All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/9] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
@ 2021-11-15 19:30 ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.
Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.
This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.
System stability and performance are not affected according to our
ongoing testing, including xfstests.
How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.
The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.
Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.
We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

Alex Sierra (9):
  mm: add zone device coherent type memory support
  mm: add device coherent vma selection for memory migration
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: coherent type as sys mem on migration to ram
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device coherent type in test_hmm
  tools: update hmm-test to support device coherent type
  tools: update test_hmm script to support SP config

 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  34 ++-
 include/linux/memremap.h                 |   8 +
 include/linux/migrate.h                  |   1 +
 include/linux/mm.h                       |   9 +
 lib/test_hmm.c                           | 270 +++++++++++++++++------
 lib/test_hmm_uapi.h                      |  21 +-
 mm/memcontrol.c                          |   6 +-
 mm/memory-failure.c                      |   6 +-
 mm/memremap.c                            |   5 +-
 mm/migrate.c                             |  30 ++-
 tools/testing/selftests/vm/hmm-tests.c   | 156 +++++++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 12 files changed, 446 insertions(+), 120 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1 0/9] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
@ 2021-11-15 19:30 ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.
Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.
This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.
System stability and performance are not affected according to our
ongoing testing, including xfstests.
How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.
The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.
Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.
We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

Alex Sierra (9):
  mm: add zone device coherent type memory support
  mm: add device coherent vma selection for memory migration
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: coherent type as sys mem on migration to ram
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device coherent type in test_hmm
  tools: update hmm-test to support device coherent type
  tools: update test_hmm script to support SP config

 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  34 ++-
 include/linux/memremap.h                 |   8 +
 include/linux/migrate.h                  |   1 +
 include/linux/mm.h                       |   9 +
 lib/test_hmm.c                           | 270 +++++++++++++++++------
 lib/test_hmm_uapi.h                      |  21 +-
 mm/memcontrol.c                          |   6 +-
 mm/memory-failure.c                      |   6 +-
 mm/memremap.c                            |   5 +-
 mm/migrate.c                             |  30 ++-
 tools/testing/selftests/vm/hmm-tests.c   | 156 +++++++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 12 files changed, 446 insertions(+), 120 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 include/linux/memremap.h |  8 ++++++++
 include/linux/mm.h       |  9 +++++++++
 mm/memcontrol.c          |  6 +++---
 mm/memory-failure.c      |  6 +++++-
 mm/memremap.c            |  5 ++++-
 mm/migrate.c             | 21 +++++++++++++--------
 6 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..ff4d398edf35 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@ struct vmem_altmap {
 enum memory_type {
 	/* 0 is reserved to catch uninitialized type fields */
 	MEMORY_DEVICE_PRIVATE = 1,
+	MEMORY_DEVICE_COHERENT,
 	MEMORY_DEVICE_FS_DAX,
 	MEMORY_DEVICE_GENERIC,
 	MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d23b6ebd2177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
 		return false;
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 	case MEMORY_DEVICE_FS_DAX:
 		return true;
 	default:
@@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		is_zone_device_page(page) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+		page->pgmap->type == MEMORY_DEVICE_COHERENT);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..e0275ebe1198 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
- *     (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
  *     For now we such page is charge like a regular page would be as for all
  *     intent and purposes it is just special memory taking the place of a
  *     regular page.
@@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page_memcg(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (is_device_private_page(page))
+			if (is_device_page(page))
 				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e6449f2102a..51b55fc5172c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto unlock;
 	}
 
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		/*
 		 * TODO: Handle HMM pages which may need coordination
 		 * with device-side memory.
 		 */
 		goto unlock;
+	default:
+		break;
 	}
 
 	/*
diff --git a/mm/memremap.c b/mm/memremap.c
index ed593bf87109..94d6a1e01d42 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
 static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_dec(&devmap_managed_key);
 }
@@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_inc(&devmap_managed_key);
 }
@@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
 			WARN(1, "Device private memory not supported\n");
 			return ERR_PTR(-EINVAL);
@@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_devmap_managed_page(struct page *page)
 {
 	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
+	if (!is_device_page(page)) {
 		wake_up_var(&page->_refcount);
 		return;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 1852d787e6ab..f74422a42192 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 	 * Device private pages have an extra refcount as they are
 	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_device_private_page(page);
+	expected_count += is_device_page(page);
 	if (mapping)
 		expected_count += thp_nr_pages(page) + page_has_private(page);
 
@@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
 		 * FIXME proper solution is to rework migration_entry_wait() so
 		 * it does not need to take a reference on page.
 		 */
-		return is_device_private_page(page);
+		return is_device_page(page);
 	}
 
 	/* For file back page */
@@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
  *     handle_pte_fault()
  *       do_anonymous_page()
  * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				swp_entry = make_readable_device_private_entry(
 							page_to_pfn(page));
 			entry = swp_entry_to_pte(swp_entry);
+		} else if (is_device_page(page)) {
+			entry = pte_mkold(mk_pte(page,
+						 READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
 		} else {
 			/*
-			 * For now we only support migrating to un-addressable
-			 * device memory.
+			 * We support migrating to private and coherent types
+			 * for device zone memory.
 			 */
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
 			goto abort;
@@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 		mapping = page_mapping(page);
 
 		if (is_zone_device_page(newpage)) {
-			if (is_device_private_page(newpage)) {
+			if (is_device_page(newpage)) {
 				/*
-				 * For now only support private anonymous when
-				 * migrating to un-addressable device memory.
+				 * For now only support private and coherent
+				 * anonymous when migrating to device memory.
 				 */
 				if (mapping) {
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 include/linux/memremap.h |  8 ++++++++
 include/linux/mm.h       |  9 +++++++++
 mm/memcontrol.c          |  6 +++---
 mm/memory-failure.c      |  6 +++++-
 mm/memremap.c            |  5 ++++-
 mm/migrate.c             | 21 +++++++++++++--------
 6 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..ff4d398edf35 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@ struct vmem_altmap {
 enum memory_type {
 	/* 0 is reserved to catch uninitialized type fields */
 	MEMORY_DEVICE_PRIVATE = 1,
+	MEMORY_DEVICE_COHERENT,
 	MEMORY_DEVICE_FS_DAX,
 	MEMORY_DEVICE_GENERIC,
 	MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d23b6ebd2177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
 		return false;
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 	case MEMORY_DEVICE_FS_DAX:
 		return true;
 	default:
@@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		is_zone_device_page(page) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+		page->pgmap->type == MEMORY_DEVICE_COHERENT);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..e0275ebe1198 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
- *     (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
  *     For now we such page is charge like a regular page would be as for all
  *     intent and purposes it is just special memory taking the place of a
  *     regular page.
@@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page_memcg(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (is_device_private_page(page))
+			if (is_device_page(page))
 				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e6449f2102a..51b55fc5172c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto unlock;
 	}
 
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		/*
 		 * TODO: Handle HMM pages which may need coordination
 		 * with device-side memory.
 		 */
 		goto unlock;
+	default:
+		break;
 	}
 
 	/*
diff --git a/mm/memremap.c b/mm/memremap.c
index ed593bf87109..94d6a1e01d42 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
 static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_dec(&devmap_managed_key);
 }
@@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_inc(&devmap_managed_key);
 }
@@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
 			WARN(1, "Device private memory not supported\n");
 			return ERR_PTR(-EINVAL);
@@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_devmap_managed_page(struct page *page)
 {
 	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
+	if (!is_device_page(page)) {
 		wake_up_var(&page->_refcount);
 		return;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 1852d787e6ab..f74422a42192 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 	 * Device private pages have an extra refcount as they are
 	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_device_private_page(page);
+	expected_count += is_device_page(page);
 	if (mapping)
 		expected_count += thp_nr_pages(page) + page_has_private(page);
 
@@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
 		 * FIXME proper solution is to rework migration_entry_wait() so
 		 * it does not need to take a reference on page.
 		 */
-		return is_device_private_page(page);
+		return is_device_page(page);
 	}
 
 	/* For file back page */
@@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
  *     handle_pte_fault()
  *       do_anonymous_page()
  * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				swp_entry = make_readable_device_private_entry(
 							page_to_pfn(page));
 			entry = swp_entry_to_pte(swp_entry);
+		} else if (is_device_page(page)) {
+			entry = pte_mkold(mk_pte(page,
+						 READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
 		} else {
 			/*
-			 * For now we only support migrating to un-addressable
-			 * device memory.
+			 * We support migrating to private and coherent types
+			 * for device zone memory.
 			 */
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
 			goto abort;
@@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 		mapping = page_mapping(page);
 
 		if (is_zone_device_page(newpage)) {
-			if (is_device_private_page(newpage)) {
+			if (is_device_page(newpage)) {
 				/*
-				 * For now only support private anonymous when
-				 * migrating to un-addressable device memory.
+				 * For now only support private and coherent
+				 * anonymous when migrating to device memory.
 				 */
 				if (mapping) {
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 2/9] mm: add device coherent vma selection for memory migration
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
v2:
condition added when migrations from device coherent pages.
---
 include/linux/migrate.h | 1 +
 mm/migrate.c            | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c8077e936691..e74bb0978f6f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -138,6 +138,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index f74422a42192..166bfec7d85e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_writable_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
-				goto next;
 			pfn = pte_pfn(pte);
 			if (is_zero_pfn(pfn)) {
 				mpfn = MIGRATE_PFN_MIGRATE;
@@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 			}
 			page = vm_normal_page(migrate->vma, addr, pte);
+			if (!is_zone_device_page(page) &&
+			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+				goto next;
+			if (is_zone_device_page(page) &&
+			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+			     page->pgmap->owner != migrate->pgmap_owner))
+				goto next;
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 2/9] mm: add device coherent vma selection for memory migration
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
v2:
condition added when migrations from device coherent pages.
---
 include/linux/migrate.h | 1 +
 mm/migrate.c            | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c8077e936691..e74bb0978f6f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -138,6 +138,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index f74422a42192..166bfec7d85e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_writable_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
-				goto next;
 			pfn = pte_pfn(pte);
 			if (is_zero_pfn(pfn)) {
 				mpfn = MIGRATE_PFN_MIGRATE;
@@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 			}
 			page = vm_normal_page(migrate->vma, addr, pte);
+			if (!is_zone_device_page(page) &&
+			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+				goto next;
+			if (is_zone_device_page(page) &&
+			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+			     page->pgmap->owner != migrate->pgmap_owner))
+				goto next;
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 3/9] drm/amdkfd: add SPM support for SVM
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
v7:
Remove lookup_resource call, so export symbol for this function
is not longer required. Patch dropped "kernel: resource:
lookup_resource as exported symbol"
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++++++++++++++---------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index aeade32ec298..9e36fe8aea0f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -935,7 +935,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 {
 	struct kfd_dev *kfddev = adev->kfd.dev;
 	struct dev_pagemap *pgmap;
-	struct resource *res;
+	struct resource *res = NULL;
 	unsigned long size;
 	void *r;
 
@@ -950,28 +950,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
 	 * should remove reserved size
 	 */
 	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
-	if (IS_ERR(res))
-		return -ENOMEM;
+	if (adev->gmc.xgmi.connected_to_cpu) {
+		pgmap->range.start = adev->gmc.aper_base;
+		pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
+		pgmap->type = MEMORY_DEVICE_COHERENT;
+	} else {
+		res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+		if (IS_ERR(res))
+			return -ENOMEM;
+		pgmap->range.start = res->start;
+		pgmap->range.end = res->end;
+		pgmap->type = MEMORY_DEVICE_PRIVATE;
+	}
 
-	pgmap->type = MEMORY_DEVICE_PRIVATE;
 	pgmap->nr_range = 1;
-	pgmap->range.start = res->start;
-	pgmap->range.end = res->end;
 	pgmap->ops = &svm_migrate_pgmap_ops;
 	pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-	pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+	pgmap->flags = 0;
 	/* Device manager releases device-specific resources, memory region and
 	 * pgmap when driver disconnects from device.
 	 */
 	r = devm_memremap_pages(adev->dev, pgmap);
 	if (IS_ERR(r)) {
 		pr_err("failed to register HMM device memory\n");
-
 		/* Disable SVM support capability */
 		pgmap->type = 0;
-		devm_release_mem_region(adev->dev, res->start, resource_size(res));
+		if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+			devm_release_mem_region(adev->dev, res->start,
+						res->end - res->start + 1);
 		return PTR_ERR(r);
 	}
 
@@ -984,3 +990,4 @@ int svm_migrate_init(struct amdgpu_device *adev)
 
 	return 0;
 }
+
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 3/9] drm/amdkfd: add SPM support for SVM
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
v7:
Remove lookup_resource call, so export symbol for this function
is not longer required. Patch dropped "kernel: resource:
lookup_resource as exported symbol"
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++++++++++++++---------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index aeade32ec298..9e36fe8aea0f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -935,7 +935,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 {
 	struct kfd_dev *kfddev = adev->kfd.dev;
 	struct dev_pagemap *pgmap;
-	struct resource *res;
+	struct resource *res = NULL;
 	unsigned long size;
 	void *r;
 
@@ -950,28 +950,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
 	 * should remove reserved size
 	 */
 	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
-	if (IS_ERR(res))
-		return -ENOMEM;
+	if (adev->gmc.xgmi.connected_to_cpu) {
+		pgmap->range.start = adev->gmc.aper_base;
+		pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
+		pgmap->type = MEMORY_DEVICE_COHERENT;
+	} else {
+		res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+		if (IS_ERR(res))
+			return -ENOMEM;
+		pgmap->range.start = res->start;
+		pgmap->range.end = res->end;
+		pgmap->type = MEMORY_DEVICE_PRIVATE;
+	}
 
-	pgmap->type = MEMORY_DEVICE_PRIVATE;
 	pgmap->nr_range = 1;
-	pgmap->range.start = res->start;
-	pgmap->range.end = res->end;
 	pgmap->ops = &svm_migrate_pgmap_ops;
 	pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-	pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+	pgmap->flags = 0;
 	/* Device manager releases device-specific resources, memory region and
 	 * pgmap when driver disconnects from device.
 	 */
 	r = devm_memremap_pages(adev->dev, pgmap);
 	if (IS_ERR(r)) {
 		pr_err("failed to register HMM device memory\n");
-
 		/* Disable SVM support capability */
 		pgmap->type = 0;
-		devm_release_mem_region(adev->dev, res->start, resource_size(res));
+		if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+			devm_release_mem_region(adev->dev, res->start,
+						res->end - res->start + 1);
 		return PTR_ERR(r);
 	}
 
@@ -984,3 +990,4 @@ int svm_migrate_init(struct amdgpu_device *adev)
 
 	return 0;
 }
+
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 4/9] drm/amdkfd: coherent type as sys mem on migration to ram
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 9e36fe8aea0f..3e405f078ade 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -661,9 +661,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 	migrate.vma = vma;
 	migrate.start = start;
 	migrate.end = end;
-	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+	if (adev->gmc.xgmi.connected_to_cpu)
+		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+	else
+		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
 	size *= npages;
 	buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 4/9] drm/amdkfd: coherent type as sys mem on migration to ram
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 9e36fe8aea0f..3e405f078ade 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -661,9 +661,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 	migrate.vma = vma;
 	migrate.start = start;
 	migrate.end = end;
-	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+	if (adev->gmc.xgmi.connected_to_cpu)
+		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+	else
+		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
 	size *= npages;
 	buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 5/9] lib: test_hmm add ioctl to get zone device type
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 15 ++++++++++++++-
 lib/test_hmm_uapi.h |  7 +++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c259842f6d44..2daaa7b3710b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -84,6 +84,7 @@ struct dmirror_chunk {
 struct dmirror_device {
 	struct cdev		cdevice;
 	struct hmm_devmem	*devmem;
+	unsigned int            zone_device_type;
 
 	unsigned int		devmem_capacity;
 	unsigned int		devmem_count;
@@ -470,6 +471,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	if (IS_ERR(res))
 		goto err_devmem;
 
+	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.range.start = res->start;
 	devmem->pagemap.range.end = res->end;
@@ -1025,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
 	return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	mutex_lock(&dmirror->mutex);
+	cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+	mutex_unlock(&dmirror->mutex);
+
+	return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
 					unsigned int command,
 					unsigned long arg)
@@ -1074,7 +1085,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 	case HMM_DMIRROR_SNAPSHOT:
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
-
+	case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+		ret = dmirror_get_device_type(dmirror, &cmd);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd06..c42e57a6a71e 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
 	__u64		npages;
 	__u64		cpages;
 	__u64		faults;
+	__u64		zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -35,6 +36,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +64,9 @@ enum {
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
 };
 
+enum {
+	/* 0 is reserved to catch uninitialized type fields */
+	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 5/9] lib: test_hmm add ioctl to get zone device type
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 15 ++++++++++++++-
 lib/test_hmm_uapi.h |  7 +++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c259842f6d44..2daaa7b3710b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -84,6 +84,7 @@ struct dmirror_chunk {
 struct dmirror_device {
 	struct cdev		cdevice;
 	struct hmm_devmem	*devmem;
+	unsigned int            zone_device_type;
 
 	unsigned int		devmem_capacity;
 	unsigned int		devmem_count;
@@ -470,6 +471,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	if (IS_ERR(res))
 		goto err_devmem;
 
+	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.range.start = res->start;
 	devmem->pagemap.range.end = res->end;
@@ -1025,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
 	return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	mutex_lock(&dmirror->mutex);
+	cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+	mutex_unlock(&dmirror->mutex);
+
+	return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
 					unsigned int command,
 					unsigned long arg)
@@ -1074,7 +1085,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 	case HMM_DMIRROR_SNAPSHOT:
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
-
+	case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+		ret = dmirror_get_device_type(dmirror, &cmd);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd06..c42e57a6a71e 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
 	__u64		npages;
 	__u64		cpages;
 	__u64		faults;
+	__u64		zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -35,6 +36,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +64,9 @@ enum {
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
 };
 
+enum {
+	/* 0 is reserved to catch uninitialized type fields */
+	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 6/9] lib: test_hmm add module param for zone device type
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 66 +++++++++++++++++++++++++++++++--------------
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 2daaa7b3710b..45334df28d7b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,6 +34,16 @@
 #define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE		16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+		"Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+		"Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -452,11 +462,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 	return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 				   struct page **ppage)
 {
 	struct dmirror_chunk *devmem;
-	struct resource *res;
+	struct resource *res = NULL;
 	unsigned long pfn;
 	unsigned long pfn_first;
 	unsigned long pfn_last;
@@ -464,17 +474,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
-		return false;
+		return -ENOMEM;
 
-	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
-				      "hmm_dmirror");
-	if (IS_ERR(res))
-		goto err_devmem;
+	if (!spm_addr_dev0 && !spm_addr_dev1) {
+		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+					      "hmm_dmirror");
+		if (IS_ERR_OR_NULL(res))
+			goto err_devmem;
+		devmem->pagemap.range.start = res->start;
+		devmem->pagemap.range.end = res->end;
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+	} else if (spm_addr_dev0 && spm_addr_dev1) {
+		devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
+							spm_addr_dev0 :
+							spm_addr_dev1;
+		devmem->pagemap.range.end = devmem->pagemap.range.start +
+					    DEVMEM_CHUNK_SIZE - 1;
+		devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+	} else {
+		pr_err("Both spm_addr_dev parameters should be set\n");
+	}
 
-	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.range.start = res->start;
-	devmem->pagemap.range.end = res->end;
 	devmem->pagemap.nr_range = 1;
 	devmem->pagemap.ops = &dmirror_devmem_ops;
 	devmem->pagemap.owner = mdevice;
@@ -495,10 +517,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
-
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
-	if (IS_ERR(ptr))
+	if (IS_ERR_OR_NULL(ptr)) {
+		if (ptr)
+			ret = PTR_ERR(ptr);
+		else
+			ret = -EFAULT;
 		goto err_release;
+	}
 
 	devmem->mdevice = mdevice;
 	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -531,7 +557,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
-	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+	if (res)
+		release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
 err_devmem:
 	kfree(devmem);
 
@@ -1219,10 +1246,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 	if (ret)
 		return ret;
 
-	/* Build a list of free ZONE_DEVICE private struct pages */
-	dmirror_allocate_chunk(mdevice, NULL);
-
-	return 0;
+	/* Build a list of free ZONE_DEVICE struct pages */
+	return dmirror_allocate_chunk(mdevice, NULL);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1235,8 +1260,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
 				mdevice->devmem_chunks[i];
 
 			memunmap_pages(&devmem->pagemap);
-			release_mem_region(devmem->pagemap.range.start,
-					   range_len(&devmem->pagemap.range));
+			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+				release_mem_region(devmem->pagemap.range.start,
+						   range_len(&devmem->pagemap.range));
 			kfree(devmem);
 		}
 		kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index c42e57a6a71e..77f81e6314eb 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -67,6 +67,7 @@ enum {
 enum {
 	/* 0 is reserved to catch uninitialized type fields */
 	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+	HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
 };
 
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 6/9] lib: test_hmm add module param for zone device type
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 66 +++++++++++++++++++++++++++++++--------------
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 2daaa7b3710b..45334df28d7b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,6 +34,16 @@
 #define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE		16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+		"Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+		"Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -452,11 +462,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 	return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 				   struct page **ppage)
 {
 	struct dmirror_chunk *devmem;
-	struct resource *res;
+	struct resource *res = NULL;
 	unsigned long pfn;
 	unsigned long pfn_first;
 	unsigned long pfn_last;
@@ -464,17 +474,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
-		return false;
+		return -ENOMEM;
 
-	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
-				      "hmm_dmirror");
-	if (IS_ERR(res))
-		goto err_devmem;
+	if (!spm_addr_dev0 && !spm_addr_dev1) {
+		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+					      "hmm_dmirror");
+		if (IS_ERR_OR_NULL(res))
+			goto err_devmem;
+		devmem->pagemap.range.start = res->start;
+		devmem->pagemap.range.end = res->end;
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+	} else if (spm_addr_dev0 && spm_addr_dev1) {
+		devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
+							spm_addr_dev0 :
+							spm_addr_dev1;
+		devmem->pagemap.range.end = devmem->pagemap.range.start +
+					    DEVMEM_CHUNK_SIZE - 1;
+		devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+	} else {
+		pr_err("Both spm_addr_dev parameters should be set\n");
+	}
 
-	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.range.start = res->start;
-	devmem->pagemap.range.end = res->end;
 	devmem->pagemap.nr_range = 1;
 	devmem->pagemap.ops = &dmirror_devmem_ops;
 	devmem->pagemap.owner = mdevice;
@@ -495,10 +517,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
-
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
-	if (IS_ERR(ptr))
+	if (IS_ERR_OR_NULL(ptr)) {
+		if (ptr)
+			ret = PTR_ERR(ptr);
+		else
+			ret = -EFAULT;
 		goto err_release;
+	}
 
 	devmem->mdevice = mdevice;
 	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -531,7 +557,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
-	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+	if (res)
+		release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
 err_devmem:
 	kfree(devmem);
 
@@ -1219,10 +1246,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 	if (ret)
 		return ret;
 
-	/* Build a list of free ZONE_DEVICE private struct pages */
-	dmirror_allocate_chunk(mdevice, NULL);
-
-	return 0;
+	/* Build a list of free ZONE_DEVICE struct pages */
+	return dmirror_allocate_chunk(mdevice, NULL);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1235,8 +1260,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
 				mdevice->devmem_chunks[i];
 
 			memunmap_pages(&devmem->pagemap);
-			release_mem_region(devmem->pagemap.range.start,
-					   range_len(&devmem->pagemap.range));
+			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+				release_mem_region(devmem->pagemap.range.start,
+						   range_len(&devmem->pagemap.range));
 			kfree(devmem);
 		}
 		kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index c42e57a6a71e..77f81e6314eb 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -67,6 +67,7 @@ enum {
 enum {
 	/* 0 is reserved to catch uninitialized type fields */
 	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+	HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
 };
 
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 7/9] lib: add support for device coherent type in test_hmm
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 191 +++++++++++++++++++++++++++++++++-----------
 lib/test_hmm_uapi.h |  15 ++--
 2 files changed, 153 insertions(+), 53 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 45334df28d7b..9e2cc0cd4412 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -471,6 +471,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	unsigned long pfn_first;
 	unsigned long pfn_last;
 	void *ptr;
+	int ret = -ENOMEM;
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
@@ -553,7 +554,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	}
 	spin_unlock(&mdevice->lock);
 
-	return true;
+	return 0;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -562,7 +563,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 err_devmem:
 	kfree(devmem);
 
-	return false;
+	return ret;
 }
 
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -571,8 +572,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	struct page *rpage;
 
 	/*
-	 * This is a fake device so we alloc real system memory to store
-	 * our device memory.
+	 * For ZONE_DEVICE private type, this is a fake device so we alloc real
+	 * system memory to store our device memory.
+	 * For ZONE_DEVICE coherent type we use the actual dpage to store the data
+	 * and ignore rpage.
 	 */
 	rpage = alloc_page(GFP_HIGHUSER);
 	if (!rpage)
@@ -623,12 +626,17 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * unallocated pte_none() or read-only zero page.
 		 */
 		spage = migrate_pfn_to_page(*src);
+		if (spage && is_zone_device_page(spage))
+			pr_err("page already in device spage pfn: 0x%lx\n",
+				page_to_pfn(spage));
+		WARN_ON(spage && is_zone_device_page(spage));
 
 		dpage = dmirror_devmem_alloc_page(mdevice);
 		if (!dpage)
 			continue;
 
-		rpage = dpage->zone_device_data;
+		rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (spage)
 			copy_highpage(rpage, spage);
 		else
@@ -640,8 +648,11 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * the simulated device memory and that page holds the pointer
 		 * to the mirror.
 		 */
+		rpage = dpage->zone_device_data;
 		rpage->zone_device_data = dmirror;
 
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
 		*dst = migrate_pfn(page_to_pfn(dpage)) |
 			    MIGRATE_PFN_LOCKED;
 		if ((*src & MIGRATE_PFN_WRITE) ||
@@ -721,10 +732,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 			continue;
 
 		/*
-		 * Store the page that holds the data so the page table
-		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 * For ZONE_DEVICE private pages we store the page that
+		 * holds the data so the page table doesn't have to deal it.
+		 * For ZONE_DEVICE coherent pages we store the actual page, since
+		 * the CPU has coherent access to the page.
 		 */
-		entry = dpage->zone_device_data;
+		entry = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (*dst & MIGRATE_PFN_WRITE)
 			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
 		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -804,8 +818,110 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 	return ret;
 }
 
-static int dmirror_migrate(struct dmirror *dmirror,
-			   struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						      struct dmirror *dmirror)
+{
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		WARN_ON(!is_device_page(spage));
+		spage = is_device_private_page(spage) ? spage->zone_device_data :
+							spage;
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+		if (!dpage)
+			continue;
+		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
+
+		lock_page(dpage);
+		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+static int dmirror_migrate_to_system(struct dmirror *dmirror,
+				     struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->notifier.mm;
+	struct vm_area_struct *vma;
+	unsigned long src_pfns[64];
+	unsigned long dst_pfns[64];
+	struct migrate_vma args;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	mmap_read_lock(mm);
+	for (addr = start; addr < end; addr = next) {
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start ||
+		    !(vma->vm_flags & VM_READ)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		if (next > vma->vm_end)
+			next = vma->vm_end;
+
+		args.vma = vma;
+		args.src = src_pfns;
+		args.dst = dst_pfns;
+		args.start = addr;
+		args.end = next;
+		args.pgmap_owner = dmirror->mdevice;
+		args.flags = (dmirror->mdevice->zone_device_type ==
+			      HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+			      MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+			      MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+
+		ret = migrate_vma_setup(&args);
+		if (ret)
+			goto out;
+
+		pr_debug("Migrating from device mem to sys mem\n");
+		dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+
+		migrate_vma_pages(&args);
+		migrate_vma_finalize(&args);
+	}
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	return ret;
+
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	return ret;
+}
+
+static int dmirror_migrate_to_device(struct dmirror *dmirror,
+				struct hmm_dmirror_cmd *cmd)
 {
 	unsigned long start, end, addr;
 	unsigned long size = cmd->npages << PAGE_SHIFT;
@@ -849,6 +965,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
 		if (ret)
 			goto out;
 
+		pr_debug("Migrating from sys mem to device mem\n");
 		dmirror_migrate_alloc_and_copy(&args, dmirror);
 		migrate_vma_pages(&args);
 		dmirror_migrate_finalize_and_map(&args, dmirror);
@@ -857,7 +974,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
 	mmap_read_unlock(mm);
 	mmput(mm);
 
-	/* Return the migrated data for verification. */
+	/* Return the migrated data for verification. only for pages in device zone */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
 		return ret;
@@ -894,9 +1011,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
 	}
 
 	page = hmm_pfn_to_page(entry);
-	if (is_device_private_page(page)) {
-		/* Is the page migrated to this device or some other? */
-		if (dmirror->mdevice == dmirror_page_to_device(page))
+	if (is_device_page(page)) {
+		/* Is page ZONE_DEVICE coherent? */
+		if (!is_device_private_page(page))
+			*perm = HMM_DMIRROR_PROT_DEV_COHERENT;
+		/*
+		 * Is page ZONE_DEVICE private migrated to
+		 * this device or some other?
+		 */
+		else if (dmirror->mdevice == dmirror_page_to_device(page))
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
 		else
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
@@ -1096,8 +1219,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_write(dmirror, &cmd);
 		break;
 
-	case HMM_DMIRROR_MIGRATE:
-		ret = dmirror_migrate(dmirror, &cmd);
+	case HMM_DMIRROR_MIGRATE_TO_DEV:
+		ret = dmirror_migrate_to_device(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_MIGRATE_TO_SYS:
+		ret = dmirror_migrate_to_system(dmirror, &cmd);
 		break;
 
 	case HMM_DMIRROR_EXCLUSIVE:
@@ -1152,38 +1279,6 @@ static void dmirror_devmem_free(struct page *page)
 	spin_unlock(&mdevice->lock);
 }
 
-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
-						      struct dmirror *dmirror)
-{
-	const unsigned long *src = args->src;
-	unsigned long *dst = args->dst;
-	unsigned long start = args->start;
-	unsigned long end = args->end;
-	unsigned long addr;
-
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
-		struct page *dpage, *spage;
-
-		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
-		spage = spage->zone_device_data;
-
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-	}
-	return 0;
-}
-
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 77f81e6314eb..af15c35a90f4 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -19,6 +19,7 @@
  * @npages: (in) number of pages to read/write
  * @cpages: (out) number of pages copied
  * @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
  */
 struct hmm_dmirror_cmd {
 	__u64		addr;
@@ -32,11 +33,12 @@ struct hmm_dmirror_cmd {
 /* Expose the address space of the calling process through hmm device file */
 #define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV	_IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x07, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -51,6 +53,8 @@ struct hmm_dmirror_cmd {
  *					device the ioctl() is made
  * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
  *					other device
+ * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
+ *				  the ioctl() is made
  */
 enum {
 	HMM_DMIRROR_PROT_ERROR			= 0xFF,
@@ -62,6 +66,7 @@ enum {
 	HMM_DMIRROR_PROT_ZERO			= 0x10,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
+	HMM_DMIRROR_PROT_DEV_COHERENT           = 0x40,
 };
 
 enum {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 7/9] lib: add support for device coherent type in test_hmm
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 191 +++++++++++++++++++++++++++++++++-----------
 lib/test_hmm_uapi.h |  15 ++--
 2 files changed, 153 insertions(+), 53 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 45334df28d7b..9e2cc0cd4412 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -471,6 +471,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	unsigned long pfn_first;
 	unsigned long pfn_last;
 	void *ptr;
+	int ret = -ENOMEM;
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
@@ -553,7 +554,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	}
 	spin_unlock(&mdevice->lock);
 
-	return true;
+	return 0;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -562,7 +563,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 err_devmem:
 	kfree(devmem);
 
-	return false;
+	return ret;
 }
 
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -571,8 +572,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	struct page *rpage;
 
 	/*
-	 * This is a fake device so we alloc real system memory to store
-	 * our device memory.
+	 * For ZONE_DEVICE private type, this is a fake device so we alloc real
+	 * system memory to store our device memory.
+	 * For ZONE_DEVICE coherent type we use the actual dpage to store the data
+	 * and ignore rpage.
 	 */
 	rpage = alloc_page(GFP_HIGHUSER);
 	if (!rpage)
@@ -623,12 +626,17 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * unallocated pte_none() or read-only zero page.
 		 */
 		spage = migrate_pfn_to_page(*src);
+		if (spage && is_zone_device_page(spage))
+			pr_err("page already in device spage pfn: 0x%lx\n",
+				page_to_pfn(spage));
+		WARN_ON(spage && is_zone_device_page(spage));
 
 		dpage = dmirror_devmem_alloc_page(mdevice);
 		if (!dpage)
 			continue;
 
-		rpage = dpage->zone_device_data;
+		rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (spage)
 			copy_highpage(rpage, spage);
 		else
@@ -640,8 +648,11 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * the simulated device memory and that page holds the pointer
 		 * to the mirror.
 		 */
+		rpage = dpage->zone_device_data;
 		rpage->zone_device_data = dmirror;
 
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
 		*dst = migrate_pfn(page_to_pfn(dpage)) |
 			    MIGRATE_PFN_LOCKED;
 		if ((*src & MIGRATE_PFN_WRITE) ||
@@ -721,10 +732,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 			continue;
 
 		/*
-		 * Store the page that holds the data so the page table
-		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 * For ZONE_DEVICE private pages we store the page that
+		 * holds the data so the page table doesn't have to deal it.
+		 * For ZONE_DEVICE coherent pages we store the actual page, since
+		 * the CPU has coherent access to the page.
 		 */
-		entry = dpage->zone_device_data;
+		entry = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (*dst & MIGRATE_PFN_WRITE)
 			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
 		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -804,8 +818,110 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 	return ret;
 }
 
-static int dmirror_migrate(struct dmirror *dmirror,
-			   struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						      struct dmirror *dmirror)
+{
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		WARN_ON(!is_device_page(spage));
+		spage = is_device_private_page(spage) ? spage->zone_device_data :
+							spage;
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+		if (!dpage)
+			continue;
+		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
+
+		lock_page(dpage);
+		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+static int dmirror_migrate_to_system(struct dmirror *dmirror,
+				     struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->notifier.mm;
+	struct vm_area_struct *vma;
+	unsigned long src_pfns[64];
+	unsigned long dst_pfns[64];
+	struct migrate_vma args;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	mmap_read_lock(mm);
+	for (addr = start; addr < end; addr = next) {
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start ||
+		    !(vma->vm_flags & VM_READ)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		if (next > vma->vm_end)
+			next = vma->vm_end;
+
+		args.vma = vma;
+		args.src = src_pfns;
+		args.dst = dst_pfns;
+		args.start = addr;
+		args.end = next;
+		args.pgmap_owner = dmirror->mdevice;
+		args.flags = (dmirror->mdevice->zone_device_type ==
+			      HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+			      MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+			      MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+
+		ret = migrate_vma_setup(&args);
+		if (ret)
+			goto out;
+
+		pr_debug("Migrating from device mem to sys mem\n");
+		dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+
+		migrate_vma_pages(&args);
+		migrate_vma_finalize(&args);
+	}
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	return ret;
+
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	return ret;
+}
+
+static int dmirror_migrate_to_device(struct dmirror *dmirror,
+				struct hmm_dmirror_cmd *cmd)
 {
 	unsigned long start, end, addr;
 	unsigned long size = cmd->npages << PAGE_SHIFT;
@@ -849,6 +965,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
 		if (ret)
 			goto out;
 
+		pr_debug("Migrating from sys mem to device mem\n");
 		dmirror_migrate_alloc_and_copy(&args, dmirror);
 		migrate_vma_pages(&args);
 		dmirror_migrate_finalize_and_map(&args, dmirror);
@@ -857,7 +974,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
 	mmap_read_unlock(mm);
 	mmput(mm);
 
-	/* Return the migrated data for verification. */
+	/* Return the migrated data for verification. only for pages in device zone */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
 		return ret;
@@ -894,9 +1011,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
 	}
 
 	page = hmm_pfn_to_page(entry);
-	if (is_device_private_page(page)) {
-		/* Is the page migrated to this device or some other? */
-		if (dmirror->mdevice == dmirror_page_to_device(page))
+	if (is_device_page(page)) {
+		/* Is page ZONE_DEVICE coherent? */
+		if (!is_device_private_page(page))
+			*perm = HMM_DMIRROR_PROT_DEV_COHERENT;
+		/*
+		 * Is page ZONE_DEVICE private migrated to
+		 * this device or some other?
+		 */
+		else if (dmirror->mdevice == dmirror_page_to_device(page))
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
 		else
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
@@ -1096,8 +1219,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_write(dmirror, &cmd);
 		break;
 
-	case HMM_DMIRROR_MIGRATE:
-		ret = dmirror_migrate(dmirror, &cmd);
+	case HMM_DMIRROR_MIGRATE_TO_DEV:
+		ret = dmirror_migrate_to_device(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_MIGRATE_TO_SYS:
+		ret = dmirror_migrate_to_system(dmirror, &cmd);
 		break;
 
 	case HMM_DMIRROR_EXCLUSIVE:
@@ -1152,38 +1279,6 @@ static void dmirror_devmem_free(struct page *page)
 	spin_unlock(&mdevice->lock);
 }
 
-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
-						      struct dmirror *dmirror)
-{
-	const unsigned long *src = args->src;
-	unsigned long *dst = args->dst;
-	unsigned long start = args->start;
-	unsigned long end = args->end;
-	unsigned long addr;
-
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
-		struct page *dpage, *spage;
-
-		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
-		spage = spage->zone_device_data;
-
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-	}
-	return 0;
-}
-
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 77f81e6314eb..af15c35a90f4 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -19,6 +19,7 @@
  * @npages: (in) number of pages to read/write
  * @cpages: (out) number of pages copied
  * @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
  */
 struct hmm_dmirror_cmd {
 	__u64		addr;
@@ -32,11 +33,12 @@ struct hmm_dmirror_cmd {
 /* Expose the address space of the calling process through hmm device file */
 #define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV	_IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x07, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -51,6 +53,8 @@ struct hmm_dmirror_cmd {
  *					device the ioctl() is made
  * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
  *					other device
+ * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
+ *				  the ioctl() is made
  */
 enum {
 	HMM_DMIRROR_PROT_ERROR			= 0xFF,
@@ -62,6 +66,7 @@ enum {
 	HMM_DMIRROR_PROT_ZERO			= 0x10,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
+	HMM_DMIRROR_PROT_DEV_COHERENT           = 0x40,
 };
 
 enum {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 8/9] tools: update hmm-test to support device coherent type
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device coherent
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and coherent zone
types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 156 ++++++++++++++++++++++---
 1 file changed, 138 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 864f126ffd78..6091e30636d5 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,7 @@ struct hmm_buffer {
 	int		fd;
 	uint64_t	cpages;
 	uint64_t	faults;
+	int		zone_device_type;
 };
 
 #define TWOMEG		(1 << 21)
@@ -144,6 +145,7 @@ static int hmm_dmirror_cmd(int fd,
 	}
 	buffer->cpages = cmd.cpages;
 	buffer->faults = cmd.faults;
+	buffer->zone_device_type = cmd.zone_device_type;
 
 	return 0;
 }
@@ -211,6 +213,32 @@ static void hmm_nanosleep(unsigned int n)
 	nanosleep(&t, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+	struct hmm_buffer buffer;
+	int ret;
+
+	buffer.ptr = 0;
+	ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, &buffer, 1);
+	*res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+	return ret;
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +903,7 @@ TEST_F(hmm, migrate)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +951,7 @@ TEST_F(hmm, migrate_fault)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -936,7 +964,7 @@ TEST_F(hmm, migrate_fault)
 		ASSERT_EQ(ptr[i], i);
 
 	/* Migrate memory to the device again. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -976,7 +1004,7 @@ TEST_F(hmm, migrate_shared)
 	ASSERT_NE(buffer->ptr, MAP_FAILED);
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, -ENOENT);
 
 	hmm_buffer_free(buffer);
@@ -1015,7 +1043,7 @@ TEST_F(hmm2, migrate_mixed)
 	p = buffer->ptr;
 
 	/* Migrating a protected area should be an error. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Punch a hole after the first page address. */
@@ -1023,7 +1051,7 @@ TEST_F(hmm2, migrate_mixed)
 	ASSERT_EQ(ret, 0);
 
 	/* We expect an error if the vma doesn't cover the range. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1083,13 @@ TEST_F(hmm2, migrate_mixed)
 
 	/* Now try to migrate pages 2-5 to device 1. */
 	buffer->ptr = p + 2 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 4);
 
 	/* Page 5 won't be migrated to device 0 because it's on device 1. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, -ENOENT);
 	buffer->ptr = p;
 
@@ -1070,8 +1098,12 @@ TEST_F(hmm2, migrate_mixed)
 }
 
 /*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of coherent zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Coherent device zone has coherent access to CPU, therefore
+ * it will not generate any page fault.
  */
 TEST_F(hmm, migrate_multiple)
 {
@@ -1082,7 +1114,9 @@ TEST_F(hmm, migrate_multiple)
 	unsigned long c;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1107,8 +1141,7 @@ TEST_F(hmm, migrate_multiple)
 			ptr[i] = i;
 
 		/* Migrate memory to device. */
-		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
-				      npages);
+		ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 		ASSERT_EQ(ret, 0);
 		ASSERT_EQ(buffer->cpages, npages);
 
@@ -1116,7 +1149,12 @@ TEST_F(hmm, migrate_multiple)
 		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
-		/* Fault pages back to system memory and check them. */
+		/* Migrate back to system memory and check them. */
+		if (!is_private) {
+			ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+			ASSERT_EQ(ret, 0);
+		}
+
 		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
@@ -1261,10 +1299,12 @@ TEST_F(hmm2, snapshot)
 	unsigned char *m;
 	int ret;
 	int val;
+	bool is_private;
 
 	npages = 7;
 	size = npages << self->page_shift;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd0, &is_private), 0);
 	buffer = malloc(sizeof(*buffer));
 	ASSERT_NE(buffer, NULL);
 
@@ -1312,13 +1352,13 @@ TEST_F(hmm2, snapshot)
 
 	/* Page 5 will be migrated to device 0. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
 	/* Page 6 will be migrated to device 1. */
 	buffer->ptr = p + 6 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
@@ -1335,9 +1375,16 @@ TEST_F(hmm2, snapshot)
 	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
-			HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	if (is_private) {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	} else {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_COHERENT |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_COHERENT |
+				HMM_DMIRROR_PROT_WRITE);
+	}
 
 	hmm_buffer_free(buffer);
 }
@@ -1496,7 +1543,13 @@ TEST_F(hmm, exclusive)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1550,7 +1603,13 @@ TEST_F(hmm, exclusive_mprotect)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1603,7 +1662,13 @@ TEST_F(hmm, exclusive_cow)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1643,4 +1708,59 @@ TEST_F(hmm, exclusive_cow)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Migrate anonymous memory to device memory and migrate back to system memory
+ * explicitly, without generating a page fault.
+ */
+TEST_F(hmm, migrate_ping_pong)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+
+	/* Migrate memory back to system mem. */
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	/* Check the buffer migrated back to system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 8/9] tools: update hmm-test to support device coherent type
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device coherent
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and coherent zone
types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 156 ++++++++++++++++++++++---
 1 file changed, 138 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 864f126ffd78..6091e30636d5 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,7 @@ struct hmm_buffer {
 	int		fd;
 	uint64_t	cpages;
 	uint64_t	faults;
+	int		zone_device_type;
 };
 
 #define TWOMEG		(1 << 21)
@@ -144,6 +145,7 @@ static int hmm_dmirror_cmd(int fd,
 	}
 	buffer->cpages = cmd.cpages;
 	buffer->faults = cmd.faults;
+	buffer->zone_device_type = cmd.zone_device_type;
 
 	return 0;
 }
@@ -211,6 +213,32 @@ static void hmm_nanosleep(unsigned int n)
 	nanosleep(&t, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+	struct hmm_buffer buffer;
+	int ret;
+
+	buffer.ptr = 0;
+	ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, &buffer, 1);
+	*res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+	return ret;
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +903,7 @@ TEST_F(hmm, migrate)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +951,7 @@ TEST_F(hmm, migrate_fault)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -936,7 +964,7 @@ TEST_F(hmm, migrate_fault)
 		ASSERT_EQ(ptr[i], i);
 
 	/* Migrate memory to the device again. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -976,7 +1004,7 @@ TEST_F(hmm, migrate_shared)
 	ASSERT_NE(buffer->ptr, MAP_FAILED);
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, -ENOENT);
 
 	hmm_buffer_free(buffer);
@@ -1015,7 +1043,7 @@ TEST_F(hmm2, migrate_mixed)
 	p = buffer->ptr;
 
 	/* Migrating a protected area should be an error. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Punch a hole after the first page address. */
@@ -1023,7 +1051,7 @@ TEST_F(hmm2, migrate_mixed)
 	ASSERT_EQ(ret, 0);
 
 	/* We expect an error if the vma doesn't cover the range. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1083,13 @@ TEST_F(hmm2, migrate_mixed)
 
 	/* Now try to migrate pages 2-5 to device 1. */
 	buffer->ptr = p + 2 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 4);
 
 	/* Page 5 won't be migrated to device 0 because it's on device 1. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, -ENOENT);
 	buffer->ptr = p;
 
@@ -1070,8 +1098,12 @@ TEST_F(hmm2, migrate_mixed)
 }
 
 /*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of coherent zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Coherent device zone has coherent access to CPU, therefore
+ * it will not generate any page fault.
  */
 TEST_F(hmm, migrate_multiple)
 {
@@ -1082,7 +1114,9 @@ TEST_F(hmm, migrate_multiple)
 	unsigned long c;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1107,8 +1141,7 @@ TEST_F(hmm, migrate_multiple)
 			ptr[i] = i;
 
 		/* Migrate memory to device. */
-		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
-				      npages);
+		ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 		ASSERT_EQ(ret, 0);
 		ASSERT_EQ(buffer->cpages, npages);
 
@@ -1116,7 +1149,12 @@ TEST_F(hmm, migrate_multiple)
 		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
-		/* Fault pages back to system memory and check them. */
+		/* Migrate back to system memory and check them. */
+		if (!is_private) {
+			ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+			ASSERT_EQ(ret, 0);
+		}
+
 		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
@@ -1261,10 +1299,12 @@ TEST_F(hmm2, snapshot)
 	unsigned char *m;
 	int ret;
 	int val;
+	bool is_private;
 
 	npages = 7;
 	size = npages << self->page_shift;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd0, &is_private), 0);
 	buffer = malloc(sizeof(*buffer));
 	ASSERT_NE(buffer, NULL);
 
@@ -1312,13 +1352,13 @@ TEST_F(hmm2, snapshot)
 
 	/* Page 5 will be migrated to device 0. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
 	/* Page 6 will be migrated to device 1. */
 	buffer->ptr = p + 6 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
@@ -1335,9 +1375,16 @@ TEST_F(hmm2, snapshot)
 	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
-			HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	if (is_private) {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	} else {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_COHERENT |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_COHERENT |
+				HMM_DMIRROR_PROT_WRITE);
+	}
 
 	hmm_buffer_free(buffer);
 }
@@ -1496,7 +1543,13 @@ TEST_F(hmm, exclusive)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1550,7 +1603,13 @@ TEST_F(hmm, exclusive_mprotect)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1603,7 +1662,13 @@ TEST_F(hmm, exclusive_cow)
 	unsigned long i;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
+	if (!is_private) {
+		printf("Skipping test as memory device type is not private\n");
+		return;
+	}
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1643,4 +1708,59 @@ TEST_F(hmm, exclusive_cow)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Migrate anonymous memory to device memory and migrate back to system memory
+ * explicitly, without generating a page fault.
+ */
+TEST_F(hmm, migrate_ping_pong)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+
+	/* Migrate memory back to system mem. */
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	/* Check the buffer migrated back to system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 9/9] tools: update test_hmm script to support SP config
  2021-11-15 19:30 ` Alex Sierra
@ 2021-11-15 19:30   ` Alex Sierra
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/test_hmm.sh | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()
 
 load_driver()
 {
-	modprobe $DRIVER > /dev/null 2>&1
+	if [ $# -eq 0 ]; then
+		modprobe $DRIVER > /dev/null 2>&1
+	else
+		if [ $# -eq 2 ]; then
+			modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+				> /dev/null 2>&1
+		else
+			echo "Missing module parameters. Make sure pass"\
+			"spm_addr_dev0 and spm_addr_dev1"
+			usage
+		fi
+	fi
 	if [ $? == 0 ]; then
 		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
 		mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
 {
 	echo "Running smoke test. Note, this test provides basic coverage."
 
-	load_driver
+	load_driver $1 $2
 	$(dirname "${BASH_SOURCE[0]}")/hmm-tests
 	unload_driver
 }
@@ -75,6 +86,9 @@ usage()
 	echo "# Smoke testing"
 	echo "./${TEST_NAME}.sh smoke"
 	echo
+	echo "# Smoke testing with SPM enabled"
+	echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+	echo
 	exit 0
 }
 
@@ -84,7 +98,7 @@ function run_test()
 		usage
 	else
 		if [ "$1" = "smoke" ]; then
-			run_smoke
+			run_smoke $2 $3
 		else
 			usage
 		fi
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 9/9] tools: update test_hmm script to support SP config
@ 2021-11-15 19:30   ` Alex Sierra
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Sierra @ 2021-11-15 19:30 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/test_hmm.sh | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()
 
 load_driver()
 {
-	modprobe $DRIVER > /dev/null 2>&1
+	if [ $# -eq 0 ]; then
+		modprobe $DRIVER > /dev/null 2>&1
+	else
+		if [ $# -eq 2 ]; then
+			modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+				> /dev/null 2>&1
+		else
+			echo "Missing module parameters. Make sure pass"\
+			"spm_addr_dev0 and spm_addr_dev1"
+			usage
+		fi
+	fi
 	if [ $? == 0 ]; then
 		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
 		mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
 {
 	echo "Running smoke test. Note, this test provides basic coverage."
 
-	load_driver
+	load_driver $1 $2
 	$(dirname "${BASH_SOURCE[0]}")/hmm-tests
 	unload_driver
 }
@@ -75,6 +86,9 @@ usage()
 	echo "# Smoke testing"
 	echo "./${TEST_NAME}.sh smoke"
 	echo
+	echo "# Smoke testing with SPM enabled"
+	echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+	echo
 	exit 0
 }
 
@@ -84,7 +98,7 @@ function run_test()
 		usage
 	else
 		if [ "$1" = "smoke" ]; then
-			run_smoke
+			run_smoke $2 $3
 		else
 			usage
 		fi
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-15 19:30   ` Alex Sierra
@ 2021-11-17  9:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2021-11-17  9:50 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

On Mon, Nov 15, 2021 at 01:30:18PM -0600, Alex Sierra wrote:
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
> - *     (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).

Please avoid the overly long line.  But I don't think we we need to mention
the exact enum, but rather do something like:

 *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
 *     thus not on the lru.

> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		/*
>  		 * TODO: Handle HMM pages which may need coordination
>  		 * with device-side memory.

This might be a good opportunity for doing a s/HMM/device/ here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-17  9:50     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2021-11-17  9:50 UTC (permalink / raw)
  To: Alex Sierra
  Cc: rcampbell, willy, Felix.Kuehling, apopple, amd-gfx, linux-xfs,
	linux-mm, jglisse, dri-devel, jgg, akpm, linux-ext4, hch

On Mon, Nov 15, 2021 at 01:30:18PM -0600, Alex Sierra wrote:
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
> - *     (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).

Please avoid the overly long line.  But I don't think we we need to mention
the exact enum, but rather do something like:

 *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
 *     thus not on the lru.

> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		/*
>  		 * TODO: Handle HMM pages which may need coordination
>  		 * with device-side memory.

This might be a good opportunity for doing a s/HMM/device/ here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-15 19:30   ` Alex Sierra
@ 2021-11-18  6:53     ` Alistair Popple
  -1 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  6:53 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, willy

On Tuesday, 16 November 2021 6:30:18 AM AEDT Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  include/linux/memremap.h |  8 ++++++++
>  include/linux/mm.h       |  9 +++++++++
>  mm/memcontrol.c          |  6 +++---
>  mm/memory-failure.c      |  6 +++++-
>  mm/memremap.c            |  5 ++++-
>  mm/migrate.c             | 21 +++++++++++++--------
>  6 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..ff4d398edf35 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -39,6 +39,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> + * type. Any page of a process can be migrated to such memory. However no one
> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -59,6 +66,7 @@ struct vmem_altmap {
>  enum memory_type {
>  	/* 0 is reserved to catch uninitialized type fields */
>  	MEMORY_DEVICE_PRIVATE = 1,
> +	MEMORY_DEVICE_COHERENT,
>  	MEMORY_DEVICE_FS_DAX,
>  	MEMORY_DEVICE_GENERIC,
>  	MEMORY_DEVICE_PCI_P2PDMA,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d23b6ebd2177 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
>  		return false;
>  	switch (page->pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  	case MEMORY_DEVICE_FS_DAX:
>  		return true;
>  	default:
> @@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
>  		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
> +static inline bool is_device_page(const struct page *page)
> +{
> +	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> +		is_zone_device_page(page) &&
> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +		page->pgmap->type == MEMORY_DEVICE_COHERENT);
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..e0275ebe1198 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
> - *     (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
>   *     For now we such page is charge like a regular page would be as for all
>   *     intent and purposes it is just special memory taking the place of a
>   *     regular page.
> @@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  		 */
>  		if (page_memcg(page) == mc.from) {
>  			ret = MC_TARGET_PAGE;
> -			if (is_device_private_page(page))
> +			if (is_device_page(page))
>  				ret = MC_TARGET_DEVICE;
>  			if (target)
>  				target->page = page;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e6449f2102a..51b55fc5172c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  		goto unlock;
>  	}
>  
> -	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		/*
>  		 * TODO: Handle HMM pages which may need coordination
>  		 * with device-side memory.
>  		 */
>  		goto unlock;
> +	default:
> +		break;
>  	}
>  
>  	/*
> diff --git a/mm/memremap.c b/mm/memremap.c
> index ed593bf87109..94d6a1e01d42 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
>  static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>  		static_branch_dec(&devmap_managed_key);
>  }
> @@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>  static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>  		static_branch_inc(&devmap_managed_key);
>  }
> @@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  
>  	switch (pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
>  			WARN(1, "Device private memory not supported\n");
>  			return ERR_PTR(-EINVAL);
> @@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_devmap_managed_page(struct page *page)
>  {
>  	/* notify page idle for dax */
> -	if (!is_device_private_page(page)) {
> +	if (!is_device_page(page)) {
>  		wake_up_var(&page->_refcount);
>  		return;
>  	}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1852d787e6ab..f74422a42192 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>  	 * Device private pages have an extra refcount as they are
>  	 * ZONE_DEVICE pages.
>  	 */
> -	expected_count += is_device_private_page(page);
> +	expected_count += is_device_page(page);
>  	if (mapping)
>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>  
> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>  		 * FIXME proper solution is to rework migration_entry_wait() so
>  		 * it does not need to take a reference on page.
>  		 */

Note that I have posted a patch to fix this - see
https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/. This
looks ok for now assuming coherent pages can never be pinned.

However that raises a question - what happens when something calls
get_user_pages() on a pfn pointing to a coherent device page? I can't see
anything in this series that prevents pinning of coherent device pages, so we
can't just assume they aren't pinned.

In the case of device-private pages this is enforced by the fact they never
have present pte's, so any attempt to GUP them results in a fault. But if I'm
understanding this series correctly that won't be the case for coherent device
pages right?

> -		return is_device_private_page(page);
> +		return is_device_page(page);
>  	}
>  
>  	/* For file back page */
> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>   *     handle_pte_fault()
>   *       do_anonymous_page()
>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> - * private page.
> + * private or coherent page.
>   */
>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				    unsigned long addr,
> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				swp_entry = make_readable_device_private_entry(
>  							page_to_pfn(page));
>  			entry = swp_entry_to_pte(swp_entry);
> +		} else if (is_device_page(page)) {

How about adding an explicit `is_device_coherent_page()` helper? It would make
the test more explicit that this is expected to handle just coherent pages and
I bet there will be future changes that need to differentiate between private
and coherent pages anyway.

> +			entry = pte_mkold(mk_pte(page,
> +						 READ_ONCE(vma->vm_page_prot)));
> +			if (vma->vm_flags & VM_WRITE)
> +				entry = pte_mkwrite(pte_mkdirty(entry));
>  		} else {
>  			/*
> -			 * For now we only support migrating to un-addressable
> -			 * device memory.
> +			 * We support migrating to private and coherent types
> +			 * for device zone memory.
>  			 */
>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>  			goto abort;
> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  		mapping = page_mapping(page);
>  
>  		if (is_zone_device_page(newpage)) {
> -			if (is_device_private_page(newpage)) {
> +			if (is_device_page(newpage)) {
>  				/*
> -				 * For now only support private anonymous when
> -				 * migrating to un-addressable device memory.
> +				 * For now only support private and coherent
> +				 * anonymous when migrating to device memory.
>  				 */
>  				if (mapping) {
>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-18  6:53     ` Alistair Popple
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  6:53 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

On Tuesday, 16 November 2021 6:30:18 AM AEDT Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  include/linux/memremap.h |  8 ++++++++
>  include/linux/mm.h       |  9 +++++++++
>  mm/memcontrol.c          |  6 +++---
>  mm/memory-failure.c      |  6 +++++-
>  mm/memremap.c            |  5 ++++-
>  mm/migrate.c             | 21 +++++++++++++--------
>  6 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..ff4d398edf35 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -39,6 +39,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> + * type. Any page of a process can be migrated to such memory. However no one
> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -59,6 +66,7 @@ struct vmem_altmap {
>  enum memory_type {
>  	/* 0 is reserved to catch uninitialized type fields */
>  	MEMORY_DEVICE_PRIVATE = 1,
> +	MEMORY_DEVICE_COHERENT,
>  	MEMORY_DEVICE_FS_DAX,
>  	MEMORY_DEVICE_GENERIC,
>  	MEMORY_DEVICE_PCI_P2PDMA,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d23b6ebd2177 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
>  		return false;
>  	switch (page->pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  	case MEMORY_DEVICE_FS_DAX:
>  		return true;
>  	default:
> @@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
>  		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
> +static inline bool is_device_page(const struct page *page)
> +{
> +	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> +		is_zone_device_page(page) &&
> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +		page->pgmap->type == MEMORY_DEVICE_COHERENT);
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..e0275ebe1198 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
> - *     (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
>   *     For now we such page is charge like a regular page would be as for all
>   *     intent and purposes it is just special memory taking the place of a
>   *     regular page.
> @@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  		 */
>  		if (page_memcg(page) == mc.from) {
>  			ret = MC_TARGET_PAGE;
> -			if (is_device_private_page(page))
> +			if (is_device_page(page))
>  				ret = MC_TARGET_DEVICE;
>  			if (target)
>  				target->page = page;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e6449f2102a..51b55fc5172c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  		goto unlock;
>  	}
>  
> -	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		/*
>  		 * TODO: Handle HMM pages which may need coordination
>  		 * with device-side memory.
>  		 */
>  		goto unlock;
> +	default:
> +		break;
>  	}
>  
>  	/*
> diff --git a/mm/memremap.c b/mm/memremap.c
> index ed593bf87109..94d6a1e01d42 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
>  static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>  		static_branch_dec(&devmap_managed_key);
>  }
> @@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>  static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>  		static_branch_inc(&devmap_managed_key);
>  }
> @@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  
>  	switch (pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
>  		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
>  			WARN(1, "Device private memory not supported\n");
>  			return ERR_PTR(-EINVAL);
> @@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_devmap_managed_page(struct page *page)
>  {
>  	/* notify page idle for dax */
> -	if (!is_device_private_page(page)) {
> +	if (!is_device_page(page)) {
>  		wake_up_var(&page->_refcount);
>  		return;
>  	}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1852d787e6ab..f74422a42192 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>  	 * Device private pages have an extra refcount as they are
>  	 * ZONE_DEVICE pages.
>  	 */
> -	expected_count += is_device_private_page(page);
> +	expected_count += is_device_page(page);
>  	if (mapping)
>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>  
> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>  		 * FIXME proper solution is to rework migration_entry_wait() so
>  		 * it does not need to take a reference on page.
>  		 */

Note that I have posted a patch to fix this - see
https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/. This
looks ok for now assuming coherent pages can never be pinned.

However that raises a question - what happens when something calls
get_user_pages() on a pfn pointing to a coherent device page? I can't see
anything in this series that prevents pinning of coherent device pages, so we
can't just assume they aren't pinned.

In the case of device-private pages this is enforced by the fact they never
have present pte's, so any attempt to GUP them results in a fault. But if I'm
understanding this series correctly that won't be the case for coherent device
pages right?

> -		return is_device_private_page(page);
> +		return is_device_page(page);
>  	}
>  
>  	/* For file back page */
> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>   *     handle_pte_fault()
>   *       do_anonymous_page()
>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> - * private page.
> + * private or coherent page.
>   */
>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				    unsigned long addr,
> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				swp_entry = make_readable_device_private_entry(
>  							page_to_pfn(page));
>  			entry = swp_entry_to_pte(swp_entry);
> +		} else if (is_device_page(page)) {

How about adding an explicit `is_device_coherent_page()` helper? It would make
the test more explicit that this is expected to handle just coherent pages and
I bet there will be future changes that need to differentiate between private
and coherent pages anyway.

> +			entry = pte_mkold(mk_pte(page,
> +						 READ_ONCE(vma->vm_page_prot)));
> +			if (vma->vm_flags & VM_WRITE)
> +				entry = pte_mkwrite(pte_mkdirty(entry));
>  		} else {
>  			/*
> -			 * For now we only support migrating to un-addressable
> -			 * device memory.
> +			 * We support migrating to private and coherent types
> +			 * for device zone memory.
>  			 */
>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>  			goto abort;
> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  		mapping = page_mapping(page);
>  
>  		if (is_zone_device_page(newpage)) {
> -			if (is_device_private_page(newpage)) {
> +			if (is_device_page(newpage)) {
>  				/*
> -				 * For now only support private anonymous when
> -				 * migrating to un-addressable device memory.
> +				 * For now only support private and coherent
> +				 * anonymous when migrating to device memory.
>  				 */
>  				if (mapping) {
>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 2/9] mm: add device coherent vma selection for memory migration
  2021-11-15 19:30   ` Alex Sierra
@ 2021-11-18  6:54     ` Alistair Popple
  -1 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  6:54 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

On Tuesday, 16 November 2021 6:30:19 AM AEDT Alex Sierra wrote:
> This case is used to migrate pages from device memory, back to system
> memory. Device coherent type memory is cache coherent from device and CPU
> point of view.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
> v2:
> condition added when migrations from device coherent pages.
> ---
>  include/linux/migrate.h | 1 +
>  mm/migrate.c            | 9 +++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index c8077e936691..e74bb0978f6f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -138,6 +138,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> +	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f74422a42192..166bfec7d85e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			if (is_writable_device_private_entry(entry))
>  				mpfn |= MIGRATE_PFN_WRITE;
>  		} else {
> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> -				goto next;
>  			pfn = pte_pfn(pte);
>  			if (is_zero_pfn(pfn)) {
>  				mpfn = MIGRATE_PFN_MIGRATE;
> @@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				goto next;
>  			}
>  			page = vm_normal_page(migrate->vma, addr, pte);
> +			if (!is_zone_device_page(page) &&
> +			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> +				goto next;
> +			if (is_zone_device_page(page) &&
> +			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +			     page->pgmap->owner != migrate->pgmap_owner))
> +				goto next;

Thanks Alex, couple of comments on this:

1. vm_normal_page() can return NULL so you need to add a check for
   page == NULL otherwise the call to is_zone_device_page(NULL) will crash.
2. The check for a coherent device page is too indirect. Being explicit and
   using is_zone_device_coherent_page() instead would make it more direct and
   obvious, particularly for developers who may not immediately realise that
   device-private pages should never have pte_present() entries.

>  			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 2/9] mm: add device coherent vma selection for memory migration
@ 2021-11-18  6:54     ` Alistair Popple
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  6:54 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, willy

On Tuesday, 16 November 2021 6:30:19 AM AEDT Alex Sierra wrote:
> This case is used to migrate pages from device memory, back to system
> memory. Device coherent type memory is cache coherent from device and CPU
> point of view.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
> v2:
> condition added when migrations from device coherent pages.
> ---
>  include/linux/migrate.h | 1 +
>  mm/migrate.c            | 9 +++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index c8077e936691..e74bb0978f6f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -138,6 +138,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> +	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f74422a42192..166bfec7d85e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			if (is_writable_device_private_entry(entry))
>  				mpfn |= MIGRATE_PFN_WRITE;
>  		} else {
> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> -				goto next;
>  			pfn = pte_pfn(pte);
>  			if (is_zero_pfn(pfn)) {
>  				mpfn = MIGRATE_PFN_MIGRATE;
> @@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				goto next;
>  			}
>  			page = vm_normal_page(migrate->vma, addr, pte);
> +			if (!is_zone_device_page(page) &&
> +			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> +				goto next;
> +			if (is_zone_device_page(page) &&
> +			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +			     page->pgmap->owner != migrate->pgmap_owner))
> +				goto next;

Thanks Alex, couple of comments on this:

1. vm_normal_page() can return NULL so you need to add a check for
   page == NULL otherwise the call to is_zone_device_page(NULL) will crash.
2. The check for a coherent device page is too indirect. Being explicit and
   using is_zone_device_coherent_page() instead would make it more direct and
   obvious, particularly for developers who may not immediately realise that
   device-private pages should never have pte_present() entries.

>  			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 6/9] lib: test_hmm add module param for zone device type
  2021-11-15 19:30   ` Alex Sierra
@ 2021-11-18  7:19     ` Alistair Popple
  -1 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  7:19 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, willy

On Tuesday, 16 November 2021 6:30:23 AM AEDT Alex Sierra wrote:
> In order to configure device coherent in test_hmm, two module parameters
> should be passed, which correspond to the SP start address of each
> device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
> private device type is configured.

Thanks for taking the time to add proper tests for this, as previously
mentioned I don't like the need for module parameters but understand why these
are more difficult to avoid.

However as also mentioned previously the restriction of being able to test only
private *or* coherent device pages is unnecessary and makes testing both types harder, especially if we need to test migration between device private and coherent pages.

<snip>
 
> -	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
> -				      "hmm_dmirror");
> -	if (IS_ERR(res))
> -		goto err_devmem;
> +	if (!spm_addr_dev0 && !spm_addr_dev1) {
> +		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
> +					      "hmm_dmirror");
> +		if (IS_ERR_OR_NULL(res))
> +			goto err_devmem;
> +		devmem->pagemap.range.start = res->start;
> +		devmem->pagemap.range.end = res->end;
> +		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> +	} else if (spm_addr_dev0 && spm_addr_dev1) {
> +		devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
> +							spm_addr_dev0 :
> +							spm_addr_dev1;

It seems like it would be fairly straight forward to address this concern by
adding extra minor character devices for the coherent devices. Would it be
possible for you to try that?

> +		devmem->pagemap.range.end = devmem->pagemap.range.start +
> +					    DEVMEM_CHUNK_SIZE - 1;
> +		devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
> +		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
> +	} else {
> +		pr_err("Both spm_addr_dev parameters should be set\n");
> +	}
>  
> -	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> -	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> -	devmem->pagemap.range.start = res->start;
> -	devmem->pagemap.range.end = res->end;
>  	devmem->pagemap.nr_range = 1;
>  	devmem->pagemap.ops = &dmirror_devmem_ops;
>  	devmem->pagemap.owner = mdevice;
> @@ -495,10 +517,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  		mdevice->devmem_capacity = new_capacity;
>  		mdevice->devmem_chunks = new_chunks;
>  	}
> -
>  	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
> -	if (IS_ERR(ptr))
> +	if (IS_ERR_OR_NULL(ptr)) {
> +		if (ptr)
> +			ret = PTR_ERR(ptr);
> +		else
> +			ret = -EFAULT;
>  		goto err_release;
> +	}
>  
>  	devmem->mdevice = mdevice;
>  	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
> @@ -531,7 +557,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  
>  err_release:
>  	mutex_unlock(&mdevice->devmem_lock);
> -	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
> +	if (res)
> +		release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
>  err_devmem:
>  	kfree(devmem);
>  
> @@ -1219,10 +1246,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
>  	if (ret)
>  		return ret;
>  
> -	/* Build a list of free ZONE_DEVICE private struct pages */
> -	dmirror_allocate_chunk(mdevice, NULL);
> -
> -	return 0;
> +	/* Build a list of free ZONE_DEVICE struct pages */
> +	return dmirror_allocate_chunk(mdevice, NULL);
>  }
>  
>  static void dmirror_device_remove(struct dmirror_device *mdevice)
> @@ -1235,8 +1260,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
>  				mdevice->devmem_chunks[i];
>  
>  			memunmap_pages(&devmem->pagemap);
> -			release_mem_region(devmem->pagemap.range.start,
> -					   range_len(&devmem->pagemap.range));
> +			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
> +				release_mem_region(devmem->pagemap.range.start,
> +						   range_len(&devmem->pagemap.range));
>  			kfree(devmem);
>  		}
>  		kfree(mdevice->devmem_chunks);
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index c42e57a6a71e..77f81e6314eb 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -67,6 +67,7 @@ enum {
>  enum {
>  	/* 0 is reserved to catch uninitialized type fields */
>  	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
> +	HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
>  };
>  
>  #endif /* _LIB_TEST_HMM_UAPI_H */
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 6/9] lib: test_hmm add module param for zone device type
@ 2021-11-18  7:19     ` Alistair Popple
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-18  7:19 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	Alex Sierra
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

On Tuesday, 16 November 2021 6:30:23 AM AEDT Alex Sierra wrote:
> In order to configure device coherent in test_hmm, two module parameters
> should be passed, which correspond to the SP start address of each
> device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
> private device type is configured.

Thanks for taking the time to add proper tests for this, as previously
mentioned I don't like the need for module parameters but understand why these
are more difficult to avoid.

However as also mentioned previously the restriction of being able to test only
private *or* coherent device pages is unnecessary and makes testing both types harder, especially if we need to test migration between device private and coherent pages.

<snip>
 
> -	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
> -				      "hmm_dmirror");
> -	if (IS_ERR(res))
> -		goto err_devmem;
> +	if (!spm_addr_dev0 && !spm_addr_dev1) {
> +		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
> +					      "hmm_dmirror");
> +		if (IS_ERR_OR_NULL(res))
> +			goto err_devmem;
> +		devmem->pagemap.range.start = res->start;
> +		devmem->pagemap.range.end = res->end;
> +		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> +	} else if (spm_addr_dev0 && spm_addr_dev1) {
> +		devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
> +							spm_addr_dev0 :
> +							spm_addr_dev1;

It seems like it would be fairly straight forward to address this concern by
adding extra minor character devices for the coherent devices. Would it be
possible for you to try that?

> +		devmem->pagemap.range.end = devmem->pagemap.range.start +
> +					    DEVMEM_CHUNK_SIZE - 1;
> +		devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
> +		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
> +	} else {
> +		pr_err("Both spm_addr_dev parameters should be set\n");
> +	}
>  
> -	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> -	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> -	devmem->pagemap.range.start = res->start;
> -	devmem->pagemap.range.end = res->end;
>  	devmem->pagemap.nr_range = 1;
>  	devmem->pagemap.ops = &dmirror_devmem_ops;
>  	devmem->pagemap.owner = mdevice;
> @@ -495,10 +517,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  		mdevice->devmem_capacity = new_capacity;
>  		mdevice->devmem_chunks = new_chunks;
>  	}
> -
>  	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
> -	if (IS_ERR(ptr))
> +	if (IS_ERR_OR_NULL(ptr)) {
> +		if (ptr)
> +			ret = PTR_ERR(ptr);
> +		else
> +			ret = -EFAULT;
>  		goto err_release;
> +	}
>  
>  	devmem->mdevice = mdevice;
>  	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
> @@ -531,7 +557,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  
>  err_release:
>  	mutex_unlock(&mdevice->devmem_lock);
> -	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
> +	if (res)
> +		release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
>  err_devmem:
>  	kfree(devmem);
>  
> @@ -1219,10 +1246,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
>  	if (ret)
>  		return ret;
>  
> -	/* Build a list of free ZONE_DEVICE private struct pages */
> -	dmirror_allocate_chunk(mdevice, NULL);
> -
> -	return 0;
> +	/* Build a list of free ZONE_DEVICE struct pages */
> +	return dmirror_allocate_chunk(mdevice, NULL);
>  }
>  
>  static void dmirror_device_remove(struct dmirror_device *mdevice)
> @@ -1235,8 +1260,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
>  				mdevice->devmem_chunks[i];
>  
>  			memunmap_pages(&devmem->pagemap);
> -			release_mem_region(devmem->pagemap.range.start,
> -					   range_len(&devmem->pagemap.range));
> +			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
> +				release_mem_region(devmem->pagemap.range.start,
> +						   range_len(&devmem->pagemap.range));
>  			kfree(devmem);
>  		}
>  		kfree(mdevice->devmem_chunks);
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index c42e57a6a71e..77f81e6314eb 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -67,6 +67,7 @@ enum {
>  enum {
>  	/* 0 is reserved to catch uninitialized type fields */
>  	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
> +	HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
>  };
>  
>  #endif /* _LIB_TEST_HMM_UAPI_H */
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-18  6:53     ` Alistair Popple
@ 2021-11-18 18:59       ` Felix Kuehling
  -1 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-18 18:59 UTC (permalink / raw)
  To: Alistair Popple, akpm, linux-mm, rcampbell, linux-ext4,
	linux-xfs, Alex Sierra
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch


Am 2021-11-18 um 1:53 a.m. schrieb Alistair Popple:
> On Tuesday, 16 November 2021 6:30:18 AM AEDT Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>  include/linux/memremap.h |  8 ++++++++
>>  include/linux/mm.h       |  9 +++++++++
>>  mm/memcontrol.c          |  6 +++---
>>  mm/memory-failure.c      |  6 +++++-
>>  mm/memremap.c            |  5 ++++-
>>  mm/migrate.c             | 21 +++++++++++++--------
>>  6 files changed, 42 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index c0e9d35889e8..ff4d398edf35 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -39,6 +39,13 @@ struct vmem_altmap {
>>   * A more complete discussion of unaddressable memory may be found in
>>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>   *
>> + * MEMORY_DEVICE_COHERENT:
>> + * Device memory that is cache coherent from device and CPU point of view. This
>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> + * type. Any page of a process can be migrated to such memory. However no one
>> + * should be allowed to pin such memory so that it can always be evicted.
>> + *
>>   * MEMORY_DEVICE_FS_DAX:
>>   * Host memory that has similar access semantics as System RAM i.e. DMA
>>   * coherent and supports page pinning. In support of coordinating page
>> @@ -59,6 +66,7 @@ struct vmem_altmap {
>>  enum memory_type {
>>  	/* 0 is reserved to catch uninitialized type fields */
>>  	MEMORY_DEVICE_PRIVATE = 1,
>> +	MEMORY_DEVICE_COHERENT,
>>  	MEMORY_DEVICE_FS_DAX,
>>  	MEMORY_DEVICE_GENERIC,
>>  	MEMORY_DEVICE_PCI_P2PDMA,
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..d23b6ebd2177 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
>>  		return false;
>>  	switch (page->pgmap->type) {
>>  	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  	case MEMORY_DEVICE_FS_DAX:
>>  		return true;
>>  	default:
>> @@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
>>  		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>>  }
>>  
>> +static inline bool is_device_page(const struct page *page)
>> +{
>> +	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
>> +		is_zone_device_page(page) &&
>> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +		page->pgmap->type == MEMORY_DEVICE_COHERENT);
>> +}
>> +
>>  static inline bool is_pci_p2pdma_page(const struct page *page)
>>  {
>>  	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6da5020a8656..e0275ebe1198 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>>   *     target for charge migration. if @target is not NULL, the entry is stored
>>   *     in target->ent.
>> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
>> - *     (so ZONE_DEVICE page and thus not on the lru).
>> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
>> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
>>   *     For now we such page is charge like a regular page would be as for all
>>   *     intent and purposes it is just special memory taking the place of a
>>   *     regular page.
>> @@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>>  		 */
>>  		if (page_memcg(page) == mc.from) {
>>  			ret = MC_TARGET_PAGE;
>> -			if (is_device_private_page(page))
>> +			if (is_device_page(page))
>>  				ret = MC_TARGET_DEVICE;
>>  			if (target)
>>  				target->page = page;
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 3e6449f2102a..51b55fc5172c 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>  		goto unlock;
>>  	}
>>  
>> -	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> +	switch (pgmap->type) {
>> +	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  		/*
>>  		 * TODO: Handle HMM pages which may need coordination
>>  		 * with device-side memory.
>>  		 */
>>  		goto unlock;
>> +	default:
>> +		break;
>>  	}
>>  
>>  	/*
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index ed593bf87109..94d6a1e01d42 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
>>  static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>>  {
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>>  		static_branch_dec(&devmap_managed_key);
>>  }
>> @@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>>  static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
>>  {
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>>  		static_branch_inc(&devmap_managed_key);
>>  }
>> @@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>>  
>>  	switch (pgmap->type) {
>>  	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
>>  			WARN(1, "Device private memory not supported\n");
>>  			return ERR_PTR(-EINVAL);
>> @@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  void free_devmap_managed_page(struct page *page)
>>  {
>>  	/* notify page idle for dax */
>> -	if (!is_device_private_page(page)) {
>> +	if (!is_device_page(page)) {
>>  		wake_up_var(&page->_refcount);
>>  		return;
>>  	}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 1852d787e6ab..f74422a42192 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>>  	 * Device private pages have an extra refcount as they are
>>  	 * ZONE_DEVICE pages.
>>  	 */
>> -	expected_count += is_device_private_page(page);
>> +	expected_count += is_device_page(page);
>>  	if (mapping)
>>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>>  
>> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>>  		 * FIXME proper solution is to rework migration_entry_wait() so
>>  		 * it does not need to take a reference on page.
>>  		 */
> Note that I have posted a patch to fix this - see
> https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
> looks ok for now assuming coherent pages can never be pinned.
>
> However that raises a question - what happens when something calls
> get_user_pages() on a pfn pointing to a coherent device page? I can't see
> anything in this series that prevents pinning of coherent device pages, so we
> can't just assume they aren't pinned.

I agree. I think we need to depend on your patch to go in first.

I'm also wondering if we need to do something to prevent get_user_pages
from pinning device pages. And by "pin", I think migrate_vma_check_page
is not talking about FOLL_PIN, but any get_user_pages call. As far as I
can tell, there should be nothing fundamentally wrong with pinning
device pages for a short time. But I think we'll want to avoid
FOLL_LONGTERM because that would affect our memory manager's ability to
evict device memory.


>
> In the case of device-private pages this is enforced by the fact they never
> have present pte's, so any attempt to GUP them results in a fault. But if I'm
> understanding this series correctly that won't be the case for coherent device
> pages right?

Right.

Regards,
  Felix


>
>> -		return is_device_private_page(page);
>> +		return is_device_page(page);
>>  	}
>>  
>>  	/* For file back page */
>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>>   *     handle_pte_fault()
>>   *       do_anonymous_page()
>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
>> - * private page.
>> + * private or coherent page.
>>   */
>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>  				    unsigned long addr,
>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>  				swp_entry = make_readable_device_private_entry(
>>  							page_to_pfn(page));
>>  			entry = swp_entry_to_pte(swp_entry);
>> +		} else if (is_device_page(page)) {
> How about adding an explicit `is_device_coherent_page()` helper? It would make
> the test more explicit that this is expected to handle just coherent pages and
> I bet there will be future changes that need to differentiate between private
> and coherent pages anyway.
>
>> +			entry = pte_mkold(mk_pte(page,
>> +						 READ_ONCE(vma->vm_page_prot)));
>> +			if (vma->vm_flags & VM_WRITE)
>> +				entry = pte_mkwrite(pte_mkdirty(entry));
>>  		} else {
>>  			/*
>> -			 * For now we only support migrating to un-addressable
>> -			 * device memory.
>> +			 * We support migrating to private and coherent types
>> +			 * for device zone memory.
>>  			 */
>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>>  			goto abort;
>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>  		mapping = page_mapping(page);
>>  
>>  		if (is_zone_device_page(newpage)) {
>> -			if (is_device_private_page(newpage)) {
>> +			if (is_device_page(newpage)) {
>>  				/*
>> -				 * For now only support private anonymous when
>> -				 * migrating to un-addressable device memory.
>> +				 * For now only support private and coherent
>> +				 * anonymous when migrating to device memory.
>>  				 */
>>  				if (mapping) {
>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-18 18:59       ` Felix Kuehling
  0 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-18 18:59 UTC (permalink / raw)
  To: Alistair Popple, akpm, linux-mm, rcampbell, linux-ext4,
	linux-xfs, Alex Sierra
  Cc: amd-gfx, dri-devel, jglisse, willy, jgg, hch


Am 2021-11-18 um 1:53 a.m. schrieb Alistair Popple:
> On Tuesday, 16 November 2021 6:30:18 AM AEDT Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>  include/linux/memremap.h |  8 ++++++++
>>  include/linux/mm.h       |  9 +++++++++
>>  mm/memcontrol.c          |  6 +++---
>>  mm/memory-failure.c      |  6 +++++-
>>  mm/memremap.c            |  5 ++++-
>>  mm/migrate.c             | 21 +++++++++++++--------
>>  6 files changed, 42 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index c0e9d35889e8..ff4d398edf35 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -39,6 +39,13 @@ struct vmem_altmap {
>>   * A more complete discussion of unaddressable memory may be found in
>>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>   *
>> + * MEMORY_DEVICE_COHERENT:
>> + * Device memory that is cache coherent from device and CPU point of view. This
>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> + * type. Any page of a process can be migrated to such memory. However no one
>> + * should be allowed to pin such memory so that it can always be evicted.
>> + *
>>   * MEMORY_DEVICE_FS_DAX:
>>   * Host memory that has similar access semantics as System RAM i.e. DMA
>>   * coherent and supports page pinning. In support of coordinating page
>> @@ -59,6 +66,7 @@ struct vmem_altmap {
>>  enum memory_type {
>>  	/* 0 is reserved to catch uninitialized type fields */
>>  	MEMORY_DEVICE_PRIVATE = 1,
>> +	MEMORY_DEVICE_COHERENT,
>>  	MEMORY_DEVICE_FS_DAX,
>>  	MEMORY_DEVICE_GENERIC,
>>  	MEMORY_DEVICE_PCI_P2PDMA,
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..d23b6ebd2177 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page *page)
>>  		return false;
>>  	switch (page->pgmap->type) {
>>  	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  	case MEMORY_DEVICE_FS_DAX:
>>  		return true;
>>  	default:
>> @@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct page *page)
>>  		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>>  }
>>  
>> +static inline bool is_device_page(const struct page *page)
>> +{
>> +	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
>> +		is_zone_device_page(page) &&
>> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +		page->pgmap->type == MEMORY_DEVICE_COHERENT);
>> +}
>> +
>>  static inline bool is_pci_p2pdma_page(const struct page *page)
>>  {
>>  	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6da5020a8656..e0275ebe1198 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>>   *     target for charge migration. if @target is not NULL, the entry is stored
>>   *     in target->ent.
>> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
>> - *     (so ZONE_DEVICE page and thus not on the lru).
>> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_COHERENT
>> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
>>   *     For now we such page is charge like a regular page would be as for all
>>   *     intent and purposes it is just special memory taking the place of a
>>   *     regular page.
>> @@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>>  		 */
>>  		if (page_memcg(page) == mc.from) {
>>  			ret = MC_TARGET_PAGE;
>> -			if (is_device_private_page(page))
>> +			if (is_device_page(page))
>>  				ret = MC_TARGET_DEVICE;
>>  			if (target)
>>  				target->page = page;
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 3e6449f2102a..51b55fc5172c 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>  		goto unlock;
>>  	}
>>  
>> -	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> +	switch (pgmap->type) {
>> +	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  		/*
>>  		 * TODO: Handle HMM pages which may need coordination
>>  		 * with device-side memory.
>>  		 */
>>  		goto unlock;
>> +	default:
>> +		break;
>>  	}
>>  
>>  	/*
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index ed593bf87109..94d6a1e01d42 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
>>  static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>>  {
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>>  		static_branch_dec(&devmap_managed_key);
>>  }
>> @@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>>  static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
>>  {
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>>  	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>>  		static_branch_inc(&devmap_managed_key);
>>  }
>> @@ -328,6 +330,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>>  
>>  	switch (pgmap->type) {
>>  	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>>  		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
>>  			WARN(1, "Device private memory not supported\n");
>>  			return ERR_PTR(-EINVAL);
>> @@ -498,7 +501,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  void free_devmap_managed_page(struct page *page)
>>  {
>>  	/* notify page idle for dax */
>> -	if (!is_device_private_page(page)) {
>> +	if (!is_device_page(page)) {
>>  		wake_up_var(&page->_refcount);
>>  		return;
>>  	}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 1852d787e6ab..f74422a42192 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>>  	 * Device private pages have an extra refcount as they are
>>  	 * ZONE_DEVICE pages.
>>  	 */
>> -	expected_count += is_device_private_page(page);
>> +	expected_count += is_device_page(page);
>>  	if (mapping)
>>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>>  
>> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>>  		 * FIXME proper solution is to rework migration_entry_wait() so
>>  		 * it does not need to take a reference on page.
>>  		 */
> Note that I have posted a patch to fix this - see
> https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
> looks ok for now assuming coherent pages can never be pinned.
>
> However that raises a question - what happens when something calls
> get_user_pages() on a pfn pointing to a coherent device page? I can't see
> anything in this series that prevents pinning of coherent device pages, so we
> can't just assume they aren't pinned.

I agree. I think we need to depend on your patch to go in first.

I'm also wondering if we need to do something to prevent get_user_pages
from pinning device pages. And by "pin", I think migrate_vma_check_page
is not talking about FOLL_PIN, but any get_user_pages call. As far as I
can tell, there should be nothing fundamentally wrong with pinning
device pages for a short time. But I think we'll want to avoid
FOLL_LONGTERM because that would affect our memory manager's ability to
evict device memory.


>
> In the case of device-private pages this is enforced by the fact they never
> have present pte's, so any attempt to GUP them results in a fault. But if I'm
> understanding this series correctly that won't be the case for coherent device
> pages right?

Right.

Regards,
  Felix


>
>> -		return is_device_private_page(page);
>> +		return is_device_page(page);
>>  	}
>>  
>>  	/* For file back page */
>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>>   *     handle_pte_fault()
>>   *       do_anonymous_page()
>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
>> - * private page.
>> + * private or coherent page.
>>   */
>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>  				    unsigned long addr,
>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>  				swp_entry = make_readable_device_private_entry(
>>  							page_to_pfn(page));
>>  			entry = swp_entry_to_pte(swp_entry);
>> +		} else if (is_device_page(page)) {
> How about adding an explicit `is_device_coherent_page()` helper? It would make
> the test more explicit that this is expected to handle just coherent pages and
> I bet there will be future changes that need to differentiate between private
> and coherent pages anyway.
>
>> +			entry = pte_mkold(mk_pte(page,
>> +						 READ_ONCE(vma->vm_page_prot)));
>> +			if (vma->vm_flags & VM_WRITE)
>> +				entry = pte_mkwrite(pte_mkdirty(entry));
>>  		} else {
>>  			/*
>> -			 * For now we only support migrating to un-addressable
>> -			 * device memory.
>> +			 * We support migrating to private and coherent types
>> +			 * for device zone memory.
>>  			 */
>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>>  			goto abort;
>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>  		mapping = page_mapping(page);
>>  
>>  		if (is_zone_device_page(newpage)) {
>> -			if (is_device_private_page(newpage)) {
>> +			if (is_device_page(newpage)) {
>>  				/*
>> -				 * For now only support private anonymous when
>> -				 * migrating to un-addressable device memory.
>> +				 * For now only support private and coherent
>> +				 * anonymous when migrating to device memory.
>>  				 */
>>  				if (mapping) {
>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-18 18:59       ` Felix Kuehling
@ 2021-11-22  2:40         ` Alistair Popple
  -1 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-22  2:40 UTC (permalink / raw)
  To: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, Alex Sierra,
	Felix Kuehling
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 1852d787e6ab..f74422a42192 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
> >>  	 * Device private pages have an extra refcount as they are
> >>  	 * ZONE_DEVICE pages.
> >>  	 */
> >> -	expected_count += is_device_private_page(page);
> >> +	expected_count += is_device_page(page);
> >>  	if (mapping)
> >>  		expected_count += thp_nr_pages(page) + page_has_private(page);
> >>  
> >> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
> >>  		 * FIXME proper solution is to rework migration_entry_wait() so
> >>  		 * it does not need to take a reference on page.
> >>  		 */
> > Note that I have posted a patch to fix this - see
> > https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
> > looks ok for now assuming coherent pages can never be pinned.
> >
> > However that raises a question - what happens when something calls
> > get_user_pages() on a pfn pointing to a coherent device page? I can't see
> > anything in this series that prevents pinning of coherent device pages, so we
> > can't just assume they aren't pinned.
> 
> I agree. I think we need to depend on your patch to go in first.
> 
> I'm also wondering if we need to do something to prevent get_user_pages
> from pinning device pages. And by "pin", I think migrate_vma_check_page
> is not talking about FOLL_PIN, but any get_user_pages call. As far as I
> can tell, there should be nothing fundamentally wrong with pinning
> device pages for a short time. But I think we'll want to avoid
> FOLL_LONGTERM because that would affect our memory manager's ability to
> evict device memory.

Right, so long as my fix goes in I don't think there is anything wrong with
pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
device memory though. I think the way to do that is update is_pinnable_page()
so we treat device pages the same as other unpinnable pages ie. long-term pins
will migrate the page.

> >
> > In the case of device-private pages this is enforced by the fact they never
> > have present pte's, so any attempt to GUP them results in a fault. But if I'm
> > understanding this series correctly that won't be the case for coherent device
> > pages right?
> 
> Right.
> 
> Regards,
>   Felix
> 
> 
> >
> >> -		return is_device_private_page(page);
> >> +		return is_device_page(page);
> >>  	}
> >>  
> >>  	/* For file back page */
> >> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
> >>   *     handle_pte_fault()
> >>   *       do_anonymous_page()
> >>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> >> - * private page.
> >> + * private or coherent page.
> >>   */
> >>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>  				    unsigned long addr,
> >> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>  				swp_entry = make_readable_device_private_entry(
> >>  							page_to_pfn(page));
> >>  			entry = swp_entry_to_pte(swp_entry);
> >> +		} else if (is_device_page(page)) {
> > How about adding an explicit `is_device_coherent_page()` helper? It would make
> > the test more explicit that this is expected to handle just coherent pages and
> > I bet there will be future changes that need to differentiate between private
> > and coherent pages anyway.
> >
> >> +			entry = pte_mkold(mk_pte(page,
> >> +						 READ_ONCE(vma->vm_page_prot)));
> >> +			if (vma->vm_flags & VM_WRITE)
> >> +				entry = pte_mkwrite(pte_mkdirty(entry));
> >>  		} else {
> >>  			/*
> >> -			 * For now we only support migrating to un-addressable
> >> -			 * device memory.
> >> +			 * We support migrating to private and coherent types
> >> +			 * for device zone memory.
> >>  			 */
> >>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
> >>  			goto abort;
> >> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >>  		mapping = page_mapping(page);
> >>  
> >>  		if (is_zone_device_page(newpage)) {
> >> -			if (is_device_private_page(newpage)) {
> >> +			if (is_device_page(newpage)) {
> >>  				/*
> >> -				 * For now only support private anonymous when
> >> -				 * migrating to un-addressable device memory.
> >> +				 * For now only support private and coherent
> >> +				 * anonymous when migrating to device memory.
> >>  				 */
> >>  				if (mapping) {
> >>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> >>
> >
> >
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-22  2:40         ` Alistair Popple
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-22  2:40 UTC (permalink / raw)
  To: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, Alex Sierra,
	Felix Kuehling
  Cc: amd-gfx, dri-devel, jglisse, willy, jgg, hch

> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 1852d787e6ab..f74422a42192 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
> >>  	 * Device private pages have an extra refcount as they are
> >>  	 * ZONE_DEVICE pages.
> >>  	 */
> >> -	expected_count += is_device_private_page(page);
> >> +	expected_count += is_device_page(page);
> >>  	if (mapping)
> >>  		expected_count += thp_nr_pages(page) + page_has_private(page);
> >>  
> >> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
> >>  		 * FIXME proper solution is to rework migration_entry_wait() so
> >>  		 * it does not need to take a reference on page.
> >>  		 */
> > Note that I have posted a patch to fix this - see
> > https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
> > looks ok for now assuming coherent pages can never be pinned.
> >
> > However that raises a question - what happens when something calls
> > get_user_pages() on a pfn pointing to a coherent device page? I can't see
> > anything in this series that prevents pinning of coherent device pages, so we
> > can't just assume they aren't pinned.
> 
> I agree. I think we need to depend on your patch to go in first.
> 
> I'm also wondering if we need to do something to prevent get_user_pages
> from pinning device pages. And by "pin", I think migrate_vma_check_page
> is not talking about FOLL_PIN, but any get_user_pages call. As far as I
> can tell, there should be nothing fundamentally wrong with pinning
> device pages for a short time. But I think we'll want to avoid
> FOLL_LONGTERM because that would affect our memory manager's ability to
> evict device memory.

Right, so long as my fix goes in I don't think there is anything wrong with
pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
device memory though. I think the way to do that is update is_pinnable_page()
so we treat device pages the same as other unpinnable pages ie. long-term pins
will migrate the page.

> >
> > In the case of device-private pages this is enforced by the fact they never
> > have present pte's, so any attempt to GUP them results in a fault. But if I'm
> > understanding this series correctly that won't be the case for coherent device
> > pages right?
> 
> Right.
> 
> Regards,
>   Felix
> 
> 
> >
> >> -		return is_device_private_page(page);
> >> +		return is_device_page(page);
> >>  	}
> >>  
> >>  	/* For file back page */
> >> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
> >>   *     handle_pte_fault()
> >>   *       do_anonymous_page()
> >>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> >> - * private page.
> >> + * private or coherent page.
> >>   */
> >>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>  				    unsigned long addr,
> >> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>  				swp_entry = make_readable_device_private_entry(
> >>  							page_to_pfn(page));
> >>  			entry = swp_entry_to_pte(swp_entry);
> >> +		} else if (is_device_page(page)) {
> > How about adding an explicit `is_device_coherent_page()` helper? It would make
> > the test more explicit that this is expected to handle just coherent pages and
> > I bet there will be future changes that need to differentiate between private
> > and coherent pages anyway.
> >
> >> +			entry = pte_mkold(mk_pte(page,
> >> +						 READ_ONCE(vma->vm_page_prot)));
> >> +			if (vma->vm_flags & VM_WRITE)
> >> +				entry = pte_mkwrite(pte_mkdirty(entry));
> >>  		} else {
> >>  			/*
> >> -			 * For now we only support migrating to un-addressable
> >> -			 * device memory.
> >> +			 * We support migrating to private and coherent types
> >> +			 * for device zone memory.
> >>  			 */
> >>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
> >>  			goto abort;
> >> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >>  		mapping = page_mapping(page);
> >>  
> >>  		if (is_zone_device_page(newpage)) {
> >> -			if (is_device_private_page(newpage)) {
> >> +			if (is_device_page(newpage)) {
> >>  				/*
> >> -				 * For now only support private anonymous when
> >> -				 * migrating to un-addressable device memory.
> >> +				 * For now only support private and coherent
> >> +				 * anonymous when migrating to device memory.
> >>  				 */
> >>  				if (mapping) {
> >>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> >>
> >
> >
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-22  2:40         ` Alistair Popple
@ 2021-11-22 17:16           ` Felix Kuehling
  -1 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-22 17:16 UTC (permalink / raw)
  To: Alistair Popple, akpm, linux-mm, rcampbell, linux-ext4,
	linux-xfs, Alex Sierra
  Cc: amd-gfx, dri-devel, jglisse, willy, jgg, hch

Am 2021-11-21 um 9:40 p.m. schrieb Alistair Popple:
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 1852d787e6ab..f74422a42192 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>>>>  	 * Device private pages have an extra refcount as they are
>>>>  	 * ZONE_DEVICE pages.
>>>>  	 */
>>>> -	expected_count += is_device_private_page(page);
>>>> +	expected_count += is_device_page(page);
>>>>  	if (mapping)
>>>>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>>>>  
>>>> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>>>>  		 * FIXME proper solution is to rework migration_entry_wait() so
>>>>  		 * it does not need to take a reference on page.
>>>>  		 */
>>> Note that I have posted a patch to fix this - see
>>> https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
>>> looks ok for now assuming coherent pages can never be pinned.
>>>
>>> However that raises a question - what happens when something calls
>>> get_user_pages() on a pfn pointing to a coherent device page? I can't see
>>> anything in this series that prevents pinning of coherent device pages, so we
>>> can't just assume they aren't pinned.
>> I agree. I think we need to depend on your patch to go in first.
>>
>> I'm also wondering if we need to do something to prevent get_user_pages
>> from pinning device pages. And by "pin", I think migrate_vma_check_page
>> is not talking about FOLL_PIN, but any get_user_pages call. As far as I
>> can tell, there should be nothing fundamentally wrong with pinning
>> device pages for a short time. But I think we'll want to avoid
>> FOLL_LONGTERM because that would affect our memory manager's ability to
>> evict device memory.
> Right, so long as my fix goes in I don't think there is anything wrong with
> pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
> device memory though. I think the way to do that is update is_pinnable_page()
> so we treat device pages the same as other unpinnable pages ie. long-term pins
> will migrate the page.

I'm trying to understand check_and_migrate_movable_pages in gup.c. It
doesn't look like the right way to migrate device pages. We may have to
do something different there as well. So instead of changing
is_pinnable_page, it maybe better to explicitly check for is_device_page
or is_device_coherent_page in check_and_migrate_movable_pages to migrate
it correctly, or just fail outright.

Thanks,
  Felix


>
>>> In the case of device-private pages this is enforced by the fact they never
>>> have present pte's, so any attempt to GUP them results in a fault. But if I'm
>>> understanding this series correctly that won't be the case for coherent device
>>> pages right?
>> Right.
>>
>> Regards,
>>   Felix
>>
>>
>>>> -		return is_device_private_page(page);
>>>> +		return is_device_page(page);
>>>>  	}
>>>>  
>>>>  	/* For file back page */
>>>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>>>>   *     handle_pte_fault()
>>>>   *       do_anonymous_page()
>>>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
>>>> - * private page.
>>>> + * private or coherent page.
>>>>   */
>>>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>>>  				    unsigned long addr,
>>>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>>>  				swp_entry = make_readable_device_private_entry(
>>>>  							page_to_pfn(page));
>>>>  			entry = swp_entry_to_pte(swp_entry);
>>>> +		} else if (is_device_page(page)) {
>>> How about adding an explicit `is_device_coherent_page()` helper? It would make
>>> the test more explicit that this is expected to handle just coherent pages and
>>> I bet there will be future changes that need to differentiate between private
>>> and coherent pages anyway.
>>>
>>>> +			entry = pte_mkold(mk_pte(page,
>>>> +						 READ_ONCE(vma->vm_page_prot)));
>>>> +			if (vma->vm_flags & VM_WRITE)
>>>> +				entry = pte_mkwrite(pte_mkdirty(entry));
>>>>  		} else {
>>>>  			/*
>>>> -			 * For now we only support migrating to un-addressable
>>>> -			 * device memory.
>>>> +			 * We support migrating to private and coherent types
>>>> +			 * for device zone memory.
>>>>  			 */
>>>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>>>>  			goto abort;
>>>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>>  		mapping = page_mapping(page);
>>>>  
>>>>  		if (is_zone_device_page(newpage)) {
>>>> -			if (is_device_private_page(newpage)) {
>>>> +			if (is_device_page(newpage)) {
>>>>  				/*
>>>> -				 * For now only support private anonymous when
>>>> -				 * migrating to un-addressable device memory.
>>>> +				 * For now only support private and coherent
>>>> +				 * anonymous when migrating to device memory.
>>>>  				 */
>>>>  				if (mapping) {
>>>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>>
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-22 17:16           ` Felix Kuehling
  0 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-22 17:16 UTC (permalink / raw)
  To: Alistair Popple, akpm, linux-mm, rcampbell, linux-ext4,
	linux-xfs, Alex Sierra
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

Am 2021-11-21 um 9:40 p.m. schrieb Alistair Popple:
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 1852d787e6ab..f74422a42192 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -362,7 +362,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
>>>>  	 * Device private pages have an extra refcount as they are
>>>>  	 * ZONE_DEVICE pages.
>>>>  	 */
>>>> -	expected_count += is_device_private_page(page);
>>>> +	expected_count += is_device_page(page);
>>>>  	if (mapping)
>>>>  		expected_count += thp_nr_pages(page) + page_has_private(page);
>>>>  
>>>> @@ -2503,7 +2503,7 @@ static bool migrate_vma_check_page(struct page *page)
>>>>  		 * FIXME proper solution is to rework migration_entry_wait() so
>>>>  		 * it does not need to take a reference on page.
>>>>  		 */
>>> Note that I have posted a patch to fix this - see
>>> https://lore.kernel.org/all/20211118020754.954425-1-apopple@nvidia.com/ This
>>> looks ok for now assuming coherent pages can never be pinned.
>>>
>>> However that raises a question - what happens when something calls
>>> get_user_pages() on a pfn pointing to a coherent device page? I can't see
>>> anything in this series that prevents pinning of coherent device pages, so we
>>> can't just assume they aren't pinned.
>> I agree. I think we need to depend on your patch to go in first.
>>
>> I'm also wondering if we need to do something to prevent get_user_pages
>> from pinning device pages. And by "pin", I think migrate_vma_check_page
>> is not talking about FOLL_PIN, but any get_user_pages call. As far as I
>> can tell, there should be nothing fundamentally wrong with pinning
>> device pages for a short time. But I think we'll want to avoid
>> FOLL_LONGTERM because that would affect our memory manager's ability to
>> evict device memory.
> Right, so long as my fix goes in I don't think there is anything wrong with
> pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
> device memory though. I think the way to do that is update is_pinnable_page()
> so we treat device pages the same as other unpinnable pages ie. long-term pins
> will migrate the page.

I'm trying to understand check_and_migrate_movable_pages in gup.c. It
doesn't look like the right way to migrate device pages. We may have to
do something different there as well. So instead of changing
is_pinnable_page, it maybe better to explicitly check for is_device_page
or is_device_coherent_page in check_and_migrate_movable_pages to migrate
it correctly, or just fail outright.

Thanks,
  Felix


>
>>> In the case of device-private pages this is enforced by the fact they never
>>> have present pte's, so any attempt to GUP them results in a fault. But if I'm
>>> understanding this series correctly that won't be the case for coherent device
>>> pages right?
>> Right.
>>
>> Regards,
>>   Felix
>>
>>
>>>> -		return is_device_private_page(page);
>>>> +		return is_device_page(page);
>>>>  	}
>>>>  
>>>>  	/* For file back page */
>>>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
>>>>   *     handle_pte_fault()
>>>>   *       do_anonymous_page()
>>>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
>>>> - * private page.
>>>> + * private or coherent page.
>>>>   */
>>>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>>>  				    unsigned long addr,
>>>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>>>>  				swp_entry = make_readable_device_private_entry(
>>>>  							page_to_pfn(page));
>>>>  			entry = swp_entry_to_pte(swp_entry);
>>>> +		} else if (is_device_page(page)) {
>>> How about adding an explicit `is_device_coherent_page()` helper? It would make
>>> the test more explicit that this is expected to handle just coherent pages and
>>> I bet there will be future changes that need to differentiate between private
>>> and coherent pages anyway.
>>>
>>>> +			entry = pte_mkold(mk_pte(page,
>>>> +						 READ_ONCE(vma->vm_page_prot)));
>>>> +			if (vma->vm_flags & VM_WRITE)
>>>> +				entry = pte_mkwrite(pte_mkdirty(entry));
>>>>  		} else {
>>>>  			/*
>>>> -			 * For now we only support migrating to un-addressable
>>>> -			 * device memory.
>>>> +			 * We support migrating to private and coherent types
>>>> +			 * for device zone memory.
>>>>  			 */
>>>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
>>>>  			goto abort;
>>>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>>  		mapping = page_mapping(page);
>>>>  
>>>>  		if (is_zone_device_page(newpage)) {
>>>> -			if (is_device_private_page(newpage)) {
>>>> +			if (is_device_page(newpage)) {
>>>>  				/*
>>>> -				 * For now only support private anonymous when
>>>> -				 * migrating to un-addressable device memory.
>>>> +				 * For now only support private and coherent
>>>> +				 * anonymous when migrating to device memory.
>>>>  				 */
>>>>  				if (mapping) {
>>>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>>
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
  2021-11-22 17:16           ` Felix Kuehling
@ 2021-11-23  7:10             ` Alistair Popple
  -1 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-23  7:10 UTC (permalink / raw)
  To: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, Alex Sierra,
	Felix Kuehling
  Cc: amd-gfx, dri-devel, jglisse, willy, jgg, hch

On Tuesday, 23 November 2021 4:16:55 AM AEDT Felix Kuehling wrote:

[...]

> > Right, so long as my fix goes in I don't think there is anything wrong with
> > pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
> > device memory though. I think the way to do that is update is_pinnable_page()
> > so we treat device pages the same as other unpinnable pages ie. long-term pins
> > will migrate the page.
> 
> I'm trying to understand check_and_migrate_movable_pages in gup.c. It
> doesn't look like the right way to migrate device pages. We may have to
> do something different there as well. So instead of changing
> is_pinnable_page, it maybe better to explicitly check for is_device_page
> or is_device_coherent_page in check_and_migrate_movable_pages to migrate
> it correctly, or just fail outright.

Yes, I think you're right. I was thinking check_and_migrate_movable_pages()
would work for coherent device pages. Now I see it won't because it assumes
they are lru pages and it tries to isolate them which will never succeed
because device pages aren't on a lru.

I think migrating them is the right thing to do for FOLL_LONGTERM though.

 - Alistair

> Thanks,
>   Felix
> 
> >
> >>> In the case of device-private pages this is enforced by the fact they never
> >>> have present pte's, so any attempt to GUP them results in a fault. But if I'm
> >>> understanding this series correctly that won't be the case for coherent device
> >>> pages right?
> >> Right.
> >>
> >> Regards,
> >>   Felix
> >>
> >>
> >>>> -		return is_device_private_page(page);
> >>>> +		return is_device_page(page);
> >>>>  	}
> >>>>  
> >>>>  	/* For file back page */
> >>>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
> >>>>   *     handle_pte_fault()
> >>>>   *       do_anonymous_page()
> >>>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> >>>> - * private page.
> >>>> + * private or coherent page.
> >>>>   */
> >>>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>>>  				    unsigned long addr,
> >>>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>>>  				swp_entry = make_readable_device_private_entry(
> >>>>  							page_to_pfn(page));
> >>>>  			entry = swp_entry_to_pte(swp_entry);
> >>>> +		} else if (is_device_page(page)) {
> >>> How about adding an explicit `is_device_coherent_page()` helper? It would make
> >>> the test more explicit that this is expected to handle just coherent pages and
> >>> I bet there will be future changes that need to differentiate between private
> >>> and coherent pages anyway.
> >>>
> >>>> +			entry = pte_mkold(mk_pte(page,
> >>>> +						 READ_ONCE(vma->vm_page_prot)));
> >>>> +			if (vma->vm_flags & VM_WRITE)
> >>>> +				entry = pte_mkwrite(pte_mkdirty(entry));
> >>>>  		} else {
> >>>>  			/*
> >>>> -			 * For now we only support migrating to un-addressable
> >>>> -			 * device memory.
> >>>> +			 * We support migrating to private and coherent types
> >>>> +			 * for device zone memory.
> >>>>  			 */
> >>>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
> >>>>  			goto abort;
> >>>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >>>>  		mapping = page_mapping(page);
> >>>>  
> >>>>  		if (is_zone_device_page(newpage)) {
> >>>> -			if (is_device_private_page(newpage)) {
> >>>> +			if (is_device_page(newpage)) {
> >>>>  				/*
> >>>> -				 * For now only support private anonymous when
> >>>> -				 * migrating to un-addressable device memory.
> >>>> +				 * For now only support private and coherent
> >>>> +				 * anonymous when migrating to device memory.
> >>>>  				 */
> >>>>  				if (mapping) {
> >>>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> >>>>
> >
> >
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
@ 2021-11-23  7:10             ` Alistair Popple
  0 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2021-11-23  7:10 UTC (permalink / raw)
  To: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, Alex Sierra,
	Felix Kuehling
  Cc: amd-gfx, willy, jglisse, dri-devel, jgg, hch

On Tuesday, 23 November 2021 4:16:55 AM AEDT Felix Kuehling wrote:

[...]

> > Right, so long as my fix goes in I don't think there is anything wrong with
> > pinning device public pages. Agree that we should avoid FOLL_LONGTERM pins for
> > device memory though. I think the way to do that is update is_pinnable_page()
> > so we treat device pages the same as other unpinnable pages ie. long-term pins
> > will migrate the page.
> 
> I'm trying to understand check_and_migrate_movable_pages in gup.c. It
> doesn't look like the right way to migrate device pages. We may have to
> do something different there as well. So instead of changing
> is_pinnable_page, it maybe better to explicitly check for is_device_page
> or is_device_coherent_page in check_and_migrate_movable_pages to migrate
> it correctly, or just fail outright.

Yes, I think you're right. I was thinking check_and_migrate_movable_pages()
would work for coherent device pages. Now I see it won't because it assumes
they are lru pages and it tries to isolate them which will never succeed
because device pages aren't on a lru.

I think migrating them is the right thing to do for FOLL_LONGTERM though.

 - Alistair

> Thanks,
>   Felix
> 
> >
> >>> In the case of device-private pages this is enforced by the fact they never
> >>> have present pte's, so any attempt to GUP them results in a fault. But if I'm
> >>> understanding this series correctly that won't be the case for coherent device
> >>> pages right?
> >> Right.
> >>
> >> Regards,
> >>   Felix
> >>
> >>
> >>>> -		return is_device_private_page(page);
> >>>> +		return is_device_page(page);
> >>>>  	}
> >>>>  
> >>>>  	/* For file back page */
> >>>> @@ -2791,7 +2791,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
> >>>>   *     handle_pte_fault()
> >>>>   *       do_anonymous_page()
> >>>>   * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
> >>>> - * private page.
> >>>> + * private or coherent page.
> >>>>   */
> >>>>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>>>  				    unsigned long addr,
> >>>> @@ -2867,10 +2867,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >>>>  				swp_entry = make_readable_device_private_entry(
> >>>>  							page_to_pfn(page));
> >>>>  			entry = swp_entry_to_pte(swp_entry);
> >>>> +		} else if (is_device_page(page)) {
> >>> How about adding an explicit `is_device_coherent_page()` helper? It would make
> >>> the test more explicit that this is expected to handle just coherent pages and
> >>> I bet there will be future changes that need to differentiate between private
> >>> and coherent pages anyway.
> >>>
> >>>> +			entry = pte_mkold(mk_pte(page,
> >>>> +						 READ_ONCE(vma->vm_page_prot)));
> >>>> +			if (vma->vm_flags & VM_WRITE)
> >>>> +				entry = pte_mkwrite(pte_mkdirty(entry));
> >>>>  		} else {
> >>>>  			/*
> >>>> -			 * For now we only support migrating to un-addressable
> >>>> -			 * device memory.
> >>>> +			 * We support migrating to private and coherent types
> >>>> +			 * for device zone memory.
> >>>>  			 */
> >>>>  			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
> >>>>  			goto abort;
> >>>> @@ -2976,10 +2981,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >>>>  		mapping = page_mapping(page);
> >>>>  
> >>>>  		if (is_zone_device_page(newpage)) {
> >>>> -			if (is_device_private_page(newpage)) {
> >>>> +			if (is_device_page(newpage)) {
> >>>>  				/*
> >>>> -				 * For now only support private anonymous when
> >>>> -				 * migrating to un-addressable device memory.
> >>>> +				 * For now only support private and coherent
> >>>> +				 * anonymous when migrating to device memory.
> >>>>  				 */
> >>>>  				if (mapping) {
> >>>>  					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> >>>>
> >
> >
> 





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 7/9] lib: add support for device coherent type in test_hmm
  2021-11-15 19:30   ` Alex Sierra
@ 2021-11-25  2:12     ` Felix Kuehling
  -1 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-25  2:12 UTC (permalink / raw)
  To: Alex Sierra, akpm, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: willy, apopple, dri-devel, jglisse, amd-gfx, jgg, hch

Am 2021-11-15 um 2:30 p.m. schrieb Alex Sierra:
> Device Coherent type uses device memory that is coherently accesible by
> the CPU. This could be shown as SP (special purpose) memory range
> at the BIOS-e820 memory enumeration. If no SP memory is supported in
> system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.
>
> Currently, test_hmm only supports two different SP ranges of at least
> 256MB size. This could be specified in the kernel parameter variable
> efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
> 0x140000000 physical address. Ex.
> efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  lib/test_hmm.c      | 191 +++++++++++++++++++++++++++++++++-----------
>  lib/test_hmm_uapi.h |  15 ++--
>  2 files changed, 153 insertions(+), 53 deletions(-)
>
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 45334df28d7b..9e2cc0cd4412 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -471,6 +471,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  	unsigned long pfn_first;
>  	unsigned long pfn_last;
>  	void *ptr;
> +	int ret = -ENOMEM;
>  
>  	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
>  	if (!devmem)
> @@ -553,7 +554,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  	}
>  	spin_unlock(&mdevice->lock);
>  
> -	return true;
> +	return 0;

You're changing the meaning of the return value, but you're not updating
the callers. I think at least dmirror_devmem_alloc_page will be broken
by this change.


>  
>  err_release:
>  	mutex_unlock(&mdevice->devmem_lock);
> @@ -562,7 +563,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  err_devmem:
>  	kfree(devmem);
>  
> -	return false;
> +	return ret;
>  }
>  
>  static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
> @@ -571,8 +572,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>  	struct page *rpage;
>  
>  	/*
> -	 * This is a fake device so we alloc real system memory to store
> -	 * our device memory.
> +	 * For ZONE_DEVICE private type, this is a fake device so we alloc real
> +	 * system memory to store our device memory.
> +	 * For ZONE_DEVICE coherent type we use the actual dpage to store the data
> +	 * and ignore rpage.
>  	 */
>  	rpage = alloc_page(GFP_HIGHUSER);
>  	if (!rpage)
> @@ -623,12 +626,17 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>  		 * unallocated pte_none() or read-only zero page.
>  		 */
>  		spage = migrate_pfn_to_page(*src);
> +		if (spage && is_zone_device_page(spage))
> +			pr_err("page already in device spage pfn: 0x%lx\n",
> +				page_to_pfn(spage));
> +		WARN_ON(spage && is_zone_device_page(spage));
You can write WARN_ON inside the if-condition:

	if (WARN_ON(spage && is_zone_device_page(spage))
		...

But I don't see why you need both pr_err and WARN_ON. You can add a
custom message by using WARN instead of WARN_ON:

	WARN(spage && is_zone_device_page(spage),
	     "page already in device spage pfn: 0x%lx\n",
	     page_to_pfn(spage));


>  
>  		dpage = dmirror_devmem_alloc_page(mdevice);
>  		if (!dpage)
>  			continue;
>  
> -		rpage = dpage->zone_device_data;
> +		rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
> +							dpage;
>  		if (spage)
>  			copy_highpage(rpage, spage);
>  		else
> @@ -640,8 +648,11 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>  		 * the simulated device memory and that page holds the pointer
>  		 * to the mirror.
>  		 */
> +		rpage = dpage->zone_device_data;
>  		rpage->zone_device_data = dmirror;
>  
> +		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
> +			 page_to_pfn(spage), page_to_pfn(dpage));
>  		*dst = migrate_pfn(page_to_pfn(dpage)) |
>  			    MIGRATE_PFN_LOCKED;
>  		if ((*src & MIGRATE_PFN_WRITE) ||
> @@ -721,10 +732,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
>  			continue;
>  
>  		/*
> -		 * Store the page that holds the data so the page table
> -		 * doesn't have to deal with ZONE_DEVICE private pages.
> +		 * For ZONE_DEVICE private pages we store the page that
> +		 * holds the data so the page table doesn't have to deal it.
> +		 * For ZONE_DEVICE coherent pages we store the actual page, since
> +		 * the CPU has coherent access to the page.
>  		 */

I find this explanation confusing. The comment in
dmirror_devmem_alloc_page is much clearer. I think all we need here is
that dpage->zone_device_data points to the backing page for
device_private pages. Device_coherent struct pages don't have a backing
page because they represent a real CPU-accessible page already.


> -		entry = dpage->zone_device_data;
> +		entry = is_device_private_page(dpage) ? dpage->zone_device_data :
> +							dpage;

I see this in a few places. Maybe better to wrap that in a helper
function or macro. Something like this:

#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
			    (page)->zone_device_data : (page))


>  		if (*dst & MIGRATE_PFN_WRITE)
>  			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
>  		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
> @@ -804,8 +818,110 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  	return ret;
>  }
>  
> -static int dmirror_migrate(struct dmirror *dmirror,
> -			   struct hmm_dmirror_cmd *cmd)
> +static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> +						      struct dmirror *dmirror)
> +{
> +	const unsigned long *src = args->src;
> +	unsigned long *dst = args->dst;
> +	unsigned long start = args->start;
> +	unsigned long end = args->end;
> +	unsigned long addr;
> +
> +	for (addr = start; addr < end; addr += PAGE_SIZE,
> +				       src++, dst++) {
> +		struct page *dpage, *spage;
> +
> +		spage = migrate_pfn_to_page(*src);
> +		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		WARN_ON(!is_device_page(spage));
> +		spage = is_device_private_page(spage) ? spage->zone_device_data :
> +							spage;
> +		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> +		if (!dpage)
> +			continue;
> +		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
> +			 page_to_pfn(spage), page_to_pfn(dpage));
> +
> +		lock_page(dpage);
> +		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> +		copy_highpage(dpage, spage);
> +		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +		if (*src & MIGRATE_PFN_WRITE)
> +			*dst |= MIGRATE_PFN_WRITE;
> +	}
> +	return 0;
> +}
> +
> +static int dmirror_migrate_to_system(struct dmirror *dmirror,
> +				     struct hmm_dmirror_cmd *cmd)
> +{
> +	unsigned long start, end, addr;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	struct mm_struct *mm = dmirror->notifier.mm;
> +	struct vm_area_struct *vma;
> +	unsigned long src_pfns[64];
> +	unsigned long dst_pfns[64];
> +	struct migrate_vma args;
> +	unsigned long next;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	/* Since the mm is for the mirrored process, get a reference first. */
> +	if (!mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	mmap_read_lock(mm);
> +	for (addr = start; addr < end; addr = next) {
> +		vma = find_vma(mm, addr);
> +		if (!vma || addr < vma->vm_start ||
> +		    !(vma->vm_flags & VM_READ)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
> +		if (next > vma->vm_end)
> +			next = vma->vm_end;
> +
> +		args.vma = vma;
> +		args.src = src_pfns;
> +		args.dst = dst_pfns;
> +		args.start = addr;
> +		args.end = next;
> +		args.pgmap_owner = dmirror->mdevice;
> +		args.flags = (dmirror->mdevice->zone_device_type ==
> +			      HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
> +			      MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
> +			      MIGRATE_VMA_SELECT_DEVICE_COHERENT;
> +
> +		ret = migrate_vma_setup(&args);
> +		if (ret)
> +			goto out;
> +
> +		pr_debug("Migrating from device mem to sys mem\n");
> +		dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
> +
> +		migrate_vma_pages(&args);
> +		migrate_vma_finalize(&args);
> +	}
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return ret;
> +
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static int dmirror_migrate_to_device(struct dmirror *dmirror,
> +				struct hmm_dmirror_cmd *cmd)
>  {
>  	unsigned long start, end, addr;
>  	unsigned long size = cmd->npages << PAGE_SHIFT;
> @@ -849,6 +965,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
>  		if (ret)
>  			goto out;
>  
> +		pr_debug("Migrating from sys mem to device mem\n");
>  		dmirror_migrate_alloc_and_copy(&args, dmirror);
>  		migrate_vma_pages(&args);
>  		dmirror_migrate_finalize_and_map(&args, dmirror);
> @@ -857,7 +974,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
>  	mmap_read_unlock(mm);
>  	mmput(mm);
>  
> -	/* Return the migrated data for verification. */
> +	/* Return the migrated data for verification. only for pages in device zone */
>  	ret = dmirror_bounce_init(&bounce, start, size);
>  	if (ret)
>  		return ret;
> @@ -894,9 +1011,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
>  	}
>  
>  	page = hmm_pfn_to_page(entry);
> -	if (is_device_private_page(page)) {
> -		/* Is the page migrated to this device or some other? */
> -		if (dmirror->mdevice == dmirror_page_to_device(page))
> +	if (is_device_page(page)) {
> +		/* Is page ZONE_DEVICE coherent? */
> +		if (!is_device_private_page(page))
> +			*perm = HMM_DMIRROR_PROT_DEV_COHERENT;

Does it make sense to distinguish between DEV_COHERENT_LOCAL and
DEV_COHERENT_REMOTE as well?


> +		/*
> +		 * Is page ZONE_DEVICE private migrated to
> +		 * this device or some other?
> +		 */
> +		else if (dmirror->mdevice == dmirror_page_to_device(page))
>  			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
>  		else
>  			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
> @@ -1096,8 +1219,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
>  		ret = dmirror_write(dmirror, &cmd);
>  		break;
>  
> -	case HMM_DMIRROR_MIGRATE:
> -		ret = dmirror_migrate(dmirror, &cmd);
> +	case HMM_DMIRROR_MIGRATE_TO_DEV:
> +		ret = dmirror_migrate_to_device(dmirror, &cmd);
> +		break;
> +
> +	case HMM_DMIRROR_MIGRATE_TO_SYS:
> +		ret = dmirror_migrate_to_system(dmirror, &cmd);
>  		break;
>  
>  	case HMM_DMIRROR_EXCLUSIVE:
> @@ -1152,38 +1279,6 @@ static void dmirror_devmem_free(struct page *page)
>  	spin_unlock(&mdevice->lock);
>  }
>  
> -static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> -						      struct dmirror *dmirror)
> -{
> -	const unsigned long *src = args->src;
> -	unsigned long *dst = args->dst;
> -	unsigned long start = args->start;
> -	unsigned long end = args->end;
> -	unsigned long addr;
> -
> -	for (addr = start; addr < end; addr += PAGE_SIZE,
> -				       src++, dst++) {
> -		struct page *dpage, *spage;
> -
> -		spage = migrate_pfn_to_page(*src);
> -		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> -			continue;
> -		spage = spage->zone_device_data;
> -
> -		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> -		if (!dpage)
> -			continue;
> -
> -		lock_page(dpage);
> -		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> -		copy_highpage(dpage, spage);
> -		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> -		if (*src & MIGRATE_PFN_WRITE)
> -			*dst |= MIGRATE_PFN_WRITE;
> -	}
> -	return 0;
> -}
> -
>  static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  {
>  	struct migrate_vma args;
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index 77f81e6314eb..af15c35a90f4 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -19,6 +19,7 @@
>   * @npages: (in) number of pages to read/write
>   * @cpages: (out) number of pages copied
>   * @faults: (out) number of device page faults seen
> + * @zone_device_type: (out) zone device memory type
>   */
>  struct hmm_dmirror_cmd {
>  	__u64		addr;
> @@ -32,11 +33,12 @@ struct hmm_dmirror_cmd {
>  /* Expose the address space of the calling process through hmm device file */
>  #define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
>  #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_DEV	_IOWR('H', 0x02, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_SYS	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x07, struct hmm_dmirror_cmd)
>  
>  /*
>   * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> @@ -51,6 +53,8 @@ struct hmm_dmirror_cmd {
>   *					device the ioctl() is made
>   * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
>   *					other device
> + * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
> + *				  the ioctl() is made
>   */
>  enum {
>  	HMM_DMIRROR_PROT_ERROR			= 0xFF,
> @@ -62,6 +66,7 @@ enum {
>  	HMM_DMIRROR_PROT_ZERO			= 0x10,
>  	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
>  	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
> +	HMM_DMIRROR_PROT_DEV_COHERENT           = 0x40,

Does it make sense to distinguish between DEV_COHERENT_LOCAL and
DEV_COHERENT_REMOTE as well?

Regards,
  Felix


>  };
>  
>  enum {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 7/9] lib: add support for device coherent type in test_hmm
@ 2021-11-25  2:12     ` Felix Kuehling
  0 siblings, 0 replies; 38+ messages in thread
From: Felix Kuehling @ 2021-11-25  2:12 UTC (permalink / raw)
  To: Alex Sierra, akpm, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse, apopple, willy

Am 2021-11-15 um 2:30 p.m. schrieb Alex Sierra:
> Device Coherent type uses device memory that is coherently accesible by
> the CPU. This could be shown as SP (special purpose) memory range
> at the BIOS-e820 memory enumeration. If no SP memory is supported in
> system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.
>
> Currently, test_hmm only supports two different SP ranges of at least
> 256MB size. This could be specified in the kernel parameter variable
> efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
> 0x140000000 physical address. Ex.
> efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  lib/test_hmm.c      | 191 +++++++++++++++++++++++++++++++++-----------
>  lib/test_hmm_uapi.h |  15 ++--
>  2 files changed, 153 insertions(+), 53 deletions(-)
>
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 45334df28d7b..9e2cc0cd4412 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -471,6 +471,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  	unsigned long pfn_first;
>  	unsigned long pfn_last;
>  	void *ptr;
> +	int ret = -ENOMEM;
>  
>  	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
>  	if (!devmem)
> @@ -553,7 +554,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  	}
>  	spin_unlock(&mdevice->lock);
>  
> -	return true;
> +	return 0;

You're changing the meaning of the return value, but you're not updating
the callers. I think at least dmirror_devmem_alloc_page will be broken
by this change.


>  
>  err_release:
>  	mutex_unlock(&mdevice->devmem_lock);
> @@ -562,7 +563,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
>  err_devmem:
>  	kfree(devmem);
>  
> -	return false;
> +	return ret;
>  }
>  
>  static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
> @@ -571,8 +572,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>  	struct page *rpage;
>  
>  	/*
> -	 * This is a fake device so we alloc real system memory to store
> -	 * our device memory.
> +	 * For ZONE_DEVICE private type, this is a fake device so we alloc real
> +	 * system memory to store our device memory.
> +	 * For ZONE_DEVICE coherent type we use the actual dpage to store the data
> +	 * and ignore rpage.
>  	 */
>  	rpage = alloc_page(GFP_HIGHUSER);
>  	if (!rpage)
> @@ -623,12 +626,17 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>  		 * unallocated pte_none() or read-only zero page.
>  		 */
>  		spage = migrate_pfn_to_page(*src);
> +		if (spage && is_zone_device_page(spage))
> +			pr_err("page already in device spage pfn: 0x%lx\n",
> +				page_to_pfn(spage));
> +		WARN_ON(spage && is_zone_device_page(spage));
You can write WARN_ON inside the if-condition:

	if (WARN_ON(spage && is_zone_device_page(spage))
		...

But I don't see why you need both pr_err and WARN_ON. You can add a
custom message by using WARN instead of WARN_ON:

	WARN(spage && is_zone_device_page(spage),
	     "page already in device spage pfn: 0x%lx\n",
	     page_to_pfn(spage));


>  
>  		dpage = dmirror_devmem_alloc_page(mdevice);
>  		if (!dpage)
>  			continue;
>  
> -		rpage = dpage->zone_device_data;
> +		rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
> +							dpage;
>  		if (spage)
>  			copy_highpage(rpage, spage);
>  		else
> @@ -640,8 +648,11 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>  		 * the simulated device memory and that page holds the pointer
>  		 * to the mirror.
>  		 */
> +		rpage = dpage->zone_device_data;
>  		rpage->zone_device_data = dmirror;
>  
> +		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
> +			 page_to_pfn(spage), page_to_pfn(dpage));
>  		*dst = migrate_pfn(page_to_pfn(dpage)) |
>  			    MIGRATE_PFN_LOCKED;
>  		if ((*src & MIGRATE_PFN_WRITE) ||
> @@ -721,10 +732,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
>  			continue;
>  
>  		/*
> -		 * Store the page that holds the data so the page table
> -		 * doesn't have to deal with ZONE_DEVICE private pages.
> +		 * For ZONE_DEVICE private pages we store the page that
> +		 * holds the data so the page table doesn't have to deal it.
> +		 * For ZONE_DEVICE coherent pages we store the actual page, since
> +		 * the CPU has coherent access to the page.
>  		 */

I find this explanation confusing. The comment in
dmirror_devmem_alloc_page is much clearer. I think all we need here is
that dpage->zone_device_data points to the backing page for
device_private pages. Device_coherent struct pages don't have a backing
page because they represent a real CPU-accessible page already.


> -		entry = dpage->zone_device_data;
> +		entry = is_device_private_page(dpage) ? dpage->zone_device_data :
> +							dpage;

I see this in a few places. Maybe better to wrap that in a helper
function or macro. Something like this:

#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
			    (page)->zone_device_data : (page))


>  		if (*dst & MIGRATE_PFN_WRITE)
>  			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
>  		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
> @@ -804,8 +818,110 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  	return ret;
>  }
>  
> -static int dmirror_migrate(struct dmirror *dmirror,
> -			   struct hmm_dmirror_cmd *cmd)
> +static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> +						      struct dmirror *dmirror)
> +{
> +	const unsigned long *src = args->src;
> +	unsigned long *dst = args->dst;
> +	unsigned long start = args->start;
> +	unsigned long end = args->end;
> +	unsigned long addr;
> +
> +	for (addr = start; addr < end; addr += PAGE_SIZE,
> +				       src++, dst++) {
> +		struct page *dpage, *spage;
> +
> +		spage = migrate_pfn_to_page(*src);
> +		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		WARN_ON(!is_device_page(spage));
> +		spage = is_device_private_page(spage) ? spage->zone_device_data :
> +							spage;
> +		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> +		if (!dpage)
> +			continue;
> +		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
> +			 page_to_pfn(spage), page_to_pfn(dpage));
> +
> +		lock_page(dpage);
> +		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> +		copy_highpage(dpage, spage);
> +		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +		if (*src & MIGRATE_PFN_WRITE)
> +			*dst |= MIGRATE_PFN_WRITE;
> +	}
> +	return 0;
> +}
> +
> +static int dmirror_migrate_to_system(struct dmirror *dmirror,
> +				     struct hmm_dmirror_cmd *cmd)
> +{
> +	unsigned long start, end, addr;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	struct mm_struct *mm = dmirror->notifier.mm;
> +	struct vm_area_struct *vma;
> +	unsigned long src_pfns[64];
> +	unsigned long dst_pfns[64];
> +	struct migrate_vma args;
> +	unsigned long next;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	/* Since the mm is for the mirrored process, get a reference first. */
> +	if (!mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	mmap_read_lock(mm);
> +	for (addr = start; addr < end; addr = next) {
> +		vma = find_vma(mm, addr);
> +		if (!vma || addr < vma->vm_start ||
> +		    !(vma->vm_flags & VM_READ)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
> +		if (next > vma->vm_end)
> +			next = vma->vm_end;
> +
> +		args.vma = vma;
> +		args.src = src_pfns;
> +		args.dst = dst_pfns;
> +		args.start = addr;
> +		args.end = next;
> +		args.pgmap_owner = dmirror->mdevice;
> +		args.flags = (dmirror->mdevice->zone_device_type ==
> +			      HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
> +			      MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
> +			      MIGRATE_VMA_SELECT_DEVICE_COHERENT;
> +
> +		ret = migrate_vma_setup(&args);
> +		if (ret)
> +			goto out;
> +
> +		pr_debug("Migrating from device mem to sys mem\n");
> +		dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
> +
> +		migrate_vma_pages(&args);
> +		migrate_vma_finalize(&args);
> +	}
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return ret;
> +
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static int dmirror_migrate_to_device(struct dmirror *dmirror,
> +				struct hmm_dmirror_cmd *cmd)
>  {
>  	unsigned long start, end, addr;
>  	unsigned long size = cmd->npages << PAGE_SHIFT;
> @@ -849,6 +965,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
>  		if (ret)
>  			goto out;
>  
> +		pr_debug("Migrating from sys mem to device mem\n");
>  		dmirror_migrate_alloc_and_copy(&args, dmirror);
>  		migrate_vma_pages(&args);
>  		dmirror_migrate_finalize_and_map(&args, dmirror);
> @@ -857,7 +974,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
>  	mmap_read_unlock(mm);
>  	mmput(mm);
>  
> -	/* Return the migrated data for verification. */
> +	/* Return the migrated data for verification. only for pages in device zone */
>  	ret = dmirror_bounce_init(&bounce, start, size);
>  	if (ret)
>  		return ret;
> @@ -894,9 +1011,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
>  	}
>  
>  	page = hmm_pfn_to_page(entry);
> -	if (is_device_private_page(page)) {
> -		/* Is the page migrated to this device or some other? */
> -		if (dmirror->mdevice == dmirror_page_to_device(page))
> +	if (is_device_page(page)) {
> +		/* Is page ZONE_DEVICE coherent? */
> +		if (!is_device_private_page(page))
> +			*perm = HMM_DMIRROR_PROT_DEV_COHERENT;

Does it make sense to distinguish between DEV_COHERENT_LOCAL and
DEV_COHERENT_REMOTE as well?


> +		/*
> +		 * Is page ZONE_DEVICE private migrated to
> +		 * this device or some other?
> +		 */
> +		else if (dmirror->mdevice == dmirror_page_to_device(page))
>  			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
>  		else
>  			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
> @@ -1096,8 +1219,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
>  		ret = dmirror_write(dmirror, &cmd);
>  		break;
>  
> -	case HMM_DMIRROR_MIGRATE:
> -		ret = dmirror_migrate(dmirror, &cmd);
> +	case HMM_DMIRROR_MIGRATE_TO_DEV:
> +		ret = dmirror_migrate_to_device(dmirror, &cmd);
> +		break;
> +
> +	case HMM_DMIRROR_MIGRATE_TO_SYS:
> +		ret = dmirror_migrate_to_system(dmirror, &cmd);
>  		break;
>  
>  	case HMM_DMIRROR_EXCLUSIVE:
> @@ -1152,38 +1279,6 @@ static void dmirror_devmem_free(struct page *page)
>  	spin_unlock(&mdevice->lock);
>  }
>  
> -static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> -						      struct dmirror *dmirror)
> -{
> -	const unsigned long *src = args->src;
> -	unsigned long *dst = args->dst;
> -	unsigned long start = args->start;
> -	unsigned long end = args->end;
> -	unsigned long addr;
> -
> -	for (addr = start; addr < end; addr += PAGE_SIZE,
> -				       src++, dst++) {
> -		struct page *dpage, *spage;
> -
> -		spage = migrate_pfn_to_page(*src);
> -		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> -			continue;
> -		spage = spage->zone_device_data;
> -
> -		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> -		if (!dpage)
> -			continue;
> -
> -		lock_page(dpage);
> -		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> -		copy_highpage(dpage, spage);
> -		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> -		if (*src & MIGRATE_PFN_WRITE)
> -			*dst |= MIGRATE_PFN_WRITE;
> -	}
> -	return 0;
> -}
> -
>  static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  {
>  	struct migrate_vma args;
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index 77f81e6314eb..af15c35a90f4 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -19,6 +19,7 @@
>   * @npages: (in) number of pages to read/write
>   * @cpages: (out) number of pages copied
>   * @faults: (out) number of device page faults seen
> + * @zone_device_type: (out) zone device memory type
>   */
>  struct hmm_dmirror_cmd {
>  	__u64		addr;
> @@ -32,11 +33,12 @@ struct hmm_dmirror_cmd {
>  /* Expose the address space of the calling process through hmm device file */
>  #define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
>  #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_DEV	_IOWR('H', 0x02, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_SYS	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x07, struct hmm_dmirror_cmd)
>  
>  /*
>   * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> @@ -51,6 +53,8 @@ struct hmm_dmirror_cmd {
>   *					device the ioctl() is made
>   * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
>   *					other device
> + * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
> + *				  the ioctl() is made
>   */
>  enum {
>  	HMM_DMIRROR_PROT_ERROR			= 0xFF,
> @@ -62,6 +66,7 @@ enum {
>  	HMM_DMIRROR_PROT_ZERO			= 0x10,
>  	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
>  	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
> +	HMM_DMIRROR_PROT_DEV_COHERENT           = 0x40,

Does it make sense to distinguish between DEV_COHERENT_LOCAL and
DEV_COHERENT_REMOTE as well?

Regards,
  Felix


>  };
>  
>  enum {

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-11-25  2:14 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-15 19:30 [PATCH v1 0/9] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping Alex Sierra
2021-11-15 19:30 ` Alex Sierra
2021-11-15 19:30 ` [PATCH v1 1/9] mm: add zone device coherent type memory support Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-17  9:50   ` Christoph Hellwig
2021-11-17  9:50     ` Christoph Hellwig
2021-11-18  6:53   ` Alistair Popple
2021-11-18  6:53     ` Alistair Popple
2021-11-18 18:59     ` Felix Kuehling
2021-11-18 18:59       ` Felix Kuehling
2021-11-22  2:40       ` Alistair Popple
2021-11-22  2:40         ` Alistair Popple
2021-11-22 17:16         ` Felix Kuehling
2021-11-22 17:16           ` Felix Kuehling
2021-11-23  7:10           ` Alistair Popple
2021-11-23  7:10             ` Alistair Popple
2021-11-15 19:30 ` [PATCH v1 2/9] mm: add device coherent vma selection for memory migration Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-18  6:54   ` Alistair Popple
2021-11-18  6:54     ` Alistair Popple
2021-11-15 19:30 ` [PATCH v1 3/9] drm/amdkfd: add SPM support for SVM Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-15 19:30 ` [PATCH v1 4/9] drm/amdkfd: coherent type as sys mem on migration to ram Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-15 19:30 ` [PATCH v1 5/9] lib: test_hmm add ioctl to get zone device type Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-15 19:30 ` [PATCH v1 6/9] lib: test_hmm add module param for " Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-18  7:19   ` Alistair Popple
2021-11-18  7:19     ` Alistair Popple
2021-11-15 19:30 ` [PATCH v1 7/9] lib: add support for device coherent type in test_hmm Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-25  2:12   ` Felix Kuehling
2021-11-25  2:12     ` Felix Kuehling
2021-11-15 19:30 ` [PATCH v1 8/9] tools: update hmm-test to support device coherent type Alex Sierra
2021-11-15 19:30   ` Alex Sierra
2021-11-15 19:30 ` [PATCH v1 9/9] tools: update test_hmm script to support SP config Alex Sierra
2021-11-15 19:30   ` Alex Sierra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.