linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3
@ 2017-07-03 21:14 Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one Jérôme Glisse
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse

Only Kconfig and comments changes since since v2, git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v3


Cache coherent device memory apply to architecture with system bus
like CAPI or CCIX. Device connected to such system bus can expose
their memory to the system and allow cache coherent access to it
from the CPU.

Even if for all intent and purposes device memory behave like regular
memory, we still want to manage it in isolation from regular memory.
Several reasons for that, first and foremost this memory is less
reliable than regular memory if the device hangs because of invalid
commands we can loose access to device memory. Second CPU access to
this memory is expected to be slower than to regular memory. Third
having random memory into device means that some of the bus bandwith
wouldn't be available to the device but would be use by CPU access.

This is why we want to manage such memory in isolation from regular
memory. Kernel should not try to use this memory as last resort
when running out of memory, at least for now.

This patchset add a new type of ZONE_DEVICE memory (DEVICE_PUBLIC)
that is use to represent CDM memory. This patchset build on top of
the HMM patchset that already introduce a new type of ZONE_DEVICE
memory for private device memory (see HMM patchset).

The end result is that with this patch if a device is in use in
a process you might have private anonymous memory or file back
page memory using ZONE_DEVICE (MEMORY_PUBLIC). Thus care must be
taken to not overwritte lru fields of such pages.

Hence all core mm changes are done to address assumption that any
process memory is back by a regular struct page that is part of
the lru. ZONE_DEVICE page are not on the lru and the lru pointer
of struct page are use to store device specific informations.

Thus this patch update all code path that would make assumptions
about lruness of a process page.

patch 01 - consolidate naming of different device memory type
patch 02 - deals with all the core mm functions
patch 03 - add an helper to HMM for hotplug of CDM memory
patch 04 - preparatory patch for memory controller changes
patch 05 - update memory controller to properly handle
           ZONE_DEVICE pages when uncharging

Previous posting:
v1 https://lkml.org/lkml/2017/4/7/638
v2 https://lwn.net/Articles/725412/

JA(C)rA'me Glisse (5):
  mm/persistent-memory: match IORES_DESC name and enum memory_type one
  mm/device-public-memory: device memory cache coherent with CPU v2
  mm/hmm: add new helper to hotplug CDM memory region
  mm/memcontrol: allow to uncharge page without using page->lru field
  mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC

 fs/proc/task_mmu.c       |   2 +-
 include/linux/hmm.h      |   7 +-
 include/linux/ioport.h   |   1 +
 include/linux/memremap.h |  25 +++++-
 include/linux/mm.h       |  16 ++--
 kernel/memremap.c        |  15 +++-
 mm/Kconfig               |  11 +++
 mm/gup.c                 |   7 ++
 mm/hmm.c                 |  89 +++++++++++++++++--
 mm/madvise.c             |   2 +-
 mm/memcontrol.c          | 226 ++++++++++++++++++++++++++++++-----------------
 mm/memory.c              |  46 ++++++++--
 mm/migrate.c             |  60 ++++++++-----
 mm/swap.c                |  11 +++
 14 files changed, 389 insertions(+), 129 deletions(-)

-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
@ 2017-07-03 21:14 ` Jérôme Glisse
  2017-07-03 23:49   ` Dan Williams
  2017-07-03 21:14 ` [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2 Jérôme Glisse
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse, Ross Zwisler

Use consistent name between IORES_DESC and enum memory_type, rename
MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
the public name for CDM (cache coherent device memory) for which the
term public is a better match.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 4 ++--
 kernel/memremap.c        | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 57546a07a558..2299cc2d387d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * Specialize ZONE_DEVICE memory into multiple types each having differents
  * usage.
  *
- * MEMORY_DEVICE_PUBLIC:
+ * MEMORY_DEVICE_PERSISTENT:
  * Persistent device memory (pmem): struct page might be allocated in different
  * memory and architecture might want to perform special actions. It is similar
  * to regular memory, in that the CPU can access it transparently. However,
@@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
  */
 enum memory_type {
-	MEMORY_DEVICE_PUBLIC = 0,
+	MEMORY_DEVICE_PERSISTENT = 0,
 	MEMORY_DEVICE_PRIVATE,
 };
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b9baa6c07918..e82456c39a6a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
-	pgmap->type = MEMORY_DEVICE_PUBLIC;
+	pgmap->type = MEMORY_DEVICE_PERSISTENT;
 	pgmap->page_fault = NULL;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2
  2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one Jérôme Glisse
@ 2017-07-03 21:14 ` Jérôme Glisse
  2017-07-11  4:12   ` Balbir Singh
  2017-07-03 21:14 ` [PATCH 3/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse, Balbir Singh, Aneesh Kumar,
	Paul E . McKenney, Benjamin Herrenschmidt, Ross Zwisler

Platform with advance system bus (like CAPI or CCIX) allow device
memory to be accessible from CPU in a cache coherent fashion. Add
a new type of ZONE_DEVICE to represent such memory. The use case
are the same as for the un-addressable device memory but without
all the corners cases.

Changed since v1:
  - Kconfig and #if/#else cleanup

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  2 +-
 include/linux/hmm.h      |  4 ++--
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 21 +++++++++++++++++
 include/linux/mm.h       | 16 ++++++++-----
 kernel/memremap.c        | 11 ++++++---
 mm/Kconfig               | 11 +++++++++
 mm/gup.c                 |  7 ++++++
 mm/hmm.c                 |  4 ++--
 mm/madvise.c             |  2 +-
 mm/memory.c              | 46 +++++++++++++++++++++++++++++++++----
 mm/migrate.c             | 60 ++++++++++++++++++++++++++++++++----------------
 mm/swap.c                | 11 +++++++++
 13 files changed, 156 insertions(+), 40 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 957b6ea80d5f..1f38f2c7cc34 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1182,7 +1182,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
-		page = vm_normal_page(vma, addr, pte);
+		page = _vm_normal_page(vma, addr, pte, true);
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 458d0d6d82f3..a40288309fd2 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct hmm_devmem;
 
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
@@ -443,7 +443,7 @@ struct hmm_device {
  */
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3a4f69137bc2..f5cf32e80041 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -131,6 +131,7 @@ enum {
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
+	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 2299cc2d387d..916ca1653ced 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -57,10 +57,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  *
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
  */
 enum memory_type {
 	MEMORY_DEVICE_PERSISTENT = 0,
 	MEMORY_DEVICE_PRIVATE,
+	MEMORY_DEVICE_PUBLIC,
 };
 
 /*
@@ -92,6 +100,8 @@ enum memory_type {
  * The page_free() callback is called once the page refcount reaches 1
  * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
  * This allows the device driver to implement its own memory management.)
+ *
+ * For MEMORY_DEVICE_CACHE_COHERENT only the page_free() callback matter.
  */
 typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
 				unsigned long addr,
@@ -134,6 +144,12 @@ static inline bool is_device_private_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -157,6 +173,11 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 330a216ac315..8b72b122de93 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -797,14 +797,15 @@ static inline bool is_zone_device_page(const struct page *page)
 #endif
 
 #ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page);
+void put_zone_device_private_or_public_page(struct page *page);
 #else
-static inline void put_zone_device_private_page(struct page *page)
+static inline void put_zone_device_private_or_public_page(struct page *page)
 {
 }
 #endif
 
 static inline bool is_device_private_page(const struct page *page);
+static inline bool is_device_public_page(const struct page *page);
 
 DECLARE_STATIC_KEY_FALSE(device_private_key);
 
@@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
 	 * include/linux/memremap.h and HMM for details.
 	 */
 	if (static_branch_unlikely(&device_private_key) &&
-	    unlikely(is_device_private_page(page))) {
-		put_zone_device_private_page(page);
+	    unlikely(is_device_private_page(page) ||
+		     is_device_public_page(page))) {
+		put_zone_device_private_or_public_page(page);
 		return;
 	}
 
@@ -1220,8 +1222,10 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 };
 
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device);
+#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index e82456c39a6a..da74775f2247 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -466,7 +466,7 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 
 
 #ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page)
+void put_zone_device_private_or_public_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -474,10 +474,15 @@ void put_zone_device_private_page(struct page *page)
 	 * If refcount is 1 then page is freed and refcount is stable as nobody
 	 * holds a reference on the page.
 	 */
-	if (count == 1)
+	if (count == 1) {
+		/* Clear Active bit in case of parallel mark_page_accessed */
+		__ClearPageActive(page);
+		__ClearPageWaiters(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
+	}
 	else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_page);
+EXPORT_SYMBOL(put_zone_device_private_or_public_page);
 #endif /* CONFIG_DEVICE_PRIVATE */
diff --git a/mm/Kconfig b/mm/Kconfig
index 3269ff1cc4cd..6002f1e40fcd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -712,12 +712,23 @@ config ZONE_DEVICE
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
+	select HMM
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
 	  memory; i.e., memory that is only accessible from the device (or
 	  group of devices).
 
+config DEVICE_PUBLIC
+	bool "Addressable device memory (like GPU memory)"
+	depends on ARCH_HAS_HMM
+	select HMM
+
+	help
+	  Allows creation of struct pages to represent addressable device
+	  memory; i.e., memory that is accessible from both the device and
+	  the CPU
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/gup.c b/mm/gup.c
index 23f01c40c88f..2f8e8604ff80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -438,6 +438,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
 			goto unmap;
 		*page = pte_page(*pte);
+
+		/*
+		 * This should never happen (a device public page in the gate
+		 * area).
+		 */
+		if (is_device_public_page(*page))
+			goto unmap;
 	}
 	get_page(*page);
 out:
diff --git a/mm/hmm.c b/mm/hmm.c
index 4e01c9ba9cc1..eadf70829c34 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
 				       unsigned long addr)
 {
@@ -1190,4 +1190,4 @@ static int __init hmm_init(void)
 }
 
 device_initcall(hmm_init);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..197277156ce3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -343,7 +343,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		}
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = _vm_normal_page(vma, addr, ptent, true);
 		if (!page)
 			continue;
 
diff --git a/mm/memory.c b/mm/memory.c
index 4fcdab3ec525..cee2bed01aa0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (!is_zero_pfn(pfn))
-			print_bad_pte(vma, addr, pte, NULL);
+		if (is_zero_pfn(pfn))
+			return NULL;
+
+		/*
+		 * Device public pages are special pages (they are ZONE_DEVICE
+		 * pages but different from persistent memory). They behave
+		 * allmost like normal pages. The difference is that they are
+		 * not on the lru and thus should never be involve with any-
+		 * thing that involve lru manipulation (mlock, numa balancing,
+		 * ...).
+		 *
+		 * This is why we still want to return NULL for such page from
+		 * vm_normal_page() so that we do not have to special case all
+		 * call site of vm_normal_page().
+		 */
+		if (likely(pfn < highest_memmap_pfn)) {
+			struct page *page = pfn_to_page(pfn);
+
+			if (is_device_public_page(page)) {
+				if (with_public_device)
+					return page;
+				return NULL;
+			}
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
+	} else if (pte_devmap(pte)) {
+		page = pte_page(pte);
+
+		/*
+		 * Cache coherent device memory behave like regular page and
+		 * not like persistent memory page. For more informations see
+		 * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
+		 */
+		if (is_device_public_page(page)) {
+			get_page(page);
+			page_dup_rmap(page, false);
+			rss[mm_counter(page)]++;
+		}
 	}
 
 out_set_pte:
@@ -1236,7 +1272,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = _vm_normal_page(vma, addr, ptent, true);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
diff --git a/mm/migrate.c b/mm/migrate.c
index 643ea61ca9bb..f9ae57f0c7a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -229,12 +229,19 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		if (unlikely(is_zone_device_page(new)) &&
-		    is_device_private_page(new)) {
-			entry = make_device_private_entry(new, pte_write(pte));
-			pte = swp_entry_to_pte(entry);
-			if (pte_swp_soft_dirty(*pvmw.pte))
-				pte = pte_mksoft_dirty(pte);
+		if (unlikely(is_zone_device_page(new))) {
+			if (is_device_private_page(new)) {
+				entry = make_device_private_entry(new, pte_write(pte));
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*pvmw.pte))
+					pte = pte_mksoft_dirty(pte);
+			}
+#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+			else if (is_device_public_page(new)) {
+				pte = pte_mkdevmap(pte);
+				flush_dcache_page(new);
+			}
+#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */
 		} else
 			flush_dcache_page(new);
 
@@ -408,12 +415,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	/*
-	 * ZONE_DEVICE pages have 1 refcount always held by their device
-	 *
-	 * Note that DAX memory will never reach that point as it does not have
-	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 * Device public or private pages have an extra refcount as they are
+	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_zone_device_page(page);
+	expected_count += is_device_private_page(page);
+	expected_count += is_device_public_page(page);
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -2087,7 +2093,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #endif /* CONFIG_NUMA */
 
-
 struct migrate_vma {
 	struct vm_area_struct	*vma;
 	unsigned long		*dst;
@@ -2186,7 +2191,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			page = vm_normal_page(migrate->vma, addr, pte);
+			page = _vm_normal_page(migrate->vma, addr, pte, true);
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
@@ -2311,13 +2316,18 @@ static bool migrate_vma_check_page(struct page *page)
 
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (is_zone_device_page(page)) {
-		if (is_device_private_page(page)) {
+		if (is_device_private_page(page) ||
+		    is_device_public_page(page))
 			extra++;
-		} else
+		else
 			/* Other ZONE_DEVICE memory type are not supported */
 			return false;
 	}
 
+	/* For file back page */
+	if (page_mapping(page))
+		extra += 1 + page_has_private(page);
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2541,11 +2551,21 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 */
 	__SetPageUptodate(page);
 
-	if (is_zone_device_page(page) && is_device_private_page(page)) {
-		swp_entry_t swp_entry;
+	if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
+			swp_entry_t swp_entry;
 
-		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
-		entry = swp_entry_to_pte(swp_entry);
+			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			entry = swp_entry_to_pte(swp_entry);
+		}
+#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+		else if (is_device_public_page(page)) {
+			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkdevmap(entry);
+		}
+#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2631,7 +2651,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 					continue;
 				}
-			} else {
+			} else if (!is_device_public_page(newpage)) {
 				/*
 				 * Other types of ZONE_DEVICE page are not
 				 * supported.
diff --git a/mm/swap.c b/mm/swap.c
index 4f44dbd7f780..212370d6cb23 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -760,6 +760,17 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (is_huge_zero_page(page))
 			continue;
 
+		/* Device public page can not be huge page */
+		if (is_device_public_page(page)) {
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+						       flags);
+				locked_pgdat = NULL;
+			}
+			put_zone_device_private_or_public_page(page);
+			continue;
+		}
+
 		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] mm/hmm: add new helper to hotplug CDM memory region
  2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2 Jérôme Glisse
@ 2017-07-03 21:14 ` Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
  2017-07-03 21:14 ` [PATCH 5/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
  4 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse

Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c            | 85 +++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index a40288309fd2..e44cb8edb137 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -392,6 +392,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 				  struct device *device,
 				  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index eadf70829c34..28e54e3b4e1d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -849,7 +849,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	zone = page_zone(page);
 
 	mem_hotplug_begin();
-	__remove_pages(zone, start_pfn, npages);
+	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+		__remove_pages(zone, start_pfn, npages);
+	else
+		arch_remove_memory(start_pfn << PAGE_SHIFT,
+				   npages << PAGE_SHIFT);
 	mem_hotplug_done();
 
 	hmm_devmem_radix_release(resource);
@@ -885,7 +889,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	if (is_ram == REGION_INTERSECTS)
 		return -ENXIO;
 
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+	else
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
 	devmem->pagemap.res = devmem->resource;
 	devmem->pagemap.page_fault = hmm_devmem_fault;
 	devmem->pagemap.page_free = hmm_devmem_free;
@@ -924,8 +932,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 		nid = numa_mem_id();
 
 	mem_hotplug_begin();
-	ret = add_pages(nid, align_start >> PAGE_SHIFT,
-			align_size >> PAGE_SHIFT, false);
+	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+		ret = arch_add_memory(nid, align_start, align_size, false);
+	else
+		ret = add_pages(nid, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, false);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
@@ -1075,6 +1086,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res)
+{
+	struct hmm_devmem *devmem;
+	int ret;
+
+	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		return ERR_PTR(-EINVAL);
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = res;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_devm_add_action;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add_resource);
+
 /*
  * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
  *
@@ -1088,6 +1160,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 {
 	resource_size_t start, size;
 	struct device *device;
+	bool cdm = false;
 
 	if (!devmem)
 		return;
@@ -1096,11 +1169,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 	start = devmem->resource->start;
 	size = resource_size(devmem->resource);
 
+	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
 	hmm_devmem_ref_kill(&devmem->ref);
 	hmm_devmem_ref_exit(&devmem->ref);
 	hmm_devmem_pages_remove(devmem);
 
-	devm_release_mem_region(device, start, size);
+	if (!cdm)
+		devm_release_mem_region(device, start, size);
 }
 EXPORT_SYMBOL(hmm_devmem_remove);
 
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-07-03 21:14 ` [PATCH 3/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
@ 2017-07-03 21:14 ` Jérôme Glisse
  2017-07-04 12:51   ` Michal Hocko
  2017-07-03 21:14 ` [PATCH 5/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
  4 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 92 insertions(+), 76 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04d73ab..c709fdceac13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	cancel_charge(memcg, nr_pages);
 }
 
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
-			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_kmem, unsigned long nr_huge,
-			   unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+	struct mem_cgroup *memcg;
+	unsigned long pgpgout;
+	unsigned long nr_anon;
+	unsigned long nr_file;
+	unsigned long nr_kmem;
+	unsigned long nr_huge;
+	unsigned long nr_shmem;
+	struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 {
-	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+	memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(memcg)) {
-		page_counter_uncharge(&memcg->memory, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg)) {
+		page_counter_uncharge(&ug->memcg->memory, nr_pages);
 		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
-			page_counter_uncharge(&memcg->kmem, nr_kmem);
-		memcg_oom_recover(memcg);
+			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+		memcg_oom_recover(ug->memcg);
 	}
 
 	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
-	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
-	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
-	memcg_check_events(memcg, dummy_page);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg))
+		css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+	if (!page->mem_cgroup)
+		return;
+
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * page->mem_cgroup at this point, we have fully
+	 * exclusive access to the page.
+	 */
+
+	if (ug->memcg != page->mem_cgroup) {
+		if (ug->memcg) {
+			uncharge_batch(ug);
+			uncharge_gather_clear(ug);
+		}
+		ug->memcg = page->mem_cgroup;
+	}
+
+	if (!PageKmemcg(page)) {
+		unsigned int nr_pages = 1;
+
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			ug->nr_huge += nr_pages;
+		}
+		if (PageAnon(page))
+			ug->nr_anon += nr_pages;
+		else {
+			ug->nr_file += nr_pages;
+			if (PageSwapBacked(page))
+				ug->nr_shmem += nr_pages;
+		}
+		ug->pgpgout++;
+	} else {
+		ug->nr_kmem += 1 << compound_order(page);
+		__ClearPageKmemcg(page);
+	}
+
+	ug->dummy_page = page;
+	page->mem_cgroup = NULL;
 }
 
 static void uncharge_list(struct list_head *page_list)
 {
-	struct mem_cgroup *memcg = NULL;
-	unsigned long nr_shmem = 0;
-	unsigned long nr_anon = 0;
-	unsigned long nr_file = 0;
-	unsigned long nr_huge = 0;
-	unsigned long nr_kmem = 0;
-	unsigned long pgpgout = 0;
+	struct uncharge_gather ug;
 	struct list_head *next;
-	struct page *page;
+
+	uncharge_gather_clear(&ug);
 
 	/*
 	 * Note that the list can be a single page->lru; hence the
@@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
 	 */
 	next = page_list->next;
 	do {
+		struct page *page;
+
 		page = list_entry(next, struct page, lru);
 		next = page->lru.next;
 
-		VM_BUG_ON_PAGE(PageLRU(page), page);
-		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
-		if (!page->mem_cgroup)
-			continue;
-
-		/*
-		 * Nobody should be changing or seriously looking at
-		 * page->mem_cgroup at this point, we have fully
-		 * exclusive access to the page.
-		 */
-
-		if (memcg != page->mem_cgroup) {
-			if (memcg) {
-				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_kmem, nr_huge, nr_shmem, page);
-				pgpgout = nr_anon = nr_file = nr_kmem = 0;
-				nr_huge = nr_shmem = 0;
-			}
-			memcg = page->mem_cgroup;
-		}
-
-		if (!PageKmemcg(page)) {
-			unsigned int nr_pages = 1;
-
-			if (PageTransHuge(page)) {
-				nr_pages <<= compound_order(page);
-				nr_huge += nr_pages;
-			}
-			if (PageAnon(page))
-				nr_anon += nr_pages;
-			else {
-				nr_file += nr_pages;
-				if (PageSwapBacked(page))
-					nr_shmem += nr_pages;
-			}
-			pgpgout++;
-		} else {
-			nr_kmem += 1 << compound_order(page);
-			__ClearPageKmemcg(page);
-		}
-
-		page->mem_cgroup = NULL;
+		uncharge_page(page, &ug);
 	} while (next != page_list);
 
-	if (memcg)
-		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_kmem, nr_huge, nr_shmem, page);
+	if (ug.memcg)
+		uncharge_batch(&ug);
 }
 
 /**
@@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct uncharge_gather ug;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
 	if (!page->mem_cgroup)
 		return;
 
-	INIT_LIST_HEAD(&page->lru);
-	uncharge_list(&page->lru);
+	uncharge_gather_clear(&ug);
+	uncharge_page(page, &ug);
+	uncharge_batch(&ug);
 }
 
 /**
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-07-03 21:14 ` [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-07-03 21:14 ` Jérôme Glisse
  4 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-07-03 21:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Jérôme Glisse, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 kernel/memremap.c |  2 ++
 mm/memcontrol.c   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index da74775f2247..584984cf7d18 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
 		__ClearPageActive(page);
 		__ClearPageWaiters(page);
 
+		mem_cgroup_uncharge(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
 	}
 	else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c709fdceac13..04a88aedddbe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4391,12 +4391,13 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_DEVICE,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
 {
-	struct page *page = vm_normal_page(vma, addr, ptent);
+	struct page *page = _vm_normal_page(vma, addr, ptent, true);
 
 	if (!page || !page_mapped(page))
 		return NULL;
@@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 		if (!(mc.flags & MOVE_FILE))
 			return NULL;
 	}
-	if (!get_page_unless_zero(page))
+	if (is_device_public_page(page)) {
+		/*
+		 * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
+		 * refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+	} else if (!get_page_unless_zero(page))
 		return NULL;
 
 	return page;
 }
 
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
 static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 			pte_t ptent, swp_entry_t *entry)
 {
@@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 
 	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
 		return NULL;
+
+	/*
+	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+	 * a device and because they are not accessible by CPU they are store
+	 * as special swap entry in the CPU page table.
+	 */
+	if (is_device_private_entry(ent)) {
+		page = device_private_entry_to_page(ent);
+		/*
+		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+		 * a refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+		return page;
+	}
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.
@@ -4582,6 +4607,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PUBLIC
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
  *
  * Called with pte lock held.
  */
@@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
+			if (is_device_private_page(page) ||
+			    is_device_public_page(page))
+				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
 		}
@@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
+		/*
+		 * Note their can not be MC_TARGET_DEVICE for now as we do not
+		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+		 * MEMORY_DEVICE_PRIVATE but this might change.
+		 */
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -4884,6 +4919,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				putback_lru_page(page);
 			}
 			put_page(page);
+		} else if (target_type == MC_TARGET_DEVICE) {
+			page = target.page;
+			if (!mem_cgroup_move_account(page, true,
+						     mc.from, mc.to)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_charge += HPAGE_PMD_NR;
+			}
+			put_page(page);
 		}
 		spin_unlock(ptl);
 		return 0;
@@ -4895,12 +4938,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
+		bool device = false;
 		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
 
 		switch (get_mctgt_type(vma, addr, ptent, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			/* fall through */
 		case MC_TARGET_PAGE:
 			page = target.page;
 			/*
@@ -4911,7 +4958,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (isolate_lru_page(page))
+			if (!device && isolate_lru_page(page))
 				goto put;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
@@ -4919,7 +4966,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
 			}
-			putback_lru_page(page);
+			if (!device)
+				putback_lru_page(page);
 put:			/* get_mctgt_type() gets the page */
 			put_page(page);
 			break;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-03 21:14 ` [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one Jérôme Glisse
@ 2017-07-03 23:49   ` Dan Williams
  2017-07-05 14:25     ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2017-07-03 23:49 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Mon, Jul 3, 2017 at 2:14 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
> Use consistent name between IORES_DESC and enum memory_type, rename
> MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
> the public name for CDM (cache coherent device memory) for which the
> term public is a better match.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/memremap.h | 4 ++--
>  kernel/memremap.c        | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 57546a07a558..2299cc2d387d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   * Specialize ZONE_DEVICE memory into multiple types each having differents
>   * usage.
>   *
> - * MEMORY_DEVICE_PUBLIC:
> + * MEMORY_DEVICE_PERSISTENT:
>   * Persistent device memory (pmem): struct page might be allocated in different
>   * memory and architecture might want to perform special actions. It is similar
>   * to regular memory, in that the CPU can access it transparently. However,
> @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   * include/linux/hmm.h and Documentation/vm/hmm.txt.
>   */
>  enum memory_type {
> -       MEMORY_DEVICE_PUBLIC = 0,
> +       MEMORY_DEVICE_PERSISTENT = 0,
>         MEMORY_DEVICE_PRIVATE,
>  };
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index b9baa6c07918..e82456c39a6a 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>         }
>         pgmap->ref = ref;
>         pgmap->res = &page_map->res;
> -       pgmap->type = MEMORY_DEVICE_PUBLIC;
> +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
>         pgmap->page_fault = NULL;
>         pgmap->page_free = NULL;
>         pgmap->data = NULL;

I think we need a different name. There's nothing "persistent" about
the devm_memremap_pages() path. Why can't they share name, is the only
difference coherence? I'm thinking something like:

MEMORY_DEVICE_PRIVATE
MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
MEMORY_DEVICE_IO /* "public", but not coherent */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-03 21:14 ` [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-07-04 12:51   ` Michal Hocko
  2017-07-05  3:18     ` Balbir Singh
  2017-07-05 14:35     ` Jerome Glisse
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-07-04 12:51 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus you can not use page->lru fields of those pages. This patch
> re-arrange the uncharge to allow single page to be uncharge without
> modifying the lru field of the struct page.
> 
> There is no change to memcontrol logic, it is the same as it was
> before this patch.

What is the memcg semantic of the memory? Why is it even charged? AFAIR
this is not a reclaimable memory. If yes how are we going to deal with
memory limits? What should happen if go OOM? Does killing an process
actually help to release that memory? Isn't it pinned by a device?

For the patch itself. It is quite ugly but I haven't spotted anything
obviously wrong with it. It is the memcg semantic with this class of
memory which makes me worried.

> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---
>  mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 92 insertions(+), 76 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3c04d73ab..c709fdceac13 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
>  	cancel_charge(memcg, nr_pages);
>  }
>  
> -static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> -			   unsigned long nr_anon, unsigned long nr_file,
> -			   unsigned long nr_kmem, unsigned long nr_huge,
> -			   unsigned long nr_shmem, struct page *dummy_page)
> +struct uncharge_gather {
> +	struct mem_cgroup *memcg;
> +	unsigned long pgpgout;
> +	unsigned long nr_anon;
> +	unsigned long nr_file;
> +	unsigned long nr_kmem;
> +	unsigned long nr_huge;
> +	unsigned long nr_shmem;
> +	struct page *dummy_page;
> +};
> +
> +static inline void uncharge_gather_clear(struct uncharge_gather *ug)
>  {
> -	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
> +	memset(ug, 0, sizeof(*ug));
> +}
> +
> +static void uncharge_batch(const struct uncharge_gather *ug)
> +{
> +	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
>  	unsigned long flags;
>  
> -	if (!mem_cgroup_is_root(memcg)) {
> -		page_counter_uncharge(&memcg->memory, nr_pages);
> +	if (!mem_cgroup_is_root(ug->memcg)) {
> +		page_counter_uncharge(&ug->memcg->memory, nr_pages);
>  		if (do_memsw_account())
> -			page_counter_uncharge(&memcg->memsw, nr_pages);
> -		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
> -			page_counter_uncharge(&memcg->kmem, nr_kmem);
> -		memcg_oom_recover(memcg);
> +			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
> +		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> +			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
> +		memcg_oom_recover(ug->memcg);
>  	}
>  
>  	local_irq_save(flags);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
> -	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
> -	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
> -	memcg_check_events(memcg, dummy_page);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
> +	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
> +	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
> +	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
> +	memcg_check_events(ug->memcg, ug->dummy_page);
>  	local_irq_restore(flags);
>  
> -	if (!mem_cgroup_is_root(memcg))
> -		css_put_many(&memcg->css, nr_pages);
> +	if (!mem_cgroup_is_root(ug->memcg))
> +		css_put_many(&ug->memcg->css, nr_pages);
> +}
> +
> +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> +{
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> +
> +	if (!page->mem_cgroup)
> +		return;
> +
> +	/*
> +	 * Nobody should be changing or seriously looking at
> +	 * page->mem_cgroup at this point, we have fully
> +	 * exclusive access to the page.
> +	 */
> +
> +	if (ug->memcg != page->mem_cgroup) {
> +		if (ug->memcg) {
> +			uncharge_batch(ug);
> +			uncharge_gather_clear(ug);
> +		}
> +		ug->memcg = page->mem_cgroup;
> +	}
> +
> +	if (!PageKmemcg(page)) {
> +		unsigned int nr_pages = 1;
> +
> +		if (PageTransHuge(page)) {
> +			nr_pages <<= compound_order(page);
> +			ug->nr_huge += nr_pages;
> +		}
> +		if (PageAnon(page))
> +			ug->nr_anon += nr_pages;
> +		else {
> +			ug->nr_file += nr_pages;
> +			if (PageSwapBacked(page))
> +				ug->nr_shmem += nr_pages;
> +		}
> +		ug->pgpgout++;
> +	} else {
> +		ug->nr_kmem += 1 << compound_order(page);
> +		__ClearPageKmemcg(page);
> +	}
> +
> +	ug->dummy_page = page;
> +	page->mem_cgroup = NULL;
>  }
>  
>  static void uncharge_list(struct list_head *page_list)
>  {
> -	struct mem_cgroup *memcg = NULL;
> -	unsigned long nr_shmem = 0;
> -	unsigned long nr_anon = 0;
> -	unsigned long nr_file = 0;
> -	unsigned long nr_huge = 0;
> -	unsigned long nr_kmem = 0;
> -	unsigned long pgpgout = 0;
> +	struct uncharge_gather ug;
>  	struct list_head *next;
> -	struct page *page;
> +
> +	uncharge_gather_clear(&ug);
>  
>  	/*
>  	 * Note that the list can be a single page->lru; hence the
> @@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
>  	 */
>  	next = page_list->next;
>  	do {
> +		struct page *page;
> +
>  		page = list_entry(next, struct page, lru);
>  		next = page->lru.next;
>  
> -		VM_BUG_ON_PAGE(PageLRU(page), page);
> -		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> -
> -		if (!page->mem_cgroup)
> -			continue;
> -
> -		/*
> -		 * Nobody should be changing or seriously looking at
> -		 * page->mem_cgroup at this point, we have fully
> -		 * exclusive access to the page.
> -		 */
> -
> -		if (memcg != page->mem_cgroup) {
> -			if (memcg) {
> -				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> -					       nr_kmem, nr_huge, nr_shmem, page);
> -				pgpgout = nr_anon = nr_file = nr_kmem = 0;
> -				nr_huge = nr_shmem = 0;
> -			}
> -			memcg = page->mem_cgroup;
> -		}
> -
> -		if (!PageKmemcg(page)) {
> -			unsigned int nr_pages = 1;
> -
> -			if (PageTransHuge(page)) {
> -				nr_pages <<= compound_order(page);
> -				nr_huge += nr_pages;
> -			}
> -			if (PageAnon(page))
> -				nr_anon += nr_pages;
> -			else {
> -				nr_file += nr_pages;
> -				if (PageSwapBacked(page))
> -					nr_shmem += nr_pages;
> -			}
> -			pgpgout++;
> -		} else {
> -			nr_kmem += 1 << compound_order(page);
> -			__ClearPageKmemcg(page);
> -		}
> -
> -		page->mem_cgroup = NULL;
> +		uncharge_page(page, &ug);
>  	} while (next != page_list);
>  
> -	if (memcg)
> -		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> -			       nr_kmem, nr_huge, nr_shmem, page);
> +	if (ug.memcg)
> +		uncharge_batch(&ug);
>  }
>  
>  /**
> @@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
>   */
>  void mem_cgroup_uncharge(struct page *page)
>  {
> +	struct uncharge_gather ug;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  
> @@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
>  	if (!page->mem_cgroup)
>  		return;
>  
> -	INIT_LIST_HEAD(&page->lru);
> -	uncharge_list(&page->lru);
> +	uncharge_gather_clear(&ug);
> +	uncharge_page(page, &ug);
> +	uncharge_batch(&ug);
>  }
>  
>  /**
> -- 
> 2.13.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-04 12:51   ` Michal Hocko
@ 2017-07-05  3:18     ` Balbir Singh
  2017-07-05  6:38       ` Michal Hocko
  2017-07-05 14:35     ` Jerome Glisse
  1 sibling, 1 reply; 30+ messages in thread
From: Balbir Singh @ 2017-07-05  3:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	David Nellans, Dan Williams, Johannes Weiner, Vladimir Davydov,
	cgroups

On Tue, Jul 4, 2017 at 10:51 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 03-07-17 17:14:14, Jérôme Glisse wrote:
>> HMM pages (private or public device pages) are ZONE_DEVICE page and
>> thus you can not use page->lru fields of those pages. This patch
>> re-arrange the uncharge to allow single page to be uncharge without
>> modifying the lru field of the struct page.
>>
>> There is no change to memcontrol logic, it is the same as it was
>> before this patch.
>
> What is the memcg semantic of the memory? Why is it even charged? AFAIR
> this is not a reclaimable memory. If yes how are we going to deal with
> memory limits? What should happen if go OOM? Does killing an process
> actually help to release that memory? Isn't it pinned by a device?
>
> For the patch itself. It is quite ugly but I haven't spotted anything
> obviously wrong with it. It is the memcg semantic with this class of
> memory which makes me worried.
>

This is the HMM CDM case. Memory is normally malloc'd and then
migrated to ZONE_DEVICE or vice-versa. One of the things we did
discuss was seeing ZONE_DEVICE memory in user page tables.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-05  3:18     ` Balbir Singh
@ 2017-07-05  6:38       ` Michal Hocko
  2017-07-05 10:22         ` Balbir Singh
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-07-05  6:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	David Nellans, Dan Williams, Johannes Weiner, Vladimir Davydov,
	cgroups

On Wed 05-07-17 13:18:18, Balbir Singh wrote:
> On Tue, Jul 4, 2017 at 10:51 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> >> HMM pages (private or public device pages) are ZONE_DEVICE page and
> >> thus you can not use page->lru fields of those pages. This patch
> >> re-arrange the uncharge to allow single page to be uncharge without
> >> modifying the lru field of the struct page.
> >>
> >> There is no change to memcontrol logic, it is the same as it was
> >> before this patch.
> >
> > What is the memcg semantic of the memory? Why is it even charged? AFAIR
> > this is not a reclaimable memory. If yes how are we going to deal with
> > memory limits? What should happen if go OOM? Does killing an process
> > actually help to release that memory? Isn't it pinned by a device?
> >
> > For the patch itself. It is quite ugly but I haven't spotted anything
> > obviously wrong with it. It is the memcg semantic with this class of
> > memory which makes me worried.
> >
> 
> This is the HMM CDM case. Memory is normally malloc'd and then
> migrated to ZONE_DEVICE or vice-versa. One of the things we did
> discuss was seeing ZONE_DEVICE memory in user page tables.

This doesn't answer any of the above questions though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-05  6:38       ` Michal Hocko
@ 2017-07-05 10:22         ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2017-07-05 10:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	David Nellans, Dan Williams, Johannes Weiner, Vladimir Davydov,
	cgroups

On Wed, Jul 5, 2017 at 4:38 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 05-07-17 13:18:18, Balbir Singh wrote:
>> On Tue, Jul 4, 2017 at 10:51 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 03-07-17 17:14:14, Jérôme Glisse wrote:
>> >> HMM pages (private or public device pages) are ZONE_DEVICE page and
>> >> thus you can not use page->lru fields of those pages. This patch
>> >> re-arrange the uncharge to allow single page to be uncharge without
>> >> modifying the lru field of the struct page.
>> >>
>> >> There is no change to memcontrol logic, it is the same as it was
>> >> before this patch.
>> >
>> > What is the memcg semantic of the memory? Why is it even charged? AFAIR
>> > this is not a reclaimable memory. If yes how are we going to deal with
>> > memory limits? What should happen if go OOM? Does killing an process
>> > actually help to release that memory? Isn't it pinned by a device?
>> >
>> > For the patch itself. It is quite ugly but I haven't spotted anything
>> > obviously wrong with it. It is the memcg semantic with this class of
>> > memory which makes me worried.
>> >
>>
>> This is the HMM CDM case. Memory is normally malloc'd and then
>> migrated to ZONE_DEVICE or vice-versa. One of the things we did
>> discuss was seeing ZONE_DEVICE memory in user page tables.
>
> This doesn't answer any of the above questions though.


Jerome is the expert and I am sure he has a better answer, but my understanding
is that this path gets called through release_pages() <-- zap_pte_range().
At first even I pondered about the same thing, but then came across this path.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-03 23:49   ` Dan Williams
@ 2017-07-05 14:25     ` Jerome Glisse
  2017-07-05 16:15       ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-05 14:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
> On Mon, Jul 3, 2017 at 2:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > Use consistent name between IORES_DESC and enum memory_type, rename
> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
> > the public name for CDM (cache coherent device memory) for which the
> > term public is a better match.
> >
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/memremap.h | 4 ++--
> >  kernel/memremap.c        | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 57546a07a558..2299cc2d387d 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
> >   * usage.
> >   *
> > - * MEMORY_DEVICE_PUBLIC:
> > + * MEMORY_DEVICE_PERSISTENT:
> >   * Persistent device memory (pmem): struct page might be allocated in different
> >   * memory and architecture might want to perform special actions. It is similar
> >   * to regular memory, in that the CPU can access it transparently. However,
> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
> >   */
> >  enum memory_type {
> > -       MEMORY_DEVICE_PUBLIC = 0,
> > +       MEMORY_DEVICE_PERSISTENT = 0,
> >         MEMORY_DEVICE_PRIVATE,
> >  };
> >
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index b9baa6c07918..e82456c39a6a 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >         }
> >         pgmap->ref = ref;
> >         pgmap->res = &page_map->res;
> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
> >         pgmap->page_fault = NULL;
> >         pgmap->page_free = NULL;
> >         pgmap->data = NULL;
> 
> I think we need a different name. There's nothing "persistent" about
> the devm_memremap_pages() path. Why can't they share name, is the only
> difference coherence? I'm thinking something like:
> 
> MEMORY_DEVICE_PRIVATE
> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
> MEMORY_DEVICE_IO /* "public", but not coherent */

No that would not work. Device public (in the context of this patchset)
is like device private ie device public page can be anywhere inside a
process address space either as anonymous memory page or as file back
page of regular filesystem (ie vma->ops is not pointing to anything
specific to the device memory).

As such device public is different from how persistent memory is use
and those the cache coherency being the same between the two kind of
memory is not a discerning factor. So i need to distinguish between
persistent memory and device public memory.

I believe keeping enum memory_type close to IORES_DESC naming is the
cleanest way to do that but i am open to other name suggestion.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-04 12:51   ` Michal Hocko
  2017-07-05  3:18     ` Balbir Singh
@ 2017-07-05 14:35     ` Jerome Glisse
  2017-07-10  8:28       ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-05 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Tue, Jul 04, 2017 at 02:51:13PM +0200, Michal Hocko wrote:
> On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus you can not use page->lru fields of those pages. This patch
> > re-arrange the uncharge to allow single page to be uncharge without
> > modifying the lru field of the struct page.
> > 
> > There is no change to memcontrol logic, it is the same as it was
> > before this patch.
> 
> What is the memcg semantic of the memory? Why is it even charged? AFAIR
> this is not a reclaimable memory. If yes how are we going to deal with
> memory limits? What should happen if go OOM? Does killing an process
> actually help to release that memory? Isn't it pinned by a device?
> 
> For the patch itself. It is quite ugly but I haven't spotted anything
> obviously wrong with it. It is the memcg semantic with this class of
> memory which makes me worried.

So i am facing 3 choices. First one not account device memory at all.
Second one is account device memory like any other memory inside a
process. Third one is account device memory as something entirely new.

I pick the second one for two reasons. First because when migrating
back from device memory it means that migration can not fail because
of memory cgroup limit, this simplify an already complex migration
code. Second because i assume that device memory usage is a transient
state ie once device is done with its computation the most likely
outcome is memory is migrated back. From this assumption it means
that you do not want to allow a process to overuse regular memory
while it is using un-accounted device memory. It sounds safer to
account device memory and to keep the process within its memcg
boundary.

Admittedly here i am making an assumption and i can be wrong. Thing
is we do not have enough real data of how this will be use and how
much of an impact device memory will have. That is why for now i
would rather restrict myself to either not account it or account it
as usual.

If you prefer not accounting it until we have more experience on how
it is use and how it impacts memory resource management i am fine with
that too. It will make the migration code slightly more complex.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-05 14:25     ` Jerome Glisse
@ 2017-07-05 16:15       ` Dan Williams
  2017-07-05 18:49         ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2017-07-05 16:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
>> On Mon, Jul 3, 2017 at 2:14 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>> > Use consistent name between IORES_DESC and enum memory_type, rename
>> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
>> > the public name for CDM (cache coherent device memory) for which the
>> > term public is a better match.
>> >
>> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> > Cc: Dan Williams <dan.j.williams@intel.com>
>> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > ---
>> >  include/linux/memremap.h | 4 ++--
>> >  kernel/memremap.c        | 2 +-
>> >  2 files changed, 3 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> > index 57546a07a558..2299cc2d387d 100644
>> > --- a/include/linux/memremap.h
>> > +++ b/include/linux/memremap.h
>> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
>> >   * usage.
>> >   *
>> > - * MEMORY_DEVICE_PUBLIC:
>> > + * MEMORY_DEVICE_PERSISTENT:
>> >   * Persistent device memory (pmem): struct page might be allocated in different
>> >   * memory and architecture might want to perform special actions. It is similar
>> >   * to regular memory, in that the CPU can access it transparently. However,
>> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
>> >   */
>> >  enum memory_type {
>> > -       MEMORY_DEVICE_PUBLIC = 0,
>> > +       MEMORY_DEVICE_PERSISTENT = 0,
>> >         MEMORY_DEVICE_PRIVATE,
>> >  };
>> >
>> > diff --git a/kernel/memremap.c b/kernel/memremap.c
>> > index b9baa6c07918..e82456c39a6a 100644
>> > --- a/kernel/memremap.c
>> > +++ b/kernel/memremap.c
>> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>> >         }
>> >         pgmap->ref = ref;
>> >         pgmap->res = &page_map->res;
>> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
>> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
>> >         pgmap->page_fault = NULL;
>> >         pgmap->page_free = NULL;
>> >         pgmap->data = NULL;
>>
>> I think we need a different name. There's nothing "persistent" about
>> the devm_memremap_pages() path. Why can't they share name, is the only
>> difference coherence? I'm thinking something like:
>>
>> MEMORY_DEVICE_PRIVATE
>> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
>> MEMORY_DEVICE_IO /* "public", but not coherent */
>
> No that would not work. Device public (in the context of this patchset)
> is like device private ie device public page can be anywhere inside a
> process address space either as anonymous memory page or as file back
> page of regular filesystem (ie vma->ops is not pointing to anything
> specific to the device memory).
>
> As such device public is different from how persistent memory is use
> and those the cache coherency being the same between the two kind of
> memory is not a discerning factor. So i need to distinguish between
> persistent memory and device public memory.
>
> I believe keeping enum memory_type close to IORES_DESC naming is the
> cleanest way to do that but i am open to other name suggestion.
>

The IORES_DESC has nothing to do with how the memory range is handled
by the core mm. It sounds like the distinction this is trying to make
is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
Where a "host" memory range is one that does not need coordination
with a specific device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-05 16:15       ` Dan Williams
@ 2017-07-05 18:49         ` Jerome Glisse
  2017-07-11  3:48           ` Balbir Singh
  2017-07-11  7:31           ` Dan Williams
  0 siblings, 2 replies; 30+ messages in thread
From: Jerome Glisse @ 2017-07-05 18:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Wed, Jul 05, 2017 at 09:15:35AM -0700, Dan Williams wrote:
> On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
> >> On Mon, Jul 3, 2017 at 2:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> > Use consistent name between IORES_DESC and enum memory_type, rename
> >> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
> >> > the public name for CDM (cache coherent device memory) for which the
> >> > term public is a better match.
> >> >
> >> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> >> > Cc: Dan Williams <dan.j.williams@intel.com>
> >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > ---
> >> >  include/linux/memremap.h | 4 ++--
> >> >  kernel/memremap.c        | 2 +-
> >> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> > index 57546a07a558..2299cc2d387d 100644
> >> > --- a/include/linux/memremap.h
> >> > +++ b/include/linux/memremap.h
> >> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
> >> >   * usage.
> >> >   *
> >> > - * MEMORY_DEVICE_PUBLIC:
> >> > + * MEMORY_DEVICE_PERSISTENT:
> >> >   * Persistent device memory (pmem): struct page might be allocated in different
> >> >   * memory and architecture might want to perform special actions. It is similar
> >> >   * to regular memory, in that the CPU can access it transparently. However,
> >> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
> >> >   */
> >> >  enum memory_type {
> >> > -       MEMORY_DEVICE_PUBLIC = 0,
> >> > +       MEMORY_DEVICE_PERSISTENT = 0,
> >> >         MEMORY_DEVICE_PRIVATE,
> >> >  };
> >> >
> >> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> >> > index b9baa6c07918..e82456c39a6a 100644
> >> > --- a/kernel/memremap.c
> >> > +++ b/kernel/memremap.c
> >> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >> >         }
> >> >         pgmap->ref = ref;
> >> >         pgmap->res = &page_map->res;
> >> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
> >> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
> >> >         pgmap->page_fault = NULL;
> >> >         pgmap->page_free = NULL;
> >> >         pgmap->data = NULL;
> >>
> >> I think we need a different name. There's nothing "persistent" about
> >> the devm_memremap_pages() path. Why can't they share name, is the only
> >> difference coherence? I'm thinking something like:
> >>
> >> MEMORY_DEVICE_PRIVATE
> >> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
> >> MEMORY_DEVICE_IO /* "public", but not coherent */
> >
> > No that would not work. Device public (in the context of this patchset)
> > is like device private ie device public page can be anywhere inside a
> > process address space either as anonymous memory page or as file back
> > page of regular filesystem (ie vma->ops is not pointing to anything
> > specific to the device memory).
> >
> > As such device public is different from how persistent memory is use
> > and those the cache coherency being the same between the two kind of
> > memory is not a discerning factor. So i need to distinguish between
> > persistent memory and device public memory.
> >
> > I believe keeping enum memory_type close to IORES_DESC naming is the
> > cleanest way to do that but i am open to other name suggestion.
> >
> 
> The IORES_DESC has nothing to do with how the memory range is handled
> by the core mm. It sounds like the distinction this is trying to make
> is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
> Where a "host" memory range is one that does not need coordination
> with a specific device.

I want to distinguish between:
  - device memory that is not accessible by the CPU
  - device memory that is accessible by the CPU just like regular
    memory
  - existing user of devm_memremap_pages() which is persistent memory
    (only pmem seems to call devm_memremap_pages()) that is use like a
    filesystem or block device and thus isn't use like generic page in
    a process address space

So if existing user of devm_memremap_pages() are only persistent memory
then it made sense to match the IORES_DESC we are expecting to see on
see such memory.

For public device memory (in the sense introduced by this patchset) i
do not know how it will be described by IORES_DESC. i think first folks
with it are IBM with CAPI and i am not sure they defined something for
that already.

I am open to any name beside public (well any reasonable name :)) but
i do need to be able to distinguish persistent memory as use today from
this device memory.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-05 14:35     ` Jerome Glisse
@ 2017-07-10  8:28       ` Michal Hocko
  2017-07-10 15:32         ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-07-10  8:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Wed 05-07-17 10:35:29, Jerome Glisse wrote:
> On Tue, Jul 04, 2017 at 02:51:13PM +0200, Michal Hocko wrote:
> > On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > thus you can not use page->lru fields of those pages. This patch
> > > re-arrange the uncharge to allow single page to be uncharge without
> > > modifying the lru field of the struct page.
> > > 
> > > There is no change to memcontrol logic, it is the same as it was
> > > before this patch.
> > 
> > What is the memcg semantic of the memory? Why is it even charged? AFAIR
> > this is not a reclaimable memory. If yes how are we going to deal with
> > memory limits? What should happen if go OOM? Does killing an process
> > actually help to release that memory? Isn't it pinned by a device?
> > 
> > For the patch itself. It is quite ugly but I haven't spotted anything
> > obviously wrong with it. It is the memcg semantic with this class of
> > memory which makes me worried.
> 
> So i am facing 3 choices. First one not account device memory at all.
> Second one is account device memory like any other memory inside a
> process. Third one is account device memory as something entirely new.
> 
> I pick the second one for two reasons. First because when migrating
> back from device memory it means that migration can not fail because
> of memory cgroup limit, this simplify an already complex migration
> code. Second because i assume that device memory usage is a transient
> state ie once device is done with its computation the most likely
> outcome is memory is migrated back. From this assumption it means
> that you do not want to allow a process to overuse regular memory
> while it is using un-accounted device memory. It sounds safer to
> account device memory and to keep the process within its memcg
> boundary.
> 
> Admittedly here i am making an assumption and i can be wrong. Thing
> is we do not have enough real data of how this will be use and how
> much of an impact device memory will have. That is why for now i
> would rather restrict myself to either not account it or account it
> as usual.
> 
> If you prefer not accounting it until we have more experience on how
> it is use and how it impacts memory resource management i am fine with
> that too. It will make the migration code slightly more complex.

I can see why you want to do this but the semantic _has_ to be clear.
And as such make sure that the exiting task will simply unpin and
invalidate all the device memory (assuming this memory is not shared
which I am not sure is even possible).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10  8:28       ` Michal Hocko
@ 2017-07-10 15:32         ` Jerome Glisse
  2017-07-10 16:04           ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-10 15:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon, Jul 10, 2017 at 10:28:06AM +0200, Michal Hocko wrote:
> On Wed 05-07-17 10:35:29, Jerome Glisse wrote:
> > On Tue, Jul 04, 2017 at 02:51:13PM +0200, Michal Hocko wrote:
> > > On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> > > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > > thus you can not use page->lru fields of those pages. This patch
> > > > re-arrange the uncharge to allow single page to be uncharge without
> > > > modifying the lru field of the struct page.
> > > > 
> > > > There is no change to memcontrol logic, it is the same as it was
> > > > before this patch.
> > > 
> > > What is the memcg semantic of the memory? Why is it even charged? AFAIR
> > > this is not a reclaimable memory. If yes how are we going to deal with
> > > memory limits? What should happen if go OOM? Does killing an process
> > > actually help to release that memory? Isn't it pinned by a device?
> > > 
> > > For the patch itself. It is quite ugly but I haven't spotted anything
> > > obviously wrong with it. It is the memcg semantic with this class of
> > > memory which makes me worried.
> > 
> > So i am facing 3 choices. First one not account device memory at all.
> > Second one is account device memory like any other memory inside a
> > process. Third one is account device memory as something entirely new.
> > 
> > I pick the second one for two reasons. First because when migrating
> > back from device memory it means that migration can not fail because
> > of memory cgroup limit, this simplify an already complex migration
> > code. Second because i assume that device memory usage is a transient
> > state ie once device is done with its computation the most likely
> > outcome is memory is migrated back. From this assumption it means
> > that you do not want to allow a process to overuse regular memory
> > while it is using un-accounted device memory. It sounds safer to
> > account device memory and to keep the process within its memcg
> > boundary.
> > 
> > Admittedly here i am making an assumption and i can be wrong. Thing
> > is we do not have enough real data of how this will be use and how
> > much of an impact device memory will have. That is why for now i
> > would rather restrict myself to either not account it or account it
> > as usual.
> > 
> > If you prefer not accounting it until we have more experience on how
> > it is use and how it impacts memory resource management i am fine with
> > that too. It will make the migration code slightly more complex.
> 
> I can see why you want to do this but the semantic _has_ to be clear.
> And as such make sure that the exiting task will simply unpin and
> invalidate all the device memory (assuming this memory is not shared
> which I am not sure is even possible).

So there is 2 differents path out of device memory:
  - munmap/process exiting: memory will get uncharge from its memory
    cgroup just like regular memory
  - migration to non device memory, the memory cgroup charge get
    transfer to the new page just like for any other page

Do you want me to document all this in any specific place ? I will
add a comment in memory_control.c and in HMM documentations for this
but should i add it anywhere else ?

Note that the device memory is not pin. The whole point of HMM is to
do away with any pining. Thought as device page are not on lru they
are not reclaim like any other page. However we expect that device
driver might implement something akin to device memory reclaim to
make room for more important data base on statistic collected by the
device driver. If there is enough commonality accross devices then
we might implement a more generic mechanisms but at this point i
rather grow as we learn.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 15:32         ` Jerome Glisse
@ 2017-07-10 16:04           ` Michal Hocko
  2017-07-10 16:25             ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-07-10 16:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon 10-07-17 11:32:23, Jerome Glisse wrote:
> On Mon, Jul 10, 2017 at 10:28:06AM +0200, Michal Hocko wrote:
> > On Wed 05-07-17 10:35:29, Jerome Glisse wrote:
> > > On Tue, Jul 04, 2017 at 02:51:13PM +0200, Michal Hocko wrote:
> > > > On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> > > > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > > > thus you can not use page->lru fields of those pages. This patch
> > > > > re-arrange the uncharge to allow single page to be uncharge without
> > > > > modifying the lru field of the struct page.
> > > > > 
> > > > > There is no change to memcontrol logic, it is the same as it was
> > > > > before this patch.
> > > > 
> > > > What is the memcg semantic of the memory? Why is it even charged? AFAIR
> > > > this is not a reclaimable memory. If yes how are we going to deal with
> > > > memory limits? What should happen if go OOM? Does killing an process
> > > > actually help to release that memory? Isn't it pinned by a device?
> > > > 
> > > > For the patch itself. It is quite ugly but I haven't spotted anything
> > > > obviously wrong with it. It is the memcg semantic with this class of
> > > > memory which makes me worried.
> > > 
> > > So i am facing 3 choices. First one not account device memory at all.
> > > Second one is account device memory like any other memory inside a
> > > process. Third one is account device memory as something entirely new.
> > > 
> > > I pick the second one for two reasons. First because when migrating
> > > back from device memory it means that migration can not fail because
> > > of memory cgroup limit, this simplify an already complex migration
> > > code. Second because i assume that device memory usage is a transient
> > > state ie once device is done with its computation the most likely
> > > outcome is memory is migrated back. From this assumption it means
> > > that you do not want to allow a process to overuse regular memory
> > > while it is using un-accounted device memory. It sounds safer to
> > > account device memory and to keep the process within its memcg
> > > boundary.
> > > 
> > > Admittedly here i am making an assumption and i can be wrong. Thing
> > > is we do not have enough real data of how this will be use and how
> > > much of an impact device memory will have. That is why for now i
> > > would rather restrict myself to either not account it or account it
> > > as usual.
> > > 
> > > If you prefer not accounting it until we have more experience on how
> > > it is use and how it impacts memory resource management i am fine with
> > > that too. It will make the migration code slightly more complex.
> > 
> > I can see why you want to do this but the semantic _has_ to be clear.
> > And as such make sure that the exiting task will simply unpin and
> > invalidate all the device memory (assuming this memory is not shared
> > which I am not sure is even possible).
> 
> So there is 2 differents path out of device memory:
>   - munmap/process exiting: memory will get uncharge from its memory
>     cgroup just like regular memory

I might have missed that in your patch, I admit I only glanced through
that, but the memcg uncharged when the last reference to the page is
released. So if the device pins the page for some reason then the charge
will be there even when the oom victim unmaps the memory.

>   - migration to non device memory, the memory cgroup charge get
>     transfer to the new page just like for any other page
> 
> Do you want me to document all this in any specific place ? I will
> add a comment in memory_control.c and in HMM documentations for this
> but should i add it anywhere else ?

hmm documentation is sufficient and the uncharge path if it needs any
special handling.

> Note that the device memory is not pin. The whole point of HMM is to
> do away with any pining. Thought as device page are not on lru they
> are not reclaim like any other page. However we expect that device
> driver might implement something akin to device memory reclaim to
> make room for more important data base on statistic collected by the
> device driver. If there is enough commonality accross devices then
> we might implement a more generic mechanisms but at this point i
> rather grow as we learn.

Do we have any guarantee that devices will _never_ pin those pages? If
no then we have to make sure we can forcefully tear them down.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 16:04           ` Michal Hocko
@ 2017-07-10 16:25             ` Jerome Glisse
  2017-07-10 16:36               ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-10 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon, Jul 10, 2017 at 06:04:46PM +0200, Michal Hocko wrote:
> On Mon 10-07-17 11:32:23, Jerome Glisse wrote:
> > On Mon, Jul 10, 2017 at 10:28:06AM +0200, Michal Hocko wrote:
> > > On Wed 05-07-17 10:35:29, Jerome Glisse wrote:
> > > > On Tue, Jul 04, 2017 at 02:51:13PM +0200, Michal Hocko wrote:
> > > > > On Mon 03-07-17 17:14:14, Jerome Glisse wrote:
> > > > > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > > > > thus you can not use page->lru fields of those pages. This patch
> > > > > > re-arrange the uncharge to allow single page to be uncharge without
> > > > > > modifying the lru field of the struct page.
> > > > > > 
> > > > > > There is no change to memcontrol logic, it is the same as it was
> > > > > > before this patch.
> > > > > 
> > > > > What is the memcg semantic of the memory? Why is it even charged? AFAIR
> > > > > this is not a reclaimable memory. If yes how are we going to deal with
> > > > > memory limits? What should happen if go OOM? Does killing an process
> > > > > actually help to release that memory? Isn't it pinned by a device?
> > > > > 
> > > > > For the patch itself. It is quite ugly but I haven't spotted anything
> > > > > obviously wrong with it. It is the memcg semantic with this class of
> > > > > memory which makes me worried.
> > > > 
> > > > So i am facing 3 choices. First one not account device memory at all.
> > > > Second one is account device memory like any other memory inside a
> > > > process. Third one is account device memory as something entirely new.
> > > > 
> > > > I pick the second one for two reasons. First because when migrating
> > > > back from device memory it means that migration can not fail because
> > > > of memory cgroup limit, this simplify an already complex migration
> > > > code. Second because i assume that device memory usage is a transient
> > > > state ie once device is done with its computation the most likely
> > > > outcome is memory is migrated back. From this assumption it means
> > > > that you do not want to allow a process to overuse regular memory
> > > > while it is using un-accounted device memory. It sounds safer to
> > > > account device memory and to keep the process within its memcg
> > > > boundary.
> > > > 
> > > > Admittedly here i am making an assumption and i can be wrong. Thing
> > > > is we do not have enough real data of how this will be use and how
> > > > much of an impact device memory will have. That is why for now i
> > > > would rather restrict myself to either not account it or account it
> > > > as usual.
> > > > 
> > > > If you prefer not accounting it until we have more experience on how
> > > > it is use and how it impacts memory resource management i am fine with
> > > > that too. It will make the migration code slightly more complex.
> > > 
> > > I can see why you want to do this but the semantic _has_ to be clear.
> > > And as such make sure that the exiting task will simply unpin and
> > > invalidate all the device memory (assuming this memory is not shared
> > > which I am not sure is even possible).
> > 
> > So there is 2 differents path out of device memory:
> >   - munmap/process exiting: memory will get uncharge from its memory
> >     cgroup just like regular memory
> 
> I might have missed that in your patch, I admit I only glanced through
> that, but the memcg uncharged when the last reference to the page is
> released. So if the device pins the page for some reason then the charge
> will be there even when the oom victim unmaps the memory.

Device can not pin memory it is part of the "contract" when using HMM.
Device memory can never be pin. Nor by device driver nor by any other
means ie we want GUP to trigger a migration back to regular memory. We
will relax the GUP requirement a one point (especialy for direct I/O
and other short time GUP).


> >   - migration to non device memory, the memory cgroup charge get
> >     transfer to the new page just like for any other page
> > 
> > Do you want me to document all this in any specific place ? I will
> > add a comment in memory_control.c and in HMM documentations for this
> > but should i add it anywhere else ?
> 
> hmm documentation is sufficient and the uncharge path if it needs any
> special handling.

Uncharge happens in the ZONE_DEVICE special handling of page refcount
ie a ZONE_DEVICE is free when its refcount reach 1 not 0.

> 
> > Note that the device memory is not pin. The whole point of HMM is to
> > do away with any pining. Thought as device page are not on lru they
> > are not reclaim like any other page. However we expect that device
> > driver might implement something akin to device memory reclaim to
> > make room for more important data base on statistic collected by the
> > device driver. If there is enough commonality accross devices then
> > we might implement a more generic mechanisms but at this point i
> > rather grow as we learn.
> 
> Do we have any guarantee that devices will _never_ pin those pages? If
> no then we have to make sure we can forcefully tear them down.

Well yes we do, as long as i monitor how driver use thing :) Device we
are targetting are like CPU from MMU point of view ie you can tear down
a device page table entry without having the device to freak about it.
So there is no need for device to pin anything, if we update its page
table to non present entry any further access to the virtual address
will trigger a fault that is then handled by the device driver.

If the process is being kill than the GPU threads can be kill by the
device driver too. Otherwise the page fault is handled with the help
of HMM like any reguler CPU page fault. If for some reasons we can not
service the fault than the device driver is responsible to decide how
to handle various VM_FAULT_ERROR. Expectation is that it kills the
device threads and inform userspace through device specific API. I
think at one point down the road we will want to standardize way to
communicate fatal error condition that affect device threads.


I will review HMM documentation again to make sure this is all in
black and white. I am pretty sure that some of it is already there.

Bottom line is that we can always free and uncharge device memory
page just like any regular page.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 16:25             ` Jerome Glisse
@ 2017-07-10 16:36               ` Michal Hocko
  2017-07-10 16:54                 ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-07-10 16:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon 10-07-17 12:25:42, Jerome Glisse wrote:
[...]
> Bottom line is that we can always free and uncharge device memory
> page just like any regular page.

OK, this answers my earlier question. Then it should be feasible to
charge this memory. There are still some things to handle. E.g. how do
we consider this memory during oom victim selection (this is not
accounted as an anonymous memory in get_mm_counter, right?), maybe others.
But the primary point is that nobody pins the memory outside of the
mapping.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 16:36               ` Michal Hocko
@ 2017-07-10 16:54                 ` Jerome Glisse
  2017-07-10 17:48                   ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-10 16:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon, Jul 10, 2017 at 06:36:52PM +0200, Michal Hocko wrote:
> On Mon 10-07-17 12:25:42, Jerome Glisse wrote:
> [...]
> > Bottom line is that we can always free and uncharge device memory
> > page just like any regular page.
> 
> OK, this answers my earlier question. Then it should be feasible to
> charge this memory. There are still some things to handle. E.g. how do
> we consider this memory during oom victim selection (this is not
> accounted as an anonymous memory in get_mm_counter, right?), maybe others.
> But the primary point is that nobody pins the memory outside of the
> mapping.

At this point it is accounted as a regular page would be (anonymous, file
or share memory). I wanted mm_counters to reflect memcg but i can untie
that. Like i said at this point we are unsure how usage of such memory
will impact thing so i wanted to keep all thing as if it was regular
memory to avoid anuything to behave too much differently.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 16:54                 ` Jerome Glisse
@ 2017-07-10 17:48                   ` Michal Hocko
  2017-07-10 18:10                     ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-07-10 17:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon 10-07-17 12:54:21, Jerome Glisse wrote:
> On Mon, Jul 10, 2017 at 06:36:52PM +0200, Michal Hocko wrote:
> > On Mon 10-07-17 12:25:42, Jerome Glisse wrote:
> > [...]
> > > Bottom line is that we can always free and uncharge device memory
> > > page just like any regular page.
> > 
> > OK, this answers my earlier question. Then it should be feasible to
> > charge this memory. There are still some things to handle. E.g. how do
> > we consider this memory during oom victim selection (this is not
> > accounted as an anonymous memory in get_mm_counter, right?), maybe others.
> > But the primary point is that nobody pins the memory outside of the
> > mapping.
> 
> At this point it is accounted as a regular page would be (anonymous, file
> or share memory). I wanted mm_counters to reflect memcg but i can untie
> that.

I am not sure I understand. If the device memory is accounted to the
same mm counter as the original page then it is correct. I will try to
double check the implementation (hopefully soon).

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-10 17:48                   ` Michal Hocko
@ 2017-07-10 18:10                     ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2017-07-10 18:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Johannes Weiner, Vladimir Davydov,
	cgroups

On Mon, Jul 10, 2017 at 07:48:58PM +0200, Michal Hocko wrote:
> On Mon 10-07-17 12:54:21, Jerome Glisse wrote:
> > On Mon, Jul 10, 2017 at 06:36:52PM +0200, Michal Hocko wrote:
> > > On Mon 10-07-17 12:25:42, Jerome Glisse wrote:
> > > [...]
> > > > Bottom line is that we can always free and uncharge device memory
> > > > page just like any regular page.
> > > 
> > > OK, this answers my earlier question. Then it should be feasible to
> > > charge this memory. There are still some things to handle. E.g. how do
> > > we consider this memory during oom victim selection (this is not
> > > accounted as an anonymous memory in get_mm_counter, right?), maybe others.
> > > But the primary point is that nobody pins the memory outside of the
> > > mapping.
> > 
> > At this point it is accounted as a regular page would be (anonymous, file
> > or share memory). I wanted mm_counters to reflect memcg but i can untie
> > that.
> 
> I am not sure I understand. If the device memory is accounted to the
> same mm counter as the original page then it is correct. I will try to
> double check the implementation (hopefully soon).

It is accounted like the original page. By same as memcg i mean i made
the same kind of choice for mm counter than i made for memcg. It is
all in the migrate code (migrate.c) ie i don't touch any of the mm
counter when migrating page.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-05 18:49         ` Jerome Glisse
@ 2017-07-11  3:48           ` Balbir Singh
  2017-07-11  7:31           ` Dan Williams
  1 sibling, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2017-07-11  3:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Ross Zwisler

On Wed, 5 Jul 2017 14:49:33 -0400
Jerome Glisse <jglisse@redhat.com> wrote:

> On Wed, Jul 05, 2017 at 09:15:35AM -0700, Dan Williams wrote:
> > On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:  
> > > On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:  
> > >> On Mon, Jul 3, 2017 at 2:14 PM, Jérôme Glisse <jglisse@redhat.com> wrote:  
> > >> > Use consistent name between IORES_DESC and enum memory_type, rename
> > >> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
> > >> > the public name for CDM (cache coherent device memory) for which the
> > >> > term public is a better match.
> > >> >
> > >> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > >> > Cc: Dan Williams <dan.j.williams@intel.com>
> > >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >> > ---
> > >> >  include/linux/memremap.h | 4 ++--
> > >> >  kernel/memremap.c        | 2 +-
> > >> >  2 files changed, 3 insertions(+), 3 deletions(-)
> > >> >
> > >> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > >> > index 57546a07a558..2299cc2d387d 100644
> > >> > --- a/include/linux/memremap.h
> > >> > +++ b/include/linux/memremap.h
> > >> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> > >> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
> > >> >   * usage.
> > >> >   *
> > >> > - * MEMORY_DEVICE_PUBLIC:
> > >> > + * MEMORY_DEVICE_PERSISTENT:
> > >> >   * Persistent device memory (pmem): struct page might be allocated in different
> > >> >   * memory and architecture might want to perform special actions. It is similar
> > >> >   * to regular memory, in that the CPU can access it transparently. However,
> > >> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> > >> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
> > >> >   */
> > >> >  enum memory_type {
> > >> > -       MEMORY_DEVICE_PUBLIC = 0,
> > >> > +       MEMORY_DEVICE_PERSISTENT = 0,
> > >> >         MEMORY_DEVICE_PRIVATE,
> > >> >  };
> > >> >
> > >> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > >> > index b9baa6c07918..e82456c39a6a 100644
> > >> > --- a/kernel/memremap.c
> > >> > +++ b/kernel/memremap.c
> > >> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> > >> >         }
> > >> >         pgmap->ref = ref;
> > >> >         pgmap->res = &page_map->res;
> > >> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
> > >> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
> > >> >         pgmap->page_fault = NULL;
> > >> >         pgmap->page_free = NULL;
> > >> >         pgmap->data = NULL;  
> > >>
> > >> I think we need a different name. There's nothing "persistent" about
> > >> the devm_memremap_pages() path. Why can't they share name, is the only
> > >> difference coherence? I'm thinking something like:
> > >>
> > >> MEMORY_DEVICE_PRIVATE
> > >> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
> > >> MEMORY_DEVICE_IO /* "public", but not coherent */  
> > >
> > > No that would not work. Device public (in the context of this patchset)
> > > is like device private ie device public page can be anywhere inside a
> > > process address space either as anonymous memory page or as file back
> > > page of regular filesystem (ie vma->ops is not pointing to anything
> > > specific to the device memory).
> > >
> > > As such device public is different from how persistent memory is use
> > > and those the cache coherency being the same between the two kind of
> > > memory is not a discerning factor. So i need to distinguish between
> > > persistent memory and device public memory.
> > >
> > > I believe keeping enum memory_type close to IORES_DESC naming is the
> > > cleanest way to do that but i am open to other name suggestion.
> > >  
> > 
> > The IORES_DESC has nothing to do with how the memory range is handled
> > by the core mm. It sounds like the distinction this is trying to make
> > is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
> > Where a "host" memory range is one that does not need coordination
> > with a specific device.  
> 
> I want to distinguish between:
>   - device memory that is not accessible by the CPU
>   - device memory that is accessible by the CPU just like regular
>     memory
>   - existing user of devm_memremap_pages() which is persistent memory
>     (only pmem seems to call devm_memremap_pages()) that is use like a
>     filesystem or block device and thus isn't use like generic page in
>     a process address space
> 
> So if existing user of devm_memremap_pages() are only persistent memory
> then it made sense to match the IORES_DESC we are expecting to see on
> see such memory.
> 
> For public device memory (in the sense introduced by this patchset) i
> do not know how it will be described by IORES_DESC. i think first folks
> with it are IBM with CAPI and i am not sure they defined something for
> that already.
> 
> I am open to any name beside public (well any reasonable name :)) but
> i do need to be able to distinguish persistent memory as use today from
> this device memory.
> 
> Cheers,
> Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2
  2017-07-03 21:14 ` [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2 Jérôme Glisse
@ 2017-07-11  4:12   ` Balbir Singh
  2017-07-11 14:57     ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Balbir Singh @ 2017-07-11  4:12 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Ross Zwisler

On Mon,  3 Jul 2017 17:14:12 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> Platform with advance system bus (like CAPI or CCIX) allow device
> memory to be accessible from CPU in a cache coherent fashion. Add
> a new type of ZONE_DEVICE to represent such memory. The use case
> are the same as for the un-addressable device memory but without
> all the corners cases.
>

Looks good overall, some comments inline.
 
> Changed since v1:
>   - Kconfig and #if/#else cleanup
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/proc/task_mmu.c       |  2 +-
>  include/linux/hmm.h      |  4 ++--
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 21 +++++++++++++++++
>  include/linux/mm.h       | 16 ++++++++-----
>  kernel/memremap.c        | 11 ++++++---
>  mm/Kconfig               | 11 +++++++++
>  mm/gup.c                 |  7 ++++++
>  mm/hmm.c                 |  4 ++--
>  mm/madvise.c             |  2 +-
>  mm/memory.c              | 46 +++++++++++++++++++++++++++++++++----
>  mm/migrate.c             | 60 ++++++++++++++++++++++++++++++++----------------
>  mm/swap.c                | 11 +++++++++
>  13 files changed, 156 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 957b6ea80d5f..1f38f2c7cc34 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1182,7 +1182,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		if (pm->show_pfn)
>  			frame = pte_pfn(pte);
>  		flags |= PM_PRESENT;
> -		page = vm_normal_page(vma, addr, pte);
> +		page = _vm_normal_page(vma, addr, pte, true);
>  		if (pte_soft_dirty(pte))
>  			flags |= PM_SOFT_DIRTY;
>  	} else if (is_swap_pte(pte)) {
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 458d0d6d82f3..a40288309fd2 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
>  
> -#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>  struct hmm_devmem;
>  
>  struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> @@ -443,7 +443,7 @@ struct hmm_device {
>   */
>  struct hmm_device *hmm_device_new(void *drvdata);
>  void hmm_device_put(struct hmm_device *hmm_device);
> -#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
> +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>  
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 3a4f69137bc2..f5cf32e80041 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -131,6 +131,7 @@ enum {
>  	IORES_DESC_PERSISTENT_MEMORY		= 4,
>  	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
>  	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
> +	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
>  };
>  
>  /* helpers to define resources */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 2299cc2d387d..916ca1653ced 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -57,10 +57,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   *
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.txt.
> + *
> + * MEMORY_DEVICE_PUBLIC:
> + * Device memory that is cache coherent from device and CPU point of view. This
> + * is use on platform that have an advance system bus (like CAPI or CCIX). A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> + * type. Any page of a process can be migrated to such memory. However no one
> + * should be allow to pin such memory so that it can always be evicted.
>   */
>  enum memory_type {
>  	MEMORY_DEVICE_PERSISTENT = 0,
>  	MEMORY_DEVICE_PRIVATE,
> +	MEMORY_DEVICE_PUBLIC,
>  };
>  
>  /*
> @@ -92,6 +100,8 @@ enum memory_type {
>   * The page_free() callback is called once the page refcount reaches 1
>   * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
>   * This allows the device driver to implement its own memory management.)
> + *
> + * For MEMORY_DEVICE_CACHE_COHERENT only the page_free() callback matter.

Correct, but I wonder if we should in the long term allow for minor faults
(due to coherency) via this interface?

>   */
>  typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
>  				unsigned long addr,
> @@ -134,6 +144,12 @@ static inline bool is_device_private_page(const struct page *page)
>  	return is_zone_device_page(page) &&
>  		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
> +
> +static inline bool is_device_public_page(const struct page *page)
> +{
> +	return is_zone_device_page(page) &&
> +		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> +}
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct resource *res, struct percpu_ref *ref,
> @@ -157,6 +173,11 @@ static inline bool is_device_private_page(const struct page *page)
>  {
>  	return false;
>  }
> +
> +static inline bool is_device_public_page(const struct page *page)
> +{
> +	return false;
> +}
>  #endif
>  
>  /**
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 330a216ac315..8b72b122de93 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -797,14 +797,15 @@ static inline bool is_zone_device_page(const struct page *page)
>  #endif
>  
>  #ifdef CONFIG_DEVICE_PRIVATE
> -void put_zone_device_private_page(struct page *page);
> +void put_zone_device_private_or_public_page(struct page *page);
>  #else
> -static inline void put_zone_device_private_page(struct page *page)
> +static inline void put_zone_device_private_or_public_page(struct page *page)
>  {
>  }
>  #endif
>  
>  static inline bool is_device_private_page(const struct page *page);
> +static inline bool is_device_public_page(const struct page *page);
>  
>  DECLARE_STATIC_KEY_FALSE(device_private_key);
>  
> @@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
>  	 * include/linux/memremap.h and HMM for details.
>  	 */
>  	if (static_branch_unlikely(&device_private_key) &&
> -	    unlikely(is_device_private_page(page))) {
> -		put_zone_device_private_page(page);
> +	    unlikely(is_device_private_page(page) ||
> +		     is_device_public_page(page))) {
> +		put_zone_device_private_or_public_page(page);
>  		return;
>  	}
>  
> @@ -1220,8 +1222,10 @@ struct zap_details {
>  	pgoff_t last_index;			/* Highest page->index to unmap */
>  };
>  
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -		pte_t pte);
> +struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte, bool with_public_device);
> +#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> +
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  				pmd_t pmd);
>  
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index e82456c39a6a..da74775f2247 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -466,7 +466,7 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>  
>  
>  #ifdef CONFIG_DEVICE_PRIVATE

Does the #ifdef above need to go as well?

> -void put_zone_device_private_page(struct page *page)
> +void put_zone_device_private_or_public_page(struct page *page)
>  {
>  	int count = page_ref_dec_return(page);
>  
> @@ -474,10 +474,15 @@ void put_zone_device_private_page(struct page *page)
>  	 * If refcount is 1 then page is freed and refcount is stable as nobody
>  	 * holds a reference on the page.
>  	 */
> -	if (count == 1)
> +	if (count == 1) {
> +		/* Clear Active bit in case of parallel mark_page_accessed */
> +		__ClearPageActive(page);
> +		__ClearPageWaiters(page);
> +
>  		page->pgmap->page_free(page, page->pgmap->data);
> +	}
>  	else if (!count)
>  		__put_page(page);
>  }
> -EXPORT_SYMBOL(put_zone_device_private_page);
> +EXPORT_SYMBOL(put_zone_device_private_or_public_page);
>  #endif /* CONFIG_DEVICE_PRIVATE */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3269ff1cc4cd..6002f1e40fcd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -712,12 +712,23 @@ config ZONE_DEVICE
>  config DEVICE_PRIVATE
>  	bool "Unaddressable device memory (GPU memory, ...)"
>  	depends on ARCH_HAS_HMM
> +	select HMM
>  
>  	help
>  	  Allows creation of struct pages to represent unaddressable device
>  	  memory; i.e., memory that is only accessible from the device (or
>  	  group of devices).
>  
> +config DEVICE_PUBLIC
> +	bool "Addressable device memory (like GPU memory)"
> +	depends on ARCH_HAS_HMM
> +	select HMM
> +
> +	help
> +	  Allows creation of struct pages to represent addressable device
> +	  memory; i.e., memory that is accessible from both the device and
> +	  the CPU
> +
>  config FRAME_VECTOR
>  	bool
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 23f01c40c88f..2f8e8604ff80 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -438,6 +438,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
>  			goto unmap;
>  		*page = pte_page(*pte);
> +
> +		/*
> +		 * This should never happen (a device public page in the gate
> +		 * area).
> +		 */
> +		if (is_device_public_page(*page))
> +			goto unmap;
>  	}
>  	get_page(*page);
>  out:
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 4e01c9ba9cc1..eadf70829c34 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
>  
> -#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>  struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
>  				       unsigned long addr)
>  {
> @@ -1190,4 +1190,4 @@ static int __init hmm_init(void)
>  }
>  
>  device_initcall(hmm_init);
> -#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
> +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9976852f1e1c..197277156ce3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -343,7 +343,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			continue;
>  		}
>  
> -		page = vm_normal_page(vma, addr, ptent);
> +		page = _vm_normal_page(vma, addr, ptent, true);
>  		if (!page)
>  			continue;
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 4fcdab3ec525..cee2bed01aa0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  #else
>  # define HAVE_PTE_SPECIAL 0
>  #endif
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -				pte_t pte)
> +struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte, bool with_public_device)
>  {
>  	unsigned long pfn = pte_pfn(pte);
>  
> @@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  			return vma->vm_ops->find_special_page(vma, addr);
>  		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>  			return NULL;
> -		if (!is_zero_pfn(pfn))
> -			print_bad_pte(vma, addr, pte, NULL);
> +		if (is_zero_pfn(pfn))
> +			return NULL;
> +
> +		/*
> +		 * Device public pages are special pages (they are ZONE_DEVICE
> +		 * pages but different from persistent memory). They behave
> +		 * allmost like normal pages. The difference is that they are
> +		 * not on the lru and thus should never be involve with any-
> +		 * thing that involve lru manipulation (mlock, numa balancing,
> +		 * ...).
> +		 *
> +		 * This is why we still want to return NULL for such page from
> +		 * vm_normal_page() so that we do not have to special case all
> +		 * call site of vm_normal_page().
> +		 */
> +		if (likely(pfn < highest_memmap_pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +
> +			if (is_device_public_page(page)) {
> +				if (with_public_device)
> +					return page;
> +				return NULL;
> +			}
> +		}
> +		print_bad_pte(vma, addr, pte, NULL);
>  		return NULL;
>  	}
>  
> @@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		get_page(page);
>  		page_dup_rmap(page, false);
>  		rss[mm_counter(page)]++;
> +	} else if (pte_devmap(pte)) {
> +		page = pte_page(pte);
> +
> +		/*
> +		 * Cache coherent device memory behave like regular page and
> +		 * not like persistent memory page. For more informations see
> +		 * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
> +		 */
> +		if (is_device_public_page(page)) {
> +			get_page(page);
> +			page_dup_rmap(page, false);
> +			rss[mm_counter(page)]++;
> +		}
>  	}
>  
>  out_set_pte:
> @@ -1236,7 +1272,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		if (pte_present(ptent)) {
>  			struct page *page;
>  
> -			page = vm_normal_page(vma, addr, ptent);
> +			page = _vm_normal_page(vma, addr, ptent, true);
>  			if (unlikely(details) && page) {
>  				/*
>  				 * unmap_shared_mapping_pages() wants to
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 643ea61ca9bb..f9ae57f0c7a1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -229,12 +229,19 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		if (is_write_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
>  
> -		if (unlikely(is_zone_device_page(new)) &&
> -		    is_device_private_page(new)) {
> -			entry = make_device_private_entry(new, pte_write(pte));
> -			pte = swp_entry_to_pte(entry);
> -			if (pte_swp_soft_dirty(*pvmw.pte))
> -				pte = pte_mksoft_dirty(pte);
> +		if (unlikely(is_zone_device_page(new))) {
> +			if (is_device_private_page(new)) {
> +				entry = make_device_private_entry(new, pte_write(pte));
> +				pte = swp_entry_to_pte(entry);
> +				if (pte_swp_soft_dirty(*pvmw.pte))
> +					pte = pte_mksoft_dirty(pte);
> +			}
> +#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> +			else if (is_device_public_page(new)) {
> +				pte = pte_mkdevmap(pte);
> +				flush_dcache_page(new);
> +			}
> +#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */
>  		} else
>  			flush_dcache_page(new);
>  
> @@ -408,12 +415,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	void **pslot;
>  
>  	/*
> -	 * ZONE_DEVICE pages have 1 refcount always held by their device
> -	 *
> -	 * Note that DAX memory will never reach that point as it does not have
> -	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
> +	 * Device public or private pages have an extra refcount as they are
> +	 * ZONE_DEVICE pages.
>  	 */
> -	expected_count += is_zone_device_page(page);
> +	expected_count += is_device_private_page(page);
> +	expected_count += is_device_public_page(page);
>  
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
> @@ -2087,7 +2093,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  
>  #endif /* CONFIG_NUMA */
>  
> -
>  struct migrate_vma {
>  	struct vm_area_struct	*vma;
>  	unsigned long		*dst;
> @@ -2186,7 +2191,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			if (is_write_device_private_entry(entry))
>  				mpfn |= MIGRATE_PFN_WRITE;
>  		} else {
> -			page = vm_normal_page(migrate->vma, addr, pte);
> +			page = _vm_normal_page(migrate->vma, addr, pte, true);
>  			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
> @@ -2311,13 +2316,18 @@ static bool migrate_vma_check_page(struct page *page)
>  
>  	/* Page from ZONE_DEVICE have one extra reference */
>  	if (is_zone_device_page(page)) {
> -		if (is_device_private_page(page)) {
> +		if (is_device_private_page(page) ||
> +		    is_device_public_page(page))
>  			extra++;
> -		} else
> +		else
>  			/* Other ZONE_DEVICE memory type are not supported */
>  			return false;
>  	}
>  
> +	/* For file back page */
> +	if (page_mapping(page))
> +		extra += 1 + page_has_private(page);
> +
>  	if ((page_count(page) - extra) > page_mapcount(page))
>  		return false;
>  
> @@ -2541,11 +2551,21 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  	 */
>  	__SetPageUptodate(page);
>  
> -	if (is_zone_device_page(page) && is_device_private_page(page)) {
> -		swp_entry_t swp_entry;
> +	if (is_zone_device_page(page)) {
> +		if (is_device_private_page(page)) {
> +			swp_entry_t swp_entry;
>  
> -		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> -		entry = swp_entry_to_pte(swp_entry);
> +			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> +			entry = swp_entry_to_pte(swp_entry);
> +		}
> +#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)

Do we need this #if check? is_device_public_page(page)
will return false if the config is disabled

> +		else if (is_device_public_page(page)) {
> +			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> +			if (vma->vm_flags & VM_WRITE)
> +				entry = pte_mkwrite(pte_mkdirty(entry));
> +			entry = pte_mkdevmap(entry);
> +		}
> +#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-05 18:49         ` Jerome Glisse
  2017-07-11  3:48           ` Balbir Singh
@ 2017-07-11  7:31           ` Dan Williams
  2017-07-11 15:05             ` Jerome Glisse
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Williams @ 2017-07-11  7:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Wed, Jul 5, 2017 at 11:49 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, Jul 05, 2017 at 09:15:35AM -0700, Dan Williams wrote:
>> On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
>> >> On Mon, Jul 3, 2017 at 2:14 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>> >> > Use consistent name between IORES_DESC and enum memory_type, rename
>> >> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
>> >> > the public name for CDM (cache coherent device memory) for which the
>> >> > term public is a better match.
>> >> >
>> >> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> >> > Cc: Dan Williams <dan.j.williams@intel.com>
>> >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> > ---
>> >> >  include/linux/memremap.h | 4 ++--
>> >> >  kernel/memremap.c        | 2 +-
>> >> >  2 files changed, 3 insertions(+), 3 deletions(-)
>> >> >
>> >> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> > index 57546a07a558..2299cc2d387d 100644
>> >> > --- a/include/linux/memremap.h
>> >> > +++ b/include/linux/memremap.h
>> >> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
>> >> >   * usage.
>> >> >   *
>> >> > - * MEMORY_DEVICE_PUBLIC:
>> >> > + * MEMORY_DEVICE_PERSISTENT:
>> >> >   * Persistent device memory (pmem): struct page might be allocated in different
>> >> >   * memory and architecture might want to perform special actions. It is similar
>> >> >   * to regular memory, in that the CPU can access it transparently. However,
>> >> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
>> >> >   */
>> >> >  enum memory_type {
>> >> > -       MEMORY_DEVICE_PUBLIC = 0,
>> >> > +       MEMORY_DEVICE_PERSISTENT = 0,
>> >> >         MEMORY_DEVICE_PRIVATE,
>> >> >  };
>> >> >
>> >> > diff --git a/kernel/memremap.c b/kernel/memremap.c
>> >> > index b9baa6c07918..e82456c39a6a 100644
>> >> > --- a/kernel/memremap.c
>> >> > +++ b/kernel/memremap.c
>> >> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>> >> >         }
>> >> >         pgmap->ref = ref;
>> >> >         pgmap->res = &page_map->res;
>> >> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
>> >> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
>> >> >         pgmap->page_fault = NULL;
>> >> >         pgmap->page_free = NULL;
>> >> >         pgmap->data = NULL;
>> >>
>> >> I think we need a different name. There's nothing "persistent" about
>> >> the devm_memremap_pages() path. Why can't they share name, is the only
>> >> difference coherence? I'm thinking something like:
>> >>
>> >> MEMORY_DEVICE_PRIVATE
>> >> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
>> >> MEMORY_DEVICE_IO /* "public", but not coherent */
>> >
>> > No that would not work. Device public (in the context of this patchset)
>> > is like device private ie device public page can be anywhere inside a
>> > process address space either as anonymous memory page or as file back
>> > page of regular filesystem (ie vma->ops is not pointing to anything
>> > specific to the device memory).
>> >
>> > As such device public is different from how persistent memory is use
>> > and those the cache coherency being the same between the two kind of
>> > memory is not a discerning factor. So i need to distinguish between
>> > persistent memory and device public memory.
>> >
>> > I believe keeping enum memory_type close to IORES_DESC naming is the
>> > cleanest way to do that but i am open to other name suggestion.
>> >
>>
>> The IORES_DESC has nothing to do with how the memory range is handled
>> by the core mm. It sounds like the distinction this is trying to make
>> is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
>> Where a "host" memory range is one that does not need coordination
>> with a specific device.
>
> I want to distinguish between:
>   - device memory that is not accessible by the CPU
>   - device memory that is accessible by the CPU just like regular
>     memory
>   - existing user of devm_memremap_pages() which is persistent memory
>     (only pmem seems to call devm_memremap_pages()) that is use like a
>     filesystem or block device and thus isn't use like generic page in
>     a process address space
>
> So if existing user of devm_memremap_pages() are only persistent memory
> then it made sense to match the IORES_DESC we are expecting to see on
> see such memory.
>
> For public device memory (in the sense introduced by this patchset) i
> do not know how it will be described by IORES_DESC. i think first folks
> with it are IBM with CAPI and i am not sure they defined something for
> that already.
>
> I am open to any name beside public (well any reasonable name :)) but
> i do need to be able to distinguish persistent memory as use today from
> this device memory.

Right, so that's why I suggested MEMORY_DEVICE_HOST for memory that is
just normal host memory and does not have any device-entanglements
outside of the base ZONE_DEVICE registration.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2
  2017-07-11  4:12   ` Balbir Singh
@ 2017-07-11 14:57     ` Jerome Glisse
  2017-07-12  5:50       ` Balbir Singh
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-11 14:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Ross Zwisler

On Tue, Jul 11, 2017 at 02:12:15PM +1000, Balbir Singh wrote:
> On Mon,  3 Jul 2017 17:14:12 -0400
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > Platform with advance system bus (like CAPI or CCIX) allow device
> > memory to be accessible from CPU in a cache coherent fashion. Add
> > a new type of ZONE_DEVICE to represent such memory. The use case
> > are the same as for the un-addressable device memory but without
> > all the corners cases.
> >
> 
> Looks good overall, some comments inline.
>  

[...]

> >  /*
> > @@ -92,6 +100,8 @@ enum memory_type {
> >   * The page_free() callback is called once the page refcount reaches 1
> >   * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
> >   * This allows the device driver to implement its own memory management.)
> > + *
> > + * For MEMORY_DEVICE_CACHE_COHERENT only the page_free() callback matter.
> 
> Correct, but I wonder if we should in the long term allow for minor faults
> (due to coherency) via this interface?

This is something we can explore latter on.

[...]

> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index e82456c39a6a..da74775f2247 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -466,7 +466,7 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >  
> >  
> >  #ifdef CONFIG_DEVICE_PRIVATE
> 
> Does the #ifdef above need to go as well?

Good catch i should make that conditional on DEVICE_PUBLIC or whatever
the name endup to be. I will make sure i test without DEVICE_PRIVATE
config before posting again.

[...]

> > @@ -2541,11 +2551,21 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >  	 */
> >  	__SetPageUptodate(page);
> >  
> > -	if (is_zone_device_page(page) && is_device_private_page(page)) {
> > -		swp_entry_t swp_entry;
> > +	if (is_zone_device_page(page)) {
> > +		if (is_device_private_page(page)) {
> > +			swp_entry_t swp_entry;
> >  
> > -		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> > -		entry = swp_entry_to_pte(swp_entry);
> > +			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> > +			entry = swp_entry_to_pte(swp_entry);
> > +		}
> > +#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> 
> Do we need this #if check? is_device_public_page(page)
> will return false if the config is disabled

pte_mkdevmap() is not define if ZONE_DEVICE is not enabled hence
i had to protect this with #if/#endif to avoid build error.

> 
> > +		else if (is_device_public_page(page)) {
> > +			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > +			if (vma->vm_flags & VM_WRITE)
> > +				entry = pte_mkwrite(pte_mkdirty(entry));
> > +			entry = pte_mkdevmap(entry);
> > +		}
> > +#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-11  7:31           ` Dan Williams
@ 2017-07-11 15:05             ` Jerome Glisse
  2017-07-11 16:49               ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-07-11 15:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Tue, Jul 11, 2017 at 12:31:22AM -0700, Dan Williams wrote:
> On Wed, Jul 5, 2017 at 11:49 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Wed, Jul 05, 2017 at 09:15:35AM -0700, Dan Williams wrote:
> >> On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> > On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
> >> >> On Mon, Jul 3, 2017 at 2:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> >> > Use consistent name between IORES_DESC and enum memory_type, rename
> >> >> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
> >> >> > the public name for CDM (cache coherent device memory) for which the
> >> >> > term public is a better match.
> >> >> >
> >> >> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> >> >> > Cc: Dan Williams <dan.j.williams@intel.com>
> >> >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> >> > ---
> >> >> >  include/linux/memremap.h | 4 ++--
> >> >> >  kernel/memremap.c        | 2 +-
> >> >> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >> >> >
> >> >> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> >> > index 57546a07a558..2299cc2d387d 100644
> >> >> > --- a/include/linux/memremap.h
> >> >> > +++ b/include/linux/memremap.h
> >> >> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >> >> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
> >> >> >   * usage.
> >> >> >   *
> >> >> > - * MEMORY_DEVICE_PUBLIC:
> >> >> > + * MEMORY_DEVICE_PERSISTENT:
> >> >> >   * Persistent device memory (pmem): struct page might be allocated in different
> >> >> >   * memory and architecture might want to perform special actions. It is similar
> >> >> >   * to regular memory, in that the CPU can access it transparently. However,
> >> >> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >> >> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
> >> >> >   */
> >> >> >  enum memory_type {
> >> >> > -       MEMORY_DEVICE_PUBLIC = 0,
> >> >> > +       MEMORY_DEVICE_PERSISTENT = 0,
> >> >> >         MEMORY_DEVICE_PRIVATE,
> >> >> >  };
> >> >> >
> >> >> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> >> >> > index b9baa6c07918..e82456c39a6a 100644
> >> >> > --- a/kernel/memremap.c
> >> >> > +++ b/kernel/memremap.c
> >> >> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >> >> >         }
> >> >> >         pgmap->ref = ref;
> >> >> >         pgmap->res = &page_map->res;
> >> >> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
> >> >> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
> >> >> >         pgmap->page_fault = NULL;
> >> >> >         pgmap->page_free = NULL;
> >> >> >         pgmap->data = NULL;
> >> >>
> >> >> I think we need a different name. There's nothing "persistent" about
> >> >> the devm_memremap_pages() path. Why can't they share name, is the only
> >> >> difference coherence? I'm thinking something like:
> >> >>
> >> >> MEMORY_DEVICE_PRIVATE
> >> >> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
> >> >> MEMORY_DEVICE_IO /* "public", but not coherent */
> >> >
> >> > No that would not work. Device public (in the context of this patchset)
> >> > is like device private ie device public page can be anywhere inside a
> >> > process address space either as anonymous memory page or as file back
> >> > page of regular filesystem (ie vma->ops is not pointing to anything
> >> > specific to the device memory).
> >> >
> >> > As such device public is different from how persistent memory is use
> >> > and those the cache coherency being the same between the two kind of
> >> > memory is not a discerning factor. So i need to distinguish between
> >> > persistent memory and device public memory.
> >> >
> >> > I believe keeping enum memory_type close to IORES_DESC naming is the
> >> > cleanest way to do that but i am open to other name suggestion.
> >> >
> >>
> >> The IORES_DESC has nothing to do with how the memory range is handled
> >> by the core mm. It sounds like the distinction this is trying to make
> >> is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
> >> Where a "host" memory range is one that does not need coordination
> >> with a specific device.
> >
> > I want to distinguish between:
> >   - device memory that is not accessible by the CPU
> >   - device memory that is accessible by the CPU just like regular
> >     memory
> >   - existing user of devm_memremap_pages() which is persistent memory
> >     (only pmem seems to call devm_memremap_pages()) that is use like a
> >     filesystem or block device and thus isn't use like generic page in
> >     a process address space
> >
> > So if existing user of devm_memremap_pages() are only persistent memory
> > then it made sense to match the IORES_DESC we are expecting to see on
> > see such memory.
> >
> > For public device memory (in the sense introduced by this patchset) i
> > do not know how it will be described by IORES_DESC. i think first folks
> > with it are IBM with CAPI and i am not sure they defined something for
> > that already.
> >
> > I am open to any name beside public (well any reasonable name :)) but
> > i do need to be able to distinguish persistent memory as use today from
> > this device memory.
> 
> Right, so that's why I suggested MEMORY_DEVICE_HOST for memory that is
> just normal host memory and does not have any device-entanglements
> outside of the base ZONE_DEVICE registration.

Well the memory considered for DEVICE_PUBLIC is device memory so it is
very much entangled with a device. It is memory that is physically on
the device. It is just that new system bus like CAPI or CCIX allows
CPU to access such memory with same cache coherency as if they were
accessing regular system DDR memory. It is expect that this memory
will be manage by the device driver and not core memory management.

But i am ok with MEMORY_DEVICE_HOST after all this just a name. But
what you put behind that name is not the reality of the memory. I just
want to be clear on that.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one
  2017-07-11 15:05             ` Jerome Glisse
@ 2017-07-11 16:49               ` Dan Williams
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Williams @ 2017-07-11 16:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Ross Zwisler

On Tue, Jul 11, 2017 at 8:05 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Tue, Jul 11, 2017 at 12:31:22AM -0700, Dan Williams wrote:
>> On Wed, Jul 5, 2017 at 11:49 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Wed, Jul 05, 2017 at 09:15:35AM -0700, Dan Williams wrote:
>> >> On Wed, Jul 5, 2017 at 7:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >> > On Mon, Jul 03, 2017 at 04:49:18PM -0700, Dan Williams wrote:
>> >> >> On Mon, Jul 3, 2017 at 2:14 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>> >> >> > Use consistent name between IORES_DESC and enum memory_type, rename
>> >> >> > MEMORY_DEVICE_PUBLIC to MEMORY_DEVICE_PERSISTENT. This is to free up
>> >> >> > the public name for CDM (cache coherent device memory) for which the
>> >> >> > term public is a better match.
>> >> >> >
>> >> >> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> >> >> > Cc: Dan Williams <dan.j.williams@intel.com>
>> >> >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> >> > ---
>> >> >> >  include/linux/memremap.h | 4 ++--
>> >> >> >  kernel/memremap.c        | 2 +-
>> >> >> >  2 files changed, 3 insertions(+), 3 deletions(-)
>> >> >> >
>> >> >> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> >> > index 57546a07a558..2299cc2d387d 100644
>> >> >> > --- a/include/linux/memremap.h
>> >> >> > +++ b/include/linux/memremap.h
>> >> >> > @@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >> >> >   * Specialize ZONE_DEVICE memory into multiple types each having differents
>> >> >> >   * usage.
>> >> >> >   *
>> >> >> > - * MEMORY_DEVICE_PUBLIC:
>> >> >> > + * MEMORY_DEVICE_PERSISTENT:
>> >> >> >   * Persistent device memory (pmem): struct page might be allocated in different
>> >> >> >   * memory and architecture might want to perform special actions. It is similar
>> >> >> >   * to regular memory, in that the CPU can access it transparently. However,
>> >> >> > @@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >> >> >   * include/linux/hmm.h and Documentation/vm/hmm.txt.
>> >> >> >   */
>> >> >> >  enum memory_type {
>> >> >> > -       MEMORY_DEVICE_PUBLIC = 0,
>> >> >> > +       MEMORY_DEVICE_PERSISTENT = 0,
>> >> >> >         MEMORY_DEVICE_PRIVATE,
>> >> >> >  };
>> >> >> >
>> >> >> > diff --git a/kernel/memremap.c b/kernel/memremap.c
>> >> >> > index b9baa6c07918..e82456c39a6a 100644
>> >> >> > --- a/kernel/memremap.c
>> >> >> > +++ b/kernel/memremap.c
>> >> >> > @@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>> >> >> >         }
>> >> >> >         pgmap->ref = ref;
>> >> >> >         pgmap->res = &page_map->res;
>> >> >> > -       pgmap->type = MEMORY_DEVICE_PUBLIC;
>> >> >> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
>> >> >> >         pgmap->page_fault = NULL;
>> >> >> >         pgmap->page_free = NULL;
>> >> >> >         pgmap->data = NULL;
>> >> >>
>> >> >> I think we need a different name. There's nothing "persistent" about
>> >> >> the devm_memremap_pages() path. Why can't they share name, is the only
>> >> >> difference coherence? I'm thinking something like:
>> >> >>
>> >> >> MEMORY_DEVICE_PRIVATE
>> >> >> MEMORY_DEVICE_COHERENT /* persistent memory and coherent devices */
>> >> >> MEMORY_DEVICE_IO /* "public", but not coherent */
>> >> >
>> >> > No that would not work. Device public (in the context of this patchset)
>> >> > is like device private ie device public page can be anywhere inside a
>> >> > process address space either as anonymous memory page or as file back
>> >> > page of regular filesystem (ie vma->ops is not pointing to anything
>> >> > specific to the device memory).
>> >> >
>> >> > As such device public is different from how persistent memory is use
>> >> > and those the cache coherency being the same between the two kind of
>> >> > memory is not a discerning factor. So i need to distinguish between
>> >> > persistent memory and device public memory.
>> >> >
>> >> > I believe keeping enum memory_type close to IORES_DESC naming is the
>> >> > cleanest way to do that but i am open to other name suggestion.
>> >> >
>> >>
>> >> The IORES_DESC has nothing to do with how the memory range is handled
>> >> by the core mm. It sounds like the distinction this is trying to make
>> >> is between MEMORY_DEVICE_{PUBLIC,PRIVATE} and MEMORY_DEVICE_HOST.
>> >> Where a "host" memory range is one that does not need coordination
>> >> with a specific device.
>> >
>> > I want to distinguish between:
>> >   - device memory that is not accessible by the CPU
>> >   - device memory that is accessible by the CPU just like regular
>> >     memory
>> >   - existing user of devm_memremap_pages() which is persistent memory
>> >     (only pmem seems to call devm_memremap_pages()) that is use like a
>> >     filesystem or block device and thus isn't use like generic page in
>> >     a process address space
>> >
>> > So if existing user of devm_memremap_pages() are only persistent memory
>> > then it made sense to match the IORES_DESC we are expecting to see on
>> > see such memory.
>> >
>> > For public device memory (in the sense introduced by this patchset) i
>> > do not know how it will be described by IORES_DESC. i think first folks
>> > with it are IBM with CAPI and i am not sure they defined something for
>> > that already.
>> >
>> > I am open to any name beside public (well any reasonable name :)) but
>> > i do need to be able to distinguish persistent memory as use today from
>> > this device memory.
>>
>> Right, so that's why I suggested MEMORY_DEVICE_HOST for memory that is
>> just normal host memory and does not have any device-entanglements
>> outside of the base ZONE_DEVICE registration.
>
> Well the memory considered for DEVICE_PUBLIC is device memory so it is
> very much entangled with a device. It is memory that is physically on
> the device. It is just that new system bus like CAPI or CCIX allows
> CPU to access such memory with same cache coherency as if they were
> accessing regular system DDR memory. It is expect that this memory
> will be manage by the device driver and not core memory management.
>
> But i am ok with MEMORY_DEVICE_HOST after all this just a name. But
> what you put behind that name is not the reality of the memory. I just
> want to be clear on that.
>

I was suggesting MEMORY_DEVICE_HOST for persistent memory and
MEMORY_DEVICE_PUBLIC as you want for CDM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2
  2017-07-11 14:57     ` Jerome Glisse
@ 2017-07-12  5:50       ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2017-07-12  5:50 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Ross Zwisler

On Tue, 11 Jul 2017 10:57:44 -0400
Jerome Glisse <jglisse@redhat.com> wrote:

> On Tue, Jul 11, 2017 at 02:12:15PM +1000, Balbir Singh wrote:
> > On Mon,  3 Jul 2017 17:14:12 -0400
> > Jérôme Glisse <jglisse@redhat.com> wrote:
> >   
> > > Platform with advance system bus (like CAPI or CCIX) allow device
> > > memory to be accessible from CPU in a cache coherent fashion. Add
> > > a new type of ZONE_DEVICE to represent such memory. The use case
> > > are the same as for the un-addressable device memory but without
> > > all the corners cases.
> > >  
> > 
> > Looks good overall, some comments inline.
> >    
> 
> [...]
> 
> > >  /*
> > > @@ -92,6 +100,8 @@ enum memory_type {
> > >   * The page_free() callback is called once the page refcount reaches 1
> > >   * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
> > >   * This allows the device driver to implement its own memory management.)
> > > + *
> > > + * For MEMORY_DEVICE_CACHE_COHERENT only the page_free() callback matter.  
> > 
> > Correct, but I wonder if we should in the long term allow for minor faults
> > (due to coherency) via this interface?  
> 
> This is something we can explore latter on.
> 
> [...]
> 

Agreed

> > > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > > index e82456c39a6a..da74775f2247 100644
> > > --- a/kernel/memremap.c
> > > +++ b/kernel/memremap.c
> > > @@ -466,7 +466,7 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> > >  
> > >  
> > >  #ifdef CONFIG_DEVICE_PRIVATE  
> > 
> > Does the #ifdef above need to go as well?  
> 
> Good catch i should make that conditional on DEVICE_PUBLIC or whatever
> the name endup to be. I will make sure i test without DEVICE_PRIVATE
> config before posting again.
> 
> [...]
> 

I've been testing with this off, I should have sent you a patch, but
I thought I'd also update in the review.

> > > @@ -2541,11 +2551,21 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> > >  	 */
> > >  	__SetPageUptodate(page);
> > >  
> > > -	if (is_zone_device_page(page) && is_device_private_page(page)) {
> > > -		swp_entry_t swp_entry;
> > > +	if (is_zone_device_page(page)) {
> > > +		if (is_device_private_page(page)) {
> > > +			swp_entry_t swp_entry;
> > >  
> > > -		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> > > -		entry = swp_entry_to_pte(swp_entry);
> > > +			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
> > > +			entry = swp_entry_to_pte(swp_entry);
> > > +		}
> > > +#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)  
> > 
> > Do we need this #if check? is_device_public_page(page)
> > will return false if the config is disabled  
> 
> pte_mkdevmap() is not define if ZONE_DEVICE is not enabled hence
> i had to protect this with #if/#endif to avoid build error.

pte_mkdevmap is always defined, could you please share the build
error.


Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-07-12  5:50 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-03 21:14 [PATCH 0/5] Cache coherent device memory (CDM) with HMM v3 Jérôme Glisse
2017-07-03 21:14 ` [PATCH 1/5] mm/persistent-memory: match IORES_DESC name and enum memory_type one Jérôme Glisse
2017-07-03 23:49   ` Dan Williams
2017-07-05 14:25     ` Jerome Glisse
2017-07-05 16:15       ` Dan Williams
2017-07-05 18:49         ` Jerome Glisse
2017-07-11  3:48           ` Balbir Singh
2017-07-11  7:31           ` Dan Williams
2017-07-11 15:05             ` Jerome Glisse
2017-07-11 16:49               ` Dan Williams
2017-07-03 21:14 ` [PATCH 2/5] mm/device-public-memory: device memory cache coherent with CPU v2 Jérôme Glisse
2017-07-11  4:12   ` Balbir Singh
2017-07-11 14:57     ` Jerome Glisse
2017-07-12  5:50       ` Balbir Singh
2017-07-03 21:14 ` [PATCH 3/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
2017-07-03 21:14 ` [PATCH 4/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-07-04 12:51   ` Michal Hocko
2017-07-05  3:18     ` Balbir Singh
2017-07-05  6:38       ` Michal Hocko
2017-07-05 10:22         ` Balbir Singh
2017-07-05 14:35     ` Jerome Glisse
2017-07-10  8:28       ` Michal Hocko
2017-07-10 15:32         ` Jerome Glisse
2017-07-10 16:04           ` Michal Hocko
2017-07-10 16:25             ` Jerome Glisse
2017-07-10 16:36               ` Michal Hocko
2017-07-10 16:54                 ` Jerome Glisse
2017-07-10 17:48                   ` Michal Hocko
2017-07-10 18:10                     ` Jerome Glisse
2017-07-03 21:14 ` [PATCH 5/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).