linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
@ 2017-07-13 21:15 Jérôme Glisse
  2017-07-13 21:15 ` [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST Jérôme Glisse
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse

Sorry i made horrible mistake on names in v4, i completly miss-
understood the suggestion. So here i repost with proper naming.
This is the only change since v3. Again sorry about the noise
with v4.

Changes since v4:
  - s/DEVICE_HOST/DEVICE_PUBLIC

Git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5


Cache coherent device memory apply to architecture with system bus
like CAPI or CCIX. Device connected to such system bus can expose
their memory to the system and allow cache coherent access to it
from the CPU.

Even if for all intent and purposes device memory behave like regular
memory, we still want to manage it in isolation from regular memory.
Several reasons for that, first and foremost this memory is less
reliable than regular memory if the device hangs because of invalid
commands we can loose access to device memory. Second CPU access to
this memory is expected to be slower than to regular memory. Third
having random memory into device means that some of the bus bandwith
wouldn't be available to the device but would be use by CPU access.

This is why we want to manage such memory in isolation from regular
memory. Kernel should not try to use this memory even as last resort
when running out of memory, at least for now.

This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
that is use to represent CDM memory. This patchset build on top of
the HMM patchset that already introduce a new type of ZONE_DEVICE
memory for private device memory (see HMM patchset).

The end result is that with this patchset if a device is in use in
a process you might have private anonymous memory or file back
page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
taken to not overwritte lru fields of such pages.

Hence all core mm changes are done to address assumption that any
process memory is back by a regular struct page that is part of
the lru. ZONE_DEVICE page are not on the lru and the lru pointer
of struct page are use to store device specific informations.

Thus this patchset update all code path that would make assumptions
about lruness of a process page.

patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
patch 03 - add an helper to HMM for hotplug of CDM memory
patch 04 - preparatory patch for memory controller changes (memch)
patch 05 - update memory controller to properly handle
           ZONE_DEVICE pages when uncharging
patch 06 - documentation patch

Previous posting:
v1 https://lkml.org/lkml/2017/4/7/638
v2 https://lwn.net/Articles/725412/
v3 https://lwn.net/Articles/727114/
v4 https://lwn.net/Articles/727692/

JA(C)rA'me Glisse (6):
  mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
  mm/device-public-memory: device memory cache coherent with CPU v4
  mm/hmm: add new helper to hotplug CDM memory region v3
  mm/memcontrol: allow to uncharge page without using page->lru field
  mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
    v3
  mm/hmm: documents how device memory is accounted in rss and memcg

 Documentation/vm/hmm.txt |  40 ++++++++
 fs/proc/task_mmu.c       |   2 +-
 include/linux/hmm.h      |   7 +-
 include/linux/ioport.h   |   1 +
 include/linux/memremap.h |  25 ++++-
 include/linux/mm.h       |  20 ++--
 kernel/memremap.c        |  19 ++--
 mm/Kconfig               |  11 +++
 mm/gup.c                 |   7 ++
 mm/hmm.c                 |  89 ++++++++++++++++--
 mm/madvise.c             |   2 +-
 mm/memcontrol.c          | 231 ++++++++++++++++++++++++++++++-----------------
 mm/memory.c              |  46 +++++++++-
 mm/migrate.c             |  57 +++++++-----
 mm/swap.c                |  11 +++
 15 files changed, 434 insertions(+), 134 deletions(-)

-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-17  9:09   ` Balbir Singh
  2017-07-13 21:15 ` [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4 Jérôme Glisse
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse, Ross Zwisler

Existing user of ZONE_DEVICE in its DEVICE_PUBLIC variant are not tie
to specific device and behave more like host memory. This patch rename
DEVICE_PUBLIC to DEVICE_HOST and free the name DEVICE_PUBLIC to be use
for cache coherent device memory that has strong tie with the device
on which the memory is (for instance on board GPU memory).

There is no functional change here.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 4 ++--
 kernel/memremap.c        | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 57546a07a558..ae5ff92f72b4 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,7 +41,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * Specialize ZONE_DEVICE memory into multiple types each having differents
  * usage.
  *
- * MEMORY_DEVICE_PUBLIC:
+ * MEMORY_DEVICE_HOST:
  * Persistent device memory (pmem): struct page might be allocated in different
  * memory and architecture might want to perform special actions. It is similar
  * to regular memory, in that the CPU can access it transparently. However,
@@ -59,7 +59,7 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
  */
 enum memory_type {
-	MEMORY_DEVICE_PUBLIC = 0,
+	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 };
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b9baa6c07918..4e07525aa273 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -350,7 +350,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
-	pgmap->type = MEMORY_DEVICE_PUBLIC;
+	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_fault = NULL;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
  2017-07-13 21:15 ` [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-13 23:01   ` Balbir Singh
  2017-07-13 21:15 ` [PATCH 3/6] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse, Balbir Singh,
	Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt,
	Ross Zwisler

Platform with advance system bus (like CAPI or CCIX) allow device
memory to be accessible from CPU in a cache coherent fashion. Add
a new type of ZONE_DEVICE to represent such memory. The use case
are the same as for the un-addressable device memory but without
all the corners cases.

Changed since v3:
  - s/public/public (going back)
Changed since v2:
  - s/public/public
  - add proper include in migrate.c and drop useless #if/#endif
Changed since v1:
  - Kconfig and #if/#else cleanup

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  2 +-
 include/linux/hmm.h      |  4 ++--
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 21 ++++++++++++++++++
 include/linux/mm.h       | 20 ++++++++++-------
 kernel/memremap.c        | 15 ++++++++-----
 mm/Kconfig               | 11 ++++++++++
 mm/gup.c                 |  7 ++++++
 mm/hmm.c                 |  4 ++--
 mm/madvise.c             |  2 +-
 mm/memory.c              | 46 +++++++++++++++++++++++++++++++++-----
 mm/migrate.c             | 57 ++++++++++++++++++++++++++++++------------------
 mm/swap.c                | 11 ++++++++++
 13 files changed, 156 insertions(+), 45 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 957b6ea80d5f..1f38f2c7cc34 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1182,7 +1182,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
-		page = vm_normal_page(vma, addr, pte);
+		page = _vm_normal_page(vma, addr, pte, true);
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 458d0d6d82f3..a40288309fd2 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct hmm_devmem;
 
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
@@ -443,7 +443,7 @@ struct hmm_device {
  */
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3a4f69137bc2..f5cf32e80041 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -131,6 +131,7 @@ enum {
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
+	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ae5ff92f72b4..c7b4c75ae3f8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -57,10 +57,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  *
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
+	MEMORY_DEVICE_PUBLIC,
 };
 
 /*
@@ -92,6 +100,8 @@ enum memory_type {
  * The page_free() callback is called once the page refcount reaches 1
  * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
  * This allows the device driver to implement its own memory management.)
+ *
+ * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter.
  */
 typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
 				unsigned long addr,
@@ -134,6 +144,12 @@ static inline bool is_device_private_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -157,6 +173,11 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 330a216ac315..980354828177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -796,15 +796,16 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page);
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page);
 #else
-static inline void put_zone_device_private_page(struct page *page)
+static inline void put_zone_device_private_or_public_page(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 static inline bool is_device_private_page(const struct page *page);
+static inline bool is_device_public_page(const struct page *page);
 
 DECLARE_STATIC_KEY_FALSE(device_private_key);
 
@@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
 	 * include/linux/memremap.h and HMM for details.
 	 */
 	if (static_branch_unlikely(&device_private_key) &&
-	    unlikely(is_device_private_page(page))) {
-		put_zone_device_private_page(page);
+	    unlikely(is_device_private_page(page) ||
+		     is_device_public_page(page))) {
+		put_zone_device_private_or_public_page(page);
 		return;
 	}
 
@@ -1220,8 +1222,10 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 };
 
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device);
+#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4e07525aa273..25c098151ed2 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -465,8 +465,8 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 #endif /* CONFIG_ZONE_DEVICE */
 
 
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -474,10 +474,15 @@ void put_zone_device_private_page(struct page *page)
 	 * If refcount is 1 then page is freed and refcount is stable as nobody
 	 * holds a reference on the page.
 	 */
-	if (count == 1)
+	if (count == 1) {
+		/* Clear Active bit in case of parallel mark_page_accessed */
+		__ClearPageActive(page);
+		__ClearPageWaiters(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
+	}
 	else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_page);
-#endif /* CONFIG_DEVICE_PRIVATE */
+EXPORT_SYMBOL(put_zone_device_private_or_public_page);
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/Kconfig b/mm/Kconfig
index 5960617ef781..424ef60547f8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -716,12 +716,23 @@ config ZONE_DEVICE
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
+	select HMM
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
 	  memory; i.e., memory that is only accessible from the device (or
 	  group of devices).
 
+config DEVICE_PUBLIC
+	bool "Addressable device memory (like GPU memory)"
+	depends on ARCH_HAS_HMM
+	select HMM
+
+	help
+	  Allows creation of struct pages to represent addressable device
+	  memory; i.e., memory that is accessible from both the device and
+	  the CPU
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/gup.c b/mm/gup.c
index 23f01c40c88f..2f8e8604ff80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -438,6 +438,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
 			goto unmap;
 		*page = pte_page(*pte);
+
+		/*
+		 * This should never happen (a device public page in the gate
+		 * area).
+		 */
+		if (is_device_public_page(*page))
+			goto unmap;
 	}
 	get_page(*page);
 out:
diff --git a/mm/hmm.c b/mm/hmm.c
index 4e01c9ba9cc1..eadf70829c34 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
 				       unsigned long addr)
 {
@@ -1190,4 +1190,4 @@ static int __init hmm_init(void)
 }
 
 device_initcall(hmm_init);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..197277156ce3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -343,7 +343,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		}
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = _vm_normal_page(vma, addr, ptent, true);
 		if (!page)
 			continue;
 
diff --git a/mm/memory.c b/mm/memory.c
index 781935e83ff3..709d7d237234 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (!is_zero_pfn(pfn))
-			print_bad_pte(vma, addr, pte, NULL);
+		if (is_zero_pfn(pfn))
+			return NULL;
+
+		/*
+		 * Device public pages are special pages (they are ZONE_DEVICE
+		 * pages but different from persistent memory). They behave
+		 * allmost like normal pages. The difference is that they are
+		 * not on the lru and thus should never be involve with any-
+		 * thing that involve lru manipulation (mlock, numa balancing,
+		 * ...).
+		 *
+		 * This is why we still want to return NULL for such page from
+		 * vm_normal_page() so that we do not have to special case all
+		 * call site of vm_normal_page().
+		 */
+		if (likely(pfn < highest_memmap_pfn)) {
+			struct page *page = pfn_to_page(pfn);
+
+			if (is_device_public_page(page)) {
+				if (with_public_device)
+					return page;
+				return NULL;
+			}
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
+	} else if (pte_devmap(pte)) {
+		page = pte_page(pte);
+
+		/*
+		 * Cache coherent device memory behave like regular page and
+		 * not like persistent memory page. For more informations see
+		 * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
+		 */
+		if (is_device_public_page(page)) {
+			get_page(page);
+			page_dup_rmap(page, false);
+			rss[mm_counter(page)]++;
+		}
 	}
 
 out_set_pte:
@@ -1236,7 +1272,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = _vm_normal_page(vma, addr, ptent, true);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
diff --git a/mm/migrate.c b/mm/migrate.c
index 643ea61ca9bb..fbf0b86deecd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
+#include <linux/pfn_t.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/balloon_compaction.h>
@@ -229,12 +230,16 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		if (unlikely(is_zone_device_page(new)) &&
-		    is_device_private_page(new)) {
-			entry = make_device_private_entry(new, pte_write(pte));
-			pte = swp_entry_to_pte(entry);
-			if (pte_swp_soft_dirty(*pvmw.pte))
-				pte = pte_mksoft_dirty(pte);
+		if (unlikely(is_zone_device_page(new))) {
+			if (is_device_private_page(new)) {
+				entry = make_device_private_entry(new, pte_write(pte));
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*pvmw.pte))
+					pte = pte_mksoft_dirty(pte);
+			} else if (is_device_public_page(new)) {
+				pte = pte_mkdevmap(pte);
+				flush_dcache_page(new);
+			}
 		} else
 			flush_dcache_page(new);
 
@@ -408,12 +413,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	/*
-	 * ZONE_DEVICE pages have 1 refcount always held by their device
-	 *
-	 * Note that DAX memory will never reach that point as it does not have
-	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 * Device public or private pages have an extra refcount as they are
+	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_zone_device_page(page);
+	expected_count += is_device_private_page(page);
+	expected_count += is_device_public_page(page);
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -2087,7 +2091,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #endif /* CONFIG_NUMA */
 
-
 struct migrate_vma {
 	struct vm_area_struct	*vma;
 	unsigned long		*dst;
@@ -2186,7 +2189,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			page = vm_normal_page(migrate->vma, addr, pte);
+			page = _vm_normal_page(migrate->vma, addr, pte, true);
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
@@ -2311,13 +2314,18 @@ static bool migrate_vma_check_page(struct page *page)
 
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (is_zone_device_page(page)) {
-		if (is_device_private_page(page)) {
+		if (is_device_private_page(page) ||
+		    is_device_public_page(page))
 			extra++;
-		} else
+		else
 			/* Other ZONE_DEVICE memory type are not supported */
 			return false;
 	}
 
+	/* For file back page */
+	if (page_mapping(page))
+		extra += 1 + page_has_private(page);
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2541,11 +2549,18 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 */
 	__SetPageUptodate(page);
 
-	if (is_zone_device_page(page) && is_device_private_page(page)) {
-		swp_entry_t swp_entry;
-
-		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
-		entry = swp_entry_to_pte(swp_entry);
+	if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
+			swp_entry_t swp_entry;
+
+			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			entry = swp_entry_to_pte(swp_entry);
+		} else if (is_device_public_page(page)) {
+			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkdevmap(entry);
+		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2631,7 +2646,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 					continue;
 				}
-			} else {
+			} else if (!is_device_public_page(newpage)) {
 				/*
 				 * Other types of ZONE_DEVICE page are not
 				 * supported.
diff --git a/mm/swap.c b/mm/swap.c
index 60b1d2a75852..eac0e35f854f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -765,6 +765,17 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (is_huge_zero_page(page))
 			continue;
 
+		/* Device public page can not be huge page */
+		if (is_device_public_page(page)) {
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+						       flags);
+				locked_pgdat = NULL;
+			}
+			put_zone_device_private_or_public_page(page);
+			continue;
+		}
+
 		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
  2017-07-13 21:15 ` [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST Jérôme Glisse
  2017-07-13 21:15 ` [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4 Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-13 21:15 ` [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse

Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c            | 85 +++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index a40288309fd2..e44cb8edb137 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -392,6 +392,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 				  struct device *device,
 				  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index eadf70829c34..28e54e3b4e1d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -849,7 +849,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	zone = page_zone(page);
 
 	mem_hotplug_begin();
-	__remove_pages(zone, start_pfn, npages);
+	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+		__remove_pages(zone, start_pfn, npages);
+	else
+		arch_remove_memory(start_pfn << PAGE_SHIFT,
+				   npages << PAGE_SHIFT);
 	mem_hotplug_done();
 
 	hmm_devmem_radix_release(resource);
@@ -885,7 +889,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	if (is_ram == REGION_INTERSECTS)
 		return -ENXIO;
 
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+	else
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
 	devmem->pagemap.res = devmem->resource;
 	devmem->pagemap.page_fault = hmm_devmem_fault;
 	devmem->pagemap.page_free = hmm_devmem_free;
@@ -924,8 +932,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 		nid = numa_mem_id();
 
 	mem_hotplug_begin();
-	ret = add_pages(nid, align_start >> PAGE_SHIFT,
-			align_size >> PAGE_SHIFT, false);
+	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+		ret = arch_add_memory(nid, align_start, align_size, false);
+	else
+		ret = add_pages(nid, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, false);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
@@ -1075,6 +1086,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res)
+{
+	struct hmm_devmem *devmem;
+	int ret;
+
+	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		return ERR_PTR(-EINVAL);
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = res;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_devm_add_action;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add_resource);
+
 /*
  * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
  *
@@ -1088,6 +1160,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 {
 	resource_size_t start, size;
 	struct device *device;
+	bool cdm = false;
 
 	if (!devmem)
 		return;
@@ -1096,11 +1169,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 	start = devmem->resource->start;
 	size = resource_size(devmem->resource);
 
+	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
 	hmm_devmem_ref_kill(&devmem->ref);
 	hmm_devmem_ref_exit(&devmem->ref);
 	hmm_devmem_pages_remove(devmem);
 
-	devm_release_mem_region(device, start, size);
+	if (!cdm)
+		devm_release_mem_region(device, start, size);
 }
 EXPORT_SYMBOL(hmm_devmem_remove);
 
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-07-13 21:15 ` [PATCH 3/6] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-17  9:10   ` Balbir Singh
  2017-07-13 21:15 ` [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3 Jérôme Glisse
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse, Johannes Weiner,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 92 insertions(+), 76 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04d73ab..c709fdceac13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	cancel_charge(memcg, nr_pages);
 }
 
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
-			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_kmem, unsigned long nr_huge,
-			   unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+	struct mem_cgroup *memcg;
+	unsigned long pgpgout;
+	unsigned long nr_anon;
+	unsigned long nr_file;
+	unsigned long nr_kmem;
+	unsigned long nr_huge;
+	unsigned long nr_shmem;
+	struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 {
-	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+	memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(memcg)) {
-		page_counter_uncharge(&memcg->memory, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg)) {
+		page_counter_uncharge(&ug->memcg->memory, nr_pages);
 		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
-			page_counter_uncharge(&memcg->kmem, nr_kmem);
-		memcg_oom_recover(memcg);
+			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+		memcg_oom_recover(ug->memcg);
 	}
 
 	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
-	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
-	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
-	memcg_check_events(memcg, dummy_page);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg))
+		css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+	if (!page->mem_cgroup)
+		return;
+
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * page->mem_cgroup at this point, we have fully
+	 * exclusive access to the page.
+	 */
+
+	if (ug->memcg != page->mem_cgroup) {
+		if (ug->memcg) {
+			uncharge_batch(ug);
+			uncharge_gather_clear(ug);
+		}
+		ug->memcg = page->mem_cgroup;
+	}
+
+	if (!PageKmemcg(page)) {
+		unsigned int nr_pages = 1;
+
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			ug->nr_huge += nr_pages;
+		}
+		if (PageAnon(page))
+			ug->nr_anon += nr_pages;
+		else {
+			ug->nr_file += nr_pages;
+			if (PageSwapBacked(page))
+				ug->nr_shmem += nr_pages;
+		}
+		ug->pgpgout++;
+	} else {
+		ug->nr_kmem += 1 << compound_order(page);
+		__ClearPageKmemcg(page);
+	}
+
+	ug->dummy_page = page;
+	page->mem_cgroup = NULL;
 }
 
 static void uncharge_list(struct list_head *page_list)
 {
-	struct mem_cgroup *memcg = NULL;
-	unsigned long nr_shmem = 0;
-	unsigned long nr_anon = 0;
-	unsigned long nr_file = 0;
-	unsigned long nr_huge = 0;
-	unsigned long nr_kmem = 0;
-	unsigned long pgpgout = 0;
+	struct uncharge_gather ug;
 	struct list_head *next;
-	struct page *page;
+
+	uncharge_gather_clear(&ug);
 
 	/*
 	 * Note that the list can be a single page->lru; hence the
@@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
 	 */
 	next = page_list->next;
 	do {
+		struct page *page;
+
 		page = list_entry(next, struct page, lru);
 		next = page->lru.next;
 
-		VM_BUG_ON_PAGE(PageLRU(page), page);
-		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
-		if (!page->mem_cgroup)
-			continue;
-
-		/*
-		 * Nobody should be changing or seriously looking at
-		 * page->mem_cgroup at this point, we have fully
-		 * exclusive access to the page.
-		 */
-
-		if (memcg != page->mem_cgroup) {
-			if (memcg) {
-				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_kmem, nr_huge, nr_shmem, page);
-				pgpgout = nr_anon = nr_file = nr_kmem = 0;
-				nr_huge = nr_shmem = 0;
-			}
-			memcg = page->mem_cgroup;
-		}
-
-		if (!PageKmemcg(page)) {
-			unsigned int nr_pages = 1;
-
-			if (PageTransHuge(page)) {
-				nr_pages <<= compound_order(page);
-				nr_huge += nr_pages;
-			}
-			if (PageAnon(page))
-				nr_anon += nr_pages;
-			else {
-				nr_file += nr_pages;
-				if (PageSwapBacked(page))
-					nr_shmem += nr_pages;
-			}
-			pgpgout++;
-		} else {
-			nr_kmem += 1 << compound_order(page);
-			__ClearPageKmemcg(page);
-		}
-
-		page->mem_cgroup = NULL;
+		uncharge_page(page, &ug);
 	} while (next != page_list);
 
-	if (memcg)
-		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_kmem, nr_huge, nr_shmem, page);
+	if (ug.memcg)
+		uncharge_batch(&ug);
 }
 
 /**
@@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct uncharge_gather ug;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
 	if (!page->mem_cgroup)
 		return;
 
-	INIT_LIST_HEAD(&page->lru);
-	uncharge_list(&page->lru);
+	uncharge_gather_clear(&ug);
+	uncharge_page(page, &ug);
+	uncharge_batch(&ug);
 }
 
 /**
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-07-13 21:15 ` [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-17  9:15   ` Balbir Singh
  2017-07-13 21:15 ` [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg Jérôme Glisse
  2017-07-18  3:26 ` [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Bob Liu
  6 siblings, 1 reply; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse, Johannes Weiner,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.

Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host
  - add comments explaining how device memory behave and why

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 kernel/memremap.c |  2 ++
 mm/memcontrol.c   | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 25c098151ed2..4d74b4a4f8f5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
 		__ClearPageActive(page);
 		__ClearPageWaiters(page);
 
+		mem_cgroup_uncharge(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
 	}
 	else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c709fdceac13..858842a741bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4391,12 +4391,13 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_DEVICE,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
 {
-	struct page *page = vm_normal_page(vma, addr, ptent);
+	struct page *page = _vm_normal_page(vma, addr, ptent, true);
 
 	if (!page || !page_mapped(page))
 		return NULL;
@@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 		if (!(mc.flags & MOVE_FILE))
 			return NULL;
 	}
-	if (!get_page_unless_zero(page))
+	if (is_device_public_page(page)) {
+		/*
+		 * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
+		 * refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+	} else if (!get_page_unless_zero(page))
 		return NULL;
 
 	return page;
 }
 
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
 static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 			pte_t ptent, swp_entry_t *entry)
 {
@@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 
 	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
 		return NULL;
+
+	/*
+	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+	 * a device and because they are not accessible by CPU they are store
+	 * as special swap entry in the CPU page table.
+	 */
+	if (is_device_private_entry(ent)) {
+		page = device_private_entry_to_page(ent);
+		/*
+		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+		 * a refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+		return page;
+	}
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.
@@ -4582,6 +4607,13 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PUBLIC
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
+ *     For now we such page is charge like a regular page would be as for all
+ *     intent and purposes it is just special memory taking the place of a
+ *     regular page. See Documentations/vm/hmm.txt and include/linux/hmm.h for
+ *     more informations on this type of memory how it is use and why it is
+ *     charge like this.
  *
  * Called with pte lock held.
  */
@@ -4610,6 +4642,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
+			if (is_device_private_page(page) ||
+			    is_device_public_page(page))
+				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
 		}
@@ -4669,6 +4704,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
+		/*
+		 * Note their can not be MC_TARGET_DEVICE for now as we do not
+		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+		 * MEMORY_DEVICE_PRIVATE but this might change.
+		 */
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -4884,6 +4924,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				putback_lru_page(page);
 			}
 			put_page(page);
+		} else if (target_type == MC_TARGET_DEVICE) {
+			page = target.page;
+			if (!mem_cgroup_move_account(page, true,
+						     mc.from, mc.to)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_charge += HPAGE_PMD_NR;
+			}
+			put_page(page);
 		}
 		spin_unlock(ptl);
 		return 0;
@@ -4895,12 +4943,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
+		bool device = false;
 		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
 
 		switch (get_mctgt_type(vma, addr, ptent, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			/* fall through */
 		case MC_TARGET_PAGE:
 			page = target.page;
 			/*
@@ -4911,7 +4963,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (isolate_lru_page(page))
+			if (!device && isolate_lru_page(page))
 				goto put;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
@@ -4919,7 +4971,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
 			}
-			putback_lru_page(page);
+			if (!device)
+				putback_lru_page(page);
 put:			/* get_mctgt_type() gets the page */
 			put_page(page);
 			break;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-07-13 21:15 ` [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3 Jérôme Glisse
@ 2017-07-13 21:15 ` Jérôme Glisse
  2017-07-14 13:26   ` Michal Hocko
  2017-07-18  3:26 ` [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Bob Liu
  6 siblings, 1 reply; 43+ messages in thread
From: Jérôme Glisse @ 2017-07-13 21:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh,
	Michal Hocko, Jérôme Glisse

For now we account device memory exactly like a regular page in
respect to rss counters and memory cgroup. We do this so that any
existing application that starts using device memory without knowing
about it will keep running unimpacted. This also simplify migration
code.

We will likely revisit this choice once we gain more experience with
how device memory is use and how it impacts overall memory resource
management. For now we believe this is a good enough choice.

Note that device memory can not be pin. Nor by device driver, nor
by GUP thus device memory can always be free and unaccounted when
a process exit.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 Documentation/vm/hmm.txt | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 192dcdb38bd1..4d3aac9f4a5d 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -15,6 +15,15 @@ section present the new migration helper that allow to leverage the device DMA
 engine.
 
 
+1) Problems of using device specific memory allocator:
+2) System bus, device memory characteristics
+3) Share address space and migration
+4) Address space mirroring implementation and API
+5) Represent and manage device memory from core kernel point of view
+6) Migrate to and from device memory
+7) Memory cgroup (memcg) and rss accounting
+
+
 -------------------------------------------------------------------------------
 
 1) Problems of using device specific memory allocator:
@@ -342,3 +351,34 @@ that happens then the finalize_and_map() can catch any pages that was not
 migrated. Note those page were still copied to new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
+
+
+-------------------------------------------------------------------------------
+
+7) Memory cgroup (memcg) and rss accounting
+
+For now device memory is accounted as any regular page in rss counters (either
+anonymous if device page is use for anonymous, file if device page is use for
+file back page or shmem if device page is use for share memory). This is a
+deliberate choice to keep existing application that might start using device
+memory without knowing about it to keep runing unimpacted.
+
+Drawbacks is that OOM killer might kill an application using a lot of device
+memory and not a lot of regular system memory and thus not freeing much system
+memory. We want to gather more real world experience on how application and
+system react under memory pressure in the presence of device memory before
+deciding to account device memory differently.
+
+
+Same decision was made for memory cgroup. Device memory page are accounted
+against same memory cgroup a regular page would be accounted to. This does
+simplify migration to and from device memory. This also means that migration
+back from device memory to regular memory can not fail because it would
+go above memory cgroup limit. We might revisit this choice latter on once we
+get more experience in how device memory is use and its impact on memory
+resource control.
+
+
+Note that device memory can never be pin nor by device driver nor through GUP
+and thus such memory is always free upon process exit. Or when last reference
+is drop in case of share memory or file back memory.
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4
  2017-07-13 21:15 ` [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4 Jérôme Glisse
@ 2017-07-13 23:01   ` Balbir Singh
  0 siblings, 0 replies; 43+ messages in thread
From: Balbir Singh @ 2017-07-13 23:01 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Michal Hocko,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Ross Zwisler

On Thu, 2017-07-13 at 17:15 -0400, JA(C)rA'me Glisse wrote:
> Platform with advance system bus (like CAPI or CCIX) allow device
> memory to be accessible from CPU in a cache coherent fashion. Add
> a new type of ZONE_DEVICE to represent such memory. The use case
> are the same as for the un-addressable device memory but without
> all the corners cases.
> 
> Changed since v3:
>   - s/public/public (going back)
> Changed since v2:
>   - s/public/public
>   - add proper include in migrate.c and drop useless #if/#endif
> Changed since v1:
>   - Kconfig and #if/#else cleanup
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg
  2017-07-13 21:15 ` [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg Jérôme Glisse
@ 2017-07-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-07-14 13:26 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh

On Thu 13-07-17 17:15:32, Jerome Glisse wrote:
> For now we account device memory exactly like a regular page in
> respect to rss counters and memory cgroup. We do this so that any
> existing application that starts using device memory without knowing
> about it will keep running unimpacted. This also simplify migration
> code.
> 
> We will likely revisit this choice once we gain more experience with
> how device memory is use and how it impacts overall memory resource
> management. For now we believe this is a good enough choice.
> 
> Note that device memory can not be pin. Nor by device driver, nor
> by GUP thus device memory can always be free and unaccounted when
> a process exit.

I have to look at the implementation but this gives a good idea of what
is going on and why.

> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/vm/hmm.txt | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> index 192dcdb38bd1..4d3aac9f4a5d 100644
> --- a/Documentation/vm/hmm.txt
> +++ b/Documentation/vm/hmm.txt
> @@ -15,6 +15,15 @@ section present the new migration helper that allow to leverage the device DMA
>  engine.
>  
>  
> +1) Problems of using device specific memory allocator:
> +2) System bus, device memory characteristics
> +3) Share address space and migration
> +4) Address space mirroring implementation and API
> +5) Represent and manage device memory from core kernel point of view
> +6) Migrate to and from device memory
> +7) Memory cgroup (memcg) and rss accounting
> +
> +
>  -------------------------------------------------------------------------------
>  
>  1) Problems of using device specific memory allocator:
> @@ -342,3 +351,34 @@ that happens then the finalize_and_map() can catch any pages that was not
>  migrated. Note those page were still copied to new page and thus we wasted
>  bandwidth but this is considered as a rare event and a price that we are
>  willing to pay to keep all the code simpler.
> +
> +
> +-------------------------------------------------------------------------------
> +
> +7) Memory cgroup (memcg) and rss accounting
> +
> +For now device memory is accounted as any regular page in rss counters (either
> +anonymous if device page is use for anonymous, file if device page is use for
> +file back page or shmem if device page is use for share memory). This is a
> +deliberate choice to keep existing application that might start using device
> +memory without knowing about it to keep runing unimpacted.
> +
> +Drawbacks is that OOM killer might kill an application using a lot of device
> +memory and not a lot of regular system memory and thus not freeing much system
> +memory. We want to gather more real world experience on how application and
> +system react under memory pressure in the presence of device memory before
> +deciding to account device memory differently.
> +
> +
> +Same decision was made for memory cgroup. Device memory page are accounted
> +against same memory cgroup a regular page would be accounted to. This does
> +simplify migration to and from device memory. This also means that migration
> +back from device memory to regular memory can not fail because it would
> +go above memory cgroup limit. We might revisit this choice latter on once we
> +get more experience in how device memory is use and its impact on memory
> +resource control.
> +
> +
> +Note that device memory can never be pin nor by device driver nor through GUP
> +and thus such memory is always free upon process exit. Or when last reference
> +is drop in case of share memory or file back memory.
> -- 
> 2.13.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
  2017-07-13 21:15 ` [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST Jérôme Glisse
@ 2017-07-17  9:09   ` Balbir Singh
  0 siblings, 0 replies; 43+ messages in thread
From: Balbir Singh @ 2017-07-17  9:09 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Michal Hocko, Ross Zwisler

On Thu, 2017-07-13 at 17:15 -0400, JA(C)rA'me Glisse wrote:
> Existing user of ZONE_DEVICE in its DEVICE_PUBLIC variant are not tie
> to specific device and behave more like host memory. This patch rename
> DEVICE_PUBLIC to DEVICE_HOST and free the name DEVICE_PUBLIC to be use
> for cache coherent device memory that has strong tie with the device
> on which the memory is (for instance on board GPU memory).
> 
> There is no functional change here.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-07-13 21:15 ` [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-07-17  9:10   ` Balbir Singh
  0 siblings, 0 replies; 43+ messages in thread
From: Balbir Singh @ 2017-07-17  9:10 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, cgroups

On Thu, 2017-07-13 at 17:15 -0400, JA(C)rA'me Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus you can not use page->lru fields of those pages. This patch
> re-arrange the uncharge to allow single page to be uncharge without
> modifying the lru field of the struct page.
> 
> There is no change to memcontrol logic, it is the same as it was
> before this patch.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3
  2017-07-13 21:15 ` [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3 Jérôme Glisse
@ 2017-07-17  9:15   ` Balbir Singh
  0 siblings, 0 replies; 43+ messages in thread
From: Balbir Singh @ 2017-07-17  9:15 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, cgroups

On Thu, 2017-07-13 at 17:15 -0400, JA(C)rA'me Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus need special handling when it comes to lru or refcount. This
> patch make sure that memcontrol properly handle those when it face
> them. Those pages are use like regular pages in a process address
> space either as anonymous page or as file back page. So from memcg
> point of view we want to handle them like regular page for now at
> least.
> 
> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
>   - add comments explaining how device memory behave and why
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2017-07-13 21:15 ` [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg Jérôme Glisse
@ 2017-07-18  3:26 ` Bob Liu
  2017-07-18 15:38   ` Jerome Glisse
  6 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-18  3:26 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Dan Williams, Balbir Singh, Michal Hocko

On 2017/7/14 5:15, JA(C)rA'me Glisse wrote:
> Sorry i made horrible mistake on names in v4, i completly miss-
> understood the suggestion. So here i repost with proper naming.
> This is the only change since v3. Again sorry about the noise
> with v4.
> 
> Changes since v4:
>   - s/DEVICE_HOST/DEVICE_PUBLIC
> 
> Git tree:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> 
> 
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.
> 
> Even if for all intent and purposes device memory behave like regular
> memory, we still want to manage it in isolation from regular memory.
> Several reasons for that, first and foremost this memory is less
> reliable than regular memory if the device hangs because of invalid
> commands we can loose access to device memory. Second CPU access to
> this memory is expected to be slower than to regular memory. Third
> having random memory into device means that some of the bus bandwith
> wouldn't be available to the device but would be use by CPU access.
> 
> This is why we want to manage such memory in isolation from regular
> memory. Kernel should not try to use this memory even as last resort
> when running out of memory, at least for now.
>

I think set a very large node distance for "Cache Coherent Device Memory" may be a easier way to address these concerns.

--
Regards,
Bob Liu


 
> This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
> that is use to represent CDM memory. This patchset build on top of
> the HMM patchset that already introduce a new type of ZONE_DEVICE
> memory for private device memory (see HMM patchset).
> 
> The end result is that with this patchset if a device is in use in
> a process you might have private anonymous memory or file back
> page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
> taken to not overwritte lru fields of such pages.
> 
> Hence all core mm changes are done to address assumption that any
> process memory is back by a regular struct page that is part of
> the lru. ZONE_DEVICE page are not on the lru and the lru pointer
> of struct page are use to store device specific informations.
> 
> Thus this patchset update all code path that would make assumptions
> about lruness of a process page.
> 
> patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
> patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
> patch 03 - add an helper to HMM for hotplug of CDM memory
> patch 04 - preparatory patch for memory controller changes (memch)
> patch 05 - update memory controller to properly handle
>            ZONE_DEVICE pages when uncharging
> patch 06 - documentation patch
> 
> Previous posting:
> v1 https://lkml.org/lkml/2017/4/7/638
> v2 https://lwn.net/Articles/725412/
> v3 https://lwn.net/Articles/727114/
> v4 https://lwn.net/Articles/727692/
> 
> JA(C)rA'me Glisse (6):
>   mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
>   mm/device-public-memory: device memory cache coherent with CPU v4
>   mm/hmm: add new helper to hotplug CDM memory region v3
>   mm/memcontrol: allow to uncharge page without using page->lru field
>   mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
>     v3
>   mm/hmm: documents how device memory is accounted in rss and memcg
> 
>  Documentation/vm/hmm.txt |  40 ++++++++
>  fs/proc/task_mmu.c       |   2 +-
>  include/linux/hmm.h      |   7 +-
>  include/linux/ioport.h   |   1 +
>  include/linux/memremap.h |  25 ++++-
>  include/linux/mm.h       |  20 ++--
>  kernel/memremap.c        |  19 ++--
>  mm/Kconfig               |  11 +++
>  mm/gup.c                 |   7 ++
>  mm/hmm.c                 |  89 ++++++++++++++++--
>  mm/madvise.c             |   2 +-
>  mm/memcontrol.c          | 231 ++++++++++++++++++++++++++++++-----------------
>  mm/memory.c              |  46 +++++++++-
>  mm/migrate.c             |  57 +++++++-----
>  mm/swap.c                |  11 +++
>  15 files changed, 434 insertions(+), 134 deletions(-)
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-18  3:26 ` [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Bob Liu
@ 2017-07-18 15:38   ` Jerome Glisse
  2017-07-19  1:46     ` Bob Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-07-18 15:38 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> On 2017/7/14 5:15, Jerome Glisse wrote:
> > Sorry i made horrible mistake on names in v4, i completly miss-
> > understood the suggestion. So here i repost with proper naming.
> > This is the only change since v3. Again sorry about the noise
> > with v4.
> > 
> > Changes since v4:
> >   - s/DEVICE_HOST/DEVICE_PUBLIC
> > 
> > Git tree:
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> > 
> > 
> > Cache coherent device memory apply to architecture with system bus
> > like CAPI or CCIX. Device connected to such system bus can expose
> > their memory to the system and allow cache coherent access to it
> > from the CPU.
> > 
> > Even if for all intent and purposes device memory behave like regular
> > memory, we still want to manage it in isolation from regular memory.
> > Several reasons for that, first and foremost this memory is less
> > reliable than regular memory if the device hangs because of invalid
> > commands we can loose access to device memory. Second CPU access to
> > this memory is expected to be slower than to regular memory. Third
> > having random memory into device means that some of the bus bandwith
> > wouldn't be available to the device but would be use by CPU access.
> > 
> > This is why we want to manage such memory in isolation from regular
> > memory. Kernel should not try to use this memory even as last resort
> > when running out of memory, at least for now.
> >
> 
> I think set a very large node distance for "Cache Coherent Device Memory"
> may be a easier way to address these concerns.

Such approach was discuss at length in the past see links below. Outcome
of discussion:
  - CPU less node are bad
  - device memory can be unreliable (device hang) no way for application
    to understand that
  - application and driver NUMA madvise/mbind/mempolicy ... can conflict
    with each other and no way the kernel can figure out which should
    apply
  - NUMA as it is now would not work as we need further isolation that
    what a large node distance would provide

Probably few others argument i forget.

https://lists.gt.net/linux/kernel/2551369
https://groups.google.com/forum/#!topic/linux.kernel/Za_e8C3XnRs%5B1-25%5D
https://lwn.net/Articles/720380/

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-18 15:38   ` Jerome Glisse
@ 2017-07-19  1:46     ` Bob Liu
  2017-07-19  2:25       ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-19  1:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On 2017/7/18 23:38, Jerome Glisse wrote:
> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> On 2017/7/14 5:15, Jerome Glisse wrote:
>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>> understood the suggestion. So here i repost with proper naming.
>>> This is the only change since v3. Again sorry about the noise
>>> with v4.
>>>
>>> Changes since v4:
>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>
>>> Git tree:
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>
>>>
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>>>
>>> Even if for all intent and purposes device memory behave like regular
>>> memory, we still want to manage it in isolation from regular memory.
>>> Several reasons for that, first and foremost this memory is less
>>> reliable than regular memory if the device hangs because of invalid
>>> commands we can loose access to device memory. Second CPU access to
>>> this memory is expected to be slower than to regular memory. Third
>>> having random memory into device means that some of the bus bandwith
>>> wouldn't be available to the device but would be use by CPU access.
>>>
>>> This is why we want to manage such memory in isolation from regular
>>> memory. Kernel should not try to use this memory even as last resort
>>> when running out of memory, at least for now.
>>>
>>
>> I think set a very large node distance for "Cache Coherent Device Memory"
>> may be a easier way to address these concerns.
> 
> Such approach was discuss at length in the past see links below. Outcome
> of discussion:
>   - CPU less node are bad
>   - device memory can be unreliable (device hang) no way for application
>     to understand that

Device memory can also be more reliable if using high quality and expensive memory.

>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>     with each other and no way the kernel can figure out which should
>     apply
>   - NUMA as it is now would not work as we need further isolation that
>     what a large node distance would provide
> 

Agree, that's where we need spend time on.

One drawback of HMM-CDM I'm worry about is one more extra copy.
In the cache coherent case, CPU can write data to device memory directly then start fpga/GPU/other accelerators.

Thanks,
Bob Liu


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-19  1:46     ` Bob Liu
@ 2017-07-19  2:25       ` Jerome Glisse
  2017-07-19  9:09         ` Bob Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-07-19  2:25 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> On 2017/7/18 23:38, Jerome Glisse wrote:
> > On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> On 2017/7/14 5:15, Jerome Glisse wrote:
> >>> Sorry i made horrible mistake on names in v4, i completly miss-
> >>> understood the suggestion. So here i repost with proper naming.
> >>> This is the only change since v3. Again sorry about the noise
> >>> with v4.
> >>>
> >>> Changes since v4:
> >>>   - s/DEVICE_HOST/DEVICE_PUBLIC
> >>>
> >>> Git tree:
> >>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> >>>
> >>>
> >>> Cache coherent device memory apply to architecture with system bus
> >>> like CAPI or CCIX. Device connected to such system bus can expose
> >>> their memory to the system and allow cache coherent access to it
> >>> from the CPU.
> >>>
> >>> Even if for all intent and purposes device memory behave like regular
> >>> memory, we still want to manage it in isolation from regular memory.
> >>> Several reasons for that, first and foremost this memory is less
> >>> reliable than regular memory if the device hangs because of invalid
> >>> commands we can loose access to device memory. Second CPU access to
> >>> this memory is expected to be slower than to regular memory. Third
> >>> having random memory into device means that some of the bus bandwith
> >>> wouldn't be available to the device but would be use by CPU access.
> >>>
> >>> This is why we want to manage such memory in isolation from regular
> >>> memory. Kernel should not try to use this memory even as last resort
> >>> when running out of memory, at least for now.
> >>>
> >>
> >> I think set a very large node distance for "Cache Coherent Device Memory"
> >> may be a easier way to address these concerns.
> > 
> > Such approach was discuss at length in the past see links below. Outcome
> > of discussion:
> >   - CPU less node are bad
> >   - device memory can be unreliable (device hang) no way for application
> >     to understand that
> 
> Device memory can also be more reliable if using high quality and expensive memory.

Even ECC memory does not compensate for device hang. When your GPU lockups
you might need to re-init GPU from scratch after which the content of the
device memory is unreliable. During init the device memory might not get
proper clock or proper refresh cycle and thus is susceptible to corruption.

> 
> >   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> >     with each other and no way the kernel can figure out which should
> >     apply
> >   - NUMA as it is now would not work as we need further isolation that
> >     what a large node distance would provide
> > 
> 
> Agree, that's where we need spend time on.
> 
> One drawback of HMM-CDM I'm worry about is one more extra copy.
> In the cache coherent case, CPU can write data to device memory
> directly then start fpga/GPU/other accelerators.

There is not necessarily an extra copy. Device driver can pre-allocate
virtual address range of a process with device memory. Device page fault
can directly allocate device memory. Once allocated CPU access will use
the device memory.

There is plan to allow other allocation (CPU page fault, file cache, ...)
to also use device memory directly. We just don't know what kind of
userspace API will fit best for that so at first it might be hidden behind
device driver specific ioctl. 

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-19  2:25       ` Jerome Glisse
@ 2017-07-19  9:09         ` Bob Liu
  2017-07-20 15:03           ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-19  9:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On 2017/7/19 10:25, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>> understood the suggestion. So here i repost with proper naming.
>>>>> This is the only change since v3. Again sorry about the noise
>>>>> with v4.
>>>>>
>>>>> Changes since v4:
>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>
>>>>> Git tree:
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>
>>>>>
>>>>> Cache coherent device memory apply to architecture with system bus
>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>> their memory to the system and allow cache coherent access to it
>>>>> from the CPU.
>>>>>
>>>>> Even if for all intent and purposes device memory behave like regular
>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>> Several reasons for that, first and foremost this memory is less
>>>>> reliable than regular memory if the device hangs because of invalid
>>>>> commands we can loose access to device memory. Second CPU access to
>>>>> this memory is expected to be slower than to regular memory. Third
>>>>> having random memory into device means that some of the bus bandwith
>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>
>>>>> This is why we want to manage such memory in isolation from regular
>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>> when running out of memory, at least for now.
>>>>>
>>>>
>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>> may be a easier way to address these concerns.
>>>
>>> Such approach was discuss at length in the past see links below. Outcome
>>> of discussion:
>>>   - CPU less node are bad
>>>   - device memory can be unreliable (device hang) no way for application
>>>     to understand that
>>
>> Device memory can also be more reliable if using high quality and expensive memory.
> 
> Even ECC memory does not compensate for device hang. When your GPU lockups
> you might need to re-init GPU from scratch after which the content of the
> device memory is unreliable. During init the device memory might not get
> proper clock or proper refresh cycle and thus is susceptible to corruption.
> 
>>
>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>>     with each other and no way the kernel can figure out which should
>>>     apply
>>>   - NUMA as it is now would not work as we need further isolation that
>>>     what a large node distance would provide
>>>
>>
>> Agree, that's where we need spend time on.
>>
>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>> In the cache coherent case, CPU can write data to device memory
>> directly then start fpga/GPU/other accelerators.
> 
> There is not necessarily an extra copy. Device driver can pre-allocate
> virtual address range of a process with device memory. Device page fault

Okay, I get your point.
But the typical use case is CPU allocate a memory and prepare/write data then launch GPU "cuda kernel".
How to control the allocation go to device memory e.g HBM or system DDR at the beginning without user explicit advise?
If goes to DDR by default, there is an extra copy. If goes to HBM by default, the HBM may be waste.

> can directly allocate device memory. Once allocated CPU access will use
> the device memory.
> 

Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE(type MEMORY_DEVICE_PUBLIC).
But the problem is the same, e.g how to make sure the device memory say HBM won't be occupied by normal CPU allocation.
Things will be more complex if there are multi GPU connected by nvlink(also cache coherent) in a system, each GPU has their own HBM.
How to decide allocate physical memory from local HBM/DDR or remote HBM/DDR? 
If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism at least.

Thanks,
Bob


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-19  9:09         ` Bob Liu
@ 2017-07-20 15:03           ` Jerome Glisse
  2017-07-21  1:15             ` Bob Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-07-20 15:03 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> On 2017/7/19 10:25, Jerome Glisse wrote:
> > On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >>>>> Sorry i made horrible mistake on names in v4, i completly miss-
> >>>>> understood the suggestion. So here i repost with proper naming.
> >>>>> This is the only change since v3. Again sorry about the noise
> >>>>> with v4.
> >>>>>
> >>>>> Changes since v4:
> >>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
> >>>>>
> >>>>> Git tree:
> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> >>>>>
> >>>>>
> >>>>> Cache coherent device memory apply to architecture with system bus
> >>>>> like CAPI or CCIX. Device connected to such system bus can expose
> >>>>> their memory to the system and allow cache coherent access to it
> >>>>> from the CPU.
> >>>>>
> >>>>> Even if for all intent and purposes device memory behave like regular
> >>>>> memory, we still want to manage it in isolation from regular memory.
> >>>>> Several reasons for that, first and foremost this memory is less
> >>>>> reliable than regular memory if the device hangs because of invalid
> >>>>> commands we can loose access to device memory. Second CPU access to
> >>>>> this memory is expected to be slower than to regular memory. Third
> >>>>> having random memory into device means that some of the bus bandwith
> >>>>> wouldn't be available to the device but would be use by CPU access.
> >>>>>
> >>>>> This is why we want to manage such memory in isolation from regular
> >>>>> memory. Kernel should not try to use this memory even as last resort
> >>>>> when running out of memory, at least for now.
> >>>>>
> >>>>
> >>>> I think set a very large node distance for "Cache Coherent Device Memory"
> >>>> may be a easier way to address these concerns.
> >>>
> >>> Such approach was discuss at length in the past see links below. Outcome
> >>> of discussion:
> >>>   - CPU less node are bad
> >>>   - device memory can be unreliable (device hang) no way for application
> >>>     to understand that
> >>
> >> Device memory can also be more reliable if using high quality and expensive memory.
> > 
> > Even ECC memory does not compensate for device hang. When your GPU lockups
> > you might need to re-init GPU from scratch after which the content of the
> > device memory is unreliable. During init the device memory might not get
> > proper clock or proper refresh cycle and thus is susceptible to corruption.
> > 
> >>
> >>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> >>>     with each other and no way the kernel can figure out which should
> >>>     apply
> >>>   - NUMA as it is now would not work as we need further isolation that
> >>>     what a large node distance would provide
> >>>
> >>
> >> Agree, that's where we need spend time on.
> >>
> >> One drawback of HMM-CDM I'm worry about is one more extra copy.
> >> In the cache coherent case, CPU can write data to device memory
> >> directly then start fpga/GPU/other accelerators.
> > 
> > There is not necessarily an extra copy. Device driver can pre-allocate
> > virtual address range of a process with device memory. Device page fault
> 
> Okay, I get your point. But the typical use case is CPU allocate a memory
> and prepare/write data then launch GPU "cuda kernel".

I don't think we should make to many assumption on what is typical case.
GPU compute is fast evolving and they are new domains where it is apply
for instance some folks use it to process network stream and the network
adapter directly write into GPU memory so there is never a CPU copy of
it. So i rather not make any restrictive assumption on how it will be use.

> How to control the allocation go to device memory e.g HBM or system
> DDR at the beginning without user explicit advise? If goes to DDR by
> default, there is an extra copy. If goes to HBM by default, the HBM
> may be waste.

Yes it is a hard problem to solve. We are working with NVidia and IBM
on this and there are several path. But as first solution we will rely
on hint/directive given by userspace program through existing GPGPU API
like CUDA or OpenCL. They are plan to have hardware monitor bus traffic
to gather statistics and do automatic memory placement from thos.


> > can directly allocate device memory. Once allocated CPU access will use
> > the device memory.
> > 
> 
> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> sure the device memory say HBM won't be occupied by normal CPU allocation.
> Things will be more complex if there are multi GPU connected by nvlink
> (also cache coherent) in a system, each GPU has their own HBM.
>
> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> DDR? 
>
> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> at least.

NUMA is not as easy as you think. First like i said we want the device
memory to be isolated from most existing mm mechanism. Because memory
is unreliable and also because device might need to be able to evict
memory to make contiguous physical memory allocation for graphics.

Second device driver are not integrated that closely within mm and the
scheduler kernel code to allow to efficiently plug in device access
notification to page (ie to update struct page so that numa worker
thread can migrate memory base on accurate informations).

Third it can be hard to decide who win between CPU and device access
when it comes to updating thing like last CPU id.

Fourth there is no such thing like device id ie equivalent of CPU id.
If we were to add something the CPU id field in flags of struct page
would not be big enough so this can have repercusion on struct page
size. This is not an easy sell.

They are other issues i can't think of right now. I think for now it
is easier and better to take the HMM-CDM approach and latter down the
road once we have more existing user to start thinking about numa or
numa like solution.

Bottom line is we spend time thinking about this and yes numa make
sense from conceptual point of view but they are many things we do
not know to feel confident that we can make something good with numa
as it is.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-20 15:03           ` Jerome Glisse
@ 2017-07-21  1:15             ` Bob Liu
  2017-07-21  1:41               ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-21  1:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On 2017/7/20 23:03, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
>>>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>>>> understood the suggestion. So here i repost with proper naming.
>>>>>>> This is the only change since v3. Again sorry about the noise
>>>>>>> with v4.
>>>>>>>
>>>>>>> Changes since v4:
>>>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>>>
>>>>>>> Git tree:
>>>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>>>
>>>>>>>
>>>>>>> Cache coherent device memory apply to architecture with system bus
>>>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>>>> their memory to the system and allow cache coherent access to it
>>>>>>> from the CPU.
>>>>>>>
>>>>>>> Even if for all intent and purposes device memory behave like regular
>>>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>>>> Several reasons for that, first and foremost this memory is less
>>>>>>> reliable than regular memory if the device hangs because of invalid
>>>>>>> commands we can loose access to device memory. Second CPU access to
>>>>>>> this memory is expected to be slower than to regular memory. Third
>>>>>>> having random memory into device means that some of the bus bandwith
>>>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>>>
>>>>>>> This is why we want to manage such memory in isolation from regular
>>>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>>>> when running out of memory, at least for now.
>>>>>>>
>>>>>>
>>>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>>>> may be a easier way to address these concerns.
>>>>>
>>>>> Such approach was discuss at length in the past see links below. Outcome
>>>>> of discussion:
>>>>>   - CPU less node are bad
>>>>>   - device memory can be unreliable (device hang) no way for application
>>>>>     to understand that
>>>>
>>>> Device memory can also be more reliable if using high quality and expensive memory.
>>>
>>> Even ECC memory does not compensate for device hang. When your GPU lockups
>>> you might need to re-init GPU from scratch after which the content of the
>>> device memory is unreliable. During init the device memory might not get
>>> proper clock or proper refresh cycle and thus is susceptible to corruption.
>>>
>>>>
>>>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>>>>     with each other and no way the kernel can figure out which should
>>>>>     apply
>>>>>   - NUMA as it is now would not work as we need further isolation that
>>>>>     what a large node distance would provide
>>>>>
>>>>
>>>> Agree, that's where we need spend time on.
>>>>
>>>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>>>> In the cache coherent case, CPU can write data to device memory
>>>> directly then start fpga/GPU/other accelerators.
>>>
>>> There is not necessarily an extra copy. Device driver can pre-allocate
>>> virtual address range of a process with device memory. Device page fault
>>
>> Okay, I get your point. But the typical use case is CPU allocate a memory
>> and prepare/write data then launch GPU "cuda kernel".
> 
> I don't think we should make to many assumption on what is typical case.
> GPU compute is fast evolving and they are new domains where it is apply
> for instance some folks use it to process network stream and the network
> adapter directly write into GPU memory so there is never a CPU copy of
> it. So i rather not make any restrictive assumption on how it will be use.
> 
>> How to control the allocation go to device memory e.g HBM or system
>> DDR at the beginning without user explicit advise? If goes to DDR by
>> default, there is an extra copy. If goes to HBM by default, the HBM
>> may be waste.
> 
> Yes it is a hard problem to solve. We are working with NVidia and IBM
> on this and there are several path. But as first solution we will rely
> on hint/directive given by userspace program through existing GPGPU API
> like CUDA or OpenCL. They are plan to have hardware monitor bus traffic
> to gather statistics and do automatic memory placement from thos.
> 
> 
>>> can directly allocate device memory. Once allocated CPU access will use
>>> the device memory.
>>>
>>
>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>> Things will be more complex if there are multi GPU connected by nvlink
>> (also cache coherent) in a system, each GPU has their own HBM.
>>
>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>> DDR? 
>>
>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>> at least.
> 
> NUMA is not as easy as you think. First like i said we want the device
> memory to be isolated from most existing mm mechanism. Because memory
> is unreliable and also because device might need to be able to evict
> memory to make contiguous physical memory allocation for graphics.
> 

Right, but we need isolation any way.
For hmm-cdm, the isolation is not adding device memory to lru list, and many
if (is_device_public_page(page)) ...

But how to evict device memory?

> Second device driver are not integrated that closely within mm and the
> scheduler kernel code to allow to efficiently plug in device access
> notification to page (ie to update struct page so that numa worker
> thread can migrate memory base on accurate informations).
> 
> Third it can be hard to decide who win between CPU and device access
> when it comes to updating thing like last CPU id.
> 
> Fourth there is no such thing like device id ie equivalent of CPU id.
> If we were to add something the CPU id field in flags of struct page
> would not be big enough so this can have repercusion on struct page
> size. This is not an easy sell.
> 
> They are other issues i can't think of right now. I think for now it

My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to 
demonstrate the whole solution works fine.

Cheers,
Bob

> is easier and better to take the HMM-CDM approach and latter down the
> road once we have more existing user to start thinking about numa or
> numa like solution.
> 
> Bottom line is we spend time thinking about this and yes numa make
> sense from conceptual point of view but they are many things we do
> not know to feel confident that we can make something good with numa
> as it is.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  1:15             ` Bob Liu
@ 2017-07-21  1:41               ` Jerome Glisse
  2017-07-21  2:10                 ` Bob Liu
  2017-07-21  3:48                 ` Dan Williams
  0 siblings, 2 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-07-21  1:41 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> On 2017/7/20 23:03, Jerome Glisse wrote:
> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:

[...]

> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >> Things will be more complex if there are multi GPU connected by nvlink
> >> (also cache coherent) in a system, each GPU has their own HBM.
> >>
> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >> DDR? 
> >>
> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >> at least.
> > 
> > NUMA is not as easy as you think. First like i said we want the device
> > memory to be isolated from most existing mm mechanism. Because memory
> > is unreliable and also because device might need to be able to evict
> > memory to make contiguous physical memory allocation for graphics.
> > 
> 
> Right, but we need isolation any way.
> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> if (is_device_public_page(page)) ...
> 
> But how to evict device memory?

What you mean by evict ? Device driver can evict whenever they see the need
to do so. CPU page fault will evict too. Process exit or munmap() will free
the device memory.

Are you refering to evict in the sense of memory reclaim under pressure ?

So the way it flows for memory pressure is that if device driver want to
make room it can evict stuff to system memory and if there is not enough
system memory than thing get reclaim as usual before device driver can
make progress on device memory reclaim.


> > Second device driver are not integrated that closely within mm and the
> > scheduler kernel code to allow to efficiently plug in device access
> > notification to page (ie to update struct page so that numa worker
> > thread can migrate memory base on accurate informations).
> > 
> > Third it can be hard to decide who win between CPU and device access
> > when it comes to updating thing like last CPU id.
> > 
> > Fourth there is no such thing like device id ie equivalent of CPU id.
> > If we were to add something the CPU id field in flags of struct page
> > would not be big enough so this can have repercusion on struct page
> > size. This is not an easy sell.
> > 
> > They are other issues i can't think of right now. I think for now it
> 
> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to 
> demonstrate the whole solution works fine.

I am working with NVidia close source driver team to make sure that it works
well for them. I am also working on nouveau open source driver for same NVidia
hardware thought it will be of less use as what is missing there is a solid
open source userspace to leverage this. Nonetheless open source driver are in
the work.

The way i see it is start with HMM-CDM which isolate most of the changes in
hmm code. Once we get more experience with real workload and not with device
driver test suite then we can start revisiting NUMA and deeper integration
with the linux kernel. I rather grow organicaly toward that than trying to
design something that would make major changes all over the kernel without
knowing for sure that we are going in the right direction. I hope that this
make sense to others too.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  1:41               ` Jerome Glisse
@ 2017-07-21  2:10                 ` Bob Liu
  2017-07-21 12:01                   ` Bob Liu
  2017-07-21  3:48                 ` Dan Williams
  1 sibling, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-21  2:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On 2017/7/21 9:41, Jerome Glisse wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> 
> [...]
> 
>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>
>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>> DDR? 
>>>>
>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>> at least.
>>>
>>> NUMA is not as easy as you think. First like i said we want the device
>>> memory to be isolated from most existing mm mechanism. Because memory
>>> is unreliable and also because device might need to be able to evict
>>> memory to make contiguous physical memory allocation for graphics.
>>>
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
> 
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
> 
> Are you refering to evict in the sense of memory reclaim under pressure ?
> 
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough

Yes, I mean this. 
So every driver have to maintain their own LRU-similar list instead of reuse what already in linux kernel.

> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
> 
> 
>>> Second device driver are not integrated that closely within mm and the
>>> scheduler kernel code to allow to efficiently plug in device access
>>> notification to page (ie to update struct page so that numa worker
>>> thread can migrate memory base on accurate informations).
>>>
>>> Third it can be hard to decide who win between CPU and device access
>>> when it comes to updating thing like last CPU id.
>>>
>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>> If we were to add something the CPU id field in flags of struct page
>>> would not be big enough so this can have repercusion on struct page
>>> size. This is not an easy sell.
>>>
>>> They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to 
>> demonstrate the whole solution works fine.
> 
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.
> 

Looking forward to see these drivers be public.

> The way i see it is start with HMM-CDM which isolate most of the changes in
> hmm code. Once we get more experience with real workload and not with device
> driver test suite then we can start revisiting NUMA and deeper integration
> with the linux kernel. I rather grow organicaly toward that than trying to
> design something that would make major changes all over the kernel without
> knowing for sure that we are going in the right direction. I hope that this
> make sense to others too.
> 

Make sense.

Thanks,
Bob Liu


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  1:41               ` Jerome Glisse
  2017-07-21  2:10                 ` Bob Liu
@ 2017-07-21  3:48                 ` Dan Williams
  2017-07-21 15:22                   ` Jerome Glisse
  2017-09-05 19:36                   ` Jerome Glisse
  1 sibling, 2 replies; 43+ messages in thread
From: Dan Williams @ 2017-07-21  3:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Michal Hocko

On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
>> >> Things will be more complex if there are multi GPU connected by nvlink
>> >> (also cache coherent) in a system, each GPU has their own HBM.
>> >>
>> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>> >> DDR?
>> >>
>> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>> >> at least.
>> >
>> > NUMA is not as easy as you think. First like i said we want the device
>> > memory to be isolated from most existing mm mechanism. Because memory
>> > is unreliable and also because device might need to be able to evict
>> > memory to make contiguous physical memory allocation for graphics.
>> >
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
>
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
>
> Are you refering to evict in the sense of memory reclaim under pressure ?
>
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough
> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
>
>
>> > Second device driver are not integrated that closely within mm and the
>> > scheduler kernel code to allow to efficiently plug in device access
>> > notification to page (ie to update struct page so that numa worker
>> > thread can migrate memory base on accurate informations).
>> >
>> > Third it can be hard to decide who win between CPU and device access
>> > when it comes to updating thing like last CPU id.
>> >
>> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> > If we were to add something the CPU id field in flags of struct page
>> > would not be big enough so this can have repercusion on struct page
>> > size. This is not an easy sell.
>> >
>> > They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> demonstrate the whole solution works fine.
>
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.

Can you point to the nouveau patches? I still find these HMM patches
un-reviewable without an upstream consumer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  2:10                 ` Bob Liu
@ 2017-07-21 12:01                   ` Bob Liu
  2017-07-21 15:21                     ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-07-21 12:01 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jerome Glisse, Linux-Kernel, Linux-MM, John Hubbard,
	David Nellans, Dan Williams, Balbir Singh, Michal Hocko

On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu <liubo95@huawei.com> wrote:
> On 2017/7/21 9:41, Jerome Glisse wrote:
>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>
>> [...]
>>
>>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>>
>>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>>> DDR?
>>>>>
>>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>>> at least.
>>>>
>>>> NUMA is not as easy as you think. First like i said we want the device
>>>> memory to be isolated from most existing mm mechanism. Because memory
>>>> is unreliable and also because device might need to be able to evict
>>>> memory to make contiguous physical memory allocation for graphics.
>>>>
>>>
>>> Right, but we need isolation any way.
>>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>>> if (is_device_public_page(page)) ...
>>>
>>> But how to evict device memory?
>>
>> What you mean by evict ? Device driver can evict whenever they see the need
>> to do so. CPU page fault will evict too. Process exit or munmap() will free
>> the device memory.
>>
>> Are you refering to evict in the sense of memory reclaim under pressure ?
>>
>> So the way it flows for memory pressure is that if device driver want to
>> make room it can evict stuff to system memory and if there is not enough
>
> Yes, I mean this.
> So every driver have to maintain their own LRU-similar list instead of reuse what already in linux kernel.
>

And how HMM-CDM can handle multiple devices or device with multiple
device memories(may with different properties also)?
This kind of hardware platform would be very common when CCIX is out soon.

Thanks,
Bob Liu



>> system memory than thing get reclaim as usual before device driver can
>> make progress on device memory reclaim.
>>
>>
>>>> Second device driver are not integrated that closely within mm and the
>>>> scheduler kernel code to allow to efficiently plug in device access
>>>> notification to page (ie to update struct page so that numa worker
>>>> thread can migrate memory base on accurate informations).
>>>>
>>>> Third it can be hard to decide who win between CPU and device access
>>>> when it comes to updating thing like last CPU id.
>>>>
>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>> If we were to add something the CPU id field in flags of struct page
>>>> would not be big enough so this can have repercusion on struct page
>>>> size. This is not an easy sell.
>>>>
>>>> They are other issues i can't think of right now. I think for now it
>>>
>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>>> demonstrate the whole solution works fine.
>>
>> I am working with NVidia close source driver team to make sure that it works
>> well for them. I am also working on nouveau open source driver for same NVidia
>> hardware thought it will be of less use as what is missing there is a solid
>> open source userspace to leverage this. Nonetheless open source driver are in
>> the work.
>>
>
> Looking forward to see these drivers be public.
>
>> The way i see it is start with HMM-CDM which isolate most of the changes in
>> hmm code. Once we get more experience with real workload and not with device
>> driver test suite then we can start revisiting NUMA and deeper integration
>> with the linux kernel. I rather grow organicaly toward that than trying to
>> design something that would make major changes all over the kernel without
>> knowing for sure that we are going in the right direction. I hope that this
>> make sense to others too.
>>
>
> Make sense.
>
> Thanks,
> Bob Liu
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21 12:01                   ` Bob Liu
@ 2017-07-21 15:21                     ` Jerome Glisse
  0 siblings, 0 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-07-21 15:21 UTC (permalink / raw)
  To: Bob Liu
  Cc: Bob Liu, Linux-Kernel, Linux-MM, John Hubbard, David Nellans,
	Dan Williams, Balbir Singh, Michal Hocko

On Fri, Jul 21, 2017 at 08:01:07PM +0800, Bob Liu wrote:
> On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu <liubo95@huawei.com> wrote:
> > On 2017/7/21 9:41, Jerome Glisse wrote:
> >> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >>> On 2017/7/20 23:03, Jerome Glisse wrote:
> >>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >>
> >> [...]
> >>
> >>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >>>>> Things will be more complex if there are multi GPU connected by nvlink
> >>>>> (also cache coherent) in a system, each GPU has their own HBM.
> >>>>>
> >>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >>>>> DDR?
> >>>>>
> >>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >>>>> at least.
> >>>>
> >>>> NUMA is not as easy as you think. First like i said we want the device
> >>>> memory to be isolated from most existing mm mechanism. Because memory
> >>>> is unreliable and also because device might need to be able to evict
> >>>> memory to make contiguous physical memory allocation for graphics.
> >>>>
> >>>
> >>> Right, but we need isolation any way.
> >>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> >>> if (is_device_public_page(page)) ...
> >>>
> >>> But how to evict device memory?
> >>
> >> What you mean by evict ? Device driver can evict whenever they see the need
> >> to do so. CPU page fault will evict too. Process exit or munmap() will free
> >> the device memory.
> >>
> >> Are you refering to evict in the sense of memory reclaim under pressure ?
> >>
> >> So the way it flows for memory pressure is that if device driver want to
> >> make room it can evict stuff to system memory and if there is not enough
> >
> > Yes, I mean this.
> > So every driver have to maintain their own LRU-similar list instead of
> > reuse what already in linux kernel.

Regarding LRU it is again not as easy. First we do necessarily have access
information like CPU page table for device page table. Second the mmu_notifier
callback on per page basis is costly. Finaly device are use differently than
CPU, usualy you schedule a job and once that job is done you can safely evict
memory it was using. Existing device driver already have quite large memory
management code of their own because of that different usage model.

LRU might make sense at one point but so far i doubt it is the right solution
for device memory.

> 
> And how HMM-CDM can handle multiple devices or device with multiple
> device memories(may with different properties also)?
> This kind of hardware platform would be very common when CCIX is out soon.

A) Multiple device is under control of device driver. Multiple devices link
to each other through dedicated link can have themself a complex topology and
remote access between device is highly tie to the device (how to program the
device mmu and device registers) and thus to the device driver.

If we identify common design pattern between different hardware then we might
start thinking about factoring out some common code to help those cases.


B) Multiple different device is an harder problem. Each device provide their
own userspace API and that is through that API that you will get memory
placement advise. If several device fight for placement of same chunk of
memory one can argue that the application is broken or device is broken.
But for now we assume that device and application will behave.

Rate limiting migration is hard, you need to keep migration statistics and
that need memory. So unless we really need to do that i would rather avoid
doing that. Again this is a thing for which we will have to wait and see
how thing panout.


Maybe i should stress that HMM is a set of helpers for device memory and it
is not intended to be a policy maker or to manage device memory. Intention
is that device driver will keep managing device memory as they already do
today.

A deeper integration with process memory management is probably bound to
happen but for now it is just about having toolbox for device driver.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  3:48                 ` Dan Williams
@ 2017-07-21 15:22                   ` Jerome Glisse
  2017-09-05 19:36                   ` Jerome Glisse
  1 sibling, 0 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-07-21 15:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bob Liu, linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Michal Hocko

On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >
> > [...]
> >
> >> >> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
> >> >> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
> >> >> sure the device memory say HBM won't be occupied by normal CPU allocation.
> >> >> Things will be more complex if there are multi GPU connected by nvlink
> >> >> (also cache coherent) in a system, each GPU has their own HBM.
> >> >>
> >> >> How to decide allocate physical memory from local HBM/DDR or remote HBM/
> >> >> DDR?
> >> >>
> >> >> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
> >> >> at least.
> >> >
> >> > NUMA is not as easy as you think. First like i said we want the device
> >> > memory to be isolated from most existing mm mechanism. Because memory
> >> > is unreliable and also because device might need to be able to evict
> >> > memory to make contiguous physical memory allocation for graphics.
> >> >
> >>
> >> Right, but we need isolation any way.
> >> For hmm-cdm, the isolation is not adding device memory to lru list, and many
> >> if (is_device_public_page(page)) ...
> >>
> >> But how to evict device memory?
> >
> > What you mean by evict ? Device driver can evict whenever they see the need
> > to do so. CPU page fault will evict too. Process exit or munmap() will free
> > the device memory.
> >
> > Are you refering to evict in the sense of memory reclaim under pressure ?
> >
> > So the way it flows for memory pressure is that if device driver want to
> > make room it can evict stuff to system memory and if there is not enough
> > system memory than thing get reclaim as usual before device driver can
> > make progress on device memory reclaim.
> >
> >
> >> > Second device driver are not integrated that closely within mm and the
> >> > scheduler kernel code to allow to efficiently plug in device access
> >> > notification to page (ie to update struct page so that numa worker
> >> > thread can migrate memory base on accurate informations).
> >> >
> >> > Third it can be hard to decide who win between CPU and device access
> >> > when it comes to updating thing like last CPU id.
> >> >
> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> > If we were to add something the CPU id field in flags of struct page
> >> > would not be big enough so this can have repercusion on struct page
> >> > size. This is not an easy sell.
> >> >
> >> > They are other issues i can't think of right now. I think for now it
> >>
> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> demonstrate the whole solution works fine.
> >
> > I am working with NVidia close source driver team to make sure that it works
> > well for them. I am also working on nouveau open source driver for same NVidia
> > hardware thought it will be of less use as what is missing there is a solid
> > open source userspace to leverage this. Nonetheless open source driver are in
> > the work.
> 
> Can you point to the nouveau patches? I still find these HMM patches
> un-reviewable without an upstream consumer.

I am still working on those, i hope i will be able to post them in 3 weeks or so.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-07-21  3:48                 ` Dan Williams
  2017-07-21 15:22                   ` Jerome Glisse
@ 2017-09-05 19:36                   ` Jerome Glisse
  2017-09-09 23:22                     ` Bob Liu
  1 sibling, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-09-05 19:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bob Liu, linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, Michal Hocko, Andrew Morton

On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:

[...]

> >> > Second device driver are not integrated that closely within mm and the
> >> > scheduler kernel code to allow to efficiently plug in device access
> >> > notification to page (ie to update struct page so that numa worker
> >> > thread can migrate memory base on accurate informations).
> >> >
> >> > Third it can be hard to decide who win between CPU and device access
> >> > when it comes to updating thing like last CPU id.
> >> >
> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> > If we were to add something the CPU id field in flags of struct page
> >> > would not be big enough so this can have repercusion on struct page
> >> > size. This is not an easy sell.
> >> >
> >> > They are other issues i can't think of right now. I think for now it
> >>
> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> demonstrate the whole solution works fine.
> >
> > I am working with NVidia close source driver team to make sure that it works
> > well for them. I am also working on nouveau open source driver for same NVidia
> > hardware thought it will be of less use as what is missing there is a solid
> > open source userspace to leverage this. Nonetheless open source driver are in
> > the work.
> 
> Can you point to the nouveau patches? I still find these HMM patches
> un-reviewable without an upstream consumer.

So i pushed a branch with WIP for nouveau to use HMM:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau

Top 16 patches are HMM related (implementic logic inside the driver to use
HMM). The next 16 patches are hardware specific patches and some nouveau
changes needed to allow page fault.

It is enough to have simple malloc test case working:

https://cgit.freedesktop.org/~glisse/compote

There is 2 program here the old one is existing way you use GPU for compute
task while the new one is what HMM allow to achieve ie use malloc memory
directly.


I haven't added yet the device memory support it is in work and i will push
update to this branch and repo for that. Probably next week if no pressing
bug preempt my time.


So there is a lot of ugliness in all this and i don't expect this to be what
end up upstream. Right now there is a large rework of nouveau vm (virtual
memory) code happening to rework completely how we do address space management
within nouveau. This work is prerequisite for a clean implementation for HMM
inside nouveau (it will also lift the 40bits address space limitation that
exist today inside nouveau driver). Once that work land i will work on clean
upstreamable implementation for nouveau to use HMM as well as userspace to
leverage it (this is requirement for upstream GPU driver to have open source
userspace that make use of features). All this is a lot of work and there is
not many people working on this.


They are other initiatives under way related to this that i can not talk about
publicly but if they bare fruit they might help to speedup all this.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-05 19:36                   ` Jerome Glisse
@ 2017-09-09 23:22                     ` Bob Liu
  2017-09-11 23:36                       ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-09-09 23:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Bob Liu, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> > Second device driver are not integrated that closely within mm and the
>> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> > notification to page (ie to update struct page so that numa worker
>> >> > thread can migrate memory base on accurate informations).
>> >> >
>> >> > Third it can be hard to decide who win between CPU and device access
>> >> > when it comes to updating thing like last CPU id.
>> >> >
>> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> > If we were to add something the CPU id field in flags of struct page
>> >> > would not be big enough so this can have repercusion on struct page
>> >> > size. This is not an easy sell.
>> >> >
>> >> > They are other issues i can't think of right now. I think for now it
>> >>
>> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> >> demonstrate the whole solution works fine.
>> >
>> > I am working with NVidia close source driver team to make sure that it works
>> > well for them. I am also working on nouveau open source driver for same NVidia
>> > hardware thought it will be of less use as what is missing there is a solid
>> > open source userspace to leverage this. Nonetheless open source driver are in
>> > the work.
>>
>> Can you point to the nouveau patches? I still find these HMM patches
>> un-reviewable without an upstream consumer.
>
> So i pushed a branch with WIP for nouveau to use HMM:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>

Nice to see that.
Btw, do you have any plan for a CDM-HMM driver? CPU can write to
Device memory directly without extra copy.

--
Thanks,
Bob Liu

> Top 16 patches are HMM related (implementic logic inside the driver to use
> HMM). The next 16 patches are hardware specific patches and some nouveau
> changes needed to allow page fault.
>
> It is enough to have simple malloc test case working:
>
> https://cgit.freedesktop.org/~glisse/compote
>
> There is 2 program here the old one is existing way you use GPU for compute
> task while the new one is what HMM allow to achieve ie use malloc memory
> directly.
>
>
> I haven't added yet the device memory support it is in work and i will push
> update to this branch and repo for that. Probably next week if no pressing
> bug preempt my time.
>
>
> So there is a lot of ugliness in all this and i don't expect this to be what
> end up upstream. Right now there is a large rework of nouveau vm (virtual
> memory) code happening to rework completely how we do address space management
> within nouveau. This work is prerequisite for a clean implementation for HMM
> inside nouveau (it will also lift the 40bits address space limitation that
> exist today inside nouveau driver). Once that work land i will work on clean
> upstreamable implementation for nouveau to use HMM as well as userspace to
> leverage it (this is requirement for upstream GPU driver to have open source
> userspace that make use of features). All this is a lot of work and there is
> not many people working on this.
>
>
> They are other initiatives under way related to this that i can not talk about
> publicly but if they bare fruit they might help to speedup all this.
>
> Jérôme
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-09 23:22                     ` Bob Liu
@ 2017-09-11 23:36                       ` Jerome Glisse
  2017-09-12  1:02                         ` Bob Liu
  2017-09-26  9:56                         ` Bob Liu
  0 siblings, 2 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-09-11 23:36 UTC (permalink / raw)
  To: Bob Liu
  Cc: Dan Williams, Bob Liu, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >> >>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >
> > [...]
> >
> >> >> > Second device driver are not integrated that closely within mm and the
> >> >> > scheduler kernel code to allow to efficiently plug in device access
> >> >> > notification to page (ie to update struct page so that numa worker
> >> >> > thread can migrate memory base on accurate informations).
> >> >> >
> >> >> > Third it can be hard to decide who win between CPU and device access
> >> >> > when it comes to updating thing like last CPU id.
> >> >> >
> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> >> > If we were to add something the CPU id field in flags of struct page
> >> >> > would not be big enough so this can have repercusion on struct page
> >> >> > size. This is not an easy sell.
> >> >> >
> >> >> > They are other issues i can't think of right now. I think for now it
> >> >>
> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> >> demonstrate the whole solution works fine.
> >> >
> >> > I am working with NVidia close source driver team to make sure that it works
> >> > well for them. I am also working on nouveau open source driver for same NVidia
> >> > hardware thought it will be of less use as what is missing there is a solid
> >> > open source userspace to leverage this. Nonetheless open source driver are in
> >> > the work.
> >>
> >> Can you point to the nouveau patches? I still find these HMM patches
> >> un-reviewable without an upstream consumer.
> >
> > So i pushed a branch with WIP for nouveau to use HMM:
> >
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >
> 
> Nice to see that.
> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> Device memory directly without extra copy.

Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
available today) is on the TODO list. Note that the driver changes for CDM
are minimal (probably less than 100 lines of code). From the driver point
of view this is memory and it doesn't matter if it is CDM or not.

The real burden is on the application developpers who need to update their
code to leverage this.


Also as a data point you want to avoid CPU access to CDM device memory as
much as possible. The overhead for single cache line access are high (this
is PCIE or derivative protocol and it is a packet protocol).

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-11 23:36                       ` Jerome Glisse
@ 2017-09-12  1:02                         ` Bob Liu
  2017-09-12 16:17                           ` Jerome Glisse
  2017-09-26  9:56                         ` Bob Liu
  1 sibling, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-09-12  1:02 UTC (permalink / raw)
  To: Jerome Glisse, Bob Liu
  Cc: Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On 2017/9/12 7:36, Jerome Glisse wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
>>>
>>> [...]
>>>
>>>>>>> Second device driver are not integrated that closely within mm and the
>>>>>>> scheduler kernel code to allow to efficiently plug in device access
>>>>>>> notification to page (ie to update struct page so that numa worker
>>>>>>> thread can migrate memory base on accurate informations).
>>>>>>>
>>>>>>> Third it can be hard to decide who win between CPU and device access
>>>>>>> when it comes to updating thing like last CPU id.
>>>>>>>
>>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>>>>> If we were to add something the CPU id field in flags of struct page
>>>>>>> would not be big enough so this can have repercusion on struct page
>>>>>>> size. This is not an easy sell.
>>>>>>>
>>>>>>> They are other issues i can't think of right now. I think for now it
>>>>>>
>>>>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>>>>>> demonstrate the whole solution works fine.
>>>>>
>>>>> I am working with NVidia close source driver team to make sure that it works
>>>>> well for them. I am also working on nouveau open source driver for same NVidia
>>>>> hardware thought it will be of less use as what is missing there is a solid
>>>>> open source userspace to leverage this. Nonetheless open source driver are in
>>>>> the work.
>>>>
>>>> Can you point to the nouveau patches? I still find these HMM patches
>>>> un-reviewable without an upstream consumer.
>>>
>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
> 
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
> 
> The real burden is on the application developpers who need to update their
> code to leverage this.
> 

Why it's not transparent to application?
Application just use system malloc() and don't care whether the data is copied or not.

> 
> Also as a data point you want to avoid CPU access to CDM device memory as
> much as possible. The overhead for single cache line access are high (this
> is PCIE or derivative protocol and it is a packet protocol).
> 

Thank you for the hint, we are going to follow cdm-hmm since HMM already merged into upstream.

--
Thanks,
Bob



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-12  1:02                         ` Bob Liu
@ 2017-09-12 16:17                           ` Jerome Glisse
  0 siblings, 0 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-09-12 16:17 UTC (permalink / raw)
  To: Bob Liu
  Cc: Bob Liu, Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Tue, Sep 12, 2017 at 09:02:19AM +0800, Bob Liu wrote:
> On 2017/9/12 7:36, Jerome Glisse wrote:
> > On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
> >>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
> >>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >>>>>>>>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >>>
> >>> [...]
> >>>
> >>>>>>> Second device driver are not integrated that closely within mm and the
> >>>>>>> scheduler kernel code to allow to efficiently plug in device access
> >>>>>>> notification to page (ie to update struct page so that numa worker
> >>>>>>> thread can migrate memory base on accurate informations).
> >>>>>>>
> >>>>>>> Third it can be hard to decide who win between CPU and device access
> >>>>>>> when it comes to updating thing like last CPU id.
> >>>>>>>
> >>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
> >>>>>>> If we were to add something the CPU id field in flags of struct page
> >>>>>>> would not be big enough so this can have repercusion on struct page
> >>>>>>> size. This is not an easy sell.
> >>>>>>>
> >>>>>>> They are other issues i can't think of right now. I think for now it
> >>>>>>
> >>>>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >>>>>> demonstrate the whole solution works fine.
> >>>>>
> >>>>> I am working with NVidia close source driver team to make sure that it works
> >>>>> well for them. I am also working on nouveau open source driver for same NVidia
> >>>>> hardware thought it will be of less use as what is missing there is a solid
> >>>>> open source userspace to leverage this. Nonetheless open source driver are in
> >>>>> the work.
> >>>>
> >>>> Can you point to the nouveau patches? I still find these HMM patches
> >>>> un-reviewable without an upstream consumer.
> >>>
> >>> So i pushed a branch with WIP for nouveau to use HMM:
> >>>
> >>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >>>
> >>
> >> Nice to see that.
> >> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> Device memory directly without extra copy.
> > 
> > Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> > available today) is on the TODO list. Note that the driver changes for CDM
> > are minimal (probably less than 100 lines of code). From the driver point
> > of view this is memory and it doesn't matter if it is CDM or not.
> > 
> > The real burden is on the application developpers who need to update their
> > code to leverage this.
> > 
> 
> Why it's not transparent to application?
> Application just use system malloc() and don't care whether the data is copied or not.

Porting today software to malloc/mmap is easy and apply to both non CDM and
CDM hardware.

So malloc/mmap is a given what i mean is that having CPU capable of doing
cache coherent access to device memory is a new thing. It never existed before
and thus no one ever though of how to take advantages of that ie there is no
existing program designed with that in mind.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-11 23:36                       ` Jerome Glisse
  2017-09-12  1:02                         ` Bob Liu
@ 2017-09-26  9:56                         ` Bob Liu
  2017-09-26 16:16                           ` Jerome Glisse
  1 sibling, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-09-26  9:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Bob Liu, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>> >
>> > [...]
>> >
>> >> >> > Second device driver are not integrated that closely within mm and the
>> >> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> >> > notification to page (ie to update struct page so that numa worker
>> >> >> > thread can migrate memory base on accurate informations).
>> >> >> >
>> >> >> > Third it can be hard to decide who win between CPU and device access
>> >> >> > when it comes to updating thing like last CPU id.
>> >> >> >
>> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> >> > If we were to add something the CPU id field in flags of struct page
>> >> >> > would not be big enough so this can have repercusion on struct page
>> >> >> > size. This is not an easy sell.
>> >> >> >
>> >> >> > They are other issues i can't think of right now. I think for now it
>> >> >>
>> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
>> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
>> >> >> demonstrate the whole solution works fine.
>> >> >
>> >> > I am working with NVidia close source driver team to make sure that it works
>> >> > well for them. I am also working on nouveau open source driver for same NVidia
>> >> > hardware thought it will be of less use as what is missing there is a solid
>> >> > open source userspace to leverage this. Nonetheless open source driver are in
>> >> > the work.
>> >>
>> >> Can you point to the nouveau patches? I still find these HMM patches
>> >> un-reviewable without an upstream consumer.
>> >
>> > So i pushed a branch with WIP for nouveau to use HMM:
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
>
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
>

It seems have to migrate/copy memory between system-memory and
device-memory even in HMM-CDM solution.
Because device-memory is not added into buddy system, the page fault
for normal malloc() always allocate memory from system-memory!!
If the device then access the same virtual address, the data is copied
to device-memory.

Correct me if I misunderstand something.
@Balbir, how do you plan to make zero-copy work if using HMM-CDM?

--
Thanks,
Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-26  9:56                         ` Bob Liu
@ 2017-09-26 16:16                           ` Jerome Glisse
  2017-09-30  2:57                             ` Bob Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-09-26 16:16 UTC (permalink / raw)
  To: Bob Liu
  Cc: Dan Williams, Bob Liu, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
> >> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
> >> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
> >> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
> >> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
> >> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
> >> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
> >> >> >> >>>>>> On 2017/7/14 5:15, Jerome Glisse wrote:
> >> >
> >> > [...]
> >> >
> >> >> >> > Second device driver are not integrated that closely within mm and the
> >> >> >> > scheduler kernel code to allow to efficiently plug in device access
> >> >> >> > notification to page (ie to update struct page so that numa worker
> >> >> >> > thread can migrate memory base on accurate informations).
> >> >> >> >
> >> >> >> > Third it can be hard to decide who win between CPU and device access
> >> >> >> > when it comes to updating thing like last CPU id.
> >> >> >> >
> >> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
> >> >> >> > If we were to add something the CPU id field in flags of struct page
> >> >> >> > would not be big enough so this can have repercusion on struct page
> >> >> >> > size. This is not an easy sell.
> >> >> >> >
> >> >> >> > They are other issues i can't think of right now. I think for now it
> >> >> >>
> >> >> >> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
> >> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or other ways.
> >> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full driver to
> >> >> >> demonstrate the whole solution works fine.
> >> >> >
> >> >> > I am working with NVidia close source driver team to make sure that it works
> >> >> > well for them. I am also working on nouveau open source driver for same NVidia
> >> >> > hardware thought it will be of less use as what is missing there is a solid
> >> >> > open source userspace to leverage this. Nonetheless open source driver are in
> >> >> > the work.
> >> >>
> >> >> Can you point to the nouveau patches? I still find these HMM patches
> >> >> un-reviewable without an upstream consumer.
> >> >
> >> > So i pushed a branch with WIP for nouveau to use HMM:
> >> >
> >> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >> >
> >>
> >> Nice to see that.
> >> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> Device memory directly without extra copy.
> >
> > Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> > available today) is on the TODO list. Note that the driver changes for CDM
> > are minimal (probably less than 100 lines of code). From the driver point
> > of view this is memory and it doesn't matter if it is CDM or not.
> >
> 
> It seems have to migrate/copy memory between system-memory and
> device-memory even in HMM-CDM solution.
> Because device-memory is not added into buddy system, the page fault
> for normal malloc() always allocate memory from system-memory!!
> If the device then access the same virtual address, the data is copied
> to device-memory.
> 
> Correct me if I misunderstand something.
> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?

Device can access system memory so copy to device is _not_ mandatory. Copying
data to device is for performance only ie the device driver take hint from
userspace and monitor device activity to decide which memory should be migrated
to device memory to maximize performance.

Moreover in some previous version of the HMM patchset we had an helper that
allowed to directly allocate device memory on device page fault. I intend to
post this helper again. With that helper you can have zero copy when device
is the first to access the memory.

Plan is to get what we have today work properly with the open source driver
and make it perform well. Once we get some experience with real workload we
might look into allowing CPU page fault to be directed to device memory but
at this time i don't think we need this.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-26 16:16                           ` Jerome Glisse
@ 2017-09-30  2:57                             ` Bob Liu
  2017-09-30 22:49                               ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-09-30  2:57 UTC (permalink / raw)
  To: Jerome Glisse, Bob Liu
  Cc: Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On 2017/9/27 0:16, Jerome Glisse wrote:
> On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
[...]
>>>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>>>
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>>>
>>>>
>>>> Nice to see that.
>>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>>>> Device memory directly without extra copy.
>>>
>>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>>> available today) is on the TODO list. Note that the driver changes for CDM
>>> are minimal (probably less than 100 lines of code). From the driver point
>>> of view this is memory and it doesn't matter if it is CDM or not.
>>>
>>
>> It seems have to migrate/copy memory between system-memory and
>> device-memory even in HMM-CDM solution.
>> Because device-memory is not added into buddy system, the page fault
>> for normal malloc() always allocate memory from system-memory!!
>> If the device then access the same virtual address, the data is copied
>> to device-memory.
>>
>> Correct me if I misunderstand something.
>> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> 
> Device can access system memory so copy to device is _not_ mandatory. Copying
> data to device is for performance only ie the device driver take hint from
> userspace and monitor device activity to decide which memory should be migrated
> to device memory to maximize performance.
> 
> Moreover in some previous version of the HMM patchset we had an helper that

Could you point in which version? I'd like to have a look.

> allowed to directly allocate device memory on device page fault. I intend to
> post this helper again. With that helper you can have zero copy when device
> is the first to access the memory.
> 
> Plan is to get what we have today work properly with the open source driver
> and make it perform well. Once we get some experience with real workload we
> might look into allowing CPU page fault to be directed to device memory but
> at this time i don't think we need this.
> 

For us, we need this feature that CPU page fault can be direct to device memory.
So that don't need to copy data from system memory to device memory.
Do you have any suggestion on the implementation? I'll try to make a prototype patch.

--
Thanks,
Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-30  2:57                             ` Bob Liu
@ 2017-09-30 22:49                               ` Jerome Glisse
  2017-10-11 13:15                                 ` Bob Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-09-30 22:49 UTC (permalink / raw)
  To: Bob Liu
  Cc: Bob Liu, Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
> On 2017/9/27 0:16, Jerome Glisse wrote:
> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> [...]
> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
> >>>>>
> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >>>>>
> >>>>
> >>>> Nice to see that.
> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >>>> Device memory directly without extra copy.
> >>>
> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> >>> available today) is on the TODO list. Note that the driver changes for CDM
> >>> are minimal (probably less than 100 lines of code). From the driver point
> >>> of view this is memory and it doesn't matter if it is CDM or not.
> >>>
> >>
> >> It seems have to migrate/copy memory between system-memory and
> >> device-memory even in HMM-CDM solution.
> >> Because device-memory is not added into buddy system, the page fault
> >> for normal malloc() always allocate memory from system-memory!!
> >> If the device then access the same virtual address, the data is copied
> >> to device-memory.
> >>
> >> Correct me if I misunderstand something.
> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> > 
> > Device can access system memory so copy to device is _not_ mandatory. Copying
> > data to device is for performance only ie the device driver take hint from
> > userspace and monitor device activity to decide which memory should be migrated
> > to device memory to maximize performance.
> > 
> > Moreover in some previous version of the HMM patchset we had an helper that
> 
> Could you point in which version? I'd like to have a look.

I will need to dig in.

> 
> > allowed to directly allocate device memory on device page fault. I intend to
> > post this helper again. With that helper you can have zero copy when device
> > is the first to access the memory.
> > 
> > Plan is to get what we have today work properly with the open source driver
> > and make it perform well. Once we get some experience with real workload we
> > might look into allowing CPU page fault to be directed to device memory but
> > at this time i don't think we need this.
> > 
> 
> For us, we need this feature that CPU page fault can be direct to device memory.
> So that don't need to copy data from system memory to device memory.
> Do you have any suggestion on the implementation? I'll try to make a prototype patch.

Why do you need that ? What is the device and what are the requirement ?

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-09-30 22:49                               ` Jerome Glisse
@ 2017-10-11 13:15                                 ` Bob Liu
  2017-10-12 15:37                                   ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-10-11 13:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
>> On 2017/9/27 0:16, Jerome Glisse wrote:
>> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> [...]
>> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
>> >>>>>
>> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >>>>>
>> >>>>
>> >>>> Nice to see that.
>> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> >>>> Device memory directly without extra copy.
>> >>>
>> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>> >>> available today) is on the TODO list. Note that the driver changes for CDM
>> >>> are minimal (probably less than 100 lines of code). From the driver point
>> >>> of view this is memory and it doesn't matter if it is CDM or not.
>> >>>
>> >>
>> >> It seems have to migrate/copy memory between system-memory and
>> >> device-memory even in HMM-CDM solution.
>> >> Because device-memory is not added into buddy system, the page fault
>> >> for normal malloc() always allocate memory from system-memory!!
>> >> If the device then access the same virtual address, the data is copied
>> >> to device-memory.
>> >>
>> >> Correct me if I misunderstand something.
>> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
>> >
>> > Device can access system memory so copy to device is _not_ mandatory. Copying
>> > data to device is for performance only ie the device driver take hint from
>> > userspace and monitor device activity to decide which memory should be migrated
>> > to device memory to maximize performance.
>> >
>> > Moreover in some previous version of the HMM patchset we had an helper that
>>
>> Could you point in which version? I'd like to have a look.
>
> I will need to dig in.
>

Thank you.

>>
>> > allowed to directly allocate device memory on device page fault. I intend to
>> > post this helper again. With that helper you can have zero copy when device
>> > is the first to access the memory.
>> >
>> > Plan is to get what we have today work properly with the open source driver
>> > and make it perform well. Once we get some experience with real workload we
>> > might look into allowing CPU page fault to be directed to device memory but
>> > at this time i don't think we need this.
>> >
>>
>> For us, we need this feature that CPU page fault can be direct to device memory.
>> So that don't need to copy data from system memory to device memory.
>> Do you have any suggestion on the implementation? I'll try to make a prototype patch.
>
> Why do you need that ? What is the device and what are the requirement ?
>

You may think it as a CCIX device or CAPI device.
The requirement is eliminate any extra copy.
A typical usecase/requirement is malloc() and madvise() allocate from
device memory, then CPU write data to device memory directly and
trigger device to read the data/do calculation.

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-10-11 13:15                                 ` Bob Liu
@ 2017-10-12 15:37                                   ` Jerome Glisse
  2017-11-16  2:10                                     ` chet l
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-10-12 15:37 UTC (permalink / raw)
  To: Bob Liu
  Cc: Bob Liu, Dan Williams, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, Michal Hocko, Andrew Morton

On Wed, Oct 11, 2017 at 09:15:57PM +0800, Bob Liu wrote:
> On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
> >> On 2017/9/27 0:16, Jerome Glisse wrote:
> >> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
> >> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
> >> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
> >> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> [...]
> >> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
> >> >>>>>
> >> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
> >> >>>>>
> >> >>>>
> >> >>>> Nice to see that.
> >> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
> >> >>>> Device memory directly without extra copy.
> >> >>>
> >> >>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> >> >>> available today) is on the TODO list. Note that the driver changes for CDM
> >> >>> are minimal (probably less than 100 lines of code). From the driver point
> >> >>> of view this is memory and it doesn't matter if it is CDM or not.
> >> >>>
> >> >>
> >> >> It seems have to migrate/copy memory between system-memory and
> >> >> device-memory even in HMM-CDM solution.
> >> >> Because device-memory is not added into buddy system, the page fault
> >> >> for normal malloc() always allocate memory from system-memory!!
> >> >> If the device then access the same virtual address, the data is copied
> >> >> to device-memory.
> >> >>
> >> >> Correct me if I misunderstand something.
> >> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> >> >
> >> > Device can access system memory so copy to device is _not_ mandatory. Copying
> >> > data to device is for performance only ie the device driver take hint from
> >> > userspace and monitor device activity to decide which memory should be migrated
> >> > to device memory to maximize performance.
> >> >
> >> > Moreover in some previous version of the HMM patchset we had an helper that
> >>
> >> Could you point in which version? I'd like to have a look.
> >
> > I will need to dig in.
> >
> 
> Thank you.

I forgot about this, sorry i was traveling i am still catching up. I will send
you those patches once i unearth where i end up backing them.

> 
> >>
> >> > allowed to directly allocate device memory on device page fault. I intend to
> >> > post this helper again. With that helper you can have zero copy when device
> >> > is the first to access the memory.
> >> >
> >> > Plan is to get what we have today work properly with the open source driver
> >> > and make it perform well. Once we get some experience with real workload we
> >> > might look into allowing CPU page fault to be directed to device memory but
> >> > at this time i don't think we need this.
> >> >
> >>
> >> For us, we need this feature that CPU page fault can be direct to device memory.
> >> So that don't need to copy data from system memory to device memory.
> >> Do you have any suggestion on the implementation? I'll try to make a prototype patch.
> >
> > Why do you need that ? What is the device and what are the requirement ?
> >
> 
> You may think it as a CCIX device or CAPI device.
> The requirement is eliminate any extra copy.
> A typical usecase/requirement is malloc() and madvise() allocate from
> device memory, then CPU write data to device memory directly and
> trigger device to read the data/do calculation.

I suggest you rely on the device driver userspace API to do a migration after malloc
then. Something like:
  ptr = malloc(size);
  my_device_migrate(ptr, size);

Which would call an ioctl of the device driver which itself would migrate memory or
allocate device memory for the range if pointer return by malloc is not yet back by
any pages.

There has been several discussions already about madvise/mbind/set_mempolicy/
move_pages and at this time i don't think we want to add or change any of them to
understand device memory. My personal opinion is that we first need to have enough
upstream user and understand of how it is actualy use before it make sense to try
to formalize and define a syscall or change an existing one. User facing API are
set in stone and i don't want to design them by making broad assumption on how i
think device memory will be use.

So for time being i think it is better to use existing device API to manage and
give hint to the kernel on where memory should be (ie should device memory be use
for some range). The first user of this are GPU and they already have a lot of
ioctl to manage and propagate hint from user space. So at this time i suggest that
you piggy back on any existing ioctl of your device or add new ioctl.

Hope this help.
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-10-12 15:37                                   ` Jerome Glisse
@ 2017-11-16  2:10                                     ` chet l
  2017-11-16  2:44                                       ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: chet l @ 2017-11-16  2:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, Bob Liu, Dan Williams, linux-kernel, Linux MM,
	John Hubbard, David Nellans, Balbir Singh, Michal Hocko,
	Andrew Morton

>> You may think it as a CCIX device or CAPI device.
>> The requirement is eliminate any extra copy.
>> A typical usecase/requirement is malloc() and madvise() allocate from
>> device memory, then CPU write data to device memory directly and
>> trigger device to read the data/do calculation.
>
> I suggest you rely on the device driver userspace API to do a migration after malloc
> then. Something like:
>   ptr = malloc(size);
>   my_device_migrate(ptr, size);
>
> Which would call an ioctl of the device driver which itself would migrate memory or
> allocate device memory for the range if pointer return by malloc is not yet back by
> any pages.
>

So for CCIX, I don't think there is going to be an inline device
driver that would allocate any memory for you. The expansion memory
will become part of the system memory as part of the boot process. So,
if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
total system mem will be 260GB.

Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
the ZONE_DEV range. But for a malloc, it will/can use that range.


> There has been several discussions already about madvise/mbind/set_mempolicy/
> move_pages and at this time i don't think we want to add or change any of them to
> understand device memory. My personal opinion is that we first need to have enough

We will visit these APIs when we are more closer to building exotic
CCIX devices. And the plan is to present/express the CCIX proximity
attributes just like a NUMA node-proximity attribute today. That way
there would be minimal disruptions to the existing OS ecosystem.



Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16  2:10                                     ` chet l
@ 2017-11-16  2:44                                       ` Jerome Glisse
  2017-11-16  3:23                                         ` chetan L
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-11-16  2:44 UTC (permalink / raw)
  To: chet l
  Cc: Bob Liu, Bob Liu, Dan Williams, linux-kernel, Linux MM,
	John Hubbard, David Nellans, Balbir Singh, Michal Hocko,
	Andrew Morton

On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
> >> You may think it as a CCIX device or CAPI device.
> >> The requirement is eliminate any extra copy.
> >> A typical usecase/requirement is malloc() and madvise() allocate from
> >> device memory, then CPU write data to device memory directly and
> >> trigger device to read the data/do calculation.
> >
> > I suggest you rely on the device driver userspace API to do a migration after malloc
> > then. Something like:
> >   ptr = malloc(size);
> >   my_device_migrate(ptr, size);
> >
> > Which would call an ioctl of the device driver which itself would migrate memory or
> > allocate device memory for the range if pointer return by malloc is not yet back by
> > any pages.
> >
> 
> So for CCIX, I don't think there is going to be an inline device
> driver that would allocate any memory for you. The expansion memory
> will become part of the system memory as part of the boot process. So,
> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
> total system mem will be 260GB.
> 
> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
> the ZONE_DEV range. But for a malloc, it will/can use that range.

HMM zone device memory would work with that, you just need to teach the
platform to identify this memory zone and not hotplug it. Again you
should rely on specific device driver API to allocate this memory.

> > There has been several discussions already about madvise/mbind/set_mempolicy/
> > move_pages and at this time i don't think we want to add or change any of them to
> > understand device memory. My personal opinion is that we first need to have enough
> 
> We will visit these APIs when we are more closer to building exotic
> CCIX devices. And the plan is to present/express the CCIX proximity
> attributes just like a NUMA node-proximity attribute today. That way
> there would be minimal disruptions to the existing OS ecosystem.

NUMA have been rejected previously see CDM/CAPI threads. So i don't see
it being accepted for CCIX either. My belief is that we want to hide this
inside device driver and only once we see multiple devices all doing the
same kind of thing we should move toward building something generic that
catter to CCIX devices.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16  2:44                                       ` Jerome Glisse
@ 2017-11-16  3:23                                         ` chetan L
  2017-11-16  3:29                                           ` chetan L
  0 siblings, 1 reply; 43+ messages in thread
From: chetan L @ 2017-11-16  3:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, Bob Liu, Dan Williams, linux-kernel, Linux MM,
	John Hubbard, David Nellans, Balbir Singh, Michal Hocko,
	Andrew Morton, linux-accelerators, Jonathan.Cameron

CC'ing : linux-accelerators@vger.kernel.org

On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
>> >> You may think it as a CCIX device or CAPI device.
>> >> The requirement is eliminate any extra copy.
>> >> A typical usecase/requirement is malloc() and madvise() allocate from
>> >> device memory, then CPU write data to device memory directly and
>> >> trigger device to read the data/do calculation.
>> >
>> > I suggest you rely on the device driver userspace API to do a migration after malloc
>> > then. Something like:
>> >   ptr = malloc(size);
>> >   my_device_migrate(ptr, size);
>> >
>> > Which would call an ioctl of the device driver which itself would migrate memory or
>> > allocate device memory for the range if pointer return by malloc is not yet back by
>> > any pages.
>> >
>>
>> So for CCIX, I don't think there is going to be an inline device
>> driver that would allocate any memory for you. The expansion memory
>> will become part of the system memory as part of the boot process. So,
>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
>> total system mem will be 260GB.
>>
>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
>> the ZONE_DEV range. But for a malloc, it will/can use that range.
>
> HMM zone device memory would work with that, you just need to teach the
> platform to identify this memory zone and not hotplug it. Again you
> should rely on specific device driver API to allocate this memory.
>

@Jerome - a new linux-accelerator's list has just been created. I have
CC'd that list since we have overlapping interests w.r.t CCIX.

I cannot comment on surprise add/remove as of now ... will cross the
bridge later.


>> > There has been several discussions already about madvise/mbind/set_mempolicy/
>> > move_pages and at this time i don't think we want to add or change any of them to
>> > understand device memory. My personal opinion is that we first need to have enough
>>
>> We will visit these APIs when we are more closer to building exotic
>> CCIX devices. And the plan is to present/express the CCIX proximity
>> attributes just like a NUMA node-proximity attribute today. That way
>> there would be minimal disruptions to the existing OS ecosystem.
>
> NUMA have been rejected previously see CDM/CAPI threads. So i don't see
> it being accepted for CCIX either. My belief is that we want to hide this
> inside device driver and only once we see multiple devices all doing the
> same kind of thing we should move toward building something generic that
> catter to CCIX devices.


Thanks for pointing out the NUMA thingy. I will visit the CDM/CAPI
threads to understand what was discussed before commenting further.


Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16  3:23                                         ` chetan L
@ 2017-11-16  3:29                                           ` chetan L
  2017-11-16 21:29                                             ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: chetan L @ 2017-11-16  3:29 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, Bob Liu, Dan Williams, linux-kernel, Linux MM,
	John Hubbard, David Nellans, Balbir Singh, Michal Hocko,
	Andrew Morton, Jonathan.Cameron, linux-accelerators

On Wed, Nov 15, 2017 at 7:23 PM, chetan L <loke.chetan@gmail.com> wrote:
> CC'ing : linux-accelerators@vger.kernel.org
>

Sorry, CC'ing the correct list this time: linux-accelerators@lists.ozlabs.org



> On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
>>> >> You may think it as a CCIX device or CAPI device.
>>> >> The requirement is eliminate any extra copy.
>>> >> A typical usecase/requirement is malloc() and madvise() allocate from
>>> >> device memory, then CPU write data to device memory directly and
>>> >> trigger device to read the data/do calculation.
>>> >
>>> > I suggest you rely on the device driver userspace API to do a migration after malloc
>>> > then. Something like:
>>> >   ptr = malloc(size);
>>> >   my_device_migrate(ptr, size);
>>> >
>>> > Which would call an ioctl of the device driver which itself would migrate memory or
>>> > allocate device memory for the range if pointer return by malloc is not yet back by
>>> > any pages.
>>> >
>>>
>>> So for CCIX, I don't think there is going to be an inline device
>>> driver that would allocate any memory for you. The expansion memory
>>> will become part of the system memory as part of the boot process. So,
>>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
>>> total system mem will be 260GB.
>>>
>>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
>>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
>>> the ZONE_DEV range. But for a malloc, it will/can use that range.
>>
>> HMM zone device memory would work with that, you just need to teach the
>> platform to identify this memory zone and not hotplug it. Again you
>> should rely on specific device driver API to allocate this memory.
>>
>
> @Jerome - a new linux-accelerator's list has just been created. I have
> CC'd that list since we have overlapping interests w.r.t CCIX.
>
> I cannot comment on surprise add/remove as of now ... will cross the
> bridge later.
>
>
>>> > There has been several discussions already about madvise/mbind/set_mempolicy/
>>> > move_pages and at this time i don't think we want to add or change any of them to
>>> > understand device memory. My personal opinion is that we first need to have enough
>>>
>>> We will visit these APIs when we are more closer to building exotic
>>> CCIX devices. And the plan is to present/express the CCIX proximity
>>> attributes just like a NUMA node-proximity attribute today. That way
>>> there would be minimal disruptions to the existing OS ecosystem.
>>
>> NUMA have been rejected previously see CDM/CAPI threads. So i don't see
>> it being accepted for CCIX either. My belief is that we want to hide this
>> inside device driver and only once we see multiple devices all doing the
>> same kind of thing we should move toward building something generic that
>> catter to CCIX devices.
>
>
> Thanks for pointing out the NUMA thingy. I will visit the CDM/CAPI
> threads to understand what was discussed before commenting further.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16  3:29                                           ` chetan L
@ 2017-11-16 21:29                                             ` Jerome Glisse
  2017-11-16 22:41                                               ` chetan L
  0 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-11-16 21:29 UTC (permalink / raw)
  To: chetan L
  Cc: Bob Liu, David Nellans, John Hubbard, Balbir Singh, linux-kernel,
	Michal Hocko, Linux MM, Dan Williams, Andrew Morton,
	linux-accelerators

On Wed, Nov 15, 2017 at 07:29:10PM -0800, chetan L wrote:
> On Wed, Nov 15, 2017 at 7:23 PM, chetan L <loke.chetan@gmail.com> wrote:
> > On Wed, Nov 15, 2017 at 6:44 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> On Wed, Nov 15, 2017 at 06:10:08PM -0800, chet l wrote:
> >>> >> You may think it as a CCIX device or CAPI device.
> >>> >> The requirement is eliminate any extra copy.
> >>> >> A typical usecase/requirement is malloc() and madvise() allocate from
> >>> >> device memory, then CPU write data to device memory directly and
> >>> >> trigger device to read the data/do calculation.
> >>> >
> >>> > I suggest you rely on the device driver userspace API to do a migration after malloc
> >>> > then. Something like:
> >>> >   ptr = malloc(size);
> >>> >   my_device_migrate(ptr, size);
> >>> >
> >>> > Which would call an ioctl of the device driver which itself would migrate memory or
> >>> > allocate device memory for the range if pointer return by malloc is not yet back by
> >>> > any pages.
> >>> >
> >>>
> >>> So for CCIX, I don't think there is going to be an inline device
> >>> driver that would allocate any memory for you. The expansion memory
> >>> will become part of the system memory as part of the boot process. So,
> >>> if the host DDR is 256GB and the CCIX expansion memory is 4GB, the
> >>> total system mem will be 260GB.
> >>>
> >>> Assume that the 'mm' is taught to mark/anoint the ZONE_DEVICE(or
> >>> ZONE_XXX) range from 256 to 260 GB. Then, for kmalloc it(mm) won't use
> >>> the ZONE_DEV range. But for a malloc, it will/can use that range.
> >>
> >> HMM zone device memory would work with that, you just need to teach the
> >> platform to identify this memory zone and not hotplug it. Again you
> >> should rely on specific device driver API to allocate this memory.
> >>
> >
> > @Jerome - a new linux-accelerator's list has just been created. I have
> > CC'd that list since we have overlapping interests w.r.t CCIX.
> >
> > I cannot comment on surprise add/remove as of now ... will cross the
> > bridge later.

Note that this is not hotplug strictly speaking. Design today is that it
is the device driver that register the memory. From kernel point of view
this is an hotplug but for many of the target architecture there is no
real hotplug ie device and its memory was present at boot time.

Like i said i think for now we are better of having each device manage and
register its memory. HMM provide a toolbox for that. If we see common trend
accross multiple devices then we can think about making something more
generic.


For the NUMA discussion this is related to CPU less node ie not wanting
to add any more CPU less node (node with only memory) and they are other
aspect too. For instance you do not necessarily have good informations
from the device to know if a page is access a lot by the device (this
kind of information is often only accessible by the device driver). Thus
the automatic NUMA placement is useless here. Not mentioning that for it
to work we would need to change how it currently work (iirc there is
issue when you not have a CPU id you can use).

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16 21:29                                             ` Jerome Glisse
@ 2017-11-16 22:41                                               ` chetan L
  2017-11-16 23:11                                                 ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: chetan L @ 2017-11-16 22:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, David Nellans, John Hubbard, Balbir Singh, linux-kernel,
	Michal Hocko, Linux MM, Dan Williams, Andrew Morton,
	linux-accelerators

On Thu, Nov 16, 2017 at 1:29 PM, Jerome Glisse <jglisse@redhat.com> wrote:

>
> For the NUMA discussion this is related to CPU less node ie not wanting
> to add any more CPU less node (node with only memory) and they are other
> aspect too. For instance you do not necessarily have good informations
> from the device to know if a page is access a lot by the device (this
> kind of information is often only accessible by the device driver). Thus

@Jerome - one comment w.r.t 'do not necessarily have good info on
device access'.

So you could be assuming a few things here :). CCIX extends the CPU
complex's coherency domain(it is now a single/unified coherency
domain). The CCIX-EP (lets say an accelerator/XPU or a NIC or a combo)
is now a true peer w.r.t the host-numa-node(s) (aka 1st class
citizen). I don't know how much info was revealed at the latest ARM
techcon where CCIX was presented. So I cannot divulge any further
details until I see that slide deck. However, you can safely assume
that the host will have *all* the info w.r.t the device-access and
vice-versa.

Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5
  2017-11-16 22:41                                               ` chetan L
@ 2017-11-16 23:11                                                 ` Jerome Glisse
  0 siblings, 0 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-11-16 23:11 UTC (permalink / raw)
  To: chetan L
  Cc: Bob Liu, David Nellans, John Hubbard, Balbir Singh, linux-kernel,
	Michal Hocko, Linux MM, Dan Williams, Andrew Morton,
	linux-accelerators

On Thu, Nov 16, 2017 at 02:41:39PM -0800, chetan L wrote:
> On Thu, Nov 16, 2017 at 1:29 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> 
> >
> > For the NUMA discussion this is related to CPU less node ie not wanting
> > to add any more CPU less node (node with only memory) and they are other
> > aspect too. For instance you do not necessarily have good informations
> > from the device to know if a page is access a lot by the device (this
> > kind of information is often only accessible by the device driver). Thus
> 
> @Jerome - one comment w.r.t 'do not necessarily have good info on
> device access'.
> 
> So you could be assuming a few things here :). CCIX extends the CPU
> complex's coherency domain(it is now a single/unified coherency
> domain). The CCIX-EP (lets say an accelerator/XPU or a NIC or a combo)
> is now a true peer w.r.t the host-numa-node(s) (aka 1st class
> citizen). I don't know how much info was revealed at the latest ARM
> techcon where CCIX was presented. So I cannot divulge any further
> details until I see that slide deck. However, you can safely assume
> that the host will have *all* the info w.r.t the device-access and
> vice-versa.

I do have access to CCIX, last time i read the draft, few month ago,
my understanding was that there is no mechanism to differentiate between
device behind the root complex. So when you do autonuma you don't know
which of your CCIX device is the one faulting hence you can not keep
track of that inside struct page for autonuma (ignoring the issue with
the lack of CPUID for each device).

This is what i mean by NUMA is not a good fit as it is. Yes everything
is cache coherent and all, but that is just a small part of what is
needed to make autonuma as it is today work.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-11-16 23:12 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13 21:15 [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Jérôme Glisse
2017-07-13 21:15 ` [PATCH 1/6] mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST Jérôme Glisse
2017-07-17  9:09   ` Balbir Singh
2017-07-13 21:15 ` [PATCH 2/6] mm/device-public-memory: device memory cache coherent with CPU v4 Jérôme Glisse
2017-07-13 23:01   ` Balbir Singh
2017-07-13 21:15 ` [PATCH 3/6] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
2017-07-13 21:15 ` [PATCH 4/6] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-07-17  9:10   ` Balbir Singh
2017-07-13 21:15 ` [PATCH 5/6] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC v3 Jérôme Glisse
2017-07-17  9:15   ` Balbir Singh
2017-07-13 21:15 ` [PATCH 6/6] mm/hmm: documents how device memory is accounted in rss and memcg Jérôme Glisse
2017-07-14 13:26   ` Michal Hocko
2017-07-18  3:26 ` [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Bob Liu
2017-07-18 15:38   ` Jerome Glisse
2017-07-19  1:46     ` Bob Liu
2017-07-19  2:25       ` Jerome Glisse
2017-07-19  9:09         ` Bob Liu
2017-07-20 15:03           ` Jerome Glisse
2017-07-21  1:15             ` Bob Liu
2017-07-21  1:41               ` Jerome Glisse
2017-07-21  2:10                 ` Bob Liu
2017-07-21 12:01                   ` Bob Liu
2017-07-21 15:21                     ` Jerome Glisse
2017-07-21  3:48                 ` Dan Williams
2017-07-21 15:22                   ` Jerome Glisse
2017-09-05 19:36                   ` Jerome Glisse
2017-09-09 23:22                     ` Bob Liu
2017-09-11 23:36                       ` Jerome Glisse
2017-09-12  1:02                         ` Bob Liu
2017-09-12 16:17                           ` Jerome Glisse
2017-09-26  9:56                         ` Bob Liu
2017-09-26 16:16                           ` Jerome Glisse
2017-09-30  2:57                             ` Bob Liu
2017-09-30 22:49                               ` Jerome Glisse
2017-10-11 13:15                                 ` Bob Liu
2017-10-12 15:37                                   ` Jerome Glisse
2017-11-16  2:10                                     ` chet l
2017-11-16  2:44                                       ` Jerome Glisse
2017-11-16  3:23                                         ` chetan L
2017-11-16  3:29                                           ` chetan L
2017-11-16 21:29                                             ` Jerome Glisse
2017-11-16 22:41                                               ` chetan L
2017-11-16 23:11                                                 ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).