linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
@ 2017-06-14 20:11 Jérôme Glisse
  2017-06-14 20:11 ` [HMM-CDM 1/5] mm/device-public-memory: device memory cache coherent with CPU Jérôme Glisse
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse, cgroups,
	Dan Williams, Ross Zwisler, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

Cache coherent device memory apply to architecture with system bus
like CAPI or CCIX. Device connected to such system bus can expose
their memory to the system and allow cache coherent access to it
from the CPU.

Even if for all intent and purposes device memory behave like regular
memory, we still want to manage it in isolation from regular memory.
Several reasons for that, first and foremost this memory is less
reliable than regular memory if the device hangs because of invalid
commands we can loose access to device memory. Second CPU access to
this memory is expected to be slower than to regular memory. Third
having random memory into device means that some of the bus bandwith
wouldn't be available to the device but would be use by CPU access.

This is why we want to manage such memory in isolation from regular
memory. Kernel should not try to use this memory as last resort
when running out of memory, at least for now.

This patchset add a new type of ZONE_DEVICE memory (DEVICE_PUBLIC)
that is use to represent CDM memory. This patchset build on top of
the HMM patchset that already introduce a new type of ZONE_DEVICE
memory for private device memory (see HMM patchset).

The end result is that with this patch if a device is in use in
a process you might have private anonymous memory or file back
page memory using ZONE_DEVICE (MEMORY_PUBLIC). Thus care must be
taken to not overwritte lru fields of such pages.

Hence all core mm changes are done to address assumption that any
process memory is back by a regular struct page that is part of
the lru. ZONE_DEVICE page are not on the lru and the lru pointer
of struct page are use to store device specific informations.

Thus this patch update all code path that would make assumptions
about lruness of a process page.

patch 01 - deals with all the core mm functions
patch 02 - add an helper to HMM for hotplug of CDM memory
patch 03 - preparatory patch for memory controller changes
patch 04 - update memory controller to properly handle
           ZONE_DEVICE pages when uncharging
patch 05 - kernel configuration updates and cleanup

git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v2


Cc: cgroups@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Jérôme Glisse (5):
  mm/device-public-memory: device memory cache coherent with CPU
  mm/hmm: add new helper to hotplug CDM memory region
  mm/memcontrol: allow to uncharge page without using page->lru field
  mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64

 fs/proc/task_mmu.c       |   2 +-
 include/linux/hmm.h      |   7 +-
 include/linux/ioport.h   |   1 +
 include/linux/memremap.h |  21 +++++
 include/linux/mm.h       |  16 ++--
 kernel/memremap.c        |  13 ++-
 mm/Kconfig               |  30 +++----
 mm/gup.c                 |   7 ++
 mm/hmm.c                 |  89 +++++++++++++++++--
 mm/madvise.c             |   2 +-
 mm/memcontrol.c          | 226 ++++++++++++++++++++++++++++++-----------------
 mm/memory.c              |  46 ++++++++--
 mm/migrate.c             |  60 ++++++++-----
 mm/swap.c                |  11 +++
 14 files changed, 389 insertions(+), 142 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [HMM-CDM 1/5] mm/device-public-memory: device memory cache coherent with CPU
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
  2017-06-14 20:11 ` [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Dan Williams, Ross Zwisler

Platform with advance system bus (like CAPI or CCIX) allow device
memory to be accessible from CPU in a cache coherent fashion. Add
a new type of ZONE_DEVICE to represent such memory. The use case
are the same as for the un-addressable device memory but without
all the corners cases.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  2 +-
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 21 +++++++++++++++++
 include/linux/mm.h       | 16 ++++++++-----
 kernel/memremap.c        | 11 ++++++---
 mm/Kconfig               | 13 +++++++++++
 mm/gup.c                 |  7 ++++++
 mm/madvise.c             |  2 +-
 mm/memory.c              | 46 +++++++++++++++++++++++++++++++++----
 mm/migrate.c             | 60 ++++++++++++++++++++++++++++++++----------------
 mm/swap.c                | 11 +++++++++
 11 files changed, 154 insertions(+), 36 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 90b2fa4..78b1eca 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1187,7 +1187,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
-		page = vm_normal_page(vma, addr, pte);
+		page = _vm_normal_page(vma, addr, pte, true);
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3a4f691..f5cf32e 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -131,6 +131,7 @@ enum {
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
+	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index d6909ea..c8506d3 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -56,10 +56,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * page must be treated as an opaque object, rather than a "normal" struct page.
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
  */
 enum memory_type {
 	MEMORY_DEVICE_PERSISTENT = 0,
 	MEMORY_DEVICE_PRIVATE,
+	MEMORY_DEVICE_PUBLIC,
 };
 
 /*
@@ -91,6 +99,8 @@ enum memory_type {
  * The page_free() callback is called once the page refcount reaches 1
  * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
  * This allows the device driver to implement its own memory management.)
+ *
+ * For MEMORY_DEVICE_CACHE_COHERENT only the page_free() callback matter.
  */
 typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
 				unsigned long addr,
@@ -133,6 +143,12 @@ static inline bool is_device_private_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -156,6 +172,11 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e74a183..d267482 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -797,14 +797,15 @@ static inline bool is_zone_device_page(const struct page *page)
 #endif
 
 #ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page);
+void put_zone_device_private_or_public_page(struct page *page);
 #else
-static inline void put_zone_device_private_page(struct page *page)
+static inline void put_zone_device_private_or_public_page(struct page *page)
 {
 }
 #endif
 
 static inline bool is_device_private_page(const struct page *page);
+static inline bool is_device_public_page(const struct page *page);
 
 DECLARE_STATIC_KEY_FALSE(device_private_key);
 
@@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
 	 * include/linux/memremap.h and HMM for details.
 	 */
 	if (static_branch_unlikely(&device_private_key) &&
-	    unlikely(is_device_private_page(page))) {
-		put_zone_device_private_page(page);
+	    unlikely(is_device_private_page(page) ||
+		     is_device_public_page(page))) {
+		put_zone_device_private_or_public_page(page);
 		return;
 	}
 
@@ -1220,8 +1222,10 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 };
 
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device);
+#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index e82456c..da74775 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -466,7 +466,7 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 
 
 #ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page)
+void put_zone_device_private_or_public_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -474,10 +474,15 @@ void put_zone_device_private_page(struct page *page)
 	 * If refcount is 1 then page is freed and refcount is stable as nobody
 	 * holds a reference on the page.
 	 */
-	if (count == 1)
+	if (count == 1) {
+		/* Clear Active bit in case of parallel mark_page_accessed */
+		__ClearPageActive(page);
+		__ClearPageWaiters(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
+	}
 	else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_page);
+EXPORT_SYMBOL(put_zone_device_private_or_public_page);
 #endif /* CONFIG_DEVICE_PRIVATE */
diff --git a/mm/Kconfig b/mm/Kconfig
index dd11b4d..ad082b9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -731,6 +731,19 @@ config DEVICE_PRIVATE
 	  memory; i.e., memory that is only accessible from the device (or
 	  group of devices).
 
+config DEVICE_PUBLIC
+	bool "Unaddressable device memory (GPU memory, ...)"
+	depends on X86_64
+	depends on ZONE_DEVICE
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on SPARSEMEM_VMEMMAP
+
+	help
+	  Allows creation of struct pages to represent addressable device
+	  memory; i.e., memory that is accessible from both the device and
+	  the CPU
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/gup.c b/mm/gup.c
index b73e426..d771850 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -438,6 +438,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
 			goto unmap;
 		*page = pte_page(*pte);
+
+		/*
+		 * This should never happen (a device public page in the gate
+		 * area).
+		 */
+		if (is_device_public_page(*page))
+			goto unmap;
 	}
 	get_page(*page);
 out:
diff --git a/mm/madvise.c b/mm/madvise.c
index 8eda184..e66ef0a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -343,7 +343,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		}
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = _vm_normal_page(vma, addr, ptent, true);
 		if (!page)
 			continue;
 
diff --git a/mm/memory.c b/mm/memory.c
index 471ed99..e7f0a4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (!is_zero_pfn(pfn))
-			print_bad_pte(vma, addr, pte, NULL);
+		if (is_zero_pfn(pfn))
+			return NULL;
+
+		/*
+		 * Device public pages are special pages (they are ZONE_DEVICE
+		 * pages but different from persistent memory). They behave
+		 * allmost like normal pages. The difference is that they are
+		 * not on the lru and thus should never be involve with any-
+		 * thing that involve lru manipulation (mlock, numa balancing,
+		 * ...).
+		 *
+		 * This is why we still want to return NULL for such page from
+		 * vm_normal_page() so that we do not have to special case all
+		 * call site of vm_normal_page().
+		 */
+		if (likely(pfn < highest_memmap_pfn)) {
+			struct page *page = pfn_to_page(pfn);
+
+			if (is_device_public_page(page)) {
+				if (with_public_device)
+					return page;
+				return NULL;
+			}
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
+	} else if (pte_devmap(pte)) {
+		page = pte_page(pte);
+
+		/*
+		 * Cache coherent device memory behave like regular page and
+		 * not like persistent memory page. For more informations see
+		 * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
+		 */
+		if (is_device_public_page(page)) {
+			get_page(page);
+			page_dup_rmap(page, false);
+			rss[mm_counter(page)]++;
+		}
 	}
 
 out_set_pte:
@@ -1236,7 +1272,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = _vm_normal_page(vma, addr, ptent, true);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
diff --git a/mm/migrate.c b/mm/migrate.c
index 729f341..9c3c323 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -229,12 +229,19 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		if (unlikely(is_zone_device_page(new)) &&
-		    is_device_private_page(new)) {
-			entry = make_device_private_entry(new, pte_write(pte));
-			pte = swp_entry_to_pte(entry);
-			if (pte_swp_soft_dirty(*pvmw.pte))
-				pte = pte_mksoft_dirty(pte);
+		if (unlikely(is_zone_device_page(new))) {
+			if (is_device_private_page(new)) {
+				entry = make_device_private_entry(new, pte_write(pte));
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*pvmw.pte))
+					pte = pte_mksoft_dirty(pte);
+			}
+#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+			else if (is_device_public_page(new)) {
+				pte = pte_mkdevmap(pte);
+				flush_dcache_page(new);
+			}
+#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */
 		} else
 			flush_dcache_page(new);
 
@@ -408,12 +415,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	/*
-	 * ZONE_DEVICE pages have 1 refcount always held by their device
-	 *
-	 * Note that DAX memory will never reach that point as it does not have
-	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 * Device public or private pages have an extra refcount as they are
+	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_zone_device_page(page);
+	expected_count += is_device_private_page(page);
+	expected_count += is_device_public_page(page);
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -2098,7 +2104,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #endif /* CONFIG_NUMA */
 
-
 struct migrate_vma {
 	struct vm_area_struct	*vma;
 	unsigned long		*dst;
@@ -2177,7 +2182,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			page = vm_normal_page(migrate->vma, addr, pte);
+			page = _vm_normal_page(migrate->vma, addr, pte, true);
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
@@ -2302,13 +2307,18 @@ static bool migrate_vma_check_page(struct page *page)
 
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (is_zone_device_page(page)) {
-		if (is_device_private_page(page)) {
+		if (is_device_private_page(page) ||
+		    is_device_public_page(page))
 			extra++;
-		} else
+		else
 			/* Other ZONE_DEVICE memory type are not supported */
 			return false;
 	}
 
+	/* For file back page */
+	if (page_mapping(page))
+		extra += 1 + page_has_private(page);
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2532,11 +2542,21 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 */
 	__SetPageUptodate(page);
 
-	if (is_zone_device_page(page) && is_device_private_page(page)) {
-		swp_entry_t swp_entry;
+	if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
+			swp_entry_t swp_entry;
 
-		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
-		entry = swp_entry_to_pte(swp_entry);
+			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			entry = swp_entry_to_pte(swp_entry);
+		}
+#if IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+		else if (is_device_public_page(page)) {
+			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkdevmap(entry);
+		}
+#endif /* IS_ENABLED(CONFIG_DEVICE_PUBLIC) */
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2623,7 +2643,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 					continue;
 				}
-			} else {
+			} else if (!is_device_public_page(newpage)) {
 				/*
 				 * Other types of ZONE_DEVICE page are not
 				 * supported.
diff --git a/mm/swap.c b/mm/swap.c
index 4f44dbd..212370d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -760,6 +760,17 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (is_huge_zero_page(page))
 			continue;
 
+		/* Device public page can not be huge page */
+		if (is_device_public_page(page)) {
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+						       flags);
+				locked_pgdat = NULL;
+			}
+			put_zone_device_private_or_public_page(page);
+			continue;
+		}
+
 		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
  2017-06-14 20:11 ` [HMM-CDM 1/5] mm/device-public-memory: device memory cache coherent with CPU Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
  2017-06-15  4:28   ` Balbir Singh
  2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse

Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c            | 85 +++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2f2a6ff..f6713b2 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -392,6 +392,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 				  struct device *device,
 				  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 int hmm_devmem_fault_range(struct hmm_devmem *devmem,
diff --git a/mm/hmm.c b/mm/hmm.c
index 4ecf618..aed110e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -849,7 +849,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	zone = page_zone(page);
 
 	mem_hotplug_begin();
-	__remove_pages(zone, start_pfn, npages);
+	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+		__remove_pages(zone, start_pfn, npages);
+	else
+		arch_remove_memory(start_pfn << PAGE_SHIFT,
+				   npages << PAGE_SHIFT);
 	mem_hotplug_done();
 
 	hmm_devmem_radix_release(resource);
@@ -885,7 +889,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	if (is_ram == REGION_INTERSECTS)
 		return -ENXIO;
 
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+	else
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
 	devmem->pagemap.res = devmem->resource;
 	devmem->pagemap.page_fault = hmm_devmem_fault;
 	devmem->pagemap.page_free = hmm_devmem_free;
@@ -924,8 +932,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 		nid = numa_mem_id();
 
 	mem_hotplug_begin();
-	ret = add_pages(nid, align_start >> PAGE_SHIFT,
-			align_size >> PAGE_SHIFT, false);
+	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+		ret = arch_add_memory(nid, align_start, align_size, false);
+	else
+		ret = add_pages(nid, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, false);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
@@ -1075,6 +1086,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res)
+{
+	struct hmm_devmem *devmem;
+	int ret;
+
+	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		return ERR_PTR(-EINVAL);
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = res;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_devm_add_action;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add_resource);
+
 /*
  * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
  *
@@ -1088,6 +1160,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 {
 	resource_size_t start, size;
 	struct device *device;
+	bool cdm = false;
 
 	if (!devmem)
 		return;
@@ -1096,11 +1169,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 	start = devmem->resource->start;
 	size = resource_size(devmem->resource);
 
+	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
 	hmm_devmem_ref_kill(&devmem->ref);
 	hmm_devmem_ref_exit(&devmem->ref);
 	hmm_devmem_pages_remove(devmem);
 
-	devm_release_mem_region(device, start, size);
+	if (!cdm)
+		devm_release_mem_region(device, start, size);
 }
 EXPORT_SYMBOL(hmm_devmem_remove);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
  2017-06-14 20:11 ` [HMM-CDM 1/5] mm/device-public-memory: device memory cache coherent with CPU Jérôme Glisse
  2017-06-14 20:11 ` [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
  2017-06-15  3:31   ` Balbir Singh
  2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 92 insertions(+), 76 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3fe4d0..b93f5fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	cancel_charge(memcg, nr_pages);
 }
 
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
-			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_kmem, unsigned long nr_huge,
-			   unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+	struct mem_cgroup *memcg;
+	unsigned long pgpgout;
+	unsigned long nr_anon;
+	unsigned long nr_file;
+	unsigned long nr_kmem;
+	unsigned long nr_huge;
+	unsigned long nr_shmem;
+	struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 {
-	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+	memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(memcg)) {
-		page_counter_uncharge(&memcg->memory, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg)) {
+		page_counter_uncharge(&ug->memcg->memory, nr_pages);
 		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
-			page_counter_uncharge(&memcg->kmem, nr_kmem);
-		memcg_oom_recover(memcg);
+			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+		memcg_oom_recover(ug->memcg);
 	}
 
 	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
-	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
-	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
-	memcg_check_events(memcg, dummy_page);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg))
+		css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+	if (!page->mem_cgroup)
+		return;
+
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * page->mem_cgroup at this point, we have fully
+	 * exclusive access to the page.
+	 */
+
+	if (ug->memcg != page->mem_cgroup) {
+		if (ug->memcg) {
+			uncharge_batch(ug);
+			uncharge_gather_clear(ug);
+		}
+		ug->memcg = page->mem_cgroup;
+	}
+
+	if (!PageKmemcg(page)) {
+		unsigned int nr_pages = 1;
+
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			ug->nr_huge += nr_pages;
+		}
+		if (PageAnon(page))
+			ug->nr_anon += nr_pages;
+		else {
+			ug->nr_file += nr_pages;
+			if (PageSwapBacked(page))
+				ug->nr_shmem += nr_pages;
+		}
+		ug->pgpgout++;
+	} else {
+		ug->nr_kmem += 1 << compound_order(page);
+		__ClearPageKmemcg(page);
+	}
+
+	ug->dummy_page = page;
+	page->mem_cgroup = NULL;
 }
 
 static void uncharge_list(struct list_head *page_list)
 {
-	struct mem_cgroup *memcg = NULL;
-	unsigned long nr_shmem = 0;
-	unsigned long nr_anon = 0;
-	unsigned long nr_file = 0;
-	unsigned long nr_huge = 0;
-	unsigned long nr_kmem = 0;
-	unsigned long pgpgout = 0;
+	struct uncharge_gather ug;
 	struct list_head *next;
-	struct page *page;
+
+	uncharge_gather_clear(&ug);
 
 	/*
 	 * Note that the list can be a single page->lru; hence the
@@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
 	 */
 	next = page_list->next;
 	do {
+		struct page *page;
+
 		page = list_entry(next, struct page, lru);
 		next = page->lru.next;
 
-		VM_BUG_ON_PAGE(PageLRU(page), page);
-		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
-		if (!page->mem_cgroup)
-			continue;
-
-		/*
-		 * Nobody should be changing or seriously looking at
-		 * page->mem_cgroup at this point, we have fully
-		 * exclusive access to the page.
-		 */
-
-		if (memcg != page->mem_cgroup) {
-			if (memcg) {
-				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_kmem, nr_huge, nr_shmem, page);
-				pgpgout = nr_anon = nr_file = nr_kmem = 0;
-				nr_huge = nr_shmem = 0;
-			}
-			memcg = page->mem_cgroup;
-		}
-
-		if (!PageKmemcg(page)) {
-			unsigned int nr_pages = 1;
-
-			if (PageTransHuge(page)) {
-				nr_pages <<= compound_order(page);
-				nr_huge += nr_pages;
-			}
-			if (PageAnon(page))
-				nr_anon += nr_pages;
-			else {
-				nr_file += nr_pages;
-				if (PageSwapBacked(page))
-					nr_shmem += nr_pages;
-			}
-			pgpgout++;
-		} else {
-			nr_kmem += 1 << compound_order(page);
-			__ClearPageKmemcg(page);
-		}
-
-		page->mem_cgroup = NULL;
+		uncharge_page(page, &ug);
 	} while (next != page_list);
 
-	if (memcg)
-		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_kmem, nr_huge, nr_shmem, page);
+	if (ug.memcg)
+		uncharge_batch(&ug);
 }
 
 /**
@@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct uncharge_gather ug;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
 	if (!page->mem_cgroup)
 		return;
 
-	INIT_LIST_HEAD(&page->lru);
-	uncharge_list(&page->lru);
+	uncharge_gather_clear(&ug);
+	uncharge_page(page, &ug);
+	uncharge_batch(&ug);
 }
 
 /**
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
  2017-06-15  1:41   ` Balbir Singh
  2017-06-14 20:11 ` [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64 Jérôme Glisse
  2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
  5 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 kernel/memremap.c |  2 ++
 mm/memcontrol.c   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index da74775..584984c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
 		__ClearPageActive(page);
 		__ClearPageWaiters(page);
 
+		mem_cgroup_uncharge(page);
+
 		page->pgmap->page_free(page, page->pgmap->data);
 	}
 	else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b93f5fe..171b638 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4391,12 +4391,13 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_DEVICE,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
 {
-	struct page *page = vm_normal_page(vma, addr, ptent);
+	struct page *page = _vm_normal_page(vma, addr, ptent, true);
 
 	if (!page || !page_mapped(page))
 		return NULL;
@@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 		if (!(mc.flags & MOVE_FILE))
 			return NULL;
 	}
-	if (!get_page_unless_zero(page))
+	if (is_device_public_page(page)) {
+		/*
+		 * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
+		 * refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+	} else if (!get_page_unless_zero(page))
 		return NULL;
 
 	return page;
 }
 
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
 static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 			pte_t ptent, swp_entry_t *entry)
 {
@@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 
 	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
 		return NULL;
+
+	/*
+	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+	 * a device and because they are not accessible by CPU they are store
+	 * as special swap entry in the CPU page table.
+	 */
+	if (is_device_private_entry(ent)) {
+		page = device_private_entry_to_page(ent);
+		/*
+		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+		 * a refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+		return page;
+	}
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.
@@ -4582,6 +4607,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PUBLIC
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
  *
  * Called with pte lock held.
  */
@@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
+			if (is_device_private_page(page) ||
+			    is_device_public_page(page))
+				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
 		}
@@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
+		/*
+		 * Note their can not be MC_TARGET_DEVICE for now as we do not
+		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+		 * MEMORY_DEVICE_PRIVATE but this might change.
+		 */
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -4884,6 +4919,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				putback_lru_page(page);
 			}
 			put_page(page);
+		} else if (target_type == MC_TARGET_DEVICE) {
+			page = target.page;
+			if (!mem_cgroup_move_account(page, true,
+						     mc.from, mc.to)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_charge += HPAGE_PMD_NR;
+			}
+			put_page(page);
 		}
 		spin_unlock(ptl);
 		return 0;
@@ -4895,12 +4938,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
+		bool device = false;
 		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
 
 		switch (get_mctgt_type(vma, addr, ptent, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			/* fall through */
 		case MC_TARGET_PAGE:
 			page = target.page;
 			/*
@@ -4911,7 +4958,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (isolate_lru_page(page))
+			if (!device && isolate_lru_page(page))
 				goto put;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
@@ -4919,7 +4966,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
 			}
-			putback_lru_page(page);
+			if (!device)
+				putback_lru_page(page);
 put:			/* get_mctgt_type() gets the page */
 			put_page(page);
 			break;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
  2017-06-14 23:10   ` John Hubbard
  2017-06-15  1:46   ` Balbir Singh
  2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
  5 siblings, 2 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
patchset).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 include/linux/hmm.h |  4 ++--
 mm/Kconfig          | 27 ++++++---------------------
 mm/hmm.c            |  4 ++--
 3 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f6713b2..720d18c 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct hmm_devmem;
 
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
@@ -456,7 +456,7 @@ struct hmm_device {
  */
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
-#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/mm/Kconfig b/mm/Kconfig
index ad082b9..7de939a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_HAS_HMM
 	bool
 	default y
-	depends on X86_64
+	depends on X86_64 || PPC64
 	depends on ZONE_DEVICE
 	depends on MMU && 64BIT
 	depends on MEMORY_HOTPLUG
@@ -277,7 +277,7 @@ config HMM
 
 config HMM_MIRROR
 	bool "HMM mirror CPU page table into a device page table"
-	depends on ARCH_HAS_HMM
+	depends on ARCH_HAS_HMM && X86_64
 	select MMU_NOTIFIER
 	select HMM
 	help
@@ -287,15 +287,6 @@ config HMM_MIRROR
 	  page tables (at PAGE_SIZE granularity), and must be able to recover from
 	  the resulting potential page faults.
 
-config HMM_DEVMEM
-	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
-	depends on ARCH_HAS_HMM
-	select HMM
-	help
-	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
-	  feature. This is just to avoid having device drivers to replicating a lot
-	  of boiler plate code.  See Documentation/vm/hmm.txt.
-
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
@@ -720,11 +711,8 @@ config ZONE_DEVICE
 
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
-	depends on X86_64
-	depends on ZONE_DEVICE
-	depends on MEMORY_HOTPLUG
-	depends on MEMORY_HOTREMOVE
-	depends on SPARSEMEM_VMEMMAP
+	depends on ARCH_HAS_HMM && X86_64
+	select HMM
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -733,11 +721,8 @@ config DEVICE_PRIVATE
 
 config DEVICE_PUBLIC
 	bool "Unaddressable device memory (GPU memory, ...)"
-	depends on X86_64
-	depends on ZONE_DEVICE
-	depends on MEMORY_HOTPLUG
-	depends on MEMORY_HOTREMOVE
-	depends on SPARSEMEM_VMEMMAP
+	depends on ARCH_HAS_HMM
+	select HMM
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index aed110e..085cc06 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
 				       unsigned long addr)
 {
@@ -1306,4 +1306,4 @@ static int __init hmm_init(void)
 }
 
 device_initcall(hmm_init);
-#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
  2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-06-14 20:11 ` [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64 Jérôme Glisse
@ 2017-06-14 21:20 ` Dave Hansen
  2017-06-14 21:38   ` Jerome Glisse
  5 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2017-06-14 21:20 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, cgroups, Dan Williams, Ross Zwisler,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh,
	Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt

On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.

How does this interact with device memory that's enumerated in the new
ACPI 6.2 HMAT?  That stuff is also in the normal e820 and, by default,
treated as normal system RAM.  Would this mechanism be used for those
devices as well?

http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
  2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
@ 2017-06-14 21:38   ` Jerome Glisse
  2017-06-14 21:58     ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2017-06-14 21:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans, cgroups,
	Dan Williams, Ross Zwisler, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> > Cache coherent device memory apply to architecture with system bus
> > like CAPI or CCIX. Device connected to such system bus can expose
> > their memory to the system and allow cache coherent access to it
> > from the CPU.
> 
> How does this interact with device memory that's enumerated in the new
> ACPI 6.2 HMAT?  That stuff is also in the normal e820 and, by default,
> treated as normal system RAM.  Would this mechanism be used for those
> devices as well?
> 
> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

It doesn't interact with that. HMM-CDM is a set of helper that don't
do anything unless instructed so. So for device memory to be presented
as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
can be done with the helper introduced in patch 2 of this patchset.

I don't think that the HMAT inside ACPI is restricted or even intended
for device memory. The kind of memory i am refering too in HMM-CDM is
for instance GPU on board memory. On PCIE system the CPU can not access
such memory in the same manner as regular memory but on CAPI or CCIX
system it can.

How such memory is listed is platform/architecture specific and HMM-CDM
does not deal with that. So PowerPC CAPI will have its own way of
discovering such device memory. So will CCIX platform (thought for CCIX
i expect that UEFI/ACPI will be involve through HMAT or something new).

Also if HMAT allow represent device memory, the choice to use HMM-CDM
would still be with the device driver. For persistent memory is does not
make sense and it might not make sense for other devices too. I expect
this to be usefull for GPU, FPGA or similar accelerator (like Xeon Phi
as add on card if there is ever something like CCIX supported by Intel).

Hope this answer your question

Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
  2017-06-14 21:38   ` Jerome Glisse
@ 2017-06-14 21:58     ` Dave Hansen
  2017-06-14 22:07       ` Benjamin Herrenschmidt
  2017-06-14 23:40       ` Balbir Singh
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Hansen @ 2017-06-14 21:58 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans, cgroups,
	Dan Williams, Ross Zwisler, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On 06/14/2017 02:38 PM, Jerome Glisse wrote:
> On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
>> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>> How does this interact with device memory that's enumerated in the new
>> ACPI 6.2 HMAT?  That stuff is also in the normal e820 and, by default,
>> treated as normal system RAM.  Would this mechanism be used for those
>> devices as well?
>>
>> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> It doesn't interact with that. HMM-CDM is a set of helper that don't
> do anything unless instructed so. So for device memory to be presented
> as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
> can be done with the helper introduced in patch 2 of this patchset.

I guess I'm asking whether we *should* instruct HMM-CDM to manage all
coherent device memory.  If not, where do we draw the line for what we
use HMM-CDM, and for what we use the core MM?

> I don't think that the HMAT inside ACPI is restricted or even intended
> for device memory.

It can definitely describe memory attached to memory controllers which
are not directly attached to CPUs.  That means either some kind of
memory expander, or device memory.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
  2017-06-14 21:58     ` Dave Hansen
@ 2017-06-14 22:07       ` Benjamin Herrenschmidt
  2017-06-14 23:40       ` Balbir Singh
  1 sibling, 0 replies; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2017-06-14 22:07 UTC (permalink / raw)
  To: Dave Hansen, Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans, cgroups,
	Dan Williams, Ross Zwisler, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Balbir Singh, Aneesh Kumar, Paul E . McKenney

On Wed, 2017-06-14 at 14:58 -0700, Dave Hansen wrote:
> > > http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> > 
> > It doesn't interact with that. HMM-CDM is a set of helper that don't
> > do anything unless instructed so. So for device memory to be presented
> > as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
> > can be done with the helper introduced in patch 2 of this patchset.
> 
> I guess I'm asking whether we *should* instruct HMM-CDM to manage all
> coherent device memory.  If not, where do we draw the line for what we
> use HMM-CDM, and for what we use the core MM?

Well, if you want the features of HMM ... It basically boils down to
whether you have some kind of coherent processing unit close to that
memory and want to manage transparent migration of pages between system
and device memory, that sort of thing.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-14 20:11 ` [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64 Jérôme Glisse
@ 2017-06-14 23:10   ` John Hubbard
  2017-06-15  2:09     ` Jerome Glisse
  2017-06-15  1:46   ` Balbir Singh
  1 sibling, 1 reply; 23+ messages in thread
From: John Hubbard @ 2017-06-14 23:10 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: David Nellans, Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
> selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
> patchset).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>   include/linux/hmm.h |  4 ++--
>   mm/Kconfig          | 27 ++++++---------------------
>   mm/hmm.c            |  4 ++--
>   3 files changed, 10 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index f6713b2..720d18c 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
>   #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>   
>   
> -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>   struct hmm_devmem;
>   
>   struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> @@ -456,7 +456,7 @@ struct hmm_device {
>    */
>   struct hmm_device *hmm_device_new(void *drvdata);
>   void hmm_device_put(struct hmm_device *hmm_device);
> -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>   
>   
>   /* Below are for HMM internal use only! Not to be used by device driver! */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ad082b9..7de939a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>   config ARCH_HAS_HMM
>   	bool
>   	default y
> -	depends on X86_64
> +	depends on X86_64 || PPC64
>   	depends on ZONE_DEVICE
>   	depends on MMU && 64BIT
>   	depends on MEMORY_HOTPLUG
> @@ -277,7 +277,7 @@ config HMM
>   
>   config HMM_MIRROR
>   	bool "HMM mirror CPU page table into a device page table"
> -	depends on ARCH_HAS_HMM
> +	depends on ARCH_HAS_HMM && X86_64
>   	select MMU_NOTIFIER
>   	select HMM
>   	help

Hi Jerome,

There are still some problems with using this configuration. First and foremost, it is still 
possible (and likely, given the complete dissimilarity in naming, and difference in location on the 
screen) to choose HMM_MIRROR, and *not* to choose either DEVICE_PRIVATE or DEVICE_PUBLIC. And then 
we end up with a swath of important page fault handling code being ifdef'd out, and one ends up 
having to investigate why.

As for solutions, at least for the x86 (DEVICE_PRIVATE)case, we could do this:

diff --git a/mm/Kconfig b/mm/Kconfig
index 7de939a29466..f64182d7b956 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -279,6 +279,7 @@ config HMM_MIRROR
         bool "HMM mirror CPU page table into a device page table"
         depends on ARCH_HAS_HMM && X86_64
         select MMU_NOTIFIER
+       select DEVICE_PRIVATE
         select HMM
         help
           Select HMM_MIRROR if you want to mirror range of the CPU page table of a

...and that is better than the other direction (having HMM_MIRROR depend on DEVICE_PRIVATE), because 
in the latter case, HMM_MIRROR will disappear (and it's several lines above) until you select 
DEVICE_PRIVATE. That is hard to work with for the user.

The user will tend to select HMM_MIRROR, but it is *not* obvious that he/she should also select 
DEVICE_PRIVATE. So Kconfig should do it for them.

In fact, I'm not even sure if the DEVICE_PRIVATE and DEVICE_PUBLIC actually need Kconfig protection, 
but if they don't, then life would be easier for whoever is configuring their kernel.


> @@ -287,15 +287,6 @@ config HMM_MIRROR
>   	  page tables (at PAGE_SIZE granularity), and must be able to recover from
>   	  the resulting potential page faults.
>   
> -config HMM_DEVMEM
> -	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
> -	depends on ARCH_HAS_HMM
> -	select HMM
> -	help
> -	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
> -	  feature. This is just to avoid having device drivers to replicating a lot
> -	  of boiler plate code.  See Documentation/vm/hmm.txt.
> -

Yes, probably good to remove HMM_DEVMEM as a separate conig choice.

>   config PHYS_ADDR_T_64BIT
>   	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
>   
> @@ -720,11 +711,8 @@ config ZONE_DEVICE
>   
>   config DEVICE_PRIVATE
>   	bool "Unaddressable device memory (GPU memory, ...)"
> -	depends on X86_64
> -	depends on ZONE_DEVICE
> -	depends on MEMORY_HOTPLUG
> -	depends on MEMORY_HOTREMOVE
> -	depends on SPARSEMEM_VMEMMAP
> +	depends on ARCH_HAS_HMM && X86_64
> +	select HMM
>   
>   	help
>   	  Allows creation of struct pages to represent unaddressable device
> @@ -733,11 +721,8 @@ config DEVICE_PRIVATE
>   
>   config DEVICE_PUBLIC
>   	bool "Unaddressable device memory (GPU memory, ...)"

Typo: this is a copy-and-paste from DEVICE_PRIVATE, but the "Unaddressable" part wasn't changed, so 
you'll end up with two identical-looking lines in `make menuconfig`.

Maybe "Directly addressable device memory"? And make the line less identical to DEVICE_PRIVATE?

thanks,
--
John Hubbard
NVIDIA

> -	depends on X86_64
> -	depends on ZONE_DEVICE
> -	depends on MEMORY_HOTPLUG
> -	depends on MEMORY_HOTREMOVE
> -	depends on SPARSEMEM_VMEMMAP
> +	depends on ARCH_HAS_HMM
> +	select HMM
>   
>   	help
>   	  Allows creation of struct pages to represent addressable device
> diff --git a/mm/hmm.c b/mm/hmm.c
> index aed110e..085cc06 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
>   #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>   
>   
> -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>   struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
>   				       unsigned long addr)
>   {
> @@ -1306,4 +1306,4 @@ static int __init hmm_init(void)
>   }
>   
>   device_initcall(hmm_init);
> -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -- 
> 2.9.3
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
  2017-06-14 21:58     ` Dave Hansen
  2017-06-14 22:07       ` Benjamin Herrenschmidt
@ 2017-06-14 23:40       ` Balbir Singh
  1 sibling, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2017-06-14 23:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jerome Glisse, linux-kernel, linux-mm, John Hubbard,
	David Nellans, cgroups, Dan Williams, Ross Zwisler,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Aneesh Kumar,
	Paul E . McKenney, Benjamin Herrenschmidt

On Thu, Jun 15, 2017 at 7:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/14/2017 02:38 PM, Jerome Glisse wrote:
>> On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
>>> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
>>>> Cache coherent device memory apply to architecture with system bus
>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>> their memory to the system and allow cache coherent access to it
>>>> from the CPU.
>>> How does this interact with device memory that's enumerated in the new
>>> ACPI 6.2 HMAT?  That stuff is also in the normal e820 and, by default,
>>> treated as normal system RAM.  Would this mechanism be used for those
>>> devices as well?
>>>
>>> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>> It doesn't interact with that. HMM-CDM is a set of helper that don't
>> do anything unless instructed so. So for device memory to be presented
>> as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
>> can be done with the helper introduced in patch 2 of this patchset.
>

[Removing my cc'd email id and responding from a different address]

> I guess I'm asking whether we *should* instruct HMM-CDM to manage all
> coherent device memory.  If not, where do we draw the line for what we
> use HMM-CDM, and for what we use the core MM?
>

If you believe the memory is managed by the device (and owned by a device
driver) I'd suggest using HMM-CDM. The idea behind HMM-CDM was that
it enables transparent migration of pages and its preferred when
locality of computation
and locality of memory access is the preferred model.

The other model was N_COHERENT_MEMORY that used the core MM, but
there were objections to exposing device memory using that technology.

Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
@ 2017-06-15  1:41   ` Balbir Singh
  2017-06-15  2:04     ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  1:41 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On Wed, 14 Jun 2017 16:11:43 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus need special handling when it comes to lru or refcount. This
> patch make sure that memcontrol properly handle those when it face
> them. Those pages are use like regular pages in a process address
> space either as anonymous page or as file back page. So from memcg
> point of view we want to handle them like regular page for now at
> least.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---
>  kernel/memremap.c |  2 ++
>  mm/memcontrol.c   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 55 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index da74775..584984c 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
>  		__ClearPageActive(page);
>  		__ClearPageWaiters(page);
>  
> +		mem_cgroup_uncharge(page);
> +

A zone device page could have a mem_cgroup charge if

1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
gets the charge that we need to drop here

And should not be charged

2. If the driver allowed mmap based allocation (these pages are not on LRU


Since put_zone_device_private_or_public_page() is called from release_pages(),
I think the assumption is that 2 is not a problem? I've not tested the mmap
bits yet.

>  		page->pgmap->page_free(page, page->pgmap->data);
>  	}
>  	else if (!count)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b93f5fe..171b638 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4391,12 +4391,13 @@ enum mc_target_type {
>  	MC_TARGET_NONE = 0,
>  	MC_TARGET_PAGE,
>  	MC_TARGET_SWAP,
> +	MC_TARGET_DEVICE,
>  };
>  
>  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  						unsigned long addr, pte_t ptent)
>  {
> -	struct page *page = vm_normal_page(vma, addr, ptent);
> +	struct page *page = _vm_normal_page(vma, addr, ptent, true);
>  
>  	if (!page || !page_mapped(page))
>  		return NULL;
> @@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  		if (!(mc.flags & MOVE_FILE))
>  			return NULL;
>  	}
> -	if (!get_page_unless_zero(page))
> +	if (is_device_public_page(page)) {
> +		/*
> +		 * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
> +		 * refcount of 1 when free (unlike normal page)
> +		 */
> +		if (!page_ref_add_unless(page, 1, 1))
> +			return NULL;
> +	} else if (!get_page_unless_zero(page))
>  		return NULL;
>  
>  	return page;
>  }
>  
> -#ifdef CONFIG_SWAP
> +#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
>  static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>  			pte_t ptent, swp_entry_t *entry)
>  {
> @@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>  
>  	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
>  		return NULL;
> +
> +	/*
> +	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
> +	 * a device and because they are not accessible by CPU they are store
> +	 * as special swap entry in the CPU page table.
> +	 */
> +	if (is_device_private_entry(ent)) {
> +		page = device_private_entry_to_page(ent);
> +		/*
> +		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
> +		 * a refcount of 1 when free (unlike normal page)
> +		 */
> +		if (!page_ref_add_unless(page, 1, 1))
> +			return NULL;
> +		return page;
> +	}
> +
>  	/*
>  	 * Because lookup_swap_cache() updates some statistics counter,
>  	 * we call find_get_page() with swapper_space directly.
> @@ -4582,6 +4607,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PUBLIC
> + *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
>   *
>   * Called with pte lock held.
>   */
> @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  		 */
>  		if (page->mem_cgroup == mc.from) {
>  			ret = MC_TARGET_PAGE;
> +			if (is_device_private_page(page) ||
> +			    is_device_public_page(page))
> +				ret = MC_TARGET_DEVICE;
>  			if (target)
>  				target->page = page;
>  		}
> @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> +		/*
> +		 * Note their can not be MC_TARGET_DEVICE for now as we do not
                        there
> +		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> +		 * MEMORY_DEVICE_PRIVATE but this might change.

I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
together today, the driver could allocate a THP size set of pages and migrate it.
There are patches to do THP migration, not upstream yet. Could you remind me
of any other limitations?

Balbir Singh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-14 20:11 ` [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64 Jérôme Glisse
  2017-06-14 23:10   ` John Hubbard
@ 2017-06-15  1:46   ` Balbir Singh
  2017-06-15  2:07     ` Jerome Glisse
  1 sibling, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  1:46 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On Wed, 14 Jun 2017 16:11:44 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
> selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
> patchset).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  include/linux/hmm.h |  4 ++--
>  mm/Kconfig          | 27 ++++++---------------------
>  mm/hmm.c            |  4 ++--
>  3 files changed, 10 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index f6713b2..720d18c 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
>  
> -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>  struct hmm_devmem;
>  
>  struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> @@ -456,7 +456,7 @@ struct hmm_device {
>   */
>  struct hmm_device *hmm_device_new(void *drvdata);
>  void hmm_device_put(struct hmm_device *hmm_device);
> -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>  
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ad082b9..7de939a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  config ARCH_HAS_HMM
>  	bool
>  	default y
> -	depends on X86_64
> +	depends on X86_64 || PPC64

Ideally we want to make this (PPC64 && PPC_BOOK3S)

>  	depends on ZONE_DEVICE
>  	depends on MMU && 64BIT
>  	depends on MEMORY_HOTPLUG
> @@ -277,7 +277,7 @@ config HMM
>  
>  config HMM_MIRROR
>  	bool "HMM mirror CPU page table into a device page table"
> -	depends on ARCH_HAS_HMM
> +	depends on ARCH_HAS_HMM && X86_64

We would need HMM_MIRROR for the generation of hardware that does
not have CDM

>  	select MMU_NOTIFIER
>  	select HMM
>  	help
> @@ -287,15 +287,6 @@ config HMM_MIRROR
>  	  page tables (at PAGE_SIZE granularity), and must be able to recover from
>  	  the resulting potential page faults.
>  
> -config HMM_DEVMEM
> -	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
> -	depends on ARCH_HAS_HMM
> -	select HMM
> -	help
> -	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
> -	  feature. This is just to avoid having device drivers to replicating a lot
> -	  of boiler plate code.  See Documentation/vm/hmm.txt.
> -
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
>  
> @@ -720,11 +711,8 @@ config ZONE_DEVICE
>  
>  config DEVICE_PRIVATE
>  	bool "Unaddressable device memory (GPU memory, ...)"
> -	depends on X86_64
> -	depends on ZONE_DEVICE
> -	depends on MEMORY_HOTPLUG
> -	depends on MEMORY_HOTREMOVE
> -	depends on SPARSEMEM_VMEMMAP
> +	depends on ARCH_HAS_HMM && X86_64

Same as above

> +	select HMM
>  
>  	help
>  	  Allows creation of struct pages to represent unaddressable device
> @@ -733,11 +721,8 @@ config DEVICE_PRIVATE
>  
>  config DEVICE_PUBLIC
>  	bool "Unaddressable device memory (GPU memory, ...)"

The unaddressable is a typo from above.

> -	depends on X86_64
> -	depends on ZONE_DEVICE
> -	depends on MEMORY_HOTPLUG
> -	depends on MEMORY_HOTREMOVE
> -	depends on SPARSEMEM_VMEMMAP
> +	depends on ARCH_HAS_HMM
> +	select HMM
>  

Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  2017-06-15  1:41   ` Balbir Singh
@ 2017-06-15  2:04     ` Jerome Glisse
  2017-06-15  3:10       ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2017-06-15  2:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On Thu, Jun 15, 2017 at 11:41:59AM +1000, Balbir Singh wrote:
> On Wed, 14 Jun 2017 16:11:43 -0400
> Jérôme Glisse <jglisse@redhat.com> wrote:
> 
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus need special handling when it comes to lru or refcount. This
> > patch make sure that memcontrol properly handle those when it face
> > them. Those pages are use like regular pages in a process address
> > space either as anonymous page or as file back page. So from memcg
> > point of view we want to handle them like regular page for now at
> > least.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: cgroups@vger.kernel.org
> > ---
> >  kernel/memremap.c |  2 ++
> >  mm/memcontrol.c   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 55 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index da74775..584984c 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
> >  		__ClearPageActive(page);
> >  		__ClearPageWaiters(page);
> >  
> > +		mem_cgroup_uncharge(page);
> > +
> 
> A zone device page could have a mem_cgroup charge if
> 
> 1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
> gets the charge that we need to drop here
> 
> And should not be charged
> 
> 2. If the driver allowed mmap based allocation (these pages are not on LRU
> 
> 
> Since put_zone_device_private_or_public_page() is called from release_pages(),
> I think the assumption is that 2 is not a problem? I've not tested the mmap
> bits yet.

Well that is one of the big question. Do we care about memory cgroup despite
page not being on lru and thus not being reclaimable through the usual path ?

I believe we do want to keep charging ZONE_DEVICE page against memory cgroup
so that userspace limit are enforced. This is important especialy for device
private when migrating back to system memory due to CPU page fault. We do not
want the migration back to fail because of memory cgroup limit.

Hence why i do want to charge ZONE_DEVICE page just like regular page. If we
have people that run into OOM because of this then we can start thinking about
how to account those pages slightly differently inside the memory cgroup.

For now i believe we do want this patch.


[...]

> > @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> >  		 */
> >  		if (page->mem_cgroup == mc.from) {
> >  			ret = MC_TARGET_PAGE;
> > +			if (is_device_private_page(page) ||
> > +			    is_device_public_page(page))
> > +				ret = MC_TARGET_DEVICE;
> >  			if (target)
> >  				target->page = page;
> >  		}
> > @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> >  
> >  	ptl = pmd_trans_huge_lock(pmd, vma);
> >  	if (ptl) {
> > +		/*
> > +		 * Note their can not be MC_TARGET_DEVICE for now as we do not
>                         there
> > +		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> > +		 * MEMORY_DEVICE_PRIVATE but this might change.
> 
> I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
> together today, the driver could allocate a THP size set of pages and migrate it.
> There are patches to do THP migration, not upstream yet. Could you remind me
> of any other limitations?

No there is nothing that would be problematic AFAICT. Persistent memory already
use huge page so we should be in the clear. But i would rather enable that as
a separate patchset alltogether and have proper testing specificaly for such
scenario.

Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-15  1:46   ` Balbir Singh
@ 2017-06-15  2:07     ` Jerome Glisse
  2017-06-15  2:59       ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2017-06-15  2:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On Thu, Jun 15, 2017 at 11:46:11AM +1000, Balbir Singh wrote:
> On Wed, 14 Jun 2017 16:11:44 -0400
> Jérôme Glisse <jglisse@redhat.com> wrote:
> 
> > This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
> > selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
> > patchset).
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Cc: Balbir Singh <balbirs@au1.ibm.com>
> > Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> >  include/linux/hmm.h |  4 ++--
> >  mm/Kconfig          | 27 ++++++---------------------
> >  mm/hmm.c            |  4 ++--
> >  3 files changed, 10 insertions(+), 25 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index f6713b2..720d18c 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
> >  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> >  
> >  
> > -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> > +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> >  struct hmm_devmem;
> >  
> >  struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> > @@ -456,7 +456,7 @@ struct hmm_device {
> >   */
> >  struct hmm_device *hmm_device_new(void *drvdata);
> >  void hmm_device_put(struct hmm_device *hmm_device);
> > -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> > +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> >  
> >  
> >  /* Below are for HMM internal use only! Not to be used by device driver! */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ad082b9..7de939a 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >  config ARCH_HAS_HMM
> >  	bool
> >  	default y
> > -	depends on X86_64
> > +	depends on X86_64 || PPC64
> 
> Ideally we want to make this (PPC64 && PPC_BOOK3S)

BOOK3S really ? :)

> 
> >  	depends on ZONE_DEVICE
> >  	depends on MMU && 64BIT
> >  	depends on MEMORY_HOTPLUG
> > @@ -277,7 +277,7 @@ config HMM
> >  
> >  config HMM_MIRROR
> >  	bool "HMM mirror CPU page table into a device page table"
> > -	depends on ARCH_HAS_HMM
> > +	depends on ARCH_HAS_HMM && X86_64
> 
> We would need HMM_MIRROR for the generation of hardware that does
> not have CDM

That would require could change to mirror code mostly ppc is missing
something like pmd_index() iirc. So best to tackle that as separate
patchset.

> 
> >  	select MMU_NOTIFIER
> >  	select HMM
> >  	help
> > @@ -287,15 +287,6 @@ config HMM_MIRROR
> >  	  page tables (at PAGE_SIZE granularity), and must be able to recover from
> >  	  the resulting potential page faults.
> >  
> > -config HMM_DEVMEM
> > -	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
> > -	depends on ARCH_HAS_HMM
> > -	select HMM
> > -	help
> > -	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
> > -	  feature. This is just to avoid having device drivers to replicating a lot
> > -	  of boiler plate code.  See Documentation/vm/hmm.txt.
> > -
> >  config PHYS_ADDR_T_64BIT
> >  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> >  
> > @@ -720,11 +711,8 @@ config ZONE_DEVICE
> >  
> >  config DEVICE_PRIVATE
> >  	bool "Unaddressable device memory (GPU memory, ...)"
> > -	depends on X86_64
> > -	depends on ZONE_DEVICE
> > -	depends on MEMORY_HOTPLUG
> > -	depends on MEMORY_HOTREMOVE
> > -	depends on SPARSEMEM_VMEMMAP
> > +	depends on ARCH_HAS_HMM && X86_64
> 
> Same as above
> 
> > +	select HMM
> >  
> >  	help
> >  	  Allows creation of struct pages to represent unaddressable device
> > @@ -733,11 +721,8 @@ config DEVICE_PRIVATE
> >  
> >  config DEVICE_PUBLIC
> >  	bool "Unaddressable device memory (GPU memory, ...)"
> 
> The unaddressable is a typo from above.

Yup cut and paste thank for catching that.

Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-14 23:10   ` John Hubbard
@ 2017-06-15  2:09     ` Jerome Glisse
  2017-06-15  3:15       ` John Hubbard
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2017-06-15  2:09 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-kernel, linux-mm, David Nellans, Balbir Singh,
	Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt

On Wed, Jun 14, 2017 at 04:10:32PM -0700, John Hubbard wrote:
> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> > This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
> > selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
> > patchset).
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Cc: Balbir Singh <balbirs@au1.ibm.com>
> > Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> >   include/linux/hmm.h |  4 ++--
> >   mm/Kconfig          | 27 ++++++---------------------
> >   mm/hmm.c            |  4 ++--
> >   3 files changed, 10 insertions(+), 25 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index f6713b2..720d18c 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
> >   #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> > -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> > +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> >   struct hmm_devmem;
> >   struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> > @@ -456,7 +456,7 @@ struct hmm_device {
> >    */
> >   struct hmm_device *hmm_device_new(void *drvdata);
> >   void hmm_device_put(struct hmm_device *hmm_device);
> > -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> > +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> >   /* Below are for HMM internal use only! Not to be used by device driver! */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ad082b9..7de939a 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >   config ARCH_HAS_HMM
> >   	bool
> >   	default y
> > -	depends on X86_64
> > +	depends on X86_64 || PPC64
> >   	depends on ZONE_DEVICE
> >   	depends on MMU && 64BIT
> >   	depends on MEMORY_HOTPLUG
> > @@ -277,7 +277,7 @@ config HMM
> >   config HMM_MIRROR
> >   	bool "HMM mirror CPU page table into a device page table"
> > -	depends on ARCH_HAS_HMM
> > +	depends on ARCH_HAS_HMM && X86_64
> >   	select MMU_NOTIFIER
> >   	select HMM
> >   	help
> 
> Hi Jerome,
> 
> There are still some problems with using this configuration. First and
> foremost, it is still possible (and likely, given the complete dissimilarity
> in naming, and difference in location on the screen) to choose HMM_MIRROR,
> and *not* to choose either DEVICE_PRIVATE or DEVICE_PUBLIC. And then we end
> up with a swath of important page fault handling code being ifdef'd out, and
> one ends up having to investigate why.
> 
> As for solutions, at least for the x86 (DEVICE_PRIVATE)case, we could do this:
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7de939a29466..f64182d7b956 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -279,6 +279,7 @@ config HMM_MIRROR
>         bool "HMM mirror CPU page table into a device page table"
>         depends on ARCH_HAS_HMM && X86_64
>         select MMU_NOTIFIER
> +       select DEVICE_PRIVATE
>         select HMM
>         help
>           Select HMM_MIRROR if you want to mirror range of the CPU page table of a
> 
> ...and that is better than the other direction (having HMM_MIRROR depend on
> DEVICE_PRIVATE), because in the latter case, HMM_MIRROR will disappear (and
> it's several lines above) until you select DEVICE_PRIVATE. That is hard to
> work with for the user.
> 
> The user will tend to select HMM_MIRROR, but it is *not* obvious that he/she
> should also select DEVICE_PRIVATE. So Kconfig should do it for them.
> 
> In fact, I'm not even sure if the DEVICE_PRIVATE and DEVICE_PUBLIC actually
> need Kconfig protection, but if they don't, then life would be easier for
> whoever is configuring their kernel.
> 

We do need Kconfig for DEVICE_PRIVATE and DEVICE_PUBLIC. I can remove HMM_MIRROR
and have HMM mirror code ifdef on DEVICE_PRIVATE.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-15  2:07     ` Jerome Glisse
@ 2017-06-15  2:59       ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  2:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Balbir Singh, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt

On Wed, 14 Jun 2017 22:07:09 -0400
Jerome Glisse <jglisse@redhat.com> wrote:

> On Thu, Jun 15, 2017 at 11:46:11AM +1000, Balbir Singh wrote:
> > On Wed, 14 Jun 2017 16:11:44 -0400
> > Jérôme Glisse <jglisse@redhat.com> wrote:
> >   
> > > This just simplify kconfig and allow HMM and DEVICE_PUBLIC to be
> > > selected for ppc64 once ZONE_DEVICE is allowed on ppc64 (different
> > > patchset).
> > > 
> > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > > Cc: Balbir Singh <balbirs@au1.ibm.com>
> > > Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > ---
> > >  include/linux/hmm.h |  4 ++--
> > >  mm/Kconfig          | 27 ++++++---------------------
> > >  mm/hmm.c            |  4 ++--
> > >  3 files changed, 10 insertions(+), 25 deletions(-)
> > > 
> > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > > index f6713b2..720d18c 100644
> > > --- a/include/linux/hmm.h
> > > +++ b/include/linux/hmm.h
> > > @@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
> > >  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> > >  
> > >  
> > > -#if IS_ENABLED(CONFIG_HMM_DEVMEM)
> > > +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> > >  struct hmm_devmem;
> > >  
> > >  struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
> > > @@ -456,7 +456,7 @@ struct hmm_device {
> > >   */
> > >  struct hmm_device *hmm_device_new(void *drvdata);
> > >  void hmm_device_put(struct hmm_device *hmm_device);
> > > -#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> > > +#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> > >  
> > >  
> > >  /* Below are for HMM internal use only! Not to be used by device driver! */
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index ad082b9..7de939a 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -265,7 +265,7 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >  config ARCH_HAS_HMM
> > >  	bool
> > >  	default y
> > > -	depends on X86_64
> > > +	depends on X86_64 || PPC64  
> > 
> > Ideally we want to make this (PPC64 && PPC_BOOK3S)  
> 
> BOOK3S really ? :)

Sorry what's wrong with that? I am not sure we want it for BOOK3E 64 bit
Or may be would just make ARCH_HAS_HMM a config set by the ARCH,
like ARCH_HAS_SET_MEMORY

> 
> >   
> > >  	depends on ZONE_DEVICE
> > >  	depends on MMU && 64BIT
> > >  	depends on MEMORY_HOTPLUG
> > > @@ -277,7 +277,7 @@ config HMM
> > >  
> > >  config HMM_MIRROR
> > >  	bool "HMM mirror CPU page table into a device page table"
> > > -	depends on ARCH_HAS_HMM
> > > +	depends on ARCH_HAS_HMM && X86_64  
> > 
> > We would need HMM_MIRROR for the generation of hardware that does
> > not have CDM  
> 
> That would require could change to mirror code mostly ppc is missing
> something like pmd_index() iirc. So best to tackle that as separate
> patchset.
> 

We do have pmd_index() and I think Reza's got the whole hmm to compile
on powerpc. I am OK tackling it later. Do you suspect there is a compile time
dependency? I got HMM_MIRROR to compile for me as well

> >   
> > >  	select MMU_NOTIFIER
> > >  	select HMM
> > >  	help
> > > @@ -287,15 +287,6 @@ config HMM_MIRROR
> > >  	  page tables (at PAGE_SIZE granularity), and must be able to recover from
> > >  	  the resulting potential page faults.
> > >  
> > > -config HMM_DEVMEM
> > > -	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
> > > -	depends on ARCH_HAS_HMM
> > > -	select HMM
> > > -	help
> > > -	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
> > > -	  feature. This is just to avoid having device drivers to replicating a lot
> > > -	  of boiler plate code.  See Documentation/vm/hmm.txt.
> > > -
> > >  config PHYS_ADDR_T_64BIT
> > >  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> > >  
> > > @@ -720,11 +711,8 @@ config ZONE_DEVICE
> > >  
> > >  config DEVICE_PRIVATE
> > >  	bool "Unaddressable device memory (GPU memory, ...)"
> > > -	depends on X86_64
> > > -	depends on ZONE_DEVICE
> > > -	depends on MEMORY_HOTPLUG
> > > -	depends on MEMORY_HOTREMOVE
> > > -	depends on SPARSEMEM_VMEMMAP
> > > +	depends on ARCH_HAS_HMM && X86_64  
> > 
> > Same as above
> >   
> > > +	select HMM
> > >  
> > >  	help
> > >  	  Allows creation of struct pages to represent unaddressable device
> > > @@ -733,11 +721,8 @@ config DEVICE_PRIVATE
> > >  
> > >  config DEVICE_PUBLIC
> > >  	bool "Unaddressable device memory (GPU memory, ...)"  
> > 
> > The unaddressable is a typo from above.  
> 
> Yup cut and paste thank for catching that.

Your welcome
Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
  2017-06-15  2:04     ` Jerome Glisse
@ 2017-06-15  3:10       ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  3:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On Wed, 14 Jun 2017 22:04:55 -0400
Jerome Glisse <jglisse@redhat.com> wrote:

> On Thu, Jun 15, 2017 at 11:41:59AM +1000, Balbir Singh wrote:
> > On Wed, 14 Jun 2017 16:11:43 -0400
> > Jérôme Glisse <jglisse@redhat.com> wrote:
> >   
> > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > thus need special handling when it comes to lru or refcount. This
> > > patch make sure that memcontrol properly handle those when it face
> > > them. Those pages are use like regular pages in a process address
> > > space either as anonymous page or as file back page. So from memcg
> > > point of view we want to handle them like regular page for now at
> > > least.
> > > 
> > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > Cc: cgroups@vger.kernel.org
> > > ---
> > >  kernel/memremap.c |  2 ++
> > >  mm/memcontrol.c   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > >  2 files changed, 55 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > > index da74775..584984c 100644
> > > --- a/kernel/memremap.c
> > > +++ b/kernel/memremap.c
> > > @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
> > >  		__ClearPageActive(page);
> > >  		__ClearPageWaiters(page);
> > >  
> > > +		mem_cgroup_uncharge(page);
> > > +  
> > 
> > A zone device page could have a mem_cgroup charge if
> > 
> > 1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
> > gets the charge that we need to drop here
> > 
> > And should not be charged
> > 
> > 2. If the driver allowed mmap based allocation (these pages are not on LRU
> > 
> > 
> > Since put_zone_device_private_or_public_page() is called from release_pages(),
> > I think the assumption is that 2 is not a problem? I've not tested the mmap
> > bits yet.  
> 
> Well that is one of the big question. Do we care about memory cgroup despite
> page not being on lru and thus not being reclaimable through the usual path ?
> 
> I believe we do want to keep charging ZONE_DEVICE page against memory cgroup
> so that userspace limit are enforced. This is important especialy for device
> private when migrating back to system memory due to CPU page fault. We do not
> want the migration back to fail because of memory cgroup limit.
> 
> Hence why i do want to charge ZONE_DEVICE page just like regular page. If we
> have people that run into OOM because of this then we can start thinking about
> how to account those pages slightly differently inside the memory cgroup.
> 
> For now i believe we do want this patch.
> 

Yes, we do need the patch, I was trying to check if we'll end up trying to uncharge
a page that is not charged, just double checking

> 
> [...]
> 
> > > @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > >  		 */
> > >  		if (page->mem_cgroup == mc.from) {
> > >  			ret = MC_TARGET_PAGE;
> > > +			if (is_device_private_page(page) ||
> > > +			    is_device_public_page(page))
> > > +				ret = MC_TARGET_DEVICE;
> > >  			if (target)
> > >  				target->page = page;
> > >  		}
> > > @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> > >  
> > >  	ptl = pmd_trans_huge_lock(pmd, vma);
> > >  	if (ptl) {
> > > +		/*
> > > +		 * Note their can not be MC_TARGET_DEVICE for now as we do not  
> >                         there  
> > > +		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> > > +		 * MEMORY_DEVICE_PRIVATE but this might change.  
> > 
> > I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
> > together today, the driver could allocate a THP size set of pages and migrate it.
> > There are patches to do THP migration, not upstream yet. Could you remind me
> > of any other limitations?  
> 
> No there is nothing that would be problematic AFAICT. Persistent memory already
> use huge page so we should be in the clear. But i would rather enable that as
> a separate patchset alltogether and have proper testing specificaly for such
> scenario.

Agreed
Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
  2017-06-15  2:09     ` Jerome Glisse
@ 2017-06-15  3:15       ` John Hubbard
  0 siblings, 0 replies; 23+ messages in thread
From: John Hubbard @ 2017-06-15  3:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, linux-mm, David Nellans, Balbir Singh,
	Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt

On 06/14/2017 07:09 PM, Jerome Glisse wrote:
> On Wed, Jun 14, 2017 at 04:10:32PM -0700, John Hubbard wrote:
>> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
[...]
>> Hi Jerome,
>>
>> There are still some problems with using this configuration. First and
>> foremost, it is still possible (and likely, given the complete dissimilarity
>> in naming, and difference in location on the screen) to choose HMM_MIRROR,
>> and *not* to choose either DEVICE_PRIVATE or DEVICE_PUBLIC. And then we end
>> up with a swath of important page fault handling code being ifdef'd out, and
>> one ends up having to investigate why.
>>
>> As for solutions, at least for the x86 (DEVICE_PRIVATE)case, we could do this:
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7de939a29466..f64182d7b956 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -279,6 +279,7 @@ config HMM_MIRROR
>>          bool "HMM mirror CPU page table into a device page table"
>>          depends on ARCH_HAS_HMM && X86_64
>>          select MMU_NOTIFIER
>> +       select DEVICE_PRIVATE
>>          select HMM
>>          help
>>            Select HMM_MIRROR if you want to mirror range of the CPU page table of a
>>
>> ...and that is better than the other direction (having HMM_MIRROR depend on
>> DEVICE_PRIVATE), because in the latter case, HMM_MIRROR will disappear (and
>> it's several lines above) until you select DEVICE_PRIVATE. That is hard to
>> work with for the user.
>>
>> The user will tend to select HMM_MIRROR, but it is *not* obvious that he/she
>> should also select DEVICE_PRIVATE. So Kconfig should do it for them.
>>
>> In fact, I'm not even sure if the DEVICE_PRIVATE and DEVICE_PUBLIC actually
>> need Kconfig protection, but if they don't, then life would be easier for
>> whoever is configuring their kernel.
>>
> 
> We do need Kconfig for DEVICE_PRIVATE and DEVICE_PUBLIC. I can remove HMM_MIRROR
> and have HMM mirror code ifdef on DEVICE_PRIVATE.
> 
> Cheers,
> Jérôme

That's probably fine.

(I see that you may have missed the rest of my response, but looks like Balbir 
covered it.)

thanks,
john h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-06-15  3:31   ` Balbir Singh
  2017-06-15 15:35     ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  3:31 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On Wed, 14 Jun 2017 16:11:42 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus you can not use page->lru fields of those pages. This patch
> re-arrange the uncharge to allow single page to be uncharge without
> modifying the lru field of the struct page.
> 
> There is no change to memcontrol logic, it is the same as it was
> before this patch.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---
>  mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 92 insertions(+), 76 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e3fe4d0..b93f5fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
>  	cancel_charge(memcg, nr_pages);
>  }
>  
> -static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> -			   unsigned long nr_anon, unsigned long nr_file,
> -			   unsigned long nr_kmem, unsigned long nr_huge,
> -			   unsigned long nr_shmem, struct page *dummy_page)
> +struct uncharge_gather {
> +	struct mem_cgroup *memcg;
> +	unsigned long pgpgout;
> +	unsigned long nr_anon;
> +	unsigned long nr_file;
> +	unsigned long nr_kmem;
> +	unsigned long nr_huge;
> +	unsigned long nr_shmem;
> +	struct page *dummy_page;
> +};
> +
> +static inline void uncharge_gather_clear(struct uncharge_gather *ug)
>  {
> -	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
> +	memset(ug, 0, sizeof(*ug));
> +}
> +
> +static void uncharge_batch(const struct uncharge_gather *ug)
> +{

Can we pass page as an argument so that we can do check events on the page?

> +	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
>  	unsigned long flags;
>  
> -	if (!mem_cgroup_is_root(memcg)) {
> -		page_counter_uncharge(&memcg->memory, nr_pages);
> +	if (!mem_cgroup_is_root(ug->memcg)) {
> +		page_counter_uncharge(&ug->memcg->memory, nr_pages);
>  		if (do_memsw_account())
> -			page_counter_uncharge(&memcg->memsw, nr_pages);
> -		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
> -			page_counter_uncharge(&memcg->kmem, nr_kmem);
> -		memcg_oom_recover(memcg);
> +			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
> +		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> +			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
> +		memcg_oom_recover(ug->memcg);
>  	}
>  
>  	local_irq_save(flags);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
> -	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
> -	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
> -	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
> -	memcg_check_events(memcg, dummy_page);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
> +	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
> +	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
> +	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
> +	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
> +	memcg_check_events(ug->memcg, ug->dummy_page);
>  	local_irq_restore(flags);
>  
> -	if (!mem_cgroup_is_root(memcg))
> -		css_put_many(&memcg->css, nr_pages);
> +	if (!mem_cgroup_is_root(ug->memcg))
> +		css_put_many(&ug->memcg->css, nr_pages);
> +}
> +
> +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> +{
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> +
> +	if (!page->mem_cgroup)
> +		return;
> +
> +	/*
> +	 * Nobody should be changing or seriously looking at
> +	 * page->mem_cgroup at this point, we have fully
> +	 * exclusive access to the page.
> +	 */
> +
> +	if (ug->memcg != page->mem_cgroup) {
> +		if (ug->memcg) {
> +			uncharge_batch(ug);

What is ug->dummy_page set to at this point? ug->dummy_page is assigned below

> +			uncharge_gather_clear(ug);
> +		}
> +		ug->memcg = page->mem_cgroup;
> +	}
> +
> +	if (!PageKmemcg(page)) {
> +		unsigned int nr_pages = 1;
> +
> +		if (PageTransHuge(page)) {
> +			nr_pages <<= compound_order(page);
> +			ug->nr_huge += nr_pages;
> +		}
> +		if (PageAnon(page))
> +			ug->nr_anon += nr_pages;
> +		else {
> +			ug->nr_file += nr_pages;
> +			if (PageSwapBacked(page))
> +				ug->nr_shmem += nr_pages;
> +		}
> +		ug->pgpgout++;
> +	} else {
> +		ug->nr_kmem += 1 << compound_order(page);
> +		__ClearPageKmemcg(page);
> +	}
> +
> +	ug->dummy_page = page;
> +	page->mem_cgroup = NULL;
>  }
>  
>  static void uncharge_list(struct list_head *page_list)
>  {
> -	struct mem_cgroup *memcg = NULL;
> -	unsigned long nr_shmem = 0;
> -	unsigned long nr_anon = 0;
> -	unsigned long nr_file = 0;
> -	unsigned long nr_huge = 0;
> -	unsigned long nr_kmem = 0;
> -	unsigned long pgpgout = 0;
> +	struct uncharge_gather ug;
>  	struct list_head *next;
> -	struct page *page;
> +
> +	uncharge_gather_clear(&ug);
>  
>  	/*
>  	 * Note that the list can be a single page->lru; hence the
> @@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
>  	 */
>  	next = page_list->next;
>  	do {
> +		struct page *page;
> +

Nit pick

VM_WARN_ON(is_zone_device_page(page));

>  		page = list_entry(next, struct page, lru);
>  		next = page->lru.next;
>  
> -		VM_BUG_ON_PAGE(PageLRU(page), page);
> -		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> -
> -		if (!page->mem_cgroup)
> -			continue;
> -
> -		/*
> -		 * Nobody should be changing or seriously looking at
> -		 * page->mem_cgroup at this point, we have fully
> -		 * exclusive access to the page.
> -		 */
> -
> -		if (memcg != page->mem_cgroup) {
> -			if (memcg) {
> -				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> -					       nr_kmem, nr_huge, nr_shmem, page);
> -				pgpgout = nr_anon = nr_file = nr_kmem = 0;
> -				nr_huge = nr_shmem = 0;
> -			}
> -			memcg = page->mem_cgroup;
> -		}
> -
> -		if (!PageKmemcg(page)) {
> -			unsigned int nr_pages = 1;
> -
> -			if (PageTransHuge(page)) {
> -				nr_pages <<= compound_order(page);
> -				nr_huge += nr_pages;
> -			}
> -			if (PageAnon(page))
> -				nr_anon += nr_pages;
> -			else {
> -				nr_file += nr_pages;
> -				if (PageSwapBacked(page))
> -					nr_shmem += nr_pages;
> -			}
> -			pgpgout++;
> -		} else {
> -			nr_kmem += 1 << compound_order(page);
> -			__ClearPageKmemcg(page);
> -		}
> -
> -		page->mem_cgroup = NULL;
> +		uncharge_page(page, &ug);
>  	} while (next != page_list);
>  
> -	if (memcg)
> -		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> -			       nr_kmem, nr_huge, nr_shmem, page);
> +	if (ug.memcg)
> +		uncharge_batch(&ug);
>  }
>  
>  /**
> @@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
>   */
>  void mem_cgroup_uncharge(struct page *page)
>  {
> +	struct uncharge_gather ug;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  
> @@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
>  	if (!page->mem_cgroup)
>  		return;
>  
> -	INIT_LIST_HEAD(&page->lru);
> -	uncharge_list(&page->lru);
> +	uncharge_gather_clear(&ug);
> +	uncharge_page(page, &ug);
> +	uncharge_batch(&ug);
>  }


Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region
  2017-06-14 20:11 ` [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
@ 2017-06-15  4:28   ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2017-06-15  4:28 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans

On Wed, 14 Jun 2017 16:11:41 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---

Looks good to me

Reviewed-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-06-15  3:31   ` Balbir Singh
@ 2017-06-15 15:35     ` Jerome Glisse
  0 siblings, 0 replies; 23+ messages in thread
From: Jerome Glisse @ 2017-06-15 15:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On Thu, Jun 15, 2017 at 01:31:28PM +1000, Balbir Singh wrote:
> On Wed, 14 Jun 2017 16:11:42 -0400
> Jérôme Glisse <jglisse@redhat.com> wrote:
> 
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus you can not use page->lru fields of those pages. This patch
> > re-arrange the uncharge to allow single page to be uncharge without
> > modifying the lru field of the struct page.
> > 
> > There is no change to memcontrol logic, it is the same as it was
> > before this patch.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: cgroups@vger.kernel.org
> > ---
> >  mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 92 insertions(+), 76 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e3fe4d0..b93f5fe 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> >  	cancel_charge(memcg, nr_pages);
> >  }
> >  
> > -static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> > -			   unsigned long nr_anon, unsigned long nr_file,
> > -			   unsigned long nr_kmem, unsigned long nr_huge,
> > -			   unsigned long nr_shmem, struct page *dummy_page)
> > +struct uncharge_gather {
> > +	struct mem_cgroup *memcg;
> > +	unsigned long pgpgout;
> > +	unsigned long nr_anon;
> > +	unsigned long nr_file;
> > +	unsigned long nr_kmem;
> > +	unsigned long nr_huge;
> > +	unsigned long nr_shmem;
> > +	struct page *dummy_page;
> > +};
> > +
> > +static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> >  {
> > -	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
> > +	memset(ug, 0, sizeof(*ug));
> > +}
> > +
> > +static void uncharge_batch(const struct uncharge_gather *ug)
> > +{
> 
> Can we pass page as an argument so that we can do check events on the page?

Well it is what dummy page is for, i wanted to keep code as close
to existing to make it easier for people to see that there was no
logic change with that patch.


[...]

> > +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> > +{
> > +	VM_BUG_ON_PAGE(PageLRU(page), page);
> > +	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> > +
> > +	if (!page->mem_cgroup)
> > +		return;
> > +
> > +	/*
> > +	 * Nobody should be changing or seriously looking at
> > +	 * page->mem_cgroup at this point, we have fully
> > +	 * exclusive access to the page.
> > +	 */
> > +
> > +	if (ug->memcg != page->mem_cgroup) {
> > +		if (ug->memcg) {
> > +			uncharge_batch(ug);
> 
> What is ug->dummy_page set to at this point? ug->dummy_page is assigned below

So at begining ug->memcg is NULL and so is ug->dummy_page, after first
call to uncharge_page() if ug->memcg isn't NULL then ug->dummy_page
points to a valid page. So uncharge_batch() can never be call with
dummy_page == NULL same as before this patch.


[...]

> >  static void uncharge_list(struct list_head *page_list)
> >  {
> > -	struct mem_cgroup *memcg = NULL;
> > -	unsigned long nr_shmem = 0;
> > -	unsigned long nr_anon = 0;
> > -	unsigned long nr_file = 0;
> > -	unsigned long nr_huge = 0;
> > -	unsigned long nr_kmem = 0;
> > -	unsigned long pgpgout = 0;
> > +	struct uncharge_gather ug;
> >  	struct list_head *next;
> > -	struct page *page;
> > +
> > +	uncharge_gather_clear(&ug);
> >  
> >  	/*
> >  	 * Note that the list can be a single page->lru; hence the
> > @@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
> >  	 */
> >  	next = page_list->next;
> >  	do {
> > +		struct page *page;
> > +
> 
> Nit pick
> 
> VM_WARN_ON(is_zone_device_page(page));

Yeah probably good thing to add. I will add it as part of the
other memcontrol patch as i want to keep this one about moving
stuff around with no logic change.

Thanks for reviewing
Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-06-15 15:35 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 1/5] mm/device-public-memory: device memory cache coherent with CPU Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 2/5] mm/hmm: add new helper to hotplug CDM memory region Jérôme Glisse
2017-06-15  4:28   ` Balbir Singh
2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-06-15  3:31   ` Balbir Singh
2017-06-15 15:35     ` Jerome Glisse
2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
2017-06-15  1:41   ` Balbir Singh
2017-06-15  2:04     ` Jerome Glisse
2017-06-15  3:10       ` Balbir Singh
2017-06-14 20:11 ` [HMM-CDM 5/5] mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64 Jérôme Glisse
2017-06-14 23:10   ` John Hubbard
2017-06-15  2:09     ` Jerome Glisse
2017-06-15  3:15       ` John Hubbard
2017-06-15  1:46   ` Balbir Singh
2017-06-15  2:07     ` Jerome Glisse
2017-06-15  2:59       ` Balbir Singh
2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
2017-06-14 21:38   ` Jerome Glisse
2017-06-14 21:58     ` Dave Hansen
2017-06-14 22:07       ` Benjamin Herrenschmidt
2017-06-14 23:40       ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).