[PATCH V3 0/4] Define coherent device memory node

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3 0/4] Define coherent device memory node
@ 2017-02-15 12:07 Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 1/4] mm: Define coherent device memory (CDM) node Anshuman Khandual
                   ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-15 12:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

	This four patches define CDM node with HugeTLB & Buddy allocation
isolation. Please refer to the last RFC posting mentioned here for more
details. The RFC series has been split for easier review process. The next
part of the work like VM flags, auto NUMA and KSM interactions with tagged
VMAs will follow later.

RFC https://lkml.org/lkml/2017/1/29/198

Changes in V3:

* Changed is_coherent_node interface into is_cdm_node
* Fixed the CDM allocation leak problem when cpuset is enabled on the system
  which requires changes to get_page_from_freelist function and verifying
  that the non NULL nodemask is indeed the requested one not becuase of the
  cpuset override on the way

Changes in V2:	(https://lkml.org/lkml/2017/2/10/183)

* Removed redundant nodemask_has_cdm() check from zonelist iterator
* Dropped the nodemask_had_cdm() function itself
* Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
* Moved CDM helper functions into nodemask.h from node.h header file
* Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check

Previous V1:	(https://lkml.org/lkml/2017/2/8/329)

Anshuman Khandual (4):
  mm: Define coherent device memory (CDM) node
  mm: Enable HugeTLB allocation isolation for CDM nodes
  mm: Add new parameter to get_page_from_freelist() function
  mm: Enable Buddy allocation isolation for CDM nodes

 Documentation/ABI/stable/sysfs-devices-node |  7 +++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 +++
 drivers/base/node.c                         |  6 +++
 include/linux/nodemask.h                    | 58 ++++++++++++++++++++-
 mm/Kconfig                                  |  4 ++
 mm/hugetlb.c                                | 25 +++++----
 mm/memory_hotplug.c                         |  3 ++
 mm/page_alloc.c                             | 81 +++++++++++++++++++++--------
 9 files changed, 160 insertions(+), 32 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH V3 1/4] mm: Define coherent device memory (CDM) node
  2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
@ 2017-02-15 12:07 ` Anshuman Khandual
  2017-02-17 14:05   ` Bob Liu
  2017-02-15 12:07 ` [PATCH V3 2/4] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-15 12:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

There are certain devices like specialized accelerator, GPU cards, network
cards, FPGA cards etc which might contain onboard memory which is coherent
along with the existing system RAM while being accessed either from the CPU
or from the device. They share some similar properties with that of normal
system RAM but at the same time can also be different with respect to
system RAM.

User applications might be interested in using this kind of coherent device
memory explicitly or implicitly along side the system RAM utilizing all
possible core memory functions like anon mapping (LRU), file mapping (LRU),
page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
etc. To achieve this kind of tight integration with core memory subsystem,
the device onboard coherent memory must be represented as a memory only
NUMA node. At the same time arch must export some kind of a function to
identify of this node as a coherent device memory not any other regular
cpu less memory only NUMA node.

After achieving the integration with core memory subsystem coherent device
memory might still need some special consideration inside the kernel. There
can be a variety of coherent memory nodes with different expectations from
the core kernel memory. But right now only one kind of special treatment is
considered which requires certain isolation.

Now consider the case of a coherent device memory node type which requires
isolation. This kind of coherent memory is onboard an external device
attached to the system through a link where there is always a chance of a
link failure taking down the entire memory node with it. More over the
memory might also have higher chance of ECC failure as compared to the
system RAM. Hence allocation into this kind of coherent memory node should
be regulated. Kernel allocations must not come here. Normal user space
allocations too should not come here implicitly (without user application
knowing about it). This summarizes isolation requirement of certain kind of
coherent device memory node as an example. There can be different kinds of
isolation requirement also.

Some coherent memory devices might not require isolation altogether after
all. Then there might be other coherent memory devices which might require
some other special treatment after being part of core memory representation
. For now, will look into isolation seeking coherent device memory node not
the other ones.

To implement the integration as well as isolation, the coherent memory node
must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
the node_states[] array. During memory hotplug operations, the new nodemask
N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
memory nodes. This also creates the following new sysfs based interface to
list down all the coherent memory nodes of the system.

	/sys/devices/system/node/is_coherent_node

Architectures must export function arch_check_node_cdm() which identifies
any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  7 ++++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 ++++
 drivers/base/node.c                         |  6 +++
 include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
 mm/Kconfig                                  |  4 ++
 mm/memory_hotplug.c                         |  3 ++
 mm/page_alloc.c                             |  8 +++-
 8 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 5b2d0f0..5df18f7 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -29,6 +29,13 @@ Description:
 		Nodes that have regular or high memory.
 		Depends on CONFIG_HIGHMEM.
 
+What:		/sys/devices/system/node/is_cdm_node
+Date:		January 2017
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		Lists the nodemask of nodes that have coherent device memory.
+		Depends on CONFIG_COHERENT_DEVICE.
+
 What:		/sys/devices/system/node/nodeX
 Date:		October 2002
 Contact:	Linux Memory Management list <linux-mm@kvack.org>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 281f4f1..1cff239 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -164,6 +164,7 @@ config PPC
 	select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
 	select HAVE_ARCH_HARDENED_USERCOPY
 	select HAVE_KERNEL_GZIP
+	select COHERENT_DEVICE if PPC_BOOK3S_64 && NEED_MULTIPLE_NODES
 
 config GENERIC_CSUM
 	def_bool CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b1099cb..14f0b98 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -41,6 +41,13 @@
 #include <asm/setup.h>
 #include <asm/vdso.h>
 
+#ifdef CONFIG_COHERENT_DEVICE
+inline int arch_check_node_cdm(int nid)
+{
+	return 0;
+}
+#endif
+
 static int numa_enabled = 1;
 
 static char *cmdline __initdata;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..f39e615 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = {
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 #endif
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+#ifdef CONFIG_COHERENT_DEVICE
+	[N_COHERENT_DEVICE] = _NODE_ATTR(is_cdm_node, N_COHERENT_DEVICE),
+#endif
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = {
 	&node_state_attr[N_MEMORY].attr.attr,
 #endif
 	&node_state_attr[N_CPU].attr.attr,
+#ifdef CONFIG_COHERENT_DEVICE
+	&node_state_attr[N_COHERENT_DEVICE].attr.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index f746e44..175c2d6 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -388,11 +388,14 @@ enum node_states {
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
 #ifdef CONFIG_MOVABLE_NODE
-	N_MEMORY,		/* The node has memory(regular, high, movable) */
+	N_MEMORY,	/* The node has memory(regular, high, movable, cdm) */
 #else
 	N_MEMORY = N_HIGH_MEMORY,
 #endif
 	N_CPU,		/* The node has one or more cpus */
+#ifdef CONFIG_COHERENT_DEVICE
+	N_COHERENT_DEVICE,	/* The node has CDM memory */
+#endif
 	NR_NODE_STATES
 };
 
@@ -496,6 +499,59 @@ static inline int node_random(const nodemask_t *mask)
 }
 #endif
 
+#ifdef CONFIG_COHERENT_DEVICE
+extern int arch_check_node_cdm(int nid);
+
+static inline nodemask_t system_mem_nodemask(void)
+{
+	nodemask_t system_mem;
+
+	nodes_clear(system_mem);
+	nodes_andnot(system_mem, node_states[N_MEMORY],
+			node_states[N_COHERENT_DEVICE]);
+	return system_mem;
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return node_isset(node, node_states[N_COHERENT_DEVICE]);
+}
+
+static inline void node_set_state_cdm(int node)
+{
+	if (arch_check_node_cdm(node))
+		node_set_state(node, N_COHERENT_DEVICE);
+}
+
+static inline void node_clear_state_cdm(int node)
+{
+	if (arch_check_node_cdm(node))
+		node_clear_state(node, N_COHERENT_DEVICE);
+}
+
+#else
+
+static inline int arch_check_node_cdm(int nid) { return 0; }
+
+static inline nodemask_t system_mem_nodemask(void)
+{
+	return node_states[N_MEMORY];
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return false;
+}
+
+static inline void node_set_state_cdm(int node)
+{
+}
+
+static inline void node_clear_state_cdm(int node)
+{
+}
+#endif	/* CONFIG_COHERENT_DEVICE */
+
 #define node_online_map 	node_states[N_ONLINE]
 #define node_possible_map 	node_states[N_POSSIBLE]
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..6263a65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -143,6 +143,10 @@ config HAVE_GENERIC_RCU_GUP
 config ARCH_DISCARD_MEMBLOCK
 	bool
 
+config COHERENT_DEVICE
+	bool
+	default n
+
 config NO_BOOTMEM
 	bool
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b8c11e0..6bce093 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1030,6 +1030,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_high >= 0)
 		node_set_state(node, N_HIGH_MEMORY);
 
+	node_set_state_cdm(node);
 	node_set_state(node, N_MEMORY);
 }
 
@@ -1843,6 +1844,8 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 	if ((N_MEMORY != N_HIGH_MEMORY) &&
 	    (arg->status_change_nid >= 0))
 		node_clear_state(node, N_MEMORY);
+
+	node_clear_state_cdm(node);
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..84d61bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6080,8 +6080,10 @@ static unsigned long __init early_calculate_totalpages(void)
 		unsigned long pages = end_pfn - start_pfn;
 
 		totalpages += pages;
-		if (pages)
+		if (pages) {
+			node_set_state_cdm(nid);
 			node_set_state(nid, N_MEMORY);
+		}
 	}
 	return totalpages;
 }
@@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 				find_min_pfn_for_node(nid), NULL);
 
 		/* Any memory on that node */
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
+			node_set_state_cdm(nid);
 			node_set_state(nid, N_MEMORY);
+		}
 		check_for_memory(pgdat, nid);
 	}
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V3 2/4] mm: Enable HugeTLB allocation isolation for CDM nodes
  2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 1/4] mm: Define coherent device memory (CDM) node Anshuman Khandual
@ 2017-02-15 12:07 ` Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 3/4] mm: Add new parameter to get_page_from_freelist() function Anshuman Khandual
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-15 12:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

HugeTLB allocation/release/accounting currently spans across all the nodes
under N_MEMORY node mask. Coherent memory nodes should not be part of these
allocations. So use system_mem_nodemask() call to fetch system RAM only
nodes on the platform which can then be used for HugeTLB allocation purpose
instead of N_MEMORY node mask. This isolates coherent device memory nodes
from HugeTLB allocations.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/hugetlb.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c7025c1..9a46d9f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1790,6 +1790,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	nodemask_t system_mem = system_mem_nodemask();
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -1816,7 +1817,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	while (nr_pages--) {
 		h->resv_huge_pages--;
 		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		if (!free_pool_huge_page(h, &system_mem, 1))
 			goto out;
 		cond_resched_lock(&hugetlb_lock);
 	}
@@ -2107,8 +2108,9 @@ int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes, node;
+	nodemask_t system_mem = system_mem_nodemask();
 
-	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+	for_each_node_mask_to_alloc(h, nr_nodes, node, &system_mem) {
 		void *addr;
 
 		addr = memblock_virt_alloc_try_nid_nopanic(
@@ -2177,13 +2179,14 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t system_mem = system_mem_nodemask();
+
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h,
-					 &node_states[N_MEMORY]))
+		} else if (!alloc_fresh_huge_page(h, &system_mem))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -2420,6 +2423,8 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   unsigned long count, size_t len)
 {
 	int err;
+	nodemask_t system_mem = system_mem_nodemask();
+
 	NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
@@ -2434,7 +2439,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		if (!(obey_mempolicy &&
 				init_nodemask_of_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
-			nodes_allowed = &node_states[N_MEMORY];
+			nodes_allowed = &system_mem;
 		}
 	} else if (nodes_allowed) {
 		/*
@@ -2444,11 +2449,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
 		init_nodemask_of_node(nodes_allowed, nid);
 	} else
-		nodes_allowed = &node_states[N_MEMORY];
+		nodes_allowed = &system_mem;
 
 	h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
 
-	if (nodes_allowed != &node_states[N_MEMORY])
+	if (nodes_allowed != &system_mem)
 		NODEMASK_FREE(nodes_allowed);
 
 	return len;
@@ -2745,9 +2750,10 @@ static void hugetlb_register_node(struct node *node)
  */
 static void __init hugetlb_register_all_nodes(void)
 {
+	nodemask_t nodes = system_mem_nodemask();
 	int nid;
 
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_mask(nid, nodes) {
 		struct node *node = node_devices[nid];
 		if (node->dev.id == nid)
 			hugetlb_register_node(node);
@@ -3019,11 +3025,12 @@ void hugetlb_show_meminfo(void)
 {
 	struct hstate *h;
 	int nid;
+	nodemask_t system_mem = system_mem_nodemask();
 
 	if (!hugepages_supported())
 		return;
 
-	for_each_node_state(nid, N_MEMORY)
+	for_each_node_mask(nid, system_mem)
 		for_each_hstate(h)
 			pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n",
 				nid,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V3 3/4] mm: Add new parameter to get_page_from_freelist() function
  2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 1/4] mm: Define coherent device memory (CDM) node Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 2/4] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
@ 2017-02-15 12:07 ` Anshuman Khandual
  2017-02-15 12:07 ` [PATCH V3 4/4] mm: Enable Buddy allocation isolation for CDM nodes Anshuman Khandual
  2017-02-15 18:20 ` [PATCH V3 0/4] Define coherent device memory node Mel Gorman
  4 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-15 12:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

This changes the function get_page_from_freelist to accommodate a new
parameter orig_mask which preserves the requested nodemask till the
zonelist iteration phase where it can be used to verify if the cpuset
based nodemask override is ever performed. This enables deciding the
acceptance of CDM zone during allocation which is implemented in the
subsequent patch.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 50 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 84d61bb..afbd24d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2893,7 +2893,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
-						const struct alloc_context *ac)
+			const struct alloc_context *ac, nodemask_t *orig_mask)
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
@@ -3050,7 +3050,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	const struct alloc_context *ac, unsigned long *did_some_progress,
+	nodemask_t *orig_mask)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -3079,7 +3080,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 * we're still under heavy pressure.
 	 */
 	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+					ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+					ac, orig_mask);
 	if (page)
 		goto out;
 
@@ -3115,14 +3117,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 		if (gfp_mask & __GFP_NOFAIL) {
 			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+					ALLOC_NO_WATERMARKS|ALLOC_CPUSET,
+					ac, orig_mask);
 			/*
 			 * fallback to ignore cpuset restriction if our nodes
 			 * are depleted
 			 */
 			if (!page)
 				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+					ALLOC_NO_WATERMARKS, ac, orig_mask);
 		}
 	}
 out:
@@ -3141,7 +3144,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result,
+		nodemask_t *orig_mask)
 {
 	struct page *page;
 
@@ -3162,8 +3166,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
-
+	page = get_page_from_freelist(gfp_mask, order,
+					alloc_flags, ac, orig_mask);
 	if (page) {
 		struct zone *zone = page_zone(page);
 
@@ -3247,7 +3251,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result,
+		nodemask_t *orig_mask)
 {
 	*compact_result = COMPACT_SKIPPED;
 	return NULL;
@@ -3314,7 +3319,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		unsigned long *did_some_progress)
+		unsigned long *did_some_progress, nodemask_t *orig_mask)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -3324,7 +3329,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry:
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order,
+					alloc_flags, ac, orig_mask);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -3517,7 +3523,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
-						struct alloc_context *ac)
+			struct alloc_context *ac, nodemask_t *orig_mask)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
@@ -3581,7 +3587,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * The adjusted alloc_flags might result in immediate success, so try
 	 * that first
 	 */
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order,
+					alloc_flags, ac, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3596,7 +3603,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_direct_compact(gfp_mask, order,
 						alloc_flags, ac,
 						INIT_COMPACT_PRIORITY,
-						&compact_result);
+						&compact_result, orig_mask);
 		if (page)
 			goto got_pg;
 
@@ -3645,7 +3652,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Attempt with potentially adjusted zonelist and alloc_flags */
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order,
+					alloc_flags, ac, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3681,13 +3689,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
-							&did_some_progress);
+						&did_some_progress, orig_mask);
 	if (page)
 		goto got_pg;
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
-					compact_priority, &compact_result);
+				compact_priority, &compact_result, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3734,7 +3742,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac,
+					&did_some_progress, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3826,7 +3835,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
-	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+	page = get_page_from_freelist(alloc_mask, order,
+					alloc_flags, &ac, nodemask);
 	if (likely(page))
 		goto out;
 
@@ -3845,7 +3855,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
 
-	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+	page = __alloc_pages_slowpath(alloc_mask, order, &ac, nodemask);
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V3 4/4] mm: Enable Buddy allocation isolation for CDM nodes
  2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
                   ` (2 preceding siblings ...)
  2017-02-15 12:07 ` [PATCH V3 3/4] mm: Add new parameter to get_page_from_freelist() function Anshuman Khandual
@ 2017-02-15 12:07 ` Anshuman Khandual
  2017-02-15 18:20 ` [PATCH V3 0/4] Define coherent device memory node Mel Gorman
  4 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-15 12:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

This implements allocation isolation for CDM nodes in buddy allocator by
discarding CDM memory zones all the time except in the cases where the gfp
flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
where it is non NULL (explicit allocation request in the kernel or user
process MPOL_BIND policy based requests).

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index afbd24d..40c942c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,7 @@
 #include <linux/page_owner.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
+#include <linux/node.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -2898,6 +2899,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	bool cpuset_fallback = false;
+
+	if (ac->nodemask != orig_mask)
+		cpuset_fallback = true;
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2908,6 +2913,24 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
+		/*
+		 * CDM nodes get skipped if the requested gfp flag
+		 * does not have __GFP_THISNODE set or the nodemask
+		 * does not have any CDM nodes in case the nodemask
+		 * is non NULL (explicit allocation requests from
+		 * kernel or user process MPOL_BIND policy which has
+		 * CDM nodes).
+		 */
+		if (is_cdm_node(zone->zone_pgdat->node_id)) {
+			if (!(gfp_mask & __GFP_THISNODE)) {
+				if (cpuset_fallback)
+					continue;
+
+				if (!ac->nodemask)
+					continue;
+			}
+		}
+
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!__cpuset_zone_allowed(zone, gfp_mask))
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
                   ` (3 preceding siblings ...)
  2017-02-15 12:07 ` [PATCH V3 4/4] mm: Enable Buddy allocation isolation for CDM nodes Anshuman Khandual
@ 2017-02-15 18:20 ` Mel Gorman
  2017-02-16 22:14   ` Balbir Singh
  2017-02-17 11:41   ` [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
  4 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2017-02-15 18:20 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> 	This four patches define CDM node with HugeTLB & Buddy allocation
> isolation. Please refer to the last RFC posting mentioned here for more

Always include the background with the changelog itself. Do not assume that
people are willing to trawl through a load of past postings to assemble
the picture. I'm only taking a brief look because of the page allocator
impact but it does not appear that previous feedback was addressed.

In itself, the series does very little and as Vlastimil already pointed
out, it's not a good idea to try merge piecemeal when people could not
agree on the big picture (I didn't dig into it).

The only reason I'm commenting at all is to say that I am extremely opposed
to the changes made to the page allocator paths that are specific to
CDM. It's been continual significant effort to keep the cost there down
and this is a mess of special cases for CDM. The changes to hugetlb to
identify "memory that is not really memory" with special casing is also
quite horrible.

It's completely unclear that even if one was to assume that CDM memory
should be expressed as nodes why such systems do not isolate all processes
from CDM nodes by default and then allow access via memory policies or
cpusets instead of special casing the page allocator fast path. It's also
completely unclear what happens if a device should then access the CDM
and how that should be synchronised with the core, if that is even possible.

It's also unclear if this is even usable by an application in userspace
at this point in time. If it is and the special casing is needed then the
regions should be isolated from early mem allocations in the arch layer
that is CDM aware, initialised late, and then setup userspace to isolate
all but privileged applications from the CDM nodes. Do not litter the core
with is_cdm_whatever checks.

At best this is incomplete because it does not look as if it could be used
by anything properly and the fast path alterations are horrible even if
it could be used. As it is, it should not be merged in my opinion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-15 18:20 ` [PATCH V3 0/4] Define coherent device memory node Mel Gorman
@ 2017-02-16 22:14   ` Balbir Singh
  2017-02-17  9:33     ` Mel Gorman
  2017-02-17 11:41   ` [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Balbir Singh @ 2017-02-16 22:14 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	srikar, haren, jglisse, dave.hansen, dan.j.williams



On 16/02/17 05:20, Mel Gorman wrote:
> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
>> 	This four patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for more
> 
> Always include the background with the changelog itself. Do not assume that
> people are willing to trawl through a load of past postings to assemble
> the picture. I'm only taking a brief look because of the page allocator
> impact but it does not appear that previous feedback was addressed.
> 
> In itself, the series does very little and as Vlastimil already pointed
> out, it's not a good idea to try merge piecemeal when people could not
> agree on the big picture (I didn't dig into it).
> 

The idea of CDM is independent of how some of the other problems related
to AutoNUMA balancing is handled. The idea of this patchset was to introduce
the concept of memory that is not necessarily system memory, but is coherent
in terms of visibility/access with some restrictions

> The only reason I'm commenting at all is to say that I am extremely opposed
> to the changes made to the page allocator paths that are specific to
> CDM. It's been continual significant effort to keep the cost there down
> and this is a mess of special cases for CDM. The changes to hugetlb to
> identify "memory that is not really memory" with special casing is also
> quite horrible.
> 
> It's completely unclear that even if one was to assume that CDM memory
> should be expressed as nodes why such systems do not isolate all processes
> from CDM nodes by default and then allow access via memory policies or
> cpusets instead of special casing the page allocator fast path. It's also
> completely unclear what happens if a device should then access the CDM
> and how that should be synchronised with the core, if that is even possible.
> 

A big part of this is driven by the need to special case what allocations
go there. The idea being that an allocation should get there only when
explicitly requested. Unfortunately, IIUC node distance is not a good
isolation metric. CPUsets are heavily driven by user space and we
believe that setting up CDM is not an administrative operation, its
going to be hard for an administrator or user space application to set
up the right policy or an installer to figure it out. It does not help
that CPUSets assume inheritance from the root hierarchy. As far as the
overheads go, one could consider using STATIC_KEYS if that is worthwhile.


> It's also unclear if this is even usable by an application in userspace
> at this point in time. If it is and the special casing is needed then the
> regions should be isolated from early mem allocations in the arch layer
> that is CDM aware, initialised late, and then setup userspace to isolate
> all but privileged applications from the CDM nodes. Do not litter the core
> with is_cdm_whatever checks.
> 

The idea is to have these nodes as ZONE_MOVABLE and those are isolated from
early mem allocations. Any new feature requires checks, but one could consider
consolidating those checks

Balbir Singh.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-16 22:14   ` Balbir Singh
@ 2017-02-17  9:33     ` Mel Gorman
  2017-02-21  2:57       ` Balbir Singh
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2017-02-17  9:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, vbabka,
	minchan, aneesh.kumar, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Fri, Feb 17, 2017 at 09:14:44AM +1100, Balbir Singh wrote:
> 
> 
> On 16/02/17 05:20, Mel Gorman wrote:
> > On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> >> 	This four patches define CDM node with HugeTLB & Buddy allocation
> >> isolation. Please refer to the last RFC posting mentioned here for more
> > 
> > Always include the background with the changelog itself. Do not assume that
> > people are willing to trawl through a load of past postings to assemble
> > the picture. I'm only taking a brief look because of the page allocator
> > impact but it does not appear that previous feedback was addressed.
> > 
> > In itself, the series does very little and as Vlastimil already pointed
> > out, it's not a good idea to try merge piecemeal when people could not
> > agree on the big picture (I didn't dig into it).
> > 
> 
> The idea of CDM is independent of how some of the other problems related
> to AutoNUMA balancing is handled.

What has Automatic NUMA balancing got to do with CDM?

Even if you're trying to draw a comparison between how the patches were
developed in comparison to CDM, it's a poor example. Regardless of which
generation of NUMA balancing implementation considered (there were three
contenders), each of them was a working implementation that had a measurable
impact on a number of workloads. In many cases, performance data was
included. The instructions on how workloads could use it were clear even
if there were disagreements on exactly what the tuning options should be.
While the feature evolved over time and improved for different classes of
workload, the first set of patches merged were functional.

> The idea of this patchset was to introduce
> the concept of memory that is not necessarily system memory, but is coherent
> in terms of visibility/access with some restrictions
> 

Which should be done without special casing the page allocator, cpusets and
special casing how cpusets are handled. It's not necessary for any other
mechanism used to restrict access to portions of memory such as cpusets,
mempolicies or even memblock reservations.

> > The only reason I'm commenting at all is to say that I am extremely opposed
> > to the changes made to the page allocator paths that are specific to
> > CDM. It's been continual significant effort to keep the cost there down
> > and this is a mess of special cases for CDM. The changes to hugetlb to
> > identify "memory that is not really memory" with special casing is also
> > quite horrible.
> > 
> > It's completely unclear that even if one was to assume that CDM memory
> > should be expressed as nodes why such systems do not isolate all processes
> > from CDM nodes by default and then allow access via memory policies or
> > cpusets instead of special casing the page allocator fast path. It's also
> > completely unclear what happens if a device should then access the CDM
> > and how that should be synchronised with the core, if that is even possible.
> > 
> 
> A big part of this is driven by the need to special case what allocations
> go there. The idea being that an allocation should get there only when
> explicitly requested.

cpuset, mempolicy or mmap of a device file that mediates whether device
or system memory is used. For the last option, I don't know the specifics
but given that HMM worked on this for years, there should be ables of
the considerations and complications that arise. I'm not familiar with
the specifics.

> Unfortunately, IIUC node distance is not a good
> isolation metric.

I don't recall suggesting that.

> CPUsets are heavily driven by user space and we
> believe that setting up CDM is not an administrative operation, its
> going to be hard for an administrator or user space application to set
> up the right policy or an installer to figure it out.

So by this design, an application is expected to know nothing about how
to access CDM yet be CDM-aware? The application is either aware of CDM or
it isn't. It's either known how to access it or it does not.

Even if it was a case that the arch layer provides hooks to alter the global
nodemask and expose a special file of the CDM nodemask to userspace then
it would still avoid special casing in the various allocators. It would
not address the problem at all of how devices are meant to be informed
that there is CDM memory with work to do but that has been raised elsewhere.

> It does not help
> that CPUSets assume inheritance from the root hierarchy. As far as the
> overheads go, one could consider using STATIC_KEYS if that is worthwhile.
> 

Hiding the overhead in static keys could not change the fact that the various
allocator paths should not need to be CDM-aware or special casing CDM when
there already are existing mechanisms for avoiding regions of memory.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-15 18:20 ` [PATCH V3 0/4] Define coherent device memory node Mel Gorman
  2017-02-16 22:14   ` Balbir Singh
@ 2017-02-17 11:41   ` Anshuman Khandual
  2017-02-17 13:32     ` Mel Gorman
  2017-02-21 11:11     ` Michal Hocko
  1 sibling, 2 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-17 11:41 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 02/15/2017 11:50 PM, Mel Gorman wrote:
> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
>> 	This four patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for more
> 
> Always include the background with the changelog itself. Do not assume that
> people are willing to trawl through a load of past postings to assemble
> the picture. I'm only taking a brief look because of the page allocator
> impact but it does not appear that previous feedback was addressed.

Sure, I made a mistake. Will include the complete background from my
previous RFCs in the next version which will show the entire context
of this patch series. I have addressed the previous feedback regarding
cpuset enabled allocation leaks into CDM memory as pointed out by
Vlastimil Babka on the last version. Did I miss anything else inside
the Buddy allocator apart from that ?

> 
> In itself, the series does very little and as Vlastimil already pointed
> out, it's not a good idea to try merge piecemeal when people could not
> agree on the big picture (I didn't dig into it).

With the proposed kernel changes and a associated driver its complete to
drive a user space based CPU/Device hybrid compute interchangeably on a
mmap() allocated memory buffer transparently and effectively. I had also
mentioned these points on the last posting in response to a comment from
Vlastimil.

>From this response (https://lkml.org/lkml/2017/2/14/50).

* User space using mbind() to get CDM memory is an additional benefit
  we get by making the CDM plug in as a node and be part of the buddy
  allocator. But the over all idea from the user space point of view
  is that the application can allocate any generic buffer and try to
  use the buffer either from the CPU side or from the device without
  knowing about where the buffer is really mapped physically. That
  gives a seamless and transparent view to the user space where CPU
  compute and possible device based compute can work together. This
  is not possible through a driver allocated buffer.

* The placement of the memory on the buffer can happen on system memory
  when the CPU faults while accessing it. But a driver can manage the
  migration between system RAM and CDM memory once the buffer is being
  used from CPU and the device interchangeably. As you have mentioned
  driver will have more information about where which part of the buffer
  should be placed at any point of time and it can make it happen with
  migration. So both allocation and placement are decided by the driver
  during runtime. CDM provides the framework for this can kind device
  assisted compute and driver managed memory placements.

* If any application is not using CDM memory for along time placed on
  its buffer and another application is forced to fallback on system
  RAM when it really wanted is CDM, the driver can detect these kind
  of situations through memory access patterns on the device HW and
  take necessary migration decisions.

I hope this explains the rationale of the framework. In fact these
four patches give logically complete CPU/Device operating framework.
Other parts of the bigger picture are VMA management, KSM, Auto NUMA
etc which are improvements on top of this basic framework.

> 
> The only reason I'm commenting at all is to say that I am extremely opposed
> to the changes made to the page allocator paths that are specific to
> CDM. It's been continual significant effort to keep the cost there down
> and this is a mess of special cases for CDM. The changes to hugetlb to
> identify "memory that is not really memory" with special casing is also
> quite horrible.

We have already removed the O (n^2) search during zonelist iteration as
pointed out by Vlastimil and the current overhead is linear for the CDM
special case. We do similar checks for the cpuset function as well. Then
how is this horrible ? On HugeTLB, we isolate CDM based on a resultant
(MEMORY - CDM) node_states[] element which identifies system memory
instead of all of the accessible memory and keep the HugeTLB limited to
that nodemask. But if you feel there is any other better approach, we
can definitely try out.

> 
> It's completely unclear that even if one was to assume that CDM memory
> should be expressed as nodes why such systems do not isolate all processes
> from CDM nodes by default and then allow access via memory policies or
> cpusets instead of special casing the page allocator fast path. It's also
> completely unclear what happens if a device should then access the CDM
> and how that should be synchronised with the core, if that is even possible.

I think Balbir has already commented on the cpuset part. Device and CPU
can consistently work on the common allocated buffer and HW takes care of
the access coherency.

> 
> It's also unclear if this is even usable by an application in userspace
> at this point in time. If it is and the special casing is needed then the

Yeah with the current CDM approach its usable from user space as
explained before.

> regions should be isolated from early mem allocations in the arch layer
> that is CDM aware, initialised late, and then setup userspace to isolate
> all but privileged applications from the CDM nodes. Do not litter the core
> with is_cdm_whatever checks.

I guess your are referring to allocating the entire CDM memory node with
memblock_reserve() and then arch managing the memory when user space
wants to use it through some sort of mmap, vm_ops methods. That defeats
the whole purpose of integrating CDM memory with core VM. I am afraid it
will also make migration between CDM memory and system memory difficult
which is essential in making the whole hybrid compute operation
transparent from  the user space.

> 
> At best this is incomplete because it does not look as if it could be used
> by anything properly and the fast path alterations are horrible even if
> it could be used. As it is, it should not be merged in my opinion.

I have mentioned in detail above how this much of code change enables
us to use the CDM in a transparent way from the user space. Please do
let me know if it still does not make sense, will try again.

On the fast path changes issue, I can really understand your concern
from the performance point of view as its achieved over a long time.
It would be great if you can suggest on how to improve from here.

- Anshuman

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-17 11:41   ` [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
@ 2017-02-17 13:32     ` Mel Gorman
  2017-02-21 13:09       ` Anshuman Khandual
  2017-02-21 11:11     ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2017-02-17 13:32 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Fri, Feb 17, 2017 at 05:11:57PM +0530, Anshuman Khandual wrote:
> On 02/15/2017 11:50 PM, Mel Gorman wrote:
> > On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> >> 	This four patches define CDM node with HugeTLB & Buddy allocation
> >> isolation. Please refer to the last RFC posting mentioned here for more
> > 
> > Always include the background with the changelog itself. Do not assume that
> > people are willing to trawl through a load of past postings to assemble
> > the picture. I'm only taking a brief look because of the page allocator
> > impact but it does not appear that previous feedback was addressed.
> 
> Sure, I made a mistake. Will include the complete background from my
> previous RFCs in the next version which will show the entire context
> of this patch series. I have addressed the previous feedback regarding
> cpuset enabled allocation leaks into CDM memory as pointed out by
> Vlastimil Babka on the last version. Did I miss anything else inside
> the Buddy allocator apart from that ?
> 

The cpuset fix is crude and uncondtionally passes around a mask to
special case CDM requirements which should be controlled by cpusets or
policies if CDM is to be treated as if it's normal memory. Special
casing like this is fragile and it'll be difficult for modifications to
the allocator to be made with any degree of certainity that CDM is not
impacted. I strongly suspect it would collide with the cpuset rework
that Vlastimil is working on to avoid races between allocation and
cpuset modification.

> > 
> > In itself, the series does very little and as Vlastimil already pointed
> > out, it's not a good idea to try merge piecemeal when people could not
> > agree on the big picture (I didn't dig into it).
> 
> With the proposed kernel changes and a associated driver its complete to
> drive a user space based CPU/Device hybrid compute interchangeably on a
> mmap() allocated memory buffer transparently and effectively.

How is the device informed at that data is available for processing?
What prevents and application modifying the data on the device while it's
being processed?
Why can this not be expressed with cpusets and memory policies
controlled by a combination of administrative steps for a privileged
application and an application that is CDM aware?

> I had also
> mentioned these points on the last posting in response to a comment from
> Vlastimil.
> 
> From this response (https://lkml.org/lkml/2017/2/14/50).
> 
> * User space using mbind() to get CDM memory is an additional benefit
>   we get by making the CDM plug in as a node and be part of the buddy
>   allocator. But the over all idea from the user space point of view
>   is that the application can allocate any generic buffer and try to
>   use the buffer either from the CPU side or from the device without
>   knowing about where the buffer is really mapped physically. That
>   gives a seamless and transparent view to the user space where CPU
>   compute and possible device based compute can work together. This
>   is not possible through a driver allocated buffer.
> 

Which can also be done with cpusets that prevents use of CDM memory and
place all non-CDM processes into that cpuset with a separate cpuset for
CDM-aware applications that allow access to CDM memory.

> * The placement of the memory on the buffer can happen on system memory
>   when the CPU faults while accessing it. But a driver can manage the
>   migration between system RAM and CDM memory once the buffer is being
>   used from CPU and the device interchangeably.

While I'm not familiar with the details because I'm not generally involved
in hardware enablement, why was HMM not suitable? I know HMM had it's own
problems with merging but as it also managed migrations between RAM and
device memory, how did it not meet your requirements? If there were parts
of HMM missing, why was that not finished?

I know HMM had a history of problems getting merged but part of that was a
chicken and egg problem where it was a lot of infrastructure to maintain
with no in-kernel users. If CDM is a potential user then CDM could be
built on top and ask for a merge of both the core infrastructure required
and the drivers at the same time.

It's not an easy path but the difficulties there do not justify special
casing CDM in the core allocator.

>   As you have mentioned
>   driver will have more information about where which part of the buffer
>   should be placed at any point of time and it can make it happen with
>   migration. So both allocation and placement are decided by the driver
>   during runtime. CDM provides the framework for this can kind device
>   assisted compute and driver managed memory placements.
> 

Which sounds like what HMM needed and the problems of co-ordinating whether
data within a VMA is located on system RAM or device memory and what that
means is not addressed by the series.

Even if HMM is unsuitable, it should be clearly explained why

> * If any application is not using CDM memory for along time placed on
>   its buffer and another application is forced to fallback on system
>   RAM when it really wanted is CDM, the driver can detect these kind
>   of situations through memory access patterns on the device HW and
>   take necessary migration decisions.
> 
> I hope this explains the rationale of the framework. In fact these
> four patches give logically complete CPU/Device operating framework.
> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
> etc which are improvements on top of this basic framework.
> 

Automatic NUMA balancing is a particular oddity as that is about
CPU->RAM locality and not RAM->device considerations.

But all that aside, it still does not explain why it's necessary to
special case CDM in the allocator paths that can be controlled with
existing mechanisms.

> > 
> > The only reason I'm commenting at all is to say that I am extremely opposed
> > to the changes made to the page allocator paths that are specific to
> > CDM. It's been continual significant effort to keep the cost there down
> > and this is a mess of special cases for CDM. The changes to hugetlb to
> > identify "memory that is not really memory" with special casing is also
> > quite horrible.
> 
> We have already removed the O (n^2) search during zonelist iteration as
> pointed out by Vlastimil and the current overhead is linear for the CDM
> special case. We do similar checks for the cpuset function as well. Then
> how is this horrible ?

Because there are existing mechanisms for avoiding nodes that are not
device specific.

> On HugeTLB, we isolate CDM based on a resultant
> (MEMORY - CDM) node_states[] element which identifies system memory
> instead of all of the accessible memory and keep the HugeTLB limited to
> that nodemask. But if you feel there is any other better approach, we
> can definitely try out.
> 

cpusets with non-CDM application in a cpuset that does not include CDM
nodes.

> > It's also unclear if this is even usable by an application in userspace
> > at this point in time. If it is and the special casing is needed then the
> 
> Yeah with the current CDM approach its usable from user space as
> explained before.
> 

Minus the parts where the application notifies the driver to do some work
or mediate between what is accessing what memory and when.

> > regions should be isolated from early mem allocations in the arch layer
> > that is CDM aware, initialised late, and then setup userspace to isolate
> > all but privileged applications from the CDM nodes. Do not litter the core
> > with is_cdm_whatever checks.
> 
> I guess your are referring to allocating the entire CDM memory node with
> memblock_reserve() and then arch managing the memory when user space
> wants to use it through some sort of mmap, vm_ops methods. That defeats
> the whole purpose of integrating CDM memory with core VM. I am afraid it
> will also make migration between CDM memory and system memory difficult
> which is essential in making the whole hybrid compute operation
> transparent from  the user space.
> 

The memblock is to only avoid bootmem allocations from that area. It can
be managed in the arch layer to first pass in all the system ram,
teardown the bootmem allocator, setup the nodelists, set system
nodemask, init CDM, init the allocator for that, and then optionally add
it to the system CDM for userspace to do the isolation or provide.

For that matter, the driver could do the discovery and then fake a
memory hot-add.

It would be tough to do this but it would confine the logic to the arch
and driver that cares instead of special casing the allocators.

> > At best this is incomplete because it does not look as if it could be used
> > by anything properly and the fast path alterations are horrible even if
> > it could be used. As it is, it should not be merged in my opinion.
> 
> I have mentioned in detail above how this much of code change enables
> us to use the CDM in a transparent way from the user space. Please do
> let me know if it still does not make sense, will try again.
> 
> On the fast path changes issue, I can really understand your concern
> from the performance point of viewi

And a maintenance overhead because changing any part where CDM is special
cased will be impossible for many to test and verify.

> as its achieved over a long time.
> It would be great if you can suggest on how to improve from here.
> 

I do not have the bandwidth to get involved in how hardware enablement
for this feature should be done on power. The objection is that however
it is handled should not need to add special casing to the allocator
which already has mechanisms for limiting what memory is used.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/4] mm: Define coherent device memory (CDM) node
  2017-02-15 12:07 ` [PATCH V3 1/4] mm: Define coherent device memory (CDM) node Anshuman Khandual
@ 2017-02-17 14:05   ` Bob Liu
  2017-02-21 10:20     ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-02-17 14:05 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux-Kernel, Linux-MM, mhocko, Vlastimil Babka, mgorman,
	Minchan Kim, aneesh.kumar, Balbir Singh, srikar, haren, jglisse,
	Dave Hansen, Dan Williams

Hi Anshuman,

I have a few questions about coherent device memory.

On Wed, Feb 15, 2017 at 8:07 PM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> There are certain devices like specialized accelerator, GPU cards, network
> cards, FPGA cards etc which might contain onboard memory which is coherent
> along with the existing system RAM while being accessed either from the CPU
> or from the device. They share some similar properties with that of normal

What's the general size of this kind of memory?

> system RAM but at the same time can also be different with respect to
> system RAM.
>
> User applications might be interested in using this kind of coherent device

What kind of applications?

> memory explicitly or implicitly along side the system RAM utilizing all
> possible core memory functions like anon mapping (LRU), file mapping (LRU),
> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations

I didn't see the benefit to manage the onboard memory same way as system RAM.
Why not just map this kind of onborad memory to userspace directly?
And only those specific applications can manage/access/use it.

It sounds not very good to complicate the core memory framework a lot
because of some not widely used devices and uncertain applications.

-
Regards,
Bob

> etc. To achieve this kind of tight integration with core memory subsystem,
> the device onboard coherent memory must be represented as a memory only
> NUMA node. At the same time arch must export some kind of a function to
> identify of this node as a coherent device memory not any other regular
> cpu less memory only NUMA node.
>
> After achieving the integration with core memory subsystem coherent device
> memory might still need some special consideration inside the kernel. There
> can be a variety of coherent memory nodes with different expectations from
> the core kernel memory. But right now only one kind of special treatment is
> considered which requires certain isolation.
>
> Now consider the case of a coherent device memory node type which requires
> isolation. This kind of coherent memory is onboard an external device
> attached to the system through a link where there is always a chance of a
> link failure taking down the entire memory node with it. More over the
> memory might also have higher chance of ECC failure as compared to the
> system RAM. Hence allocation into this kind of coherent memory node should
> be regulated. Kernel allocations must not come here. Normal user space
> allocations too should not come here implicitly (without user application
> knowing about it). This summarizes isolation requirement of certain kind of
> coherent device memory node as an example. There can be different kinds of
> isolation requirement also.
>
> Some coherent memory devices might not require isolation altogether after
> all. Then there might be other coherent memory devices which might require
> some other special treatment after being part of core memory representation
> . For now, will look into isolation seeking coherent device memory node not
> the other ones.
>
> To implement the integration as well as isolation, the coherent memory node
> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
> the node_states[] array. During memory hotplug operations, the new nodemask
> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
> memory nodes. This also creates the following new sysfs based interface to
> list down all the coherent memory nodes of the system.
>
>         /sys/devices/system/node/is_coherent_node
>
> Architectures must export function arch_check_node_cdm() which identifies
> any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  7 ++++
>  arch/powerpc/Kconfig                        |  1 +
>  arch/powerpc/mm/numa.c                      |  7 ++++
>  drivers/base/node.c                         |  6 +++
>  include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
>  mm/Kconfig                                  |  4 ++
>  mm/memory_hotplug.c                         |  3 ++
>  mm/page_alloc.c                             |  8 +++-
>  8 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 5b2d0f0..5df18f7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -29,6 +29,13 @@ Description:
>                 Nodes that have regular or high memory.
>                 Depends on CONFIG_HIGHMEM.
>
> +What:          /sys/devices/system/node/is_cdm_node
> +Date:          January 2017
> +Contact:       Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +               Lists the nodemask of nodes that have coherent device memory.
> +               Depends on CONFIG_COHERENT_DEVICE.
> +
>  What:          /sys/devices/system/node/nodeX
>  Date:          October 2002
>  Contact:       Linux Memory Management list <linux-mm@kvack.org>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 281f4f1..1cff239 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -164,6 +164,7 @@ config PPC
>         select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
>         select HAVE_ARCH_HARDENED_USERCOPY
>         select HAVE_KERNEL_GZIP
> +       select COHERENT_DEVICE if PPC_BOOK3S_64 && NEED_MULTIPLE_NODES
>
>  config GENERIC_CSUM
>         def_bool CPU_LITTLE_ENDIAN
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b1099cb..14f0b98 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -41,6 +41,13 @@
>  #include <asm/setup.h>
>  #include <asm/vdso.h>
>
> +#ifdef CONFIG_COHERENT_DEVICE
> +inline int arch_check_node_cdm(int nid)
> +{
> +       return 0;
> +}
> +#endif
> +
>  static int numa_enabled = 1;
>
>  static char *cmdline __initdata;
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 5548f96..f39e615 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = {
>         [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
>  #endif
>         [N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
> +#ifdef CONFIG_COHERENT_DEVICE
> +       [N_COHERENT_DEVICE] = _NODE_ATTR(is_cdm_node, N_COHERENT_DEVICE),
> +#endif
>  };
>
>  static struct attribute *node_state_attrs[] = {
> @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = {
>         &node_state_attr[N_MEMORY].attr.attr,
>  #endif
>         &node_state_attr[N_CPU].attr.attr,
> +#ifdef CONFIG_COHERENT_DEVICE
> +       &node_state_attr[N_COHERENT_DEVICE].attr.attr,
> +#endif
>         NULL
>  };
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index f746e44..175c2d6 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -388,11 +388,14 @@ enum node_states {
>         N_HIGH_MEMORY = N_NORMAL_MEMORY,
>  #endif
>  #ifdef CONFIG_MOVABLE_NODE
> -       N_MEMORY,               /* The node has memory(regular, high, movable) */
> +       N_MEMORY,       /* The node has memory(regular, high, movable, cdm) */
>  #else
>         N_MEMORY = N_HIGH_MEMORY,
>  #endif
>         N_CPU,          /* The node has one or more cpus */
> +#ifdef CONFIG_COHERENT_DEVICE
> +       N_COHERENT_DEVICE,      /* The node has CDM memory */
> +#endif
>         NR_NODE_STATES
>  };
>
> @@ -496,6 +499,59 @@ static inline int node_random(const nodemask_t *mask)
>  }
>  #endif
>
> +#ifdef CONFIG_COHERENT_DEVICE
> +extern int arch_check_node_cdm(int nid);
> +
> +static inline nodemask_t system_mem_nodemask(void)
> +{
> +       nodemask_t system_mem;
> +
> +       nodes_clear(system_mem);
> +       nodes_andnot(system_mem, node_states[N_MEMORY],
> +                       node_states[N_COHERENT_DEVICE]);
> +       return system_mem;
> +}
> +
> +static inline bool is_cdm_node(int node)
> +{
> +       return node_isset(node, node_states[N_COHERENT_DEVICE]);
> +}
> +
> +static inline void node_set_state_cdm(int node)
> +{
> +       if (arch_check_node_cdm(node))
> +               node_set_state(node, N_COHERENT_DEVICE);
> +}
> +
> +static inline void node_clear_state_cdm(int node)
> +{
> +       if (arch_check_node_cdm(node))
> +               node_clear_state(node, N_COHERENT_DEVICE);
> +}
> +
> +#else
> +
> +static inline int arch_check_node_cdm(int nid) { return 0; }
> +
> +static inline nodemask_t system_mem_nodemask(void)
> +{
> +       return node_states[N_MEMORY];
> +}
> +
> +static inline bool is_cdm_node(int node)
> +{
> +       return false;
> +}
> +
> +static inline void node_set_state_cdm(int node)
> +{
> +}
> +
> +static inline void node_clear_state_cdm(int node)
> +{
> +}
> +#endif /* CONFIG_COHERENT_DEVICE */
> +
>  #define node_online_map        node_states[N_ONLINE]
>  #define node_possible_map      node_states[N_POSSIBLE]
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 9b8fccb..6263a65 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -143,6 +143,10 @@ config HAVE_GENERIC_RCU_GUP
>  config ARCH_DISCARD_MEMBLOCK
>         bool
>
> +config COHERENT_DEVICE
> +       bool
> +       default n
> +
>  config NO_BOOTMEM
>         bool
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b8c11e0..6bce093 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1030,6 +1030,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
>         if (arg->status_change_nid_high >= 0)
>                 node_set_state(node, N_HIGH_MEMORY);
>
> +       node_set_state_cdm(node);
>         node_set_state(node, N_MEMORY);
>  }
>
> @@ -1843,6 +1844,8 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>         if ((N_MEMORY != N_HIGH_MEMORY) &&
>             (arg->status_change_nid >= 0))
>                 node_clear_state(node, N_MEMORY);
> +
> +       node_clear_state_cdm(node);
>  }
>
>  static int __ref __offline_pages(unsigned long start_pfn,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f3e0c69..84d61bb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6080,8 +6080,10 @@ static unsigned long __init early_calculate_totalpages(void)
>                 unsigned long pages = end_pfn - start_pfn;
>
>                 totalpages += pages;
> -               if (pages)
> +               if (pages) {
> +                       node_set_state_cdm(nid);
>                         node_set_state(nid, N_MEMORY);
> +               }
>         }
>         return totalpages;
>  }
> @@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>                                 find_min_pfn_for_node(nid), NULL);
>
>                 /* Any memory on that node */
> -               if (pgdat->node_present_pages)
> +               if (pgdat->node_present_pages) {
> +                       node_set_state_cdm(nid);
>                         node_set_state(nid, N_MEMORY);
> +               }
>                 check_for_memory(pgdat, nid);
>         }
>  }
> --
> 2.9.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-17  9:33     ` Mel Gorman
@ 2017-02-21  2:57       ` Balbir Singh
  2017-03-01  2:42         ` Balbir Singh
  0 siblings, 1 reply; 43+ messages in thread
From: Balbir Singh @ 2017-02-21  2:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, vbabka,
	minchan, aneesh.kumar, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Fri, 2017-02-17 at 09:33 +0000, Mel Gorman wrote:
> On Fri, Feb 17, 2017 at 09:14:44AM +1100, Balbir Singh wrote:
> > 
> > 
> > On 16/02/17 05:20, Mel Gorman wrote:
> > > On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> > > > 	This four patches define CDM node with HugeTLB & Buddy allocation
> > > > isolation. Please refer to the last RFC posting mentioned here for more
> > > 
> > > Always include the background with the changelog itself. Do not assume that
> > > people are willing to trawl through a load of past postings to assemble
> > > the picture. I'm only taking a brief look because of the page allocator
> > > impact but it does not appear that previous feedback was addressed.
> > > 
> > > In itself, the series does very little and as Vlastimil already pointed
> > > out, it's not a good idea to try merge piecemeal when people could not
> > > agree on the big picture (I didn't dig into it).
> > > 
> > 
> > The idea of CDM is independent of how some of the other problems related
> > to AutoNUMA balancing is handled.
> 
> What has Automatic NUMA balancing got to do with CDM?
> 

The idea is to have a policy to determine (based on the RFC discussion) whether
CDM nodes should participate in NUMA balancing.

> Even if you're trying to draw a comparison between how the patches were
> developed in comparison to CDM, it's a poor example. Regardless of which
> generation of NUMA balancing implementation considered (there were three
> contenders), each of them was a working implementation that had a measurable
> impact on a number of workloads. In many cases, performance data was
> included. The instructions on how workloads could use it were clear even
> if there were disagreements on exactly what the tuning options should be.
> While the feature evolved over time and improved for different classes of
> workload, the first set of patches merged were functional.
> 
> > The idea of this patchset was to introduce
> > the concept of memory that is not necessarily system memory, but is coherent
> > in terms of visibility/access with some restrictions
> > 
> 
> Which should be done without special casing the page allocator, cpusets and
> special casing how cpusets are handled. It's not necessary for any other
> mechanism used to restrict access to portions of memory such as cpusets,
> mempolicies or even memblock reservations.

Agreed, I mentioned a limitation that we see a cpusets. I do agree that
we should reuse any infrastructure we have, but cpusets are more static
in nature and inheritence compared to the requirements of CDM.

> 
> > > The only reason I'm commenting at all is to say that I am extremely opposed
> > > to the changes made to the page allocator paths that are specific to
> > > CDM. It's been continual significant effort to keep the cost there down
> > > and this is a mess of special cases for CDM. The changes to hugetlb to
> > > identify "memory that is not really memory" with special casing is also
> > > quite horrible.
> > > 
> > > It's completely unclear that even if one was to assume that CDM memory
> > > should be expressed as nodes why such systems do not isolate all processes
> > > from CDM nodes by default and then allow access via memory policies or
> > > cpusets instead of special casing the page allocator fast path. It's also
> > > completely unclear what happens if a device should then access the CDM
> > > and how that should be synchronised with the core, if that is even possible.
> > > 
> > 
> > A big part of this is driven by the need to special case what allocations
> > go there. The idea being that an allocation should get there only when
> > explicitly requested.
> 
> cpuset, mempolicy or mmap of a device file that mediates whether device
> or system memory is used. For the last option, I don't know the specifics
> but given that HMM worked on this for years, there should be ables of
> the considerations and complications that arise. I'm not familiar with
> the specifics.
> 
> > Unfortunately, IIUC node distance is not a good
> > isolation metric.
> 
> I don't recall suggesting that.

True, I am just saying :)

> 
> > CPUsets are heavily driven by user space and we
> > believe that setting up CDM is not an administrative operation, its
> > going to be hard for an administrator or user space application to set
> > up the right policy or an installer to figure it out.
> 
> So by this design, an application is expected to know nothing about how
> to access CDM yet be CDM-aware? 

A higher layer abstracts what/where the memory is. The memory is coherent
(CDM), but for performance it may be migrated. In some special cases 
an aware application may request explicit allocation, in other cases an
unaware application may use it seemlessly and have its memory migrated.

The application is either aware of CDM or
> it isn't. It's either known how to access it or it does not.
> 
> Even if it was a case that the arch layer provides hooks to alter the global
> nodemask and expose a special file of the CDM nodemask to userspace then
> it would still avoid special casing in the various allocators. It would
> not address the problem at all of how devices are meant to be informed
> that there is CDM memory with work to do but that has been raised elsewhere.
> 
> > It does not help
> > that CPUSets assume inheritance from the root hierarchy. As far as the
> > overheads go, one could consider using STATIC_KEYS if that is worthwhile.
> > 
> 
> Hiding the overhead in static keys could not change the fact that the various
> allocator paths should not need to be CDM-aware or special casing CDM when
> there already are existing mechanisms for avoiding regions of memory.
>


We don't want to hide things, but make it 0-overhead for non-users. There
might be better ways of doing it. Thanks for the review!

Balbir Singh. 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/4] mm: Define coherent device memory (CDM) node
  2017-02-17 14:05   ` Bob Liu
@ 2017-02-21 10:20     ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-21 10:20 UTC (permalink / raw)
  To: Bob Liu, Anshuman Khandual
  Cc: Linux-Kernel, Linux-MM, mhocko, Vlastimil Babka, mgorman,
	Minchan Kim, aneesh.kumar, Balbir Singh, srikar, haren, jglisse,
	Dave Hansen, Dan Williams

On 02/17/2017 07:35 PM, Bob Liu wrote:
> Hi Anshuman,
> 
> I have a few questions about coherent device memory.

Sure.

> 
> On Wed, Feb 15, 2017 at 8:07 PM, Anshuman Khandual
> <khandual@linux.vnet.ibm.com> wrote:
>> There are certain devices like specialized accelerator, GPU cards, network
>> cards, FPGA cards etc which might contain onboard memory which is coherent
>> along with the existing system RAM while being accessed either from the CPU
>> or from the device. They share some similar properties with that of normal
> 
> What's the general size of this kind of memory?

Its more comparable to available system RAM sizes and also not as high as
persistent storage memory or NVDIMM.

> 
>> system RAM but at the same time can also be different with respect to
>> system RAM.
>>
>> User applications might be interested in using this kind of coherent device
> 
> What kind of applications?

Applications which want to use CPU compute as well device compute on the
same allocated buffer transparently. Applications for example want to
load the problem statement on the allocated buffer and ask the device
through driver to compute results out of the problem statement.

> 
>> memory explicitly or implicitly along side the system RAM utilizing all
>> possible core memory functions like anon mapping (LRU), file mapping (LRU),
>> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
> 
> I didn't see the benefit to manage the onboard memory same way as system RAM.
> Why not just map this kind of onborad memory to userspace directly?
> And only those specific applications can manage/access/use it.

Integration with core MM along with driver assisted migrations gives the
application the ability to use the allocated buffer seamlessly from the
CPU or the device without bothering about actual physical placement of
the pages. That changes the paradigm of cpu and device based hybrid
compute framework which can not be achieved by mapping the device memory
directly to the user space.

> 
> It sounds not very good to complicate the core memory framework a lot
> because of some not widely used devices and uncertain applications.

Applications are not uncertain, they intend to use these framework to
achieve hybrid cpu/device compute working transparently on the same
allocated virtual buffer. IIUC we would want Linux kernel to enable
new device technologies regardless whether they are widely used or
not.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-17 11:41   ` [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
  2017-02-17 13:32     ` Mel Gorman
@ 2017-02-21 11:11     ` Michal Hocko
  2017-02-21 13:39       ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-02-21 11:11 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
[...]
> * User space using mbind() to get CDM memory is an additional benefit
>   we get by making the CDM plug in as a node and be part of the buddy
>   allocator. But the over all idea from the user space point of view
>   is that the application can allocate any generic buffer and try to
>   use the buffer either from the CPU side or from the device without
>   knowing about where the buffer is really mapped physically. That
>   gives a seamless and transparent view to the user space where CPU
>   compute and possible device based compute can work together. This
>   is not possible through a driver allocated buffer.

But how are you going to define any policy around that. Who is allowed
to allocate and how much of this "special memory". Is it possible that
we will eventually need some access control mechanism? If yes then mbind
is really not suitable interface to (ab)use. Also what should happen if
the mbind mentions only CDM memory and that is depleted?

Could you also explain why the transparent view is really better than
using a device specific mmap (aka CDM awareness)?
 
> * The placement of the memory on the buffer can happen on system memory
>   when the CPU faults while accessing it. But a driver can manage the
>   migration between system RAM and CDM memory once the buffer is being
>   used from CPU and the device interchangeably. As you have mentioned
>   driver will have more information about where which part of the buffer
>   should be placed at any point of time and it can make it happen with
>   migration. So both allocation and placement are decided by the driver
>   during runtime. CDM provides the framework for this can kind device
>   assisted compute and driver managed memory placements.
> 
> * If any application is not using CDM memory for along time placed on
>   its buffer and another application is forced to fallback on system
>   RAM when it really wanted is CDM, the driver can detect these kind
>   of situations through memory access patterns on the device HW and
>   take necessary migration decisions.

Is this implemented or at least designed?

Btw. I believe that sending new versions of the patchset with minor
changes is not really helping the review process. I believe the
highlevel concerns about the API are not resolved yet and that is the
number 1 thing to deal with currently.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-17 13:32     ` Mel Gorman
@ 2017-02-21 13:09       ` Anshuman Khandual
  2017-02-21 20:14         ` Jerome Glisse
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-21 13:09 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 02/17/2017 07:02 PM, Mel Gorman wrote:
> On Fri, Feb 17, 2017 at 05:11:57PM +0530, Anshuman Khandual wrote:
>> On 02/15/2017 11:50 PM, Mel Gorman wrote:
>>> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
>>>> 	This four patches define CDM node with HugeTLB & Buddy allocation
>>>> isolation. Please refer to the last RFC posting mentioned here for more
>>>
>>> Always include the background with the changelog itself. Do not assume that
>>> people are willing to trawl through a load of past postings to assemble
>>> the picture. I'm only taking a brief look because of the page allocator
>>> impact but it does not appear that previous feedback was addressed.
>>
>> Sure, I made a mistake. Will include the complete background from my
>> previous RFCs in the next version which will show the entire context
>> of this patch series. I have addressed the previous feedback regarding
>> cpuset enabled allocation leaks into CDM memory as pointed out by
>> Vlastimil Babka on the last version. Did I miss anything else inside
>> the Buddy allocator apart from that ?
>>
> 
> The cpuset fix is crude and uncondtionally passes around a mask to
> special case CDM requirements which should be controlled by cpusets or
> policies if CDM is to be treated as if it's normal memory. Special
> casing like this is fragile and it'll be difficult for modifications to
> the allocator to be made with any degree of certainity that CDM is not
> impacted. I strongly suspect it would collide with the cpuset rework
> that Vlastimil is working on to avoid races between allocation and
> cpuset modification.

I understand your point. Cpuset can should be improved to accommodate
CDM nodes through policy constructs not through special casing. I will
look into it again and work with Vlastimil to see if the new rework can
accommodate these.

> 
>>>
>>> In itself, the series does very little and as Vlastimil already pointed
>>> out, it's not a good idea to try merge piecemeal when people could not
>>> agree on the big picture (I didn't dig into it).
>>
>> With the proposed kernel changes and a associated driver its complete to
>> drive a user space based CPU/Device hybrid compute interchangeably on a
>> mmap() allocated memory buffer transparently and effectively.
> 
> How is the device informed at that data is available for processing?

It will through a call to the driver from user space which can take
the required buffer address as an argument.

> What prevents and application modifying the data on the device while it's
> being processed?

Nothing in software. The application should take care of that but access
from both sides are coherent. It should wait for the device till it
finishes the compute it had asked for earlier to prevent override and
eventual corruption.

> Why can this not be expressed with cpusets and memory policies
> controlled by a combination of administrative steps for a privileged
> application and an application that is CDM aware?

Hmm, that can be done but having an in kernel infrastructure has the
following benefits.

* Administrator does not have to listen to node add notifications
  and keep the isolation/allowed cpusets upto date all the time.
  This can be a significant overhead on the admin/userspace which
  have a number of separate device memory nodes.

* With cpuset solution, tasks which are part of CDM allowed cpuset
  can have all it's VMAs allocate from CDM memory which may not be
  something the user want. For example user may not want to have
  the text segments, libraries allocate from CDM. To achieve this
  the user will have to explicitly block allocation access from CDM
  through mbind(MPOL_BIND) memory policy setups. This negative setup
  is a big overhead. But with in kernel CDM framework, isolation is
  enabled by default. For CDM allocations the application just has
  to setup memory policy with CDM node in the allowed nodemask.

Even with cpuset solution, applications still need to know which nodes
are CDM on the system at given point of time. So we will have to store
it in a nodemask and export them on sysfs some how.

> 
>> I had also
>> mentioned these points on the last posting in response to a comment from
>> Vlastimil.
>>
>> From this response (https://lkml.org/lkml/2017/2/14/50).
>>
>> * User space using mbind() to get CDM memory is an additional benefit
>>   we get by making the CDM plug in as a node and be part of the buddy
>>   allocator. But the over all idea from the user space point of view
>>   is that the application can allocate any generic buffer and try to
>>   use the buffer either from the CPU side or from the device without
>>   knowing about where the buffer is really mapped physically. That
>>   gives a seamless and transparent view to the user space where CPU
>>   compute and possible device based compute can work together. This
>>   is not possible through a driver allocated buffer.
>>
> 
> Which can also be done with cpusets that prevents use of CDM memory and
> place all non-CDM processes into that cpuset with a separate cpuset for
> CDM-aware applications that allow access to CDM memory.

Right, but with additional overheads as explained above.

> 
>> * The placement of the memory on the buffer can happen on system memory
>>   when the CPU faults while accessing it. But a driver can manage the
>>   migration between system RAM and CDM memory once the buffer is being
>>   used from CPU and the device interchangeably.
> 
> While I'm not familiar with the details because I'm not generally involved
> in hardware enablement, why was HMM not suitable? I know HMM had it's own
> problems with merging but as it also managed migrations between RAM and
> device memory, how did it not meet your requirements? If there were parts
> of HMM missing, why was that not finished?

These are the reasons which prohibit the use of HMM for coherent
addressable device memory purpose.

(1) IIUC HMM currently supports only a subset of anon mapping in the
user space. It does not support shared anon mapping or any sort of file
mapping for that matter. We need support for all mapping in the user space
for the CPU/device compute to be effective and transparent. As HMM depends
on ZONE DEVICE for device memory representation, there are some unique
challenges in making it work for file mapping (and page cache) during
migrations between system RAM and device memory.

(2) ZONE_DEVICE has been modified to support un-addressable memory apart
from addressable persistent memory which is not movable. It still would
have to support coherent device memory which will be movable.

(3) Application cannot directly allocate into device memory from user
space using existing memory related system calls like mmap() and mbind()
as the device memory hides away in ZONE_DEVICE.

Apart from that, CDM framework provides a different approach to device
memory representation which does not require special device memory kind
of handling and associated call backs as implemented by HMM. It provides
NUMA node based visibility to the user space which can be extended to
support new features.

>
> I know HMM had a history of problems getting merged but part of that was a
> chicken and egg problem where it was a lot of infrastructure to maintain
> with no in-kernel users. If CDM is a potential user then CDM could be

CDM is not a user there, HMM needs to change (with above challenges) to
accommodate coherent device memory which it does not support at this
moment.

> built on top and ask for a merge of both the core infrastructure required
> and the drivers at the same time.

I am afraid the drivers would be HW vendor specific.

> 
> It's not an easy path but the difficulties there do not justify special
> casing CDM in the core allocator.

Hmm. Even if HMM supports all sorts of mappings in user space and related
migrations, we still will not have direct allocations from user space with
mmap() and mbind() system calls.

> 
> 
>>   As you have mentioned
>>   driver will have more information about where which part of the buffer
>>   should be placed at any point of time and it can make it happen with
>>   migration. So both allocation and placement are decided by the driver
>>   during runtime. CDM provides the framework for this can kind device
>>   assisted compute and driver managed memory placements.
>>
> 
> Which sounds like what HMM needed and the problems of co-ordinating whether
> data within a VMA is located on system RAM or device memory and what that
> means is not addressed by the series.

Did not get that. What is not addressed by this series ? How is the
requirements of HMM and CDM framework are different ?

> 
> Even if HMM is unsuitable, it should be clearly explained why

I just did explain in the previous paragraphs above.

> 
>> * If any application is not using CDM memory for along time placed on
>>   its buffer and another application is forced to fallback on system
>>   RAM when it really wanted is CDM, the driver can detect these kind
>>   of situations through memory access patterns on the device HW and
>>   take necessary migration decisions.
>>
>> I hope this explains the rationale of the framework. In fact these
>> four patches give logically complete CPU/Device operating framework.
>> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
>> etc which are improvements on top of this basic framework.
>>
> 
> Automatic NUMA balancing is a particular oddity as that is about
> CPU->RAM locality and not RAM->device considerations.

Right. But when there are migrations happening between system RAM and
device memory. Auto NUMA with its CPU fault information can migrate
between system RAM nodes which might not be necessary and can lead to
conflict or overhead. Hence Auto NUMA needs to be switched off at times
for the VMAs of concern but its not addressed in the patch series. As
mentioned before, it will be in the follow up work as improvements on
this series.

> 
> But all that aside, it still does not explain why it's necessary to
> special case CDM in the allocator paths that can be controlled with
> existing mechanisms.

Okay

> 
>>>
>>> The only reason I'm commenting at all is to say that I am extremely opposed
>>> to the changes made to the page allocator paths that are specific 	
>>> CDM. It's been continual significant effort to keep the cost there down
>>> and this is a mess of special cases for CDM. The changes to hugetlb to
>>> identify "memory that is not really memory" with special casing is also
>>> quite horrible.
>>
>> We have already removed the O (n^2) search during zonelist iteration as
>> pointed out by Vlastimil and the current overhead is linear for the CDM
>> special case. We do similar checks for the cpuset function as well. Then
>> how is this horrible ?
> 
> Because there are existing mechanisms for avoiding nodes that are not
> device specific.

So the changes are not horrible but they might be called redundant.

> 
>> On HugeTLB, we isolate CDM based on a resultant
>> (MEMORY - CDM) node_states[] element which identifies system memory
>> instead of all of the accessible memory and keep the HugeTLB limited to
>> that nodemask. But if you feel there is any other better approach, we
>> can definitely try out.
>>
> 
> cpusets with non-CDM application in a cpuset that does not include CDM
> nodes.

Okay, if cpuset turns out to be a viable option for CDM implementation.

> 
>>> It's also unclear if this is even usable by an application in userspace
>>> at this point in time. If it is and the special casing is needed then the
>>
>> Yeah with the current CDM approach its usable from user space as
>> explained before.
>>
> 
> Minus the parts where the application notifies the driver to do some work
> or mediate between what is accessing what memory and when.

As mentioned before application invokes the driver to start a device compute
on a designated buffer, waits for the completion till driver returns the call
back to the application and then application accesses the buffer from CPU. Yes
application needs to work with the driver to guarantee buffer data integrity
and prevent corruption.

> 
>>> regions should be isolated from early mem allocations in the arch layer
>>> that is CDM aware, initialised late, and then setup userspace to isolate
>>> all but privileged applications from the CDM nodes. Do not litter the core
>>> with is_cdm_whatever checks.
>>
>> I guess your are referring to allocating the entire CDM memory node with
>> memblock_reserve() and then arch managing the memory when user space
>> wants to use it through some sort of mmap, vm_ops methods. That defeats
>> the whole purpose of integrating CDM memory with core VM. I am afraid it
>> will also make migration between CDM memory and system memory difficult
>> which is essential in making the whole hybrid compute operation
>> transparent from  the user space.
>>
> 
> The memblock is to only avoid bootmem allocations from that area. It can
> be managed in the arch layer to first pass in all the system ram,
> teardown the bootmem allocator, setup the nodelists, set system
> nodemask, init CDM, init the allocator for that, and then optionally add
> it to the system CDM for userspace to do the isolation or provide.
> 
> For that matter, the driver could do the discovery and then fake a
> memory hot-add.

Not sure I got this correctly. Could you please explain more.

> 
> It would be tough to do this but it would confine the logic to the arch
> and driver that cares instead of special casing the allocators.

I did not get this part, could you please give some more details.

> 
>>> At best this is incomplete because it does not look as if it could be used
>>> by anything properly and the fast path alterations are horrible even if
>>> it could be used. As it is, it should not be merged in my opinion.
>>
>> I have mentioned in detail above how this much of code change enables
>> us to use the CDM in a transparent way from the user space. Please do
>> let me know if it still does not make sense, will try again.
>>
>> On the fast path changes issue, I can really understand your concern
>> from the performance point of viewi
> 
> And a maintenance overhead because changing any part where CDM is special
> cased will be impossible for many to test and verify.

Okay.

> 
>> as its achieved over a long time.
>> It would be great if you can suggest on how to improve from here.
>>
> 
> I do not have the bandwidth to get involved in how hardware enablement
> for this feature should be done on power. The objection is that however
> it is handled should not need to add special casing to the allocator
> which already has mechanisms for limiting what memory is used.
> 

Okay.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 11:11     ` Michal Hocko
@ 2017-02-21 13:39       ` Anshuman Khandual
  2017-02-22  9:50         ` Michal Hocko
  2017-02-24  1:06         ` Bob Liu
  0 siblings, 2 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-21 13:39 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 02/21/2017 04:41 PM, Michal Hocko wrote:
> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
> [...]
>> * User space using mbind() to get CDM memory is an additional benefit
>>   we get by making the CDM plug in as a node and be part of the buddy
>>   allocator. But the over all idea from the user space point of view
>>   is that the application can allocate any generic buffer and try to
>>   use the buffer either from the CPU side or from the device without
>>   knowing about where the buffer is really mapped physically. That
>>   gives a seamless and transparent view to the user space where CPU
>>   compute and possible device based compute can work together. This
>>   is not possible through a driver allocated buffer.
> 
> But how are you going to define any policy around that. Who is allowed

The user space VMA can define the policy with a mbind(MPOL_BIND) call
with CDM/CDMs in the nodemask.

> to allocate and how much of this "special memory". Is it possible that

Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
the nodemask can allocate from the CDM memory. "How much" gets controlled
by how we fault from CPU and the default behavior of the buddy allocator.

> we will eventually need some access control mechanism? If yes then mbind

No access control mechanism is needed. If an application wants to use
CDM memory by specifying in the mbind() it can. Nothing prevents it
from using the CDM memory.

> is really not suitable interface to (ab)use. Also what should happen if
> the mbind mentions only CDM memory and that is depleted?

IIUC *only CDM* cannot be requested from user space as there are no user
visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
CDM in the nodemask will eventually pick a FALLBACK zonelist which will
have zones of the system including CDM ones. If the resultant CDM zones
run out of memory, we fail the allocation request as usual.

> 
> Could you also explain why the transparent view is really better than
> using a device specific mmap (aka CDM awareness)?

Okay with a transparent view, we can achieve a control flow of application
like the following.

(1) Allocate a buffer:		alloc_buffer(buf, size)
(2) CPU compute on buffer:	cpu_compute(buf, size)
(3) Device compute on buffer:	device_compute(buf, size)
(4) CPU compute on buffer:	cpu_compute(buf, size)
(5) Release the buffer:		release_buffer(buf, size)

With assistance from a device specific driver, the actual page mapping of
the buffer can change between system RAM and device memory depending on
which side is accessing at a given point. This will be achieved through
driver initiated migrations.

>  
>> * The placement of the memory on the buffer can happen on system memory
>>   when the CPU faults while accessing it. But a driver can manage the
>>   migration between system RAM and CDM memory once the buffer is being
>>   used from CPU and the device interchangeably. As you have mentioned
>>   driver will have more information about where which part of the buffer
>>   should be placed at any point of time and it can make it happen with
>>   migration. So both allocation and placement are decided by the driver
>>   during runtime. CDM provides the framework for this can kind device
>>   assisted compute and driver managed memory placements.
>>
>> * If any application is not using CDM memory for along time placed on
>>   its buffer and another application is forced to fallback on system
>>   RAM when it really wanted is CDM, the driver can detect these kind
>>   of situations through memory access patterns on the device HW and
>>   take necessary migration decisions.
> 
> Is this implemented or at least designed?

Yeah, its being designed.

> 
> Btw. I believe that sending new versions of the patchset with minor
> changes is not really helping the review process. I believe the
> highlevel concerns about the API are not resolved yet and that is the
> number 1 thing to deal with currently.

Got it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 13:09       ` Anshuman Khandual
@ 2017-02-21 20:14         ` Jerome Glisse
  2017-02-23  8:14           ` Anshuman Khandual
  2017-02-22  9:29         ` Michal Hocko
  2017-02-23 15:57         ` Mel Gorman
  2 siblings, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-02-21 20:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, mhocko, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
> On 02/17/2017 07:02 PM, Mel Gorman wrote:
> > On Fri, Feb 17, 2017 at 05:11:57PM +0530, Anshuman Khandual wrote:
> >> On 02/15/2017 11:50 PM, Mel Gorman wrote:
> >>> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:

[...]

> >> * The placement of the memory on the buffer can happen on system memory
> >>   when the CPU faults while accessing it. But a driver can manage the
> >>   migration between system RAM and CDM memory once the buffer is being
> >>   used from CPU and the device interchangeably.
> > 
> > While I'm not familiar with the details because I'm not generally involved
> > in hardware enablement, why was HMM not suitable? I know HMM had it's own
> > problems with merging but as it also managed migrations between RAM and
> > device memory, how did it not meet your requirements? If there were parts
> > of HMM missing, why was that not finished?
> 
> 
> These are the reasons which prohibit the use of HMM for coherent
> addressable device memory purpose.
> 
> (1) IIUC HMM currently supports only a subset of anon mapping in the
> user space. It does not support shared anon mapping or any sort of file
> mapping for that matter. We need support for all mapping in the user space
> for the CPU/device compute to be effective and transparent. As HMM depends
> on ZONE DEVICE for device memory representation, there are some unique
> challenges in making it work for file mapping (and page cache) during
> migrations between system RAM and device memory.

I need to debunk that. HMM does not support file back page (or share memory)
for a single reason: CPU can not access HMM memory. If the device memory is
accessible from CPU in cache coherent fashion then adding support for file
back page is easy. There is only an handfull of place in the filesystem that
assume page are on the lru and all that is needed is allowing file back page
to not be on the lru. Extra thing would be to forbid GUP but that is easy.

> 
> (2) ZONE_DEVICE has been modified to support un-addressable memory apart
> from addressable persistent memory which is not movable. It still would
> have to support coherent device memory which will be movable.

Again this isn't how it is implemented. I splitted the un-addressable part
from the move-able property. So you can implement addressable and moveable
memory using HMM modification to ZONE_DEVICE.

> 
> (3) Application cannot directly allocate into device memory from user
> space using existing memory related system calls like mmap() and mbind()
> as the device memory hides away in ZONE_DEVICE.

That's true but this is deliberate choice. From the begining my choice
have been guided by the principle that i do not want to add or modify
existing syscall because we do not have real world experience with this.

Once HMM is use with real world workload by people other than me or
NVidia and we get feedback on what people writting application leveraging
this would like to do. Then we might start thinking about mbind() or other
API to expose more policy control to application.

For time being all policy and migration decision are done by the driver
that collect hint and statistic from the userspace driver of the GPU.
So this is all device specific and it use existing driver mechanism.

> 
> Apart from that, CDM framework provides a different approach to device
> memory representation which does not require special device memory kind
> of handling and associated call backs as implemented by HMM. It provides
> NUMA node based visibility to the user space which can be extended to
> support new features.

True we diverge there. I am not convince that NUMA is the right direction.
NUMA was design for CPU and CDM or device memory is more at a sub-level
than NUMA. Each device is attach to a given CPU node itself part of the
NUMA hierarchy. So to me CDM is more about having a hierarchy of memory
at node level and thus should not be implemented in NUMA. Something new
is needed. Not only for device memory but for thing like stack memory
that won't use as last level cache as it has been done in existing Intel
CPU. I believe we will have deeper hierarchy of memory, from fast high
bandwidth stack memory (on top of CPU/GPU die) to the regular memory as
we know it and also device memory.

> > I know HMM had a history of problems getting merged but part of that was a
> > chicken and egg problem where it was a lot of infrastructure to maintain
> > with no in-kernel users. If CDM is a potential user then CDM could be
> 
> CDM is not a user there, HMM needs to change (with above challenges) to
> accommodate coherent device memory which it does not support at this
> moment.

There is no need to change anything with current HMM to support CDM. What
you would want is to add file back page which would require to allow non
lru page (this lru assumption of file back page exist only in couple place
and i don't remember thinking it would be a challenge to change that).

> > built on top and ask for a merge of both the core infrastructure required
> > and the drivers at the same time.
> 
> I am afraid the drivers would be HW vendor specific.
> 
> > 
> > It's not an easy path but the difficulties there do not justify special
> > casing CDM in the core allocator.
> 
> Hmm. Even if HMM supports all sorts of mappings in user space and related
> migrations, we still will not have direct allocations from user space with
> mmap() and mbind() system calls.

I am not sure we want to have this kind of direct allocation from day one.
I would rather have the whole thing fire tested with real application and
real user through device driver. Then wait to see if common usage pattern
warrant to create a generic API to direct new memory allocation to device
memory.

> >>   As you have mentioned
> >>   driver will have more information about where which part of the buffer
> >>   should be placed at any point of time and it can make it happen with
> >>   migration. So both allocation and placement are decided by the driver
> >>   during runtime. CDM provides the framework for this can kind device
> >>   assisted compute and driver managed memory placements.
> >>
> > 
> > Which sounds like what HMM needed and the problems of co-ordinating whether
> > data within a VMA is located on system RAM or device memory and what that
> > means is not addressed by the series.
> 
> Did not get that. What is not addressed by this series ? How is the
> requirements of HMM and CDM framework are different ?

The VMA flag of CDM is really really bad from my point of view. I do
understand and agree that you want to block auto-numa and ksm or any-
thing similar from happening to CDM memory but this is a property of
the memory that back some address in a given VMA. It is not a property
of a VMA region. Given that auto-numa and KSM work from VMA down to
memory i understand why one would want to block it there but it is
wrong.

I already said that a common pattern will be fragmented VMA ie a VMA
in which you have some address back by device memory and other back
by regular memory (and no you do not want to split VMA). So to me it
is clear you need to block KSM or auto-numa at page level ie by using
memory type property from node to which the page belong for instance.

Droping the CDM flag would simplify your whole patchset.

> > 
> > Even if HMM is unsuitable, it should be clearly explained why
> 
> I just did explain in the previous paragraphs above.
> 
> > 
> >> * If any application is not using CDM memory for along time placed on
> >>   its buffer and another application is forced to fallback on system
> >>   RAM when it really wanted is CDM, the driver can detect these kind
> >>   of situations through memory access patterns on the device HW and
> >>   take necessary migration decisions.
> >>
> >> I hope this explains the rationale of the framework. In fact these
> >> four patches give logically complete CPU/Device operating framework.
> >> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
> >> etc which are improvements on top of this basic framework.
> >>
> > 
> > Automatic NUMA balancing is a particular oddity as that is about
> > CPU->RAM locality and not RAM->device considerations.
> 
> Right. But when there are migrations happening between system RAM and
> device memory. Auto NUMA with its CPU fault information can migrate
> between system RAM nodes which might not be necessary and can lead to
> conflict or overhead. Hence Auto NUMA needs to be switched off at times
> for the VMAs of concern but its not addressed in the patch series. As
> mentioned before, it will be in the follow up work as improvements on
> this series.

I do not think auto-numa need to be switch of for the whole VMA but only
block it for device memory. Because auto-numa can't gather device memory
usage statistics.

[...]

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 13:09       ` Anshuman Khandual
  2017-02-21 20:14         ` Jerome Glisse
@ 2017-02-22  9:29         ` Michal Hocko
  2017-02-22 14:59           ` Jerome Glisse
  2017-02-23  8:52           ` Anshuman Khandual
  2017-02-23 15:57         ` Mel Gorman
  2 siblings, 2 replies; 43+ messages in thread
From: Michal Hocko @ 2017-02-22  9:29 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
> On 02/17/2017 07:02 PM, Mel Gorman wrote:
[...]
> > Why can this not be expressed with cpusets and memory policies
> > controlled by a combination of administrative steps for a privileged
> > application and an application that is CDM aware?
> 
> Hmm, that can be done but having an in kernel infrastructure has the
> following benefits.
> 
> * Administrator does not have to listen to node add notifications
>   and keep the isolation/allowed cpusets upto date all the time.
>   This can be a significant overhead on the admin/userspace which
>   have a number of separate device memory nodes.

But the application has to communicate with the device so why it cannot
use a device specific allocation as well? I really fail to see why
something this special should hide behind a generic API to spread all
the special casing into the kernel instead.
 
> * With cpuset solution, tasks which are part of CDM allowed cpuset
>   can have all it's VMAs allocate from CDM memory which may not be
>   something the user want. For example user may not want to have
>   the text segments, libraries allocate from CDM. To achieve this
>   the user will have to explicitly block allocation access from CDM
>   through mbind(MPOL_BIND) memory policy setups. This negative setup
>   is a big overhead. But with in kernel CDM framework, isolation is
>   enabled by default. For CDM allocations the application just has
>   to setup memory policy with CDM node in the allowed nodemask.

Which makes cpusets vs. mempolicies even bigger mess, doesn't it? So say
that you have an application which wants to benefit from CDM and use
mbind to have an access to this memory for particular buffer. Now you
try to run this application in a cpuset which doesn't include this node
and now what? Cpuset will override the application policy so the buffer
will never reach the requested node. At least not without even more
hacks to cpuset handling. I really do not like that!

[...]
> These are the reasons which prohibit the use of HMM for coherent
> addressable device memory purpose.
> 
[...]
> (3) Application cannot directly allocate into device memory from user
> space using existing memory related system calls like mmap() and mbind()
> as the device memory hides away in ZONE_DEVICE.

Why cannot the application simply use mmap on the device file?

> Apart from that, CDM framework provides a different approach to device
> memory representation which does not require special device memory kind
> of handling and associated call backs as implemented by HMM. It provides
> NUMA node based visibility to the user space which can be extended to
> support new features.

What do you mean by new features and how users will use/request those
features (aka what is the API)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 13:39       ` Anshuman Khandual
@ 2017-02-22  9:50         ` Michal Hocko
  2017-02-23  6:52           ` Anshuman Khandual
  2017-02-24  1:06         ` Bob Liu
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-02-22  9:50 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Tue 21-02-17 19:09:18, Anshuman Khandual wrote:
> On 02/21/2017 04:41 PM, Michal Hocko wrote:
> > On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
> > [...]
> >> * User space using mbind() to get CDM memory is an additional benefit
> >>   we get by making the CDM plug in as a node and be part of the buddy
> >>   allocator. But the over all idea from the user space point of view
> >>   is that the application can allocate any generic buffer and try to
> >>   use the buffer either from the CPU side or from the device without
> >>   knowing about where the buffer is really mapped physically. That
> >>   gives a seamless and transparent view to the user space where CPU
> >>   compute and possible device based compute can work together. This
> >>   is not possible through a driver allocated buffer.
> > 
> > But how are you going to define any policy around that. Who is allowed
> 
> The user space VMA can define the policy with a mbind(MPOL_BIND) call
> with CDM/CDMs in the nodemask.
>
> > to allocate and how much of this "special memory". Is it possible that
> 
> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
> the nodemask can allocate from the CDM memory. "How much" gets controlled
> by how we fault from CPU and the default behavior of the buddy allocator.

In other words the policy is implemented by the kernel. Why is this a
good thing?

> > we will eventually need some access control mechanism? If yes then mbind
> 
> No access control mechanism is needed. If an application wants to use
> CDM memory by specifying in the mbind() it can. Nothing prevents it
> from using the CDM memory.

What if we find out that an access control _is_ really needed? I can
easily imagine that some devices will come up with really fast and expensive
memory. You do not want some random user to steal it from you when you
want to use it for your workload.

> > is really not suitable interface to (ab)use. Also what should happen if
> > the mbind mentions only CDM memory and that is depleted?
> 
> IIUC *only CDM* cannot be requested from user space as there are no user
> visible interface which can translate to __GFP_THISNODE.

I do not understand what __GFP_THISNODE has to do with this. This is an
internal flag.

> MPOL_BIND with
> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
> have zones of the system including CDM ones. If the resultant CDM zones
> run out of memory, we fail the allocation request as usual.

OK, so let's say you mbind to a single node which is CDM. You seem to be
saying that we will simply break the NUMA affinity in this special case?
Currently we invoke the OOM killer if nodes which the application binds
to are depleted and cannot be reclaimed.
 
> > Could you also explain why the transparent view is really better than
> > using a device specific mmap (aka CDM awareness)?
> 
> Okay with a transparent view, we can achieve a control flow of application
> like the following.
> 
> (1) Allocate a buffer:		alloc_buffer(buf, size)
> (2) CPU compute on buffer:	cpu_compute(buf, size)
> (3) Device compute on buffer:	device_compute(buf, size)
> (4) CPU compute on buffer:	cpu_compute(buf, size)
> (5) Release the buffer:		release_buffer(buf, size)
> 
> With assistance from a device specific driver, the actual page mapping of
> the buffer can change between system RAM and device memory depending on
> which side is accessing at a given point. This will be achieved through
> driver initiated migrations.

But then you do not need any NUMA affinity, right? The driver can do
all this automagically. How does the numa policy comes into the game in
your above example. Sorry for being dense, I might be really missing
something important here, but I really fail to see why the NUMA is the
proper interface here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-22  9:29         ` Michal Hocko
@ 2017-02-22 14:59           ` Jerome Glisse
  2017-02-22 16:54             ` Michal Hocko
  2017-02-23  8:52           ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-02-22 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, Mel Gorman, linux-kernel, linux-mm, vbabka,
	minchan, aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Wed, Feb 22, 2017 at 10:29:21AM +0100, Michal Hocko wrote:
> On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
> > On 02/17/2017 07:02 PM, Mel Gorman wrote:

[...]

> [...]
> > These are the reasons which prohibit the use of HMM for coherent
> > addressable device memory purpose.
> > 
> [...]
> > (3) Application cannot directly allocate into device memory from user
> > space using existing memory related system calls like mmap() and mbind()
> > as the device memory hides away in ZONE_DEVICE.
> 
> Why cannot the application simply use mmap on the device file?

This has been said before but we want to share the address space this do
imply that you can not rely on special allocator. For instance you can
have an application that use a library and the library use the GPU but
the application is un-aware and those any data provided by the application
to the library will come from generic malloc (mmap anonymous or from
regular file).

Currently what happens is that the library reallocate memory through
special allocator and copy thing. Not only does this waste memory (the
new memory is often regular memory too) but you also have to paid the
cost of copying GB of data.

Last bullet to this, is complex data structure (list, tree, ...) having
to go through special allocator means you have re-build the whole structure
with the duplicated memory.

Allowing to directly use memory allocated from malloc (mmap anonymous
private or from a regular file) avoid the copy operation and the complex
duplication of data structure. Moving the dataset to the GPU is then a
simple memory migration from kernel point of view.

This is share address space without special allocator is mandatory in new
or future standard such as OpenCL, Cuda, C++, OpenMP, ... some other OS
already have this and the industry want it. So the questions is do we
want to support any of this, do we care about GPGPU ?

I believe we want to support all this new standard but maybe i am the
only one.

In HMM case i have the extra painfull fact that the device memory is
not accessible by the CPU. For CDM on contrary, CPU can access in a
cache coherent way the device memory and all operation behave as regular
memory (thing like atomic operation for instance).

I hope this clearly explain why we can no longer rely on dedicated/
specialized memory allocator.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-22 14:59           ` Jerome Glisse
@ 2017-02-22 16:54             ` Michal Hocko
  2017-03-06  5:48               ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-02-22 16:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Anshuman Khandual, Mel Gorman, linux-kernel, linux-mm, vbabka,
	minchan, aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Wed 22-02-17 09:59:15, Jerome Glisse wrote:
> On Wed, Feb 22, 2017 at 10:29:21AM +0100, Michal Hocko wrote:
> > On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
> > > On 02/17/2017 07:02 PM, Mel Gorman wrote:
> 
> [...]
> 
> > [...]
> > > These are the reasons which prohibit the use of HMM for coherent
> > > addressable device memory purpose.
> > > 
> > [...]
> > > (3) Application cannot directly allocate into device memory from user
> > > space using existing memory related system calls like mmap() and mbind()
> > > as the device memory hides away in ZONE_DEVICE.
> > 
> > Why cannot the application simply use mmap on the device file?
> 
> This has been said before but we want to share the address space this do
> imply that you can not rely on special allocator. For instance you can
> have an application that use a library and the library use the GPU but
> the application is un-aware and those any data provided by the application
> to the library will come from generic malloc (mmap anonymous or from
> regular file).
> 
> Currently what happens is that the library reallocate memory through
> special allocator and copy thing. Not only does this waste memory (the
> new memory is often regular memory too) but you also have to paid the
> cost of copying GB of data.
> 
> Last bullet to this, is complex data structure (list, tree, ...) having
> to go through special allocator means you have re-build the whole structure
> with the duplicated memory.
> 
> 
> Allowing to directly use memory allocated from malloc (mmap anonymous
> private or from a regular file) avoid the copy operation and the complex
> duplication of data structure. Moving the dataset to the GPU is then a
> simple memory migration from kernel point of view.
> 
> This is share address space without special allocator is mandatory in new
> or future standard such as OpenCL, Cuda, C++, OpenMP, ... some other OS
> already have this and the industry want it. So the questions is do we
> want to support any of this, do we care about GPGPU ?
> 
> 
> I believe we want to support all this new standard but maybe i am the
> only one.
> 
> In HMM case i have the extra painfull fact that the device memory is
> not accessible by the CPU. For CDM on contrary, CPU can access in a
> cache coherent way the device memory and all operation behave as regular
> memory (thing like atomic operation for instance).
> 
> 
> I hope this clearly explain why we can no longer rely on dedicated/
> specialized memory allocator.

Yes this clarifies this point. Thanks for the information which would be
really helpful in the initial description. Maybe I've just missed it,
though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-22  9:50         ` Michal Hocko
@ 2017-02-23  6:52           ` Anshuman Khandual
  2017-03-05 12:39             ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-23  6:52 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 02/22/2017 03:20 PM, Michal Hocko wrote:
> On Tue 21-02-17 19:09:18, Anshuman Khandual wrote:
>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>> [...]
>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>   allocator. But the over all idea from the user space point of view
>>>>   is that the application can allocate any generic buffer and try to
>>>>   use the buffer either from the CPU side or from the device without
>>>>   knowing about where the buffer is really mapped physically. That
>>>>   gives a seamless and transparent view to the user space where CPU
>>>>   compute and possible device based compute can work together. This
>>>>   is not possible through a driver allocated buffer.
>>>
>>> But how are you going to define any policy around that. Who is allowed
>>
>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>> with CDM/CDMs in the nodemask.
>>
>>> to allocate and how much of this "special memory". Is it possible that
>>
>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>> by how we fault from CPU and the default behavior of the buddy allocator.
> 
> In other words the policy is implemented by the kernel. Why is this a
> good thing?

Its controlled by the kernel only during page fault paths of either CPU
or device. But the device driver will actually do the placements after
wards after taking into consideration access patterns and relative
performance. We dont want the driver to be involved during page fault
path memory allocations which should naturally go through the buddy
allocator.

> 
>>> we will eventually need some access control mechanism? If yes then mbind
>>
>> No access control mechanism is needed. If an application wants to use
>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>> from using the CDM memory.
> 
> What if we find out that an access control _is_ really needed? I can
> easily imagine that some devices will come up with really fast and expensive
> memory. You do not want some random user to steal it from you when you
> want to use it for your workload.

Hmm, it makes sense but I think its not something we have to deal with
right away. Later we may have to think about some generic access control
mechanism for mbind() and then accommodate CDM with it.

> 
>>> is really not suitable interface to (ab)use. Also what should happen if
>>> the mbind mentions only CDM memory and that is depleted?
>>
>> IIUC *only CDM* cannot be requested from user space as there are no user
>> visible interface which can translate to __GFP_THISNODE.
> 
> I do not understand what __GFP_THISNODE has to do with this. This is an
> internal flag.

Right. My bad. I was just referring to the fact that there is nothing in
user space which can make buddy allocator pick NOFALLBACK list instead of
FALLBACK list.

> 
>> MPOL_BIND with
>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>> have zones of the system including CDM ones. If the resultant CDM zones
>> run out of memory, we fail the allocation request as usual.
> 
> OK, so let's say you mbind to a single node which is CDM. You seem to be
> saying that we will simply break the NUMA affinity in this special case?

Why ? It should simply follow what happens when we pick a single NUMA node
in previous situations.

> Currently we invoke the OOM killer if nodes which the application binds
> to are depleted and cannot be reclaimed.

Right, the same should happen here for CDM as well.

>  
>>> Could you also explain why the transparent view is really better than
>>> using a device specific mmap (aka CDM awareness)?
>>
>> Okay with a transparent view, we can achieve a control flow of application
>> like the following.
>>
>> (1) Allocate a buffer:		alloc_buffer(buf, size)
>> (2) CPU compute on buffer:	cpu_compute(buf, size)
>> (3) Device compute on buffer:	device_compute(buf, size)
>> (4) CPU compute on buffer:	cpu_compute(buf, size)
>> (5) Release the buffer:		release_buffer(buf, size)
>>
>> With assistance from a device specific driver, the actual page mapping of
>> the buffer can change between system RAM and device memory depending on
>> which side is accessing at a given point. This will be achieved through
>> driver initiated migrations.
> 
> But then you do not need any NUMA affinity, right? The driver can do
> all this automagically. How does the numa policy comes into the game in
> your above example. Sorry for being dense, I might be really missing
> something important here, but I really fail to see why the NUMA is the
> proper interface here.

You are right. Driver can migrate any mapping in the userspace to any
where on the system as long as cpuset does not prohibit it. But we still
want the driver to conform to the applicable VMA memory policy set from
the userspace. Hence a VMA policy needs to be set from the user space.
NUMA VMA memory policy also restricts the allocations inside the
applicable nodemask during page fault paths (CPU and device) as well.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 20:14         ` Jerome Glisse
@ 2017-02-23  8:14           ` Anshuman Khandual
  2017-02-23 15:27             ` Jerome Glisse
  0 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-23  8:14 UTC (permalink / raw)
  To: Jerome Glisse, Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, mhocko, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On 02/22/2017 01:44 AM, Jerome Glisse wrote:
> On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
>> On 02/17/2017 07:02 PM, Mel Gorman wrote:
>>> On Fri, Feb 17, 2017 at 05:11:57PM +0530, Anshuman Khandual wrote:
>>>> On 02/15/2017 11:50 PM, Mel Gorman wrote:
>>>>> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
>>>> * The placement of the memory on the buffer can happen on system memory
>>>>   when the CPU faults while accessing it. But a driver can manage the
>>>>   migration between system RAM and CDM memory once the buffer is being
>>>>   used from CPU and the device interchangeably.
>>>
>>> While I'm not familiar with the details because I'm not generally involved
>>> in hardware enablement, why was HMM not suitable? I know HMM had it's own
>>> problems with merging but as it also managed migrations between RAM and
>>> device memory, how did it not meet your requirements? If there were parts
>>> of HMM missing, why was that not finished?
>>
>>
>> These are the reasons which prohibit the use of HMM for coherent
>> addressable device memory purpose.
>>
>> (1) IIUC HMM currently supports only a subset of anon mapping in the
>> user space. It does not support shared anon mapping or any sort of file
>> mapping for that matter. We need support for all mapping in the user space
>> for the CPU/device compute to be effective and transparent. As HMM depends
>> on ZONE DEVICE for device memory representation, there are some unique
>> challenges in making it work for file mapping (and page cache) during
>> migrations between system RAM and device memory.
> 
> I need to debunk that. HMM does not support file back page (or share memory)
> for a single reason: CPU can not access HMM memory. If the device memory is
> accessible from CPU in cache coherent fashion then adding support for file
> back page is easy. There is only an handfull of place in the filesystem that

This needs to be done in all file systems possible which supports file
mapping in the user space and page caches ?

> assume page are on the lru and all that is needed is allowing file back page
> to not be on the lru. Extra thing would be to forbid GUP but that is easy.

If its not on LRU how we are going to manage the reclaim and write back
into the disk for the dirty pages ? In which order ? Then a brand new
infrastructure needs to be created for that purpose ? Why GUP access
needs to be blocked for these device pages ?

> 
> 
>>
>> (2) ZONE_DEVICE has been modified to support un-addressable memory apart
>> from addressable persistent memory which is not movable. It still would
>> have to support coherent device memory which will be movable.
> 
> Again this isn't how it is implemented. I splitted the un-addressable part
> from the move-able property. So you can implement addressable and moveable
> memory using HMM modification to ZONE_DEVICE.

Need to check this again but yes its not a very big issue.

> 
>>
>> (3) Application cannot directly allocate into device memory from user
>> space using existing memory related system calls like mmap() and mbind()
>> as the device memory hides away in ZONE_DEVICE.
> 
> That's true but this is deliberate choice. From the begining my choice
> have been guided by the principle that i do not want to add or modify
> existing syscall because we do not have real world experience with this.

With the current proposal for CDM, memory system calls just work
on CDM without requiring any changes.

> 
> Once HMM is use with real world workload by people other than me or
> NVidia and we get feedback on what people writting application leveraging
> this would like to do. Then we might start thinking about mbind() or other
> API to expose more policy control to application.

I am not really sure how much of effort would be required to make
ZONE_DEVICE pages to be accessible from user space with existing
memory system calls. NUMA representation just makes it work without
any further changes. But I got your point.

> 
> For time being all policy and migration decision are done by the driver
> that collect hint and statistic from the userspace driver of the GPU.
> So this is all device specific and it use existing driver mechanism.

CDM framework also has the exact same expectations from the driver. But
it gives user space more control and visibility regarding whats happening
with the memory buffer.

> 
>>
>> Apart from that, CDM framework provides a different approach to device
>> memory representation which does not require special device memory kind
>> of handling and associated call backs as implemented by HMM. It provides
>> NUMA node based visibility to the user space which can be extended to
>> support new features.
> 
> True we diverge there. I am not convince that NUMA is the right direction.

Yeah true, we diverge here :)

> NUMA was design for CPU and CDM or device memory is more at a sub-level
> than NUMA. Each device is attach to a given CPU node itself part of the
> NUMA hierarchy. So to me CDM is more about having a hierarchy of memory
> at node level and thus should not be implemented in NUMA. Something new

Currently NUMA does not support any memory hierarchy at node level.

> is needed. Not only for device memory but for thing like stack memory
> that won't use as last level cache as it has been done in existing Intel
> CPU. I believe we will have deeper hierarchy of memory, from fast high
> bandwidth stack memory (on top of CPU/GPU die) to the regular memory as
> we know it and also device memory.

I agree but in absence of the infrastructure NUMA seems to be a suitable
fallback for now.
 
>  
> 
>>> I know HMM had a history of problems getting merged but part of that was a
>>> chicken and egg problem where it was a lot of infrastructure to maintain
>>> with no in-kernel users. If CDM is a potential user then CDM could be
>>
>> CDM is not a user there, HMM needs to change (with above challenges) to
>> accommodate coherent device memory which it does not support at this
>> moment.
> 
> There is no need to change anything with current HMM to support CDM. What
> you would want is to add file back page which would require to allow non
> lru page (this lru assumption of file back page exist only in couple place
> and i don't remember thinking it would be a challenge to change that).

I am afraid this statement over simplifies the challenge in hand. May be
we need to start looking into actual details to figure out how much of
changes are really required for this enablement.

> 
> 
>>> built on top and ask for a merge of both the core infrastructure required
>>> and the drivers at the same time.
>>
>> I am afraid the drivers would be HW vendor specific.
>>
>>>
>>> It's not an easy path but the difficulties there do not justify special
>>> casing CDM in the core allocator.
>>
>> Hmm. Even if HMM supports all sorts of mappings in user space and related
>> migrations, we still will not have direct allocations from user space with
>> mmap() and mbind() system calls.
> 
> I am not sure we want to have this kind of direct allocation from day one.
> I would rather have the whole thing fire tested with real application and
> real user through device driver. Then wait to see if common usage pattern
> warrant to create a generic API to direct new memory allocation to device
> memory.

But we should not also over look this aspect and go in a direction
where it can be difficult to implement at later point in time. I am
not saying its going to be difficult but its something we have to
find out.


> 
>  
>>>>   As you have mentioned
>>>>   driver will have more information about where which part of the buffer
>>>>   should be placed at any point of time and it can make it happen with
>>>>   migration. So both allocation and placement are decided by the driver
>>>>   during runtime. CDM provides the framework for this can kind device
>>>>   assisted compute and driver managed memory placements.
>>>>
>>>
>>> Which sounds like what HMM needed and the problems of co-ordinating whether
>>> data within a VMA is located on system RAM or device memory and what that
>>> means is not addressed by the series.
>>
>> Did not get that. What is not addressed by this series ? How is the
>> requirements of HMM and CDM framework are different ?
> 
> The VMA flag of CDM is really really bad from my point of view. I do
> understand and agree that you want to block auto-numa and ksm or any-
> thing similar from happening to CDM memory but this is a property of
> the memory that back some address in a given VMA. It is not a property
> of a VMA region. Given that auto-numa and KSM work from VMA down to
> memory i understand why one would want to block it there but it is
> wrong.

Okay. We have already discussed these and the current proposed patch
series does not have these changes. I had decided to split the previous
series and posted only the isolation bits of it. We can debate about
the VMA/page aspect of the solution but after we reach an agreement
on the isolation parts.

> 
> I already said that a common pattern will be fragmented VMA ie a VMA
> in which you have some address back by device memory and other back
> by regular memory (and no you do not want to split VMA). So to me it
> is clear you need to block KSM or auto-numa at page level ie by using
> memory type property from node to which the page belong for instance.
> 
> Droping the CDM flag would simplify your whole patchset.

Okay, got your point.

> 
>  
>>>
>>> Even if HMM is unsuitable, it should be clearly explained why
>>
>> I just did explain in the previous paragraphs above.
>>
>>>
>>>> * If any application is not using CDM memory for along time placed on
>>>>   its buffer and another application is forced to fallback on system
>>>>   RAM when it really wanted is CDM, the driver can detect these kind
>>>>   of situations through memory access patterns on the device HW and
>>>>   take necessary migration decisions.
>>>>
>>>> I hope this explains the rationale of the framework. In fact these
>>>> four patches give logically complete CPU/Device operating framework.
>>>> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
>>>> etc which are improvements on top of this basic framework.
>>>>
>>>
>>> Automatic NUMA balancing is a particular oddity as that is about
>>> CPU->RAM locality and not RAM->device considerations.
>>
>> Right. But when there are migrations happening between system RAM and
>> device memory. Auto NUMA with its CPU fault information can migrate
>> between system RAM nodes which might not be necessary and can lead to
>> conflict or overhead. Hence Auto NUMA needs to be switched off at times
>> for the VMAs of concern but its not addressed in the patch series. As
>> mentioned before, it will be in the follow up work as improvements on
>> this series.
> 
> I do not think auto-numa need to be switch of for the whole VMA but only
> block it for device memory. Because auto-numa can't gather device memory
> usage statistics.

Right.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-22  9:29         ` Michal Hocko
  2017-02-22 14:59           ` Jerome Glisse
@ 2017-02-23  8:52           ` Anshuman Khandual
  1 sibling, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-23  8:52 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 02/22/2017 02:59 PM, Michal Hocko wrote:
> On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
>> On 02/17/2017 07:02 PM, Mel Gorman wrote:
> [...]
>>> Why can this not be expressed with cpusets and memory policies
>>> controlled by a combination of administrative steps for a privileged
>>> application and an application that is CDM aware?
>>
>> Hmm, that can be done but having an in kernel infrastructure has the
>> following benefits.
>>
>> * Administrator does not have to listen to node add notifications
>>   and keep the isolation/allowed cpusets upto date all the time.
>>   This can be a significant overhead on the admin/userspace which
>>   have a number of separate device memory nodes.
> 
> But the application has to communicate with the device so why it cannot
> use a device specific allocation as well? I really fail to see why
> something this special should hide behind a generic API to spread all
> the special casing into the kernel instead.

Eventually both memory as well as compute parts in a hybrid
CPU/device scheme should be implemented through generic API.
The scheduler should be able to take inputs from user space
to schedule a device compute specific function on a device
compute thread for a single run. Scheduler should then have
calls backs registered from the device which can be called
to schedule the function for a device compute. But right now
we are not there yet. So we are walking half the way and
trying to do it only for memory now.

>  
>> * With cpuset solution, tasks which are part of CDM allowed cpuset
>>   can have all it's VMAs allocate from CDM memory which may not be
>>   something the user want. For example user may not want to have
>>   the text segments, libraries allocate from CDM. To achieve this
>>   the user will have to explicitly block allocation access from CDM
>>   through mbind(MPOL_BIND) memory policy setups. This negative setup
>>   is a big overhead. But with in kernel CDM framework, isolation is
>>   enabled by default. For CDM allocations the application just has
>>   to setup memory policy with CDM node in the allowed nodemask.
> 
> Which makes cpusets vs. mempolicies even bigger mess, doesn't it? So say

Hence I am trying to defend CDM framework in comparison to cpuset
+ mbind() solution from user space as suggested by Mel.

> that you have an application which wants to benefit from CDM and use
> mbind to have an access to this memory for particular buffer. Now you
> try to run this application in a cpuset which doesn't include this node
> and now what? Cpuset will override the application policy so the buffer
> will never reach the requested node. At least not without even more
> hacks to cpuset handling. I really do not like that!


Right, it will not reach the CDM. The cpuset based solution was to
have the applications which want CDM in a CDM including cpuset and
all other applications/tasks in a CDM excluded cpuset. CDM aware
application can then set their own memory policies which *may*
include CDM and it will be allowed as their cpuset contain the
nodes. But these two cpusets once containing all CDM and one without
these CDMs should be maintained all the time. No kernel cpuset
hacks will be required.

> 
> [...]
>> These are the reasons which prohibit the use of HMM for coherent
>> addressable device memory purpose.
>>
> [...]
>> (3) Application cannot directly allocate into device memory from user
>> space using existing memory related system calls like mmap() and mbind()
>> as the device memory hides away in ZONE_DEVICE.
> 
> Why cannot the application simply use mmap on the device file?

Yeah thats possible but then it does not go through core VM any more.

> 
>> Apart from that, CDM framework provides a different approach to device
>> memory representation which does not require special device memory kind
>> of handling and associated call backs as implemented by HMM. It provides
>> NUMA node based visibility to the user space which can be extended to
>> support new features.
> 
> What do you mean by new features and how users will use/request those
> features (aka what is the API)?

I dont have plans for this right now. But what I meant was once core
VM understand CDM even its represented as NUMA node, the existing APIs
can be be modified to accommodate special functions for CDM memory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-23  8:14           ` Anshuman Khandual
@ 2017-02-23 15:27             ` Jerome Glisse
  0 siblings, 0 replies; 43+ messages in thread
From: Jerome Glisse @ 2017-02-23 15:27 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, mhocko, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Thu, Feb 23, 2017 at 01:44:06PM +0530, Anshuman Khandual wrote:
> On 02/22/2017 01:44 AM, Jerome Glisse wrote:
> > On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
> >> On 02/17/2017 07:02 PM, Mel Gorman wrote:
> >>> On Fri, Feb 17, 2017 at 05:11:57PM +0530, Anshuman Khandual wrote:
> >>>> On 02/15/2017 11:50 PM, Mel Gorman wrote:
> >>>>> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> > 
> > [...]
> > 
> >>>> * The placement of the memory on the buffer can happen on system memory
> >>>>   when the CPU faults while accessing it. But a driver can manage the
> >>>>   migration between system RAM and CDM memory once the buffer is being
> >>>>   used from CPU and the device interchangeably.
> >>>
> >>> While I'm not familiar with the details because I'm not generally involved
> >>> in hardware enablement, why was HMM not suitable? I know HMM had it's own
> >>> problems with merging but as it also managed migrations between RAM and
> >>> device memory, how did it not meet your requirements? If there were parts
> >>> of HMM missing, why was that not finished?
> >>
> >>
> >> These are the reasons which prohibit the use of HMM for coherent
> >> addressable device memory purpose.
> >>
> >> (1) IIUC HMM currently supports only a subset of anon mapping in the
> >> user space. It does not support shared anon mapping or any sort of file
> >> mapping for that matter. We need support for all mapping in the user space
> >> for the CPU/device compute to be effective and transparent. As HMM depends
> >> on ZONE DEVICE for device memory representation, there are some unique
> >> challenges in making it work for file mapping (and page cache) during
> >> migrations between system RAM and device memory.
> > 
> > I need to debunk that. HMM does not support file back page (or share memory)
> > for a single reason: CPU can not access HMM memory. If the device memory is
> > accessible from CPU in cache coherent fashion then adding support for file
> > back page is easy. There is only an handfull of place in the filesystem that
> 
> This needs to be done in all file systems possible which supports file
> mapping in the user space and page caches ?

No, last time i check only couple filesystem made assumption in couple of
place about page being on lru (fuse and xfs if my memory serves me right).

Remaining place is a single function in common fs code (again if my memory
is serving me properly).


> > assume page are on the lru and all that is needed is allowing file back page
> > to not be on the lru. Extra thing would be to forbid GUP but that is easy.
> 
> If its not on LRU how we are going to manage the reclaim and write back
> into the disk for the dirty pages ? In which order ? Then a brand new
> infrastructure needs to be created for that purpose ? Why GUP access
> needs to be blocked for these device pages ?

Writeback does not rely on lru last time i check, so write back is fine.
Reclaim is obviously ignoring the page, this can trigger issue i guess if
we have enough page in such memory that we reach threshold that constantly
force regular page to be reclaim.

To me the question is do we want regular reclaim ? I do not think so as
the current reclaim code can not gather statistics from each devices to
know what memory is being use by who. I see reclaim as a per device, per
node problem. Regular memory should be reclaim with existing mechanism
but device memory should be reclaim with new infrastructure that can work
with device driver to know what memory and when to reclaim it.

Existing reclaim have been fine tune for CPU workload over the years and
i would rather not disturb that for new workload we don't have much
experience with yet.

GUP blocking is to forbid anyone from pining a page inside device memory
so that device memory can always be reclaim. What i would really like is
to add a new API for thing like direct I/O that only need to pin memory
for short period of time versus driver abusing GUP to pin memory for
device purposes. That way thing like direct I/O could work while blocking
long live pin.

I want to block pin because many GPU have contiguous memory requirement for
their graphic side (compute side doesn't have this kind of restriction). So
that you do not want to fragment device memory with pinned pages.


> >> (2) ZONE_DEVICE has been modified to support un-addressable memory apart
> >> from addressable persistent memory which is not movable. It still would
> >> have to support coherent device memory which will be movable.
> > 
> > Again this isn't how it is implemented. I splitted the un-addressable part
> > from the move-able property. So you can implement addressable and moveable
> > memory using HMM modification to ZONE_DEVICE.
> 
> Need to check this again but yes its not a very big issue.
> 
> > 
> >>
> >> (3) Application cannot directly allocate into device memory from user
> >> space using existing memory related system calls like mmap() and mbind()
> >> as the device memory hides away in ZONE_DEVICE.
> > 
> > That's true but this is deliberate choice. From the begining my choice
> > have been guided by the principle that i do not want to add or modify
> > existing syscall because we do not have real world experience with this.
> 
> With the current proposal for CDM, memory system calls just work
> on CDM without requiring any changes.
> 
> > 
> > Once HMM is use with real world workload by people other than me or
> > NVidia and we get feedback on what people writting application leveraging
> > this would like to do. Then we might start thinking about mbind() or other
> > API to expose more policy control to application.
> 
> I am not really sure how much of effort would be required to make
> ZONE_DEVICE pages to be accessible from user space with existing
> memory system calls. NUMA representation just makes it work without
> any further changes. But I got your point.
> 
> > 
> > For time being all policy and migration decision are done by the driver
> > that collect hint and statistic from the userspace driver of the GPU.
> > So this is all device specific and it use existing driver mechanism.
> 
> CDM framework also has the exact same expectations from the driver. But
> it gives user space more control and visibility regarding whats happening
> with the memory buffer.

I understand you want generic API to expose to userspace to allow program
finer control on where memory is allocated and i want that too long term.
I just don't think we have enough experience with real workload to make
sure we are making the right decision. The fact that you use existing API
is good in my view as it means you are not adding thing we will regret
latter :) If CDM NUMA node is not working well we can just stop reporting
CDM NUMA node and thus i don't think your patchset is cornering us. So
i believe it is worth adding now and gather experience with it. We can
easily back off.

 
> >> Apart from that, CDM framework provides a different approach to device
> >> memory representation which does not require special device memory kind
> >> of handling and associated call backs as implemented by HMM. It provides
> >> NUMA node based visibility to the user space which can be extended to
> >> support new features.
> > 
> > True we diverge there. I am not convince that NUMA is the right direction.
> 
> Yeah true, we diverge here :)
> 
> > NUMA was design for CPU and CDM or device memory is more at a sub-level
> > than NUMA. Each device is attach to a given CPU node itself part of the
> > NUMA hierarchy. So to me CDM is more about having a hierarchy of memory
> > at node level and thus should not be implemented in NUMA. Something new
> 
> Currently NUMA does not support any memory hierarchy at node level.

Yes and this is what i would like to see, this is something we will need
with CPU with big chunk of on die fast memory and then the regular memory
and you can even add persistent memory to the mix as a bigger chunk but
slower kind of memory. I think this something we need to think about and
that it will be needed and not only for device memory.

I do not think that we should hold of CDM until we have that. From my point
of view beside the VMA flag, CDM is fine (thought i won't go into the CPU
set discussion as i am ignorant of that part). But i believe this NUMA
solution you have shouldn't be the end of it.


> > is needed. Not only for device memory but for thing like stack memory
> > that won't use as last level cache as it has been done in existing Intel
> > CPU. I believe we will have deeper hierarchy of memory, from fast high
> > bandwidth stack memory (on top of CPU/GPU die) to the regular memory as
> > we know it and also device memory.
> 
> I agree but in absence of the infrastructure NUMA seems to be a suitable
> fallback for now.

Yes agree.
  
> >>> I know HMM had a history of problems getting merged but part of that was a
> >>> chicken and egg problem where it was a lot of infrastructure to maintain
> >>> with no in-kernel users. If CDM is a potential user then CDM could be
> >>
> >> CDM is not a user there, HMM needs to change (with above challenges) to
> >> accommodate coherent device memory which it does not support at this
> >> moment.
> > 
> > There is no need to change anything with current HMM to support CDM. What
> > you would want is to add file back page which would require to allow non
> > lru page (this lru assumption of file back page exist only in couple place
> > and i don't remember thinking it would be a challenge to change that).
> 
> I am afraid this statement over simplifies the challenge in hand. May be
> we need to start looking into actual details to figure out how much of
> changes are really required for this enablement.

Like i said from memory, writeback is fine as it doesn't deal with lru, the
only place that does is read ahead and generic read helper that populate
the page cache. This can be prototyped without device memory just using
regular memory and pretend it is device memory. I can take a look into that,
maybe before mm summit.

 
> >> I am afraid the drivers would be HW vendor specific.
> >>
> >>>
> >>> It's not an easy path but the difficulties there do not justify special
> >>> casing CDM in the core allocator.
> >>
> >> Hmm. Even if HMM supports all sorts of mappings in user space and related
> >> migrations, we still will not have direct allocations from user space with
> >> mmap() and mbind() system calls.
> > 
> > I am not sure we want to have this kind of direct allocation from day one.
> > I would rather have the whole thing fire tested with real application and
> > real user through device driver. Then wait to see if common usage pattern
> > warrant to create a generic API to direct new memory allocation to device
> > memory.
> 
> But we should not also over look this aspect and go in a direction
> where it can be difficult to implement at later point in time. I am
> not saying its going to be difficult but its something we have to
> find out.

Yes agree.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 13:09       ` Anshuman Khandual
  2017-02-21 20:14         ` Jerome Glisse
  2017-02-22  9:29         ` Michal Hocko
@ 2017-02-23 15:57         ` Mel Gorman
  2017-03-06  5:12           ` Anshuman Khandual
  2 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2017-02-23 15:57 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
> >>> In itself, the series does very little and as Vlastimil already pointed
> >>> out, it's not a good idea to try merge piecemeal when people could not
> >>> agree on the big picture (I didn't dig into it).
> >>
> >> With the proposed kernel changes and a associated driver its complete to
> >> drive a user space based CPU/Device hybrid compute interchangeably on a
> >> mmap() allocated memory buffer transparently and effectively.
> > 
> > How is the device informed at that data is available for processing?
> 
> It will through a call to the driver from user space which can take
> the required buffer address as an argument.
> 

Which goes back to tending towards what HMM intended but Jerome has covered
all relevant point there so I won't repeat any of them in this response. It
did sound that in part HMM was not used because it was missing some
small steps which could have been included instead of proposing something
different that did not meet their requirements but requires special casing.

> > What prevents and application modifying the data on the device while it's
> > being processed?
> 
> Nothing in software. The application should take care of that but access
> from both sides are coherent. It should wait for the device till it
> finishes the compute it had asked for earlier to prevent override and
> eventual corruption.
> 

Which adds the caveat that applications must be fully CDM aware so if
there are additional calls related to policies or administrative tasks
for cpusets then it follows the application can also be aware of them.

> > Why can this not be expressed with cpusets and memory policies
> > controlled by a combination of administrative steps for a privileged
> > application and an application that is CDM aware?
> 
> Hmm, that can be done but having an in kernel infrastructure has the
> following benefits.
> 
> * Administrator does not have to listen to node add notifications
>   and keep the isolation/allowed cpusets upto date all the time.
>   This can be a significant overhead on the admin/userspace which
>   have a number of separate device memory nodes.
> 

Could be handled with udev triggers potentially or if udev events are not
raised by the memory hot-add then it could still be polled.

> * With cpuset solution, tasks which are part of CDM allowed cpuset
>   can have all it's VMAs allocate from CDM memory which may not be
>   something the user want. For example user may not want to have
>   the text segments, libraries allocate from CDM. To achieve this
>   the user will have to explicitly block allocation access from CDM
>   through mbind(MPOL_BIND) memory policy setups. This negative setup
>   is a big overhead. But with in kernel CDM framework, isolation is
>   enabled by default. For CDM allocations the application just has
>   to setup memory policy with CDM node in the allowed nodemask.
> 

Then distinguish between task-wide policies that forbid CDM nodes and
per-VMA policies that allow the CDM nodes. Migration between system
memory and devices remains a separate problem but migration would also
not be covered by special casing the allocator.

> Even with cpuset solution, applications still need to know which nodes
> are CDM on the system at given point of time. So we will have to store
> it in a nodemask and export them on sysfs some how.
> 

Which in itself is not too bad and doesn't require special casing the
allocator.

> > 
> >> I had also
> >> mentioned these points on the last posting in response to a comment from
> >> Vlastimil.
> >>
> >> From this response (https://lkml.org/lkml/2017/2/14/50).
> >>
> >> * User space using mbind() to get CDM memory is an additional benefit
> >>   we get by making the CDM plug in as a node and be part of the buddy
> >>   allocator. But the over all idea from the user space point of view
> >>   is that the application can allocate any generic buffer and try to
> >>   use the buffer either from the CPU side or from the device without
> >>   knowing about where the buffer is really mapped physically. That
> >>   gives a seamless and transparent view to the user space where CPU
> >>   compute and possible device based compute can work together. This
> >>   is not possible through a driver allocated buffer.
> >>
> > 
> > Which can also be done with cpusets that prevents use of CDM memory and
> > place all non-CDM processes into that cpuset with a separate cpuset for
> > CDM-aware applications that allow access to CDM memory.
> 
> Right, but with additional overheads as explained above.
> 

The application must already be aware of the CDM nodes.

> >> * If any application is not using CDM memory for along time placed on
> >>   its buffer and another application is forced to fallback on system
> >>   RAM when it really wanted is CDM, the driver can detect these kind
> >>   of situations through memory access patterns on the device HW and
> >>   take necessary migration decisions.
> >>
> >> I hope this explains the rationale of the framework. In fact these
> >> four patches give logically complete CPU/Device operating framework.
> >> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
> >> etc which are improvements on top of this basic framework.
> >>
> > 
> > Automatic NUMA balancing is a particular oddity as that is about
> > CPU->RAM locality and not RAM->device considerations.
> 
> Right. But when there are migrations happening between system RAM and
> device memory. Auto NUMA with its CPU fault information can migrate
> between system RAM nodes which might not be necessary and can lead to
> conflict or overhead. Hence Auto NUMA needs to be switched off at times
> for the VMAs of concern but its not addressed in the patch series. As
> mentioned before, it will be in the follow up work as improvements on
> this series.

Ensure the policy settings for CDM-backed VMAs do not set MPOL_F_MOF and
automatic NUMA balancing will skip them. It does not require special casing
of the allocator or specific CDM-awareness.

> > The memblock is to only avoid bootmem allocations from that area. It can
> > be managed in the arch layer to first pass in all the system ram,
> > teardown the bootmem allocator, setup the nodelists, set system
> > nodemask, init CDM, init the allocator for that, and then optionally add
> > it to the system CDM for userspace to do the isolation or provide.
> > 
> > For that matter, the driver could do the discovery and then fake a
> > memory hot-add.
> 
> Not sure I got this correctly. Could you please explain more.
> 

Discover the device, and online the memory later as memory hotplug generally
does. If the faked memory hot-add operation raised an event that udev
can detect then the administrative functions could also be triggered
in userspace.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21 13:39       ` Anshuman Khandual
  2017-02-22  9:50         ` Michal Hocko
@ 2017-02-24  1:06         ` Bob Liu
  2017-02-24  4:39           ` John Hubbard
  2017-02-24  4:53           ` Jerome Glisse
  1 sibling, 2 replies; 43+ messages in thread
From: Bob Liu @ 2017-02-24  1:06 UTC (permalink / raw)
  To: Anshuman Khandual, Michal Hocko
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 2017/2/21 21:39, Anshuman Khandual wrote:
> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>> [...]
>>> * User space using mbind() to get CDM memory is an additional benefit
>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>   allocator. But the over all idea from the user space point of view
>>>   is that the application can allocate any generic buffer and try to
>>>   use the buffer either from the CPU side or from the device without
>>>   knowing about where the buffer is really mapped physically. That
>>>   gives a seamless and transparent view to the user space where CPU
>>>   compute and possible device based compute can work together. This
>>>   is not possible through a driver allocated buffer.
>>
>> But how are you going to define any policy around that. Who is allowed
> 
> The user space VMA can define the policy with a mbind(MPOL_BIND) call
> with CDM/CDMs in the nodemask.
> 
>> to allocate and how much of this "special memory". Is it possible that
> 
> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
> the nodemask can allocate from the CDM memory. "How much" gets controlled
> by how we fault from CPU and the default behavior of the buddy allocator.
> 
>> we will eventually need some access control mechanism? If yes then mbind
> 
> No access control mechanism is needed. If an application wants to use
> CDM memory by specifying in the mbind() it can. Nothing prevents it
> from using the CDM memory.
> 
>> is really not suitable interface to (ab)use. Also what should happen if
>> the mbind mentions only CDM memory and that is depleted?
> 
> IIUC *only CDM* cannot be requested from user space as there are no user
> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
> have zones of the system including CDM ones. If the resultant CDM zones
> run out of memory, we fail the allocation request as usual.
> 
>>
>> Could you also explain why the transparent view is really better than
>> using a device specific mmap (aka CDM awareness)?
> 
> Okay with a transparent view, we can achieve a control flow of application
> like the following.
> 
> (1) Allocate a buffer:		alloc_buffer(buf, size)
> (2) CPU compute on buffer:	cpu_compute(buf, size)
> (3) Device compute on buffer:	device_compute(buf, size)
> (4) CPU compute on buffer:	cpu_compute(buf, size)
> (5) Release the buffer:		release_buffer(buf, size)
> 
> With assistance from a device specific driver, the actual page mapping of
> the buffer can change between system RAM and device memory depending on
> which side is accessing at a given point. This will be achieved through
> driver initiated migrations.
> 

Sorry, I'm a bit confused here.
What's the difference with the Heterogeneous memory management?
Which also "allows to use device memory transparently inside any process
without any modifications to process program code."

Thanks,
-Bob

>>  
>>> * The placement of the memory on the buffer can happen on system memory
>>>   when the CPU faults while accessing it. But a driver can manage the
>>>   migration between system RAM and CDM memory once the buffer is being
>>>   used from CPU and the device interchangeably. As you have mentioned
>>>   driver will have more information about where which part of the buffer
>>>   should be placed at any point of time and it can make it happen with
>>>   migration. So both allocation and placement are decided by the driver
>>>   during runtime. CDM provides the framework for this can kind device
>>>   assisted compute and driver managed memory placements.
>>>
>>> * If any application is not using CDM memory for along time placed on
>>>   its buffer and another application is forced to fallback on system
>>>   RAM when it really wanted is CDM, the driver can detect these kind
>>>   of situations through memory access patterns on the device HW and
>>>   take necessary migration decisions.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-24  1:06         ` Bob Liu
@ 2017-02-24  4:39           ` John Hubbard
  2017-02-24  4:53           ` Jerome Glisse
  1 sibling, 0 replies; 43+ messages in thread
From: John Hubbard @ 2017-02-24  4:39 UTC (permalink / raw)
  To: Bob Liu, Anshuman Khandual, Michal Hocko
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 02/23/2017 05:06 PM, Bob Liu wrote:
> On 2017/2/21 21:39, Anshuman Khandual wrote:
>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>> [...]
>>>
>>> Could you also explain why the transparent view is really better than
>>> using a device specific mmap (aka CDM awareness)?
>>
>> Okay with a transparent view, we can achieve a control flow of application
>> like the following.
>>
>> (1) Allocate a buffer:		alloc_buffer(buf, size)
>> (2) CPU compute on buffer:	cpu_compute(buf, size)
>> (3) Device compute on buffer:	device_compute(buf, size)
>> (4) CPU compute on buffer:	cpu_compute(buf, size)
>> (5) Release the buffer:		release_buffer(buf, size)
>>
>> With assistance from a device specific driver, the actual page mapping of
>> the buffer can change between system RAM and device memory depending on
>> which side is accessing at a given point. This will be achieved through
>> driver initiated migrations.
>>
>
> Sorry, I'm a bit confused here.
> What's the difference with the Heterogeneous memory management?
> Which also "allows to use device memory transparently inside any process
> without any modifications to process program code."

OK, Jerome, let me answer this one. :)

Hi Bob,

Yes, from a userspace app's point of view, both HMM and the various NUMA-based proposals appear to 
provide the same thing: transparent, coherent access to both CPU and device memory. It's just the 
implementation that's different, and each implementation has a role:

HMM: for systems that do not provide direct access to device memory, we do need HMM. It provides a 
fault-based mechanism for transparently moving pages to the right place, and mapping them to the 
local process (CPU or device). You can think of HMM as something that provides coherent memory 
access, via software.

NUMA-based solutions: for systems that *can* provide directly addressable, coherent device memory, 
we let programs directly address the memory, and let the (probably enhanced) NUMA system handle page 
placement. There will be lots more NUMA enhancement discussions and patchsets coming, from what I 
can tell.

There are distinct advantages and disadvantages to each approach. For example, fault-based HMM can 
be slow, but it works even with hardware that doesn't directly provide coherent access--and it also 
has page fault information to guide it on page placement (thrashing detection). And NUMA systems, 
which do *not* fault nearly as much, require various artificial ways to detect when a page (or 
process) is on a suboptimal node. The NUMA approach is also, very arguably, conceptually simpler (it 
really depends on which area you look at).

So again: yes, both systems are providing a sort of coherent memory. HMM provides software based 
coherence, while NUMA assumes hardware-based memory coherence as a prerequisite.

I hope that helps, and doesn't just further muddy the waters?

--
John Hubbard
NVIDIA

>
> Thanks,
> -Bob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-24  1:06         ` Bob Liu
  2017-02-24  4:39           ` John Hubbard
@ 2017-02-24  4:53           ` Jerome Glisse
  2017-02-27  1:56             ` Bob Liu
  1 sibling, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2017-02-24  4:53 UTC (permalink / raw)
  To: Bob Liu
  Cc: Anshuman Khandual, Michal Hocko, Mel Gorman, linux-kernel,
	linux-mm, vbabka, minchan, aneesh.kumar, bsingharora, srikar,
	haren, dave.hansen, dan.j.williams

On Fri, Feb 24, 2017 at 09:06:19AM +0800, Bob Liu wrote:
> On 2017/2/21 21:39, Anshuman Khandual wrote:
> > On 02/21/2017 04:41 PM, Michal Hocko wrote:
> >> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
> >> [...]
> >>> * User space using mbind() to get CDM memory is an additional benefit
> >>>   we get by making the CDM plug in as a node and be part of the buddy
> >>>   allocator. But the over all idea from the user space point of view
> >>>   is that the application can allocate any generic buffer and try to
> >>>   use the buffer either from the CPU side or from the device without
> >>>   knowing about where the buffer is really mapped physically. That
> >>>   gives a seamless and transparent view to the user space where CPU
> >>>   compute and possible device based compute can work together. This
> >>>   is not possible through a driver allocated buffer.
> >>
> >> But how are you going to define any policy around that. Who is allowed
> > 
> > The user space VMA can define the policy with a mbind(MPOL_BIND) call
> > with CDM/CDMs in the nodemask.
> > 
> >> to allocate and how much of this "special memory". Is it possible that
> > 
> > Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
> > the nodemask can allocate from the CDM memory. "How much" gets controlled
> > by how we fault from CPU and the default behavior of the buddy allocator.
> > 
> >> we will eventually need some access control mechanism? If yes then mbind
> > 
> > No access control mechanism is needed. If an application wants to use
> > CDM memory by specifying in the mbind() it can. Nothing prevents it
> > from using the CDM memory.
> > 
> >> is really not suitable interface to (ab)use. Also what should happen if
> >> the mbind mentions only CDM memory and that is depleted?
> > 
> > IIUC *only CDM* cannot be requested from user space as there are no user
> > visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
> > CDM in the nodemask will eventually pick a FALLBACK zonelist which will
> > have zones of the system including CDM ones. If the resultant CDM zones
> > run out of memory, we fail the allocation request as usual.
> > 
> >>
> >> Could you also explain why the transparent view is really better than
> >> using a device specific mmap (aka CDM awareness)?
> > 
> > Okay with a transparent view, we can achieve a control flow of application
> > like the following.
> > 
> > (1) Allocate a buffer:		alloc_buffer(buf, size)
> > (2) CPU compute on buffer:	cpu_compute(buf, size)
> > (3) Device compute on buffer:	device_compute(buf, size)
> > (4) CPU compute on buffer:	cpu_compute(buf, size)
> > (5) Release the buffer:		release_buffer(buf, size)
> > 
> > With assistance from a device specific driver, the actual page mapping of
> > the buffer can change between system RAM and device memory depending on
> > which side is accessing at a given point. This will be achieved through
> > driver initiated migrations.
> > 
> 
> Sorry, I'm a bit confused here.
> What's the difference with the Heterogeneous memory management?
> Which also "allows to use device memory transparently inside any process
> without any modifications to process program code."

HMM is first and foremost for platform (like Intel) where CPU can not
access device memory in cache coherent way or at all. CDM is for more
advance platform with a system bus that allow the CPU to access device
memory in cache coherent way.

Hence CDM was design to integrate more closely in existing concept like
NUMA. From my point of view it is like another level in the memory
hierarchy. Nowaday you have local node memory and other node memory.
In not too distant future you will have fast CPU on die memory, local
memory (you beloved DDR3/DDR4), slightly slower but gigantic persistant
memory and also device memory (all those local to a node).

On top of that you will still have the regular NUMA hierarchy between
nodes. But each node will have its own local hierarchy of memory.

CDM wants to integrate with existing memory hinting API and i believe
this is needed to get some experience with how end user might want to
use this to fine tune their application.

Some bit of HMM are generic and will be reuse by CDM, for instance the
DMA capable memory migration helpers. Wether they can also share HMM
approach of using ZONE_DEVICE is yet to be proven but it comes with
limitations (can't be on lru or have device lru) that might hinder a
closer integration of CDM memory with many aspect of kernel mm.


This is my own view and it likely differ in some way from the view of
the people behind CDM :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-24  4:53           ` Jerome Glisse
@ 2017-02-27  1:56             ` Bob Liu
  2017-02-27  5:41               ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Bob Liu @ 2017-02-27  1:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Anshuman Khandual, Michal Hocko, Mel Gorman, linux-kernel,
	linux-mm, vbabka, minchan, aneesh.kumar, bsingharora, srikar,
	haren, dave.hansen, dan.j.williams@intel.com; jhubbard

On 2017/2/24 12:53, Jerome Glisse wrote:
> On Fri, Feb 24, 2017 at 09:06:19AM +0800, Bob Liu wrote:
>> On 2017/2/21 21:39, Anshuman Khandual wrote:
>>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>>> [...]
>>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>>   allocator. But the over all idea from the user space point of view
>>>>>   is that the application can allocate any generic buffer and try to
>>>>>   use the buffer either from the CPU side or from the device without
>>>>>   knowing about where the buffer is really mapped physically. That
>>>>>   gives a seamless and transparent view to the user space where CPU
>>>>>   compute and possible device based compute can work together. This
>>>>>   is not possible through a driver allocated buffer.
>>>>
>>>> But how are you going to define any policy around that. Who is allowed
>>>
>>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>>> with CDM/CDMs in the nodemask.
>>>
>>>> to allocate and how much of this "special memory". Is it possible that
>>>
>>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>>> by how we fault from CPU and the default behavior of the buddy allocator.
>>>
>>>> we will eventually need some access control mechanism? If yes then mbind
>>>
>>> No access control mechanism is needed. If an application wants to use
>>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>>> from using the CDM memory.
>>>
>>>> is really not suitable interface to (ab)use. Also what should happen if
>>>> the mbind mentions only CDM memory and that is depleted?
>>>
>>> IIUC *only CDM* cannot be requested from user space as there are no user
>>> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
>>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>>> have zones of the system including CDM ones. If the resultant CDM zones
>>> run out of memory, we fail the allocation request as usual.
>>>
>>>>
>>>> Could you also explain why the transparent view is really better than
>>>> using a device specific mmap (aka CDM awareness)?
>>>
>>> Okay with a transparent view, we can achieve a control flow of application
>>> like the following.
>>>
>>> (1) Allocate a buffer:		alloc_buffer(buf, size)
>>> (2) CPU compute on buffer:	cpu_compute(buf, size)
>>> (3) Device compute on buffer:	device_compute(buf, size)
>>> (4) CPU compute on buffer:	cpu_compute(buf, size)
>>> (5) Release the buffer:		release_buffer(buf, size)
>>>
>>> With assistance from a device specific driver, the actual page mapping of
>>> the buffer can change between system RAM and device memory depending on
>>> which side is accessing at a given point. This will be achieved through
>>> driver initiated migrations.
>>>
>>
>> Sorry, I'm a bit confused here.
>> What's the difference with the Heterogeneous memory management?
>> Which also "allows to use device memory transparently inside any process
>> without any modifications to process program code."
> 
> HMM is first and foremost for platform (like Intel) where CPU can not
> access device memory in cache coherent way or at all. CDM is for more
> advance platform with a system bus that allow the CPU to access device
> memory in cache coherent way.
> 
> Hence CDM was design to integrate more closely in existing concept like
> NUMA. From my point of view it is like another level in the memory
> hierarchy. Nowaday you have local node memory and other node memory.
> In not too distant future you will have fast CPU on die memory, local
> memory (you beloved DDR3/DDR4), slightly slower but gigantic persistant
> memory and also device memory (all those local to a node).
> 
> On top of that you will still have the regular NUMA hierarchy between
> nodes. But each node will have its own local hierarchy of memory.
> 
> CDM wants to integrate with existing memory hinting API and i believe
> this is needed to get some experience with how end user might want to
> use this to fine tune their application.
> 
> Some bit of HMM are generic and will be reuse by CDM, for instance the
> DMA capable memory migration helpers. Wether they can also share HMM
> approach of using ZONE_DEVICE is yet to be proven but it comes with
> limitations (can't be on lru or have device lru) that might hinder a
> closer integration of CDM memory with many aspect of kernel mm.
> 
> 
> This is my own view and it likely differ in some way from the view of
> the people behind CDM :)
> 

Got it, thank you for the kindly explanation.
And also thank you, John.

Regards,
Bob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-27  1:56             ` Bob Liu
@ 2017-02-27  5:41               ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-02-27  5:41 UTC (permalink / raw)
  To: Bob Liu, Jerome Glisse
  Cc: Anshuman Khandual, Michal Hocko, Mel Gorman, linux-kernel,
	linux-mm, vbabka, minchan, aneesh.kumar, bsingharora, srikar,
	haren, dave.hansen, dan.j.williams@intel.com; jhubbard

On 02/27/2017 07:26 AM, Bob Liu wrote:
> On 2017/2/24 12:53, Jerome Glisse wrote:
>> On Fri, Feb 24, 2017 at 09:06:19AM +0800, Bob Liu wrote:
>>> On 2017/2/21 21:39, Anshuman Khandual wrote:
>>>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>>>> [...]
>>>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>>>   allocator. But the over all idea from the user space point of view
>>>>>>   is that the application can allocate any generic buffer and try to
>>>>>>   use the buffer either from the CPU side or from the device without
>>>>>>   knowing about where the buffer is really mapped physically. That
>>>>>>   gives a seamless and transparent view to the user space where CPU
>>>>>>   compute and possible device based compute can work together. This
>>>>>>   is not possible through a driver allocated buffer.
>>>>>
>>>>> But how are you going to define any policy around that. Who is allowed
>>>>
>>>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>>>> with CDM/CDMs in the nodemask.
>>>>
>>>>> to allocate and how much of this "special memory". Is it possible that
>>>>
>>>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>>>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>>>> by how we fault from CPU and the default behavior of the buddy allocator.
>>>>
>>>>> we will eventually need some access control mechanism? If yes then mbind
>>>>
>>>> No access control mechanism is needed. If an application wants to use
>>>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>>>> from using the CDM memory.
>>>>
>>>>> is really not suitable interface to (ab)use. Also what should happen if
>>>>> the mbind mentions only CDM memory and that is depleted?
>>>>
>>>> IIUC *only CDM* cannot be requested from user space as there are no user
>>>> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
>>>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>>>> have zones of the system including CDM ones. If the resultant CDM zones
>>>> run out of memory, we fail the allocation request as usual.
>>>>
>>>>>
>>>>> Could you also explain why the transparent view is really better than
>>>>> using a device specific mmap (aka CDM awareness)?
>>>>
>>>> Okay with a transparent view, we can achieve a control flow of application
>>>> like the following.
>>>>
>>>> (1) Allocate a buffer:		alloc_buffer(buf, size)
>>>> (2) CPU compute on buffer:	cpu_compute(buf, size)
>>>> (3) Device compute on buffer:	device_compute(buf, size)
>>>> (4) CPU compute on buffer:	cpu_compute(buf, size)
>>>> (5) Release the buffer:		release_buffer(buf, size)
>>>>
>>>> With assistance from a device specific driver, the actual page mapping of
>>>> the buffer can change between system RAM and device memory depending on
>>>> which side is accessing at a given point. This will be achieved through
>>>> driver initiated migrations.
>>>>
>>>
>>> Sorry, I'm a bit confused here.
>>> What's the difference with the Heterogeneous memory management?
>>> Which also "allows to use device memory transparently inside any process
>>> without any modifications to process program code."
>>
>> HMM is first and foremost for platform (like Intel) where CPU can not
>> access device memory in cache coherent way or at all. CDM is for more
>> advance platform with a system bus that allow the CPU to access device
>> memory in cache coherent way.
>>
>> Hence CDM was design to integrate more closely in existing concept like
>> NUMA. From my point of view it is like another level in the memory
>> hierarchy. Nowaday you have local node memory and other node memory.
>> In not too distant future you will have fast CPU on die memory, local
>> memory (you beloved DDR3/DDR4), slightly slower but gigantic persistant
>> memory and also device memory (all those local to a node).
>>
>> On top of that you will still have the regular NUMA hierarchy between
>> nodes. But each node will have its own local hierarchy of memory.
>>
>> CDM wants to integrate with existing memory hinting API and i believe
>> this is needed to get some experience with how end user might want to
>> use this to fine tune their application.
>>
>> Some bit of HMM are generic and will be reuse by CDM, for instance the
>> DMA capable memory migration helpers. Wether they can also share HMM
>> approach of using ZONE_DEVICE is yet to be proven but it comes with
>> limitations (can't be on lru or have device lru) that might hinder a
>> closer integration of CDM memory with many aspect of kernel mm.
>>
>>
>> This is my own view and it likely differ in some way from the view of
>> the people behind CDM :)
>>
> 
> Got it, thank you for the kindly explanation.
> And also thank you, John.

Thanks Jerome and John for helping out with the detailed explanation.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-21  2:57       ` Balbir Singh
@ 2017-03-01  2:42         ` Balbir Singh
  2017-03-01  9:55           ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Balbir Singh @ 2017-03-01  2:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, vbabka,
	minchan, aneesh.kumar, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

>>> The idea of this patchset was to introduce
>>> the concept of memory that is not necessarily system memory, but is coherent
>>> in terms of visibility/access with some restrictions
>>>
>>
>> Which should be done without special casing the page allocator, cpusets and
>> special casing how cpusets are handled. It's not necessary for any other
>> mechanism used to restrict access to portions of memory such as cpusets,
>> mempolicies or even memblock reservations.
>
> Agreed, I mentioned a limitation that we see a cpusets. I do agree that
> we should reuse any infrastructure we have, but cpusets are more static
> in nature and inheritence compared to the requirements of CDM.
>

Mel, I went back and looked at cpusets and found some limitations that
I mentioned earlier, isolating a particular node requires some amount
of laborious work in terms of isolating all tasks away from the root cpuset
and then creating a hierarchy where the root cpuset is empty and now
belong to a child cpuset that has everything but the node we intend to
ioslate. Even with hardwalling, it does not prevent allocations from
the parent cpuset.

I am trying to understand the concerns that you/Michal/Vlastimil have
so that Anshuman/I/other stake holders can respond to the concerns
in one place if that makes sense. Here are the concerns I have heard
so far

1. Lets not add any overhead to the page allocator path
2. Lets try and keep the allocator changes easy to read/parse
3. Why do we need a NUMA interface?
4. How does this compare with HMM?
5. Why can't we use cpusets?

Would that be a fair set of concerns to address?

@Anshuman/@Srikar/@Aneesh anything else you'd like to add in terms
of concerns/issues? I think it will also make a good discussion thread
for those attending LSF/MM (I am not there) on this topic.

Balbir Singh.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-03-01  2:42         ` Balbir Singh
@ 2017-03-01  9:55           ` Mel Gorman
  2017-03-01 10:59             ` Balbir Singh
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2017-03-01  9:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, vbabka,
	minchan, aneesh.kumar, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Wed, Mar 01, 2017 at 01:42:40PM +1100, Balbir Singh wrote:
> >>>The idea of this patchset was to introduce
> >>>the concept of memory that is not necessarily system memory, but is coherent
> >>>in terms of visibility/access with some restrictions
> >>>
> >>
> >>Which should be done without special casing the page allocator, cpusets and
> >>special casing how cpusets are handled. It's not necessary for any other
> >>mechanism used to restrict access to portions of memory such as cpusets,
> >>mempolicies or even memblock reservations.
> >
> >Agreed, I mentioned a limitation that we see a cpusets. I do agree that
> >we should reuse any infrastructure we have, but cpusets are more static
> >in nature and inheritence compared to the requirements of CDM.
> >
> 
> Mel, I went back and looked at cpusets and found some limitations that
> I mentioned earlier, isolating a particular node requires some amount
> of laborious work in terms of isolating all tasks away from the root cpuset
> and then creating a hierarchy where the root cpuset is empty and now
> belong to a child cpuset that has everything but the node we intend to
> ioslate. Even with hardwalling, it does not prevent allocations from
> the parent cpuset.
> 

That it is difficult does not in itself justify adding a third mechanism
specific to one type of device for controlling access to memory.

> I am trying to understand the concerns that you/Michal/Vlastimil have
> so that Anshuman/I/other stake holders can respond to the concerns
> in one place if that makes sense. Here are the concerns I have heard
> so far
> 
> 1. Lets not add any overhead to the page allocator path

Yes and that includes both runtime overhead and maintenance overhead.
Littering the allocator paths with special casing with runtime overhead
masked by static branches would still be a maintenance burden given that
most people will not have the hardware necessary to avoid regressions.

> 2. Lets try and keep the allocator changes easy to read/parse

No, simply do not add a new mechanism for controllin access to memory
when cpusets and memory policies already exist.

> 3. Why do we need a NUMA interface?
> 4. How does this compare with HMM?

Others discussed this topic in detail.

> 5. Why can't we use cpusets?
> 

That is your assertion. The concerns you have are that the work is
laborious and that designing the administrative interfaces may be
difficult. In itself that does not justify adding a third mechanism for
controlling memory acecss.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-03-01  9:55           ` Mel Gorman
@ 2017-03-01 10:59             ` Balbir Singh
  2017-03-08  9:04               ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Balbir Singh @ 2017-03-01 10:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Anshuman Khandual, linux-kernel, linux-mm, Michal Hocko,
	Vlastimil Babka, Minchan Kim, Aneesh Kumar KV, Srikar Dronamraju,
	haren, Jérôme Glisse, Dave Hansen, Dan Williams

On Wed, Mar 1, 2017 at 8:55 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, Mar 01, 2017 at 01:42:40PM +1100, Balbir Singh wrote:
>> >>>The idea of this patchset was to introduce
>> >>>the concept of memory that is not necessarily system memory, but is coherent
>> >>>in terms of visibility/access with some restrictions
>> >>>
>> >>
>> >>Which should be done without special casing the page allocator, cpusets and
>> >>special casing how cpusets are handled. It's not necessary for any other
>> >>mechanism used to restrict access to portions of memory such as cpusets,
>> >>mempolicies or even memblock reservations.
>> >
>> >Agreed, I mentioned a limitation that we see a cpusets. I do agree that
>> >we should reuse any infrastructure we have, but cpusets are more static
>> >in nature and inheritence compared to the requirements of CDM.
>> >
>>
>> Mel, I went back and looked at cpusets and found some limitations that
>> I mentioned earlier, isolating a particular node requires some amount
>> of laborious work in terms of isolating all tasks away from the root cpuset
>> and then creating a hierarchy where the root cpuset is empty and now
>> belong to a child cpuset that has everything but the node we intend to
>> ioslate. Even with hardwalling, it does not prevent allocations from
>> the parent cpuset.
>>
>
> That it is difficult does not in itself justify adding a third mechanism
> specific to one type of device for controlling access to memory.
>

Not only is it difficult, but there are several tasks that refuse to
change cpusets once created. I also noticed that the isolation may
begin a little too late, some allocations may end up on the node to
isolate.

I also want to eventually control whether auto-numa
balancing/kswapd/reclaim etc run on this node (something that cpusets
do not provide). The reason for these decisions is very dependent on
the properties of the node. The isolation mechanism that exists today
is insufficient. Moreover the correct abstraction for device memory
would be a class similar to N_MEMORY, but limited in what we include
(which is why I was asking if questions 3 and 4 are clear). You might
argue these are not NUMA nodes then, but these are in general sense
NUMA nodes (with non-uniform properties and access times). NUMA allows
with the right hardware expose the right programming model. Please
consider reading the full details at

https://patchwork.kernel.org/patch/9566393/
https://lkml.org/lkml/2016/11/22/339

cpusets are designed for slabs/kernel allocation and user mem
policies, we need more control for things like the ones I mentioned
above. We would definitely work with the existing framework so that we
don't duplicate code or add more complexity

>> I am trying to understand the concerns that you/Michal/Vlastimil have
>> so that Anshuman/I/other stake holders can respond to the concerns
>> in one place if that makes sense. Here are the concerns I have heard
>> so far
>>
>> 1. Lets not add any overhead to the page allocator path
>
> Yes and that includes both runtime overhead and maintenance overhead.
> Littering the allocator paths with special casing with runtime overhead
> masked by static branches would still be a maintenance burden given that
> most people will not have the hardware necessary to avoid regressions.
>

I understand that, we'll try and keep the allocator changes to 0 and
where possible reuse cpusets, after all my experiments cpusets comes
out short on some counts, cpusets uses nodes, but does not represent
node characteristics and work the way N_MEMORY works

>> 2. Lets try and keep the allocator changes easy to read/parse
>
> No, simply do not add a new mechanism for controllin access to memory
> when cpusets and memory policies already exist.
>

Same as above

>> 3. Why do we need a NUMA interface?
>> 4. How does this compare with HMM?
>
> Others discussed this topic in detail.
>
>> 5. Why can't we use cpusets?
>>
>
> That is your assertion. The concerns you have are that the work is
> laborious and that designing the administrative interfaces may be
> difficult. In itself that does not justify adding a third mechanism for
> controlling memory acecss.

Laborious and insufficient as stated above unless we make changes to
the way cpusets work.

Thanks for the feedback,
Balbir Singh.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-23  6:52           ` Anshuman Khandual
@ 2017-03-05 12:39             ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-05 12:39 UTC (permalink / raw)
  To: Anshuman Khandual, Michal Hocko
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On 02/23/2017 12:22 PM, Anshuman Khandual wrote:
> On 02/22/2017 03:20 PM, Michal Hocko wrote:
>> On Tue 21-02-17 19:09:18, Anshuman Khandual wrote:
>>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>>> [...]
>>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>>   allocator. But the over all idea from the user space point of view
>>>>>   is that the application can allocate any generic buffer and try to
>>>>>   use the buffer either from the CPU side or from the device without
>>>>>   knowing about where the buffer is really mapped physically. That
>>>>>   gives a seamless and transparent view to the user space where CPU
>>>>>   compute and possible device based compute can work together. This
>>>>>   is not possible through a driver allocated buffer.
>>>>
>>>> But how are you going to define any policy around that. Who is allowed
>>>
>>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>>> with CDM/CDMs in the nodemask.
>>>
>>>> to allocate and how much of this "special memory". Is it possible that
>>>
>>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>>> by how we fault from CPU and the default behavior of the buddy allocator.
>>
>> In other words the policy is implemented by the kernel. Why is this a
>> good thing?
> 
> Its controlled by the kernel only during page fault paths of either CPU
> or device. But the device driver will actually do the placements after
> wards after taking into consideration access patterns and relative
> performance. We dont want the driver to be involved during page fault
> path memory allocations which should naturally go through the buddy
> allocator.
> 
>>
>>>> we will eventually need some access control mechanism? If yes then mbind
>>>
>>> No access control mechanism is needed. If an application wants to use
>>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>>> from using the CDM memory.
>>
>> What if we find out that an access control _is_ really needed? I can
>> easily imagine that some devices will come up with really fast and expensive
>> memory. You do not want some random user to steal it from you when you
>> want to use it for your workload.
> 
> Hmm, it makes sense but I think its not something we have to deal with
> right away. Later we may have to think about some generic access control
> mechanism for mbind() and then accommodate CDM with it.
> 
>>
>>>> is really not suitable interface to (ab)use. Also what should happen if
>>>> the mbind mentions only CDM memory and that is depleted?
>>>
>>> IIUC *only CDM* cannot be requested from user space as there are no user
>>> visible interface which can translate to __GFP_THISNODE.
>>
>> I do not understand what __GFP_THISNODE has to do with this. This is an
>> internal flag.
> 
> Right. My bad. I was just referring to the fact that there is nothing in
> user space which can make buddy allocator pick NOFALLBACK list instead of
> FALLBACK list.
> 
>>
>>> MPOL_BIND with
>>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>>> have zones of the system including CDM ones. If the resultant CDM zones
>>> run out of memory, we fail the allocation request as usual.
>>
>> OK, so let's say you mbind to a single node which is CDM. You seem to be
>> saying that we will simply break the NUMA affinity in this special case?
> 
> Why ? It should simply follow what happens when we pick a single NUMA node
> in previous situations.
> 
>> Currently we invoke the OOM killer if nodes which the application binds
>> to are depleted and cannot be reclaimed.
> 
> Right, the same should happen here for CDM as well.
> 
>>  
>>>> Could you also explain why the transparent view is really better than
>>>> using a device specific mmap (aka CDM awareness)?
>>>
>>> Okay with a transparent view, we can achieve a control flow of application
>>> like the following.
>>>
>>> (1) Allocate a buffer:		alloc_buffer(buf, size)
>>> (2) CPU compute on buffer:	cpu_compute(buf, size)
>>> (3) Device compute on buffer:	device_compute(buf, size)
>>> (4) CPU compute on buffer:	cpu_compute(buf, size)
>>> (5) Release the buffer:		release_buffer(buf, size)
>>>
>>> With assistance from a device specific driver, the actual page mapping of
>>> the buffer can change between system RAM and device memory depending on
>>> which side is accessing at a given point. This will be achieved through
>>> driver initiated migrations.
>>
>> But then you do not need any NUMA affinity, right? The driver can do
>> all this automagically. How does the numa policy comes into the game in
>> your above example. Sorry for being dense, I might be really missing
>> something important here, but I really fail to see why the NUMA is the
>> proper interface here.
> 
> You are right. Driver can migrate any mapping in the userspace to any
> where on the system as long as cpuset does not prohibit it. But we still
> want the driver to conform to the applicable VMA memory policy set from
> the userspace. Hence a VMA policy needs to be set from the user space.
> NUMA VMA memory policy also restricts the allocations inside the
> applicable nodemask during page fault paths (CPU and device) as well.

Hello Michal,

Does that answer your question ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-23 15:57         ` Mel Gorman
@ 2017-03-06  5:12           ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-06  5:12 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 02/23/2017 09:27 PM, Mel Gorman wrote:
> On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
>>>>> In itself, the series does very little and as Vlastimil already pointed
>>>>> out, it's not a good idea to try merge piecemeal when people could not
>>>>> agree on the big picture (I didn't dig into it).
>>>>
>>>> With the proposed kernel changes and a associated driver its complete to
>>>> drive a user space based CPU/Device hybrid compute interchangeably on a
>>>> mmap() allocated memory buffer transparently and effectively.
>>>
>>> How is the device informed at that data is available for processing?
>>
>> It will through a call to the driver from user space which can take
>> the required buffer address as an argument.
>>
> 
> Which goes back to tending towards what HMM intended but Jerome has covered
> all relevant point there so I won't repeat any of them in this response. It
> did sound that in part HMM was not used because it was missing some
> small steps which could have been included instead of proposing something
> different that did not meet their requirements but requires special casing.

Hmm. As I have mentioned in the HMM response thread from Jerome, at this
point we really dont know the amount of work required to get the missing
pieces of support like file mapping on all file systems, LRU support etc
with HMM done to start supporting ZONE_DEVICE based CDM implementation.
But lets discuss these things on the other thread what Balbir started.

> 
>>> What prevents and application modifying the data on the device while it's
>>> being processed?
>>
>> Nothing in software. The application should take care of that but access
>> from both sides are coherent. It should wait for the device till it
>> finishes the compute it had asked for earlier to prevent override and
>> eventual corruption.
>>
> 
> Which adds the caveat that applications must be fully CDM aware so if
> there are additional calls related to policies or administrative tasks
> for cpusets then it follows the application can also be aware of them.

Yeah, got your point.

> 
>>> Why can this not be expressed with cpusets and memory policies
>>> controlled by a combination of administrative steps for a privileged
>>> application and an application that is CDM aware?
>>
>> Hmm, that can be done but having an in kernel infrastructure has the
>> following benefits.
>>
>> * Administrator does not have to listen to node add notifications
>>   and keep the isolation/allowed cpusets upto date all the time.
>>   This can be a significant overhead on the admin/userspace which
>>   have a number of separate device memory nodes.
>>
> 
> Could be handled with udev triggers potentially or if udev events are not
> raised by the memory hot-add then it could still be polled.

Okay. 
> 
>> * With cpuset solution, tasks which are part of CDM allowed cpuset
>>   can have all it's VMAs allocate from CDM memory which may not be
>>   something the user want. For example user may not want to have
>>   the text segments, libraries allocate from CDM. To achieve this
>>   the user will have to explicitly block allocation access from CDM
>>   through mbind(MPOL_BIND) memory policy setups. This negative setup
>>   is a big overhead. But with in kernel CDM framework, isolation is
>>   enabled by default. For CDM allocations the application just has
>>   to setup memory policy with CDM node in the allowed nodemask.
>>
> 
> Then distinguish between task-wide policies that forbid CDM nodes and
> per-VMA policies that allow the CDM nodes. Migration between system

So application calls set_mempolicy() first do deny every one CDM nodes
(though cpuset allows them). Then it calls mbind() with CDM nodes for
buffer which it wants to have CDM memory ? Is that what you imply ?

> memory and devices remains a separate problem but migration would also
> not be covered by special casing the allocator.

If the CDM node is allowed for a VMA along with other system RAM nodes
then driver can migrate between them when ever it wants. What will be
the problem ?

> 
>> Even with cpuset solution, applications still need to know which nodes
>> are CDM on the system at given point of time. So we will have to store
>> it in a nodemask and export them on sysfs some how.
>>
> 
> Which in itself is not too bad and doesn't require special casing the
> allocator.

Right, the first patch in the series would do.

> 
>>>
>>>> I had also
>>>> mentioned these points on the last posting in response to a comment from
>>>> Vlastimil.
>>>>
>>>> From this response (https://lkml.org/lkml/2017/2/14/50).
>>>>
>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>   allocator. But the over all idea from the user space point of view
>>>>   is that the application can allocate any generic buffer and try to
>>>>   use the buffer either from the CPU side or from the device without
>>>>   knowing about where the buffer is really mapped physically. That
>>>>   gives a seamless and transparent view to the user space where CPU
>>>>   compute and possible device based compute can work together. This
>>>>   is not possible through a driver allocated buffer.
>>>>
>>>
>>> Which can also be done with cpusets that prevents use of CDM memory and
>>> place all non-CDM processes into that cpuset with a separate cpuset for
>>> CDM-aware applications that allow access to CDM memory.
>>
>> Right, but with additional overheads as explained above.
>>
> 
> The application must already be aware of the CDM nodes.

Absolutely.

> 
>>>> * If any application is not using CDM memory for along time placed on
>>>>   its buffer and another application is forced to fallback on system
>>>>   RAM when it really wanted is CDM, the driver can detect these kind
>>>>   of situations through memory access patterns on the device HW and
>>>>   take necessary migration decisions.
>>>>
>>>> I hope this explains the rationale of the framework. In fact these
>>>> four patches give logically complete CPU/Device operating framework.
>>>> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
>>>> etc which are improvements on top of this basic framework.
>>>>
>>>
>>> Automatic NUMA balancing is a particular oddity as that is about
>>> CPU->RAM locality and not RAM->device considerations.
>>
>> Right. But when there are migrations happening between system RAM and
>> device memory. Auto NUMA with its CPU fault information can migrate
>> between system RAM nodes which might not be necessary and can lead to
>> conflict or overhead. Hence Auto NUMA needs to be switched off at times
>> for the VMAs of concern but its not addressed in the patch series. As
>> mentioned before, it will be in the follow up work as improvements on
>> this series.
> 
> Ensure the policy settings for CDM-backed VMAs do not set MPOL_F_MOF and
> automatic NUMA balancing will skip them. It does not require special casing
> of the allocator or specific CDM-awareness.

Though changes to NUMA automatic balancing was not proposed in this
series, will take this into account and look whether these policy
internal flags of the VMAs (which cannot be set from user space)
be set from the driver to make sure that the VMAs dont participate
in auto NUMA balancing.

> 
>>> The memblock is to only avoid bootmem allocations from that area. It can
>>> be managed in the arch layer to first pass in all the system ram,
>>> teardown the bootmem allocator, setup the nodelists, set system
>>> nodemask, init CDM, init the allocator for that, and then optionally add
>>> it to the system CDM for userspace to do the isolation or provide.
>>>
>>> For that matter, the driver could do the discovery and then fake a
>>> memory hot-add.
>>
>> Not sure I got this correctly. Could you please explain more.
>>
> 
> Discover the device, and online the memory later as memory hotplug generally
> does. If the faked memory hot-add operation raised an event that udev
> can detect then the administrative functions could also be triggered
> in userspace.

Yes, that is possible. On POWER though we dont support hot add of a
node, the node can be detected during boot as a memory less node and
then later on memory can be hot plugged into the node which can be
detected as an udev event as you have mentioned. As Balbir has started
another sub thread discussing merits and disadvantages of various
approaches, we will discuss them on that thread.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-02-22 16:54             ` Michal Hocko
@ 2017-03-06  5:48               ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-06  5:48 UTC (permalink / raw)
  To: Michal Hocko, Jerome Glisse
  Cc: Anshuman Khandual, Mel Gorman, linux-kernel, linux-mm, vbabka,
	minchan, aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On 02/22/2017 10:24 PM, Michal Hocko wrote:
> On Wed 22-02-17 09:59:15, Jerome Glisse wrote:
>> On Wed, Feb 22, 2017 at 10:29:21AM +0100, Michal Hocko wrote:
>>> On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
>>>> On 02/17/2017 07:02 PM, Mel Gorman wrote:
>>
>> [...]
>>
>>> [...]
>>>> These are the reasons which prohibit the use of HMM for coherent
>>>> addressable device memory purpose.
>>>>
>>> [...]
>>>> (3) Application cannot directly allocate into device memory from user
>>>> space using existing memory related system calls like mmap() and mbind()
>>>> as the device memory hides away in ZONE_DEVICE.
>>>
>>> Why cannot the application simply use mmap on the device file?
>>
>> This has been said before but we want to share the address space this do
>> imply that you can not rely on special allocator. For instance you can
>> have an application that use a library and the library use the GPU but
>> the application is un-aware and those any data provided by the application
>> to the library will come from generic malloc (mmap anonymous or from
>> regular file).
>>
>> Currently what happens is that the library reallocate memory through
>> special allocator and copy thing. Not only does this waste memory (the
>> new memory is often regular memory too) but you also have to paid the
>> cost of copying GB of data.
>>
>> Last bullet to this, is complex data structure (list, tree, ...) having
>> to go through special allocator means you have re-build the whole structure
>> with the duplicated memory.
>>
>>
>> Allowing to directly use memory allocated from malloc (mmap anonymous
>> private or from a regular file) avoid the copy operation and the complex
>> duplication of data structure. Moving the dataset to the GPU is then a
>> simple memory migration from kernel point of view.
>>
>> This is share address space without special allocator is mandatory in new
>> or future standard such as OpenCL, Cuda, C++, OpenMP, ... some other OS
>> already have this and the industry want it. So the questions is do we
>> want to support any of this, do we care about GPGPU ?
>>
>>
>> I believe we want to support all this new standard but maybe i am the
>> only one.
>>
>> In HMM case i have the extra painfull fact that the device memory is
>> not accessible by the CPU. For CDM on contrary, CPU can access in a
>> cache coherent way the device memory and all operation behave as regular
>> memory (thing like atomic operation for instance).
>>
>>
>> I hope this clearly explain why we can no longer rely on dedicated/
>> specialized memory allocator.
> 
> Yes this clarifies this point. Thanks for the information which would be
> really helpful in the initial description. Maybe I've just missed it,
> though.

Sure, will add this into the patch description.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 0/4] Define coherent device memory node
  2017-03-01 10:59             ` Balbir Singh
@ 2017-03-08  9:04               ` Anshuman Khandual
  2017-03-08  9:21                 ` [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
  2017-03-08  9:21                 ` [PATCH 2/2] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
  0 siblings, 2 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-08  9:04 UTC (permalink / raw)
  To: Balbir Singh, Mel Gorman
  Cc: Anshuman Khandual, linux-kernel, linux-mm, Michal Hocko,
	Vlastimil Babka, Minchan Kim, Aneesh Kumar KV, Srikar Dronamraju,
	haren, Jérôme Glisse, Dave Hansen, Dan Williams

On 03/01/2017 04:29 PM, Balbir Singh wrote:
> On Wed, Mar 1, 2017 at 8:55 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Wed, Mar 01, 2017 at 01:42:40PM +1100, Balbir Singh wrote:
>>>>>> The idea of this patchset was to introduce
>>>>>> the concept of memory that is not necessarily system memory, but is coherent
>>>>>> in terms of visibility/access with some restrictions
>>>>>>
>>>>> Which should be done without special casing the page allocator, cpusets and
>>>>> special casing how cpusets are handled. It's not necessary for any other
>>>>> mechanism used to restrict access to portions of memory such as cpusets,
>>>>> mempolicies or even memblock reservations.
>>>> Agreed, I mentioned a limitation that we see a cpusets. I do agree that
>>>> we should reuse any infrastructure we have, but cpusets are more static
>>>> in nature and inheritence compared to the requirements of CDM.
>>>>
>>> Mel, I went back and looked at cpusets and found some limitations that
>>> I mentioned earlier, isolating a particular node requires some amount
>>> of laborious work in terms of isolating all tasks away from the root cpuset
>>> and then creating a hierarchy where the root cpuset is empty and now
>>> belong to a child cpuset that has everything but the node we intend to
>>> ioslate. Even with hardwalling, it does not prevent allocations from
>>> the parent cpuset.
>>>
>> That it is difficult does not in itself justify adding a third mechanism
>> specific to one type of device for controlling access to memory.
>>
> Not only is it difficult, but there are several tasks that refuse to
> change cpusets once created. I also noticed that the isolation may
> begin a little too late, some allocations may end up on the node to
> isolate.
> 
> I also want to eventually control whether auto-numa
> balancing/kswapd/reclaim etc run on this node (something that cpusets
> do not provide). The reason for these decisions is very dependent on
> the properties of the node. The isolation mechanism that exists today
> is insufficient. Moreover the correct abstraction for device memory
> would be a class similar to N_MEMORY, but limited in what we include
> (which is why I was asking if questions 3 and 4 are clear). You might
> argue these are not NUMA nodes then, but these are in general sense
> NUMA nodes (with non-uniform properties and access times). NUMA allows
> with the right hardware expose the right programming model. Please
> consider reading the full details at
> 
> https://patchwork.kernel.org/patch/9566393/
> https://lkml.org/lkml/2016/11/22/339

As explained by Balbir, right now cpuset mechanism gives only isolation
and is insufficient for creating other properties required for full
fledged CDM representation. NUMA representation is the close match for
CDM memory which represents non uniform attributes instead of distance
as the only differentiating property. Once represented as a NUMA node
in the kernel, we can achieve the isolation requirement either through
buddy allocator changes as proposed in this series or can look into
some alternative approaches as well. As I had mentioned in the last
RFC there is another way to achieve isolation through zonelist rebuild
process changes and mbind() implementation changes. Please find those
two relevant commits here.

https://github.com/akhandual/linux/commit/da1093599db29c31d12422a34d4e0cbf4683618f
https://github.com/akhandual/linux/commit/faadab4e9dc9685ab7a564a84d4a06bde8fc79d8

Will post these commits on this thread for further discussion. Do let
me know your views and suggestions on this approach.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process
  2017-03-08  9:04               ` Anshuman Khandual
@ 2017-03-08  9:21                 ` Anshuman Khandual
  2017-03-08 11:07                   ` John Hubbard
  2017-03-08  9:21                 ` [PATCH 2/2] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-08  9:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams, zi.yan

Kernel allocation to CDM node has already been prevented by putting it's
entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
from implicit allocations happening on the system.

Any isolation seeking CDM node requires isolation from implicit memory
allocations from user space but at the same time there should also have
an explicit way to do the memory allocation.

Platform node's both zonelists are fundamental to where the memory comes
from when there is an allocation request. In order to achieve these two
objectives as stated above, zonelists building process has to change as
both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
memory zones during any kind of memory allocation. The following changes
are implemented in this regard.

* CDM node's zones are not part of any other node's FALLBACK zonelist
* CDM node's FALLBACK list contains it's own memory zones followed by
  all system RAM zones in regular order as before
* CDM node's zones are part of it's own NOFALLBACK zonelist

These above changes ensure the following which in turn isolates the CDM
nodes as desired.

* There wont be any implicit memory allocation ending up in the CDM node
* Only __GFP_THISNODE marked allocations will come from the CDM node
* CDM node memory can be allocated through mbind(MPOL_BIND) interface
* System RAM memory will be used as fallback option in regular order in
  case the CDM memory is insufficient during targted allocation request

Sample zonelist configuration:

[NODE (0)]						RAM
        ZONELIST_FALLBACK (0xc00000000140da00)
                (0) (node 0) (DMA     0xc00000000140c000)
                (1) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc000000001411a10)
                (0) (node 0) (DMA     0xc00000000140c000)
[NODE (1)]						RAM
        ZONELIST_FALLBACK (0xc000000100001a00)
                (0) (node 1) (DMA     0xc000000100000000)
                (1) (node 0) (DMA     0xc00000000140c000)
        ZONELIST_NOFALLBACK (0xc000000100005a10)
                (0) (node 1) (DMA     0xc000000100000000)
[NODE (2)]						CDM
        ZONELIST_FALLBACK (0xc000000001427700)
                (0) (node 2) (Movable 0xc000000001427080)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc00000000142b710)
                (0) (node 2) (Movable 0xc000000001427080)
[NODE (3)]						CDM
        ZONELIST_FALLBACK (0xc000000001431400)
                (0) (node 3) (Movable 0xc000000001430d80)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc000000001435410)
                (0) (node 3) (Movable 0xc000000001430d80)
[NODE (4)]						CDM
        ZONELIST_FALLBACK (0xc00000000143b100)
                (0) (node 4) (Movable 0xc00000000143aa80)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc00000000143f110)
                (0) (node 4) (Movable 0xc00000000143aa80)

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40908de..6f7dddc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4825,6 +4825,16 @@ static void build_zonelists(pg_data_t *pgdat)
 	i = 0;
 
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+#ifdef CONFIG_COHERENT_DEVICE
+		/*
+		 * CDM node's own zones should not be part of any other
+		 * node's fallback zonelist but only it's own fallback
+		 * zonelist.
+		 */
+		if (is_cdm_node(node) && (pgdat->node_id != node))
+			continue;
+#endif
+
 		/*
 		 * We don't want to pressure a particular node.
 		 * So adding penalty to the first node in same
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/2] mm: Change mbind(MPOL_BIND) implementation for CDM nodes
  2017-03-08  9:04               ` Anshuman Khandual
  2017-03-08  9:21                 ` [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
@ 2017-03-08  9:21                 ` Anshuman Khandual
  1 sibling, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-08  9:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams, zi.yan

CDM nodes need a way of explicit memory allocation mechanism from the user
space. After the previous FALLBACK zonelist rebuilding process changes, the
mbind(MPOL_BIND) based allocation request fails on the CDM node. This is
because allocation requesting local node's FALLBACK zonelist is selected
for further nodemask processing targeted at MPOL_BIND implementation. As
the CDM node's zones are not part of any other regular node's FALLBACK
zonelist, the allocation simply fails without getting any valid zone. The
allocation requesting node is always going to be different than the CDM
node which does not have any CPU. Hence MPOL_MBIND implementation must
choose given CDM node's FALLBACK zonelist instead of the requesting local
node's FALLBACK zonelist. This implements that change.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e..6089c711 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,6 +1692,27 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+#ifdef CONFIG_COHERENT_DEVICE
+	/*
+	 * Coherent Device Memory (CDM)
+	 *
+	 * In case the local requesting node is not part of the nodemask, test
+	 * if the first node in the nodemask is CDM, in which case select it.
+	 *
+	 * XXX: There are multiple ways of doing this. This node check can be
+	 * restricted to the first node in the node mask as implemented here or
+	 * scan through the entire nodemask to find out any present CDM node on
+	 * it or select the first CDM node only if all other nodes in the node
+	 * mask are CDM. These are variour approaches possible, the first one
+	 * is implemented here.
+	 */
+	if (policy->mode == MPOL_BIND) {
+		if (unlikely(!node_isset(nd, policy->v.nodes))) {
+			if (is_cdm_node(first_node(policy->v.nodes)))
+				nd = first_node(policy->v.nodes);
+		}
+	}
+#endif
 	return node_zonelist(nd, gfp);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process
  2017-03-08  9:21                 ` [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
@ 2017-03-08 11:07                   ` John Hubbard
  2017-03-14 13:33                     ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: John Hubbard @ 2017-03-08 11:07 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams, zi.yan

On 03/08/2017 01:21 AM, Anshuman Khandual wrote:
> Kernel allocation to CDM node has already been prevented by putting it's
> entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
> from implicit allocations happening on the system.
>
> Any isolation seeking CDM node requires isolation from implicit memory
> allocations from user space but at the same time there should also have
> an explicit way to do the memory allocation.
>
> Platform node's both zonelists are fundamental to where the memory comes
> from when there is an allocation request. In order to achieve these two
> objectives as stated above, zonelists building process has to change as
> both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
> memory zones during any kind of memory allocation. The following changes
> are implemented in this regard.
>
> * CDM node's zones are not part of any other node's FALLBACK zonelist
> * CDM node's FALLBACK list contains it's own memory zones followed by
>   all system RAM zones in regular order as before

There was a discussion, on an earlier version of this patchset, in which someone 
pointed out that a slight over-allocation on a device that has much more memory than 
the CPU has, could use up system memory. Your latest approach here does not address 
this.

I'm thinking that, until oversubscription between NUMA nodes is more fully 
implemented in a way that can be properly controlled, you'd probably better just not 
fallback to system memory. In other words, a CDM node really is *isolated* from 
other nodes--no automatic use in either direction.

Also, naming and purpose: maybe this is a "Limited NUMA Node", rather than a 
Coherent Device Memory node. Because: the real point of this thing is to limit the 
normal operation of NUMA, just enough to work with what I am *told* is 
memory-that-is-too-fragile-for-kernel-use (I remain soemwhat on the fence, there, 
even though you did talk me into it earlier, heh).

On process: it would probably help if you gathered up previous discussion points and 
carefully, concisely addressed each one, somewhere, (maybe in a cover letter). 
Because otherwise, it's too easy for earlier, important problems to be forgotten. 
And reviewers don't want to have to repeat themselves, of course.

thanks
John Hubbard
NVIDIA

> * CDM node's zones are part of it's own NOFALLBACK zonelist
>
> These above changes ensure the following which in turn isolates the CDM
> nodes as desired.
>
> * There wont be any implicit memory allocation ending up in the CDM node
> * Only __GFP_THISNODE marked allocations will come from the CDM node
> * CDM node memory can be allocated through mbind(MPOL_BIND) interface
> * System RAM memory will be used as fallback option in regular order in
>   case the CDM memory is insufficient during targted allocation request
>
> Sample zonelist configuration:
>
> [NODE (0)]						RAM
>         ZONELIST_FALLBACK (0xc00000000140da00)
>                 (0) (node 0) (DMA     0xc00000000140c000)
>                 (1) (node 1) (DMA     0xc000000100000000)
>         ZONELIST_NOFALLBACK (0xc000000001411a10)
>                 (0) (node 0) (DMA     0xc00000000140c000)
> [NODE (1)]						RAM
>         ZONELIST_FALLBACK (0xc000000100001a00)
>                 (0) (node 1) (DMA     0xc000000100000000)
>                 (1) (node 0) (DMA     0xc00000000140c000)
>         ZONELIST_NOFALLBACK (0xc000000100005a10)
>                 (0) (node 1) (DMA     0xc000000100000000)
> [NODE (2)]						CDM
>         ZONELIST_FALLBACK (0xc000000001427700)
>                 (0) (node 2) (Movable 0xc000000001427080)
>                 (1) (node 0) (DMA     0xc00000000140c000)
>                 (2) (node 1) (DMA     0xc000000100000000)
>         ZONELIST_NOFALLBACK (0xc00000000142b710)
>                 (0) (node 2) (Movable 0xc000000001427080)
> [NODE (3)]						CDM
>         ZONELIST_FALLBACK (0xc000000001431400)
>                 (0) (node 3) (Movable 0xc000000001430d80)
>                 (1) (node 0) (DMA     0xc00000000140c000)
>                 (2) (node 1) (DMA     0xc000000100000000)
>         ZONELIST_NOFALLBACK (0xc000000001435410)
>                 (0) (node 3) (Movable 0xc000000001430d80)
> [NODE (4)]						CDM
>         ZONELIST_FALLBACK (0xc00000000143b100)
>                 (0) (node 4) (Movable 0xc00000000143aa80)
>                 (1) (node 0) (DMA     0xc00000000140c000)
>                 (2) (node 1) (DMA     0xc000000100000000)
>         ZONELIST_NOFALLBACK (0xc00000000143f110)
>                 (0) (node 4) (Movable 0xc00000000143aa80)
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  mm/page_alloc.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 40908de..6f7dddc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4825,6 +4825,16 @@ static void build_zonelists(pg_data_t *pgdat)
>  	i = 0;
>
>  	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
> +#ifdef CONFIG_COHERENT_DEVICE
> +		/*
> +		 * CDM node's own zones should not be part of any other
> +		 * node's fallback zonelist but only it's own fallback
> +		 * zonelist.
> +		 */
> +		if (is_cdm_node(node) && (pgdat->node_id != node))
> +			continue;
> +#endif
> +
>  		/*
>  		 * We don't want to pressure a particular node.
>  		 * So adding penalty to the first node in same
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process
  2017-03-08 11:07                   ` John Hubbard
@ 2017-03-14 13:33                     ` Anshuman Khandual
  2017-03-15  4:10                       ` John Hubbard
  0 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2017-03-14 13:33 UTC (permalink / raw)
  To: John Hubbard, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams, zi.yan

On 03/08/2017 04:37 PM, John Hubbard wrote:
> On 03/08/2017 01:21 AM, Anshuman Khandual wrote:
>> Kernel allocation to CDM node has already been prevented by putting it's
>> entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
>> from implicit allocations happening on the system.
>>
>> Any isolation seeking CDM node requires isolation from implicit memory
>> allocations from user space but at the same time there should also have
>> an explicit way to do the memory allocation.
>>
>> Platform node's both zonelists are fundamental to where the memory comes
>> from when there is an allocation request. In order to achieve these two
>> objectives as stated above, zonelists building process has to change as
>> both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
>> memory zones during any kind of memory allocation. The following changes
>> are implemented in this regard.
>>
>> * CDM node's zones are not part of any other node's FALLBACK zonelist
>> * CDM node's FALLBACK list contains it's own memory zones followed by
>>   all system RAM zones in regular order as before


> 
> There was a discussion, on an earlier version of this patchset, in which
> someone pointed out that a slight over-allocation on a device that has
> much more memory than the CPU has, could use up system memory. Your
> latest approach here does not address this.

Hmm, I dont remember this. Could you please be more specific and point
me to the discussion on this.

> 
> I'm thinking that, until oversubscription between NUMA nodes is more
> fully implemented in a way that can be properly controlled, you'd

I did not get you. What does over subscription mean in this context ?
FALLBACK zonelist on each node has memory from every node including
it's own. Hence the allocation request targeted towards any node is
symmetrical with respect to from where the memory will be allocated.

> probably better just not fallback to system memory. In other words, a
> CDM node really is *isolated* from other nodes--no automatic use in
> either direction.

That is debatable. With this proposed solution the CDM FALLBACK
zonelist contains system RAM zones as fallback option which will
be used in case CDM memory is depleted. IMHO, I think thats the
right thing to do as it still maintains the symmetry to some
extent.

> 
> Also, naming and purpose: maybe this is a "Limited NUMA Node", rather
> than a Coherent Device Memory node. Because: the real point of this
> thing is to limit the normal operation of NUMA, just enough to work with
> what I am *told* is memory-that-is-too-fragile-for-kernel-use (I remain
> soemwhat on the fence, there, even though you did talk me into it
> earlier, heh).

:) Naming can be debated later after we all agree on the proposal
in principle. We have already discussed about kernel memory on CDM
in detail.

> 
> On process: it would probably help if you gathered up previous
> discussion points and carefully, concisely addressed each one,
> somewhere, (maybe in a cover letter). Because otherwise, it's too easy
> for earlier, important problems to be forgotten. And reviewers don't
> want to have to repeat themselves, of course.

Will do.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process
  2017-03-14 13:33                     ` Anshuman Khandual
@ 2017-03-15  4:10                       ` John Hubbard
  0 siblings, 0 replies; 43+ messages in thread
From: John Hubbard @ 2017-03-15  4:10 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams, zi.yan

On 03/14/2017 06:33 AM, Anshuman Khandual wrote:
> On 03/08/2017 04:37 PM, John Hubbard wrote:
[...]
>> There was a discussion, on an earlier version of this patchset, in which
>> someone pointed out that a slight over-allocation on a device that has
>> much more memory than the CPU has, could use up system memory. Your
>> latest approach here does not address this.
>
> Hmm, I dont remember this. Could you please be more specific and point
> me to the discussion on this.

That idea came from Dave Hansen, who was commenting on your RFC V2 patch:

https://lkml.org/lkml/2017/1/30/894

..."A device who got its memory usage off by 1% could start to starve the rest of the system..."

>
>>
>> I'm thinking that, until oversubscription between NUMA nodes is more
>> fully implemented in a way that can be properly controlled, you'd
>
> I did not get you. What does over subscription mean in this context ?
> FALLBACK zonelist on each node has memory from every node including
> it's own. Hence the allocation request targeted towards any node is
> symmetrical with respect to from where the memory will be allocated.
>

Here, I was referring to the lack of support in the kernel today, for allocating X+N bytes on a NUMA 
node, when that node only has X bytes associated with it. Currently, the system uses a fallback node 
list to try to allocate on other nodes, in that case, but that's not idea. If it NUMA allocation 
instead supported "oversubscription", it could allow the allocation to succeed, and then fault and 
evict (to other nodes) to support a working set that is larger than the physical memory that the 
node has.

This is what GPUs do today, in order to handle work loads that are too large for GPU memory. This 
enables a whole other level of applications that the user can run.

Maybe there are other ways to get the same result, so if others have ideas, please chime in. I'm 
assuming for now that this sort of thing will just be required in the coming months.

>> probably better just not fallback to system memory. In other words, a
>> CDM node really is *isolated* from other nodes--no automatic use in
>> either direction.
>
> That is debatable. With this proposed solution the CDM FALLBACK
> zonelist contains system RAM zones as fallback option which will
> be used in case CDM memory is depleted. IMHO, I think thats the
> right thing to do as it still maintains the symmetry to some
> extent.
>

Yes, it's worth discussing. Again, Dave's note applies here.

>>
>> Also, naming and purpose: maybe this is a "Limited NUMA Node", rather
>> than a Coherent Device Memory node. Because: the real point of this
>> thing is to limit the normal operation of NUMA, just enough to work with
>> what I am *told* is memory-that-is-too-fragile-for-kernel-use (I remain
>> soemwhat on the fence, there, even though you did talk me into it
>> earlier, heh).
>
> :) Naming can be debated later after we all agree on the proposal
> in principle. We have already discussed about kernel memory on CDM
> in detail.

OK.

thanks,
John Hubbard
NVIDIA

>
>>
>> On process: it would probably help if you gathered up previous
>> discussion points and carefully, concisely addressed each one,
>> somewhere, (maybe in a cover letter). Because otherwise, it's too easy
>> for earlier, important problems to be forgotten. And reviewers don't
>> want to have to repeat themselves, of course.
>
> Will do.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-03-15  4:11 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 12:07 [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
2017-02-15 12:07 ` [PATCH V3 1/4] mm: Define coherent device memory (CDM) node Anshuman Khandual
2017-02-17 14:05   ` Bob Liu
2017-02-21 10:20     ` Anshuman Khandual
2017-02-15 12:07 ` [PATCH V3 2/4] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
2017-02-15 12:07 ` [PATCH V3 3/4] mm: Add new parameter to get_page_from_freelist() function Anshuman Khandual
2017-02-15 12:07 ` [PATCH V3 4/4] mm: Enable Buddy allocation isolation for CDM nodes Anshuman Khandual
2017-02-15 18:20 ` [PATCH V3 0/4] Define coherent device memory node Mel Gorman
2017-02-16 22:14   ` Balbir Singh
2017-02-17  9:33     ` Mel Gorman
2017-02-21  2:57       ` Balbir Singh
2017-03-01  2:42         ` Balbir Singh
2017-03-01  9:55           ` Mel Gorman
2017-03-01 10:59             ` Balbir Singh
2017-03-08  9:04               ` Anshuman Khandual
2017-03-08  9:21                 ` [PATCH 1/2] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
2017-03-08 11:07                   ` John Hubbard
2017-03-14 13:33                     ` Anshuman Khandual
2017-03-15  4:10                       ` John Hubbard
2017-03-08  9:21                 ` [PATCH 2/2] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
2017-02-17 11:41   ` [PATCH V3 0/4] Define coherent device memory node Anshuman Khandual
2017-02-17 13:32     ` Mel Gorman
2017-02-21 13:09       ` Anshuman Khandual
2017-02-21 20:14         ` Jerome Glisse
2017-02-23  8:14           ` Anshuman Khandual
2017-02-23 15:27             ` Jerome Glisse
2017-02-22  9:29         ` Michal Hocko
2017-02-22 14:59           ` Jerome Glisse
2017-02-22 16:54             ` Michal Hocko
2017-03-06  5:48               ` Anshuman Khandual
2017-02-23  8:52           ` Anshuman Khandual
2017-02-23 15:57         ` Mel Gorman
2017-03-06  5:12           ` Anshuman Khandual
2017-02-21 11:11     ` Michal Hocko
2017-02-21 13:39       ` Anshuman Khandual
2017-02-22  9:50         ` Michal Hocko
2017-02-23  6:52           ` Anshuman Khandual
2017-03-05 12:39             ` Anshuman Khandual
2017-02-24  1:06         ` Bob Liu
2017-02-24  4:39           ` John Hubbard
2017-02-24  4:53           ` Jerome Glisse
2017-02-27  1:56             ` Bob Liu
2017-02-27  5:41               ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).