All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
@ 2017-04-19  7:52 Balbir Singh
  2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
                   ` (6 more replies)
  0 siblings, 7 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

This is a request for comments on the discussed approaches
for coherent memory at mm-summit (some of the details are at
https://lwn.net/Articles/717601/). The latest posted patch
series is at https://lwn.net/Articles/713035/. I am reposting
this as RFC, Michal Hocko suggested using HMM for CDM, but
we believe there are stronger reasons to use the NUMA approach.
The earlier patches for Coherent Device memory were implemented
and designed by Anshuman Khandual.

Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
The patches do a great deal to enable CDM with HMM, but we
still believe that HMM with CDM is not a natural way to
represent coherent device memory and the mm will need
to be audited and enhanced for it to even work.

With HMM we'll see ZONE_DEVICE pages mapped into
user space and that would mean a thorough audit of all code
paths to make sure we are ready for such a use case and enabling
those use cases, like with HMM CDM patch 1, which changes
move_pages() and migration paths. I've done a quick
evaluation to check for features and found limitationd around
features like migration (page cache
migration), fault handling to the right location
(direct page cache allocation in the coherent memory), mlock
handling, RSS accounting, memcg enforcement for pages not on LRU, etc.

This series has a set of 4 patches

The first patch defines N_COHERENT_MEMORY and supports onlining of
N_COHERENT_MEMORY.  The second one enables marking of coherent
memory nodes in architecture specific code, the third patch
enables mempolicy MPOL_BIND and MPOL_PREFERRED changes to
explicitly specify a node for allocation. The fourth patch adds
documentation explaining the design and motivation behind
coherent memory. The primary motivation of these patches
is to avoid allocator overhead that Mel Gorman had concerns with,
but for explicit specification of a node in the nodemask,
mempolicy changes are required.

Introduction and design (taken from patch 4)

Introduction

CDM device memory is cache coherent with system memory and we would like
this to show up as a NUMA node, however there are certain algorithms
that might not be currently suitable for N_COHERENT_MEMORY

1. AutoNUMA balancing
2. kswapd reclaim

The reason for exposing this device memory as NUMA is to simplify
the programming model, where memory allocation via malloc() or
mmap() for example would seamlessly work across both kinds of
memory. Since we expect the size of device memory to be smaller
than system RAM, we would like to control the allocation of such
memory. The proposed mechanism reuses nodemasks and explicit
specification of the coherent node in the nodemask for allocation
from device memory. This implementation also allows for kernel
level allocation via __GFP_THISNODE and existing techniques
such as page migration to work.

Assumptions:

1. Nodes with N_COHERENT_MEMORY don't have CPUs on them, so
effectively they are CPUless memory nodes
2. Nodes with N_COHERENT_MEMORY are marked as movable_nodes.
Slub allocations from these nodes will fail otherwise.

Implementation Details

A new node state N_COHERENT_MEMORY is created. Each architecture
can then mark devices as being N_COHERENT_MEMORY and the implementation
makes sure this node set is disjoint from the N_MEMORY node state
nodes. A typical node zonelist (FALLBACK) with N_COHERENT_MEMORY would
be:

Assuming we have 2 nodes and 1 coherent memory node

Node1:	Node 1 --> Node 2

Node2:	Node 2 --> Node 1

Node3:	Node 3 --> Node 2 --> Node 1

This effectively means that allocations that have Node 1 and Node 2
in the nodemask will not allocate from Node 3. Allocations with
__GFP_THISNODE use the NOFALLBACK list and should allocate from Node 3,
if it is specified.  Since Node 3 has no CPUs, we don't expect any
default allocations occurring from it.

However to support allocation from the coherent node, changes have been
made to mempolicy, specifically policy_nodemask() and policy_zonelist()
such that

1. MPOL_BIND with the coherent node (Node 3 in the above example) will
not filter out N_COHERENT_MEMORY if any of the nodes in the nodemask
is in N_COHERENT_MEMORY
2. MPOL_PREFERRED will use the FALLBACK list of the coherent node (Node 3)
if a policy that specifies a preference to it is used.

Limitations

The limitation of this approach might be that in the future we would want
more granularity of inclusion of algorithms for example could we have
N_COHERENT_MEMORY devices that want to participate in autonuma balancing,
but not participate in kswapd reclaim or vice-versa? One way to solve
the problem would be to have tunables or extend the notion of
N_COHERENT_MEMORY.

Using coherent memory is not compatible with cpusets, since cpusets
would enforce mems_allowed and mems_allowed will not contain the
coherent node. With numactl for example, the user would have to use
"-a" to parse all nodes.

Coherent memory relies on the node being a movable_node which is a
requirement for device memory anyway due to the need to hotplug them.

Review Recommendations

Michal Hocko/Mel Gorman for the approach and allocator bits
Vlastimil Babka/Christoph Lameter for the mempolicy changes.

Testing

I tested these patches in a virtual machine where I was able to simulate
coherent device memory. I had 3 normal NUMA nodes and one N_COHERENT_MEMORY
node. I ran mmtests with the config-global-dhp__pagealloc-performance config
and noted the numbers for the following tests in particular
page_test, brk_test, exec_test and fork_test. Observations from these
tests show

1. page_test shows similar rates for with and without coherent memory
and same number of nodes
2. brk_test was faster with coherent memory (3 NUMA, 1 COHERENT) as compared
to 4 NUMA nodes, but had similar rates as the system with (3 NUMA, 0 COHERENT) 
3. exec_test was a bit slower on the system with coherent memory compared
to a system with no coherent memory
4. fork_test was a bit slower on the system with coherent memory compared
to a system with no coherent memory

I also did some basic tests with numactl -a memhog with various membind
and preferred policies. I wrote a small kernel module to allocate
memory with __GFP_THISNODE and GFP_HIGHUSER_MOVABLE (for memory on the
coherent node).

Balbir Singh (4):
  mm: create N_COHERENT_MEMORY
  arch/powerpc/mm: add support for coherent memory
  mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the
    system
  linux/mm: Add documentation for coherent memory

 Documentation/memory-hotplug.txt     | 11 +++++++
 Documentation/vm/00-INDEX            |  2 ++
 Documentation/vm/coherent-memory.txt | 59 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/numa.c               |  8 +++++
 drivers/base/memory.c                |  3 ++
 drivers/base/node.c                  |  2 ++
 include/linux/memory_hotplug.h       |  1 +
 include/linux/nodemask.h             |  1 +
 mm/memory_hotplug.c                  |  8 +++--
 mm/mempolicy.c                       | 30 ++++++++++++++++--
 mm/page_alloc.c                      | 20 +++++++++---
 11 files changed, 136 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/vm/coherent-memory.txt

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC 1/4] mm: create N_COHERENT_MEMORY
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
@ 2017-04-19  7:52 ` Balbir Singh
  2017-04-27 18:42   ` Reza Arbab
  2017-04-19  7:52 ` [RFC 2/4] arch/powerpc/mm: add support for coherent memory Balbir Singh
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

The idea and definition of coherent memory was defined in
RFC's and patchsets. In particular https://lwn.net/Articles/704403/
has the details. This patch has a summary of the intentions
and implementation. The earlier patches were implemented
and designed by Anshuman Khandual.

A coherent memory device is a NUMA node, yes its non-uniform
memory access and also non-uniform memory attributes :) New hardware
has the capability to allow for coherency between device memory
and CPU memory. This memory is visible as a part of system memory
but its attributes are different. The debate is on how we expose
this memory, so that the programming model is simple. HMM provides
a similar approach, but due to lack of hardware cannot make it
as simple as exposing the memory as a NUMA node.

In this patch we create N_COHERENT_MEMORY, which is different
from N_MEMORY. A node hotplugged as coherent memory will have
this state set. The expectation then is that this memory gets
onlined like regular nodes. Memory allocation from such nodes
occurs only when the the node is contained explicitly in the
mask.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 Documentation/memory-hotplug.txt | 13 +++++++++++++
 drivers/base/memory.c            |  3 +++
 drivers/base/node.c              |  2 ++
 include/linux/memory_hotplug.h   |  1 +
 include/linux/nodemask.h         |  1 +
 mm/memory_hotplug.c              |  5 ++++-
 6 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 670f3de..26736d8 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -298,6 +298,19 @@ available memory will be increased.
 Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
 This may be changed in future.
 
+% echo online_coherent > /sys/devices/system/memory/memoryXXX/state
+
+After this memory is onlined, same as "echo online" above, except that the node
+is marked as N_COHERENT_MEMORY and it is not a part of N_MEMORY. Effectively
+it means that this node is not a part of any node zonelist, except itself.
+Ideally N_COHERENT_MEMORY nodes have no cpus on them.
+
+A user space program can use numactl with -a to allocate on this node with
+an explicity node specification. From the kernel, one may use __GFP_THISNODE
+with the node specified and alloc_pages_node() to allocate.
+
+NOTE: This node will not show up in mems_allowed and will not work with
+cpusets in general.
 
 
 ------------------------
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cc4f1d0..9a96c6e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -323,6 +323,8 @@ store_mem_state(struct device *dev,
 		online_type = MMOP_ONLINE_KERNEL;
 	else if (sysfs_streq(buf, "online_movable"))
 		online_type = MMOP_ONLINE_MOVABLE;
+	else if (sysfs_streq(buf, "online_coherent"))
+		online_type = MMOP_ONLINE_COHERENT;
 	else if (sysfs_streq(buf, "online"))
 		online_type = MMOP_ONLINE_KEEP;
 	else if (sysfs_streq(buf, "offline"))
@@ -345,6 +347,7 @@ store_mem_state(struct device *dev,
 	case MMOP_ONLINE_KERNEL:
 	case MMOP_ONLINE_MOVABLE:
 	case MMOP_ONLINE_KEEP:
+	case MMOP_ONLINE_COHERENT:
 		mem->online_type = online_type;
 		ret = device_online(&mem->dev);
 		break;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..6bfdfd6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -660,6 +660,7 @@ static struct node_attr node_state_attr[] = {
 #ifdef CONFIG_MOVABLE_NODE
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 #endif
+	[N_COHERENT_MEMORY] = _NODE_ATTR(has_coherent_memory, N_COHERENT_MEMORY),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 };
 
@@ -673,6 +674,7 @@ static struct attribute *node_state_attrs[] = {
 #ifdef CONFIG_MOVABLE_NODE
 	&node_state_attr[N_MEMORY].attr.attr,
 #endif
+	&node_state_attr[N_COHERENT_MEMORY].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
 	NULL
 };
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f6..aa927aa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,7 @@ enum {
 	MMOP_ONLINE_KEEP,
 	MMOP_ONLINE_KERNEL,
 	MMOP_ONLINE_MOVABLE,
+	MMOP_ONLINE_COHERENT,
 };
 
 /*
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index f746e44..037e34a 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -393,6 +393,7 @@ enum node_states {
 	N_MEMORY = N_HIGH_MEMORY,
 #endif
 	N_CPU,		/* The node has one or more cpus */
+	N_COHERENT_MEMORY,	/* The node has cache coherent device memory */
 	NR_NODE_STATES
 };
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b63d7d1..ebeb3af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1149,7 +1149,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	if (onlined_pages) {
-		node_states_set_node(nid, &arg);
+		if (online_type == MMOP_ONLINE_COHERENT)
+			node_set_state(nid, N_COHERENT_MEMORY);
+		else
+			node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
 			build_all_zonelists(NULL, NULL);
 		else
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 2/4] arch/powerpc/mm: add support for coherent memory
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
  2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
@ 2017-04-19  7:52 ` Balbir Singh
  2017-04-19  7:52 ` [RFC 3/4] mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the system Balbir Singh
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

Add support for N_COHERENT_MEMORY by marking nodes compatible
with ibm,coherent-device-memory as coherent nodes. The code
sets N_COHERENT_MEMORY before the system has had a chance to
set N_MEMORY.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/mm/numa.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 371792e..c977de8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -652,6 +652,7 @@ static void __init parse_drconf_memory(struct device_node *memory)
 	unsigned long lmb_size, base, size, sz;
 	int nid;
 	struct assoc_arrays aa = { .arrays = NULL };
+	int coherent = 0;
 
 	n = of_get_drconf_memory(memory, &dm);
 	if (!n)
@@ -696,6 +697,10 @@ static void __init parse_drconf_memory(struct device_node *memory)
 				size = read_n_cells(n_mem_size_cells, &usm);
 			}
 			nid = of_drconf_to_nid_single(&drmem, &aa);
+			coherent = of_device_is_compatible(memory,
+					"ibm,coherent-device-memory");
+			if (coherent)
+				node_set_state(nid, N_COHERENT_MEMORY);
 			fake_numa_create_new_node(
 				((base + size) >> PAGE_SHIFT),
 					   &nid);
@@ -713,6 +718,7 @@ static int __init parse_numa_properties(void)
 	struct device_node *memory;
 	int default_nid = 0;
 	unsigned long i;
+	int coherent = 0;
 
 	if (numa_enabled == 0) {
 		printk(KERN_WARNING "NUMA disabled by user\n");
@@ -785,6 +791,10 @@ static int __init parse_numa_properties(void)
 
 		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
 		node_set_online(nid);
+		coherent = of_device_is_compatible(memory,
+				"ibm,coherent-device-memory");
+		if (coherent)
+			node_set_state(nid, N_COHERENT_MEMORY);
 
 		size = numa_enforce_memory_limit(start, size);
 		if (size)
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 3/4] mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the system
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
  2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
  2017-04-19  7:52 ` [RFC 2/4] arch/powerpc/mm: add support for coherent memory Balbir Singh
@ 2017-04-19  7:52 ` Balbir Singh
  2017-04-19  7:52 ` [RFC 4/4] mm: Add documentation for coherent memory Balbir Singh
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

This patch integrates N_COHERENT_MEMORY and makes the integration
deeper. It does the following

1. Modifies mempolicy so as to
	a. Allow policy_zonelist() and policy_nodemask() to
	   understand N_COHERENT_MEMORY nodes and allow the
	   right mask/list to be built when the policy contains
	   those nodes
	b. Checks for N_COHERENT_MEMORY in mpol_new_nodemask()
	   and other places with hard-coded checks for N_MEMORY
2. Modifies mm/page_alloc.c, so that nodes marked as N_COHERENT_MEMORY
   are not marked as N_MEMORY
3. Changes node zonelist creation, so that coherent memory is
   present in the fallback in case multiple such nodes are
   present.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memory_hotplug.c |  3 ++-
 mm/mempolicy.c      | 31 ++++++++++++++++++++++++++++---
 mm/page_alloc.c     | 21 +++++++++++++++++----
 3 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ebeb3af..12d5431 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1037,7 +1037,8 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_high >= 0)
 		node_set_state(node, N_HIGH_MEMORY);
 
-	node_set_state(node, N_MEMORY);
+	if (!node_state(node, N_COHERENT_MEMORY))
+		node_set_state(node, N_MEMORY);
 }
 
 bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b33..141398e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -217,6 +217,8 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
 {
 	int ret;
+	int n;
+	nodemask_t tmp;
 
 	/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
 	if (pol == NULL)
@@ -226,6 +228,14 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
+
+	for_each_node_mask(n, *nodes) {
+		if (node_state(n, N_COHERENT_MEMORY)) {
+			tmp = nodemask_of_node(n);
+			nodes_or(nsc->mask1, nsc->mask1, tmp);
+		}
+	}
+
 	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
 		nodes = NULL;	/* explicit local allocation */
 	else {
@@ -1435,7 +1445,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
 		goto out_put;
 	}
 
-	if (!nodes_subset(*new, node_states[N_MEMORY])) {
+	if (!nodes_subset(*new, node_states[N_MEMORY]) &&
+		!nodes_subset(*new, node_states[N_COHERENT_MEMORY])) {
 		err = -EINVAL;
 		goto out_put;
 	}
@@ -1670,7 +1681,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
 	if (unlikely(policy->mode == MPOL_BIND) &&
 			apply_policy_zone(policy, gfp_zone(gfp)) &&
-			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+			(cpuset_nodemask_valid_mems_allowed(&policy->v.nodes) ||
+			nodes_intersects(policy->v.nodes,
+				node_states[N_COHERENT_MEMORY])))
 		return &policy->v.nodes;
 
 	return NULL;
@@ -1691,6 +1704,17 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+	/*
+	 * It is not sufficient to have the right nodemask, we need the
+	 * correct zonelist for N_COHERENT_MEMORY
+	 */
+	if (node_state(nd, N_COHERENT_MEMORY))
+		/*
+		 * Ideally we should pick the best node, but for now use
+		 * any one
+		 */
+		nd = first_node(node_states[N_COHERENT_MEMORY]);
+
 	return node_zonelist(nd, gfp);
 }
 
@@ -2689,7 +2713,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		*nodelist++ = '\0';
 		if (nodelist_parse(nodelist, nodes))
 			goto out;
-		if (!nodes_subset(nodes, node_states[N_MEMORY]))
+		if (!nodes_subset(nodes, node_states[N_MEMORY]) &&
+			!nodes_subset(nodes, node_states[N_COHERENT_MEMORY]))
 			goto out;
 	} else
 		nodes_clear(nodes);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2c687d..59e4d30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4856,6 +4856,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	int min_val = INT_MAX;
 	int best_node = NUMA_NO_NODE;
 	const struct cpumask *tmp = cpumask_of_node(0);
+	nodemask_t tmp_mask, tmp_mask2;
 
 	/* Use the local node if we haven't already */
 	if (!node_isset(node, *used_node_mask)) {
@@ -4863,7 +4864,17 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 		return node;
 	}
 
-	for_each_node_state(n, N_MEMORY) {
+	tmp_mask = node_states[N_MEMORY];
+	tmp_mask2 = node_states[N_COHERENT_MEMORY];
+
+	/*
+	 * If the nodemask has one coherent node, add others
+	 * as well
+	 */
+	if (node_state(node, N_COHERENT_MEMORY))
+		nodes_or(tmp_mask, tmp_mask2, tmp_mask);
+
+	for_each_node_mask(n, tmp_mask) {
 
 		/* Don't want a node to appear more than once */
 		if (node_isset(n, *used_node_mask))
@@ -6288,7 +6299,7 @@ static unsigned long __init early_calculate_totalpages(void)
 		unsigned long pages = end_pfn - start_pfn;
 
 		totalpages += pages;
-		if (pages)
+		if (pages && !node_state(nid, N_COHERENT_MEMORY))
 			node_set_state(nid, N_MEMORY);
 	}
 	return totalpages;
@@ -6598,9 +6609,11 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 				find_min_pfn_for_node(nid), NULL);
 
 		/* Any memory on that node */
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages &&
+			!node_state(nid, N_COHERENT_MEMORY)) {
 			node_set_state(nid, N_MEMORY);
-		check_for_memory(pgdat, nid);
+			check_for_memory(pgdat, nid);
+		}
 	}
 }
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 4/4] mm: Add documentation for coherent memory
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
                   ` (2 preceding siblings ...)
  2017-04-19  7:52 ` [RFC 3/4] mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the system Balbir Singh
@ 2017-04-19  7:52 ` Balbir Singh
  2017-04-19 19:02 ` [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Christoph Lameter
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 Documentation/vm/00-INDEX            |  2 ++
 Documentation/vm/coherent-memory.txt | 59 ++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 Documentation/vm/coherent-memory.txt

diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 11d3d8d..99175e9 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -6,6 +6,8 @@ balance
 	- various information on memory balancing.
 cleancache.txt
 	- Intro to cleancache and page-granularity victim cache.
+coherent-memory.txt
+	- Introduction to coherent memory handling (N_COHERENT_MEMORY)
 frontswap.txt
 	- Outline frontswap, part of the transcendent memory frontend.
 highmem.txt
diff --git a/Documentation/vm/coherent-memory.txt b/Documentation/vm/coherent-memory.txt
new file mode 100644
index 0000000..bd60e5b
--- /dev/null
+++ b/Documentation/vm/coherent-memory.txt
@@ -0,0 +1,59 @@
+Introduction
+
+This document describes a new type of node called N_COHERENT_MEMORY.
+This memory is cache coherent with system memory and we would like
+this to show up as a NUMA node, however there are certain algorithms
+that might not be currently suitable for N_COHERENT_MEMORY
+
+1. AutoNUMA balancing
+2. kswapd reclaim
+
+The reason for exposing this device memory as NUMA is to simplify
+the programming model, where memory allocation via malloc() or
+mmap() for example would seamlessly work across both kinds of
+memory. Since we expect the size of device memory to be smaller
+than system RAM, we would like to control the allocation of such
+memory. The proposed mechanism reuses nodemasks and explicit
+specification of the coherent node in the nodemask for allocation
+from device memory. This implementation also allows for kernel
+level allocation via __GFP_THISNODE and existing techniques
+such as page migration to work.
+
+Assumptions:
+
+1. Nodes with N_COHERENT_MEMORY don't have CPUs on them, so
+effectively they are CPUless memory nodes
+2. Nodes with N_COHERENT_MEMORY are marked as movable_nodes.
+Slub allocations from these nodes will fail otherwise.
+
+Implementation Details
+
+A new node state N_COHERENT_MEMORY is created. Each architecture
+can then mark devices as being N_COHERENT_MEMORY and the implementation
+makes sure this node set is disjoint from the N_MEMORY node state
+nodes. A typical node zonelist (FALLBACK) with N_COHERENT_MEMORY would
+be:
+
+Assuming we have 2 nodes and 1 coherent memory node
+
+Node1:	Node 1 --> Node 2
+
+Node2:	Node 2 --> Node 1
+
+Node3:	Node 3 --> Node 2 --> Node 1
+
+This effectively means that allocations that have Node 1 and Node 2
+in the nodemask will not allocate from Node 3. Allocations with __GFP_THISNODE
+use the NOFALLBACK list and should allocate from Node 3, if it
+is specified.  Since Node 3 has no CPUs, we don't expect any default
+allocations occurring from it.
+
+However to support allocation from the coherent node, changes have been
+made to mempolicy, specifically policy_nodemask() and policy_zonelist()
+such that
+
+1. MPOL_BIND with the coherent node (Node 3 in the above example) will
+not filter out N_COHERENT_MEMORY if any of the nodes in the nodemask
+is in N_COHERENT_MEMORY
+2. MPOL_PREFERRED will use the FALLBACK list of the coherent node (Node 3)
+if a policy that specifies a preference to it is used.
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
                   ` (3 preceding siblings ...)
  2017-04-19  7:52 ` [RFC 4/4] mm: Add documentation for coherent memory Balbir Singh
@ 2017-04-19 19:02 ` Christoph Lameter
  2017-04-20  1:25   ` Balbir Singh
  2017-05-01 20:41 ` John Hubbard
  2017-05-02 14:36 ` Michal Hocko
  6 siblings, 1 reply; 45+ messages in thread
From: Christoph Lameter @ 2017-04-19 19:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Wed, 19 Apr 2017, Balbir Singh wrote:

> The first patch defines N_COHERENT_MEMORY and supports onlining of
> N_COHERENT_MEMORY.  The second one enables marking of coherent

The name is confusing. All other NUMA nodes are coherent. Can we name this
in some way that describes what is special about these nodes?

And we already have support for memory only nodes. Why is that not sufficient?
If you can answer that question then we may get to the term to be used to
name these nodes. We also have support for hotplug memory. How does the
memory here differ from hotplug?

 > memory nodes in architecture specific code, the third patch
> enables mempolicy MPOL_BIND and MPOL_PREFERRED changes to
> explicitly specify a node for allocation. The fourth patch adds

Huh? MPOL_PREFERRED already allows specifying a node.
MPOL_BIND requires a set of nodes. ??

> 1. Nodes with N_COHERENT_MEMORY don't have CPUs on them, so
> effectively they are CPUless memory nodes
> 2. Nodes with N_COHERENT_MEMORY are marked as movable_nodes.
> Slub allocations from these nodes will fail otherwise.

Isnt that what hotpluggable nodes do already?

> 1. MPOL_BIND with the coherent node (Node 3 in the above example) will
> not filter out N_COHERENT_MEMORY if any of the nodes in the nodemask
> is in N_COHERENT_MEMORY
> 2. MPOL_PREFERRED will use the FALLBACK list of the coherent node (Node 3)
> if a policy that specifies a preference to it is used.

So this means that "Coherent" nodes means that you need a different
fallback mechanism? Something like a ISOLATED_NODE or something?

The approach sounds pretty invasive to me. Can we first clarify what
features you need and develop terminology that describes things in terms
of a view from the Linux MM perspective? Coherent memory is nothing
special from there. It is special from the perspective of offload devices
that have heretofore not offered that. So its mainly a marketing term. We
need something descriptive here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-19 19:02 ` [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Christoph Lameter
@ 2017-04-20  1:25   ` Balbir Singh
  2017-04-20 15:29     ` Christoph Lameter
  0 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-04-20  1:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Wed, 2017-04-19 at 14:02 -0500, Christoph Lameter wrote:
> On Wed, 19 Apr 2017, Balbir Singh wrote:
> 
> > The first patch defines N_COHERENT_MEMORY and supports onlining of
> > N_COHERENT_MEMORY.  The second one enables marking of coherent
> 
> The name is confusing. All other NUMA nodes are coherent. Can we name this
> in some way that describes what is special about these nodes?
> 
> And we already have support for memory only nodes. Why is that not sufficient?
> If you can answer that question then we may get to the term to be used to
> name these nodes. We also have support for hotplug memory. How does the
> memory here differ from hotplug?
> 
>  > memory nodes in architecture specific code, the third patch
> > enables mempolicy MPOL_BIND and MPOL_PREFERRED changes to
> > explicitly specify a node for allocation. The fourth patch adds
> 
> Huh? MPOL_PREFERRED already allows specifying a node.
> MPOL_BIND requires a set of nodes. ??

Wording issues, I meant to support specification of the coherent
memory node for specification.

> 
> > 1. Nodes with N_COHERENT_MEMORY don't have CPUs on them, so
> > effectively they are CPUless memory nodes
> > 2. Nodes with N_COHERENT_MEMORY are marked as movable_nodes.
> > Slub allocations from these nodes will fail otherwise.
> 
> Isnt that what hotpluggable nodes do already?

Yes, we need that coherent device memory as well.

> 
> > 1. MPOL_BIND with the coherent node (Node 3 in the above example) will
> > not filter out N_COHERENT_MEMORY if any of the nodes in the nodemask
> > is in N_COHERENT_MEMORY
> > 2. MPOL_PREFERRED will use the FALLBACK list of the coherent node (Node 3)
> > if a policy that specifies a preference to it is used.
> 
> So this means that "Coherent" nodes means that you need a different
> fallback mechanism? Something like a ISOLATED_NODE or something?

Couple of things are needed

1. Isolation of allocation
2. Isolation of certain algorithms like kswapd/auto-numa balancing

There are some notes of (2) in hte limitations seciton as well.

> 
> The approach sounds pretty invasive to me.

Could you please elaborate, you mean the user space programming bits?


 Can we first clarify what
> features you need and develop terminology that describes things in terms
> of a view from the Linux MM perspective?

Ideally we need the following:

1. Transparency about being able to allocate memory anywhere and the ability
to migrate memory between coherent device memory and normal system memory
2. The ability to explictly allocate memory from coherent device memory
3. Isolation of normal allocations from coherent device memory unless
explictly stated, same as (2) above
4. The ability to hotplug in and out the memory at run-time
5. Exchange pointers between coherent device memory and normal memory
for the compute on the coherent device memory to use

I could list further things, but largely coherent device memory is like
system memory except that we believe that things like auto-numa balancing
and kswapd will not work well due to lack of information about references
and faults.

Some of the mm-summit notes are at https://lwn.net/Articles/717601/
The goals align with HMM, except that the device memory is coherent. HMM
has a CDM variation as well.

 Coherent memory is nothing
> special from there. It is special from the perspective of offload devices
> that have heretofore not offered that. So its mainly a marketing term. We
> need something descriptive here.
> 

We've been using the term coherent device memory (CDM). I could rephrase the
text and documentation for consistency. Would you prefer a different term?

Thanks for the review!
Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-20  1:25   ` Balbir Singh
@ 2017-04-20 15:29     ` Christoph Lameter
  2017-04-20 21:26       ` Benjamin Herrenschmidt
  2017-04-24  0:20       ` Balbir Singh
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Lameter @ 2017-04-20 15:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Thu, 20 Apr 2017, Balbir Singh wrote:
> Couple of things are needed
>
> 1. Isolation of allocation

cgroups, memory policy and cpuset provide that

> 2. Isolation of certain algorithms like kswapd/auto-numa balancing

Ok that may mean adding some generic functionality to limit those

> > The approach sounds pretty invasive to me.
>
> Could you please elaborate, you mean the user space programming bits?

No I mean the modification of the memory policies in particular. We are
adding more exceptions to an already complex and fragile system.

Can we do this in a generic way just using hotplug nodes and some of the
existing isolation mechanisms?


> Ideally we need the following:
>
> 1. Transparency about being able to allocate memory anywhere and the ability
> to migrate memory between coherent device memory and normal system memory

If it is a memory node then you have that already.

> 2. The ability to explictly allocate memory from coherent device memory

Ditto

> 3. Isolation of normal allocations from coherent device memory unless
> explictly stated, same as (2) above

memory policies etc do that.

> 4. The ability to hotplug in and out the memory at run-time

hotplug code does that.


> 5. Exchange pointers between coherent device memory and normal memory
> for the compute on the coherent device memory to use

I dont see anything preventing that from occurring right now. Thats a
device issue with doing proper virtual to physical mapping right?

> I could list further things, but largely coherent device memory is like
> system memory except that we believe that things like auto-numa balancing
> and kswapd will not work well due to lack of information about references
> and faults.

Ok so far I do not see that we need coherent nodes at all.

> Some of the mm-summit notes are at https://lwn.net/Articles/717601/
> The goals align with HMM, except that the device memory is coherent. HMM
> has a CDM variation as well.

I was at the presentation but at that point you were interested in a
different approach it seems.

> We've been using the term coherent device memory (CDM). I could rephrase the
> text and documentation for consistency. Would you prefer a different term?

Hotplug memory node?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-20 15:29     ` Christoph Lameter
@ 2017-04-20 21:26       ` Benjamin Herrenschmidt
  2017-04-21 16:13         ` Christoph Lameter
  2017-04-24  0:20       ` Balbir Singh
  1 sibling, 1 reply; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-20 21:26 UTC (permalink / raw)
  To: Christoph Lameter, Balbir Singh
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, mhocko, arbab, vbabka

On Thu, 2017-04-20 at 10:29 -0500, Christoph Lameter wrote:
> On Thu, 20 Apr 2017, Balbir Singh wrote:
> > Couple of things are needed
> > 
> > 1. Isolation of allocation
> 
> cgroups, memory policy and cpuset provide that

Can these be configured appropriately by the accelerator or GPU driver
at the point where it hot plugs the memory ?

The problem is we need to ensure there is no window in which the kernel
will start putting things like skb's etc... in there.

My original idea was to cover the whole thing with a CMA, which helps
with the case where the user wants to use the "legacy" APIs of manually
controlling the allocations on the GPU since in that case, the
user/driver might need to do fairly large contiguous allocations.

I was told there are some plumbing issues with having a bunch of CMAs
around though.

Basically the whole debate at the moment revolves around whether to use
HMM/CDM/ZONE_DEVICE vs. making it just a NUMA nodes with a sprinkle of
added foo.

The former approach pretty clearly puts that device into a separate
category and keeps most of the VM stuff at bay. However, it has a
number of disadvantage. ZONE_DEVICE was meant for providing struct
pages & DAX etc... for things like flash storage, "new memory" etc....

What we have here is effectively a bit more like a NUMA node, whose
processing unit is just not a CPU but a GPU or some kind of
accelerator.

The difference boils down to how we want to use is. We want any page,
anonymous memory, mapped file, you name it... to be able to migrate
back and forth depending on which piece of HW is most actively
accessing it. This is helped by a bunch of things such as very fast DMA
engines to facilitate migration, and HW counter to detect when parts of
that memory are accessed "remotely" (and thus request migrations).

So the NUMA model fits reasonably well, with that memory being overall
treated normally. The ZONE_DEVICE model on the other hand creates those
"special" pages which require a pile of special casing in all sort of
places as Balbir has mentioned, with still a bunch of rather standard
stuff not working with them.

However, we do need to address a few quirks, which is what this is
about.

Mostly we want to keep kernel allocations away from it, in part because
the memory is more prone to fail and not terribly fast for direct CPU
access, in part because we want to maximize the availability of it for
dedicated applications.

I find it clumsy to require establishing policies from userspace after
it's been instanciated (and racy). At least for that isolation
mechanism.

Other things are possibly more realistic to do that way, such as taking
KSM and AutoNuma off the picture for it.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-20 21:26       ` Benjamin Herrenschmidt
@ 2017-04-21 16:13         ` Christoph Lameter
  2017-04-21 21:15           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Lameter @ 2017-04-21 16:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, mhocko, arbab, vbabka

On Fri, 21 Apr 2017, Benjamin Herrenschmidt wrote:

> On Thu, 2017-04-20 at 10:29 -0500, Christoph Lameter wrote:
> > On Thu, 20 Apr 2017, Balbir Singh wrote:
> > > Couple of things are needed
> > >
> > > 1. Isolation of allocation
> >
> > cgroups, memory policy and cpuset provide that
>
> Can these be configured appropriately by the accelerator or GPU driver
> at the point where it hot plugs the memory ?

A driver could be able to setup a memory policy. Sure.

> The problem is we need to ensure there is no window in which the kernel
> will start putting things like skb's etc... in there.

skbs are not put into user space pages. They are unmovable and thus
hotplugged memory will not be used.

> Basically the whole debate at the moment revolves around whether to use
> HMM/CDM/ZONE_DEVICE vs. making it just a NUMA nodes with a sprinkle of
> added foo.

I think the memory hotplug idea should be making this easy to do. Not
much rigging around needed.

> What we have here is effectively a bit more like a NUMA node, whose
> processing unit is just not a CPU but a GPU or some kind of
> accelerator.

Its like a memory only node. That is a common usecase for NUMA nodes (HP
has made use of memory only nodes in a large scale)

> The difference boils down to how we want to use is. We want any page,
> anonymous memory, mapped file, you name it... to be able to migrate
> back and forth depending on which piece of HW is most actively
> accessing it. This is helped by a bunch of things such as very fast DMA
> engines to facilitate migration, and HW counter to detect when parts of
> that memory are accessed "remotely" (and thus request migrations).

Well that migration can even be done from userspace. See the
migrate_pages() syscall.

> So the NUMA model fits reasonably well, with that memory being overall
> treated normally. The ZONE_DEVICE model on the other hand creates those
> "special" pages which require a pile of special casing in all sort of
> places as Balbir has mentioned, with still a bunch of rather standard
> stuff not working with them.

Right.

> However, we do need to address a few quirks, which is what this is
> about.
>
> Mostly we want to keep kernel allocations away from it, in part because
> the memory is more prone to fail and not terribly fast for direct CPU
> access, in part because we want to maximize the availability of it for
> dedicated applications.

Hotplugged memory is containing only movable pages. This means kernel
allocations do not occur there. You are fine.

> Other things are possibly more realistic to do that way, such as taking
> KSM and AutoNuma off the picture for it.

Well just pinning those pages or mlocking those will stop these scans.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-21 16:13         ` Christoph Lameter
@ 2017-04-21 21:15           ` Benjamin Herrenschmidt
  2017-04-24 13:57             ` Christoph Lameter
  0 siblings, 1 reply; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-21 21:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, mhocko, arbab, vbabka

On Fri, 2017-04-21 at 11:13 -0500, Christoph Lameter wrote:
> > Other things are possibly more realistic to do that way, such as
> > taking
> > KSM and AutoNuma off the picture for it.
> 
> Well just pinning those pages or mlocking those will stop these
> scans.

But that will stop migration too :-) These are mostly policy
adjustement, we need to look at other options here.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-20 15:29     ` Christoph Lameter
  2017-04-20 21:26       ` Benjamin Herrenschmidt
@ 2017-04-24  0:20       ` Balbir Singh
  2017-04-24 14:00         ` Christoph Lameter
  1 sibling, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-04-24  0:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Thu, 2017-04-20 at 10:29 -0500, Christoph Lameter wrote:
> On Thu, 20 Apr 2017, Balbir Singh wrote:
> > Couple of things are needed
> > 
> > 1. Isolation of allocation
> 
> cgroups, memory policy and cpuset provide that
> 

Yes and we are building on top of mempolicies. The problem becomes a little
worse when the coherent device memory node is seen as CPUless node. I
was trying to solve 1 and 2 with the same approach.

> > 2. Isolation of certain algorithms like kswapd/auto-numa balancing
> 
> Ok that may mean adding some generic functionality to limit those

As in per-algorithm tunables? I think it would be definitely good to have
that. I do not know how well that would scale?

> 
> > > The approach sounds pretty invasive to me.
> > 
> > Could you please elaborate, you mean the user space programming bits?
> 
> No I mean the modification of the memory policies in particular. We are
> adding more exceptions to an already complex and fragile system.
> 
> Can we do this in a generic way just using hotplug nodes and some of the
> existing isolation mechanisms?
>

Yes, that was the first approach we tried and we are reusing whatever
we can -- HMM for driver driven migration, mempolicies for allocation
control and N_COHERENT_MEMORY for isolation because of 1 and 2 above
combined.
 
> 
> > Ideally we need the following:
> > 
> > 1. Transparency about being able to allocate memory anywhere and the ability
> > to migrate memory between coherent device memory and normal system memory
> 
> If it is a memory node then you have that already.
> 
> > 2. The ability to explictly allocate memory from coherent device memory
> 
> Ditto
> 
> > 3. Isolation of normal allocations from coherent device memory unless
> > explictly stated, same as (2) above
> 
> memory policies etc do that.
> 
> > 4. The ability to hotplug in and out the memory at run-time
> 
> hotplug code does that.
> 
> 
> > 5. Exchange pointers between coherent device memory and normal memory
> > for the compute on the coherent device memory to use

> 
> I dont see anything preventing that from occurring right now. Thats a
> device issue with doing proper virtual to physical mapping right?
> 

Some of these requirements come from whether we use NUMA or HMM-CDM.
We prefer NUMA and it meets the above requirements quite well.

> > I could list further things, but largely coherent device memory is like
> > system memory except that we believe that things like auto-numa balancing
> > and kswapd will not work well due to lack of information about references
> > and faults.
> 
> Ok so far I do not see that we need coherent nodes at all.
>

I presume you are suggesting this based on the fact that we add additional
infrastructure for auto-numa/kswapd/etc isolation?
 
> > Some of the mm-summit notes are at https://lwn.net/Articles/717601/
> > The goals align with HMM, except that the device memory is coherent. HMM
> > has a CDM variation as well.
> 
> I was at the presentation but at that point you were interested in a
> different approach it seems.

I do remember you were present, I don't think things have changed since then.

> 
> > We've been using the term coherent device memory (CDM). I could rephrase the
> > text and documentation for consistency. Would you prefer a different term?
> 
> Hotplug memory node?
> 

Normal memory is hotpluggable too.. but I'd be fine as long as everyone agrees

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-21 21:15           ` Benjamin Herrenschmidt
@ 2017-04-24 13:57             ` Christoph Lameter
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2017-04-24 13:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, mhocko, arbab, vbabka

On Sat, 22 Apr 2017, Benjamin Herrenschmidt wrote:

> On Fri, 2017-04-21 at 11:13 -0500, Christoph Lameter wrote:
> > > Other things are possibly more realistic to do that way, such as
> > > taking
> > > KSM and AutoNuma off the picture for it.
> >
> > Well just pinning those pages or mlocking those will stop these
> > scans.
>
> But that will stop migration too :-) These are mostly policy
> adjustement, we need to look at other options here.

Well yes that probably means some sort of policy layer that allows the
exclusion of certain nodes from KSM and AutoNUMA.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-24  0:20       ` Balbir Singh
@ 2017-04-24 14:00         ` Christoph Lameter
  2017-04-25  0:52           ` Balbir Singh
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Lameter @ 2017-04-24 14:00 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Mon, 24 Apr 2017, Balbir Singh wrote:

> > cgroups, memory policy and cpuset provide that
> >
>
> Yes and we are building on top of mempolicies. The problem becomes a little
> worse when the coherent device memory node is seen as CPUless node. I
> was trying to solve 1 and 2 with the same approach.

Well I think having the ability to restrict autonuma/ksm per node may also
be useful for other things. Like running regular processes on node 0 and
running low latency stuff on  node 1 that should not be interrupted. Right
now you cannot do that.

> > > 2. Isolation of certain algorithms like kswapd/auto-numa balancing
> >
> > Ok that may mean adding some generic functionality to limit those
>
> As in per-algorithm tunables? I think it would be definitely good to have
> that. I do not know how well that would scale?

>From what I can see it should not be too difficult to implement a node
mask constraining those activities.

> Some of these requirements come from whether we use NUMA or HMM-CDM.
> We prefer NUMA and it meets the above requirements quite well.

Great.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-24 14:00         ` Christoph Lameter
@ 2017-04-25  0:52           ` Balbir Singh
  0 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-25  0:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, arbab, vbabka

On Mon, 2017-04-24 at 09:00 -0500, Christoph Lameter wrote:
> On Mon, 24 Apr 2017, Balbir Singh wrote:
> 
> > > cgroups, memory policy and cpuset provide that
> > > 
> > 
> > Yes and we are building on top of mempolicies. The problem becomes a little
> > worse when the coherent device memory node is seen as CPUless node. I
> > was trying to solve 1 and 2 with the same approach.
> 
> Well I think having the ability to restrict autonuma/ksm per node may also
> be useful for other things. Like running regular processes on node 0 and
> running low latency stuff on  node 1 that should not be interrupted. Right
> now you cannot do that.
> 

I presume it also means differential allocation (applications allocating
on this node will be different) and isolation of allocation. Would you like
to restrict allocations from nodes? The one difference we have is that
coherent device memory has

a. Probably a compute on them which is not visible directly to the system
b. Shows up as a CPUless node

>From a solutioning perspective, today all these daemons work off of
N_MEMORY, without going to deep and speculating, one approach could
be to create N_ISOLATED_MEMORY with tunables for each set of algorithms

I did a quick grep and got the following list of N_MEMORY dependent
code paths

1. kcompactd
2. bootmem huge pages
3. memcg reclaim (soft limit)
4. mempolicy
5. migrate
6. kswapd

Which reminds that I should fix 5 in my patchset :). For KSM I found
merge_across_nodes, I presume some of the isolation across nodes can be
achieved using it and then by applications not using madvise MADV_MERGEABLE?

Would N_COHERENT_MEMORY meet your needs? May be we could call it
N_ISOLATED_MEMORY and then add tunables per-algorithm?



> > > > 2. Isolation of certain algorithms like kswapd/auto-numa balancing
> > > 
> > > Ok that may mean adding some generic functionality to limit those
> > 
> > As in per-algorithm tunables? I think it would be definitely good to have
> > that. I do not know how well that would scale?
> 
> From what I can see it should not be too difficult to implement a node
> mask constraining those activities.
> 
> > Some of these requirements come from whether we use NUMA or HMM-CDM.
> > We prefer NUMA and it meets the above requirements quite well.
> 
> Great.
>

Thanks

Balbir Singh. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] mm: create N_COHERENT_MEMORY
  2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
@ 2017-04-27 18:42   ` Reza Arbab
  2017-04-28  5:07     ` Balbir Singh
  0 siblings, 1 reply; 45+ messages in thread
From: Reza Arbab @ 2017-04-27 18:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, vbabka, cl

On Wed, Apr 19, 2017 at 05:52:39PM +1000, Balbir Singh wrote:
>In this patch we create N_COHERENT_MEMORY, which is different
>from N_MEMORY. A node hotplugged as coherent memory will have
>this state set. The expectation then is that this memory gets
>onlined like regular nodes. Memory allocation from such nodes
>occurs only when the the node is contained explicitly in the
>mask.

Finally got around to test drive this. From what I can see, as expected,
both kernel and userspace seem to ignore these nodes, unless you 
allocate specifically from them. Very convenient.

Is "online_coherent"/MMOP_ONLINE_COHERENT the right way to trigger this?  
That mechanism is used to specify zone, and only for a single block of 
memory. This concept applies to the node as a whole. I think it should 
be independent of memory onlining.

I mean, let's say online_kernel N blocks, some of them get allocated, 
and then you online_coherent block N+1, flipping the entire node into 
N_COHERENT_MEMORY. That doesn't seem right.

That said, this set as it stands needs an adjustment when based on top 
of Michal's onlining revamp [1]. As-is, allow_online_pfn_range() is 
returning false. The patch below fixed it for me.

[1] http://lkml.kernel.org/r/20170421120512.23960-1-mhocko@kernel.org

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4a535f1..ccb7a84 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -869,16 +869,20 @@ bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages,
 	 * though so let's stick with it for simplicity for now.
 	 * TODO make sure we do not overlap with ZONE_DEVICE
 	 */
-	if (online_type == MMOP_ONLINE_KERNEL) {
+	switch (online_type) {
+	case MMOP_ONLINE_KERNEL:
 		if (zone_is_empty(movable_zone))
 			return true;
 		return movable_zone->zone_start_pfn >= pfn + nr_pages;
-	} else if (online_type == MMOP_ONLINE_MOVABLE) {
+	case MMOP_ONLINE_MOVABLE:
 		return zone_end_pfn(normal_zone) <= pfn;
+	case MMOP_ONLINE_KEEP:
+	case MMOP_ONLINE_COHERENT:
+		/* These will always succeed and inherit the current zone */
+		return true;
 	}
 
-	/* MMOP_ONLINE_KEEP will always succeed and inherits the current zone */
-	return online_type == MMOP_ONLINE_KEEP;
+	return false;
 }
 
 static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,


-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] mm: create N_COHERENT_MEMORY
  2017-04-27 18:42   ` Reza Arbab
@ 2017-04-28  5:07     ` Balbir Singh
  0 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-28  5:07 UTC (permalink / raw)
  To: Reza Arbab
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, mhocko, vbabka, cl

On Thu, 2017-04-27 at 13:42 -0500, Reza Arbab wrote:
> On Wed, Apr 19, 2017 at 05:52:39PM +1000, Balbir Singh wrote:
> > In this patch we create N_COHERENT_MEMORY, which is different
> > from N_MEMORY. A node hotplugged as coherent memory will have
> > this state set. The expectation then is that this memory gets
> > onlined like regular nodes. Memory allocation from such nodes
> > occurs only when the the node is contained explicitly in the
> > mask.
> 
> Finally got around to test drive this. From what I can see, as expected,
> both kernel and userspace seem to ignore these nodes, unless you 
> allocate specifically from them. Very convenient.

Thanks for testing them!

> 
> Is "online_coherent"/MMOP_ONLINE_COHERENT the right way to trigger this?  

Now that we mark the node state at boot/hotplug time, I think we can ignore
these changes.

> That mechanism is used to specify zone, and only for a single block of 
> memory. This concept applies to the node as a whole. I think it should 
> be independent of memory onlining.
> 
> I mean, let's say online_kernel N blocks, some of them get allocated, 
> and then you online_coherent block N+1, flipping the entire node into 
> N_COHERENT_MEMORY. That doesn't seem right.
> 

Agreed, I'll remove these bits in the next posting.

Thanks for the review!
Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
                   ` (4 preceding siblings ...)
  2017-04-19 19:02 ` [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Christoph Lameter
@ 2017-05-01 20:41 ` John Hubbard
  2017-05-01 21:04   ` Reza Arbab
  2017-05-02  1:29   ` Balbir Singh
  2017-05-02 14:36 ` Michal Hocko
  6 siblings, 2 replies; 45+ messages in thread
From: John Hubbard @ 2017-05-01 20:41 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl

On 04/19/2017 12:52 AM, Balbir Singh wrote:
> This is a request for comments on the discussed approaches
> for coherent memory at mm-summit (some of the details are at
> https://lwn.net/Articles/717601/). The latest posted patch
> series is at https://lwn.net/Articles/713035/. I am reposting
> this as RFC, Michal Hocko suggested using HMM for CDM, but
> we believe there are stronger reasons to use the NUMA approach.
> The earlier patches for Coherent Device memory were implemented
> and designed by Anshuman Khandual.
> 

Hi Balbir,

Although I think everyone agrees that in the [very] long term, these 
hardware-coherent nodes probably want to be NUMA nodes, in order to decide what to 
code up over the next few years, we need to get a clear idea of what has to be done 
for each possible approach.

Here, the CDM discussion is falling just a bit short, because it does not yet 
include the whole story of what we would need to do. Earlier threads pointed this 
out: the idea started as a large patchset RFC, but then, "for ease of review", it 
got turned into a smaller RFC, which loses too much context.

So, I'd suggest putting together something more complete, so that it can be fairly 
compared against the HMM-for-hardware-coherent-nodes approach.


> Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
> The patches do a great deal to enable CDM with HMM, but we
> still believe that HMM with CDM is not a natural way to
> represent coherent device memory and the mm will need
> to be audited and enhanced for it to even work.

That is also true for the CDM approach. Specifically, in order for this to be of any 
use to device drivers, we'll need the following:

1. A way to move pages between NUMA nodes, both virtual address and physical 
address-based, from kernel mode.

2. A way to provide reverse mapping information to device drivers, even if 
indirectly. (I'm not proposing exposing rmap, but this has to be thought through, 
because at some point, a device will need to do something with a physical page.)

This strikes me as the hardest part of the problem.

3. Detection and mitigation of page thrashing between NUMA nodes (shared 
responsibility between core -mm and device driver, but probably missing some APIs 
today).

4. Handling of oversubscription (allocating more memory than is physically on a NUMA 
node, by evicting "LRU-like" pages, rather than the current fallback to other NUMA 
nodes). Similar to (3) with respect to where we're at today.

5. Something to handle the story of bringing NUMA nodes online and putting them back 
offline, given that they require a device driver that may not yet have been loaded. 
There are a few minor missing bits there.

thanks,

--
John Hubbard
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 20:41 ` John Hubbard
@ 2017-05-01 21:04   ` Reza Arbab
  2017-05-01 21:56     ` John Hubbard
  2017-05-02  1:29   ` Balbir Singh
  1 sibling, 1 reply; 45+ messages in thread
From: Reza Arbab @ 2017-05-01 21:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Balbir Singh, linux-mm, akpm, khandual, benh, aneesh.kumar,
	paulmck, srikar, haren, jglisse, mgorman, mhocko, vbabka, cl

On Mon, May 01, 2017 at 01:41:55PM -0700, John Hubbard wrote:
>1. A way to move pages between NUMA nodes, both virtual address and 
>physical address-based, from kernel mode.

Jerome's migrate_vma() and migrate_dma() should have this covered, 
including DMA-accelerated copy.

>5. Something to handle the story of bringing NUMA nodes online and 
>putting them back offline, given that they require a device driver that 
>may not yet have been loaded. There are a few minor missing bits there.

This has been prototyped with the driver doing memory hotplug/hotremove.  
Could you elaborate a little on what you feel is missing?

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 21:04   ` Reza Arbab
@ 2017-05-01 21:56     ` John Hubbard
  2017-05-01 23:51       ` Reza Arbab
  0 siblings, 1 reply; 45+ messages in thread
From: John Hubbard @ 2017-05-01 21:56 UTC (permalink / raw)
  To: Reza Arbab
  Cc: Balbir Singh, linux-mm, akpm, khandual, benh, aneesh.kumar,
	paulmck, srikar, haren, jglisse, mgorman, mhocko, vbabka, cl

On 05/01/2017 02:04 PM, Reza Arbab wrote:
> On Mon, May 01, 2017 at 01:41:55PM -0700, John Hubbard wrote:
>> 1. A way to move pages between NUMA nodes, both virtual address and physical 
>> address-based, from kernel mode.
> 
> Jérôme's migrate_vma() and migrate_dma() should have this covered, including 
> DMA-accelerated copy.

Yes, that's good. I wasn't sure from this discussion here if either or both of those 
would be used, but now I see.

Are those APIs ready for moving pages between NUMA nodes? As there is no NUMA node 
id in the API, are we relying on the pages' membership (using each page and updating 
which node it is on)?

> 
>> 5. Something to handle the story of bringing NUMA nodes online and putting them 
>> back offline, given that they require a device driver that may not yet have been 
>> loaded. There are a few minor missing bits there.
> 
> This has been prototyped with the driver doing memory hotplug/hotremove. Could you 
> elaborate a little on what you feel is missing?
> 

We just worked through how to deal with this in our driver, and I remember feeling 
worried about the way NUMA nodes can only be put online via a user space action 
(through sysfs). It seemed like you'd want to do that from kernel as well, when a 
device driver gets loaded.

I was also uneasy about user space trying to bring a node online before the 
associated device driver was loaded, and I think it would be nice to be sure that 
that whole story is looked at.

The theme here is that driver load/unload is, today, independent from the NUMA node 
online/offline, and that's a problem. Not a huge one, though, just worth enumerating 
here.

thanks
john h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 21:56     ` John Hubbard
@ 2017-05-01 23:51       ` Reza Arbab
  2017-05-01 23:58         ` John Hubbard
  0 siblings, 1 reply; 45+ messages in thread
From: Reza Arbab @ 2017-05-01 23:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Balbir Singh, linux-mm, akpm, khandual, benh, aneesh.kumar,
	paulmck, srikar, haren, jglisse, mgorman, mhocko, vbabka, cl

On Mon, May 01, 2017 at 02:56:34PM -0700, John Hubbard wrote:
>On 05/01/2017 02:04 PM, Reza Arbab wrote:
>>On Mon, May 01, 2017 at 01:41:55PM -0700, John Hubbard wrote:
>>>1. A way to move pages between NUMA nodes, both virtual address 
>>>and physical address-based, from kernel mode.
>>
>>Jerome's migrate_vma() and migrate_dma() should have this covered, 
>>including DMA-accelerated copy.
>
>Yes, that's good. I wasn't sure from this discussion here if either or 
>both of those would be used, but now I see.
>
>Are those APIs ready for moving pages between NUMA nodes? As there is 
>no NUMA node id in the API, are we relying on the pages' membership 
>(using each page and updating which node it is on)?

Yes. Those APIs work by callback. The alloc_and_copy() function you 
provide will be called at the appropriate point in the migration. Yours 
would allocate from a specific destination node, and copy using DMA.

>>>5. Something to handle the story of bringing NUMA nodes online and 
>>>putting them back offline, given that they require a device driver 
>>>that may not yet have been loaded. There are a few minor missing bits 
>>>there.
>>
>>This has been prototyped with the driver doing memory 
>>hotplug/hotremove. Could you elaborate a little on what you feel is 
>>missing?
>>
>
>We just worked through how to deal with this in our driver, and I 
>remember feeling worried about the way NUMA nodes can only be put 
>online via a user space action (through sysfs). It seemed like you'd 
>want to do that from kernel as well, when a device driver gets loaded.

That's true. I don't think we have a way to online/offline from a 
driver. To online, the alternatives are memhp_auto_online (incapable of 
doing online_movable), or udev rules (not ideal in this driver 
controlled memory use case). To offline, nothing that I know of.

>I was also uneasy about user space trying to bring a node online before 
>the associated device driver was loaded, and I think it would be nice 
>to be sure that that whole story is looked at.
>
>The theme here is that driver load/unload is, today, independent from 
>the NUMA node online/offline, and that's a problem. Not a huge one, 
>though, just worth enumerating here.

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 23:51       ` Reza Arbab
@ 2017-05-01 23:58         ` John Hubbard
  2017-05-02  0:04           ` Reza Arbab
  0 siblings, 1 reply; 45+ messages in thread
From: John Hubbard @ 2017-05-01 23:58 UTC (permalink / raw)
  To: Reza Arbab
  Cc: Balbir Singh, linux-mm, akpm, khandual, benh, aneesh.kumar,
	paulmck, srikar, haren, jglisse, mgorman, mhocko, vbabka, cl

On 05/01/2017 04:51 PM, Reza Arbab wrote:
> On Mon, May 01, 2017 at 02:56:34PM -0700, John Hubbard wrote:
>> On 05/01/2017 02:04 PM, Reza Arbab wrote:
>>> On Mon, May 01, 2017 at 01:41:55PM -0700, John Hubbard wrote:
>>>> 1. A way to move pages between NUMA nodes, both virtual address and physical 
>>>> address-based, from kernel mode.
>>>
>>> Jérôme's migrate_vma() and migrate_dma() should have this covered, including 
>>> DMA-accelerated copy.
>>
>> Yes, that's good. I wasn't sure from this discussion here if either or both of 
>> those would be used, but now I see.
>>
>> Are those APIs ready for moving pages between NUMA nodes? As there is no NUMA node 
>> id in the API, are we relying on the pages' membership (using each page and 
>> updating which node it is on)?
> 
> Yes. Those APIs work by callback. The alloc_and_copy() function you provide will be 
> called at the appropriate point in the migration. Yours would allocate from a 
> specific destination node, and copy using DMA.
> 

hmmm, that reminds me: the whole story of "which device is this, and which NUMA node 
does it correlate to?" will have to be wired up. That is *probably* all in the 
device driver, but since I haven't worked through it, I'd be inclined to list it as 
an item on the checklist, just in case it requires some little hook in the upstream 
kernel.

thanks,
john h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 23:58         ` John Hubbard
@ 2017-05-02  0:04           ` Reza Arbab
  0 siblings, 0 replies; 45+ messages in thread
From: Reza Arbab @ 2017-05-02  0:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Balbir Singh, linux-mm, akpm, khandual, benh, aneesh.kumar,
	paulmck, srikar, haren, jglisse, mgorman, mhocko, vbabka, cl

On Mon, May 01, 2017 at 04:58:14PM -0700, John Hubbard wrote:
>On 05/01/2017 04:51 PM, Reza Arbab wrote:
>>On Mon, May 01, 2017 at 02:56:34PM -0700, John Hubbard wrote:
>>>On 05/01/2017 02:04 PM, Reza Arbab wrote:
>>>>On Mon, May 01, 2017 at 01:41:55PM -0700, John Hubbard wrote:
>>>>>1. A way to move pages between NUMA nodes, both virtual 
>>>>>address and physical address-based, from kernel mode.
>>>>
>>>>Jerome's migrate_vma() and migrate_dma() should have this 
>>>>covered, including DMA-accelerated copy.
>>>
>>>Yes, that's good. I wasn't sure from this discussion here if 
>>>either or both of those would be used, but now I see.
>>>
>>>Are those APIs ready for moving pages between NUMA nodes? As there 
>>>is no NUMA node id in the API, are we relying on the pages' 
>>>membership (using each page and updating which node it is on)?
>>
>>Yes. Those APIs work by callback. The alloc_and_copy() function you 
>>provide will be called at the appropriate point in the migration. 
>>Yours would allocate from a specific destination node, and copy 
>>using DMA.
>>
>
>hmmm, that reminds me: the whole story of "which device is this, and 
>which NUMA node does it correlate to?" will have to be wired up. That 
>is *probably* all in the device driver, but since I haven't worked 
>through it, I'd be inclined to list it as an item on the checklist, 
>just in case it requires some little hook in the upstream kernel.

I've worked this out. It can be contained to the driver itself.

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-01 20:41 ` John Hubbard
  2017-05-01 21:04   ` Reza Arbab
@ 2017-05-02  1:29   ` Balbir Singh
  2017-05-02  5:47     ` John Hubbard
  1 sibling, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-05-02  1:29 UTC (permalink / raw)
  To: John Hubbard, linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl

On Mon, 2017-05-01 at 13:41 -0700, John Hubbard wrote:
> On 04/19/2017 12:52 AM, Balbir Singh wrote:
> > This is a request for comments on the discussed approaches
> > for coherent memory at mm-summit (some of the details are at
> > https://lwn.net/Articles/717601/). The latest posted patch
> > series is at https://lwn.net/Articles/713035/. I am reposting
> > this as RFC, Michal Hocko suggested using HMM for CDM, but
> > we believe there are stronger reasons to use the NUMA approach.
> > The earlier patches for Coherent Device memory were implemented
> > and designed by Anshuman Khandual.
> > 
> 
> Hi Balbir,
> 
> Although I think everyone agrees that in the [very] long term, these 
> hardware-coherent nodes probably want to be NUMA nodes, in order to decide what to 
> code up over the next few years, we need to get a clear idea of what has to be done 
> for each possible approach.
> 
> Here, the CDM discussion is falling just a bit short, because it does not yet 
> include the whole story of what we would need to do. Earlier threads pointed this 
> out: the idea started as a large patchset RFC, but then, "for ease of review", it 
> got turned into a smaller RFC, which loses too much context.

Hi, John

I thought I explained the context, but I'll try again. I see the whole solution
as a composite of the following primitives:

1. Enable hotplug of CDM nodes
2. Isolation of CDM memory
3. Migration to/from CDM memory
4. Performance enhancements for migration

The RFC here is for (2) above. (3) is handled by HMM and (4) is being discussed
in the community. I think the larger goals are same as HMM, except that we
don't need unaddressable memory, since the memory is cache coherent.

> 
> So, I'd suggest putting together something more complete, so that it can be fairly 
> compared against the HMM-for-hardware-coherent-nodes approach.
>

Since I intend to reuse bits of HMM, I am not sure if I want to repost those
patches as a part of my RFC. I hope my answers make sense, the goal is to
reuse as much of what is available. From a user perspective

1. We see no new interface being added in either case, the programming model
would differ though
2. We expect the programming model to be abstracted behind a user space
framework, potentially like CUDA or CXL

 
> 
> > Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
> > The patches do a great deal to enable CDM with HMM, but we
> > still believe that HMM with CDM is not a natural way to
> > represent coherent device memory and the mm will need
> > to be audited and enhanced for it to even work.
> 
> That is also true for the CDM approach. Specifically, in order for this to be of any 
> use to device drivers, we'll need the following:
>

Since Reza answered these questions, I'll skip them in this email

Thanks for the review!
Balbir Singh 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-02  1:29   ` Balbir Singh
@ 2017-05-02  5:47     ` John Hubbard
  2017-05-02  7:23       ` Balbir Singh
  0 siblings, 1 reply; 45+ messages in thread
From: John Hubbard @ 2017-05-02  5:47 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl



On 05/01/2017 06:29 PM, Balbir Singh wrote:
> On Mon, 2017-05-01 at 13:41 -0700, John Hubbard wrote:
>> On 04/19/2017 12:52 AM, Balbir Singh wrote:
>>> This is a request for comments on the discussed approaches
>>> for coherent memory at mm-summit (some of the details are at
>>> https://lwn.net/Articles/717601/). The latest posted patch
>>> series is at https://lwn.net/Articles/713035/. I am reposting
>>> this as RFC, Michal Hocko suggested using HMM for CDM, but
>>> we believe there are stronger reasons to use the NUMA approach.
>>> The earlier patches for Coherent Device memory were implemented
>>> and designed by Anshuman Khandual.
>>>
>>
>> Hi Balbir,
>>
>> Although I think everyone agrees that in the [very] long term, these
>> hardware-coherent nodes probably want to be NUMA nodes, in order to decide what to
>> code up over the next few years, we need to get a clear idea of what has to be done
>> for each possible approach.
>>
>> Here, the CDM discussion is falling just a bit short, because it does not yet
>> include the whole story of what we would need to do. Earlier threads pointed this
>> out: the idea started as a large patchset RFC, but then, "for ease of review", it
>> got turned into a smaller RFC, which loses too much context.
> 
> Hi, John
> 
> I thought I explained the context, but I'll try again. I see the whole solution
> as a composite of the following primitives:
> 
> 1. Enable hotplug of CDM nodes
> 2. Isolation of CDM memory
> 3. Migration to/from CDM memory
> 4. Performance enhancements for migration
> 

So, there is a little more than the above required, which is why I made that short 
list. I'm in particular concerned about the various system calls that userspace can 
make to control NUMA memory, and the device drivers will need notification (probably 
mmu_notifiers, I guess), and once they get notification, in many cases they'll need 
some way to deal with reverse mapping.

HMM provides all of that support, so it needs to happen here, too.



> The RFC here is for (2) above. (3) is handled by HMM and (4) is being discussed
> in the community. I think the larger goals are same as HMM, except that we
> don't need unaddressable memory, since the memory is cache coherent.
> 
>>
>> So, I'd suggest putting together something more complete, so that it can be fairly
>> compared against the HMM-for-hardware-coherent-nodes approach.
>>
> 
> Since I intend to reuse bits of HMM, I am not sure if I want to repost those
> patches as a part of my RFC. I hope my answers make sense, the goal is to
> reuse as much of what is available. From a user perspective

It's hard to keep track of what the plan is, so explaining exactly what you're doing 
helps.

> 
> 1. We see no new interface being added in either case, the programming model
> would differ though
> 2. We expect the programming model to be abstracted behind a user space
> framework, potentially like CUDA or CXL
> 
>   
>>
>>> Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
>>> The patches do a great deal to enable CDM with HMM, but we
>>> still believe that HMM with CDM is not a natural way to
>>> represent coherent device memory and the mm will need
>>> to be audited and enhanced for it to even work.
>>
>> That is also true for the CDM approach. Specifically, in order for this to be of any
>> use to device drivers, we'll need the following:
>>
> 
> Since Reza answered these questions, I'll skip them in this email

Yes, but he skipped over the rmap question, which I think is an important one.

thanks
john h

> 
> Thanks for the review!
> Balbir Singh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-02  5:47     ` John Hubbard
@ 2017-05-02  7:23       ` Balbir Singh
  2017-05-02 17:50         ` John Hubbard
  0 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-05-02  7:23 UTC (permalink / raw)
  To: John Hubbard, linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl

On Mon, 2017-05-01 at 22:47 -0700, John Hubbard wrote:
> 
> On 05/01/2017 06:29 PM, Balbir Singh wrote:
> > On Mon, 2017-05-01 at 13:41 -0700, John Hubbard wrote:
> > > On 04/19/2017 12:52 AM, Balbir Singh wrote:
> > > > This is a request for comments on the discussed approaches
> > > > for coherent memory at mm-summit (some of the details are at
> > > > https://lwn.net/Articles/717601/). The latest posted patch
> > > > series is at https://lwn.net/Articles/713035/. I am reposting
> > > > this as RFC, Michal Hocko suggested using HMM for CDM, but
> > > > we believe there are stronger reasons to use the NUMA approach.
> > > > The earlier patches for Coherent Device memory were implemented
> > > > and designed by Anshuman Khandual.
> > > > 
> > > 
> > > Hi Balbir,
> > > 
> > > Although I think everyone agrees that in the [very] long term, these
> > > hardware-coherent nodes probably want to be NUMA nodes, in order to decide what to
> > > code up over the next few years, we need to get a clear idea of what has to be done
> > > for each possible approach.
> > > 
> > > Here, the CDM discussion is falling just a bit short, because it does not yet
> > > include the whole story of what we would need to do. Earlier threads pointed this
> > > out: the idea started as a large patchset RFC, but then, "for ease of review", it
> > > got turned into a smaller RFC, which loses too much context.
> > 
> > Hi, John
> > 
> > I thought I explained the context, but I'll try again. I see the whole solution
> > as a composite of the following primitives:
> > 
> > 1. Enable hotplug of CDM nodes
> > 2. Isolation of CDM memory
> > 3. Migration to/from CDM memory
> > 4. Performance enhancements for migration
> > 
> 
> So, there is a little more than the above required, which is why I made that short 
> list. I'm in particular concerned about the various system calls that userspace can 
> make to control NUMA memory, and the device drivers will need notification (probably 
> mmu_notifiers, I guess), and once they get notification, in many cases they'll need 
> some way to deal with reverse mapping.

Are you suggesting that the system calls user space should be audited to
check if they should be used with a CDM device? I would
think a whole lot of this should be transparent to user space, unless it opts
in to using CDM and explictly wants to allocate and free memory -- the whole
isolation premise. w.r.t device drivers are you suggesting that the device
driver needs to know the state of each page -- free/in-use? Reverse mapping
for migration?

> 
> HMM provides all of that support, so it needs to happen here, too.
> 
> 
> 
> > The RFC here is for (2) above. (3) is handled by HMM and (4) is being discussed
> > in the community. I think the larger goals are same as HMM, except that we
> > don't need unaddressable memory, since the memory is cache coherent.
> > 
> > > 
> > > So, I'd suggest putting together something more complete, so that it can be fairly
> > > compared against the HMM-for-hardware-coherent-nodes approach.
> > > 
> > 
> > Since I intend to reuse bits of HMM, I am not sure if I want to repost those
> > patches as a part of my RFC. I hope my answers make sense, the goal is to
> > reuse as much of what is available. From a user perspective
> 
> It's hard to keep track of what the plan is, so explaining exactly what you're doing 
> helps.
> 

Fair enough, I hope I answered the questions?

> > 
> > 1. We see no new interface being added in either case, the programming model
> > would differ though
> > 2. We expect the programming model to be abstracted behind a user space
> > framework, potentially like CUDA or CXL
> > 
> >   
> > > 
> > > > Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
> > > > The patches do a great deal to enable CDM with HMM, but we
> > > > still believe that HMM with CDM is not a natural way to
> > > > represent coherent device memory and the mm will need
> > > > to be audited and enhanced for it to even work.
> > > 
> > > That is also true for the CDM approach. Specifically, in order for this to be of any
> > > use to device drivers, we'll need the following:
> > > 
> > 
> > Since Reza answered these questions, I'll skip them in this email
> 
> Yes, but he skipped over the rmap question, which I think is an important one.
>

If it is for migration, then we are going to rely on changes from HMM-CDM.
How does HMM deal with the rmap case? I presume it is not required for
unaddressable memory?

Balbir Singh. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
                   ` (5 preceding siblings ...)
  2017-05-01 20:41 ` John Hubbard
@ 2017-05-02 14:36 ` Michal Hocko
  2017-05-04  5:26   ` Balbir Singh
  6 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2017-05-02 14:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, arbab, vbabka, cl

On Wed 19-04-17 17:52:38, Balbir Singh wrote:
> This is a request for comments on the discussed approaches
> for coherent memory at mm-summit (some of the details are at
> https://lwn.net/Articles/717601/). The latest posted patch
> series is at https://lwn.net/Articles/713035/. I am reposting
> this as RFC, Michal Hocko suggested using HMM for CDM, but
> we believe there are stronger reasons to use the NUMA approach.
> The earlier patches for Coherent Device memory were implemented
> and designed by Anshuman Khandual.
> 
> Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
> The patches do a great deal to enable CDM with HMM, but we
> still believe that HMM with CDM is not a natural way to
> represent coherent device memory and the mm will need
> to be audited and enhanced for it to even work.
> 
> With HMM we'll see ZONE_DEVICE pages mapped into
> user space and that would mean a thorough audit of all code
> paths to make sure we are ready for such a use case and enabling
> those use cases, like with HMM CDM patch 1, which changes
> move_pages() and migration paths. I've done a quick
> evaluation to check for features and found limitationd around
> features like migration (page cache
> migration), fault handling to the right location
> (direct page cache allocation in the coherent memory), mlock
> handling, RSS accounting, memcg enforcement for pages not on LRU, etc.

Are those problems not viable to solve?

[...]
> Introduction
> 
> CDM device memory is cache coherent with system memory and we would like
> this to show up as a NUMA node, however there are certain algorithms
> that might not be currently suitable for N_COHERENT_MEMORY
> 
> 1. AutoNUMA balancing

OK, I can see a reason for that but theoretically the same applies to
cpu less numa nodes in general, no?

> 2. kswapd reclaim

How is the memory reclaim handled then? How are users expected to handle
OOM situation?

> The reason for exposing this device memory as NUMA is to simplify
> the programming model, where memory allocation via malloc() or
> mmap() for example would seamlessly work across both kinds of
> memory. Since we expect the size of device memory to be smaller
> than system RAM, we would like to control the allocation of such
> memory. The proposed mechanism reuses nodemasks and explicit
> specification of the coherent node in the nodemask for allocation
> from device memory. This implementation also allows for kernel
> level allocation via __GFP_THISNODE and existing techniques
> such as page migration to work.

so it basically resembles isol_cpus except for memory, right. I believe
scheduler people are more than unhappy about this interface...

Anyway, I consider CPUless nodes a dirty hack (especially when I see
them mostly used with poorly configured LPARs where no CPUs are left for
a particular memory).  Now this is trying to extend this concept even
further to a memory which is not reclaimable by the kernel and requires
an explicit and cooperative memory reclaim from userspace. How is this
going to work? The memory also has a different reliability properties
from RAM which user space doesn't have any clue about from the NUMA
properties exported. Or am I misunderstanding it? That all sounds quite
scary to me.

I very much agree with the last email from Mel and I would really like
to see how would a real application benefit from these nodes.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-02  7:23       ` Balbir Singh
@ 2017-05-02 17:50         ` John Hubbard
  0 siblings, 0 replies; 45+ messages in thread
From: John Hubbard @ 2017-05-02 17:50 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl

On 05/02/2017 12:23 AM, Balbir Singh wrote:
> On Mon, 2017-05-01 at 22:47 -0700, John Hubbard wrote:
>>
>> On 05/01/2017 06:29 PM, Balbir Singh wrote:
>>> On Mon, 2017-05-01 at 13:41 -0700, John Hubbard wrote:
>>>> On 04/19/2017 12:52 AM, Balbir Singh wrote:
[...]
>>> 1. Enable hotplug of CDM nodes
>>> 2. Isolation of CDM memory
>>> 3. Migration to/from CDM memory
>>> 4. Performance enhancements for migration
>>>
>>
>> So, there is a little more than the above required, which is why I made that short
>> list. I'm in particular concerned about the various system calls that userspace can
>> make to control NUMA memory, and the device drivers will need notification (probably
>> mmu_notifiers, I guess), and once they get notification, in many cases they'll need
>> some way to deal with reverse mapping.
> 
> Are you suggesting that the system calls user space should be audited to
> check if they should be used with a CDM device? I would
> think a whole lot of this should be transparent to user space, unless it opts
> in to using CDM and explictly wants to allocate and free memory -- the whole
> isolation premise. w.r.t device drivers are you suggesting that the device
> driver needs to know the state of each page -- free/in-use? Reverse mapping
> for migration?
> 

Interesting question. No, I was not going that direction (auditing the various system calls...) at 
all, actually. Rather, I was expecting that this system to interact as normally as possible with all 
of the system calls, and that is what led me to expect that some combination of "device driver + 
enhanced NUMA subsystem" would need to do rmap lookups.

Going through and special-casing CDM for various system calls would probably not be well-received, 
because it would be an indication of force-fitting this into the NUMA model before it's ready, right?

>>
>> HMM provides all of that support, so it needs to happen here, too.
>>
>>
>>
>>> The RFC here is for (2) above. (3) is handled by HMM and (4) is being discussed
>>> in the community. I think the larger goals are same as HMM, except that we
>>> don't need unaddressable memory, since the memory is cache coherent.
>>>
>>>>
>>>> So, I'd suggest putting together something more complete, so that it can be fairly
>>>> compared against the HMM-for-hardware-coherent-nodes approach.
>>>>
>>>
>>> Since I intend to reuse bits of HMM, I am not sure if I want to repost those
>>> patches as a part of my RFC. I hope my answers make sense, the goal is to
>>> reuse as much of what is available. From a user perspective
>>
>> It's hard to keep track of what the plan is, so explaining exactly what you're doing
>> helps.
>>
> 
> Fair enough, I hope I answered the questions?

Yes, thanks.

>>>
>>> 1. We see no new interface being added in either case, the programming model
>>> would differ though
>>> 2. We expect the programming model to be abstracted behind a user space
>>> framework, potentially like CUDA or CXL
>>>
>>>    
>>>>
>>>>> Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
>>>>> The patches do a great deal to enable CDM with HMM, but we
>>>>> still believe that HMM with CDM is not a natural way to
>>>>> represent coherent device memory and the mm will need
>>>>> to be audited and enhanced for it to even work.
>>>>
>>>> That is also true for the CDM approach. Specifically, in order for this to be of any
>>>> use to device drivers, we'll need the following:
>>>>
>>>
>>> Since Reza answered these questions, I'll skip them in this email
>>
>> Yes, but he skipped over the rmap question, which I think is an important one.
>>
> 
> If it is for migration, then we are going to rely on changes from HMM-CDM.
> How does HMM deal with the rmap case? I presume it is not required for
> unaddressable memory?
> 
> Balbir Singh.
> 

That's correct, we don't need rmap access for device drivers in the "pure HMM" case, because the HMM 
core handles it.

thanks
john h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-02 14:36 ` Michal Hocko
@ 2017-05-04  5:26   ` Balbir Singh
  2017-05-04 12:52     ` Michal Hocko
  0 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-05-04  5:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, arbab, vbabka, cl

On Tue, 2017-05-02 at 16:36 +0200, Michal Hocko wrote:
> On Wed 19-04-17 17:52:38, Balbir Singh wrote:
> > This is a request for comments on the discussed approaches
> > for coherent memory at mm-summit (some of the details are at
> > https://lwn.net/Articles/717601/). The latest posted patch
> > series is at https://lwn.net/Articles/713035/. I am reposting
> > this as RFC, Michal Hocko suggested using HMM for CDM, but
> > we believe there are stronger reasons to use the NUMA approach.
> > The earlier patches for Coherent Device memory were implemented
> > and designed by Anshuman Khandual.
> > 
> > Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
> > The patches do a great deal to enable CDM with HMM, but we
> > still believe that HMM with CDM is not a natural way to
> > represent coherent device memory and the mm will need
> > to be audited and enhanced for it to even work.
> > 
> > With HMM we'll see ZONE_DEVICE pages mapped into
> > user space and that would mean a thorough audit of all code
> > paths to make sure we are ready for such a use case and enabling
> > those use cases, like with HMM CDM patch 1, which changes
> > move_pages() and migration paths. I've done a quick
> > evaluation to check for features and found limitationd around
> > features like migration (page cache
> > migration), fault handling to the right location
> > (direct page cache allocation in the coherent memory), mlock
> > handling, RSS accounting, memcg enforcement for pages not on LRU, etc.
> 
> Are those problems not viable to solve?

Yes, except IIUC the direct page cache allocation one. The reason for calling
them out is to make aware that HMM CDM would require new mm changes/audit
to support ZONE_DEVICE pages across several parts of the mm subsystem.

> 
> [...]
> > Introduction
> > 
> > CDM device memory is cache coherent with system memory and we would like
> > this to show up as a NUMA node, however there are certain algorithms
> > that might not be currently suitable for N_COHERENT_MEMORY
> > 
> > 1. AutoNUMA balancing
> 
> OK, I can see a reason for that but theoretically the same applies to
> cpu less numa nodes in general, no?


That is correct. Christoph has shown some interest in isolating some
algorithms as well. I have some ideas that I can send out later.

> 
> > 2. kswapd reclaim
> 
> How is the memory reclaim handled then? How are users expected to handle
> OOM situation?
> 

1. The fallback node list for coherent memory includes regular memory
   nodes
2. Direct reclaim works, I've tested it

> > The reason for exposing this device memory as NUMA is to simplify
> > the programming model, where memory allocation via malloc() or
> > mmap() for example would seamlessly work across both kinds of
> > memory. Since we expect the size of device memory to be smaller
> > than system RAM, we would like to control the allocation of such
> > memory. The proposed mechanism reuses nodemasks and explicit
> > specification of the coherent node in the nodemask for allocation
> > from device memory. This implementation also allows for kernel
> > level allocation via __GFP_THISNODE and existing techniques
> > such as page migration to work.
> 
> so it basically resembles isol_cpus except for memory, right. I believe
> scheduler people are more than unhappy about this interface...
>

isol_cpus were for an era when timer/interrupts and other scheduler
infrastructure present today was not around, but I don't mean to digress.
 
> Anyway, I consider CPUless nodes a dirty hack (especially when I see
> them mostly used with poorly configured LPARs where no CPUs are left for
> a particular memory).  Now this is trying to extend this concept even
> further to a memory which is not reclaimable by the kernel and requires

Direct reclaim still works

> an explicit and cooperative memory reclaim from userspace. How is this
> going to work? The memory also has a different reliability properties
> from RAM which user space doesn't have any clue about from the NUMA
> properties exported. Or am I misunderstanding it? That all sounds quite
> scary to me.
> 
> I very much agree with the last email from Mel and I would really like
> to see how would a real application benefit from these nodes.
>

I see two use cases

1. Aware application/library - allocates from this node and uses this memory
2. Unaware application/library - allocates memory anywhere, but does not use
CDM memory by default, since it is isolated.

Both 1 and 2 can work together and an aware application can use an unaware
library and if required migrate pages between the two. Both 1 and 2
can access each others memory due to coherency, so the final application
level use case is similar to HMM. That is why HMM-CDM and NUMA-CDM are
both equivalent from an application programming model perspective,
except for the limitations mentioned above.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04  5:26   ` Balbir Singh
@ 2017-05-04 12:52     ` Michal Hocko
  2017-05-04 15:49       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2017-05-04 12:52 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, khandual, benh, aneesh.kumar, paulmck, srikar,
	haren, jglisse, mgorman, arbab, vbabka, cl

On Thu 04-05-17 15:26:55, Balbir Singh wrote:
> On Tue, 2017-05-02 at 16:36 +0200, Michal Hocko wrote:
> > On Wed 19-04-17 17:52:38, Balbir Singh wrote:
[...]
> > > 2. kswapd reclaim
> > 
> > How is the memory reclaim handled then? How are users expected to handle
> > OOM situation?
> > 
> 
> 1. The fallback node list for coherent memory includes regular memory
>    nodes
> 2. Direct reclaim works, I've tested it

But the direct reclaim would be effective only _after_ all other nodes
are full.

I thought that kswapd reclaim is a problem because the HW doesn't
support aging properly but as the direct reclaim works then what is the
actual problem?
 
> > > The reason for exposing this device memory as NUMA is to simplify
> > > the programming model, where memory allocation via malloc() or
> > > mmap() for example would seamlessly work across both kinds of
> > > memory. Since we expect the size of device memory to be smaller
> > > than system RAM, we would like to control the allocation of such
> > > memory. The proposed mechanism reuses nodemasks and explicit
> > > specification of the coherent node in the nodemask for allocation
> > > from device memory. This implementation also allows for kernel
> > > level allocation via __GFP_THISNODE and existing techniques
> > > such as page migration to work.
> > 
> > so it basically resembles isol_cpus except for memory, right. I believe
> > scheduler people are more than unhappy about this interface...
> >
> 
> isol_cpus were for an era when timer/interrupts and other scheduler
> infrastructure present today was not around, but I don't mean to digress.

AFAIU, it has been added to _isolate_ some cpus from the scheduling domain
and have them available for the explicit affinity usage. You are
effectivelly proposing the same thing.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04 12:52     ` Michal Hocko
@ 2017-05-04 15:49       ` Benjamin Herrenschmidt
  2017-05-04 17:33         ` Dave Hansen
  2017-05-05 14:52         ` Michal Hocko
  0 siblings, 2 replies; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-04 15:49 UTC (permalink / raw)
  To: Michal Hocko, Balbir Singh
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On Thu, 2017-05-04 at 14:52 +0200, Michal Hocko wrote:
> But the direct reclaim would be effective only _after_ all other nodes
> are full.
> 
> I thought that kswapd reclaim is a problem because the HW doesn't
> support aging properly but as the direct reclaim works then what is the
> actual problem?

Ageing isn't isn't completely broken. The ATS MMU supports
dirty/accessed just fine.

However the TLB invalidations are quite expensive with a GPU so too
much harvesting is detrimental, and the GPU tends to check pages out
using a special "read with intend to write" mode, which means it almost
always set the dirty bit if the page is writable to begin with.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04 15:49       ` Benjamin Herrenschmidt
@ 2017-05-04 17:33         ` Dave Hansen
  2017-05-05  3:17           ` Balbir Singh
  2017-05-05  7:49           ` Benjamin Herrenschmidt
  2017-05-05 14:52         ` Michal Hocko
  1 sibling, 2 replies; 45+ messages in thread
From: Dave Hansen @ 2017-05-04 17:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Michal Hocko, Balbir Singh
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On 05/04/2017 08:49 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2017-05-04 at 14:52 +0200, Michal Hocko wrote:
>> But the direct reclaim would be effective only _after_ all other nodes
>> are full.
>>
>> I thought that kswapd reclaim is a problem because the HW doesn't
>> support aging properly but as the direct reclaim works then what is the
>> actual problem?
> 
> Ageing isn't isn't completely broken. The ATS MMU supports
> dirty/accessed just fine.
> 
> However the TLB invalidations are quite expensive with a GPU so too
> much harvesting is detrimental, and the GPU tends to check pages out
> using a special "read with intend to write" mode, which means it almost
> always set the dirty bit if the page is writable to begin with.

Why do you have to invalidate the TLB?  Does the GPU have a TLB so large
that it can keep thing in the TLB for super-long periods of time?

We don't flush the TLB on clearing Accessed on x86 normally.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04 17:33         ` Dave Hansen
@ 2017-05-05  3:17           ` Balbir Singh
  2017-05-05 14:51             ` Dave Hansen
  2017-05-05  7:49           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2017-05-05  3:17 UTC (permalink / raw)
  To: Dave Hansen, Benjamin Herrenschmidt, Michal Hocko
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On Thu, 2017-05-04 at 10:33 -0700, Dave Hansen wrote:
> On 05/04/2017 08:49 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2017-05-04 at 14:52 +0200, Michal Hocko wrote:
> > > But the direct reclaim would be effective only _after_ all other nodes
> > > are full.
> > > 
> > > I thought that kswapd reclaim is a problem because the HW doesn't
> > > support aging properly but as the direct reclaim works then what is the
> > > actual problem?
> > 
> > Ageing isn't isn't completely broken. The ATS MMU supports
> > dirty/accessed just fine.
> > 
> > However the TLB invalidations are quite expensive with a GPU so too
> > much harvesting is detrimental, and the GPU tends to check pages out
> > using a special "read with intend to write" mode, which means it almost
> > always set the dirty bit if the page is writable to begin with.
> 
> Why do you have to invalidate the TLB?  Does the GPU have a TLB so large
> that it can keep thing in the TLB for super-long periods of time?
> 
> We don't flush the TLB on clearing Accessed on x86 normally.

Isn't that mostly because x86 relies on non-global pages to be flushed
on context switch?

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04 17:33         ` Dave Hansen
  2017-05-05  3:17           ` Balbir Singh
@ 2017-05-05  7:49           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-05  7:49 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko, Balbir Singh
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On Thu, 2017-05-04 at 10:33 -0700, Dave Hansen wrote:
> > However the TLB invalidations are quite expensive with a GPU so too
> > much harvesting is detrimental, and the GPU tends to check pages out
> > using a special "read with intend to write" mode, which means it almost
> > always set the dirty bit if the page is writable to begin with.
> 
> Why do you have to invalidate the TLB?A  Does the GPU have a TLB so large
> that it can keep thing in the TLB for super-long periods of time?
> 
> We don't flush the TLB on clearing Accessed on x86 normally.

We don't *have* to but there is no telling when it will get set again.

I always found the non-invalidation of the TLB for harvesting
"Accessed" on x86 chancy ... if a process pounds on a handful of pages
heavily, they never get seen as accessed, which is just plain weird.

But yes, we can do the same thing.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05  3:17           ` Balbir Singh
@ 2017-05-05 14:51             ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2017-05-05 14:51 UTC (permalink / raw)
  To: Balbir Singh, Benjamin Herrenschmidt, Michal Hocko
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On 05/04/2017 08:17 PM, Balbir Singh wrote:
>>> However the TLB invalidations are quite expensive with a GPU so too
>>> much harvesting is detrimental, and the GPU tends to check pages out
>>> using a special "read with intend to write" mode, which means it almost
>>> always set the dirty bit if the page is writable to begin with.
>> Why do you have to invalidate the TLB?  Does the GPU have a TLB so large
>> that it can keep thing in the TLB for super-long periods of time?
>>
>> We don't flush the TLB on clearing Accessed on x86 normally.
> Isn't that mostly because x86 relies on non-global pages to be flushed
> on context switch?

Well, that's not the case with Process Context Identifiers.  Somebody
will enable those some day.  It also isn't true for a long-lived process
camping on a CPU core.

I don't know about "mostly", but it's certainly a combination of stuff
having to be reloaded in the TLB and flushed at context switch today.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-04 15:49       ` Benjamin Herrenschmidt
  2017-05-04 17:33         ` Dave Hansen
@ 2017-05-05 14:52         ` Michal Hocko
  2017-05-05 15:57           ` Benjamin Herrenschmidt
  2017-05-09  7:51           ` Balbir Singh
  1 sibling, 2 replies; 45+ messages in thread
From: Michal Hocko @ 2017-05-05 14:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, arbab, vbabka, cl

On Thu 04-05-17 17:49:21, Benjamin Herrenschmidt wrote:
> On Thu, 2017-05-04 at 14:52 +0200, Michal Hocko wrote:
> > But the direct reclaim would be effective only _after_ all other nodes
> > are full.
> > 
> > I thought that kswapd reclaim is a problem because the HW doesn't
> > support aging properly but as the direct reclaim works then what is the
> > actual problem?
> 
> Ageing isn't isn't completely broken. The ATS MMU supports
> dirty/accessed just fine.
> 
> However the TLB invalidations are quite expensive with a GPU so too
> much harvesting is detrimental, and the GPU tends to check pages out
> using a special "read with intend to write" mode, which means it almost
> always set the dirty bit if the page is writable to begin with.

This sounds pretty much like a HW specific details which is not the
right criterion to design general CDM around.

So let me repeat the fundamental question. Is the only difference from
cpuless nodes the fact that the node should be invisible to processes
unless they specify an explicit node mask? If yes then we are talking
about policy in the kernel and that sounds like a big no-no to me.
Moreover cpusets already support exclusive numa nodes AFAIR.

I am either missing something important here, and the discussion so far
hasn't helped to be honest, or this whole CDM effort tries to build a
generic interface around a _specific_ piece of HW. The matter is worse
by the fact that the described usecases are so vague that it is hard to
build a good picture whether this is generic enough that a new/different
HW will still fit into this picture.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05 14:52         ` Michal Hocko
@ 2017-05-05 15:57           ` Benjamin Herrenschmidt
  2017-05-05 17:48             ` Jerome Glisse
  2017-05-09 11:36             ` Michal Hocko
  2017-05-09  7:51           ` Balbir Singh
  1 sibling, 2 replies; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-05 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, arbab, vbabka, cl

On Fri, 2017-05-05 at 16:52 +0200, Michal Hocko wrote:
> 
> This sounds pretty much like a HW specific details which is not the
> right criterion to design general CDM around.

Which is why I don't see what's the problem with simply making this
a hot-plugged NUMA node, since it's basically what it is with a
"different" kind of CPU, possibly covered with a CMA, which provides
both some isolation and the ability to do large physical allocations
for applications who chose to use the legacy programming interfaces and
manually control the memory.

Then, the "issues" with things like reclaim, autonuma can be handled
with policy tunables. Possibly node attributes.

It seems to me that such a model fits well in the picture where we are
heading not just with GPUs, but with OpenCAPI based memory, CCIX or
other similar technologies that can provide memory possibly with co-
located acceleration devices.

It also mostly already just work.

> So let me repeat the fundamental question. Is the only difference from
> cpuless nodes the fact that the node should be invisible to processes
> unless they specify an explicit node mask?

It would be *preferable* that it is.

It's not necessarily an absolute requirement as long as what lands
there can be kicked out. However the system would potentially be
performing poorly if too much unrelated stuff lands on the GPU memory
as it has a much higher latency.

Due to the nature of GPUs (and possibly other such accelerators but not
necessarily all of them), that memory is also more likely to fail. GPUs
crash often. However that isn't necessarily true of OpenCAPI devices or
CCIX.

This is the kind of attributes of the memory (quality ?) that can be
provided by the driver that is putting it online. We can then
orthogonally decide how we chose (or not) to take those into account,
either in the default mm algorithms or from explicit policy mechanisms
set from userspace, but the latter is often awkward and never done
right.

>  If yes then we are talking
> about policy in the kernel and that sounds like a big no-no to me.

It makes sense to expose a concept of "characteristics" of a given
memory node that affect the various policies the user can set.

It makes sense to haveA "default" policy models selected.

Policies aren't always decided in the kernel indeed (though they are
more often than not, face it, most of the time, leaving it to userspace
results in things simply not working). However the mechanisms by which
the policy is applied are in the kernel.

> Moreover cpusets already support exclusive numa nodes AFAIR.

Which implies that the user would have to do epxlciit cpuset
manipulations for the system to work right ? Most user wouldn't and the
rsult is that most user would have badly working systems. That's almost
always what happens when we chose to bounce *all* policy decision to
the user without the kernel attempting to have some kind of semi-sane
default.

> I am either missing something important here, and the discussion so
far
> hasn't helped to be honest, or this whole CDM effort tries to build a
> generic interface around a _specific_ piece of HW. 

No. You guys have just been sticking your head in the sand for month
for reasons I can't quite understand completely :-)

There is a definite direction out there for devices to participate in
cache coherency and to operate within user process MMU contexts. This
is what the GPUs on P9 will be doing via nvlink, but this will also be
possible with technologies like OpenCAPI, I believe CCIX, etc...

This is by no mean a special case.

> The matter is worse
> by the fact that the described usecases are so vague that it is hard to
> build a good picture whether this is generic enough that a new/different
> HW will still fit into this picture.

The GPU use case is rather trivial.

The end goal is to simply have accelerators transparently operate in
userspace context, along with the ability to migrate page to the memory
that is the most efficient for a given operation.

Thus for example, mmap a large file (page cache) and have the program
pass a pointer to that mmap to a GPU program that starts churning on
it.

In the specific GPU case, we have HW on the link telling us the pages
are pounded on remotely, allowing us to trigger migration toward GPU
memory (but the other way works too).

The problem with the HMM based approach is that it is based on
ZONE_DEVICE. This means "special" struct pages that aren't in LRU and
implies, at least that's my understanding, piles of special cases all
over the place to deal with them, along with various APIs etc... that
don't work with such pages.

So it makes it difficult to be able to pickup anything mapped into a
process address space, whether it is page cache pages, shared memory,
etc... and migrate it to GPU pages.

At least, that's my understanding and Jerome somewhat confirmed it,
we'd end up fighting an uphill battle dealing with all those special
cases. HMM is well suited for non-coherent systems with a distinct MMU
translation on the remote device.

This is why we think a NUMA based approach is a lot simpler. We start
by having the GPU memory be "normal" memory, and then we look at what
needs to be done to improve the default system behaviour and policies
to take into account it slightly different characteristics.

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05 15:57           ` Benjamin Herrenschmidt
@ 2017-05-05 17:48             ` Jerome Glisse
  2017-05-05 17:59               ` Benjamin Herrenschmidt
  2017-05-09 11:36             ` Michal Hocko
  1 sibling, 1 reply; 45+ messages in thread
From: Jerome Glisse @ 2017-05-05 17:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michal Hocko, Balbir Singh, linux-mm, akpm, khandual,
	aneesh.kumar, paulmck, srikar, haren, mgorman, arbab, vbabka, cl

On Fri, May 05, 2017 at 05:57:02PM +0200, Benjamin Herrenschmidt wrote:
> On Fri, 2017-05-05 at 16:52 +0200, Michal Hocko wrote:
> > 
> > This sounds pretty much like a HW specific details which is not the
> > right criterion to design general CDM around.
> 
> Which is why I don't see what's the problem with simply making this
> a hot-plugged NUMA node, since it's basically what it is with a
> "different" kind of CPU, possibly covered with a CMA, which provides
> both some isolation and the ability to do large physical allocations
> for applications who chose to use the legacy programming interfaces and
> manually control the memory.
> 
> Then, the "issues" with things like reclaim, autonuma can be handled
> with policy tunables. Possibly node attributes.
> 
> It seems to me that such a model fits well in the picture where we are
> heading not just with GPUs, but with OpenCAPI based memory, CCIX or
> other similar technologies that can provide memory possibly with co-
> located acceleration devices.
> 
> It also mostly already just work.
> 
> > So let me repeat the fundamental question. Is the only difference from
> > cpuless nodes the fact that the node should be invisible to processes
> > unless they specify an explicit node mask?
> 
> It would be *preferable* that it is.
> 
> It's not necessarily an absolute requirement as long as what lands
> there can be kicked out. However the system would potentially be
> performing poorly if too much unrelated stuff lands on the GPU memory
> as it has a much higher latency.
> 
> Due to the nature of GPUs (and possibly other such accelerators but not
> necessarily all of them), that memory is also more likely to fail. GPUs
> crash often. However that isn't necessarily true of OpenCAPI devices or
> CCIX.
> 
> This is the kind of attributes of the memory (quality ?) that can be
> provided by the driver that is putting it online. We can then
> orthogonally decide how we chose (or not) to take those into account,
> either in the default mm algorithms or from explicit policy mechanisms
> set from userspace, but the latter is often awkward and never done
> right.
> 
> >  If yes then we are talking
> > about policy in the kernel and that sounds like a big no-no to me.
> 
> It makes sense to expose a concept of "characteristics" of a given
> memory node that affect the various policies the user can set.
> 
> It makes sense to have "default" policy models selected.
> 
> Policies aren't always decided in the kernel indeed (though they are
> more often than not, face it, most of the time, leaving it to userspace
> results in things simply not working). However the mechanisms by which
> the policy is applied are in the kernel.
> 
> > Moreover cpusets already support exclusive numa nodes AFAIR.
> 
> Which implies that the user would have to do epxlciit cpuset
> manipulations for the system to work right ? Most user wouldn't and the
> rsult is that most user would have badly working systems. That's almost
> always what happens when we chose to bounce *all* policy decision to
> the user without the kernel attempting to have some kind of semi-sane
> default.
> 
> > I am either missing something important here, and the discussion so
> far
> > hasn't helped to be honest, or this whole CDM effort tries to build a
> > generic interface around a _specific_ piece of HW. 
> 
> No. You guys have just been sticking your head in the sand for month
> for reasons I can't quite understand completely :-)
> 
> There is a definite direction out there for devices to participate in
> cache coherency and to operate within user process MMU contexts. This
> is what the GPUs on P9 will be doing via nvlink, but this will also be
> possible with technologies like OpenCAPI, I believe CCIX, etc...
> 
> This is by no mean a special case.
> 
> > The matter is worse
> > by the fact that the described usecases are so vague that it is hard to
> > build a good picture whether this is generic enough that a new/different
> > HW will still fit into this picture.
> 
> The GPU use case is rather trivial.
> 
> The end goal is to simply have accelerators transparently operate in
> userspace context, along with the ability to migrate page to the memory
> that is the most efficient for a given operation.
> 
> Thus for example, mmap a large file (page cache) and have the program
> pass a pointer to that mmap to a GPU program that starts churning on
> it.
> 
> In the specific GPU case, we have HW on the link telling us the pages
> are pounded on remotely, allowing us to trigger migration toward GPU
> memory (but the other way works too).
> 
> The problem with the HMM based approach is that it is based on
> ZONE_DEVICE. This means "special" struct pages that aren't in LRU and
> implies, at least that's my understanding, piles of special cases all
> over the place to deal with them, along with various APIs etc... that
> don't work with such pages.
> 
> So it makes it difficult to be able to pickup anything mapped into a
> process address space, whether it is page cache pages, shared memory,
> etc... and migrate it to GPU pages.
> 
> At least, that's my understanding and Jerome somewhat confirmed it,
> we'd end up fighting an uphill battle dealing with all those special
> cases. HMM is well suited for non-coherent systems with a distinct MMU
> translation on the remote device.

Well there is _no_ migration issues with HMM (anonymous or file back
pages). What you don't get is thing like lru or numa balancing but i
believe you do not want either of those anyway.

Sure a carefull audit of all code path is needed to make sure that
such page does not end up in place it shouldn't (like put back on
lru).

Now you are also excluded of lot of thing, like for instance read
ahead would never use a ZONE_DEVICE page to read ahead a file. But
many of those thing are easy to add back if they are important to
you.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05 17:48             ` Jerome Glisse
@ 2017-05-05 17:59               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-05 17:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, Balbir Singh, linux-mm, akpm, khandual,
	aneesh.kumar, paulmck, srikar, haren, mgorman, arbab, vbabka, cl

On Fri, 2017-05-05 at 13:48 -0400, Jerome Glisse wrote:
> Well there is _no_ migration issues with HMM (anonymous or file back
> pages). What you don't get is thing like lru or numa balancing but i
> believe you do not want either of those anyway.

We don't want them in the specific case of GPUs today for various
reasons related more to how they are used and specific implementation
shortcomings, so matter of policy.

However, I don't think they are necessarily to be excluded in the grand
scheme of things of coherent accelerators with local memory.

So my gut feeling (but we can agree to disagree, in the end, what we
need is *a* workable solution to enable these things, which ever it is
that wins), is that we are better off simply treating them as normal
numa nodes, and adding more policy tunables where needed, if possible
with some of these being set to reasonable defaults by the driver
itself to account for implementation shortcomings.

Now, if Michal and Mel strongly prefer the approach based on HMM, we
can make it work as well I believe. It feels less "natural" and more
convoluted. That's it.

This is by no mean a criticism of HMM btw :-) HMM still is a critical
part of getting the non-coherent devices working properly, and which
ever representation we use for the memory on the coherent ones, we will
also use parts of HMM infrastructure for driver directed migration
anyway.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05 14:52         ` Michal Hocko
  2017-05-05 15:57           ` Benjamin Herrenschmidt
@ 2017-05-09  7:51           ` Balbir Singh
  1 sibling, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-05-09  7:51 UTC (permalink / raw)
  To: Michal Hocko, Benjamin Herrenschmidt
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On Fri, 2017-05-05 at 16:52 +0200, Michal Hocko wrote:
> On Thu 04-05-17 17:49:21, Benjamin Herrenschmidt wrote:
> > On Thu, 2017-05-04 at 14:52 +0200, Michal Hocko wrote:
> > > But the direct reclaim would be effective only _after_ all other nodes
> > > are full.
> > > 
> > > I thought that kswapd reclaim is a problem because the HW doesn't
> > > support aging properly but as the direct reclaim works then what is the
> > > actual problem?
> > 
> > Ageing isn't isn't completely broken. The ATS MMU supports
> > dirty/accessed just fine.
> > 
> > However the TLB invalidations are quite expensive with a GPU so too
> > much harvesting is detrimental, and the GPU tends to check pages out
> > using a special "read with intend to write" mode, which means it almost
> > always set the dirty bit if the page is writable to begin with.
> 
> This sounds pretty much like a HW specific details which is not the
> right criterion to design general CDM around.

I think Ben answered several of these questions. NUMA we felt was the best
representation of such memory, but it has limitations in that we'd like
to isolate some default algorithms that run on all nodes marked N_MEMORY.
Do you see that as a concern? Would you like to see a generic policy
like Ben said to handle node attributes like reclaim, autonuma, etc?

> 
> So let me repeat the fundamental question. Is the only difference from
> cpuless nodes the fact that the node should be invisible to processes
> unless they specify an explicit node mask? If yes then we are talking
> about policy in the kernel and that sounds like a big no-no to me.
> Moreover cpusets already support exclusive numa nodes AFAIR.

Why do you see this as a policy, it's a mechanism of isolating nodes,
the nodes themselves are then used using mempolicy.

> 
> I am either missing something important here, and the discussion so far
> hasn't helped to be honest, or this whole CDM effort tries to build a
> generic interface around a _specific_ piece of HW. The matter is worse
> by the fact that the described usecases are so vague that it is hard to
> build a good picture whether this is generic enough that a new/different
> HW will still fit into this picture.

The use case is similar to HMM, except that we've got coherent memory.
We treat is as important and want to isolate normal allocations, unless
the allocation is explicitly specified. CPUsets provide an isolation
mechanism, but we see autonuma for example moving pages away when there
is an access from the system side. With reclaim, it would be better to
use the fallback list first then swap. Again the use case is:

I'm trying to do a FAQ version here

Isolate memory - why?
 - CDM memory is not meant for normal usage, applications can request for it
   explictly. Oflload their compute to the device where the memory is
   (the offload is via a user space API like CUDA/openCL/...)
How do we isolate - NUMA or HMM?
 - Since the memory is coherent, NUMA provides the mechanism to isolate to
   a large extent via mempolicy. With NUMA we also get autonuma/kswapd/etc
   running. Something we would like to avoid. NUMA gives the application
   a transparent view of memory, in the sense that all mm features work,
   like direct page cache allocation in coherent device memory, limiting
   memory via cgroups if required, etc. With CPUSets, its
   possible for us to isolate allocation. One challenge is that the
   admin on the system may use them differently and applications need to
   be aware of running in the right cpuset to allocate memory from the
   CDM node. Putting all applications in the cpuset with the CDM node is
   not the right thing to do, which means the application needs to move itself
   to the right cpuset before requesting for CDM memory. It's not impossible
   to use CPUsets, just hard to configure correctly.
  - With HMM, we would need a HMM variant HMM-CDM, so that we are not marking
   the pages as unavailable, page cache cannot do directly to coherent memory.
   Audit of mm paths is required. Most of the other things should work.
   User access to HMM-CDM memory behind ZONE_DEVICE is via a device driver.
Why do we need migration?
 - Depending on where the memory is being accessed from, we would like to
   migrate pages between system and coherent device memory. HMM provides
   DMA offload capability that is useful in both cases.
What is the larger picture - end to end?
 - Applications can allocate memory on the device or in system memory,
   offload the compute via user space API. Migration can be used for performance
   if required since it helps to keep the memory local to the compute.

Ben/Jerome/John/others did I get the FAQ right?

>From my side, I want to ensure that the decision HMM-CDM or NUMA-CDM is based
on our design and understanding, as opposed to the reason that the
use case is not clear or in sufficient. I'd be happy if we said, we understand
the use case and believe that HMM-CDM is better from the mm's perspective as
its better because... as opposed to isolating NUMA attributes because .... 
or vice-versa.

Thanks for the review,
Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-05 15:57           ` Benjamin Herrenschmidt
  2017-05-05 17:48             ` Jerome Glisse
@ 2017-05-09 11:36             ` Michal Hocko
  2017-05-09 13:43               ` Benjamin Herrenschmidt
  2017-05-10 23:04               ` Balbir Singh
  1 sibling, 2 replies; 45+ messages in thread
From: Michal Hocko @ 2017-05-09 11:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, arbab, vbabka, cl

On Fri 05-05-17 17:57:02, Benjamin Herrenschmidt wrote:
> On Fri, 2017-05-05 at 16:52 +0200, Michal Hocko wrote:
> > 
> > This sounds pretty much like a HW specific details which is not the
> > right criterion to design general CDM around.
> 
> Which is why I don't see what's the problem with simply making this
> a hot-plugged NUMA node, since it's basically what it is with a
> "different" kind of CPU, possibly covered with a CMA, which provides
> both some isolation and the ability to do large physical allocations
> for applications who chose to use the legacy programming interfaces and
> manually control the memory.
> 
> Then, the "issues" with things like reclaim, autonuma can be handled
> with policy tunables. Possibly node attributes.
> 
> It seems to me that such a model fits well in the picture where we are
> heading not just with GPUs, but with OpenCAPI based memory, CCIX or
> other similar technologies that can provide memory possibly with co-
> located acceleration devices.
> 
> It also mostly already just work.

But this is not what the CDM as proposed here is about AFAIU. It is
argued this is not a _normal_ cpuless node and it neads tweak here and
there. And that is my main objection about. I do not mind if the memory
is presented as a hotplugable cpuless memory node. I just do not want it
to be any more special than cpuless nodes are already.

> > So let me repeat the fundamental question. Is the only difference from
> > cpuless nodes the fact that the node should be invisible to processes
> > unless they specify an explicit node mask?
> 
> It would be *preferable* that it is.
> 
> It's not necessarily an absolute requirement as long as what lands
> there can be kicked out. However the system would potentially be
> performing poorly if too much unrelated stuff lands on the GPU memory
> as it has a much higher latency.

This is a general concern for many cpuless NUMA node systems. You have
to pay for the suboptimal performance when accessing that memory. And
you have means to cope with that.

> Due to the nature of GPUs (and possibly other such accelerators but not
> necessarily all of them), that memory is also more likely to fail. GPUs
> crash often. However that isn't necessarily true of OpenCAPI devices or
> CCIX.
> 
> This is the kind of attributes of the memory (quality ?) that can be
> provided by the driver that is putting it online. We can then
> orthogonally decide how we chose (or not) to take those into account,
> either in the default mm algorithms or from explicit policy mechanisms
> set from userspace, but the latter is often awkward and never done
> right.

The first adds maintain costs all over the place and just looking at
what become of memory policies and cpusets makes me cry. I definitely do
not want more special casing on top (and just to make it clear a special
N_MEMORY_$FOO falls into the same category).

[...]
> > Moreover cpusets already support exclusive numa nodes AFAIR.
> 
> Which implies that the user would have to do epxlciit cpuset
> manipulations for the system to work right ? Most user wouldn't and the
> rsult is that most user would have badly working systems. That's almost
> always what happens when we chose to bounce *all* policy decision to
> the user without the kernel attempting to have some kind of semi-sane
> default.

I would argue that this is the case for cpuless numa nodes already.
Users should better know what they are doing when using such a
specialized HW. And that includes a specialized configuration.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-09 11:36             ` Michal Hocko
@ 2017-05-09 13:43               ` Benjamin Herrenschmidt
  2017-05-15 12:55                 ` Michal Hocko
  2017-05-10 23:04               ` Balbir Singh
  1 sibling, 1 reply; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-09 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, arbab, vbabka, cl

On Tue, 2017-05-09 at 13:36 +0200, Michal Hocko wrote:
> But this is not what the CDM as proposed here is about AFAIU. It is
> argued this is not a _normal_ cpuless node and it neads tweak here and
> there. And that is my main objection about. I do not mind if the memory
> is presented as a hotplugable cpuless memory node. I just do not want it
> to be any more special than cpuless nodes are already.

But if you look at where things are going with the new kind of memory
technologies appearing etc... I think the concept of "normal" for
memory is rather fragile.

So I think it makes sense to grow the idea that nodes have "attributes"
that affect the memory policies.

That said, one thing we do need to clarify, especially in the context
of our short term GPU usage model, is of those attributes, what is
inherent to the way the HW works and what is more related to the actual
userspace usage model, the latter possibly being better dealt with with
existing policy mechanisms).

Also maybe understand how much of these things are likely to be shared
with other type of devices such as OpenCAPI or CCIX.

> > > So let me repeat the fundamental question. Is the only difference from
> > > cpuless nodes the fact that the node should be invisible to processes
> > > unless they specify an explicit node mask?
> > 
> > It would be *preferable* that it is.
> > 
> > It's not necessarily an absolute requirement as long as what lands
> > there can be kicked out. However the system would potentially be
> > performing poorly if too much unrelated stuff lands on the GPU memory
> > as it has a much higher latency.
> 
> This is a general concern for many cpuless NUMA node systems. You have
> to pay for the suboptimal performance when accessing that memory. And
> you have means to cope with that.

Yup. However in this case, GPU memory is really bad, so that's one
reason why we want to push the idea of effectively not allowing non-
explicit allocations from it.

Thus, memory would be allocated from that node only if either the
application (or driver) use explicit APIs to grab some of it, or if the
driver migrates pages to it. (Or possibly, if we can make that work,
the memory is provisioned as the result of a page fault by the GPU
itself).

> > Due to the nature of GPUs (and possibly other such accelerators but not
> > necessarily all of them), that memory is also more likely to fail. GPUs
> > crash often. However that isn't necessarily true of OpenCAPI devices or
> > CCIX.
> > 
> > This is the kind of attributes of the memory (quality ?) that can be
> > provided by the driver that is putting it online. We can then
> > orthogonally decide how we chose (or not) to take those into account,
> > either in the default mm algorithms or from explicit policy mechanisms
> > set from userspace, but the latter is often awkward and never done
> > right.
> 
> The first adds maintain costs all over the place and just looking at
> what become of memory policies and cpusets makes me cry. I definitely do
> not want more special casing on top (and just to make it clear a special
> N_MEMORY_$FOO falls into the same category).
> 
> [...]
> > > Moreover cpusets already support exclusive numa nodes AFAIR.
> > 
> > Which implies that the user would have to do epxlciit cpuset
> > manipulations for the system to work right ? Most user wouldn't and the
> > rsult is that most user would have badly working systems. That's almost
> > always what happens when we chose to bounce *all* policy decision to
> > the user without the kernel attempting to have some kind of semi-sane
> > default.
> 
> I would argue that this is the case for cpuless numa nodes already.
> Users should better know what they are doing when using such a
> specialized HW. And that includes a specialized configuration.

So what you are saying is that users who want to use GPUs or FPGAs or
accelerated devices will need to have intimate knowledge of Linux CPU
and memory policy management at a low level.

That's where I disagree.

People want to throw these things at all sort of problems out there,
hide them behind libraries, and have things "just work".

The user will just use applications normally. Those will be use
more/less standard libraries to perform various computations, these
libraries will know how to take advantage of accelerators, nothing in
that chains knows about memory policies & placement, cpusets etc... and
nothing *should*.

Of course, the special case of the HPC user trying to milk the last
cycle out of the system is probably going to do what you suggest. But
most users won't.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-09 11:36             ` Michal Hocko
  2017-05-09 13:43               ` Benjamin Herrenschmidt
@ 2017-05-10 23:04               ` Balbir Singh
  1 sibling, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2017-05-10 23:04 UTC (permalink / raw)
  To: Michal Hocko, Benjamin Herrenschmidt
  Cc: linux-mm, akpm, khandual, aneesh.kumar, paulmck, srikar, haren,
	jglisse, mgorman, arbab, vbabka, cl

On Tue, 2017-05-09 at 13:36 +0200, Michal Hocko wrote:
> On Fri 05-05-17 17:57:02, Benjamin Herrenschmidt wrote:
> > On Fri, 2017-05-05 at 16:52 +0200, Michal Hocko wrote:
> > > 
> > > This sounds pretty much like a HW specific details which is not the
> > > right criterion to design general CDM around.
> > 
> > Which is why I don't see what's the problem with simply making this
> > a hot-plugged NUMA node, since it's basically what it is with a
> > "different" kind of CPU, possibly covered with a CMA, which provides
> > both some isolation and the ability to do large physical allocations
> > for applications who chose to use the legacy programming interfaces and
> > manually control the memory.
> > 
> > Then, the "issues" with things like reclaim, autonuma can be handled
> > with policy tunables. Possibly node attributes.
> > 
> > It seems to me that such a model fits well in the picture where we are
> > heading not just with GPUs, but with OpenCAPI based memory, CCIX or
> > other similar technologies that can provide memory possibly with co-
> > located acceleration devices.
> > 
> > It also mostly already just work.
> 
> But this is not what the CDM as proposed here is about AFAIU.

The main reason for the patches was to address "issues" with things like
reclaim, autonuma isolation, etc and the constraint of not willing to
make allocator changes.

Do we see node attributes as something we need generically? Is there
consensus that we need this or do we see all new algorithms working
across all of N_MEMORY all the time?

> It is
> argued this is not a _normal_ cpuless node and it neads tweak here and
> there. And that is my main objection about. I do not mind if the memory
> is presented as a hotplugable cpuless memory node. I just do not want it
> to be any more special than cpuless nodes are already.

The downsides being code complexity/run time overhead? Like Ben stated
there are several devices that will also have coherent memory, do you see all of
them abstracted as HMM-CDM?

> 
> > > So let me repeat the fundamental question. Is the only difference from
> > > cpuless nodes the fact that the node should be invisible to processes
> > > unless they specify an explicit node mask?
> > 
> > It would be *preferable* that it is.
> > 
> > It's not necessarily an absolute requirement as long as what lands
> > there can be kicked out. However the system would potentially be
> > performing poorly if too much unrelated stuff lands on the GPU memory
> > as it has a much higher latency.
> 
> This is a general concern for many cpuless NUMA node systems. You have
> to pay for the suboptimal performance when accessing that memory. And
> you have means to cope with that.
> 

How do we evolve the NUMA subsystem to deal with additional requirements?
Do we not enhance NUMA and move to ZONE_DEVICE?

> > Due to the nature of GPUs (and possibly other such accelerators but not
> > necessarily all of them), that memory is also more likely to fail. GPUs
> > crash often. However that isn't necessarily true of OpenCAPI devices or
> > CCIX.
> > 
> > This is the kind of attributes of the memory (quality ?) that can be
> > provided by the driver that is putting it online. We can then
> > orthogonally decide how we chose (or not) to take those into account,
> > either in the default mm algorithms or from explicit policy mechanisms
> > set from userspace, but the latter is often awkward and never done
> > right.
> 
> The first adds maintain costs all over the place and just looking at
> what become of memory policies and cpusets makes me cry. I definitely do
> not want more special casing on top (and just to make it clear a special
> N_MEMORY_$FOO falls into the same category).
> 

And I thought it was cleaner design, yes we have been special casing some
of the N_COHERENT_MEMORY bits in mm/mempolicy.c. 

> [...]
> > > Moreover cpusets already support exclusive numa nodes AFAIR.
> > 
> > Which implies that the user would have to do epxlciit cpuset
> > manipulations for the system to work right ? Most user wouldn't and the
> > rsult is that most user would have badly working systems. That's almost
> > always what happens when we chose to bounce *all* policy decision to
> > the user without the kernel attempting to have some kind of semi-sane
> > default.
> 
> I would argue that this is the case for cpuless numa nodes already.
> Users should better know what they are doing when using such a
> specialized HW. And that includes a specialized configuration.
>

Like Ben said intimate knowledge of using specialized hardware
is an unfair assumption. It sounds like the decision then is that
we do HMM-CDM or live with cpuless nodes without enhancements?

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-09 13:43               ` Benjamin Herrenschmidt
@ 2017-05-15 12:55                 ` Michal Hocko
  2017-05-15 15:53                   ` Christoph Lameter
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2017-05-15 12:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Balbir Singh, linux-mm, akpm, khandual, aneesh.kumar, paulmck,
	srikar, haren, jglisse, mgorman, arbab, vbabka, cl

[Ups, for some reason this got stuck in my draft folder and didn't get
send out]

On Tue 09-05-17 15:43:12, Benjamin Herrenschmidt wrote:
> On Tue, 2017-05-09 at 13:36 +0200, Michal Hocko wrote:
> > But this is not what the CDM as proposed here is about AFAIU. It is
> > argued this is not a _normal_ cpuless node and it neads tweak here and
> > there. And that is my main objection about. I do not mind if the memory
> > is presented as a hotplugable cpuless memory node. I just do not want it
> > to be any more special than cpuless nodes are already.
> 
> But if you look at where things are going with the new kind of memory
> technologies appearing etc... I think the concept of "normal" for
> memory is rather fragile.
> 
> So I think it makes sense to grow the idea that nodes have "attributes"
> that affect the memory policies.

I am not really sure our current API fits into such a world and a change
would require much deeper consideration.

[...]
> > This is a general concern for many cpuless NUMA node systems. You have
> > to pay for the suboptimal performance when accessing that memory. And
> > you have means to cope with that.
> 
> Yup. However in this case, GPU memory is really bad, so that's one
> reason why we want to push the idea of effectively not allowing non-
> explicit allocations from it.

I would argue that a cpuless node with a NUMA distance larger than a
certain threshold falls pretty much into the same category.

> Thus, memory would be allocated from that node only if either the
> application (or driver) use explicit APIs to grab some of it, or if the
> driver migrates pages to it. (Or possibly, if we can make that work,
> the memory is provisioned as the result of a page fault by the GPU
> itself).

That sounds like HMM to me.
 
[...]
> > I would argue that this is the case for cpuless numa nodes already.
> > Users should better know what they are doing when using such a
> > specialized HW. And that includes a specialized configuration.
> 
> So what you are saying is that users who want to use GPUs or FPGAs or
> accelerated devices will need to have intimate knowledge of Linux CPU
> and memory policy management at a low level.

No, I am not saying that. I am saying that if you want to use GPU/FPGAs
and what-not effectivelly you will most likely have to do additional
steps anyway.

> That's where I disagree.
> 
> People want to throw these things at all sort of problems out there,
> hide them behind libraries, and have things "just work".
> 
> The user will just use applications normally. Those will be use
> more/less standard libraries to perform various computations, these
> libraries will know how to take advantage of accelerators, nothing in
> that chains knows about memory policies & placement, cpusets etc... and
> nothing *should*.

With the proposed solution, they would need to set up mempolicy/cpuset
so I must be missing something here...

> Of course, the special case of the HPC user trying to milk the last
> cycle out of the system is probably going to do what you suggest. But
> most users won't.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
  2017-05-15 12:55                 ` Michal Hocko
@ 2017-05-15 15:53                   ` Christoph Lameter
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2017-05-15 15:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Benjamin Herrenschmidt, Balbir Singh, linux-mm, akpm, khandual,
	aneesh.kumar, paulmck, srikar, haren, jglisse, mgorman, arbab,
	vbabka

On Mon, 15 May 2017, Michal Hocko wrote:

> With the proposed solution, they would need to set up mempolicy/cpuset
> so I must be missing something here...
>
> > Of course, the special case of the HPC user trying to milk the last
> > cycle out of the system is probably going to do what you suggest. But
> > most users won't.

Its going to be the HPC users who will be trying to take advantage of it
anyways. I doubt that enterprise class users will even be buying the
accellerators. If it goes that way (after a couple of years) we hopefully
have matured things a bit and have experience how to configure the special
NUMA nodes in the system to behave properly with an accellerator.

I think the simplest way is to just go ahead create the NUMA node
approach and see how much can be covered with the existing NUMA features.
Then work from there to simplify and enhance.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2017-05-15 15:53 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19  7:52 [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Balbir Singh
2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
2017-04-27 18:42   ` Reza Arbab
2017-04-28  5:07     ` Balbir Singh
2017-04-19  7:52 ` [RFC 2/4] arch/powerpc/mm: add support for coherent memory Balbir Singh
2017-04-19  7:52 ` [RFC 3/4] mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the system Balbir Singh
2017-04-19  7:52 ` [RFC 4/4] mm: Add documentation for coherent memory Balbir Singh
2017-04-19 19:02 ` [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion) Christoph Lameter
2017-04-20  1:25   ` Balbir Singh
2017-04-20 15:29     ` Christoph Lameter
2017-04-20 21:26       ` Benjamin Herrenschmidt
2017-04-21 16:13         ` Christoph Lameter
2017-04-21 21:15           ` Benjamin Herrenschmidt
2017-04-24 13:57             ` Christoph Lameter
2017-04-24  0:20       ` Balbir Singh
2017-04-24 14:00         ` Christoph Lameter
2017-04-25  0:52           ` Balbir Singh
2017-05-01 20:41 ` John Hubbard
2017-05-01 21:04   ` Reza Arbab
2017-05-01 21:56     ` John Hubbard
2017-05-01 23:51       ` Reza Arbab
2017-05-01 23:58         ` John Hubbard
2017-05-02  0:04           ` Reza Arbab
2017-05-02  1:29   ` Balbir Singh
2017-05-02  5:47     ` John Hubbard
2017-05-02  7:23       ` Balbir Singh
2017-05-02 17:50         ` John Hubbard
2017-05-02 14:36 ` Michal Hocko
2017-05-04  5:26   ` Balbir Singh
2017-05-04 12:52     ` Michal Hocko
2017-05-04 15:49       ` Benjamin Herrenschmidt
2017-05-04 17:33         ` Dave Hansen
2017-05-05  3:17           ` Balbir Singh
2017-05-05 14:51             ` Dave Hansen
2017-05-05  7:49           ` Benjamin Herrenschmidt
2017-05-05 14:52         ` Michal Hocko
2017-05-05 15:57           ` Benjamin Herrenschmidt
2017-05-05 17:48             ` Jerome Glisse
2017-05-05 17:59               ` Benjamin Herrenschmidt
2017-05-09 11:36             ` Michal Hocko
2017-05-09 13:43               ` Benjamin Herrenschmidt
2017-05-15 12:55                 ` Michal Hocko
2017-05-15 15:53                   ` Christoph Lameter
2017-05-10 23:04               ` Balbir Singh
2017-05-09  7:51           ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.