linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC V2 00/12] Define coherent device memory node
@ 2017-01-30  3:35 Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 01/12] mm: Define coherent device memory (CDM) node Anshuman Khandual
                   ` (21 more replies)
  0 siblings, 22 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

	There are certain devices like accelerators, GPU cards, network
cards, FPGA cards, PLD cards etc which might contain on board memory. This
on board memory can be coherent along with system RAM and may be accessible
from either the CPU or from the device. The coherency is usually achieved
through synchronizing the cache accesses from either side. This makes the
device memory appear in the same address space as that of the system RAM.
The on board device memory and system RAM are coherent but have differences
in their properties as explained and elaborated below. Following diagram
explains how the coherent device memory appears in the memory address
space.

                +-----------------+         +-----------------+
                |                 |         |                 |
                |       CPU       |         |     DEVICE      |
                |                 |         |                 |
                +-----------------+         +-----------------+
                         |                           |
                         |   Shared Address Space    |
 +---------------------------------------------------------------------+
 |                                             |                       |
 |                                             |                       |
 |                 System RAM                  |     Coherent Memory   |
 |                                             |                       |
 |                                             |                       |
 +---------------------------------------------------------------------+

	User space applications might be interested in using the coherent
device memory either explicitly or implicitly along with the system RAM
utilizing the basic semantics for memory allocation, access and release.
Basically the user applications should be able to allocate memory any where
(system RAM or coherent memory) and then get it accessed either from the
CPU or from the coherent device for various computation or data
transformation purpose. User space really should not be concerned about
memory placement and their subsequent allocations when the memory really
faults because of the access.

	To achieve seamless integration between system RAM and coherent
device memory it must be able to utilize core memory kernel features like
anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
migrations, reclaim, compaction, etc. Making the coherent device memory
appear as a distinct memory only NUMA node which will be initialized as any
other node with memory can create this integration with currently available
system RAM memory. Also at the same time there should be a differentiating
mark which indicates that this node is a coherent device memory node not
any other memory only system RAM node.
 
	Coherent device memory invariably isn't available until the driver
for the device has been initialized. It is desirable but not required for
the device to support memory offlining for the purposes such as power
management, link management and hardware errors. Kernel allocation should
not come here as it cannot be moved out. Hence coherent device memory
should go inside ZONE_MOVABLE zone instead. This guarantees that kernel
allocations will never be satisfied from this memory and any process having
un-movable pages on this coherent device memory (likely achieved through
pinning later on after initial allocation) can be killed to free up memory
from page table and eventually hot plugging the node out.

	After similar representation as a NUMA node, the coherent memory
might still need some special consideration while being inside the kernel.
There can be a variety of coherent device memory nodes with different
expectations and special considerations from the core kernel. This RFC
discusses only one such scenario where the coherent device memory requires
just isolation.

	Now let us consider in detail the case of a coherent device memory
node which requires isolation. This kind of coherent device memory is on
board an external device attached to the system through a link where there
is a chance of link errors plugging out the entire memory node with it.
More over the memory might also have higher chances of ECC errors as
compared to the system RAM. These are just some possibilities. But the fact
remains that the coherent device memory can have some other different
properties which might not be desirable for some user space applications.
An application should not be exposed to related risks of a device if its
not taking advantage of special features of that device and it's memory.

	Because of the reasons explained above allocations into isolation
based coherent device memory node should further be regulated apart from
earlier requirement of kernel allocations not coming there. User space
allocations should not come here implicitly without the user application
explicitly knowing about it. This summarizes isolation requirement of
certain kind of a coherent device memory node as an example.

	Some coherent memory devices may not require isolation altogether.
Then there might be other coherent memory devices which require some other
special treatment after being part of core memory representation in kernel.
Though the framework suggested by this RFC has made provisions for them, it
has not considered any other kind of requirement other than isolation for
now.

	Though this RFC series currently attempts to implement one such
isolation seeking coherent device memory example, this framework can be
extended to accommodate any present or future coherent memory devices which
will fit the requirement as explained before even with new requirements
other than isolation. In case of isolation seeking coherent device memory
node, there will be other core VM code paths which need to be taken care
before it can be completely isolated as required.

	Core kernel memory features like reclamation, evictions etc. might
need to be restricted or modified on the coherent device memory node as
they can be performance limiting. The RFC does not propose anything on this
yet but it can be looked into later on. For now it just disables Auto NUMA
for any VMA which has coherent device memory.

	Seamless integration of coherent device memory with system memory
will enable various other features, some of which can be listed as follows.

	a. Seamless migrations between system RAM and the coherent memory
	b. Will have asynchronous and high throughput migrations
	c. Be able to allocate huge order pages from these memory regions
	d. Restrict allocations to a large extent to the tasks using the
	   device for workload acceleration

	Before concluding, will look into the reasons why the existing
solutions don't work. There are two basic requirements which have to be
satisfies before the coherent device memory can be integrated with core
kernel seamlessly.

	a. PFN must have struct page
	b. Struct page must able to be inside standard LRU lists

	The above two basic requirements discard the existing method of
device memory representation approaches like these which then requires the
need of creating a new framework.

(1) Traditional ioremap

	a. Memory is mapped into kernel (linear and virtual) and user space
	b. These PFNs do not have struct pages associated with it
	c. These special PFNs are marked with special flags inside the PTE
	d. Cannot participate in core VM functions much because of this
	e. Cannot do easy user space migrations

(2) Zone ZONE_DEVICE

	a. Memory is mapped into kernel and user space
	b. PFNs do have struct pages associated with it
	c. These struct pages are allocated inside it's own memory range
	d. Unfortunately the struct page's union containing LRU has been
	   used for struct dev_pagemap pointer
	e. Hence it cannot be part of any LRU (like Page cache)
	f. Hence file cached mapping cannot reside on these PFNs
	g. Cannot do easy migrations

	I had also explored non LRU representation of this coherent device
memory where the integration with system RAM in the core VM is limited only
to the following functions. Not being inside LRU is definitely going to
reduce the scope of tight integration with system RAM.

(1) Migration support between system RAM and coherent memory
(2) Migration support between various coherent memory nodes
(3) Isolation of the coherent memory
(4) Mapping the coherent memory into user space through driver's
    struct vm_operations
(5) HW poisoning of the coherent memory

	Allocating the entire memory of the coherent device node right
after hot plug into ZONE_MOVABLE (where the memory is already inside the
buddy system) will still expose a time window where other user space
allocations can come into the coherent device memory node and prevent the
intended isolation. So traditional hot plug is not the solution. Hence
started looking into CMA based non LRU solution but then hit the following
roadblocks.

(1) CMA does not support hot plugging of new memory node
	a. CMA area needs to be marked during boot before buddy is
	   initialized
	b. cma_alloc()/cma_release() can happen on the marked area
	c. Should be able to mark the CMA areas just after memory hot plug
	d. cma_alloc()/cma_release() can happen later after the hot plug
	e. This is not currently supported right now

(2) Mapped non LRU migration of pages
	a. Recent work from Michan Kim makes non LRU page migratable
	b. But it still does not support migration of mapped non LRU pages
	c. With non LRU CMA reserved, again there are some additional
	   challenges

	With hot pluggable CMA and non LRU mapped migration support there
may be an alternate approach to represent coherent device memory. Please do
review this RFC proposal and let me know your comments or suggestions.

Challenges:

ZONELIST approach:
	* Requires node FALLBACK zonelist creation process to be changed
	  to implement implicit allocation isolation. But still provides
	  access to the CDM memory which has system RAM zones as fallback.
	* Requires mbind(MPOL_BIND) semantics to be changed only for the
	  CDM nodes without introducing any regression for others.

CPUSET approach:
	* Removes CDM nodes from the root cpuset and also every process's
	  mems_allowed nodemask to achieve implicit allocation isolation.
	  In the process this also removes the ability to allocate on the
	  CDM node even with the help of  __GFP_THISNODE allocation flag
	  from inside the kernel. Currently we have a brute hack which
	  forces the allocator to ignore cpuset completely if allocation
	  flag has got __GFP_THISNODE
	* Cpuset will require changes in its way of dealing with hardwalled
	  and exclussive cpusets. Its interaction with __GFP_HARDWALL also
	  needed to be audited to implement isolation as well ability to
	  allocate from CDM nodes both in and out of context of the task.

BUDDY approach:
	* We are still looking into this approach where the core buddy
	  allocator __alloc_pages_nodemask() can be changed to implement
	  the required isolation as awell allocation method. This might
	  involve how the passed nodemask interacts with mems_allowed
	  and applicable memory policy which can also be influenced by
	  cpuset changes in the system during fast path as well as slow
	  path allocation.

Open Questions:

HugeTLB allocation isolation:

Now, we ensure complete HugeTLB allocation isolation from CDM nodes. Going
forward if we need to support HugeTLB allocation on CDM nodes on targeted
basis, then we would have to enable those allocations through the
/sys/devices/system/node/nodeN/hugepages/hugepages-16384kB/nr_hugepages
interface while still ensuring isolation from other generic sysctl and
/sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages interfaces.


FALLBACK zonelist creation:

CDM node's FALLBACK zonelist can also be changed to accommodate other CDM
memory zones along with system RAM zones in which case they can be used as
fallback options instead of first depending on the sytsem RAM zones when
it's own memory falls insufficient during allocation.

Choice of zonelist:

There are multiple ways of choosing a CDM node's FALLBACK zonelist which
will contain CDM memory in cases where the requesting local node is not
part of the user provided nodemask. This nodemask check can be restricted
to the first node in the nodemask as implemented or it scan through the
entire nodemask provided to find any present CDM node on it to choose or
select the first CDM node only if all other nodes in the nodemask are CDM.
These are variour possible approaches, the first one is implemented in the
current series. We can also implement restrictions during user mbind() sys
call where if a single node is a CDM in the nodemask then all of them have
to be CDM else the request gets declined to maintain simplicity.

VM_CDM tagged VMA:

There are two parts to this problem.

* How to mark a VMA with VM_CDM ?
	- During page fault path
	- During mbind(MPOL_BIND) call
	- Any other paths ?
	- Should a driver mark a VMA with VM_CDM explicitly ?

* How VM_CDM marked VMA gets treated ?

	- Disabled from auto NUMA migrations
	- Disabled from KSM merging
	- Anything else ?

Previous versions:

V1  RFC: https://lkml.org/lkml/2016/10/24/19
V12 RFC: https://lkml.org/lkml/2016/11/22/339

Changes in V2:

* Both the ZONELIST and CPUSET approaches have been combined together to
  utilize maximum possible common code, debug facilities and test programs.
  This will also help in review of individual components. For this purpose
  CONFIG_COHERENT_DEVICE still depends on CONFIG_CPUSET just to achieve
  this unification even its not required for the ZONELIST based proposal.

* NODE_DATA(node)->coherent_device based CDM identification is no longer
  supported. Now CDM is identified through node_state[N_COHERENT_DEVICE]
  based nodemask element which gets popullated when the particular node
  gets hot plugged into the system. It gets clear when the node gets hot
  pluged out from the system. This enables a new sysfs based interface
  /sys/devices/system/node/is_coherent_device which shows all the CDM
  nodes present on the system. Supporting architectures must export a
  function arch_check_node_cdm() which identifies CDM nodes.

* We still have all the names as CDM. But as Dave Hansen had mentioned
  before all these can still be applicable to any generic coherent memory
  type which is not on a device but might be present on processor itself.
  Will change the name appropriately later on.

* Complete isolation on CDM for HugeTLB allocation is achieved through
  ram_nodemask() fetched nodemask instead of using node_states[N_MEMORY]
  directly.

* After the changes with commit 6d8409580b: ("mm, mempolicy: clean up
  __GFP_THISNODE confusion in policy zonelist"), the MPOL_BIND changes
  had to be adjusted for CDM nodes. As regular node's zonelist does not
  contain CDM memory, the zonelist selection has to be from the given
  CDM nodes for the memory to be allocated successfully.

* Added a new patch which makes the KSM ignore madvise(MADV_MERGEABLE)
  request on a CDM memory based VMA.

* Added a new patch which will mark all applicable VMAs with VM_CDM flag
  during mbind(MPOL_BIND) call if the user passed nodemask has a CDM node.

* Brought back all VM_CDM based changes including the auto NUMA change.

* Some amount of code, documentation and commit message cleanups.

Changes in V12:

* Moved from specialized zonelist rebuilding to cpuset based isolation for
  the coherent device memory nodes

* Right now with this new approach, there is no explicit way of allocation
  into the coherent device memory nodes from user space, though it can be
  explored into later on

* Changed the behaviour of __alloc_pages_nodemask() when both cpuset is
  enabled and the allocation request has __GFP_THISNODE flag

* Dropped the VMA flag VM_CDM and related auto NUMA changes

* Dropped migrate_virtual_range() function from the series and moved that
  into the DEBUG patches

NOTE: These two set of patches mutually exclusive of each other and
represent two different approaches. Only one of these sets should be
applied at any point of time.

Set1:
  mm: Change generic FALLBACK zonelist creation process
  mm: Change mbind(MPOL_BIND) implementation for CDM nodes

Set2:
  cpuset: Add cpuset_inc() inside cpuset_init()
  mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE

Anshuman Khandual (12):
  mm: Define coherent device memory (CDM) node
  mm: Isolate HugeTLB allocations away from CDM nodes
  mm: Change generic FALLBACK zonelist creation process
  mm: Change mbind(MPOL_BIND) implementation for CDM nodes
  cpuset: Add cpuset_inc() inside cpuset_init()
  mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE
  mm: Add new VMA flag VM_CDM
  mm: Exclude CDM marked VMAs from auto NUMA
  mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs
  mm: Tag VMA with VM_CDM flag during page fault
  mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)

 Documentation/ABI/stable/sysfs-devices-node |  7 ++++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 ++++
 drivers/base/node.c                         |  6 +++
 include/linux/mempolicy.h                   | 14 +++++++
 include/linux/mm.h                          |  5 +++
 include/linux/node.h                        | 49 +++++++++++++++++++++++
 include/linux/nodemask.h                    |  3 ++
 kernel/cpuset.c                             | 14 ++++---
 kernel/sched/fair.c                         |  3 +-
 mm/Kconfig                                  |  5 +++
 mm/hugetlb.c                                | 25 +++++++-----
 mm/ksm.c                                    |  4 ++
 mm/memory_hotplug.c                         | 10 +++++
 mm/mempolicy.c                              | 62 +++++++++++++++++++++++++++++
 mm/page_alloc.c                             | 12 +++++-
 16 files changed, 211 insertions(+), 16 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC V2 01/12] mm: Define coherent device memory (CDM) node
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes Anshuman Khandual
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

There are certain devices like specialized accelerator, GPU cards, network
cards, FPGA cards etc which might contain onboard memory which is coherent
along with the existing system RAM while being accessed either from the CPU
or from the device. They share some similar properties with that of normal
system RAM but at the same time can also be different with respect to
system RAM.

User applications might be interested in using this kind of coherent device
memory explicitly or implicitly along side the system RAM utilizing all
possible core memory functions like anon mapping (LRU), file mapping (LRU),
page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
etc. To achieve this kind of tight integration with core memory subsystem,
the device onbaord coherent memory must be represented as a memory only
NUMA node. At the same time arch must export some kind of a function to
identify of this node as a coherent device memory not any other regular
cpu less memory only NUMA node.

After achieving the integration with core memory subsystem coherent device
memory might still need some special consideration inside the kernel. There
can be a variety of coherent memory nodes with different expectations from
the core kernel memory. But right now only one kind of special treatment is
considered which requires certain isolation.

Now consider the case of a coherent device memory node type which requires
isolation. This kind of coherent memory is onboard an external device
attached to the system through a link where there is always a chance of a
link failure taking down the entire memory node with it. More over the
memory might also have higher chance of ECC failure as compared to the
system RAM. Hence allocation into this kind of coherent memory node should
be regulated. Kernel allocations must not come here. Normal user space
allocations too should not come here implicitly (without user application
knowing about it). This summarizes isolation requirement of certain kind of
coherent device memory node as an example. There can be different kinds of
isolation requirement also.

Some coherent memory devices might not require isolation altogether after
all. Then there might be other coherent memory devices which might require
some other special treatment after being part of core memory representation
. For now, will look into isolation seeking coherent device memory node not
the other ones.

To implement the integration as well as isolation, the coherent memory node
must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
the node_states[] array. During memory hotplug operations, the new nodemask
N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
memory nodes. This also creates the following new sysfs based interface to
list down all the coherent memory nodes of the system.

	/sys/devices/system/node/is_coherent_node

Architectures must export function arch_check_node_cdm() which identifies
any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  7 +++++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 +++++
 drivers/base/node.c                         |  6 ++++
 include/linux/node.h                        | 49 +++++++++++++++++++++++++++++
 include/linux/nodemask.h                    |  3 ++
 mm/Kconfig                                  |  5 +++
 mm/memory_hotplug.c                         | 10 ++++++
 8 files changed, 88 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 5b2d0f0..fa2f105 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -29,6 +29,13 @@ Description:
 		Nodes that have regular or high memory.
 		Depends on CONFIG_HIGHMEM.
 
+What:		/sys/devices/system/node/is_coherent_device
+Date:		January 2017
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		Lists the nodemask of nodes that have coherent device memory.
+		Depends on CONFIG_COHERENT_DEVICE.
+
 What:		/sys/devices/system/node/nodeX
 Date:		October 2002
 Contact:	Linux Memory Management list <linux-mm@kvack.org>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a8ee573..8273e6e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -165,6 +165,7 @@ config PPC
 	select HAVE_ARCH_HARDENED_USERCOPY
 	select HAVE_KERNEL_GZIP
 	select HAVE_CC_STACKPROTECTOR
+	select COHERENT_DEVICE if PPC64 && CPUSETS
 
 config GENERIC_CSUM
 	def_bool CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b1099cb..9c73fbe 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -41,6 +41,13 @@
 #include <asm/setup.h>
 #include <asm/vdso.h>
 
+#ifdef CONFIG_COHERENT_DEVICE
+int arch_check_node_cdm(int nid)
+{
+	return 0;
+}
+#endif
+
 static int numa_enabled = 1;
 
 static char *cmdline __initdata;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..5b5dd89 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = {
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 #endif
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+#ifdef CONFIG_COHERENT_DEVICE
+	[N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE),
+#endif
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = {
 	&node_state_attr[N_MEMORY].attr.attr,
 #endif
 	&node_state_attr[N_CPU].attr.attr,
+#ifdef CONFIG_COHERENT_DEVICE
+	&node_state_attr[N_COHERENT_DEVICE].attr.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 2115ad5..e284206 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -81,4 +81,53 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
+
+#ifdef CONFIG_COHERENT_DEVICE
+extern int arch_check_node_cdm(int nid);
+
+static inline nodemask_t ram_nodemask(void)
+{
+	nodemask_t ram_nodes;
+
+	nodes_clear(ram_nodes);
+	nodes_andnot(ram_nodes, node_states[N_MEMORY],
+				node_states[N_COHERENT_DEVICE]);
+	return ram_nodes;
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return node_isset(node, node_states[N_COHERENT_DEVICE]);
+}
+
+static inline bool nodemask_has_cdm(nodemask_t mask)
+{
+	int node, i;
+
+	node = first_node(mask);
+	for (i = 0; i < nodes_weight(mask); i++) {
+		if (is_cdm_node(node))
+			return true;
+		node = next_node(node, mask);
+	}
+	return false;
+}
+#else
+static inline int arch_check_node_cdm(int nid) { return 0; }
+
+static inline nodemask_t ram_nodemask(void)
+{
+	return node_states[N_MEMORY];
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return false;
+}
+
+static inline bool nodemask_has_cdm(nodemask_t mask)
+{
+	return false;
+}
+#endif /* CONFIG_COHERENT_DEVICE */
 #endif /* _LINUX_NODE_H_ */
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index f746e44..6e66cfd 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -393,6 +393,9 @@ enum node_states {
 	N_MEMORY = N_HIGH_MEMORY,
 #endif
 	N_CPU,		/* The node has one or more cpus */
+#ifdef CONFIG_COHERENT_DEVICE
+	N_COHERENT_DEVICE,
+#endif
 	NR_NODE_STATES
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..5b7d1e7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -143,6 +143,11 @@ config HAVE_GENERIC_RCU_GUP
 config ARCH_DISCARD_MEMBLOCK
 	bool
 
+config COHERENT_DEVICE
+	bool
+	depends on CPUSETS
+	default n
+
 config NO_BOOTMEM
 	bool
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca2723d..b63010a2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1030,6 +1030,11 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_high >= 0)
 		node_set_state(node, N_HIGH_MEMORY);
 
+#ifdef CONFIG_COHERENT_DEVICE
+	if (arch_check_node_cdm(node))
+		node_set_state(node, N_COHERENT_DEVICE);
+#endif
+
 	node_set_state(node, N_MEMORY);
 }
 
@@ -1830,6 +1835,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 	if ((N_MEMORY != N_HIGH_MEMORY) &&
 	    (arg->status_change_nid >= 0))
 		node_clear_state(node, N_MEMORY);
+
+#ifdef CONFIG_COHERENT_DEVICE
+	if (arch_check_node_cdm(node))
+		node_clear_state(node, N_COHERENT_DEVICE);
+#endif
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 01/12] mm: Define coherent device memory (CDM) node Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 17:19   ` Dave Hansen
  2017-01-30  3:35 ` [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

HugeTLB allocation/release/accounting currently spans across all the nodes
under N_MEMORY node mask. Coherent memory nodes should not be part of these
allocations. So use system_ram() call to fetch system RAM only nodes on the
platform which can then be used for HugeTLB allocation purpose instead of
N_MEMORY node mask. This isolates coherent device memory nodes from HugeTLB
allocations.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/hugetlb.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c7025c1..698af91 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1790,6 +1790,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	nodemask_t ram_nodes = ram_nodemask();
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -1816,7 +1817,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	while (nr_pages--) {
 		h->resv_huge_pages--;
 		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		if (!free_pool_huge_page(h, &ram_nodes, 1))
 			goto out;
 		cond_resched_lock(&hugetlb_lock);
 	}
@@ -2107,8 +2108,9 @@ int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes, node;
+	nodemask_t ram_nodes = ram_nodemask();
 
-	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+	for_each_node_mask_to_alloc(h, nr_nodes, node, &ram_nodes) {
 		void *addr;
 
 		addr = memblock_virt_alloc_try_nid_nopanic(
@@ -2177,13 +2179,14 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t ram_nodes = ram_nodemask();
+
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h,
-					 &node_states[N_MEMORY]))
+		} else if (!alloc_fresh_huge_page(h, &ram_nodes))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -2420,6 +2423,8 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   unsigned long count, size_t len)
 {
 	int err;
+	nodemask_t ram_nodes = ram_nodemask();
+
 	NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
@@ -2434,7 +2439,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		if (!(obey_mempolicy &&
 				init_nodemask_of_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
-			nodes_allowed = &node_states[N_MEMORY];
+			nodes_allowed = &ram_nodes;
 		}
 	} else if (nodes_allowed) {
 		/*
@@ -2444,11 +2449,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
 		init_nodemask_of_node(nodes_allowed, nid);
 	} else
-		nodes_allowed = &node_states[N_MEMORY];
+		nodes_allowed = &ram_nodes;
 
 	h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
 
-	if (nodes_allowed != &node_states[N_MEMORY])
+	if (nodes_allowed != &ram_nodes)
 		NODEMASK_FREE(nodes_allowed);
 
 	return len;
@@ -2745,9 +2750,10 @@ static void hugetlb_register_node(struct node *node)
  */
 static void __init hugetlb_register_all_nodes(void)
 {
+	nodemask_t nodes = ram_nodemask();
 	int nid;
 
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_mask(nid, nodes) {
 		struct node *node = node_devices[nid];
 		if (node->dev.id == nid)
 			hugetlb_register_node(node);
@@ -3019,11 +3025,12 @@ void hugetlb_show_meminfo(void)
 {
 	struct hstate *h;
 	int nid;
+	nodemask_t ram_nodes = ram_nodemask();
 
 	if (!hugepages_supported())
 		return;
 
-	for_each_node_state(nid, N_MEMORY)
+	for_each_node_mask(nid, ram_nodes)
 		for_each_hstate(h)
 			pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n",
 				nid,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 01/12] mm: Define coherent device memory (CDM) node Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 17:34   ` Dave Hansen
  2017-01-30  3:35 ` [RFC V2 04/12] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Kernel allocation to CDM node has already been prevented by putting it's
entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
from implicit allocations happening on the system.

Any isolation seeking CDM node requires isolation from implicit memory
allocations from user space but at the same time there should also have
an explicit way to do the memory allocation.

Platform node's both zonelists are fundamental to where the memory comes
from when there is an allocation request. In order to achieve these two
objectives as stated above, zonelists building process has to change as
both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
memory zones during any kind of memory allocation. The following changes
are implemented in this regard.

* CDM node's zones are not part of any other node's FALLBACK zonelist
* CDM node's FALLBACK list contains it's own memory zones followed by
  all system RAM zones in regular order as before
* CDM node's zones are part of it's own NOFALLBACK zonelist

These above changes ensure the following which in turn isolates the CDM
nodes as desired.

* There wont be any implicit memory allocation ending up in the CDM node
* Only __GFP_THISNODE marked allocations will come from the CDM node
* CDM node memory can be allocated through mbind(MPOL_BIND) interface
* System RAM memory will be used as fallback option in regular order in
  case the CDM memory is insufficient during targted allocation request

Sample zonelist configuration:

[NODE (0)]						RAM
        ZONELIST_FALLBACK (0xc00000000140da00)
                (0) (node 0) (DMA     0xc00000000140c000)
                (1) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc000000001411a10)
                (0) (node 0) (DMA     0xc00000000140c000)
[NODE (1)]						RAM
        ZONELIST_FALLBACK (0xc000000100001a00)
                (0) (node 1) (DMA     0xc000000100000000)
                (1) (node 0) (DMA     0xc00000000140c000)
        ZONELIST_NOFALLBACK (0xc000000100005a10)
                (0) (node 1) (DMA     0xc000000100000000)
[NODE (2)]						CDM
        ZONELIST_FALLBACK (0xc000000001427700)
                (0) (node 2) (Movable 0xc000000001427080)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc00000000142b710)
                (0) (node 2) (Movable 0xc000000001427080)
[NODE (3)]						CDM
        ZONELIST_FALLBACK (0xc000000001431400)
                (0) (node 3) (Movable 0xc000000001430d80)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc000000001435410)
                (0) (node 3) (Movable 0xc000000001430d80)
[NODE (4)]						CDM
        ZONELIST_FALLBACK (0xc00000000143b100)
                (0) (node 4) (Movable 0xc00000000143aa80)
                (1) (node 0) (DMA     0xc00000000140c000)
                (2) (node 1) (DMA     0xc000000100000000)
        ZONELIST_NOFALLBACK (0xc00000000143f110)
                (0) (node 4) (Movable 0xc00000000143aa80)

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..5db353a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4825,6 +4825,16 @@ static void build_zonelists(pg_data_t *pgdat)
 	i = 0;
 
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+#ifdef CONFIG_COHERENT_DEVICE
+		/*
+		 * CDM node's own zones should not be part of any other
+		 * node's fallback zonelist but only it's own fallback
+		 * zonelist.
+		 */
+		if (is_cdm_node(node) && (pgdat->node_id != node))
+			continue;
+#endif
+
 		/*
 		 * We don't want to pressure a particular node.
 		 * So adding penalty to the first node in same
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 04/12] mm: Change mbind(MPOL_BIND) implementation for CDM nodes
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (2 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

CDM nodes need a way of explicit memory allocation mechanism from the user
space. After the previous FALLBACK zonelist rebuilding process changes, the
mbind(MPOL_BIND) based allocation request fails on the CDM node. This is
because allocation requesting local node's FALLBACK zonelist is selected
for further nodemask processing targeted at MPOL_BIND implementation. As
the CDM node's zones are not part of any other regular node's FALLBACK
zonelist, the allocation simply fails without getting any valid zone. The
allocation requesting node is always going to be different than the CDM
node which does not have any CPU. Hence MPOL_MBIND implementation must
choose given CDM node's FALLBACK zonelist instead of the requesting local
node's FALLBACK zonelist. This implements that change.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e..6089c711 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,6 +1692,27 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+#ifdef CONFIG_COHERENT_DEVICE
+	/*
+	 * Coherent Device Memory (CDM)
+	 *
+	 * In case the local requesting node is not part of the nodemask, test
+	 * if the first node in the nodemask is CDM, in which case select it.
+	 *
+	 * XXX: There are multiple ways of doing this. This node check can be
+	 * restricted to the first node in the node mask as implemented here or
+	 * scan through the entire nodemask to find out any present CDM node on
+	 * it or select the first CDM node only if all other nodes in the node
+	 * mask are CDM. These are variour approaches possible, the first one
+	 * is implemented here.
+	 */
+	if (policy->mode == MPOL_BIND) {
+		if (unlikely(!node_isset(nd, policy->v.nodes))) {
+			if (is_cdm_node(first_node(policy->v.nodes)))
+				nd = first_node(policy->v.nodes);
+		}
+	}
+#endif
 	return node_zonelist(nd, gfp);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init()
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (3 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 04/12] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 17:36   ` Dave Hansen
  2017-01-30 20:30   ` Mel Gorman
  2017-01-30  3:35 ` [RFC V2 06/12] mm: Exclude CDM nodes from task->mems_allowed and root cpuset Anshuman Khandual
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Currently cpusets_enabled() wrongfully returns 0 even if we have a root
cpuset configured on the system. This got missed when jump level was
introduced in place of number_of_cpusets with the commit 664eeddeef65
("mm: page_alloc: use jump labels to avoid checking number_of_cpusets")
. This fixes the problem so that cpusets_enabled() returns positive even
for the root cpuset.

Fixes: 664eeddeef65 ("mm: page_alloc: use jump labels to avoid")
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 kernel/cpuset.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b308888..be75f3f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2133,6 +2133,8 @@ int __init cpuset_init(void)
 	set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
 	top_cpuset.relax_domain_level = -1;
 
+	cpuset_inc();
+
 	err = register_filesystem(&cpuset_fs_type);
 	if (err < 0)
 		return err;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 06/12] mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (4 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 07/12] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE Anshuman Khandual
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Task struct's mems_allowed element decides the final nodemask from which
memory can be allocated in the task context irrespective any applicable
memory policy. CDM nodes should not be used for user allocations, its one
of the overall requirements of it's isolation. So they should not be part
of any task's mems_allowed nodemask. System RAM nodemask is used instead
of node_states[N_MEMORY] nodemask during mems_allowed initialization and
it's update during memory hotlugs.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 kernel/cpuset.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index be75f3f..4e1df26 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -364,9 +364,11 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
-	while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
+	nodemask_t ram_nodes = ram_nodemask();
+
+	while (!nodes_intersects(cs->effective_mems, ram_nodes))
 		cs = parent_cs(cs);
-	nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
+	nodes_and(*pmask, cs->effective_mems, ram_nodes);
 }
 
 /*
@@ -2303,7 +2305,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
-	new_mems = node_states[N_MEMORY];
+	new_mems = ram_nodemask();
 
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
@@ -2395,11 +2397,11 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
 void __init cpuset_init_smp(void)
 {
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
-	top_cpuset.mems_allowed = node_states[N_MEMORY];
+	top_cpuset.mems_allowed = ram_nodemask();
 	top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
 
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
-	top_cpuset.effective_mems = node_states[N_MEMORY];
+	top_cpuset.effective_mems = ram_nodemask();
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 07/12] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (5 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 06/12] mm: Exclude CDM nodes from task->mems_allowed and root cpuset Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 08/12] mm: Add new VMA flag VM_CDM Anshuman Khandual
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

__GFP_THISNODE specifically asks the memory to be allocated from the given
node. Not all the requests that end up in __alloc_pages_nodemask() are
originated from the process context where cpuset makes more sense. The
current condition enforces cpuset limitation on every allocation whether
originated from process context or not which prevents __GFP_THISNODE
mandated allocations to come from the specified node. In context of the
coherent device memory node which is isolated from all cpuset nodemask
in the system, it prevents the only way of allocation into it which has
been changed with this patch.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5db353a..609cf9c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3778,7 +3778,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
-	if (cpusets_enabled()) {
+	if (cpusets_enabled() && !(alloc_mask & __GFP_THISNODE)) {
 		alloc_mask |= __GFP_HARDWALL;
 		alloc_flags |= ALLOC_CPUSET;
 		if (!ac.nodemask)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 08/12] mm: Add new VMA flag VM_CDM
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (6 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 07/12] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 18:52   ` Jerome Glisse
  2017-01-30  3:35 ` [RFC V2 09/12] mm: Exclude CDM marked VMAs from auto NUMA Anshuman Khandual
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

VMA which contains CDM memory pages should be marked with new VM_CDM flag.
These VMAs need to be identified in various core kernel paths for special
handling and this flag will help in their identification.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 include/linux/mm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..82482d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -182,6 +182,11 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
+
+#ifdef CONFIG_COHERENT_DEVICE
+#define VM_CDM		0x00800000	/* Contains coherent device memory */
+#endif
+
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_ARCH_2	0x02000000
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 09/12] mm: Exclude CDM marked VMAs from auto NUMA
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (7 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 08/12] mm: Add new VMA flag VM_CDM Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 10/12] mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs Anshuman Khandual
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Kernel cannot track device memory accesses behind VMAs containing CDM
memory. Hence all the VM_CDM marked VMAs should not be part of the auto
NUMA migration scheme. This patch also adds a new function is_cdm_vma()
to detect any VMA marked with flag VM_CDM.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 include/linux/mempolicy.h | 14 ++++++++++++++
 kernel/sched/fair.c       |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d828..ff0c6bc 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -172,6 +172,20 @@ extern int mpol_parse_str(char *str, struct mempolicy **mpol);
 
 extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
 
+#ifdef CONFIG_COHERENT_DEVICE
+static inline bool is_cdm_vma(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_CDM)
+		return true;
+	return false;
+}
+#else
+static inline bool is_cdm_vma(struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 /* Check if a vma is migratable */
 static inline bool vma_migratable(struct vm_area_struct *vma)
 {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6559d19..523508c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2482,7 +2482,8 @@ void task_numa_work(struct callback_head *work)
 	}
 	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
-			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+			is_vm_hugetlb_page(vma) || is_cdm_vma(vma) ||
+					(vma->vm_flags & VM_MIXEDMAP)) {
 			continue;
 		}
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 10/12] mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (8 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 09/12] mm: Exclude CDM marked VMAs from auto NUMA Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault Anshuman Khandual
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

VMA containing CDM memory should be excluded from KSM merging. This change
makes madvise(MADV_MERGEABLE) request on target VMA to be ignored.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/ksm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9ae6011..2fb8939 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
 #include <linux/freezer.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/mempolicy.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1751,6 +1752,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (is_cdm_vma(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (9 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 10/12] mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 17:51   ` Dave Hansen
  2017-01-30  3:35 ` [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) Anshuman Khandual
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Mark the corresponding VMA with VM_CDM flag if the allocated page happens
to be from a CDM node. This can be expensive from performance stand point.
There are multiple checks to avoid an expensive page_to_nid lookup but it
can be optimized further.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6089c711..78e095b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -174,6 +174,29 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
 	nodes_onto(*ret, tmp, *rel);
 }
 
+#ifdef CONFIG_COHERENT_DEVICE
+static void mark_vma_cdm(nodemask_t *nmask,
+		struct page *page, struct vm_area_struct *vma)
+{
+	if (!page)
+		return;
+
+	if (vma->vm_flags & VM_CDM)
+		return;
+
+	if (nmask && !nodemask_has_cdm(*nmask))
+		return;
+
+	if (is_cdm_node(page_to_nid(page)))
+		vma->vm_flags |= VM_CDM;
+}
+#else
+static void mark_vma_cdm(nodemask_t *nmask,
+		struct page *page, struct vm_area_struct *vma)
+{
+}
+#endif
+
 static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
@@ -2039,6 +2062,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	nmask = policy_nodemask(gfp, pol);
 	zl = policy_zonelist(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
+	mark_vma_cdm(nmask, page, vma);
 	mpol_cond_put(pol);
 out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (10 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30 17:54   ` Dave Hansen
  2017-01-30  3:35 ` [DEBUG 13/21] powerpc/mm: Identify coherent device memory nodes during platform init Anshuman Khandual
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND)
call if the user provided nodemask has a CDM node.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 78e095b..4482140 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -175,6 +175,16 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
 }
 
 #ifdef CONFIG_COHERENT_DEVICE
+static inline void set_vm_cdm(struct vm_area_struct *vma)
+{
+	vma->vm_flags |= VM_CDM;
+}
+
+static inline void clr_vm_cdm(struct vm_area_struct *vma)
+{
+	vma->vm_flags &= ~VM_CDM;
+}
+
 static void mark_vma_cdm(nodemask_t *nmask,
 		struct page *page, struct vm_area_struct *vma)
 {
@@ -191,6 +201,9 @@ static void mark_vma_cdm(nodemask_t *nmask,
 		vma->vm_flags |= VM_CDM;
 }
 #else
+static inline void set_vm_cdm(struct vm_area_struct *vma) { }
+static inline void clr_vm_cdm(struct vm_area_struct *vma) { }
+
 static void mark_vma_cdm(nodemask_t *nmask,
 		struct page *page, struct vm_area_struct *vma)
 {
@@ -770,6 +783,10 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 		vmstart = max(start, vma->vm_start);
 		vmend   = min(end, vma->vm_end);
 
+		if ((new_pol->mode == MPOL_BIND)
+			&& nodemask_has_cdm(new_pol->v.nodes))
+			set_vm_cdm(vma);
+
 		if (mpol_equal(vma_policy(vma), new_pol))
 			continue;
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 13/21] powerpc/mm: Identify coherent device memory nodes during platform init
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (11 preceding siblings ...)
  2017-01-30  3:35 ` [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [DEBUG 14/21] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Coherent device memory nodes will have "ibm,hotplug-aperture" as one of the
compatible properties in their respective device nodes in the device tree.
Detect them early during NUMA platform initialization and mark them as such
in the node_to_phys_device_map[] array which in turn is used to support the
arch_check_cdm_node() function for the core VM.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9c73fbe..6def078 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -41,10 +41,12 @@
 #include <asm/setup.h>
 #include <asm/vdso.h>
 
+static int node_to_phys_device_map[MAX_NUMNODES];
+
 #ifdef CONFIG_COHERENT_DEVICE
 int arch_check_node_cdm(int nid)
 {
-	return 0;
+	return node_to_phys_device_map[nid];
 }
 #endif
 
@@ -790,6 +792,9 @@ static int __init parse_numa_properties(void)
 		if (nid < 0)
 			nid = default_nid;
 
+		if (of_device_is_compatible(memory, "ibm,hotplug-aperture"))
+			node_to_phys_device_map[nid] = 1;
+
 		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
 		node_set_online(nid);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 14/21] powerpc/mm: Create numa nodes for hotplug memory
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (12 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 13/21] powerpc/mm: Identify coherent device memory nodes during platform init Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [DEBUG 15/21] powerpc/mm: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

From: Reza Arbab <arbab@linux.vnet.ibm.com>

When scanning the device tree to initialize the system NUMA topology,
process dt elements with compatible id "ibm,hotplug-aperture" to create
memoryless numa nodes.

These nodes will be filled when hotplug occurs within the associated
address range.

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 .../bindings/powerpc/opal/hotplug-aperture.txt     | 26 ++++++++++++++++++++++
 arch/powerpc/mm/numa.c                             | 10 +++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt

diff --git a/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt
new file mode 100644
index 0000000..b8dffaa
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt
@@ -0,0 +1,26 @@
+Designated hotplug memory
+-------------------------
+
+This binding describes a region of hotplug memory which is not present at boot,
+allowing its eventual NUMA associativity to be prespecified.
+
+Required properties:
+
+- compatible
+	"ibm,hotplug-aperture"
+
+- reg
+	base address and size of the region (standard definition)
+
+- ibm,associativity
+	NUMA associativity (standard definition)
+
+Example:
+
+A 2 GiB aperture at 0x100000000, to be part of nid 3 when hotplugged:
+
+	hotplug-memory@100000000 {
+		compatible = "ibm,hotplug-aperture";
+		reg = <0x0 0x100000000 0x0 0x80000000>;
+		ibm,associativity = <0x4 0x0 0x0 0x0 0x3>;
+	};
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6def078..5370833 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -717,6 +717,12 @@ static void __init parse_drconf_memory(struct device_node *memory)
 	}
 }
 
+static const struct of_device_id memory_match[] = {
+	{ .type = "memory" },
+	{ .compatible = "ibm,hotplug-aperture" },
+	{ /* sentinel */ }
+};
+
 static int __init parse_numa_properties(void)
 {
 	struct device_node *memory;
@@ -761,7 +767,7 @@ static int __init parse_numa_properties(void)
 
 	get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
 
-	for_each_node_by_type(memory, "memory") {
+	for_each_matching_node(memory, memory_match) {
 		unsigned long start;
 		unsigned long size;
 		int nid;
@@ -1056,7 +1062,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 	struct device_node *memory;
 	int nid = -1;
 
-	for_each_node_by_type(memory, "memory") {
+	for_each_matching_node(memory, memory_match) {
 		unsigned long start, size;
 		int ranges;
 		const __be32 *memcell_buf;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 15/21] powerpc/mm: Enable CONFIG_MOVABLE_NODE for PPC64 platform
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (13 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 14/21] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [DEBUG 16/21] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Just enable MOVABLE_NODE config option for PPC64 platform by default.
This prevents accidentally building the kernel without the required
config option.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8273e6e..f7e1cd8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -314,6 +314,10 @@ config PGTABLE_LEVELS
 	default 3 if PPC_64K_PAGES && !PPC_BOOK3S_64
 	default 4
 
+config MOVABLE_NODE
+	bool
+	default y if PPC64
+
 source "init/Kconfig"
 
 source "kernel/Kconfig.freezer"
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 16/21] mm: Enable CONFIG_MOVABLE_NODE on powerpc
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (14 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 15/21] powerpc/mm: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [DEBUG 17/21] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

From: Reza Arbab <arbab@linux.vnet.ibm.com>

Onlining memory into ZONE_MOVABLE requires CONFIG_MOVABLE_NODE.

Enable the use of this config option on PPC64 platforms.

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 5b7d1e7..bc6ff72 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -158,7 +158,7 @@ config MOVABLE_NODE
 	bool "Enable to assign a node which has only movable memory"
 	depends on HAVE_MEMBLOCK
 	depends on NO_BOOTMEM
-	depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
+	depends on X86_64 || PPC64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
 	depends on NUMA
 	default n
 	help
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 17/21] mm: Export definition of 'zone_names' array through mmzone.h
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (15 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 16/21] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:35 ` [DEBUG 18/21] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

zone_names[] is used to identify any zone given it's index which
can be used in many other places. So exporting the definition
through include/linux/mmzone.h header for it's broader access.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 include/linux/mmzone.h | 1 +
 mm/page_alloc.c        | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f4aac87..fd1ab32 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -341,6 +341,7 @@ enum zone_type {
 
 };
 
+extern char * const zone_names[];
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 609cf9c..80fba54 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,7 +212,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 
 EXPORT_SYMBOL(totalram_pages);
 
-static char * const zone_names[MAX_NR_ZONES] = {
+char * const zone_names[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
 	 "DMA",
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 18/21] mm: Add debugfs interface to dump each node's zonelist information
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (16 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 17/21] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
@ 2017-01-30  3:35 ` Anshuman Khandual
  2017-01-30  3:36 ` [DEBUG 19/21] mm: Add migrate_virtual_range migration interface Anshuman Khandual
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Each individual node in the system has a ZONELIST_FALLBACK zonelist
and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
order of zones during memory allocations. Sometimes it helps to dump
these zonelists to see the priority order of various zones in them.

Particularly platforms which support memory hotplug into previously
non existing zones (at boot), this interface helps in visualizing
which all zonelists of the system at what priority level, the new
hot added memory ends up in. POWER is such a platform where all the
memory detected during boot time remains with ZONE_DMA for good but
then hot plug process can actually get new memory into ZONE_MOVABLE.
So having a way to get the snapshot of the zonelists on the system
after memory or node hot[un]plug is desirable. This change adds one
new debugfs interface (/sys/kernel/debug/zonelists) which will fetch
and dump this information.

Example zonelist information from a KVM guest with four NUMA nodes
on a POWER8 platform.

[NODE (0)]
	ZONELIST_FALLBACK
		(0) (Node 0) (DMA)
		(1) (Node 1) (DMA)
		(2) (Node 2) (DMA)
		(3) (Node 3) (DMA)
	ZONELIST_NOFALLBACK
		(0) (Node 0) (DMA)
[NODE (1)]
	ZONELIST_FALLBACK
		(0) (Node 1) (DMA)
		(1) (Node 2) (DMA)
		(2) (Node 3) (DMA)
		(3) (Node 0) (DMA)
	ZONELIST_NOFALLBACK
		(0) (Node 1) (DMA)
[NODE (2)]
	ZONELIST_FALLBACK
		(0) (Node 2) (DMA)
		(1) (Node 3) (DMA)
		(2) (Node 0) (DMA)
		(3) (Node 1) (DMA)
	ZONELIST_NOFALLBACK
		(0) (Node 2) (DMA)
[NODE (3)]
	ZONELIST_FALLBACK
		(0) (Node 3) (DMA)
		(1) (Node 0) (DMA)
		(2) (Node 1) (DMA)
		(3) (Node 2) (DMA)
	ZONELIST_NOFALLBACK
		(0) (Node 3) (DMA)

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/memory.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b47..1099d35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/mmzone.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -3153,6 +3154,68 @@ static int __init fault_around_debugfs(void)
 		pr_warn("Failed to create fault_around_bytes in debugfs");
 	return 0;
 }
+
+#ifdef CONFIG_NUMA
+static void show_zonelist(struct seq_file *m, struct zonelist *zonelist)
+{
+	unsigned int i;
+
+	for (i = 0; zonelist->_zonerefs[i].zone; i++) {
+		seq_printf(m, "\t\t(%d) (Node %d) (%-7s 0x%pK)\n", i,
+			zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
+			zone_names[zonelist->_zonerefs[i].zone_idx],
+			(void *) zonelist->_zonerefs[i].zone);
+	}
+}
+
+static int zonelists_show(struct seq_file *m, void *v)
+{
+	struct zonelist *zonelist;
+	unsigned int node;
+
+	for_each_online_node(node) {
+		zonelist = &(NODE_DATA(node)->
+				node_zonelists[ZONELIST_FALLBACK]);
+		seq_printf(m, "[NODE (%d)]\n", node);
+		seq_puts(m, "\tZONELIST_FALLBACK ");
+		seq_printf(m, "(0x%pK)\n", zonelist);
+		show_zonelist(m, zonelist);
+
+		zonelist = &(NODE_DATA(node)->
+				node_zonelists[ZONELIST_NOFALLBACK]);
+		seq_puts(m, "\tZONELIST_NOFALLBACK ");
+		seq_printf(m, "(0x%pK)\n", zonelist);
+		show_zonelist(m, zonelist);
+	}
+	return 0;
+}
+
+static int zonelists_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, zonelists_show, NULL);
+}
+
+static const struct file_operations zonelists_fops = {
+	.open		= zonelists_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init zonelists_debugfs(void)
+{
+	void *ret;
+
+	ret = debugfs_create_file("zonelists", 0444, NULL, NULL,
+			&zonelists_fops);
+	if (!ret)
+		pr_warn("Failed to create zonelists in debugfs");
+	return 0;
+}
+
+late_initcall(zonelists_debugfs);
+#endif /* CONFIG_NUMA */
+
 late_initcall(fault_around_debugfs);
 #endif
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 19/21] mm: Add migrate_virtual_range migration interface
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (17 preceding siblings ...)
  2017-01-30  3:35 ` [DEBUG 18/21] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
@ 2017-01-30  3:36 ` Anshuman Khandual
  2017-01-30  3:36 ` [DEBUG 20/21] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Currently there is no interface to be called by a driver for user process
virtual range migration. This adds one function and exports to be then
used by drivers.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/mempolicy.c            | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ff0c6bc..b07d6dc 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -153,6 +153,8 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
 extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern unsigned int mempolicy_slab_node(void);
+extern int migrate_virtual_range(int pid, unsigned long vaddr,
+			unsigned long size, int nid);
 
 extern enum zone_type policy_zone;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4482140..13cd5eb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2919,3 +2919,48 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+/*
+ * migrate_virtual_range - migrate all pages of a process faulted within
+ * a virtual address range to a specified node. This function is also
+ * exported to be used by device drivers dealing with CDM memory.
+ *
+ * @pid:	Process ID of the target process
+ * @start:	Start address of virtual range
+ * @end:	End address of virtual range
+ * @nid:	Target node for migration
+ *
+ * Returns number of pages that were not migrated in case of failure else
+ * returns 0 when its successful.
+ */
+int migrate_virtual_range(int pid, unsigned long start,
+			unsigned long end, int nid)
+{
+	struct mm_struct *mm;
+	int ret = 0;
+
+	LIST_HEAD(mlist);
+
+	if ((!start) || (!end)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rcu_read_lock();
+	mm = find_task_by_vpid(pid)->mm;
+	rcu_read_unlock();
+
+	down_write(&mm->mmap_sem);
+	queue_pages_range(mm, start, end, &node_states[N_MEMORY],
+			MPOL_MF_MOVE_ALL | MPOL_MF_DISCONTIG_OK, &mlist);
+	if (!list_empty(&mlist)) {
+		ret = migrate_pages(&mlist, new_node_page, NULL,
+					nid, MIGRATE_SYNC, MR_NUMA_MISPLACED);
+		if (ret)
+			putback_movable_pages(&mlist);
+	}
+	up_write(&mm->mmap_sem);
+out:
+	return ret;
+}
+EXPORT_SYMBOL(migrate_virtual_range);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 20/21] drivers: Add two drivers for coherent device memory tests
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (18 preceding siblings ...)
  2017-01-30  3:36 ` [DEBUG 19/21] mm: Add migrate_virtual_range migration interface Anshuman Khandual
@ 2017-01-30  3:36 ` Anshuman Khandual
  2017-01-30  3:36 ` [DEBUG 21/21] selftests/powerpc: Add a script to perform random VMA migrations Anshuman Khandual
  2017-01-31  5:48 ` [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

This adds two different drivers inside drivers/char/ directory under two
new kernel config options COHERENT_HOTPLUG_DEMO and COHERENT_MEMORY_DEMO.

1) coherent_hotplug_demo: Detects, hoptlugs the coherent device memory
2) coherent_memory_demo:  Exports debugfs interface for VMA migrations

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/char/Kconfig                 |  23 +++
 drivers/char/Makefile                |   2 +
 drivers/char/coherent_hotplug_demo.c | 133 ++++++++++++++
 drivers/char/coherent_memory_demo.c  | 337 +++++++++++++++++++++++++++++++++++
 drivers/char/memory_online_sysfs.h   | 148 +++++++++++++++
 mm/mempolicy.c                       |   9 +-
 6 files changed, 651 insertions(+), 1 deletion(-)
 create mode 100644 drivers/char/coherent_hotplug_demo.c
 create mode 100644 drivers/char/coherent_memory_demo.c
 create mode 100644 drivers/char/memory_online_sysfs.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index fde005e..0a9fb82 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -588,6 +588,29 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config COHERENT_HOTPLUG_DEMO
+	tristate "Demo driver to test coherent memory node hotplug"
+	depends on PPC64 || COHERENT_DEVICE
+	default n
+	help
+	  Say yes when you want to build a test driver to hotplug all
+	  the coherent memory nodes present on the system. This driver
+	  scans through the device tree, checks on "ibm,memory-device"
+	  property device nodes and onlines its memory. When unloaded,
+	  it goes through the list of memory ranges it onlined before
+	  and oflines them one by one. If not sure, select N.
+
+config COHERENT_MEMORY_DEMO
+	tristate "Demo driver to test coherent memory node functionality"
+	depends on PPC64 || COHERENT_DEVICE
+	default n
+	help
+	  Say yes when you want to build a test driver to demonstrate
+	  the coherent memory functionalities, capabilities and probable
+	  utilizaton. It also exports a debugfs file to accept inputs for
+	  virtual address range migration for any process. If not sure,
+	  select N.
+
 source "drivers/char/xillybus/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 6e6c244..92fa338 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -60,3 +60,5 @@ js-rtc-y = rtc.o
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
 obj-$(CONFIG_XILLYBUS)		+= xillybus/
 obj-$(CONFIG_POWERNV_OP_PANEL)	+= powernv-op-panel.o
+obj-$(CONFIG_COHERENT_HOTPLUG_DEMO)	+= coherent_hotplug_demo.o
+obj-$(CONFIG_COHERENT_MEMORY_DEMO)	+= coherent_memory_demo.o
diff --git a/drivers/char/coherent_hotplug_demo.c b/drivers/char/coherent_hotplug_demo.c
new file mode 100644
index 0000000..bfc1254
--- /dev/null
+++ b/drivers/char/coherent_hotplug_demo.c
@@ -0,0 +1,133 @@
+/*
+ * Memory hotplug support for coherent memory nodes in runtime.
+ *
+ * Copyright (C) 2016, Reza Arbab, IBM Corporation.
+ * Copyright (C) 2016, Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#include <linux/of.h>
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/memblock.h>
+#include <linux/module.h>
+#include <linux/memory.h>
+#include <linux/sizes.h>
+#include <linux/bitops.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/migrate.h>
+#include <linux/memblock.h>
+#include <linux/uaccess.h>
+
+#include <asm/mmu.h>
+#include <asm/pgalloc.h>
+#include "memory_online_sysfs.h"
+
+#define MAX_HOTADD_NODES 100
+phys_addr_t addr[MAX_HOTADD_NODES][2];
+int nr_addr;
+
+/*
+ * extern int memory_failure(unsigned long pfn, int trapno, int flags);
+ * extern int min_free_kbytes;
+ * extern int user_min_free_kbytes;
+ *
+ * extern unsigned long nr_kernel_pages;
+ * extern unsigned long nr_all_pages;
+ * extern unsigned long dma_reserve;
+ */
+
+static void dump_core_vm_tunables(void)
+{
+/*
+ *	printk(":::::::: VM TUNABLES :::::::\n");
+ *	printk("[min_free_kbytes]	%d\n", min_free_kbytes);
+ *	printk("[user_min_free_kbytes]	%d\n", user_min_free_kbytes);
+ *	printk("[nr_kernel_pages]	%ld\n", nr_kernel_pages);
+ *	printk("[nr_all_pages]		%ld\n", nr_all_pages);
+ *	printk("[dma_reserve]		%ld\n", dma_reserve);
+ */
+}
+
+
+
+static int online_coherent_memory(void)
+{
+	struct device_node *memory;
+
+	nr_addr = 0;
+	disable_auto_online();
+	dump_core_vm_tunables();
+	for_each_compatible_node(memory, NULL, "ibm,memory-device") {
+		struct device_node *mem;
+		const __be64 *reg;
+		unsigned int len, ret;
+		phys_addr_t start, size;
+
+		mem = of_parse_phandle(memory, "memory-region", 0);
+		if (!mem) {
+			pr_info("memory-region property not found\n");
+			return -1;
+		}
+
+		reg = of_get_property(mem, "reg", &len);
+		if (!reg || len <= 0) {
+			pr_info("memory-region property not found\n");
+			return -1;
+		}
+		start = be64_to_cpu(*reg);
+		size = be64_to_cpu(*(reg + 1));
+		pr_info("Coherent memory start %llx size %llx\n", start, size);
+		ret = memory_probe_store(start, size);
+		if (ret)
+			pr_info("probe failed\n");
+
+		ret = store_mem_state(start, size, "online_movable");
+		if (ret)
+			pr_info("online_movable failed\n");
+
+		addr[nr_addr][0] = start;
+		addr[nr_addr][1] = size;
+		nr_addr++;
+	}
+	dump_core_vm_tunables();
+	enable_auto_online();
+	return 0;
+}
+
+static int offline_coherent_memory(void)
+{
+	int i;
+
+	for (i = 0; i < nr_addr; i++)
+		store_mem_state(addr[i][0], addr[i][1], "offline");
+	return 0;
+}
+
+static void __exit coherent_hotplug_exit(void)
+{
+	pr_info("%s\n", __func__);
+	offline_coherent_memory();
+}
+
+static int __init coherent_hotplug_init(void)
+{
+	pr_info("%s\n", __func__);
+	return online_coherent_memory();
+}
+module_init(coherent_hotplug_init);
+module_exit(coherent_hotplug_exit);
+MODULE_LICENSE("GPL");
diff --git a/drivers/char/coherent_memory_demo.c b/drivers/char/coherent_memory_demo.c
new file mode 100644
index 0000000..e711165
--- /dev/null
+++ b/drivers/char/coherent_memory_demo.c
@@ -0,0 +1,337 @@
+/*
+ * Demonstrating various aspects of the coherent memory.
+ *
+ * Copyright (C) 2016, Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#include <linux/of.h>
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/memblock.h>
+#include <linux/module.h>
+#include <linux/memory.h>
+#include <linux/sizes.h>
+#include <linux/bitops.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/migrate.h>
+#include <linux/memblock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+
+#include <asm/mmu.h>
+#include <asm/pgalloc.h>
+
+#define COHERENT_DEV_MAJOR 89
+#define COHERENT_DEV_NAME  "coherent_memory"
+
+#define CRNT_NODE_NID1 1
+#define CRNT_NODE_NID2 2
+#define CRNT_NODE_NID3 3
+
+#define RAM_CRNT_MIGRATE 1
+#define CRNT_RAM_MIGRATE 2
+
+struct vma_map_info {
+	struct list_head list;
+	unsigned long nr_pages;
+	spinlock_t lock;
+};
+
+static void vma_map_info_init(struct vm_area_struct *vma)
+{
+	struct vma_map_info *info = kmalloc(sizeof(struct vma_map_info),
+								GFP_KERNEL);
+
+	WARN_ON(!info);
+	INIT_LIST_HEAD(&info->list);
+	spin_lock_init(&info->lock);
+	vma->vm_private_data = info;
+	info->nr_pages = 0;
+}
+
+static void coherent_vmops_open(struct vm_area_struct *vma)
+{
+	vma_map_info_init(vma);
+}
+
+static void coherent_vmops_close(struct vm_area_struct *vma)
+{
+	struct vma_map_info *info = vma->vm_private_data;
+
+	WARN_ON(!info);
+again:
+	cond_resched();
+	spin_lock(&info->lock);
+	while (info->nr_pages) {
+		struct page *page, *page2;
+
+		list_for_each_entry_safe(page, page2, &info->list, lru) {
+			if (!trylock_page(page)) {
+				spin_unlock(&info->lock);
+				goto again;
+			}
+
+			list_del_init(&page->lru);
+			info->nr_pages--;
+			unlock_page(page);
+			SetPageReclaim(page);
+			put_page(page);
+		}
+		spin_unlock(&info->lock);
+		cond_resched();
+		spin_lock(&info->lock);
+	}
+	spin_unlock(&info->lock);
+	kfree(info);
+	vma->vm_private_data = NULL;
+}
+
+static int coherent_vmops_fault(struct vm_area_struct *vma,
+					struct vm_fault *vmf)
+{
+	struct vma_map_info *info;
+	struct page *page;
+	static int coherent_node = CRNT_NODE_NID1;
+
+	if (coherent_node == CRNT_NODE_NID1)
+		coherent_node = CRNT_NODE_NID2;
+	else
+		coherent_node = CRNT_NODE_NID1;
+
+	page = alloc_pages_node(coherent_node,
+				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	info = (struct vma_map_info *) vma->vm_private_data;
+	WARN_ON(!info);
+	spin_lock(&info->lock);
+	list_add(&page->lru, &info->list);
+	info->nr_pages++;
+	spin_unlock(&info->lock);
+
+	page->index = vmf->pgoff;
+	get_page(page);
+	vmf->page = page;
+	return 0;
+}
+
+static const struct vm_operations_struct coherent_memory_vmops = {
+	.open = coherent_vmops_open,
+	.close = coherent_vmops_close,
+	.fault = coherent_vmops_fault,
+};
+
+static int coherent_memory_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	pr_info("Mmap opened (file: %lx vma: %lx)\n",
+			(unsigned long) file, (unsigned long) vma);
+	vma->vm_ops = &coherent_memory_vmops;
+	coherent_vmops_open(vma);
+	return 0;
+}
+
+static int coherent_memory_open(struct inode *inode, struct file *file)
+{
+	pr_info("Device opened (inode: %lx file: %lx)\n",
+			(unsigned long) inode, (unsigned long) file);
+	return 0;
+}
+
+static int coherent_memory_close(struct inode *inode, struct file *file)
+{
+	pr_info("Device closed (inode: %lx file: %lx)\n",
+			(unsigned long) inode, (unsigned long) file);
+	return 0;
+}
+
+static void lru_ram_coherent_migrate(unsigned long addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	nodemask_t nmask;
+	LIST_HEAD(mlist);
+
+	nodes_clear(nmask);
+	nodes_setall(nmask);
+	down_write(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if  ((addr < vma->vm_start) || (addr > vma->vm_end))
+			continue;
+		break;
+	}
+	up_write(&mm->mmap_sem);
+	if (!vma) {
+		pr_info("%s: No VMA found\n", __func__);
+		return;
+	}
+	migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 2);
+}
+
+static void lru_coherent_ram_migrate(unsigned long addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	nodemask_t nmask;
+	LIST_HEAD(mlist);
+
+	nodes_clear(nmask);
+	nodes_setall(nmask);
+	down_write(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if  ((addr < vma->vm_start) || (addr > vma->vm_end))
+			continue;
+		break;
+	}
+	up_write(&mm->mmap_sem);
+	if (!vma) {
+		pr_info("%s: No VMA found\n", __func__);
+		return;
+	}
+	migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 0);
+}
+
+static long coherent_memory_ioctl(struct file *file,
+					unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case RAM_CRNT_MIGRATE:
+		lru_ram_coherent_migrate(arg);
+		break;
+
+	case CRNT_RAM_MIGRATE:
+		lru_coherent_ram_migrate(arg);
+		break;
+
+	default:
+		pr_info("%s Invalid ioctl() command: %d\n", __func__, cmd);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations fops = {
+	.mmap = coherent_memory_mmap,
+	.open = coherent_memory_open,
+	.release = coherent_memory_close,
+	.unlocked_ioctl = &coherent_memory_ioctl
+};
+
+static char kbuf[100];	/* Will store original user passed buffer */
+static char str[100];	/* Working copy for individual substring */
+
+static u64 args[4];
+static u64 index;
+static void convert_substring(const char *buf)
+{
+	u64 val = 0;
+
+	if (kstrtou64(buf, 0, &val))
+		pr_info("String conversion failed\n");
+
+	args[index] = val;
+	index++;
+}
+
+static ssize_t coherent_debug_write(struct file *file,
+					const char __user *user_buf,
+					size_t count, loff_t *ppos)
+{
+	char *tmp, *tmp1;
+	size_t ret;
+
+	memset(args, 0, sizeof(args));
+	index = 0;
+
+	ret = simple_write_to_buffer(kbuf, sizeof(kbuf), ppos, user_buf, count);
+	if (ret < 0)
+		return ret;
+
+	kbuf[ret] = '\0';
+	tmp = kbuf;
+	do {
+		tmp1 = strchr(tmp, ',');
+		if (tmp1) {
+			*tmp1 = '\0';
+			strncpy(str, (const char *)tmp, strlen(tmp));
+			convert_substring(str);
+		} else {
+			strncpy(str, (const char *)tmp, strlen(tmp));
+			convert_substring(str);
+			break;
+		}
+		tmp = tmp1 + 1;
+		memset(str, 0, sizeof(str));
+	} while (true);
+	migrate_virtual_range(args[0], args[1], args[2], args[3]);
+	return ret;
+}
+
+static int coherent_debug_show(struct seq_file *m, void *v)
+{
+	seq_puts(m, "Expected Value: <pid,vaddr,size,nid>\n");
+	return 0;
+}
+
+static int coherent_debug_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, coherent_debug_show, NULL);
+}
+
+static const struct file_operations coherent_debug_fops = {
+	.open		= coherent_debug_open,
+	.write		= coherent_debug_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct dentry *debugfile;
+
+static void coherent_memory_debugfs(void)
+{
+
+	debugfile = debugfs_create_file("coherent_debug", 0644, NULL, NULL,
+				&coherent_debug_fops);
+	if (!debugfile)
+		pr_warn("Failed to create coherent_memory in debugfs");
+}
+
+static void __exit coherent_memory_exit(void)
+{
+	pr_info("%s\n", __func__);
+	debugfs_remove(debugfile);
+	unregister_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME);
+}
+
+static int __init coherent_memory_init(void)
+{
+	int ret;
+
+	pr_info("%s\n", __func__);
+	ret = register_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME, &fops);
+	if (ret < 0) {
+		pr_info("%s register_chrdev() failed\n", __func__);
+		return -1;
+	}
+	coherent_memory_debugfs();
+	return 0;
+}
+
+module_init(coherent_memory_init);
+module_exit(coherent_memory_exit);
+MODULE_LICENSE("GPL");
diff --git a/drivers/char/memory_online_sysfs.h b/drivers/char/memory_online_sysfs.h
new file mode 100644
index 0000000..a5f022d
--- /dev/null
+++ b/drivers/char/memory_online_sysfs.h
@@ -0,0 +1,148 @@
+/*
+ * Accessing sysfs interface for memory hotplug operation from
+ * inside the kernel.
+ *
+ * Licensed under GPL V2
+ */
+#ifndef __SYSFS_H
+#define __SYSFS_H
+
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+
+#define AUTO_ONLINE_BLOCKS "/sys/devices/system/memory/auto_online_blocks"
+#define BLOCK_SIZE_BYTES   "/sys/devices/system/memory/block_size_bytes"
+#define MEMORY_PROBE       "/sys/devices/system/memory/probe"
+
+static ssize_t read_buf(char *filename, char *buf, ssize_t count)
+{
+	mm_segment_t old_fs;
+	struct file *filp;
+	loff_t pos = 0;
+
+	if (!count)
+		return 0;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	filp = filp_open(filename, O_RDONLY, 0);
+	if (IS_ERR(filp)) {
+		count = PTR_ERR(filp);
+		goto err_open;
+	}
+
+	count = vfs_read(filp, buf, count - 1, &pos);
+	buf[count] = '\0';
+
+	filp_close(filp, NULL);
+
+err_open:
+	set_fs(old_fs);
+
+	return count;
+}
+
+static unsigned long long read_0x(char *filename)
+{
+	unsigned long long ret;
+	char buf[32];
+
+	if (read_buf(filename, buf, 32) <= 0)
+		return 0;
+
+	if (kstrtoull(buf, 16, &ret))
+		return 0;
+
+	return ret;
+}
+
+static ssize_t write_buf(char *filename, char *buf)
+{
+	int ret;
+	mm_segment_t old_fs;
+	struct file *filp;
+	loff_t pos = 0;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	filp = filp_open(filename, O_WRONLY, 0);
+	if (IS_ERR(filp)) {
+		ret = PTR_ERR(filp);
+		goto err_open;
+	}
+
+	ret = vfs_write(filp, buf, strlen(buf), &pos);
+
+	filp_close(filp, NULL);
+
+err_open:
+	set_fs(old_fs);
+
+	return ret;
+}
+
+int memory_probe_store(phys_addr_t addr, phys_addr_t size)
+{
+	phys_addr_t block_sz =
+		read_0x(BLOCK_SIZE_BYTES);
+	long i;
+
+	for (i = 0; i < size / block_sz; i++, addr += block_sz) {
+		char s[32];
+		ssize_t count;
+
+		snprintf(s, 32, "0x%llx", addr);
+
+		count = write_buf(MEMORY_PROBE, s);
+		if (count < 0)
+			return count;
+	}
+
+	return 0;
+}
+
+int store_mem_state(phys_addr_t addr, phys_addr_t size, char *state)
+{
+	phys_addr_t block_sz = read_0x(BLOCK_SIZE_BYTES);
+	unsigned long start_block, end_block, i;
+
+	start_block = addr / block_sz;
+	end_block = start_block + size / block_sz;
+
+	for (i = end_block - 1; i >= start_block; i--) {
+		char filename[64];
+		ssize_t count;
+
+		snprintf(filename, 64,
+			 "/sys/devices/system/memory/memory%ld/state", i);
+
+		count = write_buf(filename, state);
+		if (count < 0)
+			return count;
+	}
+
+	return 0;
+}
+
+int disable_auto_online(void)
+{
+	int ret;
+
+	ret = write_buf(AUTO_ONLINE_BLOCKS, "offline");
+	if (ret)
+		return ret;
+	return 0;
+}
+
+int enable_auto_online(void)
+{
+	int ret;
+
+	ret = write_buf(AUTO_ONLINE_BLOCKS, "online");
+	if (ret)
+		return ret;
+	return 0;
+}
+#endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 13cd5eb..f65810a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2946,6 +2946,7 @@ int migrate_virtual_range(int pid, unsigned long start,
 		goto out;
 	}
 
+	pr_info("%s: %d %lx %lx %d: ", __func__, pid, start, end, nid);
 	rcu_read_lock();
 	mm = find_task_by_vpid(pid)->mm;
 	rcu_read_unlock();
@@ -2956,8 +2957,14 @@ int migrate_virtual_range(int pid, unsigned long start,
 	if (!list_empty(&mlist)) {
 		ret = migrate_pages(&mlist, new_node_page, NULL,
 					nid, MIGRATE_SYNC, MR_NUMA_MISPLACED);
-		if (ret)
+		if (ret) {
+			pr_info("migration_failed for %d pages\n", ret);
 			putback_movable_pages(&mlist);
+		} else {
+			pr_info("migration_passed\n");
+		}
+	} else {
+		pr_info("list_empty\n");
 	}
 	up_write(&mm->mmap_sem);
 out:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [DEBUG 21/21] selftests/powerpc: Add a script to perform random VMA migrations
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (19 preceding siblings ...)
  2017-01-30  3:36 ` [DEBUG 20/21] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
@ 2017-01-30  3:36 ` Anshuman Khandual
  2017-01-31  5:48 ` [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
  21 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-30  3:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

This is a test script which creates a workload (e.g ebizzy) and go through
it's VMAs (/proc/pid/maps) and initiate migration to random nodes which can
be either system memory node or coherent memory node.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 tools/testing/selftests/vm/cdm_migration.sh | 77 +++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100755 tools/testing/selftests/vm/cdm_migration.sh

diff --git a/tools/testing/selftests/vm/cdm_migration.sh b/tools/testing/selftests/vm/cdm_migration.sh
new file mode 100755
index 0000000..3ded302
--- /dev/null
+++ b/tools/testing/selftests/vm/cdm_migration.sh
@@ -0,0 +1,77 @@
+#!/usr/bin/bash
+#
+# Should work with any workoad and workload commandline.
+# But for now ebizzy should be installed. Please run it
+# as root.
+#
+# Copyright (C) Anshuman Khandual 2016, IBM Corporation
+#
+# Licensed under GPL V2
+
+# Unload, build and reload modules
+if [ "$1" = "reload" ]
+then
+	rmmod coherent_memory_demo
+	rmmod coherent_hotplug_demo
+	cd ../../../../
+	make -s -j 64 modules
+	insmod drivers/char/coherent_hotplug_demo.ko
+	insmod drivers/char/coherent_memory_demo.ko
+	cd -
+fi
+
+# Workload
+workload=ebizzy
+work_cmd="ebizzy -T -z -m -t 128 -n 100000 -s 32768 -S 10000"
+
+pkill $workload
+$work_cmd &
+
+# File
+if [ -e input_file.txt ]
+then
+	rm input_file.txt
+fi
+
+# Inputs
+pid=`pidof ebizzy`
+cp /proc/$pid/maps input_file.txt
+if [ ! -e input_file.txt ]
+then
+	echo "Input file was not created"
+	exit
+fi
+input=input_file.txt
+
+# Migrations
+dmesg -C
+while read line
+do
+	addr_start=$(echo $line | cut -d '-' -f1)
+	addr_end=$(echo $line | cut -d '-' -f2 | cut -d ' ' -f1)
+	node=`expr $RANDOM % 5`
+
+	echo $pid,0x$addr_start,0x$addr_end,$node > \
+			/sys/kernel/debug/coherent_debug
+done < "$input"
+
+# Analyze dmesg output
+passed=`dmesg | grep "migration_passed" | wc -l`
+failed=`dmesg | grep "migration_failed" | wc -l`
+queuef=`dmesg | grep "queue_pages_range_failed" | wc -l`
+empty=`dmesg | grep "list_empty" | wc -l`
+missing=`dmesg | grep "vma_missing" | wc -l`
+
+# Stats
+echo passed	$passed
+echo failed	$failed
+echo queuef	$queuef
+echo empty	$empty
+echo missing	$missing
+
+# Cleanup
+rm input_file.txt
+if pgrep -x $workload > /dev/null
+then
+	pkill $workload
+fi
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-01-30  3:35 ` [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes Anshuman Khandual
@ 2017-01-30 17:19   ` Dave Hansen
  2017-01-31  1:03     ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2017-01-30 17:19 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> HugeTLB allocation/release/accounting currently spans across all the nodes
> under N_MEMORY node mask. Coherent memory nodes should not be part of these
> allocations. So use system_ram() call to fetch system RAM only nodes on the
> platform which can then be used for HugeTLB allocation purpose instead of
> N_MEMORY node mask. This isolates coherent device memory nodes from HugeTLB
> allocations.

Does this end up making it impossible to use hugetlbfs to access device
memory?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-30  3:35 ` [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
@ 2017-01-30 17:34   ` Dave Hansen
  2017-01-31  1:36     ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2017-01-30 17:34 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> * CDM node's zones are not part of any other node's FALLBACK zonelist
> * CDM node's FALLBACK list contains it's own memory zones followed by
>   all system RAM zones in regular order as before
> * CDM node's zones are part of it's own NOFALLBACK zonelist

This seems like a sane policy for the system that you're describing.
But, it's still a policy, and it's rather hard-coded into the kernel.
Let's say we had a CDM node with 100x more RAM than the rest of the
system and it was just as fast as the rest of the RAM.  Would we still
want it isolated like this?  Or would we want a different policy?

Why do we need this hard-coded along with the cpuset stuff later in the
series.  Doesn't taking a node out of the cpuset also take it out of the
fallback lists?

>  	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
> +#ifdef CONFIG_COHERENT_DEVICE
> +		/*
> +		 * CDM node's own zones should not be part of any other
> +		 * node's fallback zonelist but only it's own fallback
> +		 * zonelist.
> +		 */
> +		if (is_cdm_node(node) && (pgdat->node_id != node))
> +			continue;
> +#endif

On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
a 'return 0' stub when the config option is off anyway.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init()
  2017-01-30  3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
@ 2017-01-30 17:36   ` Dave Hansen
  2017-01-30 20:30   ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Dave Hansen @ 2017-01-30 17:36 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> Currently cpusets_enabled() wrongfully returns 0 even if we have a root
> cpuset configured on the system. This got missed when jump level was
> introduced in place of number_of_cpusets with the commit 664eeddeef65
> ("mm: page_alloc: use jump labels to avoid checking number_of_cpusets")
> . This fixes the problem so that cpusets_enabled() returns positive even
> for the root cpuset.
> 
> Fixes: 664eeddeef65 ("mm: page_alloc: use jump labels to avoid")
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>

This needs to go upstream separately, right?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault
  2017-01-30  3:35 ` [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault Anshuman Khandual
@ 2017-01-30 17:51   ` Dave Hansen
  2017-01-31  5:10     ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2017-01-30 17:51 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

Here's the flag definition:

> +#ifdef CONFIG_COHERENT_DEVICE
> +#define VM_CDM		0x00800000	/* Contains coherent device memory */
> +#endif

But it doesn't match the implementation:

> +#ifdef CONFIG_COHERENT_DEVICE
> +static void mark_vma_cdm(nodemask_t *nmask,
> +		struct page *page, struct vm_area_struct *vma)
> +{
> +	if (!page)
> +		return;
> +
> +	if (vma->vm_flags & VM_CDM)
> +		return;
> +
> +	if (nmask && !nodemask_has_cdm(*nmask))
> +		return;
> +
> +	if (is_cdm_node(page_to_nid(page)))
> +		vma->vm_flags |= VM_CDM;
> +}

That flag is a one-way trip.  Any VMA with that flag set on it will keep
it for the life of the VMA, despite whether it has CDM pages in it now
or not.  Even if you changed the policy back to one that doesn't allow
CDM and forced all the pages to be migrated out.

This also assumes that the only way to get a page mapped into a VMA is
via alloc_pages_vma().  Do the NUMA migration APIs use this path?

When you *set* this flag, you don't go and turn off KSM merging, for
instance.  You keep it from being turned on from this point forward, but
you don't turn it off.

This is happening with mmap_sem held for read.  Correct?  Is it OK that
you're modifying the VMA?  That vm_flags manipulation is non-atomic, so
how can that even be safe?

If you're going to go down this route, I think you need to be very
careful.  We need to ensure that when this flag gets set, it's never set
on VMAs that are "normal" and will only be set on VMAs that were
*explicitly* set up for accessing CDM.  That means that you'll need to
make sure that there's no possible way to get a CDM page faulted into a
VMA unless it's via an explicitly assigned policy that would have cause
the VMA to be split from any "normal" one in the system.

This all makes me really nervous.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-01-30  3:35 ` [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) Anshuman Khandual
@ 2017-01-30 17:54   ` Dave Hansen
  2017-01-31  4:36     ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2017-01-30 17:54 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> +		if ((new_pol->mode == MPOL_BIND)
> +			&& nodemask_has_cdm(new_pol->v.nodes))
> +			set_vm_cdm(vma);

So, if you did:

	mbind(addr, PAGE_SIZE, MPOL_BIND, all_nodes, ...);
	mbind(addr, PAGE_SIZE, MPOL_BIND, one_non_cdm_node, ...);

You end up with a VMA that can never have KSM done on it, etc...  Even
though there's no good reason for it.  I guess /proc/$pid/smaps might be
able to help us figure out what was going on here, but that still seems
like an awful lot of damage.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 08/12] mm: Add new VMA flag VM_CDM
  2017-01-30  3:35 ` [RFC V2 08/12] mm: Add new VMA flag VM_CDM Anshuman Khandual
@ 2017-01-30 18:52   ` Jerome Glisse
  2017-01-31  4:22     ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2017-01-30 18:52 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, mgorman, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Mon, Jan 30, 2017 at 09:05:49AM +0530, Anshuman Khandual wrote:
> VMA which contains CDM memory pages should be marked with new VM_CDM flag.
> These VMAs need to be identified in various core kernel paths for special
> handling and this flag will help in their identification.
> 
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>


Why doing this on vma basis ? Why not special casing all those path on page
basis ?

After all you can have a big vma with some pages in it being cdm and other
being regular page. The CPU process might migrate to different CPU in a
different node and you might still want to have the regular page to migrate
to this new node and keep the cdm page while the device is still working
on them.

This is just an example, same can apply for ksm or any other kernel feature
you want to special case. Maybe we can store a set of flag in node that
tells what is allowed for page in node (ksm, hugetlb, migrate, numa, ...).

This would be more flexible and the policy choice can be left to each of
the device driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init()
  2017-01-30  3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
  2017-01-30 17:36   ` Dave Hansen
@ 2017-01-30 20:30   ` Mel Gorman
  2017-01-31 14:22     ` [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask Anshuman Khandual
  2017-01-31 14:36     ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Vlastimil Babka
  1 sibling, 2 replies; 58+ messages in thread
From: Mel Gorman @ 2017-01-30 20:30 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Mon, Jan 30, 2017 at 09:05:46AM +0530, Anshuman Khandual wrote:
> Currently cpusets_enabled() wrongfully returns 0 even if we have a root
> cpuset configured on the system. This got missed when jump level was
> introduced in place of number_of_cpusets with the commit 664eeddeef65
> ("mm: page_alloc: use jump labels to avoid checking number_of_cpusets")
> . This fixes the problem so that cpusets_enabled() returns positive even
> for the root cpuset.
> 
> Fixes: 664eeddeef65 ("mm: page_alloc: use jump labels to avoid")
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>

Superficially, this appears to always activate the cpuset_enabled
branch() when it doesn't really make sense that the root cpuset be
restricted. I strongly suspect it should be altered to cpuset_inc only
if the root cpuset is configured to isolate memory.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-01-30 17:19   ` Dave Hansen
@ 2017-01-31  1:03     ` Anshuman Khandual
  2017-01-31  1:37       ` Dave Hansen
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  1:03 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 10:49 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> HugeTLB allocation/release/accounting currently spans across all the nodes
>> under N_MEMORY node mask. Coherent memory nodes should not be part of these
>> allocations. So use system_ram() call to fetch system RAM only nodes on the
>> platform which can then be used for HugeTLB allocation purpose instead of
>> N_MEMORY node mask. This isolates coherent device memory nodes from HugeTLB
>> allocations.
> 
> Does this end up making it impossible to use hugetlbfs to access device
> memory?

Right, thats the implementation at the moment. But going forward if we need
to have HugeTLB pages on the CDM node, then we can implement through the
sysfs interface from individual NUMA node paths instead of changing the
generic HugeTLB path. I wrote this up in the cover letter but should also
have mentioned in the comment section of this patch as well. Does this
approach look okay ?

"Now, we ensure complete HugeTLB allocation isolation from CDM nodes. Going
forward if we need to support HugeTLB allocation on CDM nodes on targeted
basis, then we would have to enable those allocations through the
/sys/devices/system/node/nodeN/hugepages/hugepages-16384kB/nr_hugepages
interface while still ensuring isolation from other generic sysctl and
/sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages interfaces."

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-30 17:34   ` Dave Hansen
@ 2017-01-31  1:36     ` Anshuman Khandual
  2017-01-31  1:57       ` Dave Hansen
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  1:36 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 11:04 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> * CDM node's zones are not part of any other node's FALLBACK zonelist
>> * CDM node's FALLBACK list contains it's own memory zones followed by
>>   all system RAM zones in regular order as before
>> * CDM node's zones are part of it's own NOFALLBACK zonelist
> 
> This seems like a sane policy for the system that you're describing.
> But, it's still a policy, and it's rather hard-coded into the kernel.

Right. In the original RFC which I had posted in October, I had thought
about this issue and created 'pglist_data->coherent_device' as a u64
element where each bit in the mask can indicate a specific policy request
for the hot plugged coherent device. But it looked too complicated in
for the moment in absence of other potential coherent memory HW which
really requires anything other than isolation and explicit allocation
method.

> Let's say we had a CDM node with 100x more RAM than the rest of the
> system and it was just as fast as the rest of the RAM.  Would we still
> want it isolated like this?  Or would we want a different policy?

Though in this particular case this CDM can be hot plugged into the
system as a normal NUMA node (I dont see any reason why it should
not be treated as normal NUMA node) but I do understand the need
for different policy requirements for different kind of coherent
memory.

But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?

There is a sense that if the non system RAM memory is coherent and
similar there cannot be much differences between what they would
expect from the kernel.

> 
> Why do we need this hard-coded along with the cpuset stuff later in the
> series.  Doesn't taking a node out of the cpuset also take it out of the
> fallback lists?

There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,

"
NOTE: These two set of patches mutually exclusive of each other and
represent two different approaches. Only one of these sets should be
applied at any point of time.

Set1:
  mm: Change generic FALLBACK zonelist creation process
  mm: Change mbind(MPOL_BIND) implementation for CDM nodes

Set2:
  cpuset: Add cpuset_inc() inside cpuset_init()
  mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE
"

> 
>>  	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +		/*
>> +		 * CDM node's own zones should not be part of any other
>> +		 * node's fallback zonelist but only it's own fallback
>> +		 * zonelist.
>> +		 */
>> +		if (is_cdm_node(node) && (pgdat->node_id != node))
>> +			continue;
>> +#endif
> 
> On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
> a 'return 0' stub when the config option is off anyway.

Right, will fix it up.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-01-31  1:03     ` Anshuman Khandual
@ 2017-01-31  1:37       ` Dave Hansen
  2017-02-01 13:59         ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2017-01-31  1:37 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 05:03 PM, Anshuman Khandual wrote:
> On 01/30/2017 10:49 PM, Dave Hansen wrote:
>> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>>> HugeTLB allocation/release/accounting currently spans across all the nodes
>>> under N_MEMORY node mask. Coherent memory nodes should not be part of these
>>> allocations. So use system_ram() call to fetch system RAM only nodes on the
>>> platform which can then be used for HugeTLB allocation purpose instead of
>>> N_MEMORY node mask. This isolates coherent device memory nodes from HugeTLB
>>> allocations.
>>
>> Does this end up making it impossible to use hugetlbfs to access device
>> memory?
> 
> Right, thats the implementation at the moment. But going forward if we need
> to have HugeTLB pages on the CDM node, then we can implement through the
> sysfs interface from individual NUMA node paths instead of changing the
> generic HugeTLB path. I wrote this up in the cover letter but should also
> have mentioned in the comment section of this patch as well. Does this
> approach look okay ?

The cover letter is not the most approachable document I've ever seen. :)

> "Now, we ensure complete HugeTLB allocation isolation from CDM nodes. Going
> forward if we need to support HugeTLB allocation on CDM nodes on targeted
> basis, then we would have to enable those allocations through the
> /sys/devices/system/node/nodeN/hugepages/hugepages-16384kB/nr_hugepages
> interface while still ensuring isolation from other generic sysctl and
> /sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages interfaces."

That would be passable if that's the only way you can allocate hugetlbfs
pages.  But we also have the fault-based allocations that can pull stuff
right out of the buddy allocator.  This approach would break that path
entirely.

FWIW, I think you really need to separate the true "CDM" stuff that's
*really* device-specific from the parts of this from which you really
just want to implement isolation.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31  1:36     ` Anshuman Khandual
@ 2017-01-31  1:57       ` Dave Hansen
  2017-01-31  7:25         ` John Hubbard
  2017-02-01  6:40         ` Anshuman Khandual
  0 siblings, 2 replies; 58+ messages in thread
From: Dave Hansen @ 2017-01-31  1:57 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>> Let's say we had a CDM node with 100x more RAM than the rest of the
>> system and it was just as fast as the rest of the RAM.  Would we still
>> want it isolated like this?  Or would we want a different policy?
> 
> But then the other argument being, dont we want to keep this 100X more
> memory isolated for some special purpose to be utilized by specific
> applications ?

I was thinking that in this case, we wouldn't even want to bother with
having "system RAM" in the fallback lists.  A device who got its memory
usage off by 1% could start to starve the rest of the system.  A sane
policy in this case might be to isolate the "system RAM" from the device's.

>> Why do we need this hard-coded along with the cpuset stuff later in the
>> series.  Doesn't taking a node out of the cpuset also take it out of the
>> fallback lists?
> 
> There are two mutually exclusive approaches which are described in
> this patch series.
> 
> (1) zonelist modification based approach
> (2) cpuset restriction based approach
> 
> As mentioned in the cover letter,

Well, I'm glad you coded both of them up, but now that we have them how
to we pick which one to throw to the wolves?  Or, do we just merge both
of them and let one bitrot? ;)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 08/12] mm: Add new VMA flag VM_CDM
  2017-01-30 18:52   ` Jerome Glisse
@ 2017-01-31  4:22     ` Anshuman Khandual
  2017-01-31  6:05       ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  4:22 UTC (permalink / raw)
  To: Jerome Glisse, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, mgorman, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On 01/31/2017 12:22 AM, Jerome Glisse wrote:
> On Mon, Jan 30, 2017 at 09:05:49AM +0530, Anshuman Khandual wrote:
>> VMA which contains CDM memory pages should be marked with new VM_CDM flag.
>> These VMAs need to be identified in various core kernel paths for special
>> handling and this flag will help in their identification.
>>
>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> 
> 
> Why doing this on vma basis ? Why not special casing all those path on page
> basis ?

The primary motivation being the cost. Wont it be too expensive to account
for and act on individual pages rather than on the VMA as a whole ? For
example page_to_nid() seemed pretty expensive when tried to tag VMA on
individual page fault basis.

> 
> After all you can have a big vma with some pages in it being cdm and other
> being regular page. The CPU process might migrate to different CPU in a
> different node and you might still want to have the regular page to migrate
> to this new node and keep the cdm page while the device is still working
> on them.

Right, that is the ideal thing to do. But wont it be better to split the
big VMA into smaller chunks and tag them appropriately so that those VMAs
tagged would contain as much CDM pages as possible for them to be likely
restricted from auto NUMA, KSM etc.

> 
> This is just an example, same can apply for ksm or any other kernel feature
> you want to special case. Maybe we can store a set of flag in node that
> tells what is allowed for page in node (ksm, hugetlb, migrate, numa, ...).
> 
> This would be more flexible and the policy choice can be left to each of
> the device driver.

Hmm, thats another way of doing the special cases. The other way as Dave
had mentioned before is to classify coherent memory property into various
kinds and store them for each node and implement a predefined set of
restrictions for each kind of coherent memory which might include features
like auto NUMA, HugeTLB, KSM etc. Maintaining two different property sets
one for the kind of coherent memory and the other being for each special
cases) wont be too complicated ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-01-30 17:54   ` Dave Hansen
@ 2017-01-31  4:36     ` Anshuman Khandual
  2017-02-07 18:07       ` Dave Hansen
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  4:36 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 11:24 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> +		if ((new_pol->mode == MPOL_BIND)
>> +			&& nodemask_has_cdm(new_pol->v.nodes))
>> +			set_vm_cdm(vma);
> So, if you did:
> 
> 	mbind(addr, PAGE_SIZE, MPOL_BIND, all_nodes, ...);
> 	mbind(addr, PAGE_SIZE, MPOL_BIND, one_non_cdm_node, ...);
> 
> You end up with a VMA that can never have KSM done on it, etc...  Even
> though there's no good reason for it.  I guess /proc/$pid/smaps might be
> able to help us figure out what was going on here, but that still seems
> like an awful lot of damage.

Agreed, this VMA should not remain tagged after the second call. It does
not make sense. For this kind of scenarios we can re-evaluate the VMA
tag every time the nodemask change is attempted. But if we are looking for
some runtime re-evaluation then we need to steal some cycles are during
general VMA processing opportunity points like merging and split to do
the necessary re-evaluation. Should do we do these kind two kinds of
re-evaluation to be more optimal ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault
  2017-01-30 17:51   ` Dave Hansen
@ 2017-01-31  5:10     ` Anshuman Khandual
  2017-01-31 17:54       ` Dave Hansen
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  5:10 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 11:21 PM, Dave Hansen wrote:
> Here's the flag definition:
> 
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +#define VM_CDM		0x00800000	/* Contains coherent device memory */
>> +#endif
> 
> But it doesn't match the implementation:
> 
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +static void mark_vma_cdm(nodemask_t *nmask,
>> +		struct page *page, struct vm_area_struct *vma)
>> +{
>> +	if (!page)
>> +		return;
>> +
>> +	if (vma->vm_flags & VM_CDM)
>> +		return;
>> +
>> +	if (nmask && !nodemask_has_cdm(*nmask))
>> +		return;
>> +
>> +	if (is_cdm_node(page_to_nid(page)))
>> +		vma->vm_flags |= VM_CDM;
>> +}
> 
> That flag is a one-way trip.  Any VMA with that flag set on it will keep
> it for the life of the VMA, despite whether it has CDM pages in it now
> or not.  Even if you changed the policy back to one that doesn't allow
> CDM and forced all the pages to be migrated out.

Right, we have this limitation right now. But as I have mentioned in the
reply on the other thread, will work towards both static and runtime
re-evaluation of the VMA flag next time around.

> 
> This also assumes that the only way to get a page mapped into a VMA is
> via alloc_pages_vma().  Do the NUMA migration APIs use this path?

Right now I have just taken care of these two paths.

* Page fault path
* mbind() path

agreed, will work on the NUMA migration APIs paths next. Wondering if
I need to update for migrate_pages() kernel API also as it will be
used by the driver or should the driver tag the VMA explicitly knowing
what has just happened ? I had also mentioned about this in the cover
letter :) But as you have pointed out will move the documentation
to the patches.

"
VM_CDM tagged VMA:

There are two parts to this problem.

* How to mark a VMA with VM_CDM ?
	- During page fault path
	- During mbind(MPOL_BIND) call
	- Any other paths ?
	- Should a driver mark a VMA with VM_CDM explicitly ?

* How VM_CDM marked VMA gets treated ?

	- Disabled from auto NUMA migrations
	- Disabled from KSM merging
	- Anything else ?
"

> 
> When you *set* this flag, you don't go and turn off KSM merging, for
> instance.  You keep it from being turned on from this point forward, but
> you don't turn it off.

I was in the impression that the KSM merging does not start unless we
do madvise(MADV_MERGEABLE) call on the VMA (where its blocked now). I
might be missing something here if it can start before hand.

> 
> This is happening with mmap_sem held for read.  Correct?  Is it OK that
> you're modifying the VMA?  That vm_flags manipulation is non-atomic, so
> how can that even be safe?

Hmm. should it be done with mmap_sem being held for write. Will look
into this further. But intercepting the page faults inside alloc_pages_vma()
for tagging the VMA is okay from over all design perspective ?. Or this
should be moved up or down the call chain in the page fault path ?

> 
> If you're going to go down this route, I think you need to be very
> careful.  We need to ensure that when this flag gets set, it's never set
> on VMAs that are "normal" and will only be set on VMAs that were
> *explicitly* set up for accessing CDM.  That means that you'll need to
> make sure that there's no possible way to get a CDM page faulted into a
> VMA unless it's via an explicitly assigned policy that would have cause
> the VMA to be split from any "normal" one in the system.
> 
> This all makes me really nervous.

Got it, will work towards this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 00/12] Define coherent device memory node
  2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
                   ` (20 preceding siblings ...)
  2017-01-30  3:36 ` [DEBUG 21/21] selftests/powerpc: Add a script to perform random VMA migrations Anshuman Khandual
@ 2017-01-31  5:48 ` Anshuman Khandual
  2017-01-31  6:15   ` Jerome Glisse
  21 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31  5:48 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

Hello Dave/Jerome/Mel,

Here is the overall layout of the functions I am trying to put together
through this patch series.

(1) Define CDM from core VM and kernel perspective

(2) Isolation/Special consideration for HugeTLB allocations

(3) Isolation/Special consideration for buddy allocations

	(a) Zonelist modification based isolation (proposed)
	(b) Cpuset modification based isolation	  (proposed)
	(c) Buddy modification based isolation	  (working)

(4) Define VMA containing CDM memory with a new flag VM_CDM

(5) Special consideration for VM_CDM marked VMAs

	(a) Special consideration for auto NUMA
	(b) Special consideration for KSM

Is there are any other area which needs to be taken care of before CDM
node can be represented completely inside the kernel ?

Regards
Anshuman

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 08/12] mm: Add new VMA flag VM_CDM
  2017-01-31  4:22     ` Anshuman Khandual
@ 2017-01-31  6:05       ` Jerome Glisse
  0 siblings, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2017-01-31  6:05 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, mgorman, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Tue, Jan 31, 2017 at 09:52:20AM +0530, Anshuman Khandual wrote:
> On 01/31/2017 12:22 AM, Jerome Glisse wrote:
> > On Mon, Jan 30, 2017 at 09:05:49AM +0530, Anshuman Khandual wrote:
> >> VMA which contains CDM memory pages should be marked with new VM_CDM flag.
> >> These VMAs need to be identified in various core kernel paths for special
> >> handling and this flag will help in their identification.
> >>
> >> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> > 
> > 
> > Why doing this on vma basis ? Why not special casing all those path on page
> > basis ?
> 
> The primary motivation being the cost. Wont it be too expensive to account
> for and act on individual pages rather than on the VMA as a whole ? For
> example page_to_nid() seemed pretty expensive when tried to tag VMA on
> individual page fault basis.

No i don't think it would be too expensive. What is confusing in this patchset
is that you are conflating 3 different problems. First one is how to create
struct page for coherent device memory and exclude those pages from regular
allocations.

Second one is how to allow userspace to set allocation policy that would direct
allocation for a given vma to use a specific device memory.

Finaly last one is how to block some kernel feature such as numa or ksm as you
expect (and i share that believe) that they will be hurtfull.

I do believe, that this last requirement, is better left to be done on a per page
basis as page_to_nid() is only a memory lookup and i would be stun if that memory
lookup register as more than a blip on any profiler radar.

The vma flag as all or nothing choice is bad in my view and its stickyness and how
to handle its lifetime and inheritance is troubling and hard. Checking through node
if a page should undergo ksm or numa is a better solution in my view.

> 
> > 
> > After all you can have a big vma with some pages in it being cdm and other
> > being regular page. The CPU process might migrate to different CPU in a
> > different node and you might still want to have the regular page to migrate
> > to this new node and keep the cdm page while the device is still working
> > on them.
> 
> Right, that is the ideal thing to do. But wont it be better to split the
> big VMA into smaller chunks and tag them appropriately so that those VMAs
> tagged would contain as much CDM pages as possible for them to be likely
> restricted from auto NUMA, KSM etc.

Think a vma in which every odd 4k address point to a device page is device and
even 4k address point to a regular page, would you want to create as many vma
for this ?

Setting policy for allocation make sense, but setting flag that enable/disable
kernel feature for a range, overridding other policy is bad in my view.

> 
> > 
> > This is just an example, same can apply for ksm or any other kernel feature
> > you want to special case. Maybe we can store a set of flag in node that
> > tells what is allowed for page in node (ksm, hugetlb, migrate, numa, ...).
> > 
> > This would be more flexible and the policy choice can be left to each of
> > the device driver.
> 
> Hmm, thats another way of doing the special cases. The other way as Dave
> had mentioned before is to classify coherent memory property into various
> kinds and store them for each node and implement a predefined set of
> restrictions for each kind of coherent memory which might include features
> like auto NUMA, HugeTLB, KSM etc. Maintaining two different property sets
> one for the kind of coherent memory and the other being for each special
> cases) wont be too complicated ?

I am not sure i follow, you have a single mask provided by the driver that
register the memory something like:

CDM_ALLOW_NUMA (1 << 0)
CDM_ALLOW_KSM  (1 << 1)
...

Then you have bool page_node_allow_numa(page), bool page_node_allow_ksm(page),
... that is it. Both numa and ksm perform heavy operations and having to go
check a mask inside node struct isn't gonna slow them down.

I am not talking about kind matching to sets of restriction. Just a simple
mask of thing that allowed on that memory. You can add thing like GUP or any
other mechanism that i can't think of right now.

I really think that the vma flag is a bad idea, my expectation is that we
will see more vma with a mix of device and regular memory. I don't think the
only workload will be some big vma device only (ie only access by device) or
CPU only. I believe we will see everything on the spectrum from highly
fragmented to completetly regular.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 00/12] Define coherent device memory node
  2017-01-31  5:48 ` [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
@ 2017-01-31  6:15   ` Jerome Glisse
  0 siblings, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2017-01-31  6:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, mgorman, minchan,
	aneesh.kumar, bsingharora, srikar, haren, dave.hansen,
	dan.j.williams

On Tue, Jan 31, 2017 at 11:18:49AM +0530, Anshuman Khandual wrote:
> Hello Dave/Jerome/Mel,
> 
> Here is the overall layout of the functions I am trying to put together
> through this patch series.
> 
> (1) Define CDM from core VM and kernel perspective
> 
> (2) Isolation/Special consideration for HugeTLB allocations
> 
> (3) Isolation/Special consideration for buddy allocations
> 
> 	(a) Zonelist modification based isolation (proposed)
> 	(b) Cpuset modification based isolation	  (proposed)
> 	(c) Buddy modification based isolation	  (working)
> 
> (4) Define VMA containing CDM memory with a new flag VM_CDM
> 
> (5) Special consideration for VM_CDM marked VMAs
> 
> 	(a) Special consideration for auto NUMA
> 	(b) Special consideration for KSM

I believe (5) should not be done on per vma basis but on a page basis.
Thus rendering (4) pointless. A vma shouldn't be special because it has
some special kind of memory irespective of what the vma points to.


> Is there are any other area which needs to be taken care of before CDM
> node can be represented completely inside the kernel ?

Maybe thing like swap or suspend and resume (i know you are targetting big
computer and not laptop :)) but you can't presume what platform CDM might
be use latter on.

Also userspace might be confuse by looking a /proc/meminfo or any of the
sysfs file and see all this device memory without understanding that it
is special and might be unwise to be use for regular CPU only task.

I would probably want CDM memory be reported separatly from the rest of
memory. Which also most likely have repercution with memory cgroup.

My expectation is that you only want to use device memory in a process
if and only if that process also use the device to some extent. So
having new group hierarchy for this memory is probably a better path
forward.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31  1:57       ` Dave Hansen
@ 2017-01-31  7:25         ` John Hubbard
  2017-01-31 18:04           ` Dave Hansen
  2017-02-01  6:46           ` Anshuman Khandual
  2017-02-01  6:40         ` Anshuman Khandual
  1 sibling, 2 replies; 58+ messages in thread
From: John Hubbard @ 2017-01-31  7:25 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 05:57 PM, Dave Hansen wrote:
> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>> system and it was just as fast as the rest of the RAM.  Would we still
>>> want it isolated like this?  Or would we want a different policy?
>>
>> But then the other argument being, dont we want to keep this 100X more
>> memory isolated for some special purpose to be utilized by specific
>> applications ?
>
> I was thinking that in this case, we wouldn't even want to bother with
> having "system RAM" in the fallback lists.  A device who got its memory
> usage off by 1% could start to starve the rest of the system.  A sane
> policy in this case might be to isolate the "system RAM" from the device's.

I also don't like having these policies hard-coded, and your 100x example above 
helps clarify what can go wrong about it. It would be nicer if, instead, we could 
better express the "distance" between nodes (bandwidth, latency, relative to sysmem, 
perhaps), and let the NUMA system figure out the Right Thing To Do.

I realize that this is not quite possible with NUMA just yet, but I wonder if that's 
a reasonable direction to go with this?

thanks,
john h

>
>>> Why do we need this hard-coded along with the cpuset stuff later in the
>>> series.  Doesn't taking a node out of the cpuset also take it out of the
>>> fallback lists?
>>
>> There are two mutually exclusive approaches which are described in
>> this patch series.
>>
>> (1) zonelist modification based approach
>> (2) cpuset restriction based approach
>>
>> As mentioned in the cover letter,
>
> Well, I'm glad you coded both of them up, but now that we have them how
> to we pick which one to throw to the wolves?  Or, do we just merge both
> of them and let one bitrot? ;)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask
  2017-01-30 20:30   ` Mel Gorman
@ 2017-01-31 14:22     ` Anshuman Khandual
  2017-01-31 16:00       ` Mel Gorman
  2017-01-31 14:36     ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Vlastimil Babka
  1 sibling, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31 14:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

At present, top_cpuset.mems_allowed is same as node_states[N_MEMORY] and it
cannot be changed at the runtime. Maximum possible node_states[N_MEMORY]
also gets reflected in top_cpuset.effective_mems interface. It prevents some
one from removing or restricting memory placement which will be applicable
system wide on a given memory node through cpuset mechanism which might be
limiting. This solves the problem by enabling update_nodemask() function to
accept changes to top_cpuset.mems_allowed as well. Once changed, it also
updates the value of top_cpuset.effective_mems. Updates all it's task's
mems_allowed nodemask as well. It calls cpuset_inc() to make sure cpuset
is accounted for in the buddy allocator through cpusets_enabled() check.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
Tested for

* Enforcement of changed top_cpuset.mems_allowed
* Global mems_allowed cannot be changed till there are other
  cpusets present underneath the top root cpuset. I guess it
  is expected.

 kernel/cpuset.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b308888..e8c105a 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1210,15 +1210,6 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	int retval;
 
 	/*
-	 * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
-	 * it's read-only
-	 */
-	if (cs == &top_cpuset) {
-		retval = -EACCES;
-		goto done;
-	}
-
-	/*
 	 * An empty mems_allowed is ok iff there are no tasks in the cpuset.
 	 * Since nodelist_parse() fails on an empty mask, we special case
 	 * that parsing.  The validate_change() call ensures that cpusets
@@ -1232,7 +1223,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			goto done;
 
 		if (!nodes_subset(trialcs->mems_allowed,
-				  top_cpuset.mems_allowed)) {
+				  node_states[N_MEMORY])) {
 			retval = -EINVAL;
 			goto done;
 		}
@@ -1250,6 +1241,16 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	cs->mems_allowed = trialcs->mems_allowed;
 	spin_unlock_irq(&callback_lock);
 
+	if (cs == &top_cpuset) {
+		spin_lock_irq(&callback_lock);
+		cs->effective_mems = trialcs->mems_allowed;
+		spin_unlock_irq(&callback_lock);
+
+		update_tasks_nodemask(cs);
+		cpuset_inc();
+		goto done;
+	}
+
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
 done:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init()
  2017-01-30 20:30   ` Mel Gorman
  2017-01-31 14:22     ` [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask Anshuman Khandual
@ 2017-01-31 14:36     ` Vlastimil Babka
  2017-01-31 15:30       ` Anshuman Khandual
  1 sibling, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2017-01-31 14:36 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 01/30/2017 09:30 PM, Mel Gorman wrote:
> On Mon, Jan 30, 2017 at 09:05:46AM +0530, Anshuman Khandual wrote:
>> Currently cpusets_enabled() wrongfully returns 0 even if we have a root
>> cpuset configured on the system. This got missed when jump level was
>> introduced in place of number_of_cpusets with the commit 664eeddeef65
>> ("mm: page_alloc: use jump labels to avoid checking number_of_cpusets")
>> . This fixes the problem so that cpusets_enabled() returns positive even
>> for the root cpuset.
>>
>> Fixes: 664eeddeef65 ("mm: page_alloc: use jump labels to avoid")
>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> 
> Superficially, this appears to always activate the cpuset_enabled
> branch() when it doesn't really make sense that the root cpuset be
> restricted.

Yes that's why root cpuset doesn't "count", as it's not supposed to be
restricted (it's also documented in cpusets.txt) Thus the "Fixes:" tag
is very misleading.

> I strongly suspect it should be altered to cpuset_inc only
> if the root cpuset is configured to isolate memory.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init()
  2017-01-31 14:36     ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Vlastimil Babka
@ 2017-01-31 15:30       ` Anshuman Khandual
  0 siblings, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-01-31 15:30 UTC (permalink / raw)
  To: Vlastimil Babka, Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 01/31/2017 08:06 PM, Vlastimil Babka wrote:
> On 01/30/2017 09:30 PM, Mel Gorman wrote:
>> On Mon, Jan 30, 2017 at 09:05:46AM +0530, Anshuman Khandual wrote:
>>> Currently cpusets_enabled() wrongfully returns 0 even if we have a root
>>> cpuset configured on the system. This got missed when jump level was
>>> introduced in place of number_of_cpusets with the commit 664eeddeef65
>>> ("mm: page_alloc: use jump labels to avoid checking number_of_cpusets")
>>> . This fixes the problem so that cpusets_enabled() returns positive even
>>> for the root cpuset.
>>>
>>> Fixes: 664eeddeef65 ("mm: page_alloc: use jump labels to avoid")
>>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>>
>> Superficially, this appears to always activate the cpuset_enabled
>> branch() when it doesn't really make sense that the root cpuset be
>> restricted.
> 
> Yes that's why root cpuset doesn't "count", as it's not supposed to be
> restricted (it's also documented in cpusets.txt) Thus the "Fixes:" tag
> is very misleading.

Agreed, I have removed the "Fixes: " tag in the proposed RFC already
posted on this thread where it puts it as a new enablement instead
and an addition to the capability what we already have with cpuset.
It will be great if you can please take a look and provide feedback.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask
  2017-01-31 14:22     ` [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask Anshuman Khandual
@ 2017-01-31 16:00       ` Mel Gorman
  2017-02-01  7:31         ` Anshuman Khandual
  0 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2017-01-31 16:00 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Tue, Jan 31, 2017 at 07:52:37PM +0530, Anshuman Khandual wrote:
> At present, top_cpuset.mems_allowed is same as node_states[N_MEMORY] and it
> cannot be changed at the runtime. Maximum possible node_states[N_MEMORY]
> also gets reflected in top_cpuset.effective_mems interface. It prevents some
> one from removing or restricting memory placement which will be applicable
> system wide on a given memory node through cpuset mechanism which might be
> limiting. This solves the problem by enabling update_nodemask() function to
> accept changes to top_cpuset.mems_allowed as well. Once changed, it also
> updates the value of top_cpuset.effective_mems. Updates all it's task's
> mems_allowed nodemask as well. It calls cpuset_inc() to make sure cpuset
> is accounted for in the buddy allocator through cpusets_enabled() check.
> 

What's the point of allowing the root cpuset to be restricted?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault
  2017-01-31  5:10     ` Anshuman Khandual
@ 2017-01-31 17:54       ` Dave Hansen
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Hansen @ 2017-01-31 17:54 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 09:10 PM, Anshuman Khandual wrote:
>> This is happening with mmap_sem held for read.  Correct?  Is it OK that
>> you're modifying the VMA?  That vm_flags manipulation is non-atomic, so
>> how can that even be safe?
> Hmm. should it be done with mmap_sem being held for write. Will look
> into this further. But intercepting the page faults inside alloc_pages_vma()
> for tagging the VMA is okay from over all design perspective ?. Or this
> should be moved up or down the call chain in the page fault path ?

Doing it in the fault path seems wrong to me.

Apps have to take *explicit* action to go and get access to device
memory.  It seems like we should mark the VMA *then*, at the time of the
explicit action.  I also think _implying_ that we want KSM, etc...
turned off just because of the target of an mbind() is a bad idea.  Apps
have to ask for this stuff *explicitly*, so why not also have them turn
KSM off explicitly?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31  7:25         ` John Hubbard
@ 2017-01-31 18:04           ` Dave Hansen
  2017-01-31 19:14             ` David Nellans
  2017-02-01  6:56             ` Anshuman Khandual
  2017-02-01  6:46           ` Anshuman Khandual
  1 sibling, 2 replies; 58+ messages in thread
From: Dave Hansen @ 2017-01-31 18:04 UTC (permalink / raw)
  To: John Hubbard, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 11:25 PM, John Hubbard wrote:
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

In the end, I don't think the kernel can make the "right" decision very
widely here.

Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
evidently has a higher latency than DRAM.  Given a plain malloc(), how
is the kernel to know that the memory will be used for AVX-512
instructions that need lots of bandwidth vs. some random data structure
that's latency-sensitive?

In the end, I think all we can do is keep the kernel's existing default
of "low latency to the CPU that allocated it", and let apps override
when that policy doesn't fit them.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31 18:04           ` Dave Hansen
@ 2017-01-31 19:14             ` David Nellans
  2017-02-01  6:56             ` Anshuman Khandual
  1 sibling, 0 replies; 58+ messages in thread
From: David Nellans @ 2017-01-31 19:14 UTC (permalink / raw)
  To: Dave Hansen, John Hubbard, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams



On 01/31/2017 12:04 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
>
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?
>
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.
>
I think John's point is that latency might not be the predominant factor
anymore
for certain sections of the CPU and GPU world.  What if a Phi has MCDRAM
physically attached, but DDR4 connected via QPI that still has lower
total latency (might
be a stretch for Phi but not a stretch for GPUs with deep sorting memory
controllers)?  Lowest latency is probably the wrong choice. Latency has
really been a
numeric proxy for physical proximity, under assumption most closely
coupled memory is
the right placement, but HBM/MCDRAM is causing that relationship to
break down in all
sorts of interesting ways.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31  1:57       ` Dave Hansen
  2017-01-31  7:25         ` John Hubbard
@ 2017-02-01  6:40         ` Anshuman Khandual
  1 sibling, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-01  6:40 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/31/2017 07:27 AM, Dave Hansen wrote:
> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>> system and it was just as fast as the rest of the RAM.  Would we still
>>> want it isolated like this?  Or would we want a different policy?
>>
>> But then the other argument being, dont we want to keep this 100X more
>> memory isolated for some special purpose to be utilized by specific
>> applications ?
> 
> I was thinking that in this case, we wouldn't even want to bother with
> having "system RAM" in the fallback lists.  A device who got its memory

System RAM is in the fallback list of the CDM node for the following
purpose.

If the user asks explicitly through mbind() and there is insufficient
memory on the CDM node to fulfill the request. Then it is better to
fallback on a system RAM memory node than to fail the request. This is
in line with expectations from the mbind() call. There are other ways
for the user space like /proc/pid/numa_maps to query about from where
exactly a given page has come from in the runtime.

But keeping options open I have noted down this in the cover letter.

"
FALLBACK zonelist creation:

CDM node's FALLBACK zonelist can also be changed to accommodate other CDM
memory zones along with system RAM zones in which case they can be used as
fallback options instead of first depending on the system RAM zones when
it's own memory falls insufficient during allocation.
"

> usage off by 1% could start to starve the rest of the system.  A sane

Did not get this point. Could you please elaborate more on this ?

> policy in this case might be to isolate the "system RAM" from the device's.

Hmm.

> 
>>> Why do we need this hard-coded along with the cpuset stuff later in the
>>> series.  Doesn't taking a node out of the cpuset also take it out of the
>>> fallback lists?
>>
>> There are two mutually exclusive approaches which are described in
>> this patch series.
>>
>> (1) zonelist modification based approach
>> (2) cpuset restriction based approach
>>
>> As mentioned in the cover letter,
> 
> Well, I'm glad you coded both of them up, but now that we have them how
> to we pick which one to throw to the wolves?  Or, do we just merge both
> of them and let one bitrot? ;)

I am just trying to see how each isolation method stack up from benefit
and cost point of view, so that we can have informed debate about their
individual merit. Meanwhile I have started looking at if the core buddy
allocator __alloc_pages_nodemask() and its interaction with nodemask at
various stages can also be modified to implement the intended solution.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31  7:25         ` John Hubbard
  2017-01-31 18:04           ` Dave Hansen
@ 2017-02-01  6:46           ` Anshuman Khandual
  1 sibling, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-01  6:46 UTC (permalink / raw)
  To: John Hubbard, Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/31/2017 12:55 PM, John Hubbard wrote:
> On 01/30/2017 05:57 PM, Dave Hansen wrote:
>> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>>> system and it was just as fast as the rest of the RAM.  Would we still
>>>> want it isolated like this?  Or would we want a different policy?
>>>
>>> But then the other argument being, dont we want to keep this 100X more
>>> memory isolated for some special purpose to be utilized by specific
>>> applications ?
>>
>> I was thinking that in this case, we wouldn't even want to bother with
>> having "system RAM" in the fallback lists.  A device who got its memory
>> usage off by 1% could start to starve the rest of the system.  A sane
>> policy in this case might be to isolate the "system RAM" from the
>> device's.
> 
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

That is complete overhaul of the NUMA representation in the kernel. What
CDM attempts is to find a solution with existing NUMA framework and with
as little code change as possible.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
  2017-01-31 18:04           ` Dave Hansen
  2017-01-31 19:14             ` David Nellans
@ 2017-02-01  6:56             ` Anshuman Khandual
  1 sibling, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-01  6:56 UTC (permalink / raw)
  To: Dave Hansen, John Hubbard, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/31/2017 11:34 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> 
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
> 
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?

CDM has been designed to work with a driver which can take these kind
of appropriate memory placement decisions along the way. But as per
the above example of an generic malloc() allocated buffer.

(1) System RAM gets allocated if there are first CPU faults
(2) CDM memory gets allocated if there are first device access faults
(3) After monitoring the access patterns there after, the driver can
    then take required "right" decisions about its eventual placement
    and migrates memory as required

> 
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.

I think this is almost similar to what we are trying to achieve with
CDM representation and driver based migrations. Dont you agree ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask
  2017-01-31 16:00       ` Mel Gorman
@ 2017-02-01  7:31         ` Anshuman Khandual
  2017-02-01  8:53           ` Michal Hocko
  2017-02-01  9:18           ` Mel Gorman
  0 siblings, 2 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-01  7:31 UTC (permalink / raw)
  To: Mel Gorman, Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On 01/31/2017 09:30 PM, Mel Gorman wrote:
> On Tue, Jan 31, 2017 at 07:52:37PM +0530, Anshuman Khandual wrote:
>> At present, top_cpuset.mems_allowed is same as node_states[N_MEMORY] and it
>> cannot be changed at the runtime. Maximum possible node_states[N_MEMORY]
>> also gets reflected in top_cpuset.effective_mems interface. It prevents some
>> one from removing or restricting memory placement which will be applicable
>> system wide on a given memory node through cpuset mechanism which might be
>> limiting. This solves the problem by enabling update_nodemask() function to
>> accept changes to top_cpuset.mems_allowed as well. Once changed, it also
>> updates the value of top_cpuset.effective_mems. Updates all it's task's
>> mems_allowed nodemask as well. It calls cpuset_inc() to make sure cpuset
>> is accounted for in the buddy allocator through cpusets_enabled() check.
>>
> 
> What's the point of allowing the root cpuset to be restricted?

After an extended period of run time on a system, currently if we have
to run HW diagnostics and dump (which are run out of band) for debug
purpose, we have to stop further allocations to the node. Hot plugging
the memory node out of the kernel will achieve this. But it can also
be made possible by just enabling top_cpuset.memory_migrate and then
restricting all the allocations by removing the node from top_cpuset.
mems_allowed nodemask. This will force all the existing allocations
out of the target node.

More importantly it also extends the cpuset memory restriction feature
to the logical completion without adding any regressions for the
existing use cases. Then why not do this ? Does it add any overhead ?

In the future this feature can also be used to isolate a memory node
from all possible general allocations and at the same time provide an
alternate method for explicit allocation into it (still working on this
part, though have a hack right now). The current RFC series proposes
one such possible use case through the top_cpuset.mems_allowed nodemask.
But in this case it is being restricted during boot as well as after
hotplug of a memory only NUMA node.

If you think currently this does not have a use case to stand on it's
own, then I will carry it along with this patch series as part of the
proposed cpuset based isolation solution (with explicit allocation
access to the isolated node) as described just above.

- Anshuman

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask
  2017-02-01  7:31         ` Anshuman Khandual
@ 2017-02-01  8:53           ` Michal Hocko
  2017-02-01  9:18           ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-02-01  8:53 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, linux-kernel, linux-mm, vbabka, minchan,
	aneesh.kumar, bsingharora, srikar, haren, jglisse, dave.hansen,
	dan.j.williams

On Wed 01-02-17 13:01:24, Anshuman Khandual wrote:
[...]
> More importantly it also extends the cpuset memory restriction feature
> to the logical completion without adding any regressions for the
> existing use cases. Then why not do this ? Does it add any overhead ?

Maybe it doesn't add any overhead but it just breaks the cgroups
expectation that the root cgroup covers the full resource set. No cgroup
controller allows to set limits on the root cgroup. So all this looks
like an abuse of the interface.

I haven't read the full series yet but this particular change looks like
a nogo to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask
  2017-02-01  7:31         ` Anshuman Khandual
  2017-02-01  8:53           ` Michal Hocko
@ 2017-02-01  9:18           ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2017-02-01  9:18 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-mm, mhocko, vbabka, minchan, aneesh.kumar,
	bsingharora, srikar, haren, jglisse, dave.hansen, dan.j.williams

On Wed, Feb 01, 2017 at 01:01:24PM +0530, Anshuman Khandual wrote:
> On 01/31/2017 09:30 PM, Mel Gorman wrote:
> > On Tue, Jan 31, 2017 at 07:52:37PM +0530, Anshuman Khandual wrote:
> >> At present, top_cpuset.mems_allowed is same as node_states[N_MEMORY] and it
> >> cannot be changed at the runtime. Maximum possible node_states[N_MEMORY]
> >> also gets reflected in top_cpuset.effective_mems interface. It prevents some
> >> one from removing or restricting memory placement which will be applicable
> >> system wide on a given memory node through cpuset mechanism which might be
> >> limiting. This solves the problem by enabling update_nodemask() function to
> >> accept changes to top_cpuset.mems_allowed as well. Once changed, it also
> >> updates the value of top_cpuset.effective_mems. Updates all it's task's
> >> mems_allowed nodemask as well. It calls cpuset_inc() to make sure cpuset
> >> is accounted for in the buddy allocator through cpusets_enabled() check.
> >>
> > 
> > What's the point of allowing the root cpuset to be restricted?
> 
> After an extended period of run time on a system, currently if we have
> to run HW diagnostics and dump (which are run out of band) for debug
> purpose, we have to stop further allocations to the node. Hot plugging
> the memory node out of the kernel will achieve this. But it can also
> be made possible by just enabling top_cpuset.memory_migrate and then
> restricting all the allocations by removing the node from top_cpuset.
> mems_allowed nodemask. This will force all the existing allocations
> out of the target node.
> 

So would creating a restricted cpuset and migrating all tasks from the
root cpuset into it.

> More importantly it also extends the cpuset memory restriction feature
> to the logical completion without adding any regressions for the
> existing use cases. Then why not do this ? Does it add any overhead ?
> 

It violates the expectation that the root cgroup can access all
resources. Once enabled, there is some overhead in the page allocator as
it must check all cpusets even for tasks that weren't configured to be
isolated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-01-31  1:37       ` Dave Hansen
@ 2017-02-01 13:59         ` Anshuman Khandual
  2017-02-01 19:01           ` Dave Hansen
  0 siblings, 1 reply; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-01 13:59 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/31/2017 07:07 AM, Dave Hansen wrote:
> On 01/30/2017 05:03 PM, Anshuman Khandual wrote:
>> On 01/30/2017 10:49 PM, Dave Hansen wrote:
>>> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>>>> HugeTLB allocation/release/accounting currently spans across all the nodes
>>>> under N_MEMORY node mask. Coherent memory nodes should not be part of these
>>>> allocations. So use system_ram() call to fetch system RAM only nodes on the
>>>> platform which can then be used for HugeTLB allocation purpose instead of
>>>> N_MEMORY node mask. This isolates coherent device memory nodes from HugeTLB
>>>> allocations.
>>>
>>> Does this end up making it impossible to use hugetlbfs to access device
>>> memory?
>>
>> Right, thats the implementation at the moment. But going forward if we need
>> to have HugeTLB pages on the CDM node, then we can implement through the
>> sysfs interface from individual NUMA node paths instead of changing the
>> generic HugeTLB path. I wrote this up in the cover letter but should also
>> have mentioned in the comment section of this patch as well. Does this
>> approach look okay ?
> 
> The cover letter is not the most approachable document I've ever seen. :)

Hmm,

So shall we write all these details in the comment section for each
patch after the SOB statement to be more visible ? Or some where
in-code documentation as FIXME or XXX or something. These are little
large paragraphs, hence was wondering.

> 
>> "Now, we ensure complete HugeTLB allocation isolation from CDM nodes. Going
>> forward if we need to support HugeTLB allocation on CDM nodes on targeted
>> basis, then we would have to enable those allocations through the
>> /sys/devices/system/node/nodeN/hugepages/hugepages-16384kB/nr_hugepages
>> interface while still ensuring isolation from other generic sysctl and
>> /sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages interfaces."
> 
> That would be passable if that's the only way you can allocate hugetlbfs
> pages.  But we also have the fault-based allocations that can pull stuff
> right out of the buddy allocator.  This approach would break that path
> entirely.

There two distinct points which I think will prevent the problem you just
mentioned.

* No regular node has CDM memory in their fallback zone list. Hence any
  allocation attempt without __GFP_THISNODE will never go into CDM memory
  zones. If the allocation happens with __GFP_THISNODE flag it will only
  happen from the exact node. Remember we have removed CDM nodes from the
  global nodemask iterators. Then how can pre allocated reserve HugeTLB
  pages can come from CDM nodes ?

* Page faults (which will probably use __GFP_THISNODE) cannot come from the
  CDM nodes as they dont have any CPUs.

I did a quick scan of all the allocation paths leading upto the allocation
functions alloc_pages_node() and __alloc_pages_node() inside the hugetlb.c
file. Might be missing something here.

> 
> FWIW, I think you really need to separate the true "CDM" stuff that's
> *really* device-specific from the parts of this from which you really
> just want to implement isolation.

IIUC, are you suggesting something like a pure CDM HugeTLB implementation
which is completely separated from the generic one ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes
  2017-02-01 13:59         ` Anshuman Khandual
@ 2017-02-01 19:01           ` Dave Hansen
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Hansen @ 2017-02-01 19:01 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 02/01/2017 05:59 AM, Anshuman Khandual wrote:
> So shall we write all these details in the comment section for each
> patch after the SOB statement to be more visible ? Or some where
> in-code documentation as FIXME or XXX or something. These are little
> large paragraphs, hence was wondering.

I would make an effort to convey a maximum amount of content in a
minimal amount of words. :)

But, yeah, it is pretty obvious that you've got too much in the cover
letter and not enough in the patches descriptions.

...
> * Page faults (which will probably use __GFP_THISNODE) cannot come from the
>   CDM nodes as they dont have any CPUs.

Page faults happen on CPUs but they happen on VMAs that could be bound
to a CDM node.  We allocate based on the VMA policy first, the fall back
to the default policy which is based on the CPU doing the fault if the
VMA doesn't have a specific policy.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-01-31  4:36     ` Anshuman Khandual
@ 2017-02-07 18:07       ` Dave Hansen
  2017-02-08 14:13         ` Anshuman Khandual
  2017-02-08 15:04         ` Jerome Glisse
  0 siblings, 2 replies; 58+ messages in thread
From: Dave Hansen @ 2017-02-07 18:07 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 01/30/2017 08:36 PM, Anshuman Khandual wrote:
> On 01/30/2017 11:24 PM, Dave Hansen wrote:
>> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>>> +		if ((new_pol->mode == MPOL_BIND)
>>> +			&& nodemask_has_cdm(new_pol->v.nodes))
>>> +			set_vm_cdm(vma);
>> So, if you did:
>>
>> 	mbind(addr, PAGE_SIZE, MPOL_BIND, all_nodes, ...);
>> 	mbind(addr, PAGE_SIZE, MPOL_BIND, one_non_cdm_node, ...);
>>
>> You end up with a VMA that can never have KSM done on it, etc...  Even
>> though there's no good reason for it.  I guess /proc/$pid/smaps might be
>> able to help us figure out what was going on here, but that still seems
>> like an awful lot of damage.
> 
> Agreed, this VMA should not remain tagged after the second call. It does
> not make sense. For this kind of scenarios we can re-evaluate the VMA
> tag every time the nodemask change is attempted. But if we are looking for
> some runtime re-evaluation then we need to steal some cycles are during
> general VMA processing opportunity points like merging and split to do
> the necessary re-evaluation. Should do we do these kind two kinds of
> re-evaluation to be more optimal ?

I'm still unconvinced that you *need* detection like this.  Scanning big
VMAs is going to be really painful.

I thought I asked before but I can't find it in this thread.  But, we
have explicit interfaces for disabling KSM and khugepaged.  Why do we
need implicit ones like this in addition to those?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-02-07 18:07       ` Dave Hansen
@ 2017-02-08 14:13         ` Anshuman Khandual
  2017-02-08 15:04         ` Jerome Glisse
  1 sibling, 0 replies; 58+ messages in thread
From: Anshuman Khandual @ 2017-02-08 14:13 UTC (permalink / raw)
  To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dan.j.williams

On 02/07/2017 11:37 PM, Dave Hansen wrote:
>> On 01/30/2017 11:24 PM, Dave Hansen wrote:
>>> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>>>> +		if ((new_pol->mode == MPOL_BIND)
>>>> +			&& nodemask_has_cdm(new_pol->v.nodes))
>>>> +			set_vm_cdm(vma);
>>> So, if you did:
>>>
>>> 	mbind(addr, PAGE_SIZE, MPOL_BIND, all_nodes, ...);
>>> 	mbind(addr, PAGE_SIZE, MPOL_BIND, one_non_cdm_node, ...);
>>>
>>> You end up with a VMA that can never have KSM done on it, etc...  Even
>>> though there's no good reason for it.  I guess /proc/$pid/smaps might be
>>> able to help us figure out what was going on here, but that still seems
>>> like an awful lot of damage.
>> Agreed, this VMA should not remain tagged after the second call. It does
>> not make sense. For this kind of scenarios we can re-evaluate the VMA
>> tag every time the nodemask change is attempted. But if we are looking for
>> some runtime re-evaluation then we need to steal some cycles are during
>> general VMA processing opportunity points like merging and split to do
>> the necessary re-evaluation. Should do we do these kind two kinds of
>> re-evaluation to be more optimal ?
> I'm still unconvinced that you *need* detection like this.  Scanning big
> VMAs is going to be really painful.
> 
> I thought I asked before but I can't find it in this thread.  But, we
> have explicit interfaces for disabling KSM and khugepaged.  Why do we
> need implicit ones like this in addition to those?

Missed the discussion we had on this last time around I think. My bad, sorry
about that. IIUC we can disable KSM through madvise() call, in fact I guess
its disabled by default and need to be enabled. We can just have a similar
interface to disable auto NUMA for a specific VMA or we can handle it page
by page basis with something like this.

diff --git a/mm/memory.c b/mm/memory.c
index 1099d35..101dfd9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3518,6 +3518,9 @@ static int do_numa_page(struct vm_fault *vmf)
                goto out;
        }
 
+       if (is_cdm_node(page_to_nid(page)))
+               goto out;
+
        /* Migrate to the requested node */
        migrated = migrate_misplaced_page(page, vma, target_nid);
        if (migrated) {

I am still looking into these aspects. BTW have posted the minimum set of
CDM patches which defines and isolates CDM node.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND)
  2017-02-07 18:07       ` Dave Hansen
  2017-02-08 14:13         ` Anshuman Khandual
@ 2017-02-08 15:04         ` Jerome Glisse
  1 sibling, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2017-02-08 15:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, vbabka,
	mgorman, minchan, aneesh kumar, bsingharora, srikar, haren,
	dan j williams

> On 01/30/2017 08:36 PM, Anshuman Khandual wrote:
> > On 01/30/2017 11:24 PM, Dave Hansen wrote:
> >> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> >>> +		if ((new_pol->mode == MPOL_BIND)
> >>> +			&& nodemask_has_cdm(new_pol->v.nodes))
> >>> +			set_vm_cdm(vma);
> >> So, if you did:
> >>
> >> 	mbind(addr, PAGE_SIZE, MPOL_BIND, all_nodes, ...);
> >> 	mbind(addr, PAGE_SIZE, MPOL_BIND, one_non_cdm_node, ...);
> >>
> >> You end up with a VMA that can never have KSM done on it, etc...  Even
> >> though there's no good reason for it.  I guess /proc/$pid/smaps might be
> >> able to help us figure out what was going on here, but that still seems
> >> like an awful lot of damage.
> > 
> > Agreed, this VMA should not remain tagged after the second call. It does
> > not make sense. For this kind of scenarios we can re-evaluate the VMA
> > tag every time the nodemask change is attempted. But if we are looking for
> > some runtime re-evaluation then we need to steal some cycles are during
> > general VMA processing opportunity points like merging and split to do
> > the necessary re-evaluation. Should do we do these kind two kinds of
> > re-evaluation to be more optimal ?
> 
> I'm still unconvinced that you *need* detection like this.  Scanning big
> VMAs is going to be really painful.
> 
> I thought I asked before but I can't find it in this thread.  But, we
> have explicit interfaces for disabling KSM and khugepaged.  Why do we
> need implicit ones like this in addition to those?
> 

I said it in other part of the thread i think the vma flag is a no go. Because
it try to set something that is orthogonal to vma. That you want some vma to
use device memory on new allocation is a valid policy for a vma to have. But to
have a flag that say various kernel subsystem hey my memory is special skip me
is wrong.

The fact that you want to exclude device memory from KSM or autonuma is valid but
it should be done at struct page level ie KSM or autonuma should check the type
of page before doing anything. For CDM pages they would skip. It could be the flags
idea that was discussed.

The overhead of doing it at page level is far lower than trying to manage a vma
flags with all the issue related to vma merging, splitting and lifetime of such
flags. Moreover this flags is an all or nothing, it does not consider the case
where you have as much regular page as CDM page in a vma. It would block regular
page from under going the usual KSM/autonuma ...

I do strongly believe that this vma flag is a bad idea.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2017-02-08 18:35 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30  3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 01/12] mm: Define coherent device memory (CDM) node Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes Anshuman Khandual
2017-01-30 17:19   ` Dave Hansen
2017-01-31  1:03     ` Anshuman Khandual
2017-01-31  1:37       ` Dave Hansen
2017-02-01 13:59         ` Anshuman Khandual
2017-02-01 19:01           ` Dave Hansen
2017-01-30  3:35 ` [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
2017-01-30 17:34   ` Dave Hansen
2017-01-31  1:36     ` Anshuman Khandual
2017-01-31  1:57       ` Dave Hansen
2017-01-31  7:25         ` John Hubbard
2017-01-31 18:04           ` Dave Hansen
2017-01-31 19:14             ` David Nellans
2017-02-01  6:56             ` Anshuman Khandual
2017-02-01  6:46           ` Anshuman Khandual
2017-02-01  6:40         ` Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 04/12] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
2017-01-30 17:36   ` Dave Hansen
2017-01-30 20:30   ` Mel Gorman
2017-01-31 14:22     ` [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask Anshuman Khandual
2017-01-31 16:00       ` Mel Gorman
2017-02-01  7:31         ` Anshuman Khandual
2017-02-01  8:53           ` Michal Hocko
2017-02-01  9:18           ` Mel Gorman
2017-01-31 14:36     ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Vlastimil Babka
2017-01-31 15:30       ` Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 06/12] mm: Exclude CDM nodes from task->mems_allowed and root cpuset Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 07/12] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 08/12] mm: Add new VMA flag VM_CDM Anshuman Khandual
2017-01-30 18:52   ` Jerome Glisse
2017-01-31  4:22     ` Anshuman Khandual
2017-01-31  6:05       ` Jerome Glisse
2017-01-30  3:35 ` [RFC V2 09/12] mm: Exclude CDM marked VMAs from auto NUMA Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 10/12] mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs Anshuman Khandual
2017-01-30  3:35 ` [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault Anshuman Khandual
2017-01-30 17:51   ` Dave Hansen
2017-01-31  5:10     ` Anshuman Khandual
2017-01-31 17:54       ` Dave Hansen
2017-01-30  3:35 ` [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) Anshuman Khandual
2017-01-30 17:54   ` Dave Hansen
2017-01-31  4:36     ` Anshuman Khandual
2017-02-07 18:07       ` Dave Hansen
2017-02-08 14:13         ` Anshuman Khandual
2017-02-08 15:04         ` Jerome Glisse
2017-01-30  3:35 ` [DEBUG 13/21] powerpc/mm: Identify coherent device memory nodes during platform init Anshuman Khandual
2017-01-30  3:35 ` [DEBUG 14/21] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
2017-01-30  3:35 ` [DEBUG 15/21] powerpc/mm: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
2017-01-30  3:35 ` [DEBUG 16/21] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
2017-01-30  3:35 ` [DEBUG 17/21] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
2017-01-30  3:35 ` [DEBUG 18/21] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
2017-01-30  3:36 ` [DEBUG 19/21] mm: Add migrate_virtual_range migration interface Anshuman Khandual
2017-01-30  3:36 ` [DEBUG 20/21] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
2017-01-30  3:36 ` [DEBUG 21/21] selftests/powerpc: Add a script to perform random VMA migrations Anshuman Khandual
2017-01-31  5:48 ` [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
2017-01-31  6:15   ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).