linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
@ 2018-12-26 13:14 Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
                   ` (21 more replies)
  0 siblings, 22 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, kvm, LKML, Fan Du, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams, Fengguang Wu

This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
transparent to normal applications and virtual machines.

The code is still in active development. It's provided for early design review.

Key functionalities:

1) create and describe PMEM NUMA node for NVDIMM memory
2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting
3) passive kernel cold page migration in page reclaim path
4) improved move_pages() for active user space hot/cold page migration

(1) is foundation for transparent usage of NVDIMM for normal apps and virtual
machines. (2-4) enable auto placing hot pages in DRAM for better performance.
A user space migration daemon is being built based on this kernel patchset to
make the full vertical solution.

Base kernel is v4.20 . The patches are not suitable for upstreaming in near
future -- some are quick hacks, some others need more works. However they are
complete enough to demo the necessary kernel changes for the proposed app&VM
transparent NVDIMM volatile use model.

The interfaces are far from finalized. They kind of illustrate what would be
necessary for creating a user space driven solution. The exact forms will ask
for more thoughts and inputs. We may adopt HMAT based solution for NUMA node
related interface when they are ready. The /proc/PID/idle_pages interface is
standalone but non-trivial. Before upstreaming some day, it's expected to take
long time to collect various real use cases and feedbacks, so as to refine and
stabilize the format.

Create PMEM numa node

	[PATCH 01/21] e820: cheat PMEM as DRAM

Mark numa node as DRAM/PMEM

	[PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table
	[PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case
	[PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes
	[PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM
	[PATCH 06/21] x86,numa: update numa node type
	[PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node

Point neighbor DRAM/PMEM to each other

	[PATCH 08/21] mm: introduce and export pgdat peer_node
	[PATCH 09/21] mm: avoid duplicate peer target node

Standalone zonelist for DRAM and PMEM nodes

	[PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node

Keep page table pages in DRAM

	[PATCH 11/21] kvm: allocate page table pages from DRAM
	[PATCH 12/21] x86/pgtable: allocate page table pages from DRAM

/proc/PID/idle_pages interface for virtual machine and normal tasks

	[PATCH 13/21] x86/pgtable: dont check PMD accessed bit
	[PATCH 14/21] kvm: register in mm_struct
	[PATCH 15/21] ept-idle: EPT walk for virtual machine
	[PATCH 16/21] mm-idle: mm_walk for normal task
	[PATCH 17/21] proc: introduce /proc/PID/idle_pages
	[PATCH 18/21] kvm-ept-idle: enable module

Mark hot pages

	[PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag

Kernel DRAM=>PMEM migration

	[PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
	[PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM

 arch/x86/include/asm/numa.h    |    2 
 arch/x86/include/asm/pgalloc.h |   10 
 arch/x86/include/asm/pgtable.h |    3 
 arch/x86/kernel/e820.c         |    3 
 arch/x86/kvm/Kconfig           |   11 
 arch/x86/kvm/Makefile          |    4 
 arch/x86/kvm/ept_idle.c        |  841 +++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h        |  116 ++++
 arch/x86/kvm/mmu.c             |   12 
 arch/x86/mm/numa.c             |    3 
 arch/x86/mm/numa_emulation.c   |   30 +
 arch/x86/mm/pgtable.c          |   22 
 drivers/acpi/numa.c            |    5 
 drivers/base/node.c            |   21 
 fs/proc/base.c                 |    2 
 fs/proc/internal.h             |    1 
 fs/proc/task_mmu.c             |   54 +
 include/linux/mm_types.h       |   11 
 include/linux/mmzone.h         |   38 +
 mm/mempolicy.c                 |   14 
 mm/migrate.c                   |   13 
 mm/page_alloc.c                |   77 ++
 mm/pagewalk.c                  |    1 
 mm/vmscan.c                    |   38 +
 virt/kvm/kvm_main.c            |    3 
 25 files changed, 1306 insertions(+), 29 deletions(-)

V1 patches: https://lkml.org/lkml/2018/9/2/13

Regards,
Fengguang


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-27  3:41   ` Matthew Wilcox
  2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0001-e820-Force-PMEM-entry-as-RAM-type-to-enumerate-NUMA-.patch --]
[-- Type: text/plain, Size: 869 bytes --]

From: Fan Du <fan.du@intel.com>

This is a hack to enumerate PMEM as NUMA nodes.
It's necessary for current BIOS that don't yet fill ACPI HMAT table.

WARNING: take care to backup. It is mutual exclusive with libnvdimm
subsystem and can destroy ndctl managed namespaces.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kernel/e820.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux.orig/arch/x86/kernel/e820.c	2018-12-23 19:20:34.587078783 +0800
+++ linux/arch/x86/kernel/e820.c	2018-12-23 19:20:34.587078783 +0800
@@ -403,7 +403,8 @@ static int __init __append_e820_table(st
 		/* Ignore the entry on 64-bit overflow: */
 		if (start > end && likely(size))
 			return -1;
-
+		if (type == E820_TYPE_PMEM)
+			type = E820_TYPE_RAM;
 		e820__range_add(start, size, type);
 
 		entry++;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0002-acpi-Memorize-numa-node-type-from-SRAT-table.patch --]
[-- Type: text/plain, Size: 1920 bytes --]

From: Fan Du <fan.du@intel.com>

Mark NUMA node as DRAM or PMEM.

This could happen in boot up state (see the e820 pmem type
override patch), or on fly when bind devdax device with kmem
driver.

It depends on BIOS supplying PMEM NUMA proximity in SRAT table,
that's current production BIOS does.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/include/asm/numa.h |    2 ++
 arch/x86/mm/numa.c          |    2 ++
 drivers/acpi/numa.c         |    5 +++++
 3 files changed, 9 insertions(+)

--- linux.orig/arch/x86/include/asm/numa.h	2018-12-23 19:20:39.890947888 +0800
+++ linux/arch/x86/include/asm/numa.h	2018-12-23 19:20:39.890947888 +0800
@@ -30,6 +30,8 @@ extern int numa_off;
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_nodes_pmem;
+extern nodemask_t numa_nodes_dram;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
--- linux.orig/arch/x86/mm/numa.c	2018-12-23 19:20:39.890947888 +0800
+++ linux/arch/x86/mm/numa.c	2018-12-23 19:20:39.890947888 +0800
@@ -20,6 +20,8 @@
 
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_nodes_pmem;
+nodemask_t numa_nodes_dram;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
--- linux.orig/drivers/acpi/numa.c	2018-12-23 19:20:39.890947888 +0800
+++ linux/drivers/acpi/numa.c	2018-12-23 19:20:39.890947888 +0800
@@ -297,6 +297,11 @@ acpi_numa_memory_affinity_init(struct ac
 
 	node_set(node, numa_nodes_parsed);
 
+	if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE)
+		node_set(node, numa_nodes_pmem);
+	else
+		node_set(node, numa_nodes_dram);
+
 	pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
 		node, pxm,
 		(unsigned long long) start, (unsigned long long) end - 1,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, kvm, LKML, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams, Fengguang Wu

[-- Attachment #1: fix-fake-numa.patch --]
[-- Type: text/plain, Size: 1451 bytes --]

From: Fan Du <fan.du@intel.com>

The index of numa_meminfo is expected to the same as of numa_meminfo.blk[].
and numa_remove_memblk_from break the expectation.

2S system does not break, because

before numa_remove_memblk_from
index  nid
0	0
1	1

after numa_remove_memblk_from

index  nid
0	1
1	1

If you try to configure uniform fake node in 4S system.
index  nid
0	0
1	1
2       2
3	3

node 3 will be removed by numa_remove_memblk_from when iterate index 2.
so we only create fake node for 3 physcial node, and a portion of memroy
wasted as much as it hit lost pages checking in numa_meminfo_cover_memory.

Signed-off-by: Fan Du <fan.du@intel.com>

---
 arch/x86/mm/numa_emulation.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- linux.orig/arch/x86/mm/numa_emulation.c	2018-12-23 19:20:51.570664269 +0800
+++ linux/arch/x86/mm/numa_emulation.c	2018-12-23 19:20:51.566664364 +0800
@@ -381,7 +381,21 @@ void __init numa_emulation(struct numa_m
 		goto no_emu;
 
 	memset(&ei, 0, sizeof(ei));
-	pi = *numa_meminfo;
+
+	{
+		/* Make sure the index is identical with nid */
+		struct numa_meminfo *mi = numa_meminfo;
+		int nid;
+
+		for (i = 0; i < mi->nr_blks; i++) {
+			nid = mi->blk[i].nid;
+			pi.blk[nid].nid = nid;
+			pi.blk[nid].start = mi->blk[i].start;
+			pi.blk[nid].end = mi->blk[i].end;
+		}
+		pi.nr_blks = mi->nr_blks;
+
+	}
 
 	for (i = 0; i < MAX_NUMNODES; i++)
 		emu_nid_to_phys[i] = NUMA_NO_NODE;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (2 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, kvm, LKML, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams, Fengguang Wu

[-- Attachment #1: 0021-x86-numa-Fix-fake-numa-in-uniform-case.patch --]
[-- Type: text/plain, Size: 1257 bytes --]

From: Fan Du <fan.du@intel.com>

Signed-off-by: Fan Du <fan.du@intel.com>
---
 arch/x86/mm/numa_emulation.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

--- linux.orig/arch/x86/mm/numa_emulation.c	2018-12-23 19:21:11.002206144 +0800
+++ linux/arch/x86/mm/numa_emulation.c	2018-12-23 19:21:10.998206236 +0800
@@ -12,6 +12,8 @@
 
 static int emu_nid_to_phys[MAX_NUMNODES];
 static char *emu_cmdline __initdata;
+static nodemask_t emu_numa_nodes_pmem;
+static nodemask_t emu_numa_nodes_dram;
 
 void __init numa_emu_cmdline(char *str)
 {
@@ -311,6 +313,12 @@ static int __init split_nodes_size_inter
 					       min(end, limit) - start);
 			if (ret < 0)
 				return ret;
+
+			/* Update numa node type for fake numa node */
+			if (node_isset(i, emu_numa_nodes_pmem))
+				node_set(nid - 1, numa_nodes_pmem);
+			else
+				node_set(nid - 1, numa_nodes_dram);
 		}
 	}
 	return nid;
@@ -410,6 +418,12 @@ void __init numa_emulation(struct numa_m
 		unsigned long n;
 		int nid = 0;
 
+		emu_numa_nodes_pmem = numa_nodes_pmem;
+		emu_numa_nodes_dram = numa_nodes_dram;
+
+		nodes_clear(numa_nodes_pmem);
+		nodes_clear(numa_nodes_dram);
+
 		n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
 		ret = -1;
 		for_each_node_mask(i, physnode_mask) {



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (3 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0003-mmzone-Introduce-new-flag-to-tag-pgdat-type.patch --]
[-- Type: text/plain, Size: 1607 bytes --]

From: Fan Du <fan.du@intel.com>

One system with DRAM and PMEM, we need new flag
to tag pgdat is made of DRAM or peristent memory.

This patch serves as preparetion one for follow up patch.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/mmzone.h |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux.orig/include/linux/mmzone.h	2018-12-23 19:29:42.430602202 +0800
+++ linux/include/linux/mmzone.h	2018-12-23 19:29:42.430602202 +0800
@@ -522,6 +522,8 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_DRAM,			/* Volatile DRAM memory node */
+	PGDAT_PMEM,			/* Persistent memory node */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -919,6 +921,30 @@ extern struct pglist_data contig_page_da
 
 #endif /* !CONFIG_NEED_MULTIPLE_NODES */
 
+static inline int is_node_pmem(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	return test_bit(PGDAT_PMEM, &pgdat->flags);
+}
+
+static inline int is_node_dram(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	return test_bit(PGDAT_DRAM, &pgdat->flags);
+}
+
+static inline void set_node_type(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (node_isset(nid, numa_nodes_pmem))
+		set_bit(PGDAT_PMEM, &pgdat->flags);
+	else
+		set_bit(PGDAT_DRAM, &pgdat->flags);
+}
+
 extern struct pglist_data *first_online_pgdat(void);
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 06/21] x86,numa: update numa node type
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (4 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0004-x86-numa-Update-numa-node-type.patch --]
[-- Type: text/plain, Size: 510 bytes --]

From: Fan Du <fan.du@intel.com>

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/mm/numa.c |    1 +
 1 file changed, 1 insertion(+)

--- linux.orig/arch/x86/mm/numa.c	2018-12-23 19:38:17.363582512 +0800
+++ linux/arch/x86/mm/numa.c	2018-12-23 19:38:17.363582512 +0800
@@ -594,6 +594,7 @@ static int __init numa_register_memblks(
 			continue;
 
 		alloc_node_data(nid);
+		set_node_type(nid);
 	}
 
 	/* Dump memblock with node info and return. */



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (5 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0005-Export-node-type-pmem-ram-in-sys-bus-node.patch --]
[-- Type: text/plain, Size: 1806 bytes --]

From: Fan Du <fan.du@intel.com>

User space migration daemon could check
/sys/bus/node/devices/nodeX/type for node type.

Software can interrogate node type for node memory type and distance
to get desirable target node in migration.

grep -r . /sys/devices/system/node/*/type
/sys/devices/system/node/node0/type:dram
/sys/devices/system/node/node1/type:dram
/sys/devices/system/node/node2/type:pmem
/sys/devices/system/node/node3/type:pmem

Along with next patch which export `peer_node`, migration daemon
could easily find the memory type of current node, and the target
node in case of migration.

grep -r . /sys/devices/system/node/*/peer_node
/sys/devices/system/node/node0/peer_node:2
/sys/devices/system/node/node1/peer_node:3
/sys/devices/system/node/node2/peer_node:0
/sys/devices/system/node/node3/peer_node:1

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 drivers/base/node.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- linux.orig/drivers/base/node.c	2018-12-23 19:39:04.763414931 +0800
+++ linux/drivers/base/node.c	2018-12-23 19:39:04.763414931 +0800
@@ -233,6 +233,15 @@ static ssize_t node_read_distance(struct
 }
 static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+static ssize_t type_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+
+	return sprintf(buf, is_node_pmem(nid) ? "pmem\n" : "dram\n");
+}
+static DEVICE_ATTR(type, S_IRUGO, type_show, NULL);
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_cpumap.attr,
 	&dev_attr_cpulist.attr,
@@ -240,6 +249,7 @@ static struct attribute *node_dev_attrs[
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+	&dev_attr_type.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(node_dev);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (6 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-27 20:07   ` Christopher Lameter
  2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0019-mm-Introduce-and-export-peer_node-for-pgdat.patch --]
[-- Type: text/plain, Size: 3314 bytes --]

From: Fan Du <fan.du@intel.com>

Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
Migration between DRAM and PMEM will by default happen between peer nodes.

It's a temp solution. In multiple memory layers, a node can have both
promotion and demotion targets instead of a single peer node. User space
may also be able to infer promotion/demotion targets based on future
HMAT info.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 drivers/base/node.c    |   11 +++++++++++
 include/linux/mmzone.h |   12 ++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

--- linux.orig/drivers/base/node.c	2018-12-23 19:39:51.647261099 +0800
+++ linux/drivers/base/node.c	2018-12-23 19:39:51.643261112 +0800
@@ -242,6 +242,16 @@ static ssize_t type_show(struct device *
 }
 static DEVICE_ATTR(type, S_IRUGO, type_show, NULL);
 
+static ssize_t peer_node_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+	return sprintf(buf, "%d\n", pgdat->peer_node);
+}
+static DEVICE_ATTR(peer_node, S_IRUGO, peer_node_show, NULL);
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_cpumap.attr,
 	&dev_attr_cpulist.attr,
@@ -250,6 +260,7 @@ static struct attribute *node_dev_attrs[
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
 	&dev_attr_type.attr,
+	&dev_attr_peer_node.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(node_dev);
--- linux.orig/include/linux/mmzone.h	2018-12-23 19:39:51.647261099 +0800
+++ linux/include/linux/mmzone.h	2018-12-23 19:39:51.643261112 +0800
@@ -713,6 +713,18 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+
+	/*
+	 * Points to the nearest node in terms of latency
+	 * E.g. peer of node 0 is node 2 per SLIT
+	 * node distances:
+	 * node   0   1   2   3
+	 *   0:  10  21  17  28
+	 *   1:  21  10  28  17
+	 *   2:  17  28  10  28
+	 *   3:  28  17  28  10
+	 */
+	int	peer_node;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
--- linux.orig/mm/page_alloc.c	2018-12-23 19:39:51.647261099 +0800
+++ linux/mm/page_alloc.c	2018-12-23 19:39:51.643261112 +0800
@@ -6926,6 +6926,34 @@ static void check_for_memory(pg_data_t *
 	}
 }
 
+/*
+ * Return the nearest peer node in terms of *locality*
+ * E.g. peer of node 0 is node 2 per SLIT
+ * node distances:
+ * node   0   1   2   3
+ *   0:  10  21  17  28
+ *   1:  21  10  28  17
+ *   2:  17  28  10  28
+ *   3:  28  17  28  10
+ */
+static int find_best_peer_node(int nid)
+{
+	int n, val;
+	int min_val = INT_MAX;
+	int peer = NUMA_NO_NODE;
+
+	for_each_online_node(n) {
+		if (n == nid)
+			continue;
+		val = node_distance(nid, n);
+		if (val < min_val) {
+			min_val = val;
+			peer = n;
+		}
+	}
+	return peer;
+}
+
 /**
  * free_area_init_nodes - Initialise all pg_data_t and zone data
  * @max_zone_pfn: an array of max PFNs for each zone
@@ -7012,6 +7040,7 @@ void __init free_area_init_nodes(unsigne
 		if (pgdat->node_present_pages)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
+		pgdat->peer_node = find_best_peer_node(nid);
 	}
 }
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (7 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0020-page_alloc-avoid-duplicate-peer-target-node.patch --]
[-- Type: text/plain, Size: 2145 bytes --]

To ensure 1:1 peer node mapping on broken BIOS

	node distances:
	node   0   1   2   3
	  0:  10  21  20  20
	  1:  21  10  20  20
	  2:  20  20  10  20
	  3:  20  20  20  10

or with numa=fake=4U

	node distances:
	node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
	  0:  10  10  10  10  21  21  21  21  17  17  17  17  28  28  28  28
	  1:  10  10  10  10  21  21  21  21  17  17  17  17  28  28  28  28
	  2:  10  10  10  10  21  21  21  21  17  17  17  17  28  28  28  28
	  3:  10  10  10  10  21  21  21  21  17  17  17  17  28  28  28  28
	  4:  21  21  21  21  10  10  10  10  28  28  28  28  17  17  17  17
	  5:  21  21  21  21  10  10  10  10  28  28  28  28  17  17  17  17
	  6:  21  21  21  21  10  10  10  10  28  28  28  28  17  17  17  17
	  7:  21  21  21  21  10  10  10  10  28  28  28  28  17  17  17  17
	  8:  17  17  17  17  28  28  28  28  10  10  10  10  28  28  28  28
	  9:  17  17  17  17  28  28  28  28  10  10  10  10  28  28  28  28
	 10:  17  17  17  17  28  28  28  28  10  10  10  10  28  28  28  28
	 11:  17  17  17  17  28  28  28  28  10  10  10  10  28  28  28  28
	 12:  28  28  28  28  17  17  17  17  28  28  28  28  10  10  10  10
	 13:  28  28  28  28  17  17  17  17  28  28  28  28  10  10  10  10
	 14:  28  28  28  28  17  17  17  17  28  28  28  28  10  10  10  10
	 15:  28  28  28  28  17  17  17  17  28  28  28  28  10  10  10  10

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/page_alloc.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- linux.orig/mm/page_alloc.c	2018-12-23 19:48:27.366110325 +0800
+++ linux/mm/page_alloc.c	2018-12-23 19:48:27.362110332 +0800
@@ -6941,16 +6941,22 @@ static int find_best_peer_node(int nid)
 	int n, val;
 	int min_val = INT_MAX;
 	int peer = NUMA_NO_NODE;
+	static nodemask_t target_nodes = NODE_MASK_NONE;
 
 	for_each_online_node(n) {
 		if (n == nid)
 			continue;
 		val = node_distance(nid, n);
+		if (val == LOCAL_DISTANCE)
+			continue;
+		if (node_isset(n, target_nodes))
+			continue;
 		if (val < min_val) {
 			min_val = val;
 			peer = n;
 		}
 	}
+	node_set(peer, target_nodes);
 	return peer;
 }
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (8 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2019-01-01  9:14   ` Aneesh Kumar K.V
  2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0016-page-alloc-Build-separate-zonelist-for-PMEM-and-RAM-.patch --]
[-- Type: text/plain, Size: 3845 bytes --]

From: Fan Du <fan.du@intel.com>

When allocate page, DRAM and PMEM node should better not fall back to
each other. This allows migration code to explicitly control which type
of node to allocate pages from.

With this patch, PMEM NUMA node can only be used in 2 ways:
- migrate in and out
- numactl

That guarantees PMEM NUMA node will only hold anon pages.
We don't detect hotness for other types of pages for now.
So need to prevent some PMEM page goes hot while not able to
detect/move it to DRAM.

Another implication is, new page allocations will by default goto
DRAM nodes. Which is normally a good choice -- since DRAM writes
are cheaper than PMEM, it's often benefitial to watch new pages in
DRAM for some time and only move the likely cold pages to PMEM.

However there can be exceptions. For example, if PMEM:DRAM ratio is
very high, some page allocations may better go to PMEM nodes directly.
In long term, we may create more kind of fallback zonelists and make
them configurable by NUMA policy.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/mempolicy.c  |   14 ++++++++++++++
 mm/page_alloc.c |   42 +++++++++++++++++++++++++++++-------------
 2 files changed, 43 insertions(+), 13 deletions(-)

--- linux.orig/mm/mempolicy.c	2018-12-26 20:03:49.821417489 +0800
+++ linux/mm/mempolicy.c	2018-12-26 20:29:24.597884301 +0800
@@ -1745,6 +1745,20 @@ static int policy_node(gfp_t gfp, struct
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+	if (policy->mode == MPOL_BIND) {
+		nodemask_t nodes = policy->v.nodes;
+
+		/*
+		 * The rule is if we run on DRAM node and mbind to PMEM node,
+		 * perferred node id is the peer node, vice versa.
+		 * if we run on DRAM node and mbind to DRAM node, #PF node is
+		 * the preferred node, vice versa, so just fall back.
+		 */
+		if ((is_node_dram(nd) && nodes_subset(nodes, numa_nodes_pmem)) ||
+			(is_node_pmem(nd) && nodes_subset(nodes, numa_nodes_dram)))
+			nd = NODE_DATA(nd)->peer_node;
+	}
+
 	return nd;
 }
 
--- linux.orig/mm/page_alloc.c	2018-12-26 20:03:49.821417489 +0800
+++ linux/mm/page_alloc.c	2018-12-26 20:03:49.817417321 +0800
@@ -5153,6 +5153,10 @@ static int find_next_best_node(int node,
 		if (node_isset(n, *used_node_mask))
 			continue;
 
+		/* DRAM node doesn't fallback to pmem node */
+		if (is_node_pmem(n))
+			continue;
+
 		/* Use the distance array to find the distance */
 		val = node_distance(node, n);
 
@@ -5242,19 +5246,31 @@ static void build_zonelists(pg_data_t *p
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-		if (node_distance(local_node, node) !=
-		    node_distance(local_node, prev_node))
-			node_load[node] = load;
-
-		node_order[nr_nodes++] = node;
-		prev_node = node;
-		load--;
+	/* Pmem node doesn't fallback to DRAM node */
+	if (is_node_pmem(local_node)) {
+		int n;
+
+		/* Pmem nodes should fallback to each other */
+		node_order[nr_nodes++] = local_node;
+		for_each_node_state(n, N_MEMORY) {
+			if ((n != local_node) && is_node_pmem(n))
+				node_order[nr_nodes++] = n;
+		}
+	} else {
+		while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+			/*
+			 * We don't want to pressure a particular node.
+			 * So adding penalty to the first node in same
+			 * distance group to make it round-robin.
+			 */
+			if (node_distance(local_node, node) !=
+			    node_distance(local_node, prev_node))
+				node_load[node] = load;
+
+			node_order[nr_nodes++] = node;
+			prev_node = node;
+			load--;
+		}
 	}
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (9 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2019-01-01  9:23   ` Aneesh Kumar K.V
  2019-01-02 16:47   ` Dave Hansen
  2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
                   ` (10 subsequent siblings)
  21 siblings, 2 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Yao Yuan, Fengguang Wu, kvm, LKML,
	Fan Du, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0001-kvm-allocate-page-table-pages-from-DRAM.patch --]
[-- Type: text/plain, Size: 1212 bytes --]

From: Yao Yuan <yuan.yao@intel.com>

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
arch/x86/kvm/mmu.c |   12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

--- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
+++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
@@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
 		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
 }
 
+static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
+{
+       struct page *page;
+
+       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
+       if (!page)
+	       return 0;
+       return (unsigned long) page_address(page);
+}
+
 static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
 				       int min)
 {
@@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
 	if (cache->nobjs >= min)
 		return 0;
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
 		if (!page)
 			return cache->nobjs >= min ? 0 : -ENOMEM;
 		cache->objects[cache->nobjs++] = page;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 12/21] x86/pgtable: allocate page table pages from DRAM
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (10 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0018-pgtable-force-pgtable-allocation-from-DRAM-node-0.patch --]
[-- Type: text/plain, Size: 3554 bytes --]

On rand read/writes on large data, we find near half memory accesses
caused by TLB misses, hence hit the page table pages. So better keep
page table pages in faster DRAM nodes.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/include/asm/pgalloc.h |   10 +++++++---
 arch/x86/mm/pgtable.c          |   22 ++++++++++++++++++----
 2 files changed, 25 insertions(+), 7 deletions(-)

--- linux.orig/arch/x86/mm/pgtable.c	2018-12-26 19:41:57.494900885 +0800
+++ linux/arch/x86/mm/pgtable.c	2018-12-26 19:42:35.531621035 +0800
@@ -22,17 +22,30 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
+nodemask_t all_node_mask = NODE_MASK_ALL;
+
+unsigned long __get_free_pgtable_pages(gfp_t gfp_mask,
+						     unsigned int order)
+{
+	struct page *page;
+
+	page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), &all_node_mask);
+	if (!page)
+		return 0;
+	return (unsigned long) page_address(page);
+}
+EXPORT_SYMBOL(__get_free_pgtable_pages);
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
+	return (pte_t *)__get_free_pgtable_pages(PGALLOC_GFP & ~__GFP_ACCOUNT, 0);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
 
-	pte = alloc_pages(__userpte_alloc_gfp, 0);
+	pte = __alloc_pages_nodemask(__userpte_alloc_gfp, 0, numa_node_id(), &all_node_mask);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
@@ -241,7 +254,7 @@ static int preallocate_pmds(struct mm_st
 		gfp &= ~__GFP_ACCOUNT;
 
 	for (i = 0; i < count; i++) {
-		pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
+		pmd_t *pmd = (pmd_t *)__get_free_pgtable_pages(gfp, 0);
 		if (!pmd)
 			failed = true;
 		if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
@@ -422,7 +435,8 @@ static inline void _pgd_free(pgd_t *pgd)
 
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
+	return (pgd_t *)__get_free_pgtable_pages(PGALLOC_GFP,
+						 PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
--- linux.orig/arch/x86/include/asm/pgalloc.h	2018-12-26 19:40:12.992251270 +0800
+++ linux/arch/x86/include/asm/pgalloc.h	2018-12-26 19:42:35.531621035 +0800
@@ -96,10 +96,11 @@ static inline pmd_t *pmd_alloc_one(struc
 {
 	struct page *page;
 	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
+	nodemask_t all_node_mask = NODE_MASK_ALL;
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
-	page = alloc_pages(gfp, 0);
+	page = __alloc_pages_nodemask(gfp, 0, numa_node_id(), &all_node_mask);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -141,13 +142,16 @@ static inline void p4d_populate(struct m
 	set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
+extern unsigned long __get_free_pgtable_pages(gfp_t gfp_mask,
+					      unsigned int order);
+
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
-	return (pud_t *)get_zeroed_page(gfp);
+	return (pud_t *)__get_free_pgtable_pages(gfp | __GFP_ZERO, 0);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -179,7 +183,7 @@ static inline p4d_t *p4d_alloc_one(struc
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
-	return (p4d_t *)get_zeroed_page(gfp);
+	return (p4d_t *)__get_free_pgtable_pages(gfp | __GFP_ZERO, 0);
 }
 
 static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (11 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
@ 2018-12-26 13:14 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Jingqi Liu, Fengguang Wu, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0006-pgtable-don-t-check-the-page-accessed-bit.patch --]
[-- Type: text/plain, Size: 1062 bytes --]

From: Jingqi Liu <jingqi.liu@intel.com>

ept-idle will clear PMD accessed bit to speedup PTE scan -- if the bit
remains unset in the next scan, all the 512 PTEs can be skipped.

So don't complain on !_PAGE_ACCESSED in pmd_bad().

Note that clearing PMD accessed bit has its own cost, the optimization
may only be worthwhile for
- large idle area
- sparsely populated area

Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/include/asm/pgtable.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux.orig/arch/x86/include/asm/pgtable.h	2018-12-23 19:50:50.917902600 +0800
+++ linux/arch/x86/include/asm/pgtable.h	2018-12-23 19:50:50.913902605 +0800
@@ -821,7 +821,8 @@ static inline pte_t *pte_offset_kernel(p
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+					(_KERNPG_TABLE & ~_PAGE_ACCESSED);
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 14/21] kvm: register in mm_struct
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (12 preceding siblings ...)
  2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2019-02-02  6:57   ` Peter Xu
  2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Nikita Leshenko,
	Christian Borntraeger, Fengguang Wu, kvm, LKML, Fan Du, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams

[-- Attachment #1: 0009-kvm-register-in-mm_struct.patch --]
[-- Type: text/plain, Size: 2028 bytes --]

VM is associated with an address space and not a specific thread.

>From Documentation/virtual/kvm/api.txt:
   Only run VM ioctls from the same process (address space) that was used
   to create the VM.

CC: Nikita Leshenko <nikita.leshchenko@oracle.com>
CC: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/mm_types.h |   11 +++++++++++
 virt/kvm/kvm_main.c      |    3 +++
 2 files changed, 14 insertions(+)

--- linux.orig/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
+++ linux/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
@@ -27,6 +27,7 @@ typedef int vm_fault_t;
 struct address_space;
 struct mem_cgroup;
 struct hmm;
+struct kvm;
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -496,6 +497,10 @@ struct mm_struct {
 		/* HMM needs to track a few things per mm */
 		struct hmm *hmm;
 #endif
+
+#if IS_ENABLED(CONFIG_KVM)
+		struct kvm *kvm;
+#endif
 	} __randomize_layout;
 
 	/*
@@ -507,6 +512,12 @@ struct mm_struct {
 
 extern struct mm_struct init_mm;
 
+#if IS_ENABLED(CONFIG_KVM)
+static inline struct kvm *mm_kvm(struct mm_struct *mm) { return mm->kvm; }
+#else
+static inline struct kvm *mm_kvm(struct mm_struct *mm) { return NULL; }
+#endif
+
 /* Pointer magic because the dynamic array size confuses some compilers. */
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
--- linux.orig/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
+++ linux/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
@@ -727,6 +727,7 @@ static void kvm_destroy_vm(struct kvm *k
 	struct mm_struct *mm = kvm->mm;
 
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
+	mm->kvm = NULL;
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
 	spin_lock(&kvm_lock);
@@ -3224,6 +3225,8 @@ static int kvm_dev_ioctl_create_vm(unsig
 		fput(file);
 		return -ENOMEM;
 	}
+
+	kvm->mm->kvm = kvm;
 	kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
 
 	fd_install(r, file);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (13 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Dave Hansen, Peng Dong, Liu Jingqi,
	Fengguang Wu, kvm, LKML, Fan Du, Yao Yuan, Huang Ying,
	Dong Eddie, Zhang Yi, Dan Williams

[-- Attachment #1: 0014-kvm-ept-idle-EPT-page-table-walk-for-A-bits.patch --]
[-- Type: text/plain, Size: 19039 bytes --]

For virtual machines, "accessed" bits will be set in guest page tables
and EPT/NPT. So for qemu-kvm process, convert HVA to GFN to GPA, then do
EPT/NPT walks.

This borrows host page table walk macros/functions to do EPT/NPT walk.
So it depends on them using the same level.

As proposed by Dave Hansen, invalidate TLB when finished one round of
scan, in order to ensure HW will set accessed bit for super-hot pages.

V2: convert idle_bitmap to idle_pages to be more efficient on
- huge pages
- sparse page table
- ranges of similar pages

The new idle_pages file contains a series of records of different size
reporting ranges of different page size to user space. That interface
has a major downside: it breaks read() assumption about range_to_read ==
read_buffer_size. Now we workaround this problem by deducing
range_to_read from read_buffer_size, and let read() return when either
read_buffer_size is filled, or range_to_read is fully scanned.

To make a more precise interface, we may need further switch to ioctl().

CC: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Peng Dong <dongx.peng@intel.com>
Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c |  637 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  116 ++++++
 2 files changed, 753 insertions(+)
 create mode 100644 arch/x86/kvm/ept_idle.c
 create mode 100644 arch/x86/kvm/ept_idle.h

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/arch/x86/kvm/ept_idle.c	2018-12-26 20:38:07.298994533 +0800
@@ -0,0 +1,637 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+#include <linux/uaccess.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/bitmap.h>
+#include <linux/sched/mm.h>
+#include <asm/tlbflush.h>
+
+#include "ept_idle.h"
+
+/* #define DEBUG 1 */
+
+#ifdef DEBUG
+
+#define debug_printk trace_printk
+
+#define set_restart_gpa(val, note)	({			\
+	unsigned long old_val = eic->restart_gpa;		\
+	eic->restart_gpa = (val);				\
+	trace_printk("restart_gpa=%lx %luK  %s  %s %d\n",	\
+		     (val), (eic->restart_gpa - old_val) >> 10,	\
+		     note, __func__, __LINE__);			\
+})
+
+#define set_next_hva(val, note)	({				\
+	unsigned long old_val = eic->next_hva;			\
+	eic->next_hva = (val);					\
+	trace_printk("   next_hva=%lx %luK  %s  %s %d\n",	\
+		     (val), (eic->next_hva - old_val) >> 10,	\
+		     note, __func__, __LINE__);			\
+})
+
+#else
+
+#define debug_printk(...)
+
+#define set_restart_gpa(val, note)	({			\
+	eic->restart_gpa = (val);				\
+})
+
+#define set_next_hva(val, note)	({				\
+	eic->next_hva = (val);					\
+})
+
+#endif
+
+static unsigned long pagetype_size[16] = {
+	[PTE_ACCESSED]	= PAGE_SIZE,	/* 4k page */
+	[PMD_ACCESSED]	= PMD_SIZE,	/* 2M page */
+	[PUD_PRESENT]	= PUD_SIZE,	/* 1G page */
+
+	[PTE_DIRTY]	= PAGE_SIZE,
+	[PMD_DIRTY]	= PMD_SIZE,
+
+	[PTE_IDLE]	= PAGE_SIZE,
+	[PMD_IDLE]	= PMD_SIZE,
+	[PMD_IDLE_PTES] = PMD_SIZE,
+
+	[PTE_HOLE]	= PAGE_SIZE,
+	[PMD_HOLE]	= PMD_SIZE,
+};
+
+static void u64_to_u8(uint64_t n, uint8_t *p)
+{
+	p += sizeof(uint64_t) - 1;
+
+	*p-- = n; n >>= 8;
+	*p-- = n; n >>= 8;
+	*p-- = n; n >>= 8;
+	*p-- = n; n >>= 8;
+
+	*p-- = n; n >>= 8;
+	*p-- = n; n >>= 8;
+	*p-- = n; n >>= 8;
+	*p   = n;
+}
+
+static void dump_eic(struct ept_idle_ctrl *eic)
+{
+	debug_printk("ept_idle_ctrl: pie_read=%d pie_read_max=%d buf_size=%d "
+		     "bytes_copied=%d next_hva=%lx restart_gpa=%lx "
+		     "gpa_to_hva=%lx\n",
+		     eic->pie_read,
+		     eic->pie_read_max,
+		     eic->buf_size,
+		     eic->bytes_copied,
+		     eic->next_hva,
+		     eic->restart_gpa,
+		     eic->gpa_to_hva);
+}
+
+static void eic_report_addr(struct ept_idle_ctrl *eic, unsigned long addr)
+{
+	unsigned long hva;
+	eic->kpie[eic->pie_read++] = PIP_CMD_SET_HVA;
+	hva = addr;
+	u64_to_u8(hva, &eic->kpie[eic->pie_read]);
+	eic->pie_read += sizeof(uint64_t);
+	debug_printk("eic_report_addr %lx\n", addr);
+	dump_eic(eic);
+}
+
+static int eic_add_page(struct ept_idle_ctrl *eic,
+			unsigned long addr,
+			unsigned long next,
+			enum ProcIdlePageType page_type)
+{
+	int page_size = pagetype_size[page_type];
+
+	debug_printk("eic_add_page addr=%lx next=%lx "
+		     "page_type=%d pagesize=%dK\n",
+		     addr, next, (int)page_type, (int)page_size >> 10);
+	dump_eic(eic);
+
+	/* align kernel/user vision of cursor position */
+	next = round_up(next, page_size);
+
+	if (!eic->pie_read ||
+	    addr + eic->gpa_to_hva != eic->next_hva) {
+		/* merge hole */
+		if (page_type == PTE_HOLE ||
+		    page_type == PMD_HOLE) {
+			set_restart_gpa(next, "PTE_HOLE|PMD_HOLE");
+			return 0;
+		}
+
+		if (addr + eic->gpa_to_hva < eic->next_hva) {
+			debug_printk("ept_idle: addr moves backwards\n");
+			WARN_ONCE(1, "ept_idle: addr moves backwards");
+		}
+
+		if (eic->pie_read + sizeof(uint64_t) + 2 >= eic->pie_read_max) {
+			set_restart_gpa(addr, "EPT_IDLE_KBUF_FULL");
+			return EPT_IDLE_KBUF_FULL;
+		}
+
+		eic_report_addr(eic, round_down(addr, page_size) +
+							eic->gpa_to_hva);
+	} else {
+		if (PIP_TYPE(eic->kpie[eic->pie_read - 1]) == page_type &&
+		    PIP_SIZE(eic->kpie[eic->pie_read - 1]) < 0xF) {
+			set_next_hva(next + eic->gpa_to_hva, "IN-PLACE INC");
+			set_restart_gpa(next, "IN-PLACE INC");
+			eic->kpie[eic->pie_read - 1]++;
+			WARN_ONCE(page_size < next-addr, "next-addr too large");
+			return 0;
+		}
+		if (eic->pie_read >= eic->pie_read_max) {
+			set_restart_gpa(addr, "EPT_IDLE_KBUF_FULL");
+			return EPT_IDLE_KBUF_FULL;
+		}
+	}
+
+	set_next_hva(next + eic->gpa_to_hva, "NEW-ITEM");
+	set_restart_gpa(next, "NEW-ITEM");
+	eic->kpie[eic->pie_read] = PIP_COMPOSE(page_type, 1);
+	eic->pie_read++;
+
+	return 0;
+}
+
+static int ept_pte_range(struct ept_idle_ctrl *eic,
+			 pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	enum ProcIdlePageType page_type;
+	int err = 0;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!ept_pte_present(*pte))
+			page_type = PTE_HOLE;
+		else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED,
+					     (unsigned long *) &pte->pte))
+			page_type = PTE_IDLE;
+		else {
+			page_type = PTE_ACCESSED;
+		}
+
+		err = eic_add_page(eic, addr, addr + PAGE_SIZE, page_type);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	return err;
+}
+
+static int ept_pmd_range(struct ept_idle_ctrl *eic,
+			 pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	enum ProcIdlePageType page_type;
+	enum ProcIdlePageType pte_page_type;
+	int err = 0;
+
+	if (eic->flags & SCAN_HUGE_PAGE)
+		pte_page_type = PMD_IDLE_PTES;
+	else
+		pte_page_type = IDLE_PAGE_TYPE_MAX;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+
+		if (!ept_pmd_present(*pmd))
+			page_type = PMD_HOLE;	/* likely won't hit here */
+		else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED,
+					     (unsigned long *)pmd)) {
+			if (pmd_large(*pmd))
+				page_type = PMD_IDLE;
+			else if (eic->flags & SCAN_SKIM_IDLE)
+				page_type = PMD_IDLE_PTES;
+			else
+				page_type = pte_page_type;
+		} else if (pmd_large(*pmd)) {
+			page_type = PMD_ACCESSED;
+		} else
+			page_type = pte_page_type;
+
+		if (page_type != IDLE_PAGE_TYPE_MAX)
+			err = eic_add_page(eic, addr, next, page_type);
+		else
+			err = ept_pte_range(eic, pmd, addr, next);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_pud_range(struct ept_idle_ctrl *eic,
+			 p4d_t *p4d, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err = 0;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+
+		if (!ept_pud_present(*pud)) {
+			set_restart_gpa(next, "PUD_HOLE");
+			continue;
+		}
+
+		if (pud_large(*pud))
+			err = eic_add_page(eic, addr, next, PUD_PRESENT);
+		else
+			err = ept_pmd_range(eic, pud, addr, next);
+
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_p4d_range(struct ept_idle_ctrl *eic,
+			 pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (!ept_p4d_present(*p4d)) {
+			set_restart_gpa(next, "P4D_HOLE");
+			continue;
+		}
+
+		err = ept_pud_range(eic, p4d, addr, next);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_page_range(struct ept_idle_ctrl *eic,
+			  unsigned long addr,
+			  unsigned long end)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_mmu *mmu;
+	pgd_t *ept_root;
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	BUG_ON(addr >= end);
+
+	spin_lock(&eic->kvm->mmu_lock);
+
+	vcpu = kvm_get_vcpu(eic->kvm, 0);
+	if (!vcpu) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	mmu = vcpu->arch.mmu;
+	if (!VALID_PAGE(mmu->root_hpa)) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	ept_root = __va(mmu->root_hpa);
+
+	local_irq_disable();
+	pgd = pgd_offset_pgd(ept_root, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (!ept_pgd_present(*pgd)) {
+			set_restart_gpa(next, "PGD_HOLE");
+			continue;
+		}
+
+		err = ept_p4d_range(eic, pgd, addr, next);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+	local_irq_enable();
+out_unlock:
+	spin_unlock(&eic->kvm->mmu_lock);
+	return err;
+}
+
+static void init_ept_idle_ctrl_buffer(struct ept_idle_ctrl *eic)
+{
+	eic->pie_read = 0;
+	eic->pie_read_max = min(EPT_IDLE_KBUF_SIZE,
+				eic->buf_size - eic->bytes_copied);
+	/* reserve space for PIP_CMD_SET_HVA in the end */
+	eic->pie_read_max -= sizeof(uint64_t) + 1;
+	memset(eic->kpie, 0, sizeof(eic->kpie));
+}
+
+static int ept_idle_copy_user(struct ept_idle_ctrl *eic,
+			      unsigned long start, unsigned long end)
+{
+	int bytes_read;
+	int lc = 0;	/* last copy? */
+	int ret;
+
+	debug_printk("ept_idle_copy_user %lx %lx\n", start, end);
+	dump_eic(eic);
+
+	/* Break out of loop on no more progress. */
+	if (!eic->pie_read) {
+		lc = 1;
+		if (start < end)
+			start = end;
+	}
+
+	if (start >= end && start > eic->next_hva) {
+		set_next_hva(start, "TAIL-HOLE");
+		eic_report_addr(eic, start);
+	}
+
+	bytes_read = eic->pie_read;
+	if (!bytes_read)
+		return 1;
+
+	ret = copy_to_user(eic->buf, eic->kpie, bytes_read);
+	if (ret)
+		return -EFAULT;
+
+	eic->buf += bytes_read;
+	eic->bytes_copied += bytes_read;
+	if (eic->bytes_copied >= eic->buf_size)
+		return EPT_IDLE_BUF_FULL;
+	if (lc)
+		return lc;
+
+	init_ept_idle_ctrl_buffer(eic);
+	cond_resched();
+	return 0;
+}
+
+/*
+ * Depending on whether hva falls in a memslot:
+ *
+ * 1) found => return gpa and remaining memslot size in *addr_range
+ *
+ *                 |<----- addr_range --------->|
+ *         [               mem slot             ]
+ *                 ^hva
+ *
+ * 2) not found => return hole size in *addr_range
+ *
+ *                 |<----- addr_range --------->|
+ *                                              [   first mem slot above hva  ]
+ *                 ^hva
+ *
+ * If hva is above all mem slots, *addr_range will be ~0UL. We can finish read(2).
+ */
+static unsigned long ept_idle_find_gpa(struct ept_idle_ctrl *eic,
+				       unsigned long hva,
+				       unsigned long *addr_range)
+{
+	struct kvm *kvm = eic->kvm;
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	unsigned long hva_end;
+	gfn_t gfn;
+
+	*addr_range = ~0UL;
+	mutex_lock(&kvm->slots_lock);
+	slots = kvm_memslots(eic->kvm);
+	kvm_for_each_memslot(memslot, slots) {
+		hva_end = memslot->userspace_addr +
+		    (memslot->npages << PAGE_SHIFT);
+
+		if (hva >= memslot->userspace_addr && hva < hva_end) {
+			gpa_t gpa;
+			gfn = hva_to_gfn_memslot(hva, memslot);
+			*addr_range = hva_end - hva;
+			gpa = gfn_to_gpa(gfn);
+			debug_printk("ept_idle_find_gpa slot %lx=>%llx %lx=>%llx "
+				     "delta %llx size %lx\n",
+				     memslot->userspace_addr,
+				     gfn_to_gpa(memslot->base_gfn),
+				     hva, gpa,
+				     hva - gpa,
+				     memslot->npages << PAGE_SHIFT);
+			mutex_unlock(&kvm->slots_lock);
+			return gpa;
+		}
+
+		if (memslot->userspace_addr > hva)
+			*addr_range = min(*addr_range,
+					  memslot->userspace_addr - hva);
+	}
+	mutex_unlock(&kvm->slots_lock);
+	return INVALID_PAGE;
+}
+
+static int ept_idle_supports_cpu(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_mmu *mmu;
+	int ret;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (!vcpu)
+		return -EINVAL;
+
+	spin_lock(&kvm->mmu_lock);
+	mmu = vcpu->arch.mmu;
+	if (mmu->mmu_role.base.ad_disabled) {
+		printk(KERN_NOTICE
+		       "CPU does not support EPT A/D bits tracking\n");
+		ret = -EINVAL;
+	} else if (mmu->shadow_root_level != 4 + (! !pgtable_l5_enabled())) {
+		printk(KERN_NOTICE "Unsupported EPT level %d\n",
+		       mmu->shadow_root_level);
+		ret = -EINVAL;
+	} else
+		ret = 0;
+	spin_unlock(&kvm->mmu_lock);
+
+	return ret;
+}
+
+static int ept_idle_walk_hva_range(struct ept_idle_ctrl *eic,
+				   unsigned long start, unsigned long end)
+{
+	unsigned long gpa_addr;
+	unsigned long addr_range;
+	int ret;
+
+	ret = ept_idle_supports_cpu(eic->kvm);
+	if (ret)
+		return ret;
+
+	init_ept_idle_ctrl_buffer(eic);
+
+	for (; start < end;) {
+		gpa_addr = ept_idle_find_gpa(eic, start, &addr_range);
+
+		if (gpa_addr == INVALID_PAGE) {
+			eic->gpa_to_hva = 0;
+			if (addr_range == ~0UL) /* beyond max virtual address */
+				set_restart_gpa(TASK_SIZE, "EOF");
+			else {
+				start += addr_range;
+				set_restart_gpa(start, "OUT-OF-SLOT");
+			}
+		} else {
+			eic->gpa_to_hva = start - gpa_addr;
+			ept_page_range(eic, gpa_addr, gpa_addr + addr_range);
+		}
+
+		start = eic->restart_gpa + eic->gpa_to_hva;
+		ret = ept_idle_copy_user(eic, start, end);
+		if (ret)
+			break;
+	}
+
+	if (eic->bytes_copied)
+		ret = 0;
+	return ret;
+}
+
+static ssize_t ept_idle_read(struct file *file, char *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct mm_struct *mm = file->private_data;
+	struct ept_idle_ctrl *eic;
+	unsigned long hva_start = *ppos;
+	unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT));
+	int ret;
+
+	if (hva_start >= TASK_SIZE) {
+		debug_printk("ept_idle_read past TASK_SIZE: %lx %lx\n",
+			     hva_start, TASK_SIZE);
+		return 0;
+	}
+
+	if (!mm_kvm(mm))
+		return mm_idle_read(file, buf, count, ppos);
+
+	if (hva_end <= hva_start) {
+		debug_printk("ept_idle_read past EOF: %lx %lx\n",
+			     hva_start, hva_end);
+		return 0;
+	}
+	if (*ppos & (PAGE_SIZE - 1)) {
+		debug_printk("ept_idle_read unaligned ppos: %lx\n",
+			     hva_start);
+		return -EINVAL;
+	}
+	if (count < EPT_IDLE_BUF_MIN) {
+		debug_printk("ept_idle_read small count: %lx\n",
+			     (unsigned long)count);
+		return -EINVAL;
+	}
+
+	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
+	if (!eic)
+		return -ENOMEM;
+
+	if (!mm || !mmget_not_zero(mm)) {
+		ret = -ESRCH;
+		goto out_free_eic;
+	}
+
+	eic->buf = buf;
+	eic->buf_size = count;
+	eic->mm = mm;
+	eic->kvm = mm_kvm(mm);
+	if (!eic->kvm) {
+		ret = -EINVAL;
+		goto out_mm;
+	}
+
+	kvm_get_kvm(eic->kvm);
+
+	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
+	if (ret)
+		goto out_kvm;
+
+	ret = eic->bytes_copied;
+	*ppos = eic->next_hva;
+	debug_printk("ppos=%lx bytes_copied=%d\n",
+		     eic->next_hva, ret);
+out_kvm:
+	kvm_put_kvm(eic->kvm);
+out_mm:
+	mmput(mm);
+out_free_eic:
+	kfree(eic);
+	return ret;
+}
+
+static int ept_idle_open(struct inode *inode, struct file *file)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -EBUSY;
+
+	return 0;
+}
+
+static int ept_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+	struct kvm *kvm;
+	int ret = 0;
+
+	if (!mm) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	kvm = mm_kvm(mm);
+	if (!kvm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	spin_lock(&kvm->mmu_lock);
+	kvm_flush_remote_tlbs(kvm);
+	spin_unlock(&kvm->mmu_lock);
+
+out:
+	module_put(THIS_MODULE);
+	return ret;
+}
+
+extern struct file_operations proc_ept_idle_operations;
+
+static int ept_idle_entry(void)
+{
+	proc_ept_idle_operations.owner = THIS_MODULE;
+	proc_ept_idle_operations.read = ept_idle_read;
+	proc_ept_idle_operations.open = ept_idle_open;
+	proc_ept_idle_operations.release = ept_idle_release;
+
+	return 0;
+}
+
+static void ept_idle_exit(void)
+{
+	memset(&proc_ept_idle_operations, 0, sizeof(proc_ept_idle_operations));
+}
+
+MODULE_LICENSE("GPL");
+module_init(ept_idle_entry);
+module_exit(ept_idle_exit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/arch/x86/kvm/ept_idle.h	2018-12-26 20:32:09.775444685 +0800
@@ -0,0 +1,116 @@
+#ifndef _EPT_IDLE_H
+#define _EPT_IDLE_H
+
+#define SCAN_HUGE_PAGE		O_NONBLOCK	/* only huge page */
+#define SCAN_SKIM_IDLE		O_NOFOLLOW	/* stop on PMD_IDLE_PTES */
+
+enum ProcIdlePageType {
+	PTE_ACCESSED,	/* 4k page */
+	PMD_ACCESSED,	/* 2M page */
+	PUD_PRESENT,	/* 1G page */
+
+	PTE_DIRTY,
+	PMD_DIRTY,
+
+	PTE_IDLE,
+	PMD_IDLE,
+	PMD_IDLE_PTES,	/* all PTE idle */
+
+	PTE_HOLE,
+	PMD_HOLE,
+
+	PIP_CMD,
+
+	IDLE_PAGE_TYPE_MAX
+};
+
+#define PIP_TYPE(a)		(0xf & (a >> 4))
+#define PIP_SIZE(a)		(0xf & a)
+#define PIP_COMPOSE(type, nr)	((type << 4) | nr)
+
+#define PIP_CMD_SET_HVA		PIP_COMPOSE(PIP_CMD, 0)
+
+#define _PAGE_BIT_EPT_ACCESSED	8
+#define _PAGE_EPT_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED)
+
+#define _PAGE_EPT_PRESENT	(_AT(pteval_t, 7))
+
+static inline int ept_pte_present(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pmd_present(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pud_present(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_p4d_present(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pgd_present(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pte_accessed(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pmd_accessed(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pud_accessed(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_p4d_accessed(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pgd_accessed(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+extern struct file_operations proc_ept_idle_operations;
+
+#define EPT_IDLE_KBUF_FULL	1
+#define EPT_IDLE_BUF_FULL	2
+#define EPT_IDLE_BUF_MIN	(sizeof(uint64_t) * 2 + 3)
+
+#define EPT_IDLE_KBUF_SIZE	8000
+
+struct ept_idle_ctrl {
+	struct mm_struct *mm;
+	struct kvm *kvm;
+
+	uint8_t kpie[EPT_IDLE_KBUF_SIZE];
+	int pie_read;
+	int pie_read_max;
+
+	void __user *buf;
+	int buf_size;
+	int bytes_copied;
+
+	unsigned long next_hva;		/* GPA for EPT; VA for PT */
+	unsigned long gpa_to_hva;
+	unsigned long restart_gpa;
+	unsigned long last_va;
+
+	unsigned int flags;
+};
+
+#endif



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (14 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Zhang Yi, Fengguang Wu, kvm, LKML,
	Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Dan Williams

[-- Attachment #1: 0015-page-idle-Added-mmu-idle-page-walk.patch --]
[-- Type: text/plain, Size: 6243 bytes --]

From: Zhang Yi <yi.z.zhang@linux.intel.com>

File pages are skipped for now. They are in general not guaranteed to be
mapped. It means when become hot, there is no guarantee to find and move
them to DRAM nodes.

Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c |  204 ++++++++++++++++++++++++++++++++++++++
 mm/pagewalk.c           |    1 
 2 files changed, 205 insertions(+)

--- linux.orig/arch/x86/kvm/ept_idle.c	2018-12-26 19:58:30.576894801 +0800
+++ linux/arch/x86/kvm/ept_idle.c	2018-12-26 19:58:39.840936072 +0800
@@ -510,6 +510,9 @@ static int ept_idle_walk_hva_range(struc
 	return ret;
 }
 
+static ssize_t mm_idle_read(struct file *file, char *buf,
+			    size_t count, loff_t *ppos);
+
 static ssize_t ept_idle_read(struct file *file, char *buf,
 			     size_t count, loff_t *ppos)
 {
@@ -615,6 +618,207 @@ out:
 	return ret;
 }
 
+static int mm_idle_pte_range(struct ept_idle_ctrl *eic, pmd_t *pmd,
+			     unsigned long addr, unsigned long next)
+{
+	enum ProcIdlePageType page_type;
+	pte_t *pte;
+	int err = 0;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!pte_present(*pte))
+			page_type = PTE_HOLE;
+		else if (!test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					     (unsigned long *) &pte->pte))
+			page_type = PTE_IDLE;
+		else {
+			page_type = PTE_ACCESSED;
+		}
+
+		err = eic_add_page(eic, addr, addr + PAGE_SIZE, page_type);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != next);
+
+	return err;
+}
+
+static int mm_idle_pmd_entry(pmd_t *pmd, unsigned long addr,
+			     unsigned long next, struct mm_walk *walk)
+{
+	struct ept_idle_ctrl *eic = walk->private;
+	enum ProcIdlePageType page_type;
+	enum ProcIdlePageType pte_page_type;
+	int err;
+
+	/*
+	 * Skip duplicate PMD_IDLE_PTES: when the PMD crosses VMA boundary,
+	 * walk_page_range() can call on the same PMD twice.
+	 */
+	if ((addr & PMD_MASK) == (eic->last_va & PMD_MASK)) {
+		debug_printk("ignore duplicate addr %lx %lx\n",
+			     addr, eic->last_va);
+		return 0;
+	}
+	eic->last_va = addr;
+
+	if (eic->flags & SCAN_HUGE_PAGE)
+		pte_page_type = PMD_IDLE_PTES;
+	else
+		pte_page_type = IDLE_PAGE_TYPE_MAX;
+
+	if (!pmd_present(*pmd))
+		page_type = PMD_HOLE;
+	else if (!test_and_clear_bit(_PAGE_BIT_ACCESSED, (unsigned long *)pmd)) {
+		if (pmd_large(*pmd))
+			page_type = PMD_IDLE;
+		else if (eic->flags & SCAN_SKIM_IDLE)
+			page_type = PMD_IDLE_PTES;
+		else
+			page_type = pte_page_type;
+	} else if (pmd_large(*pmd)) {
+		page_type = PMD_ACCESSED;
+	} else
+		page_type = pte_page_type;
+
+	if (page_type != IDLE_PAGE_TYPE_MAX)
+		err = eic_add_page(eic, addr, next, page_type);
+	else
+		err = mm_idle_pte_range(eic, pmd, addr, next);
+
+	return err;
+}
+
+static int mm_idle_pud_entry(pud_t *pud, unsigned long addr,
+			     unsigned long next, struct mm_walk *walk)
+{
+	struct ept_idle_ctrl *eic = walk->private;
+
+	if ((addr & PUD_MASK) != (eic->last_va & PUD_MASK)) {
+		eic_add_page(eic, addr, next, PUD_PRESENT);
+		eic->last_va = addr;
+	}
+	return 1;
+}
+
+static int mm_idle_test_walk(unsigned long start, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (vma->vm_file) {
+		if ((vma->vm_flags & (VM_WRITE|VM_MAYSHARE)) == VM_WRITE)
+		    return 0;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int mm_idle_walk_range(struct ept_idle_ctrl *eic,
+			      unsigned long start,
+			      unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	init_ept_idle_ctrl_buffer(eic);
+
+	for (; start < end;)
+	{
+		down_read(&walk->mm->mmap_sem);
+		vma = find_vma(walk->mm, start);
+		if (vma) {
+			if (end > vma->vm_start) {
+				local_irq_disable();
+				ret = walk_page_range(start, end, walk);
+				local_irq_enable();
+			} else
+				set_restart_gpa(vma->vm_start, "VMA-HOLE");
+		} else
+			set_restart_gpa(TASK_SIZE, "EOF");
+		up_read(&walk->mm->mmap_sem);
+
+		WARN_ONCE(eic->gpa_to_hva, "non-zero gpa_to_hva");
+		start = eic->restart_gpa;
+		ret = ept_idle_copy_user(eic, start, end);
+		if (ret)
+			break;
+	}
+
+	if (eic->bytes_copied) {
+		if (ret != EPT_IDLE_BUF_FULL && eic->next_hva < end)
+			debug_printk("partial scan: next_hva=%lx end=%lx\n",
+				     eic->next_hva, end);
+		ret = 0;
+	} else
+		WARN_ONCE(1, "nothing read");
+	return ret;
+}
+
+static ssize_t mm_idle_read(struct file *file, char *buf,
+			    size_t count, loff_t *ppos)
+{
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk mm_walk = {};
+	struct ept_idle_ctrl *eic;
+	unsigned long va_start = *ppos;
+	unsigned long va_end = va_start + (count << (3 + PAGE_SHIFT));
+	int ret;
+
+	if (va_end <= va_start) {
+		debug_printk("mm_idle_read past EOF: %lx %lx\n",
+			     va_start, va_end);
+		return 0;
+	}
+	if (*ppos & (PAGE_SIZE - 1)) {
+		debug_printk("mm_idle_read unaligned ppos: %lx\n",
+			     va_start);
+		return -EINVAL;
+	}
+	if (count < EPT_IDLE_BUF_MIN) {
+		debug_printk("mm_idle_read small count: %lx\n",
+			     (unsigned long)count);
+		return -EINVAL;
+	}
+
+	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
+	if (!eic)
+		return -ENOMEM;
+
+	if (!mm || !mmget_not_zero(mm)) {
+		ret = -ESRCH;
+		goto out_free;
+	}
+
+	eic->buf = buf;
+	eic->buf_size = count;
+	eic->mm = mm;
+	eic->flags = file->f_flags;
+
+	mm_walk.mm = mm;
+	mm_walk.pmd_entry = mm_idle_pmd_entry;
+	mm_walk.pud_entry = mm_idle_pud_entry;
+	mm_walk.test_walk = mm_idle_test_walk;
+	mm_walk.private = eic;
+
+	ret = mm_idle_walk_range(eic, va_start, va_end, &mm_walk);
+	if (ret)
+		goto out_mm;
+
+	ret = eic->bytes_copied;
+	*ppos = eic->next_hva;
+	debug_printk("ppos=%lx bytes_copied=%d\n",
+		     eic->next_hva, ret);
+out_mm:
+	mmput(mm);
+out_free:
+	kfree(eic);
+	return ret;
+}
+
 extern struct file_operations proc_ept_idle_operations;
 
 static int ept_idle_entry(void)
--- linux.orig/mm/pagewalk.c	2018-12-26 19:58:30.576894801 +0800
+++ linux/mm/pagewalk.c	2018-12-26 19:58:30.576894801 +0800
@@ -338,6 +338,7 @@ int walk_page_range(unsigned long start,
 	} while (start = next, start < end);
 	return err;
 }
+EXPORT_SYMBOL(walk_page_range);
 
 int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk)
 {



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (15 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Huang Ying, Brendan Gregg,
	Fengguang Wu, kvm, LKML, Fan Du, Yao Yuan, Peng Dong, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0008-proc-introduce-proc-PID-idle_pages.patch --]
[-- Type: text/plain, Size: 4845 bytes --]

This will be similar to /sys/kernel/mm/page_idle/bitmap documented in
Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed
by process virtual address.

When using the global PFN indexed idle bitmap, we find 2 kind of
overheads:

- to track a task's working set, Brendan Gregg end up writing wss-v1
  for small tasks and wss-v2 for large tasks:

  https://github.com/brendangregg/wss

  That's because VAs may point to random PAs throughout the physical
  address space. So we either query /proc/pid/pagemap first and access
  the lots of random PFNs (with lots of syscalls) in the bitmap, or
  write+read the whole system idle bitmap beforehand.

- page table walking by PFN has much more overheads than to walk a
  page table in its natural order:
  - rmap queries
  - more locking
  - random memory reads/writes

This interface provides a cheap path for the majority non-shared mapping
pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system
time to scan the per-task/global idle bitmaps. Which means ~7x speedup.
The gap will be enlarged if consider

- the extra /proc/pid/pagemap walk
- natural page table walks can skip the whole 512 PTEs if PMD is idle

OTOH, the per-task idle bitmap is not suitable in some situations:

- not accurate for shared pages
- don't work with non-mapped file pages
- don't perform well for sparse page tables (pointed out by Huang Ying)

So it's more about complementing the existing global idle bitmap.

CC: Huang Ying <ying.huang@intel.com>
CC: Brendan Gregg <bgregg@netflix.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/proc/base.c     |    2 +
 fs/proc/internal.h |    1 
 fs/proc/task_mmu.c |   54 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

--- linux.orig/fs/proc/base.c	2018-12-23 20:08:14.228919325 +0800
+++ linux/fs/proc/base.c	2018-12-23 20:08:14.224919327 +0800
@@ -2969,6 +2969,7 @@ static const struct pid_entry tgid_base_
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3357,6 +3358,7 @@ static const struct pid_entry tid_base_s
 	REG("smaps",     S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
--- linux.orig/fs/proc/internal.h	2018-12-23 20:08:14.228919325 +0800
+++ linux/fs/proc/internal.h	2018-12-23 20:08:14.224919327 +0800
@@ -298,6 +298,7 @@ extern const struct file_operations proc
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_mm_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
--- linux.orig/fs/proc/task_mmu.c	2018-12-23 20:08:14.228919325 +0800
+++ linux/fs/proc/task_mmu.c	2018-12-23 20:08:14.224919327 +0800
@@ -1559,6 +1559,60 @@ const struct file_operations proc_pagema
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+/* will be filled when kvm_ept_idle module loads */
+struct file_operations proc_ept_idle_operations = {
+};
+EXPORT_SYMBOL_GPL(proc_ept_idle_operations);
+
+static ssize_t mm_idle_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	if (proc_ept_idle_operations.read)
+		return proc_ept_idle_operations.read(file, buf, count, ppos);
+
+	return 0;
+}
+
+
+static int mm_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = proc_mem_open(inode, PTRACE_MODE_READ);
+
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+
+	file->private_data = mm;
+
+	if (proc_ept_idle_operations.open)
+		return proc_ept_idle_operations.open(inode, file);
+
+	return 0;
+}
+
+static int mm_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	if (mm) {
+		if (!mm_kvm(mm))
+			flush_tlb_mm(mm);
+		mmdrop(mm);
+	}
+
+	if (proc_ept_idle_operations.release)
+		return proc_ept_idle_operations.release(inode, file);
+
+	return 0;
+}
+
+const struct file_operations proc_mm_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= mm_idle_read,
+	.open		= mm_idle_open,
+	.release	= mm_idle_release,
+};
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 18/21] kvm-ept-idle: enable module
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (16 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0007-kvm-ept-idle-enable-module.patch --]
[-- Type: text/plain, Size: 1395 bytes --]

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/Kconfig  |   11 +++++++++++
 arch/x86/kvm/Makefile |    4 ++++
 2 files changed, 15 insertions(+)

--- linux.orig/arch/x86/kvm/Kconfig	2018-12-23 20:09:04.628882396 +0800
+++ linux/arch/x86/kvm/Kconfig	2018-12-23 20:09:04.628882396 +0800
@@ -96,6 +96,17 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_EPT_IDLE
+	tristate "KVM EPT idle page tracking"
+	depends on KVM_INTEL
+	depends on PROC_PAGE_MONITOR
+	---help---
+	  Provides support for walking EPT to get the A bits on Intel
+	  processors equipped with the VT extensions.
+
+	  To compile this as a module, choose M here: the module
+	  will be called kvm-ept-idle.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
--- linux.orig/arch/x86/kvm/Makefile	2018-12-23 20:09:04.628882396 +0800
+++ linux/arch/x86/kvm/Makefile	2018-12-23 20:09:04.628882396 +0800
@@ -19,6 +19,10 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o
 kvm-intel-y		+= vmx.o pmu_intel.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
+kvm-ept-idle-y		+= ept_idle.o
+
 obj-$(CONFIG_KVM)	+= kvm.o
 obj-$(CONFIG_KVM_INTEL)	+= kvm-intel.o
 obj-$(CONFIG_KVM_AMD)	+= kvm-amd.o
+
+obj-$(CONFIG_KVM_EPT_IDLE)	+= kvm-ept-idle.o



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (17 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Liu Jingqi, Fengguang Wu, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0010-migrate-check-if-the-page-is-software-young-when-mov.patch --]
[-- Type: text/plain, Size: 3637 bytes --]

From: Liu Jingqi <jingqi.liu@intel.com>

Introduce MPOL_MF_SW_YOUNG flag to move_pages(). When on,
the already-in-DRAM pages will be set PG_referenced.

Background:
The use space migration daemon will frequently scan page table and
read-clear accessed bits to detect hot/cold pages. Then migrate hot
pages from PMEM to DRAM node. When doing so, it btw tells kernel that
these are the hot page set. This maintains a persistent view of hot/cold
pages between kernel and user space daemon.

The more concrete steps are

1) do multiple scan of page table, count accessed bits
2) highest accessed count => hot pages
3) call move_pages(hot pages, DRAM nodes, MPOL_MF_SW_YOUNG)

(1) regularly clears PTE young, which makes kernel lose access to
    PTE young information

(2) for anonymous pages, user space daemon defines which is hot and
    which is cold

(3) conveys user space view of hot/cold pages to kernel through
    PG_referenced

In the long run, most hot pages could already be in DRAM.
move_pages(MPOL_MF_SW_YOUNG) sets PG_referenced for those already in
DRAM hot pages. But not for newly migrated hot pages. Since they are
expected to put to the end of LRU, thus has long enough time in LRU to
gather accessed/PG_referenced bit and prove to kernel they are really hot.

The daemon may only select DRAM/2 pages as hot for 2 purposes:
- avoid thrashing, eg. some warm pages got promoted then demoted soon
- make sure enough DRAM LRU pages look "cold" to kernel, so that vmscan
  won't run into trouble busy scanning LRU lists

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/migrate.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- linux.orig/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
+++ linux/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
@@ -55,6 +55,8 @@
 
 #include "internal.h"
 
+#define MPOL_MF_SW_YOUNG (1<<7)
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -1484,12 +1486,13 @@ static int do_move_pages_to_node(struct
  * the target node
  */
 static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
-		int node, struct list_head *pagelist, bool migrate_all)
+		int node, struct list_head *pagelist, int flags)
 {
 	struct vm_area_struct *vma;
 	struct page *page;
 	unsigned int follflags;
 	int err;
+	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
 
 	down_read(&mm->mmap_sem);
 	err = -EFAULT;
@@ -1519,6 +1522,8 @@ static int add_page_for_migration(struct
 
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
+			if (flags & MPOL_MF_SW_YOUNG)
+				SetPageReferenced(page);
 			isolate_huge_page(page, pagelist);
 			err = 0;
 		}
@@ -1531,6 +1536,8 @@ static int add_page_for_migration(struct
 			goto out_putpage;
 
 		err = 0;
+		if (flags & MPOL_MF_SW_YOUNG)
+			SetPageReferenced(head);
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_cache(head),
@@ -1606,7 +1613,7 @@ static int do_pages_move(struct mm_struc
 		 * report them via status
 		 */
 		err = add_page_for_migration(mm, addr, current_node,
-				&pagelist, flags & MPOL_MF_MOVE_ALL);
+				&pagelist, flags);
 		if (!err)
 			continue;
 
@@ -1725,7 +1732,7 @@ static int kernel_move_pages(pid_t pid,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_SW_YOUNG))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (18 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
  2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Jingqi Liu, Fengguang Wu,
	kvm, LKML, Yao Yuan, Peng Dong, Huang Ying, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0012-vmscan-migrate-anonymous-pages-to-pmem-node-before-s.patch --]
[-- Type: text/plain, Size: 3814 bytes --]

From: Jingqi Liu <jingqi.liu@intel.com>

With PMEM nodes, the demotion path could be

1) DRAM pages: migrate to PMEM node
2) PMEM pages: swap out

This patch does (1) for anonymous pages only. Since we cannot
detect hotness of (unmapped) page cache pages for now.

The user space daemon can do migration in both directions:
- PMEM=>DRAM hot page migration
- DRAM=>PMEM cold page migration
However it's more natural for user space to do hot page migration
and kernel to do cold page migration. Especially, only kernel can
guarantee on-demand migration when there is memory pressure.

So the big picture will look like this: user space daemon does regular
hot page migration to DRAM, creating memory pressure on DRAM nodes,
which triggers kernel cold page migration to PMEM nodes.

Du Fan:
- Support multiple NUMA nodes.
- Don't migrate clean MADV_FREE pages to PMEM node.

With advise(MADV_FREE) syscall, both vma structure and
its corresponding page entries still lives, but we got
MADV_FREE page, anonymous but WITHOUT SwapBacked.

In case of page reclaim, clean MADV_FREE pages will be
freed and return to buddy system, the dirty ones then
turn into canonical anonymous page with
PageSwapBacked(page) set, and put into LRU_INACTIVE_FILE
list falling into standard aging routine.

Point is clean MADV_FREE pages should not be migrated,
it has steal (useless) user data once madvise(MADV_FREE)
called and guard against thus scenarios.

P.S. MADV_FREE is heavily used by jemalloc engine, and
workload like redis, refer to [1] for detailed backgroud,
usecase, and benchmark result.

[1]
https://lore.kernel.org/patchwork/patch/622179/

Fengguang:
- detect migrate thp and hugetlb
- avoid moving pages to a non-existent node

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/vmscan.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- linux.orig/mm/vmscan.c	2018-12-23 20:37:58.305551976 +0800
+++ linux/mm/vmscan.c	2018-12-23 20:37:58.305551976 +0800
@@ -1112,6 +1112,7 @@ static unsigned long shrink_page_list(st
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(move_pages);
 	int pgactivate = 0;
 	unsigned nr_unqueued_dirty = 0;
 	unsigned nr_dirty = 0;
@@ -1121,6 +1122,7 @@ static unsigned long shrink_page_list(st
 	unsigned nr_immediate = 0;
 	unsigned nr_ref_keep = 0;
 	unsigned nr_unmap_fail = 0;
+	int page_on_dram = is_node_dram(pgdat->node_id);
 
 	cond_resched();
 
@@ -1275,6 +1277,21 @@ static unsigned long shrink_page_list(st
 		}
 
 		/*
+		 * Check if the page is in DRAM numa node.
+		 * Skip MADV_FREE pages as it might be freed
+		 * immediately to buddy system if it's clean.
+		 */
+		if (node_online(pgdat->peer_node) &&
+			PageAnon(page) && (PageSwapBacked(page) || PageTransHuge(page))) {
+			if (page_on_dram) {
+				/* Add to the page list which will be moved to pmem numa node. */
+				list_add(&page->lru, &move_pages);
+				unlock_page(page);
+				continue;
+			}
+		}
+
+		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 * Lazyfree page could be freed directly
@@ -1496,6 +1513,22 @@ keep:
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
+	/* Move the anonymous pages to PMEM numa node. */
+	if (!list_empty(&move_pages)) {
+		int err;
+
+		/* Could not block. */
+		err = migrate_pages(&move_pages, alloc_new_node_page, NULL,
+					pgdat->peer_node,
+					MIGRATE_ASYNC, MR_NUMA_MISPLACED);
+		if (err) {
+			putback_movable_pages(&move_pages);
+
+			/* Join the pages which were not migrated.  */
+			list_splice(&ret_pages, &move_pages);
+		}
+	}
+
 	mem_cgroup_uncharge_list(&free_pages);
 	try_to_unmap_flush();
 	free_unref_page_list(&free_pages);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (19 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
@ 2018-12-26 13:15 ` Fengguang Wu
  2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
  21 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-26 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: 0013-vmscan-disable-0-swap-space-optimization.patch --]
[-- Type: text/plain, Size: 1125 bytes --]

Fix OOM by making in-kernel DRAM=>PMEM migration reachable.

Here we assume these 2 possible demotion paths:
- DRAM migrate to PMEM
- PMEM to swap device

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- linux.orig/mm/vmscan.c	2018-12-23 20:38:44.310446223 +0800
+++ linux/mm/vmscan.c	2018-12-23 20:38:44.306446146 +0800
@@ -2259,7 +2259,7 @@ static bool inactive_list_is_low(struct
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
 	 */
-	if (!file && !total_swap_pages)
+	if (!file && (is_node_pmem(pgdat->node_id) && !total_swap_pages))
 		return false;
 
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2340,7 +2340,8 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+	if (is_node_pmem(pgdat->node_id) &&
+	    (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
@ 2018-12-27  3:41   ` Matthew Wilcox
  2018-12-27  4:11     ` Fengguang Wu
  0 siblings, 1 reply; 62+ messages in thread
From: Matthew Wilcox @ 2018-12-27  3:41 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Wed, Dec 26, 2018 at 09:14:47PM +0800, Fengguang Wu wrote:
> From: Fan Du <fan.du@intel.com>
> 
> This is a hack to enumerate PMEM as NUMA nodes.
> It's necessary for current BIOS that don't yet fill ACPI HMAT table.
> 
> WARNING: take care to backup. It is mutual exclusive with libnvdimm
> subsystem and can destroy ndctl managed namespaces.

Why depend on firmware to present this "correctly"?  It seems to me like
less effort all around to have ndctl label some namespaces as being for
this kind of use.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-27  3:41   ` Matthew Wilcox
@ 2018-12-27  4:11     ` Fengguang Wu
  2018-12-27  5:13       ` Dan Williams
  0 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-27  4:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Wed, Dec 26, 2018 at 07:41:41PM -0800, Matthew Wilcox wrote:
>On Wed, Dec 26, 2018 at 09:14:47PM +0800, Fengguang Wu wrote:
>> From: Fan Du <fan.du@intel.com>
>>
>> This is a hack to enumerate PMEM as NUMA nodes.
>> It's necessary for current BIOS that don't yet fill ACPI HMAT table.
>>
>> WARNING: take care to backup. It is mutual exclusive with libnvdimm
>> subsystem and can destroy ndctl managed namespaces.
>
>Why depend on firmware to present this "correctly"?  It seems to me like
>less effort all around to have ndctl label some namespaces as being for
>this kind of use.

Dave Hansen may be more suitable to answer your question. He posted
patches to make PMEM NUMA node coexist with libnvdimm and ndctl:

[PATCH 0/9] Allow persistent memory to be used like normal RAM
https://lkml.org/lkml/2018/10/23/9

That depends on future BIOS. So we did this quick hack to test out
PMEM NUMA node for the existing BIOS.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-27  4:11     ` Fengguang Wu
@ 2018-12-27  5:13       ` Dan Williams
  2018-12-27 19:32         ` Yang Shi
  0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2018-12-27  5:13 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Matthew Wilcox, Andrew Morton, Linux Memory Management List,
	Fan Du, KVM list, LKML, Yao Yuan, Peng Dong, Huang Ying,
	Liu Jingqi, Dong Eddie, Dave Hansen, Zhang Yi

On Wed, Dec 26, 2018 at 8:11 PM Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> On Wed, Dec 26, 2018 at 07:41:41PM -0800, Matthew Wilcox wrote:
> >On Wed, Dec 26, 2018 at 09:14:47PM +0800, Fengguang Wu wrote:
> >> From: Fan Du <fan.du@intel.com>
> >>
> >> This is a hack to enumerate PMEM as NUMA nodes.
> >> It's necessary for current BIOS that don't yet fill ACPI HMAT table.
> >>
> >> WARNING: take care to backup. It is mutual exclusive with libnvdimm
> >> subsystem and can destroy ndctl managed namespaces.
> >
> >Why depend on firmware to present this "correctly"?  It seems to me like
> >less effort all around to have ndctl label some namespaces as being for
> >this kind of use.
>
> Dave Hansen may be more suitable to answer your question. He posted
> patches to make PMEM NUMA node coexist with libnvdimm and ndctl:
>
> [PATCH 0/9] Allow persistent memory to be used like normal RAM
> https://lkml.org/lkml/2018/10/23/9
>
> That depends on future BIOS. So we did this quick hack to test out
> PMEM NUMA node for the existing BIOS.

No, it does not depend on a future BIOS.

Willy, have a look here [1], here [2], and here [3] for the
work-in-progress ndctl takeover approach (actually 'daxctl' in this
case).

[1]: https://lkml.org/lkml/2018/10/23/9
[2]: https://lkml.org/lkml/2018/10/31/243
[3]: https://lists.01.org/pipermail/linux-nvdimm/2018-November/018677.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-27  5:13       ` Dan Williams
@ 2018-12-27 19:32         ` Yang Shi
  2018-12-28  3:27           ` Fengguang Wu
  0 siblings, 1 reply; 62+ messages in thread
From: Yang Shi @ 2018-12-27 19:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Fengguang Wu, Matthew Wilcox, Andrew Morton,
	Linux Memory Management List, Fan Du, KVM list, LKML, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi

On Wed, Dec 26, 2018 at 9:13 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Dec 26, 2018 at 8:11 PM Fengguang Wu <fengguang.wu@intel.com> wrote:
> >
> > On Wed, Dec 26, 2018 at 07:41:41PM -0800, Matthew Wilcox wrote:
> > >On Wed, Dec 26, 2018 at 09:14:47PM +0800, Fengguang Wu wrote:
> > >> From: Fan Du <fan.du@intel.com>
> > >>
> > >> This is a hack to enumerate PMEM as NUMA nodes.
> > >> It's necessary for current BIOS that don't yet fill ACPI HMAT table.
> > >>
> > >> WARNING: take care to backup. It is mutual exclusive with libnvdimm
> > >> subsystem and can destroy ndctl managed namespaces.
> > >
> > >Why depend on firmware to present this "correctly"?  It seems to me like
> > >less effort all around to have ndctl label some namespaces as being for
> > >this kind of use.
> >
> > Dave Hansen may be more suitable to answer your question. He posted
> > patches to make PMEM NUMA node coexist with libnvdimm and ndctl:
> >
> > [PATCH 0/9] Allow persistent memory to be used like normal RAM
> > https://lkml.org/lkml/2018/10/23/9
> >
> > That depends on future BIOS. So we did this quick hack to test out
> > PMEM NUMA node for the existing BIOS.
>
> No, it does not depend on a future BIOS.

It is correct. We already have Dave's patches + Dan's patch (added
target_node field) work on our machine which has SRAT.

Thanks,
Yang

>
> Willy, have a look here [1], here [2], and here [3] for the
> work-in-progress ndctl takeover approach (actually 'daxctl' in this
> case).
>
> [1]: https://lkml.org/lkml/2018/10/23/9
> [2]: https://lkml.org/lkml/2018/10/31/243
> [3]: https://lists.01.org/pipermail/linux-nvdimm/2018-November/018677.html
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node
  2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
@ 2018-12-27 20:07   ` Christopher Lameter
  2018-12-28  2:31     ` Fengguang Wu
  0 siblings, 1 reply; 62+ messages in thread
From: Christopher Lameter @ 2018-12-27 20:07 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Wed, 26 Dec 2018, Fengguang Wu wrote:

> Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
> Migration between DRAM and PMEM will by default happen between peer nodes.

Which one does numa_node_id() point to? I guess that is the DRAM node and
then we fall back to the PMEM node?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
                   ` (20 preceding siblings ...)
  2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
@ 2018-12-27 20:31 ` Michal Hocko
  2018-12-28  5:08   ` Fengguang Wu
  21 siblings, 1 reply; 62+ messages in thread
From: Michal Hocko @ 2018-12-27 20:31 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Wed 26-12-18 21:14:46, Wu Fengguang wrote:
> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
> transparent to normal applications and virtual machines.
> 
> The code is still in active development. It's provided for early design review.

So can we get a high level description of the design and expected
usecases please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node
  2018-12-27 20:07   ` Christopher Lameter
@ 2018-12-28  2:31     ` Fengguang Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28  2:31 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Thu, Dec 27, 2018 at 08:07:26PM +0000, Christopher Lameter wrote:
>On Wed, 26 Dec 2018, Fengguang Wu wrote:
>
>> Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes".
>> Migration between DRAM and PMEM will by default happen between peer nodes.
>
>Which one does numa_node_id() point to? I guess that is the DRAM node and

Yes. In our test machine, PMEM nodes show up as memory-only nodes, so
numa_node_id() points to DRAM node.

Here is numactl --hardware output on a 2S test machine.

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
node 0 size: 257712 MB
node 0 free: 178251 MB
node 1 cpus: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
103
node 1 size: 258038 MB
node 1 free: 174796 MB
node 2 cpus:
node 2 size: 503999 MB
node 2 free: 438349 MB
node 3 cpus:
node 3 size: 503999 MB
node 3 free: 438349 MB
node distances:
node   0   1   2   3
  0:  10  21  20  20
  1:  21  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

>then we fall back to the PMEM node?

Fall back is possible but not the scope of this patchset. We modified
fallback zonelists in patch 10 to simplify PMEM usage. With that
patch, page allocations on DRAM nodes won't fallback to PMEM nodes.
Instead, PMEM nodes will mainly be used by explicit numactl placement
and as migration target. When there is memory pressure in DRAM node,
LRU cold pages there will be demote migrated to its peer PMEM node on
the same socket by patch 20.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM
  2018-12-27 19:32         ` Yang Shi
@ 2018-12-28  3:27           ` Fengguang Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28  3:27 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dan Williams, Matthew Wilcox, Andrew Morton,
	Linux Memory Management List, Fan Du, KVM list, LKML, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi

On Thu, Dec 27, 2018 at 11:32:06AM -0800, Yang Shi wrote:
>On Wed, Dec 26, 2018 at 9:13 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Wed, Dec 26, 2018 at 8:11 PM Fengguang Wu <fengguang.wu@intel.com> wrote:
>> >
>> > On Wed, Dec 26, 2018 at 07:41:41PM -0800, Matthew Wilcox wrote:
>> > >On Wed, Dec 26, 2018 at 09:14:47PM +0800, Fengguang Wu wrote:
>> > >> From: Fan Du <fan.du@intel.com>
>> > >>
>> > >> This is a hack to enumerate PMEM as NUMA nodes.
>> > >> It's necessary for current BIOS that don't yet fill ACPI HMAT table.
>> > >>
>> > >> WARNING: take care to backup. It is mutual exclusive with libnvdimm
>> > >> subsystem and can destroy ndctl managed namespaces.
>> > >
>> > >Why depend on firmware to present this "correctly"?  It seems to me like
>> > >less effort all around to have ndctl label some namespaces as being for
>> > >this kind of use.
>> >
>> > Dave Hansen may be more suitable to answer your question. He posted
>> > patches to make PMEM NUMA node coexist with libnvdimm and ndctl:
>> >
>> > [PATCH 0/9] Allow persistent memory to be used like normal RAM
>> > https://lkml.org/lkml/2018/10/23/9
>> >
>> > That depends on future BIOS. So we did this quick hack to test out
>> > PMEM NUMA node for the existing BIOS.
>>
>> No, it does not depend on a future BIOS.
>
>It is correct. We already have Dave's patches + Dan's patch (added
>target_node field) work on our machine which has SRAT.

Thanks for the correction. It looks my perception was out of date.
So we can follow Dave+Dan's patches to create the PMEM NUMA nodes.

Thanks,
Fengguang

>>
>> Willy, have a look here [1], here [2], and here [3] for the
>> work-in-progress ndctl takeover approach (actually 'daxctl' in this
>> case).
>>
>> [1]: https://lkml.org/lkml/2018/10/23/9
>> [2]: https://lkml.org/lkml/2018/10/31/243
>> [3]: https://lists.01.org/pipermail/linux-nvdimm/2018-November/018677.html
>>
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
@ 2018-12-28  5:08   ` Fengguang Wu
  2018-12-28  8:41     ` Michal Hocko
  0 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28  5:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Thu, Dec 27, 2018 at 09:31:58PM +0100, Michal Hocko wrote:
>On Wed 26-12-18 21:14:46, Wu Fengguang wrote:
>> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
>> transparent to normal applications and virtual machines.
>>
>> The code is still in active development. It's provided for early design review.
>
>So can we get a high level description of the design and expected
>usecases please?

Good question.

Use cases
=========

The general use case is to use PMEM as slower but cheaper "DRAM".
The suitable ones can be

- workloads care memory size more than bandwidth/latency
- workloads with a set of warm/cold pages that don't change rapidly over time
- low cost VM/containers

Foundation: create PMEM NUMA nodes
==================================

To create PMEM nodes in native kernel, Dave Hansen and Dan Williams
have working patches for kernel and ndctl. According to Ying, it'll
work like this

        ndctl destroy-namespace -f namespace0.0
        ndctl destroy-namespace -f namespace1.0
        ipmctl create -goal MemoryMode=100
        reboot

To create PMEM nodes in QEMU VMs, current Debian/Fedora etc. distros
already support this

	qemu-system-x86_64
	-machine pc,nvdimm
        -enable-kvm
        -smp 64
        -m 256G
        # DRAM node 0
        -object memory-backend-file,size=128G,share=on,mem-path=/dev/shm/qemu_node0,id=tmpfs-node0
	-numa node,cpus=0-31,nodeid=0,memdev=tmpfs-node0
        # PMEM node 1
        -object memory-backend-file,size=128G,share=on,mem-path=/dev/dax1.0,align=128M,id=dax-node1
        -numa node,cpus=32-63,nodeid=1,memdev=dax-node1

Optimization: do hot/cold page tracking and migration
=====================================================

Since PMEM is slower than DRAM, we need to make sure hot pages go to
DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.

- DRAM=>PMEM cold page migration

It can be done in kernel page reclaim path, near the anonymous page
swap out point. Instead of swapping out, we now have the option to
migrate cold pages to PMEM NUMA nodes.

User space may also do it, however cannot act on-demand, when there
are memory pressure in DRAM nodes.

- PMEM=>DRAM hot page migration

While LRU can be good enough for identifying cold pages, frequency
based accounting can be more suitable for identifying hot pages.

Our design choice is to create a flexible user space daemon to drive
the accounting and migration, with necessary kernel supports by this
patchset.

Linux kernel already offers move_pages(2) for user space to migrate
pages to specified NUMA nodes. The major gap lies in hotness accounting.

User space driven hotness accounting
====================================

One way to find out hot/cold pages is to scan page table multiple
times and collect the "accessed" bits.

We created the kvm-ept-idle kernel module to provide the "accessed"
bits via interface /proc/PID/idle_pages. User space can open it and
read the "accessed" bits for a range of virtual address.

Inside kernel module, it implements 2 independent set of page table
scan code, seamlessly providing the same interface:

- for QEMU, scan HVA range of the VM's EPT(Extended Page Table)
- for others, scan VA range of the process page table 

With /proc/PID/idle_pages and move_pages(2), the user space daemon
can work like this

One round of scan+migration:

        loop N=(3-10) times:
                sleep 0.01-10s (typical values)
                scan page tables and read/accumulate accessed bits into arrays
        treat pages with accessed_count == N as hot  pages
        treat pages with accessed_count == 0 as cold pages
        migrate hot  pages to DRAM nodes
        migrate cold pages to PMEM nodes (optional, may do it once on multi scan rounds, to make sure they are really cold)

That just describes the bare minimal working model. A real world
daemon should consider lots more to be useful and robust. The notable
one is to avoid thrashing.

Hotness accounting can be rough and workload can be unstable. We need
to avoid promoting a warm page to DRAM and then demoting it soon.

The basic scheme is to auto control scan interval and count, so that
each round of scan will get hot pages < 1/2 DRAM size.

May also do multiple round of scans before migration, to filter out
unstable/burst accesses.

In long run, most of the accounted hot pages will already be in DRAM.
So only need to migrate the new ones to DRAM. When doing so, should
consider QoS and rate limiting to reduce impacts to user workloads.

When user space drives hot page migration, the DRAM nodes may well be
pressured, which will in turn trigger in-kernel cold page migration.
The above 1/2 DRAM size hot pages target can help kernel easily find
cold pages on LRU scan.

To avoid thrashing, it's also important to maintain persistent kernel
and user-space view of hot/cold pages. Since they will do migrations
in 2 different directions.

- the regular page table scans will clear PMD/PTE young
- user space compensate that by setting PG_referenced on
  move_pages(hot pages, MPOL_MF_SW_YOUNG)

That guarantees the user space collected view of hot pages will be
conveyed to kernel.

Regards,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28  5:08   ` Fengguang Wu
@ 2018-12-28  8:41     ` Michal Hocko
  2018-12-28  9:42       ` Fengguang Wu
  2019-01-02 18:12       ` Dave Hansen
  0 siblings, 2 replies; 62+ messages in thread
From: Michal Hocko @ 2018-12-28  8:41 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Fri 28-12-18 13:08:06, Wu Fengguang wrote:
[...]
> Optimization: do hot/cold page tracking and migration
> =====================================================
> 
> Since PMEM is slower than DRAM, we need to make sure hot pages go to
> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.
> 
> - DRAM=>PMEM cold page migration
> 
> It can be done in kernel page reclaim path, near the anonymous page
> swap out point. Instead of swapping out, we now have the option to
> migrate cold pages to PMEM NUMA nodes.

OK, this makes sense to me except I am not sure this is something that
should be pmem specific. Is there any reason why we shouldn't migrate
pages on memory pressure to other nodes in general? In other words
rather than paging out we whould migrate over to the next node that is
not under memory pressure. Swapout would be the next level when the
memory is (almost_) fully utilized. That wouldn't be pmem specific.

> User space may also do it, however cannot act on-demand, when there
> are memory pressure in DRAM nodes.
> 
> - PMEM=>DRAM hot page migration
> 
> While LRU can be good enough for identifying cold pages, frequency
> based accounting can be more suitable for identifying hot pages.
> 
> Our design choice is to create a flexible user space daemon to drive
> the accounting and migration, with necessary kernel supports by this
> patchset.

We do have numa balancing, why cannot we rely on it? This along with the
above would allow to have pmem numa nodes (cpuless nodes in fact)
without any special casing and a natural part of the MM. It would be
only the matter of the configuration to set the appropriate distance to
allow reasonable allocation fallback strategy.

I haven't looked at the implementation yet but if you are proposing a
special cased zone lists then this is something CDM (Coherent Device
Memory) was trying to do two years ago and there was quite some
skepticism in the approach.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28  8:41     ` Michal Hocko
@ 2018-12-28  9:42       ` Fengguang Wu
  2018-12-28 12:15         ` Michal Hocko
  2019-01-02 18:12       ` Dave Hansen
  1 sibling, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Fri, Dec 28, 2018 at 09:41:05AM +0100, Michal Hocko wrote:
>On Fri 28-12-18 13:08:06, Wu Fengguang wrote:
>[...]
>> Optimization: do hot/cold page tracking and migration
>> =====================================================
>>
>> Since PMEM is slower than DRAM, we need to make sure hot pages go to
>> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.
>>
>> - DRAM=>PMEM cold page migration
>>
>> It can be done in kernel page reclaim path, near the anonymous page
>> swap out point. Instead of swapping out, we now have the option to
>> migrate cold pages to PMEM NUMA nodes.
>
>OK, this makes sense to me except I am not sure this is something that
>should be pmem specific. Is there any reason why we shouldn't migrate
>pages on memory pressure to other nodes in general? In other words
>rather than paging out we whould migrate over to the next node that is
>not under memory pressure. Swapout would be the next level when the
>memory is (almost_) fully utilized. That wouldn't be pmem specific.

In future there could be multi memory levels with different
performance/size/cost metric. There are ongoing HMAT works to describe
that. When ready, we can switch to the HMAT based general infrastructure.
Then the code will no longer be PMEM specific, but do general
promotion/demotion migrations between high/low memory levels.
Swapout could be from the lowest level memory.

Migration between peer nodes is the obvious simple way and a good
choice for the initial implementation. But yeah, it's possible to
migrate to other nodes. For example, it can be combined with NUMA
balancing: if we know the page is mostly accessed by the other socket,
then it'd best to migrate hot/cold pages directly to that socket.

>> User space may also do it, however cannot act on-demand, when there
>> are memory pressure in DRAM nodes.
>>
>> - PMEM=>DRAM hot page migration
>>
>> While LRU can be good enough for identifying cold pages, frequency
>> based accounting can be more suitable for identifying hot pages.
>>
>> Our design choice is to create a flexible user space daemon to drive
>> the accounting and migration, with necessary kernel supports by this
>> patchset.
>
>We do have numa balancing, why cannot we rely on it? This along with the
>above would allow to have pmem numa nodes (cpuless nodes in fact)
>without any special casing and a natural part of the MM. It would be
>only the matter of the configuration to set the appropriate distance to
>allow reasonable allocation fallback strategy.

Good question. We actually tried reusing NUMA balancing mechanism to
do page-fault triggered migration. move_pages() only calls
change_prot_numa(). It turns out the 2 migration types have different
purposes (one for hotness, another for home node) and hence implement
details. We end up modifying some few NUMA balancing logic -- removing
rate limiting, changing target node logics, etc.

Those look unnecessary complexities for this post. This v2 patchset
mainly fulfills our first milestone goal: a minimal viable solution
that's relatively clean to backport. Even when preparing for new
upstreamable versions, it may be good to keep it simple for the
initial upstream inclusion.

>I haven't looked at the implementation yet but if you are proposing a
>special cased zone lists then this is something CDM (Coherent Device
>Memory) was trying to do two years ago and there was quite some
>skepticism in the approach.

It looks we are pretty different than CDM. :)
We creating new NUMA nodes rather than CDM's new ZONE.
The zonelists modification is just to make PMEM nodes more separated.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28  9:42       ` Fengguang Wu
@ 2018-12-28 12:15         ` Michal Hocko
  2018-12-28 13:15           ` Fengguang Wu
  2018-12-28 13:31           ` Fengguang Wu
  0 siblings, 2 replies; 62+ messages in thread
From: Michal Hocko @ 2018-12-28 12:15 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
[...]
> Those look unnecessary complexities for this post. This v2 patchset
> mainly fulfills our first milestone goal: a minimal viable solution
> that's relatively clean to backport. Even when preparing for new
> upstreamable versions, it may be good to keep it simple for the
> initial upstream inclusion.

On the other hand this is creating a new NUMA semantic and I would like
to have something long term thatn let's throw something in now and care
about long term later. So I would really prefer to talk about long term
plans first and only care about implementation details later.

> > I haven't looked at the implementation yet but if you are proposing a
> > special cased zone lists then this is something CDM (Coherent Device
> > Memory) was trying to do two years ago and there was quite some
> > skepticism in the approach.
> 
> It looks we are pretty different than CDM. :)
> We creating new NUMA nodes rather than CDM's new ZONE.
> The zonelists modification is just to make PMEM nodes more separated.

Yes, this is exactly what CDM was after. Have a zone which is not
reachable without explicit request AFAIR. So no, I do not think you are
too different, you just use a different terminology ;)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 12:15         ` Michal Hocko
@ 2018-12-28 13:15           ` Fengguang Wu
  2018-12-28 19:46             ` Michal Hocko
  2018-12-28 13:31           ` Fengguang Wu
  1 sibling, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28 13:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote:
>On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
>[...]
>> Those look unnecessary complexities for this post. This v2 patchset
>> mainly fulfills our first milestone goal: a minimal viable solution
>> that's relatively clean to backport. Even when preparing for new
>> upstreamable versions, it may be good to keep it simple for the
>> initial upstream inclusion.
>
>On the other hand this is creating a new NUMA semantic and I would like
>to have something long term thatn let's throw something in now and care
>about long term later. So I would really prefer to talk about long term
>plans first and only care about implementation details later.

That makes good sense. FYI here are the several in-house patches that
try to leverage (but not yet integrate with) NUMA balancing. The last
one is brutal force hacking. They obviously break original NUMA
balancing logic.

Thanks,
Fengguang

[-- Attachment #2: 0074-migrate-set-PROT_NONE-on-the-PTEs-and-let-NUMA-balan.patch --]
[-- Type: text/x-diff, Size: 1332 bytes --]

From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001
From: Liu Jingqi <jingqi.liu@intel.com>
Date: Sat, 29 Sep 2018 23:29:56 +0800
Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA
 balancing

Need to enable CONFIG_NUMA_BALANCING firstly.
Set PROT_NONE on the PTEs that map to the page,
and do the actual migration in the context of process which initiate migration.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/migrate.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index b27a287081c2..d933f6966601 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 	if (page_mapcount(page) > 1 && !migrate_all)
 		goto out_putpage;
 
+	if (flags & MPOL_MF_SW_YOUNG) {
+		unsigned long start, end;
+		unsigned long nr_pte_updates = 0;
+
+		start = max(addr, vma->vm_start);
+
+		/* TODO: if huge page  */
+		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
+		end = min(end, vma->vm_end);
+		nr_pte_updates = change_prot_numa(vma, start, end);
+
+		err = 0;
+		goto out_putpage;
+	}
+
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
 			/* Check if the page is software young. */
-- 
2.15.0


[-- Attachment #3: 0075-migrate-consolidate-MPOL_MF_SW_YOUNG-behaviors.patch --]
[-- Type: text/x-diff, Size: 3514 bytes --]

From e617e8c2034387cbed50bafa786cf83528dbe3df Mon Sep 17 00:00:00 2001
From: Fengguang Wu <fengguang.wu@intel.com>
Date: Sun, 30 Sep 2018 10:50:58 +0800
Subject: [PATCH 075/166] migrate: consolidate MPOL_MF_SW_YOUNG behaviors

- if page already in target node: SetPageReferenced
- otherwise: change_prot_numa

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/Kconfig |  1 +
 mm/migrate.c         | 65 +++++++++++++++++++++++++++++++---------------------
 2 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4c6dec47fac6..c103373536fc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -100,6 +100,7 @@ config KVM_EPT_IDLE
 	tristate "KVM EPT idle page tracking"
 	depends on KVM_INTEL
 	depends on PROC_PAGE_MONITOR
+	depends on NUMA_BALANCING
 	---help---
 	  Provides support for walking EPT to get the A bits on Intel
 	  processors equipped with the VT extensions.
diff --git a/mm/migrate.c b/mm/migrate.c
index d933f6966601..d944f031c9ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1500,6 +1500,8 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 {
 	struct vm_area_struct *vma;
 	struct page *page;
+	unsigned long end;
+	unsigned int page_nid;
 	unsigned int follflags;
 	int err;
 	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
@@ -1522,49 +1524,60 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 	if (!page)
 		goto out;
 
-	err = 0;
-	if (page_to_nid(page) == node)
-		goto out_putpage;
+	page_nid = page_to_nid(page);
 
 	err = -EACCES;
 	if (page_mapcount(page) > 1 && !migrate_all)
 		goto out_putpage;
 
-	if (flags & MPOL_MF_SW_YOUNG) {
-		unsigned long start, end;
-		unsigned long nr_pte_updates = 0;
-
-		start = max(addr, vma->vm_start);
-
-		/* TODO: if huge page  */
-		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
-		end = min(end, vma->vm_end);
-		nr_pte_updates = change_prot_numa(vma, start, end);
-
-		err = 0;
-		goto out_putpage;
-	}
-
+	err = 0;
 	if (PageHuge(page)) {
-		if (PageHead(page)) {
-			/* Check if the page is software young. */
-			if (flags & MPOL_MF_SW_YOUNG)
+		if (!PageHead(page)) {
+			err = -EACCES;
+			goto out_putpage;
+		}
+		if (flags & MPOL_MF_SW_YOUNG) {
+			if (page_nid == node)
 				SetPageReferenced(page);
-			isolate_huge_page(page, pagelist);
-			err = 0;
+			else if (PageAnon(page)) {
+				end = addr + (hpage_nr_pages(page) << PAGE_SHIFT);
+				if (end <= vma->vm_end)
+					change_prot_numa(vma, addr, end);
+			}
+			goto out_putpage;
 		}
+		if (page_nid == node)
+			goto out_putpage;
+		isolate_huge_page(page, pagelist);
 	} else {
 		struct page *head;
 
 		head = compound_head(page);
+
+		if (flags & MPOL_MF_SW_YOUNG) {
+			if (page_nid == node)
+				SetPageReferenced(head);
+			else {
+				unsigned long size;
+				size = hpage_nr_pages(head) << PAGE_SHIFT;
+				end = addr + size;
+				if (unlikely(addr & (size - 1)))
+					err = -EXDEV;
+				else if (likely(end <= vma->vm_end))
+					change_prot_numa(vma, addr, end);
+				else
+					err = -ERANGE;
+			}
+			goto out_putpage;
+		}
+		if (page_nid == node)
+			goto out_putpage;
+
 		err = isolate_lru_page(head);
 		if (err)
 			goto out_putpage;
 
 		err = 0;
-		/* Check if the page is software young. */
-		if (flags & MPOL_MF_SW_YOUNG)
-			SetPageReferenced(head);
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_cache(head),
-- 
2.15.0


[-- Attachment #4: 0076-mempolicy-force-NUMA-balancing.patch --]
[-- Type: text/x-diff, Size: 1511 bytes --]

From a2d9740d1639f807868014c16dc9e2620d356f3c Mon Sep 17 00:00:00 2001
From: Fengguang Wu <fengguang.wu@intel.com>
Date: Sun, 30 Sep 2018 19:22:27 +0800
Subject: [PATCH 076/166] mempolicy: force NUMA balancing

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/memory.c    | 3 ++-
 mm/mempolicy.c | 5 -----
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..20c7efdff63b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3775,7 +3775,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		*flags |= TNF_FAULT_LOCAL;
 	}
 
-	return mpol_misplaced(page, vma, addr);
+	return 0;
+	/* return mpol_misplaced(page, vma, addr); */
 }
 
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..21dc6ba1d062 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2295,8 +2295,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	int ret = -1;
 
 	pol = get_vma_policy(vma, addr);
-	if (!(pol->flags & MPOL_F_MOF))
-		goto out;
 
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
@@ -2336,9 +2334,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
 		polnid = thisnid;
-
-		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
-			goto out;
 	}
 
 	if (curnid != polnid)
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 12:15         ` Michal Hocko
  2018-12-28 13:15           ` Fengguang Wu
@ 2018-12-28 13:31           ` Fengguang Wu
  2018-12-28 18:28             ` Yang Shi
  2018-12-28 19:52             ` Michal Hocko
  1 sibling, 2 replies; 62+ messages in thread
From: Fengguang Wu @ 2018-12-28 13:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

>> > I haven't looked at the implementation yet but if you are proposing a
>> > special cased zone lists then this is something CDM (Coherent Device
>> > Memory) was trying to do two years ago and there was quite some
>> > skepticism in the approach.
>>
>> It looks we are pretty different than CDM. :)
>> We creating new NUMA nodes rather than CDM's new ZONE.
>> The zonelists modification is just to make PMEM nodes more separated.
>
>Yes, this is exactly what CDM was after. Have a zone which is not
>reachable without explicit request AFAIR. So no, I do not think you are
>too different, you just use a different terminology ;)

Got it. OK.. The fall back zonelists patch does need more thoughts.

In long term POV, Linux should be prepared for multi-level memory.
Then there will arise the need to "allocate from this level memory".
So it looks good to have separated zonelists for each level of memory.  

On the other hand, there will also be page allocations that don't care
about the exact memory level. So it looks reasonable to expect
different kind of fallback zonelists that can be selected by NUMA policy.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 13:31           ` Fengguang Wu
@ 2018-12-28 18:28             ` Yang Shi
  2018-12-28 19:52             ` Michal Hocko
  1 sibling, 0 replies; 62+ messages in thread
From: Yang Shi @ 2018-12-28 18:28 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Michal Hocko, Andrew Morton, Linux Memory Management List,
	KVM list, LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying,
	Liu Jingqi, Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams

On Fri, Dec 28, 2018 at 5:31 AM Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> >> > I haven't looked at the implementation yet but if you are proposing a
> >> > special cased zone lists then this is something CDM (Coherent Device
> >> > Memory) was trying to do two years ago and there was quite some
> >> > skepticism in the approach.
> >>
> >> It looks we are pretty different than CDM. :)
> >> We creating new NUMA nodes rather than CDM's new ZONE.
> >> The zonelists modification is just to make PMEM nodes more separated.
> >
> >Yes, this is exactly what CDM was after. Have a zone which is not
> >reachable without explicit request AFAIR. So no, I do not think you are
> >too different, you just use a different terminology ;)
>
> Got it. OK.. The fall back zonelists patch does need more thoughts.
>
> In long term POV, Linux should be prepared for multi-level memory.
> Then there will arise the need to "allocate from this level memory".
> So it looks good to have separated zonelists for each level of memory.

I tend to agree with Fengguang. We do have needs for finer grained
control to the usage of DRAM and PMEM, for example, controlling the
percentage of DRAM and PMEM for a specific VMA.

NUMA policy sounds not good enough for some usecases since it just can
control what mempolicy is used by what memory range. Our usecase's
memory access pattern is random in a VMA. So, we can't control the
percentage by mempolicy. We have to put PMEM into a separate zonelist
to make sure memory allocation happens on PMEM when certain criteria
is met as what Fengguang does in this patch series.

Thanks,
Yang

>
> On the other hand, there will also be page allocations that don't care
> about the exact memory level. So it looks reasonable to expect
> different kind of fallback zonelists that can be selected by NUMA policy.
>
> Thanks,
> Fengguang
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 13:15           ` Fengguang Wu
@ 2018-12-28 19:46             ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2018-12-28 19:46 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli

[Cc Mel and Andrea - the thread started
http://lkml.kernel.org/r/20181226131446.330864849@intel.com]

On Fri 28-12-18 21:15:42, Wu Fengguang wrote:
> On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote:
> > On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
> > [...]
> > > Those look unnecessary complexities for this post. This v2 patchset
> > > mainly fulfills our first milestone goal: a minimal viable solution
> > > that's relatively clean to backport. Even when preparing for new
> > > upstreamable versions, it may be good to keep it simple for the
> > > initial upstream inclusion.
> > 
> > On the other hand this is creating a new NUMA semantic and I would like
> > to have something long term thatn let's throw something in now and care
> > about long term later. So I would really prefer to talk about long term
> > plans first and only care about implementation details later.
> 
> That makes good sense. FYI here are the several in-house patches that
> try to leverage (but not yet integrate with) NUMA balancing. The last
> one is brutal force hacking. They obviously break original NUMA
> balancing logic.
> 
> Thanks,
> Fengguang

> >From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001
> From: Liu Jingqi <jingqi.liu@intel.com>
> Date: Sat, 29 Sep 2018 23:29:56 +0800
> Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA
>  balancing
> 
> Need to enable CONFIG_NUMA_BALANCING firstly.
> Set PROT_NONE on the PTEs that map to the page,
> and do the actual migration in the context of process which initiate migration.
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  mm/migrate.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b27a287081c2..d933f6966601 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  	if (page_mapcount(page) > 1 && !migrate_all)
>  		goto out_putpage;
>  
> +	if (flags & MPOL_MF_SW_YOUNG) {
> +		unsigned long start, end;
> +		unsigned long nr_pte_updates = 0;
> +
> +		start = max(addr, vma->vm_start);
> +
> +		/* TODO: if huge page  */
> +		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
> +		end = min(end, vma->vm_end);
> +		nr_pte_updates = change_prot_numa(vma, start, end);
> +
> +		err = 0;
> +		goto out_putpage;
> +	}
> +
>  	if (PageHuge(page)) {
>  		if (PageHead(page)) {
>  			/* Check if the page is software young. */
> -- 
> 2.15.0
> 

> >From e617e8c2034387cbed50bafa786cf83528dbe3df Mon Sep 17 00:00:00 2001
> From: Fengguang Wu <fengguang.wu@intel.com>
> Date: Sun, 30 Sep 2018 10:50:58 +0800
> Subject: [PATCH 075/166] migrate: consolidate MPOL_MF_SW_YOUNG behaviors
> 
> - if page already in target node: SetPageReferenced
> - otherwise: change_prot_numa
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  arch/x86/kvm/Kconfig |  1 +
>  mm/migrate.c         | 65 +++++++++++++++++++++++++++++++---------------------
>  2 files changed, 40 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 4c6dec47fac6..c103373536fc 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -100,6 +100,7 @@ config KVM_EPT_IDLE
>  	tristate "KVM EPT idle page tracking"
>  	depends on KVM_INTEL
>  	depends on PROC_PAGE_MONITOR
> +	depends on NUMA_BALANCING
>  	---help---
>  	  Provides support for walking EPT to get the A bits on Intel
>  	  processors equipped with the VT extensions.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d933f6966601..d944f031c9ea 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1500,6 +1500,8 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  {
>  	struct vm_area_struct *vma;
>  	struct page *page;
> +	unsigned long end;
> +	unsigned int page_nid;
>  	unsigned int follflags;
>  	int err;
>  	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
> @@ -1522,49 +1524,60 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  	if (!page)
>  		goto out;
>  
> -	err = 0;
> -	if (page_to_nid(page) == node)
> -		goto out_putpage;
> +	page_nid = page_to_nid(page);
>  
>  	err = -EACCES;
>  	if (page_mapcount(page) > 1 && !migrate_all)
>  		goto out_putpage;
>  
> -	if (flags & MPOL_MF_SW_YOUNG) {
> -		unsigned long start, end;
> -		unsigned long nr_pte_updates = 0;
> -
> -		start = max(addr, vma->vm_start);
> -
> -		/* TODO: if huge page  */
> -		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
> -		end = min(end, vma->vm_end);
> -		nr_pte_updates = change_prot_numa(vma, start, end);
> -
> -		err = 0;
> -		goto out_putpage;
> -	}
> -
> +	err = 0;
>  	if (PageHuge(page)) {
> -		if (PageHead(page)) {
> -			/* Check if the page is software young. */
> -			if (flags & MPOL_MF_SW_YOUNG)
> +		if (!PageHead(page)) {
> +			err = -EACCES;
> +			goto out_putpage;
> +		}
> +		if (flags & MPOL_MF_SW_YOUNG) {
> +			if (page_nid == node)
>  				SetPageReferenced(page);
> -			isolate_huge_page(page, pagelist);
> -			err = 0;
> +			else if (PageAnon(page)) {
> +				end = addr + (hpage_nr_pages(page) << PAGE_SHIFT);
> +				if (end <= vma->vm_end)
> +					change_prot_numa(vma, addr, end);
> +			}
> +			goto out_putpage;
>  		}
> +		if (page_nid == node)
> +			goto out_putpage;
> +		isolate_huge_page(page, pagelist);
>  	} else {
>  		struct page *head;
>  
>  		head = compound_head(page);
> +
> +		if (flags & MPOL_MF_SW_YOUNG) {
> +			if (page_nid == node)
> +				SetPageReferenced(head);
> +			else {
> +				unsigned long size;
> +				size = hpage_nr_pages(head) << PAGE_SHIFT;
> +				end = addr + size;
> +				if (unlikely(addr & (size - 1)))
> +					err = -EXDEV;
> +				else if (likely(end <= vma->vm_end))
> +					change_prot_numa(vma, addr, end);
> +				else
> +					err = -ERANGE;
> +			}
> +			goto out_putpage;
> +		}
> +		if (page_nid == node)
> +			goto out_putpage;
> +
>  		err = isolate_lru_page(head);
>  		if (err)
>  			goto out_putpage;
>  
>  		err = 0;
> -		/* Check if the page is software young. */
> -		if (flags & MPOL_MF_SW_YOUNG)
> -			SetPageReferenced(head);
>  		list_add_tail(&head->lru, pagelist);
>  		mod_node_page_state(page_pgdat(head),
>  			NR_ISOLATED_ANON + page_is_file_cache(head),
> -- 
> 2.15.0
> 

> >From a2d9740d1639f807868014c16dc9e2620d356f3c Mon Sep 17 00:00:00 2001
> From: Fengguang Wu <fengguang.wu@intel.com>
> Date: Sun, 30 Sep 2018 19:22:27 +0800
> Subject: [PATCH 076/166] mempolicy: force NUMA balancing
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  mm/memory.c    | 3 ++-
>  mm/mempolicy.c | 5 -----
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c467102a5cbc..20c7efdff63b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3775,7 +3775,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>  		*flags |= TNF_FAULT_LOCAL;
>  	}
>  
> -	return mpol_misplaced(page, vma, addr);
> +	return 0;
> +	/* return mpol_misplaced(page, vma, addr); */
>  }
>  
>  static vm_fault_t do_numa_page(struct vm_fault *vmf)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..21dc6ba1d062 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2295,8 +2295,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  	int ret = -1;
>  
>  	pol = get_vma_policy(vma, addr);
> -	if (!(pol->flags & MPOL_F_MOF))
> -		goto out;
>  
>  	switch (pol->mode) {
>  	case MPOL_INTERLEAVE:
> @@ -2336,9 +2334,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  	/* Migrate the page towards the node whose CPU is referencing it */
>  	if (pol->flags & MPOL_F_MORON) {
>  		polnid = thisnid;
> -
> -		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
> -			goto out;
>  	}
>  
>  	if (curnid != polnid)
> -- 
> 2.15.0
> 


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 13:31           ` Fengguang Wu
  2018-12-28 18:28             ` Yang Shi
@ 2018-12-28 19:52             ` Michal Hocko
  2019-01-03 10:57               ` Mel Gorman
                                 ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Michal Hocko @ 2018-12-28 19:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli

[Ccing Mel and Andrea]

On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > I haven't looked at the implementation yet but if you are proposing a
> > > > special cased zone lists then this is something CDM (Coherent Device
> > > > Memory) was trying to do two years ago and there was quite some
> > > > skepticism in the approach.
> > > 
> > > It looks we are pretty different than CDM. :)
> > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > The zonelists modification is just to make PMEM nodes more separated.
> > 
> > Yes, this is exactly what CDM was after. Have a zone which is not
> > reachable without explicit request AFAIR. So no, I do not think you are
> > too different, you just use a different terminology ;)
> 
> Got it. OK.. The fall back zonelists patch does need more thoughts.
> 
> In long term POV, Linux should be prepared for multi-level memory.
> Then there will arise the need to "allocate from this level memory".
> So it looks good to have separated zonelists for each level of memory.

Well, I do not have a good answer for you here. We do not have good
experiences with those systems, I am afraid. NUMA is with us for more
than a decade yet our APIs are coarse to say the least and broken at so
many times as well. Starting a new API just based on PMEM sounds like a
ticket to another disaster to me.

I would like to see solid arguments why the current model of numa nodes
with fallback in distances order cannot be used for those new
technologies in the beginning and develop something better based on our
experiences that we gain on the way.

I would be especially interested about a possibility of the memory
migration idea during a memory pressure and relying on numa balancing to
resort the locality on demand rather than hiding certain NUMA nodes or
zones from the allocator and expose them only to the userspace.

> On the other hand, there will also be page allocations that don't care
> about the exact memory level. So it looks reasonable to expect
> different kind of fallback zonelists that can be selected by NUMA policy.
> 
> Thanks,
> Fengguang

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node
  2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
@ 2019-01-01  9:14   ` Aneesh Kumar K.V
  2019-01-07  9:57     ` Fengguang Wu
  0 siblings, 1 reply; 62+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-01  9:14 UTC (permalink / raw)
  To: Fengguang Wu, Andrew Morton
  Cc: Linux Memory Management List, Fan Du, Fengguang Wu, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

Fengguang Wu <fengguang.wu@intel.com> writes:

> From: Fan Du <fan.du@intel.com>
>
> When allocate page, DRAM and PMEM node should better not fall back to
> each other. This allows migration code to explicitly control which type
> of node to allocate pages from.
>
> With this patch, PMEM NUMA node can only be used in 2 ways:
> - migrate in and out
> - numactl

Can we achieve this using nodemask? That way we don't tag nodes with
different properties such as DRAM/PMEM. We can then give the
flexibilility to the device init code to add the new memory nodes to
the right nodemask

-aneesh


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
  2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
@ 2019-01-01  9:23   ` Aneesh Kumar K.V
  2019-01-02  0:59     ` Yuan Yao
  2019-01-02 16:47   ` Dave Hansen
  1 sibling, 1 reply; 62+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-01  9:23 UTC (permalink / raw)
  To: Fengguang Wu, Andrew Morton
  Cc: Linux Memory Management List, Yao Yuan, Fengguang Wu, kvm, LKML,
	Fan Du, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

Fengguang Wu <fengguang.wu@intel.com> writes:

> From: Yao Yuan <yuan.yao@intel.com>
>
> Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
> arch/x86/kvm/mmu.c |   12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
>  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
>  }
>  
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}
> +

May be it is explained in other patches. What is preventing the
allocation from pmem here? Is it that we are not using the memory
policy prefered node id and hence the zone list we built won't have the
PMEM node?


>  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
>  				       int min)
>  {
> @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
>  	if (cache->nobjs >= min)
>  		return 0;
>  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
>  		if (!page)
>  			return cache->nobjs >= min ? 0 : -ENOMEM;
>  		cache->objects[cache->nobjs++] = page;

-aneesh


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
  2019-01-01  9:23   ` Aneesh Kumar K.V
@ 2019-01-02  0:59     ` Yuan Yao
  0 siblings, 0 replies; 62+ messages in thread
From: Yuan Yao @ 2019-01-02  0:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List,
	Yao Yuan, kvm, LKML, Fan Du, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams

On Tue, Jan 01, 2019 at 02:53:07PM +0530, Aneesh Kumar K.V wrote:
> Fengguang Wu <fengguang.wu@intel.com> writes:
> 
> > From: Yao Yuan <yuan.yao@intel.com>
> >
> > Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> > ---
> > arch/x86/kvm/mmu.c |   12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> > +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> > @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
> >  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
> >  }
> >  
> > +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> > +{
> > +       struct page *page;
> > +
> > +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> > +       if (!page)
> > +	       return 0;
> > +       return (unsigned long) page_address(page);
> > +}
> > +
> 
> May be it is explained in other patches. What is preventing the
> allocation from pmem here? Is it that we are not using the memory
> policy prefered node id and hence the zone list we built won't have the
> PMEM node?

That because the PMEM nodes are memory-only node in the patchset,
so numa_node_id() will always return the node id from DRAM nodes.

About the zone list, yes in patch 10/21 we build the PMEM nodes to
seperate zonelist, so DRAM nodes will not fall back to PMEM nodes.

> 
> >  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
> >  				       int min)
> >  {
> > @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
> >  	if (cache->nobjs >= min)
> >  		return 0;
> >  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> > -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> > +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
> >  		if (!page)
> >  			return cache->nobjs >= min ? 0 : -ENOMEM;
> >  		cache->objects[cache->nobjs++] = page;
> 
> -aneesh
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
  2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
  2019-01-01  9:23   ` Aneesh Kumar K.V
@ 2019-01-02 16:47   ` Dave Hansen
  2019-01-07 10:21     ` Fengguang Wu
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2019-01-02 16:47 UTC (permalink / raw)
  To: Fengguang Wu, Andrew Morton
  Cc: Linux Memory Management List, Yao Yuan, kvm, LKML, Fan Du,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Zhang Yi,
	Dan Williams

On 12/26/18 5:14 AM, Fengguang Wu wrote:
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}

There seems to be a ton of *policy* baked into these patches.  For
instance: thou shalt not allocate page tables pages from PMEM.  That's
surely not a policy we want to inflict on every Linux user until the end
of time.

I think the more important question is how we can have the specific
policy that this patch implements, but also leave open room for other
policies, such as: "I don't care how slow this VM runs, minimize the
amount of fast memory it eats."

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28  8:41     ` Michal Hocko
  2018-12-28  9:42       ` Fengguang Wu
@ 2019-01-02 18:12       ` Dave Hansen
  2019-01-08 14:53         ` Michal Hocko
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2019-01-02 18:12 UTC (permalink / raw)
  To: Michal Hocko, Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, kvm, LKML, Fan Du,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Zhang Yi, Dan Williams

On 12/28/18 12:41 AM, Michal Hocko wrote:
>>
>> It can be done in kernel page reclaim path, near the anonymous page
>> swap out point. Instead of swapping out, we now have the option to
>> migrate cold pages to PMEM NUMA nodes.
> OK, this makes sense to me except I am not sure this is something that
> should be pmem specific. Is there any reason why we shouldn't migrate
> pages on memory pressure to other nodes in general? In other words
> rather than paging out we whould migrate over to the next node that is
> not under memory pressure. Swapout would be the next level when the
> memory is (almost_) fully utilized. That wouldn't be pmem specific.

Yeah, we don't want to make this specific to any particular kind of
memory.  For instance, with lots of pressure on expensive, small
high-bandwidth memory (HBM), we might want to migrate some HBM contents
to DRAM.

We need to decide on whether we want to cause pressure on the
destination nodes or not, though.  I think you're suggesting that we try
to look for things under some pressure and totally avoid them.  That
sounds sane, but I also like the idea of this being somewhat ordered.

Think of if we have three nodes, A, B, C.  A is fast, B is medium, C is
slow.  If A and B are "full" and we want to reclaim some of A, do we:

1. Migrate A->B, and put pressure on a later B->C migration, or
2. Migrate A->C directly

?

Doing A->C is less resource intensive because there's only one migration
involved.  But, doing A->B/B->C probably makes the app behave better
because the "A data" is presumably more valuable and is more
appropriately placed in B rather than being demoted all the way to C.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 19:52             ` Michal Hocko
@ 2019-01-03 10:57               ` Mel Gorman
  2019-01-10 16:25               ` Jerome Glisse
       [not found]               ` <20190102122110.00000206@huawei.com>
  2 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2019-01-03 10:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams,
	Andrea Arcangeli

On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.
> 
> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.
> 

I didn't read the thread as I'm backlogged as I imagine a lot of people
are. However, I would agree that zonelists are not a good fit for something
like PMEM-based being available via a zonelist with a fake distance combined
with NUMA balancing moving pages in and out DRAM and PMEM. The same applies
to a much lesser extent for something like a special higher-speed memory
that is faster than RAM.

The fundamental problem encountered will be a hot-page-inversion issue.
In the PMEM case, DRAM fills, then PMEM starts filling except now we
know that the most recently allocated page which is potentially the most
important in terms of hotness is allocated on slower "remote" memory.
Reclaim kicks in for the DRAM node and then there is interleaving of
hotness between DRAM and PMEM with NUMA balancing then getting involved
with non-deterministic performance.

I recognise that the same problem happens for remote NUMA nodes and it
also has an inversion issue once reclaim gets involved, but it also has a
clearly defined API for dealing with that problem if applications encounter
it. It's also relatively well known given the age of the problem and how
to cope with it. It's less clear whether applications could be able to
cope of it's a more distant PMEM instead of a remote DRAM and how that
should be advertised.

This has been brought up repeatedly over the last few years since high
speed memory was first mentioned but I think long-term what we should
be thinking of is "age-based-migration" where cold pages from DRAM
get migrated to PMEM when DRAM fills and use NUMA balancing to promote
hot pages from PMEM to DRAM. It should also be workable for remote DRAM
although that *might* violate the principal of least surprise given that
applications exist that are remote NUMA aware. It might be safer overall
if such age-based-migration is specific to local-but-different-speed
memory with the main DRAM only being in the zonelists. NUMA balancing
could still optionally promote from DRAM->faster memory while aging
moves pages from fast->slow as memory pressure dictates.

There still would need to be thought on exactly how this is advertised
to userspace because while "distance" is reasonably well understood,
it's not as clear to me whether distance is appropriate to describe
"local-but-different-speed" memory given that accessing a remote
NUMA node can saturate a single link where as the same may not
be true of local-but-different-speed memory which probably has
dedicated channels. In an ideal world, application developers
interested in higher-speed-memory-reserved-for-important-use and
cheaper-lower-speed-memory could describe what sort of application
modifications they'd be willing to do but that might be unlikely.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node
  2019-01-01  9:14   ` Aneesh Kumar K.V
@ 2019-01-07  9:57     ` Fengguang Wu
  2019-01-07 14:09       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 62+ messages in thread
From: Fengguang Wu @ 2019-01-07  9:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote:
>Fengguang Wu <fengguang.wu@intel.com> writes:
>
>> From: Fan Du <fan.du@intel.com>
>>
>> When allocate page, DRAM and PMEM node should better not fall back to
>> each other. This allows migration code to explicitly control which type
>> of node to allocate pages from.
>>
>> With this patch, PMEM NUMA node can only be used in 2 ways:
>> - migrate in and out
>> - numactl
>
>Can we achieve this using nodemask? That way we don't tag nodes with
>different properties such as DRAM/PMEM. We can then give the
>flexibilility to the device init code to add the new memory nodes to
>the right nodemask

Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and
numa_nodes_dram. What's your supposed way of "using nodemask"?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
  2019-01-02 16:47   ` Dave Hansen
@ 2019-01-07 10:21     ` Fengguang Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2019-01-07 10:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux Memory Management List, Yao Yuan, kvm, LKML,
	Fan Du, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Zhang Yi,
	Dan Williams

On Wed, Jan 02, 2019 at 08:47:25AM -0800, Dave Hansen wrote:
>On 12/26/18 5:14 AM, Fengguang Wu wrote:
>> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
>> +{
>> +       struct page *page;
>> +
>> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
>> +       if (!page)
>> +	       return 0;
>> +       return (unsigned long) page_address(page);
>> +}
>
>There seems to be a ton of *policy* baked into these patches.  For
>instance: thou shalt not allocate page tables pages from PMEM.  That's
>surely not a policy we want to inflict on every Linux user until the end
>of time.

Right. It's straight forward policy for users that care performance.
The project is planned by 3 steps, at this moment we are in phase (1):

1) core functionalities, easy to backport
2) upstream-able total solution
3) upstream when API stabilized

The dumb kernel interface /proc/PID/idle_pages enables doing
the majority policies in user space. However for the other smaller
parts, it looks easier to just implement an obvious policy first.
Then to consider more possibilities.

>I think the more important question is how we can have the specific
>policy that this patch implements, but also leave open room for other
>policies, such as: "I don't care how slow this VM runs, minimize the
>amount of fast memory it eats."

Agreed. I'm open for more ways. We can treat these patches as the
soliciting version. If anyone send reasonable improvements or even
totally different way of doing it, I'd be happy to incorporate.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node
  2019-01-07  9:57     ` Fengguang Wu
@ 2019-01-07 14:09       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 62+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-07 14:09 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Fan Du, kvm, LKML,
	Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie,
	Dave Hansen, Zhang Yi, Dan Williams

Fengguang Wu <fengguang.wu@intel.com> writes:

> On Tue, Jan 01, 2019 at 02:44:41PM +0530, Aneesh Kumar K.V wrote:
>>Fengguang Wu <fengguang.wu@intel.com> writes:
>>
>>> From: Fan Du <fan.du@intel.com>
>>>
>>> When allocate page, DRAM and PMEM node should better not fall back to
>>> each other. This allows migration code to explicitly control which type
>>> of node to allocate pages from.
>>>
>>> With this patch, PMEM NUMA node can only be used in 2 ways:
>>> - migrate in and out
>>> - numactl
>>
>>Can we achieve this using nodemask? That way we don't tag nodes with
>>different properties such as DRAM/PMEM. We can then give the
>>flexibilility to the device init code to add the new memory nodes to
>>the right nodemask
>
> Aneesh, in patch 2 we did create nodemask numa_nodes_pmem and
> numa_nodes_dram. What's your supposed way of "using nodemask"?
>

IIUC the patch is to avoid allocation from PMEM nodes and the way you
achieve it is by checking if (is_node_pmem(n)). We already have
abstractness to avoid allocation from a node using node mask. I was
wondering whether we can do the equivalent of above using that.

ie, __next_zone_zonelist can do zref_in_nodemask(z,
default_exclude_nodemask)) and decide whether to use the specific zone
or not.

That way we don't add special code like 

+	PGDAT_DRAM,			/* Volatile DRAM memory node */
+	PGDAT_PMEM,			/* Persistent memory node */

The reason is that there could be other device memory that would want to
get excluded from that default allocation like you are doing for PMEM

-aneesh


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
       [not found]               ` <20190102122110.00000206@huawei.com>
@ 2019-01-08 14:52                 ` Michal Hocko
  2019-01-10 15:53                   ` Jerome Glisse
  2019-01-10 18:26                   ` Jonathan Cameron
  2019-01-28 17:42                 ` Jonathan Cameron
  1 sibling, 2 replies; 62+ messages in thread
From: Michal Hocko @ 2019-01-08 14:52 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli, linux-accelerators

On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
[...]
> So ideally I'd love this set to head in a direction that helps me tick off
> at least some of the above usecases and hopefully have some visibility on
> how to address the others moving forwards,

Is it sufficient to have such a memory marked as movable (aka only have
ZONE_MOVABLE)? That should rule out most of the kernel allocations and
it fits the "balance by migration" concept.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-02 18:12       ` Dave Hansen
@ 2019-01-08 14:53         ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2019-01-08 14:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Zhang Yi, Dan Williams

On Wed 02-01-19 10:12:04, Dave Hansen wrote:
> On 12/28/18 12:41 AM, Michal Hocko wrote:
> >>
> >> It can be done in kernel page reclaim path, near the anonymous page
> >> swap out point. Instead of swapping out, we now have the option to
> >> migrate cold pages to PMEM NUMA nodes.
> > OK, this makes sense to me except I am not sure this is something that
> > should be pmem specific. Is there any reason why we shouldn't migrate
> > pages on memory pressure to other nodes in general? In other words
> > rather than paging out we whould migrate over to the next node that is
> > not under memory pressure. Swapout would be the next level when the
> > memory is (almost_) fully utilized. That wouldn't be pmem specific.
> 
> Yeah, we don't want to make this specific to any particular kind of
> memory.  For instance, with lots of pressure on expensive, small
> high-bandwidth memory (HBM), we might want to migrate some HBM contents
> to DRAM.
> 
> We need to decide on whether we want to cause pressure on the
> destination nodes or not, though.  I think you're suggesting that we try
> to look for things under some pressure and totally avoid them.  That
> sounds sane, but I also like the idea of this being somewhat ordered.
> 
> Think of if we have three nodes, A, B, C.  A is fast, B is medium, C is
> slow.  If A and B are "full" and we want to reclaim some of A, do we:
> 
> 1. Migrate A->B, and put pressure on a later B->C migration, or
> 2. Migrate A->C directly
> 
> ?
> 
> Doing A->C is less resource intensive because there's only one migration
> involved.  But, doing A->B/B->C probably makes the app behave better
> because the "A data" is presumably more valuable and is more
> appropriately placed in B rather than being demoted all the way to C.

This is a good question and I do not have a good answer because I lack
experiences with such "many levels" systems. If we followed CPU caches
model ten you are right that the fallback should be gradual. This is
more complex implementation wise of course. Anyway, I believe that there
is a lot of room for experimentations. If this stays an internal
implementation detail without user API then there is also no promise on
future behavior so nothing gets carved into stone since the day 1 when
our experiences are limited.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-08 14:52                 ` Michal Hocko
@ 2019-01-10 15:53                   ` Jerome Glisse
  2019-01-10 16:42                     ` Michal Hocko
  2019-01-10 18:26                   ` Jonathan Cameron
  1 sibling, 1 reply; 62+ messages in thread
From: Jerome Glisse @ 2019-01-10 15:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jonathan Cameron, Fengguang Wu, Andrew Morton,
	Linux Memory Management List, kvm, LKML, Fan Du, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams, Mel Gorman, Andrea Arcangeli,
	linux-accelerators

On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> [...]
> > So ideally I'd love this set to head in a direction that helps me tick off
> > at least some of the above usecases and hopefully have some visibility on
> > how to address the others moving forwards,
> 
> Is it sufficient to have such a memory marked as movable (aka only have
> ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> it fits the "balance by migration" concept.

This would not work for GPU, GPU driver really want to be in total
control of their memory yet sometimes they want to migrate some part
of the process to their memory.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2018-12-28 19:52             ` Michal Hocko
  2019-01-03 10:57               ` Mel Gorman
@ 2019-01-10 16:25               ` Jerome Glisse
  2019-01-10 16:50                 ` Michal Hocko
       [not found]               ` <20190102122110.00000206@huawei.com>
  2 siblings, 1 reply; 62+ messages in thread
From: Jerome Glisse @ 2019-01-10 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli

On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.

I see several issues with distance. First it does fully abstract the
underlying topology and this might be problematic, for instance if
you memory with different characteristic in same node like persistent
memory connected to some CPU then it might be faster for that CPU to
access that persistent memory has it has dedicated link to it than to
access some other remote memory for which the CPU might have to share
the link with other CPUs or devices.

Second distance is no longer easy to compute when you are not trying
to answer what is the fastest memory for CPU-N but rather asking what
is the fastest memory for CPU-N and device-M ie when you are trying to
find the best memory for a group of CPUs/devices. The answer can
changes drasticly depending on members of the groups.


Some advance programmer already do graph matching ie they match the
graph of their program dataset/computation with the topology graph
of the computer they run on to determine what is best placement both
for threads and memory.


> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.

For device memory we have more things to think of like:
    - memory not accessible by CPU
    - non cache coherent memory (yet still useful in some case if
      application explicitly ask for it)
    - device driver want to keep full control over memory as older
      application like graphic for GPU, do need contiguous physical
      memory and other tight control over physical memory placement

So if we are talking about something to replace NUMA i would really
like for that to be inclusive of device memory (which can itself be
a hierarchy of different memory with different characteristics).

Note that i do believe the NUMA proposed solution is something useful
now. But for a new API it would be good to allow thing like device
memory.

This is a good topic to discuss during next LSF/MM

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-10 15:53                   ` Jerome Glisse
@ 2019-01-10 16:42                     ` Michal Hocko
  2019-01-10 17:42                       ` Jerome Glisse
  0 siblings, 1 reply; 62+ messages in thread
From: Michal Hocko @ 2019-01-10 16:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jonathan Cameron, Fengguang Wu, Andrew Morton,
	Linux Memory Management List, kvm, LKML, Fan Du, Yao Yuan,
	Peng Dong, Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen,
	Zhang Yi, Dan Williams, Mel Gorman, Andrea Arcangeli,
	linux-accelerators

On Thu 10-01-19 10:53:17, Jerome Glisse wrote:
> On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> > [...]
> > > So ideally I'd love this set to head in a direction that helps me tick off
> > > at least some of the above usecases and hopefully have some visibility on
> > > how to address the others moving forwards,
> > 
> > Is it sufficient to have such a memory marked as movable (aka only have
> > ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> > it fits the "balance by migration" concept.
> 
> This would not work for GPU, GPU driver really want to be in total
> control of their memory yet sometimes they want to migrate some part
> of the process to their memory.

But that also means that GPU doesn't really fit the model discussed
here, right? I thought HMM is the way to manage such a memory.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-10 16:25               ` Jerome Glisse
@ 2019-01-10 16:50                 ` Michal Hocko
  2019-01-10 18:02                   ` Jerome Glisse
  0 siblings, 1 reply; 62+ messages in thread
From: Michal Hocko @ 2019-01-10 16:50 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli

On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > [Ccing Mel and Andrea]
> > 
> > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > skepticism in the approach.
> > > > > 
> > > > > It looks we are pretty different than CDM. :)
> > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > 
> > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > too different, you just use a different terminology ;)
> > > 
> > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > 
> > > In long term POV, Linux should be prepared for multi-level memory.
> > > Then there will arise the need to "allocate from this level memory".
> > > So it looks good to have separated zonelists for each level of memory.
> > 
> > Well, I do not have a good answer for you here. We do not have good
> > experiences with those systems, I am afraid. NUMA is with us for more
> > than a decade yet our APIs are coarse to say the least and broken at so
> > many times as well. Starting a new API just based on PMEM sounds like a
> > ticket to another disaster to me.
> > 
> > I would like to see solid arguments why the current model of numa nodes
> > with fallback in distances order cannot be used for those new
> > technologies in the beginning and develop something better based on our
> > experiences that we gain on the way.
> 
> I see several issues with distance. First it does fully abstract the
> underlying topology and this might be problematic, for instance if
> you memory with different characteristic in same node like persistent
> memory connected to some CPU then it might be faster for that CPU to
> access that persistent memory has it has dedicated link to it than to
> access some other remote memory for which the CPU might have to share
> the link with other CPUs or devices.
> 
> Second distance is no longer easy to compute when you are not trying
> to answer what is the fastest memory for CPU-N but rather asking what
> is the fastest memory for CPU-N and device-M ie when you are trying to
> find the best memory for a group of CPUs/devices. The answer can
> changes drasticly depending on members of the groups.

While you might be right, I would _really_ appreciate to start with a
simpler model and go to a more complex one based on realy HW and real
experiences than start with an overly complicated and over engineered
approach from scratch.

> Some advance programmer already do graph matching ie they match the
> graph of their program dataset/computation with the topology graph
> of the computer they run on to determine what is best placement both
> for threads and memory.

And those can still use our mempolicy API to describe their needs. If
existing API is not sufficient then let's talk about which pieces are
missing.

> > I would be especially interested about a possibility of the memory
> > migration idea during a memory pressure and relying on numa balancing to
> > resort the locality on demand rather than hiding certain NUMA nodes or
> > zones from the allocator and expose them only to the userspace.
> 
> For device memory we have more things to think of like:
>     - memory not accessible by CPU
>     - non cache coherent memory (yet still useful in some case if
>       application explicitly ask for it)
>     - device driver want to keep full control over memory as older
>       application like graphic for GPU, do need contiguous physical
>       memory and other tight control over physical memory placement

Again, I believe that HMM is to target those non-coherent or
non-accessible memory and I do not think it is helpful to put them into
the mix here.

> So if we are talking about something to replace NUMA i would really
> like for that to be inclusive of device memory (which can itself be
> a hierarchy of different memory with different characteristics).

I think we should build on the existing NUMA infrastructure we have.
Developing something completely new is not going to happen anytime soon
and I am not convinced the result would be that much better either.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-10 16:42                     ` Michal Hocko
@ 2019-01-10 17:42                       ` Jerome Glisse
  0 siblings, 0 replies; 62+ messages in thread
From: Jerome Glisse @ 2019-01-10 17:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Huang Ying, Zhang Yi, kvm, Dave Hansen,
	Liu Jingqi, Yao Yuan, Fan Du, Dong Eddie, LKML,
	Linux Memory Management List, Peng Dong, Andrew Morton,
	Fengguang Wu, Dan Williams, linux-accelerators, Mel Gorman

On Thu, Jan 10, 2019 at 05:42:48PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 10:53:17, Jerome Glisse wrote:
> > On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> > > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> > > [...]
> > > > So ideally I'd love this set to head in a direction that helps me tick off
> > > > at least some of the above usecases and hopefully have some visibility on
> > > > how to address the others moving forwards,
> > > 
> > > Is it sufficient to have such a memory marked as movable (aka only have
> > > ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> > > it fits the "balance by migration" concept.
> > 
> > This would not work for GPU, GPU driver really want to be in total
> > control of their memory yet sometimes they want to migrate some part
> > of the process to their memory.
> 
> But that also means that GPU doesn't really fit the model discussed
> here, right? I thought HMM is the way to manage such a memory.

HMM provides the plumbing and tools to manage but right now the patchset
for nouveau expose API through nouveau device file as nouveau ioctl. This
is not a good long term solution when you want to mix and match multiple
GPUs memory (possibly from different vendors). Then you get each device
driver implementing their own mem policy infrastructure and without any
coordination between devices/drivers. While it is _mostly_ ok for single
GPU case, it is seriously crippling for the multi-GPUs or multi-devices
cases (for instance when you chain network and GPU together or GPU and
storage).

People have been asking for a single common API to manage both regular
memory and device memory. As anyway the common case is you move things
around depending on which devices/CPUs is working on the dataset.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-10 16:50                 ` Michal Hocko
@ 2019-01-10 18:02                   ` Jerome Glisse
  0 siblings, 0 replies; 62+ messages in thread
From: Jerome Glisse @ 2019-01-10 18:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli

On Thu, Jan 10, 2019 at 05:50:01PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > > [Ccing Mel and Andrea]
> > > 
> > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > > skepticism in the approach.
> > > > > > 
> > > > > > It looks we are pretty different than CDM. :)
> > > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > > 
> > > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > > too different, you just use a different terminology ;)
> > > > 
> > > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > > 
> > > > In long term POV, Linux should be prepared for multi-level memory.
> > > > Then there will arise the need to "allocate from this level memory".
> > > > So it looks good to have separated zonelists for each level of memory.
> > > 
> > > Well, I do not have a good answer for you here. We do not have good
> > > experiences with those systems, I am afraid. NUMA is with us for more
> > > than a decade yet our APIs are coarse to say the least and broken at so
> > > many times as well. Starting a new API just based on PMEM sounds like a
> > > ticket to another disaster to me.
> > > 
> > > I would like to see solid arguments why the current model of numa nodes
> > > with fallback in distances order cannot be used for those new
> > > technologies in the beginning and develop something better based on our
> > > experiences that we gain on the way.
> > 
> > I see several issues with distance. First it does fully abstract the
> > underlying topology and this might be problematic, for instance if
> > you memory with different characteristic in same node like persistent
> > memory connected to some CPU then it might be faster for that CPU to
> > access that persistent memory has it has dedicated link to it than to
> > access some other remote memory for which the CPU might have to share
> > the link with other CPUs or devices.
> > 
> > Second distance is no longer easy to compute when you are not trying
> > to answer what is the fastest memory for CPU-N but rather asking what
> > is the fastest memory for CPU-N and device-M ie when you are trying to
> > find the best memory for a group of CPUs/devices. The answer can
> > changes drasticly depending on members of the groups.
> 
> While you might be right, I would _really_ appreciate to start with a
> simpler model and go to a more complex one based on realy HW and real
> experiences than start with an overly complicated and over engineered
> approach from scratch.
> 
> > Some advance programmer already do graph matching ie they match the
> > graph of their program dataset/computation with the topology graph
> > of the computer they run on to determine what is best placement both
> > for threads and memory.
> 
> And those can still use our mempolicy API to describe their needs. If
> existing API is not sufficient then let's talk about which pieces are
> missing.

I understand people don't want the fully topology thing but device memory
can not be expose as a NUMA node hence at very least we need something
that is not NUMA node only and most likely an API that does not use bitmask
as front facing userspace API. So some kind of UID for memory, one for
each type of memory on each node (and also for each device memory). It
can be a 1 to 1 match with NUMA node id for all regular NUMA node memory
with extra id for device memory (for instance by setting the high bit on
the UID for device memory).


> > > I would be especially interested about a possibility of the memory
> > > migration idea during a memory pressure and relying on numa balancing to
> > > resort the locality on demand rather than hiding certain NUMA nodes or
> > > zones from the allocator and expose them only to the userspace.
> > 
> > For device memory we have more things to think of like:
> >     - memory not accessible by CPU
> >     - non cache coherent memory (yet still useful in some case if
> >       application explicitly ask for it)
> >     - device driver want to keep full control over memory as older
> >       application like graphic for GPU, do need contiguous physical
> >       memory and other tight control over physical memory placement
> 
> Again, I believe that HMM is to target those non-coherent or
> non-accessible memory and I do not think it is helpful to put them into
> the mix here.

HMM is the kernel plumbing it does not expose anything to userspace.
While right now for nouveau the plan is to expose API through nouveau
ioctl this does not scale/work for multiple devices or when you mix
and match different devices. A single API that can handle both device
memory and regular memory would be much more useful. Long term at least
that's what i would like to see.


> > So if we are talking about something to replace NUMA i would really
> > like for that to be inclusive of device memory (which can itself be
> > a hierarchy of different memory with different characteristics).
> 
> I think we should build on the existing NUMA infrastructure we have.
> Developing something completely new is not going to happen anytime soon
> and I am not convinced the result would be that much better either.

The issue with NUMA is that i do not see a way to add device memory as
node as the memory need to be fully manage by the device driver. Also
the number of nodes might get out of hands (think 32 devices per CPU
so with 1024 CPU that's 2^15 max nodes ...) this leads to node mask
taking a full page.

Also the whole NUMA access tracking does not work with devices (it can
be added but right now it is non existent). Forcing page fault to track
access is highly disruptive for GPU while the hw can provide much better
informations without fault and CPU counters might also be something we
might want to use rather than faulting.

I am not saying something new will solve all the issues we have today
with NUMA, actualy i don't believe we can solve all of them. But it
could at least be more flexible in terms of what memory program can
bind to.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-08 14:52                 ` Michal Hocko
  2019-01-10 15:53                   ` Jerome Glisse
@ 2019-01-10 18:26                   ` Jonathan Cameron
  1 sibling, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2019-01-10 18:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fengguang Wu, Andrew Morton, Linux Memory Management List, kvm,
	LKML, Fan Du, Yao Yuan, Peng Dong, Huang Ying, Liu Jingqi,
	Dong Eddie, Dave Hansen, Zhang Yi, Dan Williams, Mel Gorman,
	Andrea Arcangeli, linux-accelerators

On Tue, 8 Jan 2019 15:52:56 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> [...]
> > So ideally I'd love this set to head in a direction that helps me tick off
> > at least some of the above usecases and hopefully have some visibility on
> > how to address the others moving forwards,  
> 
> Is it sufficient to have such a memory marked as movable (aka only have
> ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> it fits the "balance by migration" concept.

Yes, to some degree. That's exactly what we are doing, though a things currently
stand I think you have to turn it on via a kernel command line and mark it
hotpluggable in ACPI. Given it my or may not actually be hotpluggable
that's less than elegant.

Let's randomly decide not to explore that one further for a few more weeks.
la la la la

If we have general balancing by migration then things are definitely
heading in a useful direction as long as 'hot' takes into account the
main user not being a CPU.  You are right that migration dealing with
the movable kernel allocations is a nice side effect though which I
hadn't thought about.  Long run we might end up with everything where
it should be after some level of burn in period. A generic version of
this proposal is looking nicer and nicer!

Thanks,

Jonathan






^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
       [not found]               ` <20190102122110.00000206@huawei.com>
  2019-01-08 14:52                 ` Michal Hocko
@ 2019-01-28 17:42                 ` Jonathan Cameron
  2019-01-29  2:00                   ` Fengguang Wu
  1 sibling, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2019-01-28 17:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Huang Ying, Zhang Yi, kvm, Dave Hansen,
	Liu Jingqi, Fan Du, Dong Eddie, LKML, linux-accelerators,
	Linux Memory Management List, Peng Dong, Yao Yuan, Andrew Morton,
	Fengguang Wu, Dan Williams, Mel Gorman

On Wed, 2 Jan 2019 12:21:10 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> On Fri, 28 Dec 2018 20:52:24 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > [Ccing Mel and Andrea]
> > 

Hi,

I just wanted to highlight this section as I didn't feel we really addressed this
in the earlier conversation.

> * Hot pages may not be hot just because the host is using them a lot.  It would be
>   very useful to have a means of adding information available from accelerators
>   beyond simple accessed bits (dreaming ;)  One problem here is translation
>   caches (ATCs) as they won't normally result in any updates to the page accessed
>   bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
>   obvious) that the ATS request is the only opportunity to update the accessed
>   bit.  The nasty option here would be to periodically flush the ATC to force
>   the access bit updates via repeats of the ATS request (ouch).
>   That option only works if the iommu supports updating the accessed flag
>   (optional on SMMU v3 for example).
> 

If we ignore the IOMMU hardware update issue which will simply need to be addressed
by future hardware if these techniques become common, how do we address the
Address Translation Cache issue without potentially causing big performance
problems by flushing the cache just to force an accessed bit update?

These devices are frequently used with PRI and Shared Virtual Addressing
and can be accessing most of your memory without you having any visibility
of it in the page tables (as they aren't walked if your ATC is well matched
in size to your usecase.

Classic example would be accelerated DB walkers like the the CCIX demo
Xilinx has shown at a few conferences.   The whole point of those is that
most of the time only your large set of database walkers is using your
memory and they have translations cached for for a good part of what
they are accessing.  Flushing that cache could hurt a lot.
Pinning pages hurts for all the normal flexibility reasons.

Last thing we want is to be migrating these pages that can be very hot but
in an invisible fashion.

Thanks,

Jonathan
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
  2019-01-28 17:42                 ` Jonathan Cameron
@ 2019-01-29  2:00                   ` Fengguang Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2019-01-29  2:00 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Andrea Arcangeli, Huang Ying, Zhang Yi, kvm,
	Dave Hansen, Liu Jingqi, Fan Du, Dong Eddie, LKML,
	linux-accelerators, Linux Memory Management List, Peng Dong,
	Yao Yuan, Andrew Morton, Dan Williams, Mel Gorman

Hi Jonathan,

Thanks for showing the gap on tracking hot accesses from devices.

On Mon, Jan 28, 2019 at 05:42:39PM +0000, Jonathan Cameron wrote:
>On Wed, 2 Jan 2019 12:21:10 +0000
>Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
>> On Fri, 28 Dec 2018 20:52:24 +0100
>> Michal Hocko <mhocko@kernel.org> wrote:
>>
>> > [Ccing Mel and Andrea]
>> >
>
>Hi,
>
>I just wanted to highlight this section as I didn't feel we really addressed this
>in the earlier conversation.
>
>> * Hot pages may not be hot just because the host is using them a lot.  It would be
>>   very useful to have a means of adding information available from accelerators
>>   beyond simple accessed bits (dreaming ;)  One problem here is translation
>>   caches (ATCs) as they won't normally result in any updates to the page accessed
>>   bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
>>   obvious) that the ATS request is the only opportunity to update the accessed
>>   bit.  The nasty option here would be to periodically flush the ATC to force
>>   the access bit updates via repeats of the ATS request (ouch).
>>   That option only works if the iommu supports updating the accessed flag
>>   (optional on SMMU v3 for example).

If ATS based updates are supported, we may trigger it when closing the
/proc/pid/idle_pages file. We already do TLB flushes at that time. For
example,

[PATCH 15/21] ept-idle: EPT walk for virtual machine

        ept_idle_release():
          kvm_flush_remote_tlbs(kvm);

[PATCH 17/21] proc: introduce /proc/PID/idle_pages

        mm_idle_release():
          flush_tlb_mm(mm);

The flush cost is kind of "minimal necessary" in our current use
model, where user space scan+migration daemon will do such loop:

loop:
        walk page table N times:
                open,read,close /proc/PID/idle_pages
                (flushes TLB on file close)
                sleep for a short interval
        sort and migrate hot pages
        sleep for a while

>If we ignore the IOMMU hardware update issue which will simply need to be addressed
>by future hardware if these techniques become common, how do we address the
>Address Translation Cache issue without potentially causing big performance
>problems by flushing the cache just to force an accessed bit update?
>
>These devices are frequently used with PRI and Shared Virtual Addressing
>and can be accessing most of your memory without you having any visibility
>of it in the page tables (as they aren't walked if your ATC is well matched
>in size to your usecase.
>
>Classic example would be accelerated DB walkers like the the CCIX demo
>Xilinx has shown at a few conferences.   The whole point of those is that
>most of the time only your large set of database walkers is using your
>memory and they have translations cached for for a good part of what
>they are accessing.  Flushing that cache could hurt a lot.
>Pinning pages hurts for all the normal flexibility reasons.
>
>Last thing we want is to be migrating these pages that can be very hot but
>in an invisible fashion.

If there are some other way to get hotness for special device memory,
the user space daemon may be extended to cover that. Perhaps by
querying another new kernel interface.

By driving hotness accounting and migration in user space, we harvest
this kind of flexibility. In the daemon POV, /proc/PID/idle_pages
provides one common way to get "accessed" bits hence hotness, though
the daemon does not need to depend solely on it.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 14/21] kvm: register in mm_struct
  2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
@ 2019-02-02  6:57   ` Peter Xu
  2019-02-02 10:50     ` Fengguang Wu
  2019-02-04 10:46     ` Paolo Bonzini
  0 siblings, 2 replies; 62+ messages in thread
From: Peter Xu @ 2019-02-02  6:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Nikita Leshenko,
	Christian Borntraeger, kvm, LKML, Fan Du, Yao Yuan, Peng Dong,
	Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen, Zhang Yi,
	Dan Williams, Paolo Bonzini

On Wed, Dec 26, 2018 at 09:15:00PM +0800, Fengguang Wu wrote:
> VM is associated with an address space and not a specific thread.
> 
> >From Documentation/virtual/kvm/api.txt:
>    Only run VM ioctls from the same process (address space) that was used
>    to create the VM.

Hi, Fengguang,

AFAIU the commit message only explains why a kvm object needs to bind
to a single mm object (say, the reason why there is kvm->mm) however
not the reverse (say, the reason why there is mm->kvm), while the
latter is what this patch really needs?

I'm thinking whether it's legal for multiple VMs to run on a single mm
address space.  I don't see a limitation so far but it's very possible
I am just missing something there (if there is, IMHO they might be
something nice to put into the commit message?).  Thanks,

> 
> CC: Nikita Leshenko <nikita.leshchenko@oracle.com>
> CC: Christian Borntraeger <borntraeger@de.ibm.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  include/linux/mm_types.h |   11 +++++++++++
>  virt/kvm/kvm_main.c      |    3 +++
>  2 files changed, 14 insertions(+)
> 
> --- linux.orig/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
> +++ linux/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
> @@ -27,6 +27,7 @@ typedef int vm_fault_t;
>  struct address_space;
>  struct mem_cgroup;
>  struct hmm;
> +struct kvm;
>  
>  /*
>   * Each physical page in the system has a struct page associated with
> @@ -496,6 +497,10 @@ struct mm_struct {
>  		/* HMM needs to track a few things per mm */
>  		struct hmm *hmm;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_KVM)
> +		struct kvm *kvm;
> +#endif
>  	} __randomize_layout;
>  
>  	/*
> @@ -507,6 +512,12 @@ struct mm_struct {
>  
>  extern struct mm_struct init_mm;
>  
> +#if IS_ENABLED(CONFIG_KVM)
> +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return mm->kvm; }
> +#else
> +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return NULL; }
> +#endif
> +
>  /* Pointer magic because the dynamic array size confuses some compilers. */
>  static inline void mm_init_cpumask(struct mm_struct *mm)
>  {
> --- linux.orig/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
> +++ linux/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
> @@ -727,6 +727,7 @@ static void kvm_destroy_vm(struct kvm *k
>  	struct mm_struct *mm = kvm->mm;
>  
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> +	mm->kvm = NULL;
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
>  	spin_lock(&kvm_lock);
> @@ -3224,6 +3225,8 @@ static int kvm_dev_ioctl_create_vm(unsig
>  		fput(file);
>  		return -ENOMEM;
>  	}
> +
> +	kvm->mm->kvm = kvm;
>  	kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
>  
>  	fd_install(r, file);
> 
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 14/21] kvm: register in mm_struct
  2019-02-02  6:57   ` Peter Xu
@ 2019-02-02 10:50     ` Fengguang Wu
  2019-02-04 10:46     ` Paolo Bonzini
  1 sibling, 0 replies; 62+ messages in thread
From: Fengguang Wu @ 2019-02-02 10:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Linux Memory Management List, Nikita Leshenko,
	Christian Borntraeger, kvm, LKML, Fan Du, Yao Yuan, Peng Dong,
	Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen, Zhang Yi,
	Dan Williams, Paolo Bonzini

Hi Peter,

On Sat, Feb 02, 2019 at 02:57:41PM +0800, Peter Xu wrote:
>On Wed, Dec 26, 2018 at 09:15:00PM +0800, Fengguang Wu wrote:
>> VM is associated with an address space and not a specific thread.
>>
>> >From Documentation/virtual/kvm/api.txt:
>>    Only run VM ioctls from the same process (address space) that was used
>>    to create the VM.
>
>Hi, Fengguang,
>
>AFAIU the commit message only explains why a kvm object needs to bind
>to a single mm object (say, the reason why there is kvm->mm) however
>not the reverse (say, the reason why there is mm->kvm), while the
>latter is what this patch really needs?

Yeah good point. The addition of mm->kvm makes code in this patchset
simple. However if that field is considered not general useful for
other possible users, and the added space overheads is a concern, we
can instead do with a flag (saying the mm is referenced by some KVM),
and add extra lookup code to find out the exact kvm instance.

>I'm thinking whether it's legal for multiple VMs to run on a single mm
>address space.  I don't see a limitation so far but it's very possible
>I am just missing something there (if there is, IMHO they might be
>something nice to put into the commit message?).  Thanks,

So far one QEMU only starts one KVM. I cannot think of any strong
benefit to start multiple KVMs in one single QEMU, so it may well
remain so in future. Anyway it's internal data structure instead of
API, which can adapt to possible future changes.

Thanks,
Fengguang

>> CC: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> CC: Christian Borntraeger <borntraeger@de.ibm.com>
>> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
>> ---
>>  include/linux/mm_types.h |   11 +++++++++++
>>  virt/kvm/kvm_main.c      |    3 +++
>>  2 files changed, 14 insertions(+)
>>
>> --- linux.orig/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
>> +++ linux/include/linux/mm_types.h	2018-12-23 19:58:06.993417137 +0800
>> @@ -27,6 +27,7 @@ typedef int vm_fault_t;
>>  struct address_space;
>>  struct mem_cgroup;
>>  struct hmm;
>> +struct kvm;
>>
>>  /*
>>   * Each physical page in the system has a struct page associated with
>> @@ -496,6 +497,10 @@ struct mm_struct {
>>  		/* HMM needs to track a few things per mm */
>>  		struct hmm *hmm;
>>  #endif
>> +
>> +#if IS_ENABLED(CONFIG_KVM)
>> +		struct kvm *kvm;
>> +#endif
>>  	} __randomize_layout;
>>
>>  	/*
>> @@ -507,6 +512,12 @@ struct mm_struct {
>>
>>  extern struct mm_struct init_mm;
>>
>> +#if IS_ENABLED(CONFIG_KVM)
>> +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return mm->kvm; }
>> +#else
>> +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return NULL; }
>> +#endif
>> +
>>  /* Pointer magic because the dynamic array size confuses some compilers. */
>>  static inline void mm_init_cpumask(struct mm_struct *mm)
>>  {
>> --- linux.orig/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
>> +++ linux/virt/kvm/kvm_main.c	2018-12-23 19:58:06.993417137 +0800
>> @@ -727,6 +727,7 @@ static void kvm_destroy_vm(struct kvm *k
>>  	struct mm_struct *mm = kvm->mm;
>>
>>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>> +	mm->kvm = NULL;
>>  	kvm_destroy_vm_debugfs(kvm);
>>  	kvm_arch_sync_events(kvm);
>>  	spin_lock(&kvm_lock);
>> @@ -3224,6 +3225,8 @@ static int kvm_dev_ioctl_create_vm(unsig
>>  		fput(file);
>>  		return -ENOMEM;
>>  	}
>> +
>> +	kvm->mm->kvm = kvm;
>>  	kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
>>
>>  	fd_install(r, file);
>>
>>
>
>Regards,
>
>-- 
>Peter Xu
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH v2 14/21] kvm: register in mm_struct
  2019-02-02  6:57   ` Peter Xu
  2019-02-02 10:50     ` Fengguang Wu
@ 2019-02-04 10:46     ` Paolo Bonzini
  1 sibling, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2019-02-04 10:46 UTC (permalink / raw)
  To: Peter Xu, Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Nikita Leshenko,
	Christian Borntraeger, kvm, LKML, Fan Du, Yao Yuan, Peng Dong,
	Huang Ying, Liu Jingqi, Dong Eddie, Dave Hansen, Zhang Yi,
	Dan Williams

On 02/02/19 07:57, Peter Xu wrote:
> 
> I'm thinking whether it's legal for multiple VMs to run on a single mm
> address space.  I don't see a limitation so far but it's very possible
> I am just missing something there (if there is, IMHO they might be
> something nice to put into the commit message?).  Thanks,

Yes, it certainly is legal, and even useful in fact.

For example there are people running WebAssembly in a KVM sandbox.  In
that case you can have multiple KVM instances in a single process.

It seems to me that there is already a perfect way to link an mm to its
users, which is the MMU notifier.  Why do you need a separate
proc_ept_idle_operations?  You could change ept_idle_read into an MMU
notifier callback, and have core mm/ core combine the output of
mm_idle_read and all the MMU notifiers?  Basically, ept_idle_ctrl
becomes an argument to the new MMU notifier callback, or something like
that.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2019-02-04 10:47 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
     [not found]               ` <20190102122110.00000206@huawei.com>
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).