[RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
@ 2019-04-25  1:21 Fan Du
  2019-04-25  1:21 ` [RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table Fan Du
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

This is another approach of building zonelist based on patch #10 of
patchset[1].

For systems with heterogeneous DRAM and PMEM (persistent memory),

1) change ZONELIST_FALLBACK to first fallback to same type nodes,
   then the other types

2) add ZONELIST_FALLBACK_SAME_TYPE to fallback only same type nodes.
   To be explicitly selected by __GFP_SAME_NODE_TYPE.

For example, a 2S DRAM+PMEM system may have NUMA distances:
node   0   1   2   3 
  0:  10  21  17  28 
  1:  21  10  28  17 
  2:  17  28  10  28 
  3:  28  17  28  10

Node 0,1 are DRAM nodes, node 2, 3 are PMEM nodes.

ZONELIST_FALLBACK
=================
Current zoned fallback lists are based on numa distance only,
which means page allocation request from node 0 will iterate zone order
like: DRAM node 0 -> PMEM node 2 -> DRAM node 1 -> PMEM node 3.

However PMEM has different characteristics from DRAM,
the more reasonable or desirable fallback style would be:
DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
When DRAM is exhausted, try PMEM then. 

ZONELIST_FALLBACK_SAME_TYPE
===========================
Some cases are more suitable to fit PMEM characteristics, like page is
read more frequently than written. Other cases may prefer DRAM only.
It doesn't matter page is from local node, or remote.

Create __GFP_SAME_NODE_TYPE to request page of same node type,
either we got DRAM(from node 0, 1) or PMEM (from node 2, 3), it's kind
of extension to the nofallback list, but with the same node type. 

This patchset is self-contained, and based on Linux 5.1-rc6.

[1]:
https://lkml.org/lkml/2018/12/26/138

Fan Du (5):
  acpi/numa: memorize NUMA node type from SRAT table
  mmzone: new pgdat flags for DRAM and PMEM
  x86,numa: update numa node type
  mm, page alloc: build fallback list on per node type basis
  mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list

 arch/x86/include/asm/numa.h |  2 ++
 arch/x86/mm/numa.c          |  3 +++
 drivers/acpi/numa.c         |  5 ++++
 include/linux/gfp.h         |  7 ++++++
 include/linux/mmzone.h      | 35 ++++++++++++++++++++++++++++
 mm/page_alloc.c             | 57 ++++++++++++++++++++++++++++++++-------------
 6 files changed, 93 insertions(+), 16 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
@ 2019-04-25  1:21 ` Fan Du
  2019-04-25  1:21 ` [RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM Fan Du
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

Mark NUMA node as DRAM or PMEM.

This could happen in boot up state (see the e820 pmem type
override patch), or on fly when bind devdax device with kmem
driver.

It depends on BIOS supplying PMEM NUMA proximity in SRAT table,
that's current production BIOS does.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 arch/x86/include/asm/numa.h | 2 ++
 arch/x86/mm/numa.c          | 2 ++
 drivers/acpi/numa.c         | 5 +++++
 3 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index bbfde3d..5191198 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -30,6 +30,8 @@
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_nodes_pmem;
+extern nodemask_t numa_nodes_dram;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index dfb6c4d..3c3a1f5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -20,6 +20,8 @@
 
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_nodes_pmem;
+nodemask_t numa_nodes_dram;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 867f6e3..ec4b7a7e 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -298,6 +298,11 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
 
 	node_set(node, numa_nodes_parsed);
 
+	if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE)
+		node_set(node, numa_nodes_pmem);
+	else
+		node_set(node, numa_nodes_dram);
+
 	pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
 		node, pxm,
 		(unsigned long long) start, (unsigned long long) end - 1,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
  2019-04-25  1:21 ` [RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table Fan Du
@ 2019-04-25  1:21 ` Fan Du
  2019-04-25  1:21 ` [RFC PATCH 3/5] x86,numa: update numa node type Fan Du
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

One system with DRAM and PMEM, we need new flag
to tag pgdat is made of DRAM or peristent memory.

This patch serves as preparetion one for follow up patch.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/linux/mmzone.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..d3ee9f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -520,6 +520,8 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_DRAM,			/* Volatile DRAM memory node */
+	PGDAT_PMEM,			/* Persistent memory node */
 };
 
 enum zone_flags {
@@ -923,6 +925,30 @@ extern int numa_zonelist_order_handler(struct ctl_table *, int,
 
 #endif /* !CONFIG_NEED_MULTIPLE_NODES */
 
+static inline int is_node_pmem(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	return test_bit(PGDAT_PMEM, &pgdat->flags);
+}
+
+static inline int is_node_dram(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	return test_bit(PGDAT_DRAM, &pgdat->flags);
+}
+
+static inline void set_node_type(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (node_isset(nid, numa_nodes_pmem))
+		set_bit(PGDAT_PMEM, &pgdat->flags);
+	else
+		set_bit(PGDAT_DRAM, &pgdat->flags);
+}
+
 extern struct pglist_data *first_online_pgdat(void);
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/5] x86,numa: update numa node type
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
  2019-04-25  1:21 ` [RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table Fan Du
  2019-04-25  1:21 ` [RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM Fan Du
@ 2019-04-25  1:21 ` Fan Du
  2019-04-25  1:21 ` [RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis Fan Du
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

Give the newly created node a type per SRAT attribution.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 arch/x86/mm/numa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3c3a1f5..ff8ad63 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -590,6 +590,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		set_node_type(nid);
 	}
 
 	/* Dump memblock with node info and return. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
                   ` (2 preceding siblings ...)
  2019-04-25  1:21 ` [RFC PATCH 3/5] x86,numa: update numa node type Fan Du
@ 2019-04-25  1:21 ` Fan Du
  2019-04-25  1:21 ` [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list Fan Du
  2019-04-25  6:37 ` [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Michal Hocko
  5 siblings, 0 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

On box with both DRAM and PMEM managed by mm system,
Usually node 0, 1 are DRAM nodes, nodes 2, 3 are PMEM nodes.
nofallback list are same as before, fallback list are not
redesigned to be arranged by node type basis, iow,
allocation request of DRAM page start from node 0 will go
through node0->node1->node2->node3 zonelists.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/linux/mmzone.h |  8 ++++++++
 mm/page_alloc.c        | 42 ++++++++++++++++++++++++++----------------
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d3ee9f9..8c37e1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -939,6 +939,14 @@ static inline int is_node_dram(int nid)
 	return test_bit(PGDAT_DRAM, &pgdat->flags);
 }
 
+static inline int is_node_same_type(int nida, int nidb)
+{
+	if (node_isset(nida, numa_nodes_pmem))
+		return node_isset(nidb, numa_nodes_pmem);
+	else
+		return node_isset(nidb, numa_nodes_dram);
+}
+
 static inline void set_node_type(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6ce20a..a408a91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5372,7 +5372,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask, int need_same_type)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -5380,7 +5380,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	const struct cpumask *tmp = cpumask_of_node(0);
 
 	/* Use the local node if we haven't already */
-	if (!node_isset(node, *used_node_mask)) {
+	if (need_same_type && !node_isset(node, *used_node_mask)) {
 		node_set(node, *used_node_mask);
 		return node;
 	}
@@ -5391,6 +5391,12 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 		if (node_isset(n, *used_node_mask))
 			continue;
 
+		if (need_same_type && !is_node_same_type(node, n))
+			continue;
+
+		if (!need_same_type && is_node_same_type(node, n))
+			continue;
+
 		/* Use the distance array to find the distance */
 		val = node_distance(node, n);
 
@@ -5472,31 +5478,35 @@ static void build_zonelists(pg_data_t *pgdat)
 	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
+	int need_same_type;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
 	load = nr_online_nodes;
 	prev_node = local_node;
-	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-		if (node_distance(local_node, node) !=
-		    node_distance(local_node, prev_node))
-			node_load[node] = load;
+	for (need_same_type = 1; need_same_type >= 0; need_same_type--) {
+		nodes_clear(used_mask);
+		while ((node = find_next_best_node(local_node, &used_mask,
+				need_same_type)) >= 0) {
+			/*
+			 * We don't want to pressure a particular node.
+			 * So adding penalty to the first node in same
+			 * distance group to make it round-robin.
+			 */
+			if (node_distance(local_node, node) !=
+			    node_distance(local_node, prev_node))
+				node_load[node] = load;
 
-		node_order[nr_nodes++] = node;
-		prev_node = node;
-		load--;
+			node_order[nr_nodes++] = node;
+			prev_node = node;
+			load--;
+		}
 	}
-
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
+
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
                   ` (3 preceding siblings ...)
  2019-04-25  1:21 ` [RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis Fan Du
@ 2019-04-25  1:21 ` Fan Du
       [not found]   ` <a0728518-a067-4f89-a8ae-3fa279f768f2.xishi.qiuxishi@alibaba-inc.com>
  2019-04-25  6:38   ` Michal Hocko
  2019-04-25  6:37 ` [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Michal Hocko
  5 siblings, 2 replies; 23+ messages in thread
From: Fan Du @ 2019-04-25  1:21 UTC (permalink / raw)
  To: akpm, mhocko, fengguang.wu, dan.j.williams, dave.hansen,
	xishi.qiuxishi, ying.huang
  Cc: linux-mm, linux-kernel, Fan Du

On system with heterogeneous memory, reasonable fall back lists woul be:
a. No fall back, stick to current running node.
b. Fall back to other nodes of the same type or different type
   e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
c. Fall back to other nodes of the same type only.
   e.g. DRAM node 0 -> DRAM node 1

a. is already in place, previous patch implement b. providing way to
satisfy memory request as best effort by default. And this patch of
writing build c. to fallback to the same node type when user specify
GFP_SAME_NODE_TYPE only.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/linux/gfp.h    |  7 +++++++
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 15 +++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de..ca5fdfc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -44,6 +44,8 @@
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
+#define ___GFP_SAME_NODE_TYPE	0x1000000u
+
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -215,6 +217,7 @@
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_TYPE)
 
 /* Room for N __GFP_FOO bits */
 #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
@@ -301,6 +304,8 @@
 			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
 
+#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
+
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 #define GFP_MOVABLE_SHIFT 3
@@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
 #ifdef CONFIG_NUMA
 	if (unlikely(flags & __GFP_THISNODE))
 		return ZONELIST_NOFALLBACK;
+	if (unlikely(flags & __GFP_SAME_NODE_TYPE))
+		return ZONELIST_FALLBACK_SAME_TYPE;
 #endif
 	return ZONELIST_FALLBACK;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8c37e1c..2f8603e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone,
 
 enum {
 	ZONELIST_FALLBACK,	/* zonelist with fallback */
+	ZONELIST_FALLBACK_SAME_TYPE,	/* zonelist with fallback to the same type node */
 #ifdef CONFIG_NUMA
 	/*
 	 * The NUMA zonelists are doubled because we need zonelists that
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a408a91..de797921 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
 	}
 	zonerefs->zone = NULL;
 	zonerefs->zone_idx = 0;
+
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs;
+
+	for (i = 0; i < nr_nodes; i++) {
+		int nr_zones;
+
+		pg_data_t *node = NODE_DATA(node_order[i]);
+
+		if (!is_node_same_type(node->node_id, pgdat->node_id))
+			continue;
+		nr_zones = build_zonerefs_node(node, zonerefs);
+		zonerefs += nr_zones;
+	}
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
       [not found]   ` <a0728518-a067-4f89-a8ae-3fa279f768f2.xishi.qiuxishi@alibaba-inc.com>
@ 2019-04-25  3:26     ` Xishi Qiu
  2019-04-25  7:45       ` Du, Fan
  0 siblings, 1 reply; 23+ messages in thread
From: Xishi Qiu @ 2019-04-25  3:26 UTC (permalink / raw)
  To: Fengguang Wu, fan.du
  Cc: akpm, Michal Hocko, Dan Williams, dave.hansen, ying.huang,
	linux-mm, Linux Kernel Mailing List

Hi Fan Du,

I think we should change the print in mminit_verify_zonelist too.

This patch changes the order of ZONELIST_FALLBACK, so the default numa policy can
alloc DRAM first, then PMEM, right?

Thanks,
Xishi Qiu
>     On system with heterogeneous memory, reasonable fall back lists woul be:
>     a. No fall back, stick to current running node.
>     b. Fall back to other nodes of the same type or different type
>        e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
>     c. Fall back to other nodes of the same type only.
>        e.g. DRAM node 0 -> DRAM node 1
> 
>     a. is already in place, previous patch implement b. providing way to
>     satisfy memory request as best effort by default. And this patch of
>     writing build c. to fallback to the same node type when user specify
>     GFP_SAME_NODE_TYPE only.
> 
>     Signed-off-by: Fan Du <fan.du@intel.com>
>     ---
>      include/linux/gfp.h    |  7 +++++++
>      include/linux/mmzone.h |  1 +
>      mm/page_alloc.c        | 15 +++++++++++++++
>      3 files changed, 23 insertions(+)
> 
>     diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>     index fdab7de..ca5fdfc 100644
>     --- a/include/linux/gfp.h
>     +++ b/include/linux/gfp.h
>     @@ -44,6 +44,8 @@
>      #else
>      #define ___GFP_NOLOCKDEP 0
>      #endif
>     +#define ___GFP_SAME_NODE_TYPE 0x1000000u
>     +
>      /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>      
>      /*
>     @@ -215,6 +217,7 @@
>      
>      /* Disable lockdep for GFP context tracking */
>      #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>     +#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_TYPE)
>      
>      /* Room for N __GFP_FOO bits */
>      #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
>     @@ -301,6 +304,8 @@
>          __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>      #define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
>      
>     +#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
>     +
>      /* Convert GFP flags to their corresponding migrate type */
>      #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
>      #define GFP_MOVABLE_SHIFT 3
>     @@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
>      #ifdef CONFIG_NUMA
>       if (unlikely(flags & __GFP_THISNODE))
>        return ZONELIST_NOFALLBACK;
>     + if (unlikely(flags & __GFP_SAME_NODE_TYPE))
>     +  return ZONELIST_FALLBACK_SAME_TYPE;
>      #endif
>       return ZONELIST_FALLBACK;
>      }
>     diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>     index 8c37e1c..2f8603e 100644
>     --- a/include/linux/mmzone.h
>     +++ b/include/linux/mmzone.h
>     @@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone,
>      
>      enum {
>       ZONELIST_FALLBACK, /* zonelist with fallback */
>     + ZONELIST_FALLBACK_SAME_TYPE, /* zonelist with fallback to the same type node */
>      #ifdef CONFIG_NUMA
>       /*
>        * The NUMA zonelists are doubled because we need zonelists that
>     diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>     index a408a91..de797921 100644
>     --- a/mm/page_alloc.c
>     +++ b/mm/page_alloc.c
>     @@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
>       }
>       zonerefs->zone = NULL;
>       zonerefs->zone_idx = 0;
>     +
>     + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs;
>     +
>     + for (i = 0; i < nr_nodes; i++) {
>     +  int nr_zones;
>     +
>     +  pg_data_t *node = NODE_DATA(node_order[i]);
>     +
>     +  if (!is_node_same_type(node->node_id, pgdat->node_id))
>     +   continue;
>     +  nr_zones = build_zonerefs_node(node, zonerefs);
>     +  zonerefs += nr_zones;
>     + }
>     + zonerefs->zone = NULL;
>     + zonerefs->zone_idx = 0;
>      }
>      
>      /*
>     -- 
>     1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
                   ` (4 preceding siblings ...)
  2019-04-25  1:21 ` [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list Fan Du
@ 2019-04-25  6:37 ` Michal Hocko
  2019-04-25  7:41   ` Du, Fan
  5 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  6:37 UTC (permalink / raw)
  To: Fan Du
  Cc: akpm, fengguang.wu, dan.j.williams, dave.hansen, xishi.qiuxishi,
	ying.huang, linux-mm, linux-kernel

On Thu 25-04-19 09:21:30, Fan Du wrote:
[...]
> However PMEM has different characteristics from DRAM,
> the more reasonable or desirable fallback style would be:
> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> When DRAM is exhausted, try PMEM then. 

Why and who does care? NUMA is fundamentally about memory nodes with
different access characteristics so why is PMEM any special?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  1:21 ` [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list Fan Du
       [not found]   ` <a0728518-a067-4f89-a8ae-3fa279f768f2.xishi.qiuxishi@alibaba-inc.com>
@ 2019-04-25  6:38   ` Michal Hocko
  2019-04-25  7:43     ` Du, Fan
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  6:38 UTC (permalink / raw)
  To: Fan Du
  Cc: akpm, fengguang.wu, dan.j.williams, dave.hansen, xishi.qiuxishi,
	ying.huang, linux-mm, linux-kernel

On Thu 25-04-19 09:21:35, Fan Du wrote:
> On system with heterogeneous memory, reasonable fall back lists woul be:
> a. No fall back, stick to current running node.
> b. Fall back to other nodes of the same type or different type
>    e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
> c. Fall back to other nodes of the same type only.
>    e.g. DRAM node 0 -> DRAM node 1
> 
> a. is already in place, previous patch implement b. providing way to
> satisfy memory request as best effort by default. And this patch of
> writing build c. to fallback to the same node type when user specify
> GFP_SAME_NODE_TYPE only.

So an immediate question which should be answered by this changelog. Who
is going to use the new gfp flag? Why cannot all allocations without an
explicit numa policy fallback to all existing nodes?
 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  include/linux/gfp.h    |  7 +++++++
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c        | 15 +++++++++++++++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fdab7de..ca5fdfc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -44,6 +44,8 @@
>  #else
>  #define ___GFP_NOLOCKDEP	0
>  #endif
> +#define ___GFP_SAME_NODE_TYPE	0x1000000u
> +
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -215,6 +217,7 @@
>  
>  /* Disable lockdep for GFP context tracking */
>  #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
> +#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_TYPE)
>  
>  /* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
> @@ -301,6 +304,8 @@
>  			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>  #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
>  
> +#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
> +
>  /* Convert GFP flags to their corresponding migrate type */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
>  #define GFP_MOVABLE_SHIFT 3
> @@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
>  #ifdef CONFIG_NUMA
>  	if (unlikely(flags & __GFP_THISNODE))
>  		return ZONELIST_NOFALLBACK;
> +	if (unlikely(flags & __GFP_SAME_NODE_TYPE))
> +		return ZONELIST_FALLBACK_SAME_TYPE;
>  #endif
>  	return ZONELIST_FALLBACK;
>  }
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8c37e1c..2f8603e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone,
>  
>  enum {
>  	ZONELIST_FALLBACK,	/* zonelist with fallback */
> +	ZONELIST_FALLBACK_SAME_TYPE,	/* zonelist with fallback to the same type node */
>  #ifdef CONFIG_NUMA
>  	/*
>  	 * The NUMA zonelists are doubled because we need zonelists that
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a408a91..de797921 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
>  	}
>  	zonerefs->zone = NULL;
>  	zonerefs->zone_idx = 0;
> +
> +	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs;
> +
> +	for (i = 0; i < nr_nodes; i++) {
> +		int nr_zones;
> +
> +		pg_data_t *node = NODE_DATA(node_order[i]);
> +
> +		if (!is_node_same_type(node->node_id, pgdat->node_id))
> +			continue;
> +		nr_zones = build_zonerefs_node(node, zonerefs);
> +		zonerefs += nr_zones;
> +	}
> +	zonerefs->zone = NULL;
> +	zonerefs->zone_idx = 0;
>  }
>  
>  /*
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25  6:37 ` [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Michal Hocko
@ 2019-04-25  7:41   ` Du, Fan
  2019-04-25  7:53     ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Du, Fan @ 2019-04-25  7:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan

>-----Original Message-----
>From: Michal Hocko [mailto:mhocko@kernel.org]
>Sent: Thursday, April 25, 2019 2:37 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu 25-04-19 09:21:30, Fan Du wrote:
>[...]
>> However PMEM has different characteristics from DRAM,
>> the more reasonable or desirable fallback style would be:
>> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> When DRAM is exhausted, try PMEM then.
>
>Why and who does care? NUMA is fundamentally about memory nodes with
>different access characteristics so why is PMEM any special?

Michal, thanks for your comments!

The "different" lies in the local or remote access, usually the underlying
memory is the same type, i.e. DRAM.

By "special", PMEM is usually in gigantic capacity than DRAM per dimm, 
while with different read/write access latency than DRAM. Iow PMEM
sits right under DRAM in the memory tier hierarchy.

This makes PMEM to be far memory, or second class memory.
So we give first class DRAM page to user, fallback to PMEM when
necessary.

The Cloud Service Provider can use DRAM + PMEM in their system,
Leveraging method [1] to keep hot page in DRAM and warm or cold
Page in PMEM, achieve optimal performance and reduce total cost
of ownership at the same time.

[1]:
https://github.com/fengguang/memory-optimizer

>--
>Michal Hocko
>SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  6:38   ` Michal Hocko
@ 2019-04-25  7:43     ` Du, Fan
  2019-04-25  7:48       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Du, Fan @ 2019-04-25  7:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: Michal Hocko [mailto:mhocko@kernel.org]
>Sent: Thursday, April 25, 2019 2:38 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>ZONELIST_FALLBACK_SAME_TYPE fallback list
>
>On Thu 25-04-19 09:21:35, Fan Du wrote:
>> On system with heterogeneous memory, reasonable fall back lists woul be:
>> a. No fall back, stick to current running node.
>> b. Fall back to other nodes of the same type or different type
>>    e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
>> c. Fall back to other nodes of the same type only.
>>    e.g. DRAM node 0 -> DRAM node 1
>>
>> a. is already in place, previous patch implement b. providing way to
>> satisfy memory request as best effort by default. And this patch of
>> writing build c. to fallback to the same node type when user specify
>> GFP_SAME_NODE_TYPE only.
>
>So an immediate question which should be answered by this changelog. Who
>is going to use the new gfp flag? Why cannot all allocations without an
>explicit numa policy fallback to all existing nodes?

PMEM is good for frequently read accessed page, e.g. page cache(implicit page
request), or user space data base (explicit page request)

For now this patch create GFP_SAME_NODE_TYPE for such cases, additional
Implementation will be followed up.

For example:
a. Open file
b. Populate pagecache with PMEM page if user set O_RDONLY
c. Migrate frequently read accessed page to PMEM from DRAM,
  for cases w/o O_RDONLY.


>> Signed-off-by: Fan Du <fan.du@intel.com>
>> ---
>>  include/linux/gfp.h    |  7 +++++++
>>  include/linux/mmzone.h |  1 +
>>  mm/page_alloc.c        | 15 +++++++++++++++
>>  3 files changed, 23 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index fdab7de..ca5fdfc 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -44,6 +44,8 @@
>>  #else
>>  #define ___GFP_NOLOCKDEP	0
>>  #endif
>> +#define ___GFP_SAME_NODE_TYPE	0x1000000u
>> +
>>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>>
>>  /*
>> @@ -215,6 +217,7 @@
>>
>>  /* Disable lockdep for GFP context tracking */
>>  #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>> +#define __GFP_SAME_NODE_TYPE ((__force
>gfp_t)___GFP_SAME_NODE_TYPE)
>>
>>  /* Room for N __GFP_FOO bits */
>>  #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
>> @@ -301,6 +304,8 @@
>>  			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>>  #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT |
>__GFP_DIRECT_RECLAIM)
>>
>> +#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
>> +
>>  /* Convert GFP flags to their corresponding migrate type */
>>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
>>  #define GFP_MOVABLE_SHIFT 3
>> @@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
>>  #ifdef CONFIG_NUMA
>>  	if (unlikely(flags & __GFP_THISNODE))
>>  		return ZONELIST_NOFALLBACK;
>> +	if (unlikely(flags & __GFP_SAME_NODE_TYPE))
>> +		return ZONELIST_FALLBACK_SAME_TYPE;
>>  #endif
>>  	return ZONELIST_FALLBACK;
>>  }
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 8c37e1c..2f8603e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone,
>>
>>  enum {
>>  	ZONELIST_FALLBACK,	/* zonelist with fallback */
>> +	ZONELIST_FALLBACK_SAME_TYPE,	/* zonelist with fallback to the
>same type node */
>>  #ifdef CONFIG_NUMA
>>  	/*
>>  	 * The NUMA zonelists are doubled because we need zonelists that
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a408a91..de797921 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5448,6 +5448,21 @@ static void
>build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
>>  	}
>>  	zonerefs->zone = NULL;
>>  	zonerefs->zone_idx = 0;
>> +
>> +	zonerefs =
>pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs;
>> +
>> +	for (i = 0; i < nr_nodes; i++) {
>> +		int nr_zones;
>> +
>> +		pg_data_t *node = NODE_DATA(node_order[i]);
>> +
>> +		if (!is_node_same_type(node->node_id, pgdat->node_id))
>> +			continue;
>> +		nr_zones = build_zonerefs_node(node, zonerefs);
>> +		zonerefs += nr_zones;
>> +	}
>> +	zonerefs->zone = NULL;
>> +	zonerefs->zone_idx = 0;
>>  }
>>
>>  /*
>> --
>> 1.8.3.1
>>
>
>--
>Michal Hocko
>SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  3:26     ` Xishi Qiu
@ 2019-04-25  7:45       ` Du, Fan
  0 siblings, 0 replies; 23+ messages in thread
From: Du, Fan @ 2019-04-25  7:45 UTC (permalink / raw)
  To: Xishi Qiu, Wu, Fengguang
  Cc: akpm, Michal Hocko, Williams, Dan J, Hansen, Dave, Huang, Ying,
	linux-mm, Linux Kernel Mailing List, Du, Fan



>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Xishi Qiu
>Sent: Thursday, April 25, 2019 11:26 AM
>To: Wu, Fengguang <fengguang.wu@intel.com>; Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Michal Hocko <mhocko@suse.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; Huang, Ying <ying.huang@intel.com>;
>linux-mm@kvack.org; Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
>Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>ZONELIST_FALLBACK_SAME_TYPE fallback list
>
>Hi Fan Du,
>
>I think we should change the print in mminit_verify_zonelist too.
>
>This patch changes the order of ZONELIST_FALLBACK, so the default numa
>policy can
>alloc DRAM first, then PMEM, right?

Yes, you are right. :)

>Thanks,
>Xishi Qiu
>>
>On system with heterogeneous memory, reasonable fall back lists wo
>ul be:
>>     a. No fall back, stick to current running node.
>>
>b. Fall back to other nodes of the same type or different type
>>        e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 ->
>PMEM node 3
>>     c. Fall back to other nodes of the same type only.
>>        e.g. DRAM node 0 -> DRAM node 1
>>
>>
>a. is already in place, previous patch implement b. providing way to
>>
>satisfy memory request as best effort by default. And this patch of
>>
>writing build c. to fallback to the same node type when user specify
>>     GFP_SAME_NODE_TYPE only.
>>
>>     Signed-off-by: Fan Du <fan.du@intel.com>
>>     ---
>>      include/linux/gfp.h    |  7 +++++++
>>      include/linux/mmzone.h |  1 +
>>      mm/page_alloc.c        | 15 +++++++++++++++
>>      3 files changed, 23 insertions(+)
>>
>>     diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>     index fdab7de..ca5fdfc 100644
>>     --- a/include/linux/gfp.h
>>     +++ b/include/linux/gfp.h
>>     @@ -44,6 +44,8 @@
>>      #else
>>      #define ___GFP_NOLOCKDEP 0
>>      #endif
>>     +#define ___GFP_SAME_NODE_TYPE 0x1000000u
>>     +
>>      /* If the above are modified, __GFP_BITS_SHIFT may need up
>dating */
>>
>>      /*
>>     @@ -215,6 +217,7 @@
>>
>>      /* Disable lockdep for GFP context tracking */
>>      #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>>
>+#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_T
>YPE)
>>
>>      /* Room for N __GFP_FOO bits */
>>      #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
>>     @@ -301,6 +304,8 @@
>>          __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLA
>IM)
>>      #define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRE
>CT_RECLAIM)
>>
>>     +#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
>>     +
>>      /* Convert GFP flags to their corresponding migrate type */
>>      #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVA
>BLE)
>>      #define GFP_MOVABLE_SHIFT 3
>>     @@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
>>      #ifdef CONFIG_NUMA
>>       if (unlikely(flags & __GFP_THISNODE))
>>        return ZONELIST_NOFALLBACK;
>>     + if (unlikely(flags & __GFP_SAME_NODE_TYPE))
>>     +  return ZONELIST_FALLBACK_SAME_TYPE;
>>      #endif
>>       return ZONELIST_FALLBACK;
>>      }
>>     diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>     index 8c37e1c..2f8603e 100644
>>     --- a/include/linux/mmzone.h
>>     +++ b/include/linux/mmzone.h
>>
>@@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone
>*zone,
>>
>>      enum {
>>       ZONELIST_FALLBACK, /* zonelist with fallback */
>>
>+ ZONELIST_FALLBACK_SAME_TYPE, /* zonelist with fallback to the sam
>e type node */
>>      #ifdef CONFIG_NUMA
>>       /*
>>        * The NUMA zonelists are doubled because we need zonel
>ists that
>>     diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>     index a408a91..de797921 100644
>>     --- a/mm/page_alloc.c
>>     +++ b/mm/page_alloc.c
>>
>@@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg
>_data_t *pgdat, int *node_order,
>>       }
>>       zonerefs->zone = NULL;
>>       zonerefs->zone_idx = 0;
>>     +
>>
>+ zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zon
>erefs;
>>     +
>>     + for (i = 0; i < nr_nodes; i++) {
>>     +  int nr_zones;
>>     +
>>     +  pg_data_t *node = NODE_DATA(node_order[i]);
>>     +
>>     +  if (!is_node_same_type(node->node_id, pgdat->node_id))
>>     +   continue;
>>     +  nr_zones = build_zonerefs_node(node, zonerefs);
>>     +  zonerefs += nr_zones;
>>     + }
>>     + zonerefs->zone = NULL;
>>     + zonerefs->zone_idx = 0;
>>      }
>>
>>      /*
>>     --
>>     1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  7:43     ` Du, Fan
@ 2019-04-25  7:48       ` Michal Hocko
  2019-04-25  7:55         ` Du, Fan
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  7:48 UTC (permalink / raw)
  To: Du, Fan
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel

On Thu 25-04-19 07:43:09, Du, Fan wrote:
> 
> 
> >-----Original Message-----
> >From: Michal Hocko [mailto:mhocko@kernel.org]
> >Sent: Thursday, April 25, 2019 2:38 PM
> >To: Du, Fan <fan.du@intel.com>
> >Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> ><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
> >ZONELIST_FALLBACK_SAME_TYPE fallback list
> >
> >On Thu 25-04-19 09:21:35, Fan Du wrote:
> >> On system with heterogeneous memory, reasonable fall back lists woul be:
> >> a. No fall back, stick to current running node.
> >> b. Fall back to other nodes of the same type or different type
> >>    e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
> >> c. Fall back to other nodes of the same type only.
> >>    e.g. DRAM node 0 -> DRAM node 1
> >>
> >> a. is already in place, previous patch implement b. providing way to
> >> satisfy memory request as best effort by default. And this patch of
> >> writing build c. to fallback to the same node type when user specify
> >> GFP_SAME_NODE_TYPE only.
> >
> >So an immediate question which should be answered by this changelog. Who
> >is going to use the new gfp flag? Why cannot all allocations without an
> >explicit numa policy fallback to all existing nodes?
> 
> PMEM is good for frequently read accessed page, e.g. page cache(implicit page
> request), or user space data base (explicit page request)
> For now this patch create GFP_SAME_NODE_TYPE for such cases, additional
> Implementation will be followed up.

Then simply configure that NUMA node as movable and you get these
allocations for any movable allocation. I am not really convinced a new
gfp flag is really justified.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25  7:41   ` Du, Fan
@ 2019-04-25  7:53     ` Michal Hocko
  2019-04-25  8:05       ` Du, Fan
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  7:53 UTC (permalink / raw)
  To: Du, Fan
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel

On Thu 25-04-19 07:41:40, Du, Fan wrote:
> 
> 
> >-----Original Message-----
> >From: Michal Hocko [mailto:mhocko@kernel.org]
> >Sent: Thursday, April 25, 2019 2:37 PM
> >To: Du, Fan <fan.du@intel.com>
> >Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> ><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >[...]
> >> However PMEM has different characteristics from DRAM,
> >> the more reasonable or desirable fallback style would be:
> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> When DRAM is exhausted, try PMEM then.
> >
> >Why and who does care? NUMA is fundamentally about memory nodes with
> >different access characteristics so why is PMEM any special?
> 
> Michal, thanks for your comments!
> 
> The "different" lies in the local or remote access, usually the underlying
> memory is the same type, i.e. DRAM.
> 
> By "special", PMEM is usually in gigantic capacity than DRAM per dimm, 
> while with different read/write access latency than DRAM.

You are describing a NUMA in general here. Yes access to different NUMA
nodes has a different read/write latency. But that doesn't make PMEM
really special from a regular DRAM. There are few other people trying to
work with PMEM as NUMA nodes and these kind of arguments are repeating
again and again. So far I haven't really heard much beyond hand waving.
Please go and read through those discussion so that we do not have to go
throug the same set of arguments again.

I absolutely do see and understand people want to find a way to use
their shiny NVIDIMs but please step back and try to think in more
general terms than PMEM is special and we have to treat it that way.
We currently have ways to use it as DAX device and a NUMA node then
focus on how to improve our NUMA handling so that we can get maximum out
of the HW rather than make a PMEM NUMA node a special snow flake.

Thank you.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  7:48       ` Michal Hocko
@ 2019-04-25  7:55         ` Du, Fan
  2019-04-25  8:09           ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Du, Fan @ 2019-04-25  7:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Michal Hocko
>Sent: Thursday, April 25, 2019 3:49 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>ZONELIST_FALLBACK_SAME_TYPE fallback list
>
>On Thu 25-04-19 07:43:09, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: Michal Hocko [mailto:mhocko@kernel.org]
>> >Sent: Thursday, April 25, 2019 2:38 PM
>> >To: Du, Fan <fan.du@intel.com>
>> >Cc: akpm@linux-foundation.org; Wu, Fengguang
><fengguang.wu@intel.com>;
>> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
>> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
>> ><ying.huang@intel.com>; linux-mm@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>> >ZONELIST_FALLBACK_SAME_TYPE fallback list
>> >
>> >On Thu 25-04-19 09:21:35, Fan Du wrote:
>> >> On system with heterogeneous memory, reasonable fall back lists woul
>be:
>> >> a. No fall back, stick to current running node.
>> >> b. Fall back to other nodes of the same type or different type
>> >>    e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node
>3
>> >> c. Fall back to other nodes of the same type only.
>> >>    e.g. DRAM node 0 -> DRAM node 1
>> >>
>> >> a. is already in place, previous patch implement b. providing way to
>> >> satisfy memory request as best effort by default. And this patch of
>> >> writing build c. to fallback to the same node type when user specify
>> >> GFP_SAME_NODE_TYPE only.
>> >
>> >So an immediate question which should be answered by this changelog.
>Who
>> >is going to use the new gfp flag? Why cannot all allocations without an
>> >explicit numa policy fallback to all existing nodes?
>>
>> PMEM is good for frequently read accessed page, e.g. page cache(implicit
>page
>> request), or user space data base (explicit page request)
>> For now this patch create GFP_SAME_NODE_TYPE for such cases, additional
>> Implementation will be followed up.
>
>Then simply configure that NUMA node as movable and you get these
>allocations for any movable allocation. I am not really convinced a new
>gfp flag is really justified.

Case 1: frequently write and/or read accessed page deserved to DRAM
Case 2: frequently read accessed page deserved to PMEM

We need something like a new gfp flag to sort above two cases out
From each other.

>--
>Michal Hocko
>SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25  7:53     ` Michal Hocko
@ 2019-04-25  8:05       ` Du, Fan
  2019-04-25 15:43           ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Du, Fan @ 2019-04-25  8:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Michal Hocko
>Sent: Thursday, April 25, 2019 3:54 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu 25-04-19 07:41:40, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: Michal Hocko [mailto:mhocko@kernel.org]
>> >Sent: Thursday, April 25, 2019 2:37 PM
>> >To: Du, Fan <fan.du@intel.com>
>> >Cc: akpm@linux-foundation.org; Wu, Fengguang
><fengguang.wu@intel.com>;
>> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
>> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
>> ><ying.huang@intel.com>; linux-mm@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >memory system
>> >
>> >On Thu 25-04-19 09:21:30, Fan Du wrote:
>> >[...]
>> >> However PMEM has different characteristics from DRAM,
>> >> the more reasonable or desirable fallback style would be:
>> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> >> When DRAM is exhausted, try PMEM then.
>> >
>> >Why and who does care? NUMA is fundamentally about memory nodes
>with
>> >different access characteristics so why is PMEM any special?
>>
>> Michal, thanks for your comments!
>>
>> The "different" lies in the local or remote access, usually the underlying
>> memory is the same type, i.e. DRAM.
>>
>> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
>> while with different read/write access latency than DRAM.
>
>You are describing a NUMA in general here. Yes access to different NUMA
>nodes has a different read/write latency. But that doesn't make PMEM
>really special from a regular DRAM. 

Not the numa distance b/w cpu and PMEM node make PMEM different than
DRAM. The difference lies in the physical layer. The access latency characteristics
comes from media level.

>There are few other people trying to
>work with PMEM as NUMA nodes and these kind of arguments are repeating
>again and again. So far I haven't really heard much beyond hand waving.
>Please go and read through those discussion so that we do not have to go
>throug the same set of arguments again.
>
>I absolutely do see and understand people want to find a way to use
>their shiny NVIDIMs but please step back and try to think in more
>general terms than PMEM is special and we have to treat it that way.
>We currently have ways to use it as DAX device and a NUMA node then
>focus on how to improve our NUMA handling so that we can get maximum
>out
>of the HW rather than make a PMEM NUMA node a special snow flake.
>
>Thank you.
>
>--
>Michal Hocko
>SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  7:55         ` Du, Fan
@ 2019-04-25  8:09           ` Michal Hocko
  2019-04-25  8:20             ` Du, Fan
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  8:09 UTC (permalink / raw)
  To: Du, Fan
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel

On Thu 25-04-19 07:55:58, Du, Fan wrote:
> >> PMEM is good for frequently read accessed page, e.g. page cache(implicit
> >> page
> >> request), or user space data base (explicit page request)
> >> For now this patch create GFP_SAME_NODE_TYPE for such cases, additional
> >> Implementation will be followed up.
> >
> >Then simply configure that NUMA node as movable and you get these
> >allocations for any movable allocation. I am not really convinced a new
> >gfp flag is really justified.
> 
> Case 1: frequently write and/or read accessed page deserved to DRAM

NUMA balancing

> Case 2: frequently read accessed page deserved to PMEM

memory reclaim to move those pages to a more distant node (e.g. a PMEM).

Btw. none of the above is a static thing you would easily know at the
allocation time.

Please spare some time reading surrounding discussions - e.g.
http://lkml.kernel.org/r/1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  8:09           ` Michal Hocko
@ 2019-04-25  8:20             ` Du, Fan
  2019-04-25  8:43               ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Du, Fan @ 2019-04-25  8:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Michal Hocko
>Sent: Thursday, April 25, 2019 4:10 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>ZONELIST_FALLBACK_SAME_TYPE fallback list
>
>On Thu 25-04-19 07:55:58, Du, Fan wrote:
>> >> PMEM is good for frequently read accessed page, e.g. page cache(implicit
>> >> page
>> >> request), or user space data base (explicit page request)
>> >> For now this patch create GFP_SAME_NODE_TYPE for such cases,
>additional
>> >> Implementation will be followed up.
>> >
>> >Then simply configure that NUMA node as movable and you get these
>> >allocations for any movable allocation. I am not really convinced a new
>> >gfp flag is really justified.
>>
>> Case 1: frequently write and/or read accessed page deserved to DRAM
>
>NUMA balancing

Sorry, I mean page cache case here.
Numa balancing works for pages mapped in pagetable style.

>> Case 2: frequently read accessed page deserved to PMEM
>
>memory reclaim to move those pages to a more distant node (e.g. a PMEM).
>
>Btw. none of the above is a static thing you would easily know at the
>allocation time.
>
>Please spare some time reading surrounding discussions - e.g.
>http://lkml.kernel.org/r/1554955019-29472-1-git-send-email-yang.shi@linux.a
>libaba.com

Thanks for the point.

>Michal Hocko
>SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  8:20             ` Du, Fan
@ 2019-04-25  8:43               ` Michal Hocko
  2019-04-25  9:18                 ` Du, Fan
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-04-25  8:43 UTC (permalink / raw)
  To: Du, Fan
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel

On Thu 25-04-19 08:20:28, Du, Fan wrote:
> 
> 
> >-----Original Message-----
> >From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> >Behalf Of Michal Hocko
> >Sent: Thursday, April 25, 2019 4:10 PM
> >To: Du, Fan <fan.du@intel.com>
> >Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> ><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
> >ZONELIST_FALLBACK_SAME_TYPE fallback list
> >
> >On Thu 25-04-19 07:55:58, Du, Fan wrote:
> >> >> PMEM is good for frequently read accessed page, e.g. page cache(implicit
> >> >> page
> >> >> request), or user space data base (explicit page request)
> >> >> For now this patch create GFP_SAME_NODE_TYPE for such cases,
> >additional
> >> >> Implementation will be followed up.
> >> >
> >> >Then simply configure that NUMA node as movable and you get these
> >> >allocations for any movable allocation. I am not really convinced a new
> >> >gfp flag is really justified.
> >>
> >> Case 1: frequently write and/or read accessed page deserved to DRAM
> >
> >NUMA balancing
> 
> Sorry, I mean page cache case here.
> Numa balancing works for pages mapped in pagetable style.

I would still expect that a remote PMEM node access latency is
smaller/comparable to the real storage so a promoting part is not that
important for the unmapped pagecache. Maybe I am wrong here but that
really begs for some experiments before we start adding special casing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
  2019-04-25  8:43               ` Michal Hocko
@ 2019-04-25  9:18                 ` Du, Fan
  0 siblings, 0 replies; 23+ messages in thread
From: Du, Fan @ 2019-04-25  9:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Wu, Fengguang, Williams, Dan J, Hansen, Dave,
	xishi.qiuxishi, Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Michal Hocko
>Sent: Thursday, April 25, 2019 4:43 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
>Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>ZONELIST_FALLBACK_SAME_TYPE fallback list
>
>On Thu 25-04-19 08:20:28, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>> >Behalf Of Michal Hocko
>> >Sent: Thursday, April 25, 2019 4:10 PM
>> >To: Du, Fan <fan.du@intel.com>
>> >Cc: akpm@linux-foundation.org; Wu, Fengguang
><fengguang.wu@intel.com>;
>> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
>> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
>> ><ying.huang@intel.com>; linux-mm@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 5/5] mm, page_alloc: Introduce
>> >ZONELIST_FALLBACK_SAME_TYPE fallback list
>> >
>> >On Thu 25-04-19 07:55:58, Du, Fan wrote:
>> >> >> PMEM is good for frequently read accessed page, e.g. page
>cache(implicit
>> >> >> page
>> >> >> request), or user space data base (explicit page request)
>> >> >> For now this patch create GFP_SAME_NODE_TYPE for such cases,
>> >additional
>> >> >> Implementation will be followed up.
>> >> >
>> >> >Then simply configure that NUMA node as movable and you get these
>> >> >allocations for any movable allocation. I am not really convinced a new
>> >> >gfp flag is really justified.
>> >>
>> >> Case 1: frequently write and/or read accessed page deserved to DRAM
>> >
>> >NUMA balancing
>>
>> Sorry, I mean page cache case here.
>> Numa balancing works for pages mapped in pagetable style.
>
>I would still expect that a remote PMEM node access latency is
>smaller/comparable to the real storage so a promoting part is not that
>important for the unmapped pagecache. Maybe I am wrong here but that
>really begs for some experiments before we start adding special casing.

I understand your concern :), please refer to following summary from 3rd party.
https://arxiv.org/pdf/1903.05714.pdf


>--
>Michal Hocko
>SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25  8:05       ` Du, Fan
@ 2019-04-25 15:43           ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2019-04-25 15:43 UTC (permalink / raw)
  To: Du, Fan
  Cc: Michal Hocko, akpm, Wu, Fengguang, Hansen, Dave, xishi.qiuxishi,
	Huang, Ying, linux-mm, linux-kernel

On Thu, Apr 25, 2019 at 1:05 AM Du, Fan <fan.du@intel.com> wrote:
>
>
>
> >-----Original Message-----
> >From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> >Behalf Of Michal Hocko
> >Sent: Thursday, April 25, 2019 3:54 PM
> >To: Du, Fan <fan.du@intel.com>
> >Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> ><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
> >>
> >>
> >> >-----Original Message-----
> >> >From: Michal Hocko [mailto:mhocko@kernel.org]
> >> >Sent: Thursday, April 25, 2019 2:37 PM
> >> >To: Du, Fan <fan.du@intel.com>
> >> >Cc: akpm@linux-foundation.org; Wu, Fengguang
> ><fengguang.wu@intel.com>;
> >> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> >> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> >> ><ying.huang@intel.com>; linux-mm@kvack.org;
> >linux-kernel@vger.kernel.org
> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >> >memory system
> >> >
> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >> >[...]
> >> >> However PMEM has different characteristics from DRAM,
> >> >> the more reasonable or desirable fallback style would be:
> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> >> When DRAM is exhausted, try PMEM then.
> >> >
> >> >Why and who does care? NUMA is fundamentally about memory nodes
> >with
> >> >different access characteristics so why is PMEM any special?
> >>
> >> Michal, thanks for your comments!
> >>
> >> The "different" lies in the local or remote access, usually the underlying
> >> memory is the same type, i.e. DRAM.
> >>
> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
> >> while with different read/write access latency than DRAM.
> >
> >You are describing a NUMA in general here. Yes access to different NUMA
> >nodes has a different read/write latency. But that doesn't make PMEM
> >really special from a regular DRAM.
>
> Not the numa distance b/w cpu and PMEM node make PMEM different than
> DRAM. The difference lies in the physical layer. The access latency characteristics
> comes from media level.

No, there is no such thing as a "PMEM node". I've pushed back on this
broken concept in the past [1] [2]. Consider that PMEM could be as
fast as DRAM for technologies like NVDIMM-N or in emulation
environments. These attempts to look at persistence as an attribute of
performance are entirely missing the point that the system can have
multiple varied memory types and the platform firmware needs to
enumerate these performance properties in the HMAT on ACPI platforms.
Any scheme that only considers a binary DRAM and not-DRAM property is
immediately invalidated the moment the OS needs to consider a 3rd or
4th memory type, or a more varied connection topology.

[1]: https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mPrdjCjEuSw+0rwdOUHdjwetxg@mail.gmail.com/

[2]: https://lore.kernel.org/lkml/CAPcyv4it1w7SdDVBV24cRCVHtLb3s1pVB5+SDM02Uw4RbahKiA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
@ 2019-04-25 15:43           ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2019-04-25 15:43 UTC (permalink / raw)
  To: Du, Fan
  Cc: Michal Hocko, akpm, Wu, Fengguang, Hansen, Dave, xishi.qiuxishi,
	Huang, Ying, linux-mm, linux-kernel

On Thu, Apr 25, 2019 at 1:05 AM Du, Fan <fan.du@intel.com> wrote:
>
>
>
> >-----Original Message-----
> >From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> >Behalf Of Michal Hocko
> >Sent: Thursday, April 25, 2019 3:54 PM
> >To: Du, Fan <fan.du@intel.com>
> >Cc: akpm@linux-foundation.org; Wu, Fengguang <fengguang.wu@intel.com>;
> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> ><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
> >>
> >>
> >> >-----Original Message-----
> >> >From: Michal Hocko [mailto:mhocko@kernel.org]
> >> >Sent: Thursday, April 25, 2019 2:37 PM
> >> >To: Du, Fan <fan.du@intel.com>
> >> >Cc: akpm@linux-foundation.org; Wu, Fengguang
> ><fengguang.wu@intel.com>;
> >> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
> >> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
> >> ><ying.huang@intel.com>; linux-mm@kvack.org;
> >linux-kernel@vger.kernel.org
> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >> >memory system
> >> >
> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >> >[...]
> >> >> However PMEM has different characteristics from DRAM,
> >> >> the more reasonable or desirable fallback style would be:
> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> >> When DRAM is exhausted, try PMEM then.
> >> >
> >> >Why and who does care? NUMA is fundamentally about memory nodes
> >with
> >> >different access characteristics so why is PMEM any special?
> >>
> >> Michal, thanks for your comments!
> >>
> >> The "different" lies in the local or remote access, usually the underlying
> >> memory is the same type, i.e. DRAM.
> >>
> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
> >> while with different read/write access latency than DRAM.
> >
> >You are describing a NUMA in general here. Yes access to different NUMA
> >nodes has a different read/write latency. But that doesn't make PMEM
> >really special from a regular DRAM.
>
> Not the numa distance b/w cpu and PMEM node make PMEM different than
> DRAM. The difference lies in the physical layer. The access latency characteristics
> comes from media level.

No, there is no such thing as a "PMEM node". I've pushed back on this
broken concept in the past [1] [2]. Consider that PMEM could be as
fast as DRAM for technologies like NVDIMM-N or in emulation
environments. These attempts to look at persistence as an attribute of
performance are entirely missing the point that the system can have
multiple varied memory types and the platform firmware needs to
enumerate these performance properties in the HMAT on ACPI platforms.
Any scheme that only considers a binary DRAM and not-DRAM property is
immediately invalidated the moment the OS needs to consider a 3rd or
4th memory type, or a more varied connection topology.

[1]: https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mPrdjCjEuSw+0rwdOUHdjwetxg@mail.gmail.com/

[2]: https://lore.kernel.org/lkml/CAPcyv4it1w7SdDVBV24cRCVHtLb3s1pVB5+SDM02Uw4RbahKiA@mail.gmail.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
  2019-04-25 15:43           ` Dan Williams
  (?)
@ 2019-04-26  2:40           ` Du, Fan
  -1 siblings, 0 replies; 23+ messages in thread
From: Du, Fan @ 2019-04-26  2:40 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Michal Hocko, akpm, Wu, Fengguang, Hansen, Dave, xishi.qiuxishi,
	Huang, Ying, linux-mm, linux-kernel, Du, Fan



>-----Original Message-----
>From: Dan Williams [mailto:dan.j.williams@intel.com]
>Sent: Thursday, April 25, 2019 11:43 PM
>To: Du, Fan <fan.du@intel.com>
>Cc: Michal Hocko <mhocko@kernel.org>; akpm@linux-foundation.org; Wu,
>Fengguang <fengguang.wu@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
><ying.huang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu, Apr 25, 2019 at 1:05 AM Du, Fan <fan.du@intel.com> wrote:
>>
>>
>>
>> >-----Original Message-----
>> >From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>> >Behalf Of Michal Hocko
>> >Sent: Thursday, April 25, 2019 3:54 PM
>> >To: Du, Fan <fan.du@intel.com>
>> >Cc: akpm@linux-foundation.org; Wu, Fengguang
><fengguang.wu@intel.com>;
>> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
>> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
>> ><ying.huang@intel.com>; linux-mm@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >memory system
>> >
>> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Michal Hocko [mailto:mhocko@kernel.org]
>> >> >Sent: Thursday, April 25, 2019 2:37 PM
>> >> >To: Du, Fan <fan.du@intel.com>
>> >> >Cc: akpm@linux-foundation.org; Wu, Fengguang
>> ><fengguang.wu@intel.com>;
>> >> >Williams, Dan J <dan.j.williams@intel.com>; Hansen, Dave
>> >> ><dave.hansen@intel.com>; xishi.qiuxishi@alibaba-inc.com; Huang, Ying
>> >> ><ying.huang@intel.com>; linux-mm@kvack.org;
>> >linux-kernel@vger.kernel.org
>> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >> >memory system
>> >> >
>> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
>> >> >[...]
>> >> >> However PMEM has different characteristics from DRAM,
>> >> >> the more reasonable or desirable fallback style would be:
>> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> >> >> When DRAM is exhausted, try PMEM then.
>> >> >
>> >> >Why and who does care? NUMA is fundamentally about memory nodes
>> >with
>> >> >different access characteristics so why is PMEM any special?
>> >>
>> >> Michal, thanks for your comments!
>> >>
>> >> The "different" lies in the local or remote access, usually the underlying
>> >> memory is the same type, i.e. DRAM.
>> >>
>> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
>> >> while with different read/write access latency than DRAM.
>> >
>> >You are describing a NUMA in general here. Yes access to different NUMA
>> >nodes has a different read/write latency. But that doesn't make PMEM
>> >really special from a regular DRAM.
>>
>> Not the numa distance b/w cpu and PMEM node make PMEM different
>than
>> DRAM. The difference lies in the physical layer. The access latency
>characteristics
>> comes from media level.
>
>No, there is no such thing as a "PMEM node". I've pushed back on this
>broken concept in the past [1] [2]. Consider that PMEM could be as
>fast as DRAM for technologies like NVDIMM-N or in emulation
>environments. These attempts to look at persistence as an attribute of
>performance are entirely missing the point that the system can have
>multiple varied memory types and the platform firmware needs to
>enumerate these performance properties in the HMAT on ACPI platforms.
>Any scheme that only considers a binary DRAM and not-DRAM property is
>immediately invalidated the moment the OS needs to consider a 3rd or
>4th memory type, or a more varied connection topology.

Dan, Thanks for your comments!

I've understood your point from the very beginning time of your post before.
Below is my something in my mind as a [standalone personal contributor] only:
a. I fully recognized what HMAT is designed for.
b. I understood your point for the "type" thing is temporal, and think you are right about your
  point.

A generic approach is indeed required, however I what to elaborate the point of the problem
I'm trying to solve for customer, not how we and other people solve it one way or another..

Customer require to fully utilized system memory, no matter DRAM, 1st generation PMEM,
future xth generation PMEM which beats DRAM.
Customer require to explicitly [coarse grained] control the memory allocation for different
latency/bandwidth.

Maybe it's more worthwhile to think what is needed essentially to solve the problem,
And make sure it scale well enough for some period.

a. Build fallback list for heterogeneous system.
  I prefer to build it per HMAT, because HMAT expose the latency/bandwidth from local node
  Perspective, it's already standardized in ACPI Spec. NUMA node distance from SLIT wouldn't be
  more accurately helpful for heterogeneous memory system anymore.

b. Provide explicit page allocation option for frequently read accessed pages request.
  This requirement is well justified as well. All scenario both in kernel or user level, don't care about
  write latency should leverage this option to archive overall optimal performance.

c. NUMA balancing for heterogeneous system.
  I'm aware of this topic, but it's not what I in mind(a. b.) right now.


>[1]:
>https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mPrdjCjEuSw+0rwd
>OUHdjwetxg@mail.gmail.com/
>
>[2]:
>https://lore.kernel.org/lkml/CAPcyv4it1w7SdDVBV24cRCVHtLb3s1pVB5+SDM0
>2Uw4RbahKiA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-04-26  2:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-25  1:21 [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Fan Du
2019-04-25  1:21 ` [RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table Fan Du
2019-04-25  1:21 ` [RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM Fan Du
2019-04-25  1:21 ` [RFC PATCH 3/5] x86,numa: update numa node type Fan Du
2019-04-25  1:21 ` [RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis Fan Du
2019-04-25  1:21 ` [RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list Fan Du
     [not found]   ` <a0728518-a067-4f89-a8ae-3fa279f768f2.xishi.qiuxishi@alibaba-inc.com>
2019-04-25  3:26     ` Xishi Qiu
2019-04-25  7:45       ` Du, Fan
2019-04-25  6:38   ` Michal Hocko
2019-04-25  7:43     ` Du, Fan
2019-04-25  7:48       ` Michal Hocko
2019-04-25  7:55         ` Du, Fan
2019-04-25  8:09           ` Michal Hocko
2019-04-25  8:20             ` Du, Fan
2019-04-25  8:43               ` Michal Hocko
2019-04-25  9:18                 ` Du, Fan
2019-04-25  6:37 ` [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system Michal Hocko
2019-04-25  7:41   ` Du, Fan
2019-04-25  7:53     ` Michal Hocko
2019-04-25  8:05       ` Du, Fan
2019-04-25 15:43         ` Dan Williams
2019-04-25 15:43           ` Dan Williams
2019-04-26  2:40           ` Du, Fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.