All of lore.kernel.org
 help / color / mirror / Atom feed
* Subject: [PATCH RFC 0/4] Demotion Profiling Improvements
@ 2023-11-02  2:56 Li Zhijian
  2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Li Zhijian @ 2023-11-02  2:56 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm
  Cc: ying.huang, y-goto, linux-kernel, Li Zhijian

With the deployment of high-capacity CXL Type 3 memory, HBM and Nvdimm,
the kernel now supports memory tiering. Building on this foundation
and aiming to further enhance memory efficiency, the kernel has
introduced demotion and promotion features.

To provide users with a more intuitive way to observe information
related to demotion, we have made several improvements to demotion
profiling, primarily in two aspects:

Patch 1 introduces a new interface: /sys/devices/system/node/node0/demotion_nodes
This interface is used to display the target nodes to which a node can demote.

Patch 2, Patch 3, and Patch 4 are improvements to demotion statistics.
Patch 2 changes the granularity of demotion statistics from global to per-node.
Patch 3 and Patch 4 further differentiate demotion statistics into demotion
source and demotion destination.

The names of the statistics are open to discussion; they could be named something
like pgdemote_from/to_* etc.
One issue with this patch set is that SUM(pgdemote_src_*) always equals SUM(pgdemote_dst_*)
in the global statistics (/proc/vmstat). Should we hide one of them?

Any feedback is welcome.

TO: Andrew Morton <akpm@linux-foundation.org> 
TO: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 
TO: "Rafael J. Wysocki" <rafael@kernel.org> 
CC: "Huang, Ying" <ying.huang@intel.com>
CC: y-goto@fujitsu.com
CC: linux-kernel@vger.kernel.org 
TO: linux-mm@kvack.org 

Li Zhijian (4):
  drivers/base/node: Add demotion_nodes sys infterface
  mm/vmstat: Move pgdemote_* to per-node stats
  mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  drivers/base/node: add demote_src and demote_dst to numastat

 drivers/base/node.c           | 29 +++++++++++++++++++++++++++--
 include/linux/memory-tiers.h  |  6 ++++++
 include/linux/mmzone.h        |  7 +++++++
 include/linux/vm_event_item.h |  3 ---
 mm/memory-tiers.c             |  8 ++++++++
 mm/vmscan.c                   | 14 +++++++++++---
 mm/vmstat.c                   |  9 ++++++---
 7 files changed, 65 insertions(+), 11 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
@ 2023-11-02  2:56 ` Li Zhijian
  2023-11-02  3:17   ` Huang, Ying
  2023-11-03  2:21   ` kernel test robot
  2023-11-02  2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Li Zhijian @ 2023-11-02  2:56 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm
  Cc: ying.huang, y-goto, linux-kernel, Li Zhijian

It shows the demotion target nodes of a node. Export this information to
user directly.

Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
- Before PMEM is online, no demotion_nodes for node0 and node1.
$ cat /sys/devices/system/node/node0/demotion_nodes
 <show nothing>
- After node3 is online as kmem
$ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
[
  {
    "chardev":"dax0.0",
    "size":1054867456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":0,
    "total_memblocks":7
  }
]
$ cat /sys/devices/system/node/node0/demotion_nodes
3
$ cat /sys/devices/system/node/node1/demotion_nodes
3
$ cat /sys/devices/system/node/node3/demotion_nodes
 <show nothing>

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/base/node.c          | 13 +++++++++++++
 include/linux/memory-tiers.h |  6 ++++++
 mm/memory-tiers.c            |  8 ++++++++
 3 files changed, 27 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..27e8502548a7 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
 
+static ssize_t demotion_nodes_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	int ret;
+	nodemask_t nmask = next_demotion_nodes(dev->id);
+
+	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
+	return ret;
+}
+static DEVICE_ATTR_RO(demotion_nodes);
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_meminfo.attr,
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+	&dev_attr_demotion_nodes.attr,
 	NULL
 };
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 437441cdf78f..8eb04923f965 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+nodemask_t next_demotion_nodes(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 bool node_is_toptier(int node);
 #else
@@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 
+static inline next_demotion_nodes next_demotion_nodes(int node)
+{
+	return NODE_MASK_NONE;
+}
+
 static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 {
 	*targets = NODE_MASK_NONE;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 37a4f59d9585..90047f37d98a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 	rcu_read_unlock();
 }
 
+nodemask_t next_demotion_nodes(int node)
+{
+	if (!node_demotion)
+		return NODE_MASK_NONE;
+
+	return node_demotion[node].preferred;
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats
  2023-11-02  2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
  2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
@ 2023-11-02  2:56 ` Li Zhijian
  2023-11-02  4:56   ` Huang, Ying
  2023-11-02  5:43   ` Huang, Ying
  2023-11-02  2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
  2023-11-02  2:56 ` [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat Li Zhijian
  3 siblings, 2 replies; 34+ messages in thread
From: Li Zhijian @ 2023-11-02  2:56 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm
  Cc: ying.huang, y-goto, linux-kernel, Li Zhijian

This is a prepare to improve the demotion profiling in the later
patches.

Per-node demotion stats help users to quickly identify which
node is in hige stree, and take some special operations if needed.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 include/linux/mmzone.h        | 4 ++++
 include/linux/vm_event_item.h | 3 ---
 mm/vmscan.c                   | 3 ++-
 mm/vmstat.c                   | 6 +++---
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..ad0309eea850 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -206,6 +206,10 @@ enum node_stat_item {
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
+	/* PGDEMOTE_*: pages demoted */
+	PGDEMOTE_KSWAPD,
+	PGDEMOTE_DIRECT,
+	PGDEMOTE_KHUGEPAGED,
 #endif
 	NR_VM_NODE_STAT_ITEMS
 };
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..d1b847502f09 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,9 +41,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
 		PGSTEAL_KHUGEPAGED,
-		PGDEMOTE_KSWAPD,
-		PGDEMOTE_DIRECT,
-		PGDEMOTE_KHUGEPAGED,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_KHUGEPAGED,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..2f1fb4ec3235 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1678,7 +1678,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
 		      &nr_succeeded);
 
-	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
+	mod_node_page_state(NODE_DATA(target_nid),
+		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
 
 	return nr_succeeded;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..f141c48c39e4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1244,6 +1244,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
+	"pgdemote_khugepaged",
 #endif
 
 	/* enum writeback_stat_item counters */
@@ -1275,9 +1278,6 @@ const char * const vmstat_text[] = {
 	"pgsteal_kswapd",
 	"pgsteal_direct",
 	"pgsteal_khugepaged",
-	"pgdemote_kswapd",
-	"pgdemote_direct",
-	"pgdemote_khugepaged",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_khugepaged",
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
  2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
  2023-11-02  2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
@ 2023-11-02  2:56 ` Li Zhijian
  2023-11-02  5:45   ` Huang, Ying
  2023-11-02 17:16   ` kernel test robot
  2023-11-02  2:56 ` [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat Li Zhijian
  3 siblings, 2 replies; 34+ messages in thread
From: Li Zhijian @ 2023-11-02  2:56 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm
  Cc: ying.huang, y-goto, linux-kernel, Li Zhijian

pgdemote_src_*: pages demoted from this node.
pgdemote_dst_*: pages demoted to this node.

So that we are able to know their demotion per-node stats by checking this.

In the environment, node0 and node1 are DRAM, node3 is PMEM.

Global stats:
$ grep -E 'demote' /proc/vmstat
pgdemote_src_kswapd 130155
pgdemote_src_direct 113497
pgdemote_src_khugepaged 0
pgdemote_dst_kswapd 130155
pgdemote_dst_direct 113497
pgdemote_dst_khugepaged 0

Per-node stats:
$ grep demote /sys/devices/system/node/node0/vmstat
pgdemote_src_kswapd 68454
pgdemote_src_direct 83431
pgdemote_src_khugepaged 0
pgdemote_dst_kswapd 0
pgdemote_dst_direct 0
pgdemote_dst_khugepaged 0

$ grep demote /sys/devices/system/node/node1/vmstat
pgdemote_src_kswapd 185834
pgdemote_src_direct 30066
pgdemote_src_khugepaged 0
pgdemote_dst_kswapd 0
pgdemote_dst_direct 0
pgdemote_dst_khugepaged 0

$ grep demote /sys/devices/system/node/node3/vmstat
pgdemote_src_kswapd 0
pgdemote_src_direct 0
pgdemote_src_khugepaged 0
pgdemote_dst_kswapd 254288
pgdemote_dst_direct 113497
pgdemote_dst_khugepaged 0

From above stats, we know node3 is the demotion destination which one
the node0 and node1 will demote to.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
RFC: their names are open to discussion, maybe pgdemote_from/to_*
Another defect of this patch is that, SUM(pgdemote_src_*) is always same
as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
---
 include/linux/mmzone.h |  9 ++++++---
 mm/vmscan.c            | 13 ++++++++++---
 mm/vmstat.c            |  9 ++++++---
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad0309eea850..a6140d894bec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -207,9 +207,12 @@ enum node_stat_item {
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
 	/* PGDEMOTE_*: pages demoted */
-	PGDEMOTE_KSWAPD,
-	PGDEMOTE_DIRECT,
-	PGDEMOTE_KHUGEPAGED,
+	PGDEMOTE_SRC_KSWAPD,
+	PGDEMOTE_SRC_DIRECT,
+	PGDEMOTE_SRC_KHUGEPAGED,
+	PGDEMOTE_DST_KSWAPD,
+	PGDEMOTE_DST_DIRECT,
+	PGDEMOTE_DST_KHUGEPAGED,
 #endif
 	NR_VM_NODE_STAT_ITEMS
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2f1fb4ec3235..55d2287d7150 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1111,13 +1111,18 @@ void drop_slab(void)
 static int reclaimer_offset(void)
 {
 	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
-			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
+			PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
 	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
 			PGSCAN_DIRECT - PGSCAN_KSWAPD);
 	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
-			PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
+			PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
 	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
 			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
+	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
+			PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
+	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
+			PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
+
 
 	if (current_is_kswapd())
 		return 0;
@@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
 		      &nr_succeeded);
 
+	mod_node_page_state(pgdat,
+		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(), nr_succeeded);
 	mod_node_page_state(NODE_DATA(target_nid),
-		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
+		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
 
 	return nr_succeeded;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f141c48c39e4..63f106a5e008 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",
-	"pgdemote_kswapd",
-	"pgdemote_direct",
-	"pgdemote_khugepaged",
+	"pgdemote_src_kswapd",
+	"pgdemote_src_direct",
+	"pgdemote_src_khugepaged",
+	"pgdemote_dst_kswapd",
+	"pgdemote_dst_direct",
+	"pgdemote_dst_khugepaged",
 #endif
 
 	/* enum writeback_stat_item counters */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat
  2023-11-02  2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
                   ` (2 preceding siblings ...)
  2023-11-02  2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
@ 2023-11-02  2:56 ` Li Zhijian
  2023-11-02  5:40   ` Greg Kroah-Hartman
  3 siblings, 1 reply; 34+ messages in thread
From: Li Zhijian @ 2023-11-02  2:56 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm
  Cc: ying.huang, y-goto, linux-kernel, Li Zhijian

node0 and node1 is DRAM node, node3 is a PMEM node.

$ cat /sys/devices/system/node/node1/numastat
numa_hit 646590
numa_miss 3963
numa_foreign 30651
interleave_hit 416
local_node 645252
other_node 5301
demote_src 200478
demote_dst 0

Of cousre, the userspace numastat will be extened to support these 2
new fields in the future like:
$ numastat
                           node0           node1           node3
numa_hit                  741793          702460          364154
numa_miss                   1759            8104           28893
numa_foreign                8105           30651               0
interleave_hit               653             416               0
local_node                741762          701115               0
other_node                  1790            9449          393047
demote_src                163612          203828               0
demote_dst                     0               0          367440

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/base/node.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 27e8502548a7..d3fc70599b6a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -496,20 +496,32 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
 static ssize_t node_read_numastat(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
+	struct pglist_data *pgdat = NODE_DATA(dev->id);
+	unsigned long demote_src, demote_dst;
+
 	fold_vm_numa_events();
+	demote_src = node_page_state_pages(pgdat, PGDEMOTE_SRC_KSWAPD) +
+		     node_page_state_pages(pgdat, PGDEMOTE_SRC_DIRECT) +
+		     node_page_state_pages(pgdat, PGDEMOTE_SRC_KHUGEPAGED);
+	demote_dst = node_page_state_pages(pgdat, PGDEMOTE_DST_KSWAPD) +
+		     node_page_state_pages(pgdat, PGDEMOTE_DST_DIRECT) +
+		     node_page_state_pages(pgdat, PGDEMOTE_DST_KHUGEPAGED);
 	return sysfs_emit(buf,
 			  "numa_hit %lu\n"
 			  "numa_miss %lu\n"
 			  "numa_foreign %lu\n"
 			  "interleave_hit %lu\n"
 			  "local_node %lu\n"
-			  "other_node %lu\n",
+			  "other_node %lu\n"
+			  "demote_src %lu\n"
+			  "demote_dst %lu\n",
 			  sum_zone_numa_event_state(dev->id, NUMA_HIT),
 			  sum_zone_numa_event_state(dev->id, NUMA_MISS),
 			  sum_zone_numa_event_state(dev->id, NUMA_FOREIGN),
 			  sum_zone_numa_event_state(dev->id, NUMA_INTERLEAVE_HIT),
 			  sum_zone_numa_event_state(dev->id, NUMA_LOCAL),
-			  sum_zone_numa_event_state(dev->id, NUMA_OTHER));
+			  sum_zone_numa_event_state(dev->id, NUMA_OTHER),
+			  demote_src, demote_dst);
 }
 static DEVICE_ATTR(numastat, 0444, node_read_numastat, NULL);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
@ 2023-11-02  3:17   ` Huang, Ying
  2023-11-02  3:39     ` Zhijian Li (Fujitsu)
  2024-01-30  8:53     ` Li Zhijian
  2023-11-03  2:21   ` kernel test robot
  1 sibling, 2 replies; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  3:17 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Li Zhijian <lizhijian@fujitsu.com> writes:

> It shows the demotion target nodes of a node. Export this information to
> user directly.
>
> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
> - Before PMEM is online, no demotion_nodes for node0 and node1.
> $ cat /sys/devices/system/node/node0/demotion_nodes
>  <show nothing>
> - After node3 is online as kmem
> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
> [
>   {
>     "chardev":"dax0.0",
>     "size":1054867456,
>     "target_node":3,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":0,
>     "total_memblocks":7
>   }
> ]
> $ cat /sys/devices/system/node/node0/demotion_nodes
> 3
> $ cat /sys/devices/system/node/node1/demotion_nodes
> 3
> $ cat /sys/devices/system/node/node3/demotion_nodes
>  <show nothing>

We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
already.  A node in a higher tier can demote to any node in the lower
tiers.  What's more need to be displayed in nodeX/demotion_nodes?

--
Best Regards,
Huang, Ying

> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  drivers/base/node.c          | 13 +++++++++++++
>  include/linux/memory-tiers.h |  6 ++++++
>  mm/memory-tiers.c            |  8 ++++++++
>  3 files changed, 27 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 493d533f8375..27e8502548a7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/memory-tiers.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>  
> +static ssize_t demotion_nodes_show(struct device *dev,
> +			     struct device_attribute *attr, char *buf)
> +{
> +	int ret;
> +	nodemask_t nmask = next_demotion_nodes(dev->id);
> +
> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
> +	return ret;
> +}
> +static DEVICE_ATTR_RO(demotion_nodes);
> +
>  static struct attribute *node_dev_attrs[] = {
>  	&dev_attr_meminfo.attr,
>  	&dev_attr_numastat.attr,
>  	&dev_attr_distance.attr,
>  	&dev_attr_vmstat.attr,
> +	&dev_attr_demotion_nodes.attr,
>  	NULL
>  };
>  
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 437441cdf78f..8eb04923f965 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
> +nodemask_t next_demotion_nodes(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>  bool node_is_toptier(int node);
>  #else
> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>  	return NUMA_NO_NODE;
>  }
>  
> +static inline next_demotion_nodes next_demotion_nodes(int node)
> +{
> +	return NODE_MASK_NONE;
> +}
> +
>  static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>  {
>  	*targets = NODE_MASK_NONE;
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 37a4f59d9585..90047f37d98a 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>  	rcu_read_unlock();
>  }
>  
> +nodemask_t next_demotion_nodes(int node)
> +{
> +	if (!node_demotion)
> +		return NODE_MASK_NONE;
> +
> +	return node_demotion[node].preferred;
> +}
> +
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
>   * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  3:17   ` Huang, Ying
@ 2023-11-02  3:39     ` Zhijian Li (Fujitsu)
  2023-11-02  5:18       ` Huang, Ying
  2024-01-30  8:53     ` Li Zhijian
  1 sibling, 1 reply; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-02  3:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 5570 bytes --]

> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> already.  A node in a higher tier can demote to any node in the lower
> tiers.  What's more need to be displayed in nodeX/demotion_nodes?

IIRC, they are not the same. memory_tier[number], where the number is shared by
the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
across nodes(different distance will be grouped into the same memory_tier).
But demotion will only select the nearest nodelist to demote.

Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
are different.
 
# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 964 MB
node 0 free: 746 MB
node 1 cpus: 1
node 1 size: 685 MB
node 1 free: 455 MB
node 2 cpus:
node 2 size: 896 MB
node 2 free: 897 MB
node 3 cpus:
node 3 size: 896 MB
node 3 free: 896 MB
node distances:
node   0   1   2   3
  0:  10  20  20  25
  1:  20  10  25  20
  2:  20  25  10  20
  3:  25  20  20  10
# cat /sys/devices/system/node/node0/demotion_nodes
2
# cat /sys/devices/system/node/node1/demotion_nodes
3
# cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
2-3

Thanks
Zhijian

(I hate the outlook native reply composition format.)
________________________________________
From: Huang, Ying <ying.huang@intel.com>
Sent: Thursday, November 2, 2023 11:17
To: Li, Zhijian/Àî ÖǼá
Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvack.org; Gotou, Yasunori/Îåu ¿µÎÄ; linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface

Li Zhijian <lizhijian@fujitsu.com> writes:

> It shows the demotion target nodes of a node. Export this information to
> user directly.
>
> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
> - Before PMEM is online, no demotion_nodes for node0 and node1.
> $ cat /sys/devices/system/node/node0/demotion_nodes
>  <show nothing>
> - After node3 is online as kmem
> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
> [
>   {
>     "chardev":"dax0.0",
>     "size":1054867456,
>     "target_node":3,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":0,
>     "total_memblocks":7
>   }
> ]
> $ cat /sys/devices/system/node/node0/demotion_nodes
> 3
> $ cat /sys/devices/system/node/node1/demotion_nodes
> 3
> $ cat /sys/devices/system/node/node3/demotion_nodes
>  <show nothing>

We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
already.  A node in a higher tier can demote to any node in the lower
tiers.  What's more need to be displayed in nodeX/demotion_nodes?

--
Best Regards,
Huang, Ying

> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  drivers/base/node.c          | 13 +++++++++++++
>  include/linux/memory-tiers.h |  6 ++++++
>  mm/memory-tiers.c            |  8 ++++++++
>  3 files changed, 27 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 493d533f8375..27e8502548a7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/memory-tiers.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>
> +static ssize_t demotion_nodes_show(struct device *dev,
> +                          struct device_attribute *attr, char *buf)
> +{
> +     int ret;
> +     nodemask_t nmask = next_demotion_nodes(dev->id);
> +
> +     ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
> +     return ret;
> +}
> +static DEVICE_ATTR_RO(demotion_nodes);
> +
>  static struct attribute *node_dev_attrs[] = {
>       &dev_attr_meminfo.attr,
>       &dev_attr_numastat.attr,
>       &dev_attr_distance.attr,
>       &dev_attr_vmstat.attr,
> +     &dev_attr_demotion_nodes.attr,
>       NULL
>  };
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 437441cdf78f..8eb04923f965 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
> +nodemask_t next_demotion_nodes(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>  bool node_is_toptier(int node);
>  #else
> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>       return NUMA_NO_NODE;
>  }
>
> +static inline next_demotion_nodes next_demotion_nodes(int node)
> +{
> +     return NODE_MASK_NONE;
> +}
> +
>  static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>  {
>       *targets = NODE_MASK_NONE;
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 37a4f59d9585..90047f37d98a 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>       rcu_read_unlock();
>  }
>
> +nodemask_t next_demotion_nodes(int node)
> +{
> +     if (!node_demotion)
> +             return NODE_MASK_NONE;
> +
> +     return node_demotion[node].preferred;
> +}
> +
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
>   * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats
  2023-11-02  2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
@ 2023-11-02  4:56   ` Huang, Ying
  2023-11-02  5:43   ` Huang, Ying
  1 sibling, 0 replies; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  4:56 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Li Zhijian <lizhijian@fujitsu.com> writes:

> This is a prepare to improve the demotion profiling in the later
> patches.

I think that this patch has its value even without the following
patches.  So, don't need to define it as preparation.

> Per-node demotion stats help users to quickly identify which
> node is in hige stree, and take some special operations if needed.

Better to add more description.  For example, memory pressure on one
node, etc.

> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>

After addressing the comments above, feel free to add

Acked-by: "Huang, Ying" <ying.huang@intel.com>

--
Best Regards,
Huang, Ying

> ---
>  include/linux/mmzone.h        | 4 ++++
>  include/linux/vm_event_item.h | 3 ---
>  mm/vmscan.c                   | 3 ++-
>  mm/vmstat.c                   | 6 +++---
>  4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc5b4b3..ad0309eea850 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -206,6 +206,10 @@ enum node_stat_item {
>  #ifdef CONFIG_NUMA_BALANCING
>  	PGPROMOTE_SUCCESS,	/* promote successfully */
>  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> +	/* PGDEMOTE_*: pages demoted */
> +	PGDEMOTE_KSWAPD,
> +	PGDEMOTE_DIRECT,
> +	PGDEMOTE_KHUGEPAGED,
>  #endif
>  	NR_VM_NODE_STAT_ITEMS
>  };
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 8abfa1240040..d1b847502f09 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -41,9 +41,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGSTEAL_KSWAPD,
>  		PGSTEAL_DIRECT,
>  		PGSTEAL_KHUGEPAGED,
> -		PGDEMOTE_KSWAPD,
> -		PGDEMOTE_DIRECT,
> -		PGDEMOTE_KHUGEPAGED,
>  		PGSCAN_KSWAPD,
>  		PGSCAN_DIRECT,
>  		PGSCAN_KHUGEPAGED,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6f13394b112e..2f1fb4ec3235 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1678,7 +1678,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>  		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>  		      &nr_succeeded);
>  
> -	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
> +	mod_node_page_state(NODE_DATA(target_nid),
> +		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>  
>  	return nr_succeeded;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 00e81e99c6ee..f141c48c39e4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1244,6 +1244,9 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA_BALANCING
>  	"pgpromote_success",
>  	"pgpromote_candidate",
> +	"pgdemote_kswapd",
> +	"pgdemote_direct",
> +	"pgdemote_khugepaged",
>  #endif
>  
>  	/* enum writeback_stat_item counters */
> @@ -1275,9 +1278,6 @@ const char * const vmstat_text[] = {
>  	"pgsteal_kswapd",
>  	"pgsteal_direct",
>  	"pgsteal_khugepaged",
> -	"pgdemote_kswapd",
> -	"pgdemote_direct",
> -	"pgdemote_khugepaged",
>  	"pgscan_kswapd",
>  	"pgscan_direct",
>  	"pgscan_khugepaged",

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  3:39     ` Zhijian Li (Fujitsu)
@ 2023-11-02  5:18       ` Huang, Ying
  2023-11-02  5:54         ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  5:18 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> already.  A node in a higher tier can demote to any node in the lower
>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>
> IIRC, they are not the same. memory_tier[number], where the number is shared by
> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
> across nodes(different distance will be grouped into the same memory_tier).
> But demotion will only select the nearest nodelist to demote.

In the following patchset, we will use the performance information from
HMAT to place nodes using the same memory driver into different memory
tiers.

https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@intel.com/

The patch is in mm-stable tree.

> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
> are different.
>  
> # numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0
> node 0 size: 964 MB
> node 0 free: 746 MB
> node 1 cpus: 1
> node 1 size: 685 MB
> node 1 free: 455 MB
> node 2 cpus:
> node 2 size: 896 MB
> node 2 free: 897 MB
> node 3 cpus:
> node 3 size: 896 MB
> node 3 free: 896 MB
> node distances:
> node   0   1   2   3
>   0:  10  20  20  25
>   1:  20  10  25  20
>   2:  20  25  10  20
>   3:  25  20  20  10
> # cat /sys/devices/system/node/node0/demotion_nodes
> 2

node 2 is only the preferred demotion target.  In fact, memory in node 0
can be demoted to node 2,3.  Please check demote_folio_list() for
details.

--
Best Regards,
Huang, Ying

> # cat /sys/devices/system/node/node1/demotion_nodes
> 3
> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
> 2-3
>
> Thanks
> Zhijian
>
> (I hate the outlook native reply composition format.)
> ________________________________________
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Thursday, November 2, 2023 11:17
> To: Li, Zhijian/李 智坚
> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvack.org; Gotou, Yasunori/五島 康文; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
>
> Li Zhijian <lizhijian@fujitsu.com> writes:
>
>> It shows the demotion target nodes of a node. Export this information to
>> user directly.
>>
>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>  <show nothing>
>> - After node3 is online as kmem
>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>> [
>>   {
>>     "chardev":"dax0.0",
>>     "size":1054867456,
>>     "target_node":3,
>>     "align":2097152,
>>     "mode":"system-ram",
>>     "online_memblocks":0,
>>     "total_memblocks":7
>>   }
>> ]
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>  <show nothing>
>
> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> already.  A node in a higher tier can demote to any node in the lower
> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>
> --
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>  drivers/base/node.c          | 13 +++++++++++++
>>  include/linux/memory-tiers.h |  6 ++++++
>>  mm/memory-tiers.c            |  8 ++++++++
>>  3 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 493d533f8375..27e8502548a7 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/init.h>
>>  #include <linux/mm.h>
>>  #include <linux/memory.h>
>> +#include <linux/memory-tiers.h>
>>  #include <linux/vmstat.h>
>>  #include <linux/notifier.h>
>>  #include <linux/node.h>
>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>  }
>>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>
>> +static ssize_t demotion_nodes_show(struct device *dev,
>> +                          struct device_attribute *attr, char *buf)
>> +{
>> +     int ret;
>> +     nodemask_t nmask = next_demotion_nodes(dev->id);
>> +
>> +     ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> +     return ret;
>> +}
>> +static DEVICE_ATTR_RO(demotion_nodes);
>> +
>>  static struct attribute *node_dev_attrs[] = {
>>       &dev_attr_meminfo.attr,
>>       &dev_attr_numastat.attr,
>>       &dev_attr_distance.attr,
>>       &dev_attr_vmstat.attr,
>> +     &dev_attr_demotion_nodes.attr,
>>       NULL
>>  };
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 437441cdf78f..8eb04923f965 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>  #ifdef CONFIG_MIGRATION
>>  int next_demotion_node(int node);
>> +nodemask_t next_demotion_nodes(int node);
>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>  bool node_is_toptier(int node);
>>  #else
>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>       return NUMA_NO_NODE;
>>  }
>>
>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>> +{
>> +     return NODE_MASK_NONE;
>> +}
>> +
>>  static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>  {
>>       *targets = NODE_MASK_NONE;
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 37a4f59d9585..90047f37d98a 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>       rcu_read_unlock();
>>  }
>>
>> +nodemask_t next_demotion_nodes(int node)
>> +{
>> +     if (!node_demotion)
>> +             return NODE_MASK_NONE;
>> +
>> +     return node_demotion[node].preferred;
>> +}
>> +
>>  /**
>>   * next_demotion_node() - Get the next node in the demotion path
>>   * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat
  2023-11-02  2:56 ` [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat Li Zhijian
@ 2023-11-02  5:40   ` Greg Kroah-Hartman
  2023-11-02  8:15     ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Greg Kroah-Hartman @ 2023-11-02  5:40 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, rafael, linux-mm, ying.huang, y-goto, linux-kernel

On Thu, Nov 02, 2023 at 10:56:48AM +0800, Li Zhijian wrote:
> node0 and node1 is DRAM node, node3 is a PMEM node.
> 
> $ cat /sys/devices/system/node/node1/numastat
> numa_hit 646590
> numa_miss 3963
> numa_foreign 30651
> interleave_hit 416
> local_node 645252
> other_node 5301
> demote_src 200478
> demote_dst 0
> 
> Of cousre, the userspace numastat will be extened to support these 2
> new fields in the future like:
> $ numastat
>                            node0           node1           node3
> numa_hit                  741793          702460          364154
> numa_miss                   1759            8104           28893
> numa_foreign                8105           30651               0
> interleave_hit               653             416               0
> local_node                741762          701115               0
> other_node                  1790            9449          393047
> demote_src                163612          203828               0
> demote_dst                     0               0          367440
> 
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  drivers/base/node.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 27e8502548a7..d3fc70599b6a 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -496,20 +496,32 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				  struct device_attribute *attr, char *buf)
>  {
> +	struct pglist_data *pgdat = NODE_DATA(dev->id);
> +	unsigned long demote_src, demote_dst;
> +
>  	fold_vm_numa_events();
> +	demote_src = node_page_state_pages(pgdat, PGDEMOTE_SRC_KSWAPD) +
> +		     node_page_state_pages(pgdat, PGDEMOTE_SRC_DIRECT) +
> +		     node_page_state_pages(pgdat, PGDEMOTE_SRC_KHUGEPAGED);
> +	demote_dst = node_page_state_pages(pgdat, PGDEMOTE_DST_KSWAPD) +
> +		     node_page_state_pages(pgdat, PGDEMOTE_DST_DIRECT) +
> +		     node_page_state_pages(pgdat, PGDEMOTE_DST_KHUGEPAGED);
>  	return sysfs_emit(buf,
>  			  "numa_hit %lu\n"
>  			  "numa_miss %lu\n"
>  			  "numa_foreign %lu\n"
>  			  "interleave_hit %lu\n"
>  			  "local_node %lu\n"
> -			  "other_node %lu\n",
> +			  "other_node %lu\n"
> +			  "demote_src %lu\n"
> +			  "demote_dst %lu\n",

This sysfs file is already a total abuse of sysfs so please, do NOT make
it worse by adding more fields, that's just wrong and something I can
not take at all for obvious reasons.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats
  2023-11-02  2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
  2023-11-02  4:56   ` Huang, Ying
@ 2023-11-02  5:43   ` Huang, Ying
  2023-11-02  5:57     ` Zhijian Li (Fujitsu)
  1 sibling, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  5:43 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Li Zhijian <lizhijian@fujitsu.com> writes:

> This is a prepare to improve the demotion profiling in the later
> patches.
>
> Per-node demotion stats help users to quickly identify which
> node is in hige stree, and take some special operations if needed.
>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  include/linux/mmzone.h        | 4 ++++
>  include/linux/vm_event_item.h | 3 ---
>  mm/vmscan.c                   | 3 ++-
>  mm/vmstat.c                   | 6 +++---
>  4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc5b4b3..ad0309eea850 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -206,6 +206,10 @@ enum node_stat_item {
>  #ifdef CONFIG_NUMA_BALANCING
>  	PGPROMOTE_SUCCESS,	/* promote successfully */
>  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> +	/* PGDEMOTE_*: pages demoted */
> +	PGDEMOTE_KSWAPD,
> +	PGDEMOTE_DIRECT,
> +	PGDEMOTE_KHUGEPAGED,
>  #endif
>  	NR_VM_NODE_STAT_ITEMS
>  };
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 8abfa1240040..d1b847502f09 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -41,9 +41,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGSTEAL_KSWAPD,
>  		PGSTEAL_DIRECT,
>  		PGSTEAL_KHUGEPAGED,
> -		PGDEMOTE_KSWAPD,
> -		PGDEMOTE_DIRECT,
> -		PGDEMOTE_KHUGEPAGED,
>  		PGSCAN_KSWAPD,
>  		PGSCAN_DIRECT,
>  		PGSCAN_KHUGEPAGED,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6f13394b112e..2f1fb4ec3235 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1678,7 +1678,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>  		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>  		      &nr_succeeded);
>  
> -	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
> +	mod_node_page_state(NODE_DATA(target_nid),
> +		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);

Think again.  It seems that it's better to count demotion event for the
source node.  Because demotion comes from the memory pressure of the
source node.  The target node isn't so important.  Do you agree?

--
Best Regards,
Huang, Ying

>  
>  	return nr_succeeded;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 00e81e99c6ee..f141c48c39e4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1244,6 +1244,9 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA_BALANCING
>  	"pgpromote_success",
>  	"pgpromote_candidate",
> +	"pgdemote_kswapd",
> +	"pgdemote_direct",
> +	"pgdemote_khugepaged",
>  #endif
>  
>  	/* enum writeback_stat_item counters */
> @@ -1275,9 +1278,6 @@ const char * const vmstat_text[] = {
>  	"pgsteal_kswapd",
>  	"pgsteal_direct",
>  	"pgsteal_khugepaged",
> -	"pgdemote_kswapd",
> -	"pgdemote_direct",
> -	"pgdemote_khugepaged",
>  	"pgscan_kswapd",
>  	"pgscan_direct",
>  	"pgscan_khugepaged",

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
@ 2023-11-02  5:45   ` Huang, Ying
  2023-11-02  6:34     ` Zhijian Li (Fujitsu)
  2023-11-02 17:16   ` kernel test robot
  1 sibling, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  5:45 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Li Zhijian <lizhijian@fujitsu.com> writes:

> pgdemote_src_*: pages demoted from this node.
> pgdemote_dst_*: pages demoted to this node.
>
> So that we are able to know their demotion per-node stats by checking this.
>
> In the environment, node0 and node1 are DRAM, node3 is PMEM.
>
> Global stats:
> $ grep -E 'demote' /proc/vmstat
> pgdemote_src_kswapd 130155
> pgdemote_src_direct 113497
> pgdemote_src_khugepaged 0
> pgdemote_dst_kswapd 130155
> pgdemote_dst_direct 113497
> pgdemote_dst_khugepaged 0
>
> Per-node stats:
> $ grep demote /sys/devices/system/node/node0/vmstat
> pgdemote_src_kswapd 68454
> pgdemote_src_direct 83431
> pgdemote_src_khugepaged 0
> pgdemote_dst_kswapd 0
> pgdemote_dst_direct 0
> pgdemote_dst_khugepaged 0
>
> $ grep demote /sys/devices/system/node/node1/vmstat
> pgdemote_src_kswapd 185834
> pgdemote_src_direct 30066
> pgdemote_src_khugepaged 0
> pgdemote_dst_kswapd 0
> pgdemote_dst_direct 0
> pgdemote_dst_khugepaged 0
>
> $ grep demote /sys/devices/system/node/node3/vmstat
> pgdemote_src_kswapd 0
> pgdemote_src_direct 0
> pgdemote_src_khugepaged 0
> pgdemote_dst_kswapd 254288
> pgdemote_dst_direct 113497
> pgdemote_dst_khugepaged 0
>
> From above stats, we know node3 is the demotion destination which one
> the node0 and node1 will demote to.

Why do we need these information?  Do you have some use case?

--
Best Regards,
Huang, Ying

> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
> RFC: their names are open to discussion, maybe pgdemote_from/to_*
> Another defect of this patch is that, SUM(pgdemote_src_*) is always same
> as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
> ---
>  include/linux/mmzone.h |  9 ++++++---
>  mm/vmscan.c            | 13 ++++++++++---
>  mm/vmstat.c            |  9 ++++++---
>  3 files changed, 22 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ad0309eea850..a6140d894bec 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -207,9 +207,12 @@ enum node_stat_item {
>  	PGPROMOTE_SUCCESS,	/* promote successfully */
>  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
>  	/* PGDEMOTE_*: pages demoted */
> -	PGDEMOTE_KSWAPD,
> -	PGDEMOTE_DIRECT,
> -	PGDEMOTE_KHUGEPAGED,
> +	PGDEMOTE_SRC_KSWAPD,
> +	PGDEMOTE_SRC_DIRECT,
> +	PGDEMOTE_SRC_KHUGEPAGED,
> +	PGDEMOTE_DST_KSWAPD,
> +	PGDEMOTE_DST_DIRECT,
> +	PGDEMOTE_DST_KHUGEPAGED,
>  #endif
>  	NR_VM_NODE_STAT_ITEMS
>  };
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2f1fb4ec3235..55d2287d7150 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1111,13 +1111,18 @@ void drop_slab(void)
>  static int reclaimer_offset(void)
>  {
>  	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> -			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
> +			PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
>  	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>  			PGSCAN_DIRECT - PGSCAN_KSWAPD);
>  	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
> -			PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
> +			PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
>  	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>  			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
> +			PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
> +			PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
> +
>  
>  	if (current_is_kswapd())
>  		return 0;
> @@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>  		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>  		      &nr_succeeded);
>  
> +	mod_node_page_state(pgdat,
> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(), nr_succeeded);
>  	mod_node_page_state(NODE_DATA(target_nid),
> -		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
>  
>  	return nr_succeeded;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f141c48c39e4..63f106a5e008 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA_BALANCING
>  	"pgpromote_success",
>  	"pgpromote_candidate",
> -	"pgdemote_kswapd",
> -	"pgdemote_direct",
> -	"pgdemote_khugepaged",
> +	"pgdemote_src_kswapd",
> +	"pgdemote_src_direct",
> +	"pgdemote_src_khugepaged",
> +	"pgdemote_dst_kswapd",
> +	"pgdemote_dst_direct",
> +	"pgdemote_dst_khugepaged",
>  #endif
>  
>  	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  5:18       ` Huang, Ying
@ 2023-11-02  5:54         ` Zhijian Li (Fujitsu)
  2023-11-02  5:58           ` Huang, Ying
  0 siblings, 1 reply; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-02  5:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel



On 02/11/2023 13:18, Huang, Ying wrote:
> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> 
>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>> already.  A node in a higher tier can demote to any node in the lower
>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>
>> IIRC, they are not the same. memory_tier[number], where the number is shared by
>> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
>> across nodes(different distance will be grouped into the same memory_tier).
>> But demotion will only select the nearest nodelist to demote.
> 
> In the following patchset, we will use the performance information from
> HMAT to place nodes using the same memory driver into different memory
> tiers.
> 
> https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@intel.com/

Thanks for your reminder. It seems like I've fallen behind the world by months.
I will rebase on it later if this patch is still needed.

> 
> The patch is in mm-stable tree.
> 
>> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
>> are different.
>>   
>> # numactl -H
>> available: 4 nodes (0-3)
>> node 0 cpus: 0
>> node 0 size: 964 MB
>> node 0 free: 746 MB
>> node 1 cpus: 1
>> node 1 size: 685 MB
>> node 1 free: 455 MB
>> node 2 cpus:
>> node 2 size: 896 MB
>> node 2 free: 897 MB
>> node 3 cpus:
>> node 3 size: 896 MB
>> node 3 free: 896 MB
>> node distances:
>> node   0   1   2   3
>>    0:  10  20  20  25
>>    1:  20  10  25  20
>>    2:  20  25  10  20
>>    3:  25  20  20  10
>> # cat /sys/devices/system/node/node0/demotion_nodes
>> 2
> 
> node 2 is only the preferred demotion target.  In fact, memory in node 0
> can be demoted to node 2,3.  Please check demote_folio_list() for
> details.

Have I missed something, at least the on master tree, nd->preferred only include the
nearest ones(by specific algorithms), so in above numa topology, nd->preferred of
node0 is node2 only. node0 distance to node3 is 25 greater than to node2(20).

> 1657         int target_nid = next_demotion_node(pgdat->node_id);

So target_nid cannot be node3 IIUC.

(I cooked this patches weeks ago, maybe something has changed, i will also take a deep look later.)

1650 /*
1651  * Take folios on @demote_folios and attempt to demote them to another node.
1652  * Folios which are not demoted are left on @demote_folios.
1653  */
1654 static unsigned int demote_folio_list(struct list_head *demote_folios,
1655                                      struct pglist_data *pgdat)
1656 {
1657         int target_nid = next_demotion_node(pgdat->node_id);
1658         unsigned int nr_succeeded;
1659         nodemask_t allowed_mask;
1660
1661         struct migration_target_control mtc = {
1662                 /*
1663                  * Allocate from 'node', or fail quickly and quietly.
1664                  * When this happens, 'page' will likely just be discarded
1665                  * instead of migrated.
1666                  */
1667                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
1668                         __GFP_NOMEMALLOC | GFP_NOWAIT,
1669                 .nid = target_nid,
1670                 .nmask = &allowed_mask
1671         };
1672
1673         if (list_empty(demote_folios))
1674                 return 0;
1675
1676         if (target_nid == NUMA_NO_NODE)
1677                 return 0;
1678
1679         node_get_allowed_targets(pgdat, &allowed_mask);
1680
1681         /* Demotion ignores all cpuset and mempolicy settings */
1682         migrate_pages(demote_folios, alloc_demote_folio, NULL,
1683                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
1684                       &nr_succeeded);


> 
> --
> Best Regards,
> Huang, Ying
> 
>> # cat /sys/devices/system/node/node1/demotion_nodes
>> 3
>> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
>> 2-3
>>
>> Thanks
>> Zhijian
>>
>> (I hate the outlook native reply composition format.)
>> ________________________________________
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Thursday, November 2, 2023 11:17
>> To: Li, Zhijian/李 智坚
>> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvack.org; Gotou, Yasunori/五島 康文; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
>>
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>
>>> It shows the demotion target nodes of a node. Export this information to
>>> user directly.
>>>
>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>   <show nothing>
>>> - After node3 is online as kmem
>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>>> [
>>>    {
>>>      "chardev":"dax0.0",
>>>      "size":1054867456,
>>>      "target_node":3,
>>>      "align":2097152,
>>>      "mode":"system-ram",
>>>      "online_memblocks":0,
>>>      "total_memblocks":7
>>>    }
>>> ]
>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>> 3
>>> $ cat /sys/devices/system/node/node1/demotion_nodes
>>> 3
>>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>>   <show nothing>
>>
>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> already.  A node in a higher tier can demote to any node in the lower
>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> ---
>>>   drivers/base/node.c          | 13 +++++++++++++
>>>   include/linux/memory-tiers.h |  6 ++++++
>>>   mm/memory-tiers.c            |  8 ++++++++
>>>   3 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index 493d533f8375..27e8502548a7 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -7,6 +7,7 @@
>>>   #include <linux/init.h>
>>>   #include <linux/mm.h>
>>>   #include <linux/memory.h>
>>> +#include <linux/memory-tiers.h>
>>>   #include <linux/vmstat.h>
>>>   #include <linux/notifier.h>
>>>   #include <linux/node.h>
>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>>   }
>>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>>
>>> +static ssize_t demotion_nodes_show(struct device *dev,
>>> +                          struct device_attribute *attr, char *buf)
>>> +{
>>> +     int ret;
>>> +     nodemask_t nmask = next_demotion_nodes(dev->id);
>>> +
>>> +     ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>>> +     return ret;
>>> +}
>>> +static DEVICE_ATTR_RO(demotion_nodes);
>>> +
>>>   static struct attribute *node_dev_attrs[] = {
>>>        &dev_attr_meminfo.attr,
>>>        &dev_attr_numastat.attr,
>>>        &dev_attr_distance.attr,
>>>        &dev_attr_vmstat.attr,
>>> +     &dev_attr_demotion_nodes.attr,
>>>        NULL
>>>   };
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> index 437441cdf78f..8eb04923f965 100644
>>> --- a/include/linux/memory-tiers.h
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>>   void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>>   #ifdef CONFIG_MIGRATION
>>>   int next_demotion_node(int node);
>>> +nodemask_t next_demotion_nodes(int node);
>>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>   bool node_is_toptier(int node);
>>>   #else
>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>>        return NUMA_NO_NODE;
>>>   }
>>>
>>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>>> +{
>>> +     return NODE_MASK_NONE;
>>> +}
>>> +
>>>   static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>   {
>>>        *targets = NODE_MASK_NONE;
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index 37a4f59d9585..90047f37d98a 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>        rcu_read_unlock();
>>>   }
>>>
>>> +nodemask_t next_demotion_nodes(int node)
>>> +{
>>> +     if (!node_demotion)
>>> +             return NODE_MASK_NONE;
>>> +
>>> +     return node_demotion[node].preferred;
>>> +}
>>> +
>>>   /**
>>>    * next_demotion_node() - Get the next node in the demotion path
>>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats
  2023-11-02  5:43   ` Huang, Ying
@ 2023-11-02  5:57     ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-02  5:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel



On 02/11/2023 13:43, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
>> This is a prepare to improve the demotion profiling in the later
>> patches.
>>
>> Per-node demotion stats help users to quickly identify which
>> node is in hige stree, and take some special operations if needed.
>>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>   include/linux/mmzone.h        | 4 ++++
>>   include/linux/vm_event_item.h | 3 ---
>>   mm/vmscan.c                   | 3 ++-
>>   mm/vmstat.c                   | 6 +++---
>>   4 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 4106fbc5b4b3..ad0309eea850 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -206,6 +206,10 @@ enum node_stat_item {
>>   #ifdef CONFIG_NUMA_BALANCING
>>   	PGPROMOTE_SUCCESS,	/* promote successfully */
>>   	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
>> +	/* PGDEMOTE_*: pages demoted */
>> +	PGDEMOTE_KSWAPD,
>> +	PGDEMOTE_DIRECT,
>> +	PGDEMOTE_KHUGEPAGED,
>>   #endif
>>   	NR_VM_NODE_STAT_ITEMS
>>   };
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 8abfa1240040..d1b847502f09 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -41,9 +41,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>   		PGSTEAL_KSWAPD,
>>   		PGSTEAL_DIRECT,
>>   		PGSTEAL_KHUGEPAGED,
>> -		PGDEMOTE_KSWAPD,
>> -		PGDEMOTE_DIRECT,
>> -		PGDEMOTE_KHUGEPAGED,
>>   		PGSCAN_KSWAPD,
>>   		PGSCAN_DIRECT,
>>   		PGSCAN_KHUGEPAGED,
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6f13394b112e..2f1fb4ec3235 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1678,7 +1678,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>>   		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>>   		      &nr_succeeded);
>>   
>> -	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>> +	mod_node_page_state(NODE_DATA(target_nid),
>> +		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
> 
> Think again.  It seems that it's better to count demotion event for the
> source node.  Because demotion comes from the memory pressure of the
> source node.  The target node isn't so important.  Do you agree?

Good idea, I will update it.



> 
> --
> Best Regards,
> Huang, Ying
> 
>>   
>>   	return nr_succeeded;
>>   }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 00e81e99c6ee..f141c48c39e4 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1244,6 +1244,9 @@ const char * const vmstat_text[] = {
>>   #ifdef CONFIG_NUMA_BALANCING
>>   	"pgpromote_success",
>>   	"pgpromote_candidate",
>> +	"pgdemote_kswapd",
>> +	"pgdemote_direct",
>> +	"pgdemote_khugepaged",
>>   #endif
>>   
>>   	/* enum writeback_stat_item counters */
>> @@ -1275,9 +1278,6 @@ const char * const vmstat_text[] = {
>>   	"pgsteal_kswapd",
>>   	"pgsteal_direct",
>>   	"pgsteal_khugepaged",
>> -	"pgdemote_kswapd",
>> -	"pgdemote_direct",
>> -	"pgdemote_khugepaged",
>>   	"pgscan_kswapd",
>>   	"pgscan_direct",
>>   	"pgscan_khugepaged",

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  5:54         ` Zhijian Li (Fujitsu)
@ 2023-11-02  5:58           ` Huang, Ying
  2023-11-03  3:05             ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  5:58 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 02/11/2023 13:18, Huang, Ying wrote:
>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>> 
>>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>>> already.  A node in a higher tier can demote to any node in the lower
>>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>>
>>> IIRC, they are not the same. memory_tier[number], where the number is shared by
>>> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
>>> across nodes(different distance will be grouped into the same memory_tier).
>>> But demotion will only select the nearest nodelist to demote.
>> 
>> In the following patchset, we will use the performance information from
>> HMAT to place nodes using the same memory driver into different memory
>> tiers.
>> 
>> https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@intel.com/
>
> Thanks for your reminder. It seems like I've fallen behind the world by months.
> I will rebase on it later if this patch is still needed.
>
>> 
>> The patch is in mm-stable tree.
>> 
>>> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
>>> are different.
>>>   
>>> # numactl -H
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0
>>> node 0 size: 964 MB
>>> node 0 free: 746 MB
>>> node 1 cpus: 1
>>> node 1 size: 685 MB
>>> node 1 free: 455 MB
>>> node 2 cpus:
>>> node 2 size: 896 MB
>>> node 2 free: 897 MB
>>> node 3 cpus:
>>> node 3 size: 896 MB
>>> node 3 free: 896 MB
>>> node distances:
>>> node   0   1   2   3
>>>    0:  10  20  20  25
>>>    1:  20  10  25  20
>>>    2:  20  25  10  20
>>>    3:  25  20  20  10
>>> # cat /sys/devices/system/node/node0/demotion_nodes
>>> 2
>> 
>> node 2 is only the preferred demotion target.  In fact, memory in node 0
>> can be demoted to node 2,3.  Please check demote_folio_list() for
>> details.
>
> Have I missed something, at least the on master tree, nd->preferred only include the
> nearest ones(by specific algorithms), so in above numa topology, nd->preferred of
> node0 is node2 only. node0 distance to node3 is 25 greater than to node2(20).
>
>> 1657         int target_nid = next_demotion_node(pgdat->node_id);
>
> So target_nid cannot be node3 IIUC.
>
> (I cooked this patches weeks ago, maybe something has changed, i will also take a deep look later.)
>
> 1650 /*
> 1651  * Take folios on @demote_folios and attempt to demote them to another node.
> 1652  * Folios which are not demoted are left on @demote_folios.
> 1653  */
> 1654 static unsigned int demote_folio_list(struct list_head *demote_folios,
> 1655                                      struct pglist_data *pgdat)
> 1656 {
> 1657         int target_nid = next_demotion_node(pgdat->node_id);
> 1658         unsigned int nr_succeeded;
> 1659         nodemask_t allowed_mask;
> 1660
> 1661         struct migration_target_control mtc = {
> 1662                 /*
> 1663                  * Allocate from 'node', or fail quickly and quietly.
> 1664                  * When this happens, 'page' will likely just be discarded
> 1665                  * instead of migrated.
> 1666                  */
> 1667                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> 1668                         __GFP_NOMEMALLOC | GFP_NOWAIT,
> 1669                 .nid = target_nid,
> 1670                 .nmask = &allowed_mask
> 1671         };
> 1672
> 1673         if (list_empty(demote_folios))
> 1674                 return 0;
> 1675
> 1676         if (target_nid == NUMA_NO_NODE)
> 1677                 return 0;
> 1678
> 1679         node_get_allowed_targets(pgdat, &allowed_mask);
> 1680
> 1681         /* Demotion ignores all cpuset and mempolicy settings */
> 1682         migrate_pages(demote_folios, alloc_demote_folio, NULL,
> 1683                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> 1684                       &nr_succeeded);
>

In alloc_demote_folio(), target_nid is tried firstly. Then, if
allocation fails, any node in allowed_mask will be tried.

--
Best Regards,
Huang, Ying

>> 
>>> # cat /sys/devices/system/node/node1/demotion_nodes
>>> 3
>>> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
>>> 2-3
>>>
>>> Thanks
>>> Zhijian
>>>
>>> (I hate the outlook native reply composition format.)
>>> ________________________________________
>>> From: Huang, Ying <ying.huang@intel.com>
>>> Sent: Thursday, November 2, 2023 11:17
>>> To: Li, Zhijian/李 智坚
>>> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvack.org; Gotou, Yasunori/五島 康文; linux-kernel@vger.kernel.org
>>> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
>>>
>>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>>
>>>> It shows the demotion target nodes of a node. Export this information to
>>>> user directly.
>>>>
>>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>>>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>>   <show nothing>
>>>> - After node3 is online as kmem
>>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>>>> [
>>>>    {
>>>>      "chardev":"dax0.0",
>>>>      "size":1054867456,
>>>>      "target_node":3,
>>>>      "align":2097152,
>>>>      "mode":"system-ram",
>>>>      "online_memblocks":0,
>>>>      "total_memblocks":7
>>>>    }
>>>> ]
>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>> 3
>>>> $ cat /sys/devices/system/node/node1/demotion_nodes
>>>> 3
>>>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>>>   <show nothing>
>>>
>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>> already.  A node in a higher tier can demote to any node in the lower
>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>>
>>> --
>>> Best Regards,
>>> Huang, Ying
>>>
>>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>>> ---
>>>>   drivers/base/node.c          | 13 +++++++++++++
>>>>   include/linux/memory-tiers.h |  6 ++++++
>>>>   mm/memory-tiers.c            |  8 ++++++++
>>>>   3 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index 493d533f8375..27e8502548a7 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -7,6 +7,7 @@
>>>>   #include <linux/init.h>
>>>>   #include <linux/mm.h>
>>>>   #include <linux/memory.h>
>>>> +#include <linux/memory-tiers.h>
>>>>   #include <linux/vmstat.h>
>>>>   #include <linux/notifier.h>
>>>>   #include <linux/node.h>
>>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>>>   }
>>>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>>>
>>>> +static ssize_t demotion_nodes_show(struct device *dev,
>>>> +                          struct device_attribute *attr, char *buf)
>>>> +{
>>>> +     int ret;
>>>> +     nodemask_t nmask = next_demotion_nodes(dev->id);
>>>> +
>>>> +     ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>>>> +     return ret;
>>>> +}
>>>> +static DEVICE_ATTR_RO(demotion_nodes);
>>>> +
>>>>   static struct attribute *node_dev_attrs[] = {
>>>>        &dev_attr_meminfo.attr,
>>>>        &dev_attr_numastat.attr,
>>>>        &dev_attr_distance.attr,
>>>>        &dev_attr_vmstat.attr,
>>>> +     &dev_attr_demotion_nodes.attr,
>>>>        NULL
>>>>   };
>>>>
>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>> index 437441cdf78f..8eb04923f965 100644
>>>> --- a/include/linux/memory-tiers.h
>>>> +++ b/include/linux/memory-tiers.h
>>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>>>   void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>>>   #ifdef CONFIG_MIGRATION
>>>>   int next_demotion_node(int node);
>>>> +nodemask_t next_demotion_nodes(int node);
>>>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>>   bool node_is_toptier(int node);
>>>>   #else
>>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>>>        return NUMA_NO_NODE;
>>>>   }
>>>>
>>>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>>>> +{
>>>> +     return NODE_MASK_NONE;
>>>> +}
>>>> +
>>>>   static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>   {
>>>>        *targets = NODE_MASK_NONE;
>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>> index 37a4f59d9585..90047f37d98a 100644
>>>> --- a/mm/memory-tiers.c
>>>> +++ b/mm/memory-tiers.c
>>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>        rcu_read_unlock();
>>>>   }
>>>>
>>>> +nodemask_t next_demotion_nodes(int node)
>>>> +{
>>>> +     if (!node_demotion)
>>>> +             return NODE_MASK_NONE;
>>>> +
>>>> +     return node_demotion[node].preferred;
>>>> +}
>>>> +
>>>>   /**
>>>>    * next_demotion_node() - Get the next node in the demotion path
>>>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  5:45   ` Huang, Ying
@ 2023-11-02  6:34     ` Zhijian Li (Fujitsu)
  2023-11-02  6:56       ` Huang, Ying
  2023-11-02  7:38       ` Yasunori Gotou (Fujitsu)
  0 siblings, 2 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-02  6:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel



On 02/11/2023 13:45, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
>> pgdemote_src_*: pages demoted from this node.
>> pgdemote_dst_*: pages demoted to this node.
>>
>> So that we are able to know their demotion per-node stats by checking this.
>>
>> In the environment, node0 and node1 are DRAM, node3 is PMEM.
>>
>> Global stats:
>> $ grep -E 'demote' /proc/vmstat
>> pgdemote_src_kswapd 130155
>> pgdemote_src_direct 113497
>> pgdemote_src_khugepaged 0
>> pgdemote_dst_kswapd 130155
>> pgdemote_dst_direct 113497
>> pgdemote_dst_khugepaged 0
>>
>> Per-node stats:
>> $ grep demote /sys/devices/system/node/node0/vmstat
>> pgdemote_src_kswapd 68454
>> pgdemote_src_direct 83431
>> pgdemote_src_khugepaged 0
>> pgdemote_dst_kswapd 0
>> pgdemote_dst_direct 0
>> pgdemote_dst_khugepaged 0
>>
>> $ grep demote /sys/devices/system/node/node1/vmstat
>> pgdemote_src_kswapd 185834
>> pgdemote_src_direct 30066
>> pgdemote_src_khugepaged 0
>> pgdemote_dst_kswapd 0
>> pgdemote_dst_direct 0
>> pgdemote_dst_khugepaged 0
>>
>> $ grep demote /sys/devices/system/node/node3/vmstat
>> pgdemote_src_kswapd 0
>> pgdemote_src_direct 0
>> pgdemote_src_khugepaged 0
>> pgdemote_dst_kswapd 254288
>> pgdemote_dst_direct 113497
>> pgdemote_dst_khugepaged 0
>>
>>  From above stats, we know node3 is the demotion destination which one
>> the node0 and node1 will demote to.
> 
> Why do we need these information?  Do you have some use case?

I recall our customers have mentioned that they want to know how much the memory is demoted
to the CXL memory device in a specific period.


>>>   	mod_node_page_state(NODE_DATA(target_nid),
>>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);

But if the *target_nid* is only indicate the preferred node, this accounting maybe not accurate.


Thanks
Zhijian

> 
> --
> Best Regards,
> Huang, Ying
> 
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>> RFC: their names are open to discussion, maybe pgdemote_from/to_*
>> Another defect of this patch is that, SUM(pgdemote_src_*) is always same
>> as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
>> ---
>>   include/linux/mmzone.h |  9 ++++++---
>>   mm/vmscan.c            | 13 ++++++++++---
>>   mm/vmstat.c            |  9 ++++++---
>>   3 files changed, 22 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index ad0309eea850..a6140d894bec 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -207,9 +207,12 @@ enum node_stat_item {
>>   	PGPROMOTE_SUCCESS,	/* promote successfully */
>>   	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
>>   	/* PGDEMOTE_*: pages demoted */
>> -	PGDEMOTE_KSWAPD,
>> -	PGDEMOTE_DIRECT,
>> -	PGDEMOTE_KHUGEPAGED,
>> +	PGDEMOTE_SRC_KSWAPD,
>> +	PGDEMOTE_SRC_DIRECT,
>> +	PGDEMOTE_SRC_KHUGEPAGED,
>> +	PGDEMOTE_DST_KSWAPD,
>> +	PGDEMOTE_DST_DIRECT,
>> +	PGDEMOTE_DST_KHUGEPAGED,
>>   #endif
>>   	NR_VM_NODE_STAT_ITEMS
>>   };
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2f1fb4ec3235..55d2287d7150 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1111,13 +1111,18 @@ void drop_slab(void)
>>   static int reclaimer_offset(void)
>>   {
>>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>> -			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
>> +			PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
>>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>>   			PGSCAN_DIRECT - PGSCAN_KSWAPD);
>>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>> -			PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
>> +			PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
>>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>>   			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
>> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
>> +			PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
>> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
>> +			PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
>> +
>>   
>>   	if (current_is_kswapd())
>>   		return 0;
>> @@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>>   		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>>   		      &nr_succeeded);
>>   
>> +	mod_node_page_state(pgdat,
>> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(), nr_succeeded);
>>   	mod_node_page_state(NODE_DATA(target_nid),
>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
>>   
>>   	return nr_succeeded;
>>   }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index f141c48c39e4..63f106a5e008 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
>>   #ifdef CONFIG_NUMA_BALANCING
>>   	"pgpromote_success",
>>   	"pgpromote_candidate",
>> -	"pgdemote_kswapd",
>> -	"pgdemote_direct",
>> -	"pgdemote_khugepaged",
>> +	"pgdemote_src_kswapd",
>> +	"pgdemote_src_direct",
>> +	"pgdemote_src_khugepaged",
>> +	"pgdemote_dst_kswapd",
>> +	"pgdemote_dst_direct",
>> +	"pgdemote_dst_khugepaged",
>>   #endif
>>   
>>   	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  6:34     ` Zhijian Li (Fujitsu)
@ 2023-11-02  6:56       ` Huang, Ying
  2023-11-02  7:38       ` Yasunori Gotou (Fujitsu)
  1 sibling, 0 replies; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  6:56 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 02/11/2023 13:45, Huang, Ying wrote:
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>> 
>>> pgdemote_src_*: pages demoted from this node.
>>> pgdemote_dst_*: pages demoted to this node.
>>>
>>> So that we are able to know their demotion per-node stats by checking this.
>>>
>>> In the environment, node0 and node1 are DRAM, node3 is PMEM.
>>>
>>> Global stats:
>>> $ grep -E 'demote' /proc/vmstat
>>> pgdemote_src_kswapd 130155
>>> pgdemote_src_direct 113497
>>> pgdemote_src_khugepaged 0
>>> pgdemote_dst_kswapd 130155
>>> pgdemote_dst_direct 113497
>>> pgdemote_dst_khugepaged 0
>>>
>>> Per-node stats:
>>> $ grep demote /sys/devices/system/node/node0/vmstat
>>> pgdemote_src_kswapd 68454
>>> pgdemote_src_direct 83431
>>> pgdemote_src_khugepaged 0
>>> pgdemote_dst_kswapd 0
>>> pgdemote_dst_direct 0
>>> pgdemote_dst_khugepaged 0
>>>
>>> $ grep demote /sys/devices/system/node/node1/vmstat
>>> pgdemote_src_kswapd 185834
>>> pgdemote_src_direct 30066
>>> pgdemote_src_khugepaged 0
>>> pgdemote_dst_kswapd 0
>>> pgdemote_dst_direct 0
>>> pgdemote_dst_khugepaged 0
>>>
>>> $ grep demote /sys/devices/system/node/node3/vmstat
>>> pgdemote_src_kswapd 0
>>> pgdemote_src_direct 0
>>> pgdemote_src_khugepaged 0
>>> pgdemote_dst_kswapd 254288
>>> pgdemote_dst_direct 113497
>>> pgdemote_dst_khugepaged 0
>>>
>>>  From above stats, we know node3 is the demotion destination which one
>>> the node0 and node1 will demote to.
>> 
>> Why do we need these information?  Do you have some use case?
>
> I recall our customers have mentioned that they want to know how much the memory is demoted
> to the CXL memory device in a specific period.

This doesn't sound like a use case.  Can you elaborate it?  What can
only be tuned with the help of the added stats?

--
Best Regards,
Huang, Ying

>
>>>>   	mod_node_page_state(NODE_DATA(target_nid),
>>>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>>>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
>
> But if the *target_nid* is only indicate the preferred node, this accounting maybe not accurate.
>
>
> Thanks
> Zhijian
>
>> 
>> --
>> Best Regards,
>> Huang, Ying
>> 
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> ---
>>> RFC: their names are open to discussion, maybe pgdemote_from/to_*
>>> Another defect of this patch is that, SUM(pgdemote_src_*) is always same
>>> as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
>>> ---
>>>   include/linux/mmzone.h |  9 ++++++---
>>>   mm/vmscan.c            | 13 ++++++++++---
>>>   mm/vmstat.c            |  9 ++++++---
>>>   3 files changed, 22 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index ad0309eea850..a6140d894bec 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -207,9 +207,12 @@ enum node_stat_item {
>>>   	PGPROMOTE_SUCCESS,	/* promote successfully */
>>>   	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
>>>   	/* PGDEMOTE_*: pages demoted */
>>> -	PGDEMOTE_KSWAPD,
>>> -	PGDEMOTE_DIRECT,
>>> -	PGDEMOTE_KHUGEPAGED,
>>> +	PGDEMOTE_SRC_KSWAPD,
>>> +	PGDEMOTE_SRC_DIRECT,
>>> +	PGDEMOTE_SRC_KHUGEPAGED,
>>> +	PGDEMOTE_DST_KSWAPD,
>>> +	PGDEMOTE_DST_DIRECT,
>>> +	PGDEMOTE_DST_KHUGEPAGED,
>>>   #endif
>>>   	NR_VM_NODE_STAT_ITEMS
>>>   };
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 2f1fb4ec3235..55d2287d7150 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1111,13 +1111,18 @@ void drop_slab(void)
>>>   static int reclaimer_offset(void)
>>>   {
>>>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>>> -			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
>>> +			PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
>>>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>>>   			PGSCAN_DIRECT - PGSCAN_KSWAPD);
>>>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>>> -			PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
>>> +			PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
>>>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>>>   			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
>>> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
>>> +			PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
>>> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
>>> +			PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
>>> +
>>>   
>>>   	if (current_is_kswapd())
>>>   		return 0;
>>> @@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>>>   		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>>>   		      &nr_succeeded);
>>>   
>>> +	mod_node_page_state(pgdat,
>>> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(), nr_succeeded);
>>>   	mod_node_page_state(NODE_DATA(target_nid),
>>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
>>>   
>>>   	return nr_succeeded;
>>>   }
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index f141c48c39e4..63f106a5e008 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
>>>   #ifdef CONFIG_NUMA_BALANCING
>>>   	"pgpromote_success",
>>>   	"pgpromote_candidate",
>>> -	"pgdemote_kswapd",
>>> -	"pgdemote_direct",
>>> -	"pgdemote_khugepaged",
>>> +	"pgdemote_src_kswapd",
>>> +	"pgdemote_src_direct",
>>> +	"pgdemote_src_khugepaged",
>>> +	"pgdemote_dst_kswapd",
>>> +	"pgdemote_dst_direct",
>>> +	"pgdemote_dst_khugepaged",
>>>   #endif
>>>   
>>>   	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  6:34     ` Zhijian Li (Fujitsu)
  2023-11-02  6:56       ` Huang, Ying
@ 2023-11-02  7:38       ` Yasunori Gotou (Fujitsu)
  2023-11-02  7:46         ` Huang, Ying
  1 sibling, 1 reply; 34+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-11-02  7:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel, Zhijian Li (Fujitsu)

Hello,

> On 02/11/2023 13:45, Huang, Ying wrote:
> > Li Zhijian <lizhijian@fujitsu.com> writes:
> >
> >> pgdemote_src_*: pages demoted from this node.
> >> pgdemote_dst_*: pages demoted to this node.
> >>
> >> So that we are able to know their demotion per-node stats by checking this.
> >>
> >> In the environment, node0 and node1 are DRAM, node3 is PMEM.
> >>
> >> Global stats:
> >> $ grep -E 'demote' /proc/vmstat
> >> pgdemote_src_kswapd 130155
> >> pgdemote_src_direct 113497
> >> pgdemote_src_khugepaged 0
> >> pgdemote_dst_kswapd 130155
> >> pgdemote_dst_direct 113497
> >> pgdemote_dst_khugepaged 0
> >>
> >> Per-node stats:
> >> $ grep demote /sys/devices/system/node/node0/vmstat
> >> pgdemote_src_kswapd 68454
> >> pgdemote_src_direct 83431
> >> pgdemote_src_khugepaged 0
> >> pgdemote_dst_kswapd 0
> >> pgdemote_dst_direct 0
> >> pgdemote_dst_khugepaged 0
> >>
> >> $ grep demote /sys/devices/system/node/node1/vmstat
> >> pgdemote_src_kswapd 185834
> >> pgdemote_src_direct 30066
> >> pgdemote_src_khugepaged 0
> >> pgdemote_dst_kswapd 0
> >> pgdemote_dst_direct 0
> >> pgdemote_dst_khugepaged 0
> >>
> >> $ grep demote /sys/devices/system/node/node3/vmstat
> >> pgdemote_src_kswapd 0
> >> pgdemote_src_direct 0
> >> pgdemote_src_khugepaged 0
> >> pgdemote_dst_kswapd 254288
> >> pgdemote_dst_direct 113497
> >> pgdemote_dst_khugepaged 0
> >>
> >>  From above stats, we know node3 is the demotion destination which one
> >> the node0 and node1 will demote to.
> >
> > Why do we need these information?  Do you have some use case?
> 
> I recall our customers have mentioned that they want to know how much the
> memory is demoted
> to the CXL memory device in a specific period.

I'll mention about it more.

I had a conversation with one of our customers. He expressed a desire for more detailed
profile information to analyze the behavior of demotion (and promotion) when
his workloads are executed. 
If the results are not satisfactory for his workloads, he wants to tune his servers for his workloads
with these profiles.
Additionally, depending on the results, he may want to change his server configuration. 
For example, he may want to buy more expensive DDR memories rather than cheaper CXL memory.

In my impression, our customers seems to think that CXL memory is NOT as reliable as DDR memory yet.
Therefore, they want to prepare for the new world that CXL will bring, and want to have a method 
for the preparation by profiling information as much as possible.

it this enough for your question?

Thanks,

> 
> 
> >>>   	mod_node_page_state(NODE_DATA(target_nid),
> >>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
> nr_succeeded);
> >>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
> nr_succeeded);
> 
> But if the *target_nid* is only indicate the preferred node, this accounting
> maybe not accurate.
> 
> 
> Thanks
> Zhijian
> 
> >
> > --
> > Best Regards,
> > Huang, Ying
> >
> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> >> ---
> >> RFC: their names are open to discussion, maybe pgdemote_from/to_*
> >> Another defect of this patch is that, SUM(pgdemote_src_*) is always same
> >> as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
> >> ---
> >>   include/linux/mmzone.h |  9 ++++++---
> >>   mm/vmscan.c            | 13 ++++++++++---
> >>   mm/vmstat.c            |  9 ++++++---
> >>   3 files changed, 22 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index ad0309eea850..a6140d894bec 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -207,9 +207,12 @@ enum node_stat_item {
> >>   	PGPROMOTE_SUCCESS,	/* promote successfully */
> >>   	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> >>   	/* PGDEMOTE_*: pages demoted */
> >> -	PGDEMOTE_KSWAPD,
> >> -	PGDEMOTE_DIRECT,
> >> -	PGDEMOTE_KHUGEPAGED,
> >> +	PGDEMOTE_SRC_KSWAPD,
> >> +	PGDEMOTE_SRC_DIRECT,
> >> +	PGDEMOTE_SRC_KHUGEPAGED,
> >> +	PGDEMOTE_DST_KSWAPD,
> >> +	PGDEMOTE_DST_DIRECT,
> >> +	PGDEMOTE_DST_KHUGEPAGED,
> >>   #endif
> >>   	NR_VM_NODE_STAT_ITEMS
> >>   };
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 2f1fb4ec3235..55d2287d7150 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1111,13 +1111,18 @@ void drop_slab(void)
> >>   static int reclaimer_offset(void)
> >>   {
> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> >> -			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
> >> +			PGDEMOTE_SRC_DIRECT -
> PGDEMOTE_SRC_KSWAPD);
> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> >>   			PGSCAN_DIRECT - PGSCAN_KSWAPD);
> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
> >> -			PGDEMOTE_KHUGEPAGED -
> PGDEMOTE_KSWAPD);
> >> +			PGDEMOTE_SRC_KHUGEPAGED -
> PGDEMOTE_SRC_KSWAPD);
> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
> >>   			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT -
> PGDEMOTE_SRC_KSWAPD !=
> >> +			PGDEMOTE_DST_DIRECT -
> PGDEMOTE_DST_KSWAPD);
> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED -
> PGDEMOTE_SRC_KSWAPD !=
> >> +			PGDEMOTE_DST_KHUGEPAGED -
> PGDEMOTE_DST_KSWAPD);
> >> +
> >>
> >>   	if (current_is_kswapd())
> >>   		return 0;
> >> @@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct
> list_head *demote_folios,
> >>   		      (unsigned long)&mtc, MIGRATE_ASYNC,
> MR_DEMOTION,
> >>   		      &nr_succeeded);
> >>
> >> +	mod_node_page_state(pgdat,
> >> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(),
> nr_succeeded);
> >>   	mod_node_page_state(NODE_DATA(target_nid),
> >> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
> nr_succeeded);
> >> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
> nr_succeeded);
> >>
> >>   	return nr_succeeded;
> >>   }
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index f141c48c39e4..63f106a5e008 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
> >>   #ifdef CONFIG_NUMA_BALANCING
> >>   	"pgpromote_success",
> >>   	"pgpromote_candidate",
> >> -	"pgdemote_kswapd",
> >> -	"pgdemote_direct",
> >> -	"pgdemote_khugepaged",
> >> +	"pgdemote_src_kswapd",
> >> +	"pgdemote_src_direct",
> >> +	"pgdemote_src_khugepaged",
> >> +	"pgdemote_dst_kswapd",
> >> +	"pgdemote_dst_direct",
> >> +	"pgdemote_dst_khugepaged",
> >>   #endif
> >>
> >>   	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  7:38       ` Yasunori Gotou (Fujitsu)
@ 2023-11-02  7:46         ` Huang, Ying
  2023-11-02  9:45           ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-02  7:46 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel, Zhijian Li (Fujitsu)

"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com> writes:

> Hello,
>
>> On 02/11/2023 13:45, Huang, Ying wrote:
>> > Li Zhijian <lizhijian@fujitsu.com> writes:
>> >
>> >> pgdemote_src_*: pages demoted from this node.
>> >> pgdemote_dst_*: pages demoted to this node.
>> >>
>> >> So that we are able to know their demotion per-node stats by checking this.
>> >>
>> >> In the environment, node0 and node1 are DRAM, node3 is PMEM.
>> >>
>> >> Global stats:
>> >> $ grep -E 'demote' /proc/vmstat
>> >> pgdemote_src_kswapd 130155
>> >> pgdemote_src_direct 113497
>> >> pgdemote_src_khugepaged 0
>> >> pgdemote_dst_kswapd 130155
>> >> pgdemote_dst_direct 113497
>> >> pgdemote_dst_khugepaged 0
>> >>
>> >> Per-node stats:
>> >> $ grep demote /sys/devices/system/node/node0/vmstat
>> >> pgdemote_src_kswapd 68454
>> >> pgdemote_src_direct 83431
>> >> pgdemote_src_khugepaged 0
>> >> pgdemote_dst_kswapd 0
>> >> pgdemote_dst_direct 0
>> >> pgdemote_dst_khugepaged 0
>> >>
>> >> $ grep demote /sys/devices/system/node/node1/vmstat
>> >> pgdemote_src_kswapd 185834
>> >> pgdemote_src_direct 30066
>> >> pgdemote_src_khugepaged 0
>> >> pgdemote_dst_kswapd 0
>> >> pgdemote_dst_direct 0
>> >> pgdemote_dst_khugepaged 0
>> >>
>> >> $ grep demote /sys/devices/system/node/node3/vmstat
>> >> pgdemote_src_kswapd 0
>> >> pgdemote_src_direct 0
>> >> pgdemote_src_khugepaged 0
>> >> pgdemote_dst_kswapd 254288
>> >> pgdemote_dst_direct 113497
>> >> pgdemote_dst_khugepaged 0
>> >>
>> >>  From above stats, we know node3 is the demotion destination which one
>> >> the node0 and node1 will demote to.
>> >
>> > Why do we need these information?  Do you have some use case?
>> 
>> I recall our customers have mentioned that they want to know how much the
>> memory is demoted
>> to the CXL memory device in a specific period.
>
> I'll mention about it more.
>
> I had a conversation with one of our customers. He expressed a desire for more detailed
> profile information to analyze the behavior of demotion (and promotion) when
> his workloads are executed. 
> If the results are not satisfactory for his workloads, he wants to tune his servers for his workloads
> with these profiles.
> Additionally, depending on the results, he may want to change his server configuration. 
> For example, he may want to buy more expensive DDR memories rather than cheaper CXL memory.
>
> In my impression, our customers seems to think that CXL memory is NOT as reliable as DDR memory yet.
> Therefore, they want to prepare for the new world that CXL will bring, and want to have a method 
> for the preparation by profiling information as much as possible.
>
> it this enough for your question?

I want some more detailed information about how these stats are used?
Why isn't per-node pgdemote_xxx counter enough?

--
Best Regards,
Huang, Ying

> Thanks,
>
>> 
>> 
>> >>>   	mod_node_page_state(NODE_DATA(target_nid),
>> >>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
>> nr_succeeded);
>> >>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
>> nr_succeeded);
>> 
>> But if the *target_nid* is only indicate the preferred node, this accounting
>> maybe not accurate.
>> 
>> 
>> Thanks
>> Zhijian
>> 
>> >
>> > --
>> > Best Regards,
>> > Huang, Ying
>> >
>> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> >> ---
>> >> RFC: their names are open to discussion, maybe pgdemote_from/to_*
>> >> Another defect of this patch is that, SUM(pgdemote_src_*) is always same
>> >> as SUM(pgdemote_dst_*) in the global stats, shall we hide one of them.
>> >> ---
>> >>   include/linux/mmzone.h |  9 ++++++---
>> >>   mm/vmscan.c            | 13 ++++++++++---
>> >>   mm/vmstat.c            |  9 ++++++---
>> >>   3 files changed, 22 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> >> index ad0309eea850..a6140d894bec 100644
>> >> --- a/include/linux/mmzone.h
>> >> +++ b/include/linux/mmzone.h
>> >> @@ -207,9 +207,12 @@ enum node_stat_item {
>> >>   	PGPROMOTE_SUCCESS,	/* promote successfully */
>> >>   	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
>> >>   	/* PGDEMOTE_*: pages demoted */
>> >> -	PGDEMOTE_KSWAPD,
>> >> -	PGDEMOTE_DIRECT,
>> >> -	PGDEMOTE_KHUGEPAGED,
>> >> +	PGDEMOTE_SRC_KSWAPD,
>> >> +	PGDEMOTE_SRC_DIRECT,
>> >> +	PGDEMOTE_SRC_KHUGEPAGED,
>> >> +	PGDEMOTE_DST_KSWAPD,
>> >> +	PGDEMOTE_DST_DIRECT,
>> >> +	PGDEMOTE_DST_KHUGEPAGED,
>> >>   #endif
>> >>   	NR_VM_NODE_STAT_ITEMS
>> >>   };
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 2f1fb4ec3235..55d2287d7150 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -1111,13 +1111,18 @@ void drop_slab(void)
>> >>   static int reclaimer_offset(void)
>> >>   {
>> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>> >> -			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
>> >> +			PGDEMOTE_SRC_DIRECT -
>> PGDEMOTE_SRC_KSWAPD);
>> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
>> >>   			PGSCAN_DIRECT - PGSCAN_KSWAPD);
>> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>> >> -			PGDEMOTE_KHUGEPAGED -
>> PGDEMOTE_KSWAPD);
>> >> +			PGDEMOTE_SRC_KHUGEPAGED -
>> PGDEMOTE_SRC_KSWAPD);
>> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
>> >>   			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
>> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT -
>> PGDEMOTE_SRC_KSWAPD !=
>> >> +			PGDEMOTE_DST_DIRECT -
>> PGDEMOTE_DST_KSWAPD);
>> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED -
>> PGDEMOTE_SRC_KSWAPD !=
>> >> +			PGDEMOTE_DST_KHUGEPAGED -
>> PGDEMOTE_DST_KSWAPD);
>> >> +
>> >>
>> >>   	if (current_is_kswapd())
>> >>   		return 0;
>> >> @@ -1678,8 +1683,10 @@ static unsigned int demote_folio_list(struct
>> list_head *demote_folios,
>> >>   		      (unsigned long)&mtc, MIGRATE_ASYNC,
>> MR_DEMOTION,
>> >>   		      &nr_succeeded);
>> >>
>> >> +	mod_node_page_state(pgdat,
>> >> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(),
>> nr_succeeded);
>> >>   	mod_node_page_state(NODE_DATA(target_nid),
>> >> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
>> nr_succeeded);
>> >> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
>> nr_succeeded);
>> >>
>> >>   	return nr_succeeded;
>> >>   }
>> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> >> index f141c48c39e4..63f106a5e008 100644
>> >> --- a/mm/vmstat.c
>> >> +++ b/mm/vmstat.c
>> >> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
>> >>   #ifdef CONFIG_NUMA_BALANCING
>> >>   	"pgpromote_success",
>> >>   	"pgpromote_candidate",
>> >> -	"pgdemote_kswapd",
>> >> -	"pgdemote_direct",
>> >> -	"pgdemote_khugepaged",
>> >> +	"pgdemote_src_kswapd",
>> >> +	"pgdemote_src_direct",
>> >> +	"pgdemote_src_khugepaged",
>> >> +	"pgdemote_dst_kswapd",
>> >> +	"pgdemote_dst_direct",
>> >> +	"pgdemote_dst_khugepaged",
>> >>   #endif
>> >>
>> >>   	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat
  2023-11-02  5:40   ` Greg Kroah-Hartman
@ 2023-11-02  8:15     ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-02  8:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andrew Morton, rafael, linux-mm, ying.huang,
	Yasunori Gotou (Fujitsu),
	linux-kernel



On 02/11/2023 13:40, Greg Kroah-Hartman wrote:
>> index 27e8502548a7..d3fc70599b6a 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -496,20 +496,32 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
>>   static ssize_t node_read_numastat(struct device *dev,
>>   				  struct device_attribute *attr, char *buf)
>>   {
>> +	struct pglist_data *pgdat = NODE_DATA(dev->id);
>> +	unsigned long demote_src, demote_dst;
>> +
>>   	fold_vm_numa_events();
>> +	demote_src = node_page_state_pages(pgdat, PGDEMOTE_SRC_KSWAPD) +
>> +		     node_page_state_pages(pgdat, PGDEMOTE_SRC_DIRECT) +
>> +		     node_page_state_pages(pgdat, PGDEMOTE_SRC_KHUGEPAGED);
>> +	demote_dst = node_page_state_pages(pgdat, PGDEMOTE_DST_KSWAPD) +
>> +		     node_page_state_pages(pgdat, PGDEMOTE_DST_DIRECT) +
>> +		     node_page_state_pages(pgdat, PGDEMOTE_DST_KHUGEPAGED);
>>   	return sysfs_emit(buf,
>>   			  "numa_hit %lu\n"
>>   			  "numa_miss %lu\n"
>>   			  "numa_foreign %lu\n"
>>   			  "interleave_hit %lu\n"
>>   			  "local_node %lu\n"
>> -			  "other_node %lu\n",
>> +			  "other_node %lu\n"
>> +			  "demote_src %lu\n"
>> +			  "demote_dst %lu\n",
> This sysfs file is already a total abuse of sysfs so please, do NOT make
> it worse by adding more fields, that's just wrong and something I can
> not take at all for obvious reasons.
> 

Alright, thank you for your feedback. We will reconsider other options if necessary.

Thanks
Zhijian


> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  7:46         ` Huang, Ying
@ 2023-11-02  9:45           ` Yasunori Gotou (Fujitsu)
  2023-11-03  6:14             ` Huang, Ying
  0 siblings, 1 reply; 34+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-11-02  9:45 UTC (permalink / raw)
  To: 'Huang, Ying'
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel, Zhijian Li (Fujitsu)

> > Hello,
> >
> >> On 02/11/2023 13:45, Huang, Ying wrote:
> >> > Li Zhijian <lizhijian@fujitsu.com> writes:
> >> >
> >> >> pgdemote_src_*: pages demoted from this node.
> >> >> pgdemote_dst_*: pages demoted to this node.
> >> >>
> >> >> So that we are able to know their demotion per-node stats by checking
> this.
> >> >>
> >> >> In the environment, node0 and node1 are DRAM, node3 is PMEM.
> >> >>
> >> >> Global stats:
> >> >> $ grep -E 'demote' /proc/vmstat
> >> >> pgdemote_src_kswapd 130155
> >> >> pgdemote_src_direct 113497
> >> >> pgdemote_src_khugepaged 0
> >> >> pgdemote_dst_kswapd 130155
> >> >> pgdemote_dst_direct 113497
> >> >> pgdemote_dst_khugepaged 0
> >> >>
> >> >> Per-node stats:
> >> >> $ grep demote /sys/devices/system/node/node0/vmstat
> >> >> pgdemote_src_kswapd 68454
> >> >> pgdemote_src_direct 83431
> >> >> pgdemote_src_khugepaged 0
> >> >> pgdemote_dst_kswapd 0
> >> >> pgdemote_dst_direct 0
> >> >> pgdemote_dst_khugepaged 0
> >> >>
> >> >> $ grep demote /sys/devices/system/node/node1/vmstat
> >> >> pgdemote_src_kswapd 185834
> >> >> pgdemote_src_direct 30066
> >> >> pgdemote_src_khugepaged 0
> >> >> pgdemote_dst_kswapd 0
> >> >> pgdemote_dst_direct 0
> >> >> pgdemote_dst_khugepaged 0
> >> >>
> >> >> $ grep demote /sys/devices/system/node/node3/vmstat
> >> >> pgdemote_src_kswapd 0
> >> >> pgdemote_src_direct 0
> >> >> pgdemote_src_khugepaged 0
> >> >> pgdemote_dst_kswapd 254288
> >> >> pgdemote_dst_direct 113497
> >> >> pgdemote_dst_khugepaged 0
> >> >>
> >> >>  From above stats, we know node3 is the demotion destination which
> >> >> one the node0 and node1 will demote to.
> >> >
> >> > Why do we need these information?  Do you have some use case?
> >>
> >> I recall our customers have mentioned that they want to know how much
> >> the memory is demoted to the CXL memory device in a specific period.
> >
> > I'll mention about it more.
> >
> > I had a conversation with one of our customers. He expressed a desire
> > for more detailed profile information to analyze the behavior of
> > demotion (and promotion) when his workloads are executed.
> > If the results are not satisfactory for his workloads, he wants to
> > tune his servers for his workloads with these profiles.
> > Additionally, depending on the results, he may want to change his server
> configuration.
> > For example, he may want to buy more expensive DDR memories rather than
> cheaper CXL memory.
> >
> > In my impression, our customers seems to think that CXL memory is NOT as
> reliable as DDR memory yet.
> > Therefore, they want to prepare for the new world that CXL will bring,
> > and want to have a method for the preparation by profiling information as
> much as possible.
> >
> > it this enough for your question?
> 
> I want some more detailed information about how these stats are used?
> Why isn't per-node pgdemote_xxx counter enough?

I rechecked the customer's original request.

- If a memory area is demoted to a CXL memory node, he wanted to analyze how it affects performance
 of their workload, such as latency. He wanted to use CXL Node memory usage as basic
 information for the analysis.
- If he notices that demotion occurs well on a server and CXL memories are used 85% constantly, he 
  may want to add DDR DRAM or select some other ways to avoid demotion.
  (His image is likely Swap free/used.)
  IIRC, demotion target is not spread to all of the CXL memory node, right? 
  Then, he needs to know how CXL memory is occupied by demoted memory.

If I misunderstand something, or you have any better idea,
please let us know. I'll talk with him again. (It will be next week...)

Thanks,

> 
> --
> Best Regards,
> Huang, Ying
> 
> > Thanks,
> >
> >>
> >>
> >> >>>   	mod_node_page_state(NODE_DATA(target_nid),
> >> >>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
> >> nr_succeeded);
> >> >>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
> >> nr_succeeded);
> >>
> >> But if the *target_nid* is only indicate the preferred node, this
> >> accounting maybe not accurate.
> >>
> >>
> >> Thanks
> >> Zhijian
> >>
> >> >
> >> > --
> >> > Best Regards,
> >> > Huang, Ying
> >> >
> >> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> >> >> ---
> >> >> RFC: their names are open to discussion, maybe pgdemote_from/to_*
> >> >> Another defect of this patch is that, SUM(pgdemote_src_*) is
> >> >> always same as SUM(pgdemote_dst_*) in the global stats, shall we
> hide one of them.
> >> >> ---
> >> >>   include/linux/mmzone.h |  9 ++++++---
> >> >>   mm/vmscan.c            | 13 ++++++++++---
> >> >>   mm/vmstat.c            |  9 ++++++---
> >> >>   3 files changed, 22 insertions(+), 9 deletions(-)
> >> >>
> >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index
> >> >> ad0309eea850..a6140d894bec 100644
> >> >> --- a/include/linux/mmzone.h
> >> >> +++ b/include/linux/mmzone.h
> >> >> @@ -207,9 +207,12 @@ enum node_stat_item {
> >> >>   	PGPROMOTE_SUCCESS,	/* promote successfully */
> >> >>   	PGPROMOTE_CANDIDATE,	/* candidate pages to
> promote */
> >> >>   	/* PGDEMOTE_*: pages demoted */
> >> >> -	PGDEMOTE_KSWAPD,
> >> >> -	PGDEMOTE_DIRECT,
> >> >> -	PGDEMOTE_KHUGEPAGED,
> >> >> +	PGDEMOTE_SRC_KSWAPD,
> >> >> +	PGDEMOTE_SRC_DIRECT,
> >> >> +	PGDEMOTE_SRC_KHUGEPAGED,
> >> >> +	PGDEMOTE_DST_KSWAPD,
> >> >> +	PGDEMOTE_DST_DIRECT,
> >> >> +	PGDEMOTE_DST_KHUGEPAGED,
> >> >>   #endif
> >> >>   	NR_VM_NODE_STAT_ITEMS
> >> >>   };
> >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c index
> >> >> 2f1fb4ec3235..55d2287d7150 100644
> >> >> --- a/mm/vmscan.c
> >> >> +++ b/mm/vmscan.c
> >> >> @@ -1111,13 +1111,18 @@ void drop_slab(void)
> >> >>   static int reclaimer_offset(void)
> >> >>   {
> >> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> >> >> -			PGDEMOTE_DIRECT -
> PGDEMOTE_KSWAPD);
> >> >> +			PGDEMOTE_SRC_DIRECT -
> >> PGDEMOTE_SRC_KSWAPD);
> >> >>   	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> >> >>   			PGSCAN_DIRECT - PGSCAN_KSWAPD);
> >> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED -
> PGSTEAL_KSWAPD !=
> >> >> -			PGDEMOTE_KHUGEPAGED -
> >> PGDEMOTE_KSWAPD);
> >> >> +			PGDEMOTE_SRC_KHUGEPAGED -
> >> PGDEMOTE_SRC_KSWAPD);
> >> >>   	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED -
> PGSTEAL_KSWAPD !=
> >> >>   			PGSCAN_KHUGEPAGED -
> PGSCAN_KSWAPD);
> >> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT -
> >> PGDEMOTE_SRC_KSWAPD !=
> >> >> +			PGDEMOTE_DST_DIRECT -
> >> PGDEMOTE_DST_KSWAPD);
> >> >> +	BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED -
> >> PGDEMOTE_SRC_KSWAPD !=
> >> >> +			PGDEMOTE_DST_KHUGEPAGED -
> >> PGDEMOTE_DST_KSWAPD);
> >> >> +
> >> >>
> >> >>   	if (current_is_kswapd())
> >> >>   		return 0;
> >> >> @@ -1678,8 +1683,10 @@ static unsigned int
> >> >> demote_folio_list(struct
> >> list_head *demote_folios,
> >> >>   		      (unsigned long)&mtc, MIGRATE_ASYNC,
> >> MR_DEMOTION,
> >> >>   		      &nr_succeeded);
> >> >>
> >> >> +	mod_node_page_state(pgdat,
> >> >> +		    PGDEMOTE_SRC_KSWAPD + reclaimer_offset(),
> >> nr_succeeded);
> >> >>   	mod_node_page_state(NODE_DATA(target_nid),
> >> >> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
> >> nr_succeeded);
> >> >> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
> >> nr_succeeded);
> >> >>
> >> >>   	return nr_succeeded;
> >> >>   }
> >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c index
> >> >> f141c48c39e4..63f106a5e008 100644
> >> >> --- a/mm/vmstat.c
> >> >> +++ b/mm/vmstat.c
> >> >> @@ -1244,9 +1244,12 @@ const char * const vmstat_text[] = {
> >> >>   #ifdef CONFIG_NUMA_BALANCING
> >> >>   	"pgpromote_success",
> >> >>   	"pgpromote_candidate",
> >> >> -	"pgdemote_kswapd",
> >> >> -	"pgdemote_direct",
> >> >> -	"pgdemote_khugepaged",
> >> >> +	"pgdemote_src_kswapd",
> >> >> +	"pgdemote_src_direct",
> >> >> +	"pgdemote_src_khugepaged",
> >> >> +	"pgdemote_dst_kswapd",
> >> >> +	"pgdemote_dst_direct",
> >> >> +	"pgdemote_dst_khugepaged",
> >> >>   #endif
> >> >>
> >> >>   	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
  2023-11-02  5:45   ` Huang, Ying
@ 2023-11-02 17:16   ` kernel test robot
  1 sibling, 0 replies; 34+ messages in thread
From: kernel test robot @ 2023-11-02 17:16 UTC (permalink / raw)
  To: Li Zhijian; +Cc: llvm, oe-kbuild-all

Hi Li,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus staging/staging-testing staging/staging-next staging/staging-linus linus/master v6.6 next-20231102]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-vmstat-Move-pgdemote_-to-per-node-stats/20231102-135849
base:   driver-core/driver-core-testing
patch link:    https://lore.kernel.org/r/20231102025648.1285477-4-lizhijian%40fujitsu.com
patch subject: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20231103/202311030137.Vu2ki6zm-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231103/202311030137.Vu2ki6zm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311030137.Vu2ki6zm-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/vmscan.c:19:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from mm/vmscan.c:19:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from mm/vmscan.c:19:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> mm/vmscan.c:1114:4: error: use of undeclared identifier 'PGDEMOTE_SRC_DIRECT'
    1114 |                         PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
         |                         ^
>> mm/vmscan.c:1114:26: error: use of undeclared identifier 'PGDEMOTE_SRC_KSWAPD'
    1114 |                         PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
         |                                               ^
>> mm/vmscan.c:1118:4: error: use of undeclared identifier 'PGDEMOTE_SRC_KHUGEPAGED'
    1118 |                         PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
         |                         ^
   mm/vmscan.c:1118:30: error: use of undeclared identifier 'PGDEMOTE_SRC_KSWAPD'
    1118 |                         PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
         |                                                   ^
   mm/vmscan.c:1121:15: error: use of undeclared identifier 'PGDEMOTE_SRC_DIRECT'
    1121 |         BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
         |                      ^
   mm/vmscan.c:1121:37: error: use of undeclared identifier 'PGDEMOTE_SRC_KSWAPD'
    1121 |         BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
         |                                            ^
>> mm/vmscan.c:1122:4: error: use of undeclared identifier 'PGDEMOTE_DST_DIRECT'
    1122 |                         PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
         |                         ^
>> mm/vmscan.c:1122:26: error: use of undeclared identifier 'PGDEMOTE_DST_KSWAPD'
    1122 |                         PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
         |                                               ^
   mm/vmscan.c:1123:15: error: use of undeclared identifier 'PGDEMOTE_SRC_KHUGEPAGED'
    1123 |         BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
         |                      ^
   mm/vmscan.c:1123:41: error: use of undeclared identifier 'PGDEMOTE_SRC_KSWAPD'
    1123 |         BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
         |                                                ^
>> mm/vmscan.c:1124:4: error: use of undeclared identifier 'PGDEMOTE_DST_KHUGEPAGED'
    1124 |                         PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
         |                         ^
   mm/vmscan.c:1124:30: error: use of undeclared identifier 'PGDEMOTE_DST_KSWAPD'
    1124 |                         PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
         |                                                   ^
   mm/vmscan.c:1687:7: error: use of undeclared identifier 'PGDEMOTE_SRC_KSWAPD'
    1687 |                     PGDEMOTE_SRC_KSWAPD + reclaimer_offset(), nr_succeeded);
         |                     ^
   mm/vmscan.c:1689:7: error: use of undeclared identifier 'PGDEMOTE_DST_KSWAPD'
    1689 |                     PGDEMOTE_DST_KSWAPD + reclaimer_offset(), nr_succeeded);
         |                     ^
   12 warnings and 14 errors generated.


vim +/PGDEMOTE_SRC_DIRECT +1114 mm/vmscan.c

  1110	
  1111	static int reclaimer_offset(void)
  1112	{
  1113		BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
> 1114				PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD);
  1115		BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
  1116				PGSCAN_DIRECT - PGSCAN_KSWAPD);
  1117		BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
> 1118				PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD);
  1119		BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
  1120				PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
> 1121		BUILD_BUG_ON(PGDEMOTE_SRC_DIRECT - PGDEMOTE_SRC_KSWAPD !=
> 1122				PGDEMOTE_DST_DIRECT - PGDEMOTE_DST_KSWAPD);
  1123		BUILD_BUG_ON(PGDEMOTE_SRC_KHUGEPAGED - PGDEMOTE_SRC_KSWAPD !=
> 1124				PGDEMOTE_DST_KHUGEPAGED - PGDEMOTE_DST_KSWAPD);
  1125	
  1126	
  1127		if (current_is_kswapd())
  1128			return 0;
  1129		if (current_is_khugepaged())
  1130			return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD;
  1131		return PGSTEAL_DIRECT - PGSTEAL_KSWAPD;
  1132	}
  1133	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
  2023-11-02  3:17   ` Huang, Ying
@ 2023-11-03  2:21   ` kernel test robot
  1 sibling, 0 replies; 34+ messages in thread
From: kernel test robot @ 2023-11-03  2:21 UTC (permalink / raw)
  To: Li Zhijian; +Cc: oe-kbuild-all

Hi Li,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus staging/staging-testing staging/staging-next staging/staging-linus linus/master v6.6 next-20231102]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-vmstat-Move-pgdemote_-to-per-node-stats/20231102-135849
base:   driver-core/driver-core-testing
patch link:    https://lore.kernel.org/r/20231102025648.1285477-2-lizhijian%40fujitsu.com
patch subject: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
config: x86_64-buildonly-randconfig-005-20231103 (https://download.01.org/0day-ci/archive/20231103/202311031052.cwUKB84l-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231103/202311031052.cwUKB84l-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311031052.cwUKB84l-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/base/node.c:10:
>> include/linux/memory-tiers.h:50:15: error: unknown type name 'next_demotion_nodes'
      50 | static inline next_demotion_nodes next_demotion_nodes(int node)
         |               ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/mmzone.h:18,
                    from include/linux/gfp.h:7,
                    from include/linux/umh.h:4,
                    from include/linux/kmod.h:9,
                    from include/linux/module.h:17,
                    from drivers/base/node.c:6:
   include/linux/memory-tiers.h: In function 'next_demotion_nodes':
>> include/linux/nodemask.h:333:1: error: incompatible types when returning type 'nodemask_t' but 'int' was expected
     333 | ((nodemask_t) { {                                                       \
         | ^
   include/linux/memory-tiers.h:52:16: note: in expansion of macro 'NODE_MASK_NONE'
      52 |         return NODE_MASK_NONE;
         |                ^~~~~~~~~~~~~~
   drivers/base/node.c: In function 'demotion_nodes_show':
>> drivers/base/node.c:577:28: error: invalid initializer
     577 |         nodemask_t nmask = next_demotion_nodes(dev->id);
         |                            ^~~~~~~~~~~~~~~~~~~


vim +/next_demotion_nodes +50 include/linux/memory-tiers.h

    49	
  > 50	static inline next_demotion_nodes next_demotion_nodes(int node)
    51	{
    52		return NODE_MASK_NONE;
    53	}
    54	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  5:58           ` Huang, Ying
@ 2023-11-03  3:05             ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2023-11-03  3:05 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel



On 02/11/2023 13:58, Huang, Ying wrote:
> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> 
>> On 02/11/2023 13:18, Huang, Ying wrote:
>>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>>>
>>>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>>>> already.  A node in a higher tier can demote to any node in the lower
>>>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>>>
>>>> IIRC, they are not the same. memory_tier[number], where the number is shared by
>>>> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
>>>> across nodes(different distance will be grouped into the same memory_tier).
>>>> But demotion will only select the nearest nodelist to demote.
>>>
>>> In the following patchset, we will use the performance information from
>>> HMAT to place nodes using the same memory driver into different memory
>>> tiers.
>>>
>>> https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@intel.com/
>>
>> Thanks for your reminder. It seems like I've fallen behind the world by months.
>> I will rebase on it later if this patch is still needed.
>>
>>>
>>> The patch is in mm-stable tree.
>>>
>>>> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
>>>> are different.
>>>>    
>>>> # numactl -H
>>>> available: 4 nodes (0-3)
>>>> node 0 cpus: 0
>>>> node 0 size: 964 MB
>>>> node 0 free: 746 MB
>>>> node 1 cpus: 1
>>>> node 1 size: 685 MB
>>>> node 1 free: 455 MB
>>>> node 2 cpus:
>>>> node 2 size: 896 MB
>>>> node 2 free: 897 MB
>>>> node 3 cpus:
>>>> node 3 size: 896 MB
>>>> node 3 free: 896 MB
>>>> node distances:
>>>> node   0   1   2   3
>>>>     0:  10  20  20  25
>>>>     1:  20  10  25  20
>>>>     2:  20  25  10  20
>>>>     3:  25  20  20  10
>>>> # cat /sys/devices/system/node/node0/demotion_nodes
>>>> 2
>>>
>>> node 2 is only the preferred demotion target.  In fact, memory in node 0
>>> can be demoted to node 2,3.  Please check demote_folio_list() for
>>> details.
>>
>> Have I missed something, at least the on master tree, nd->preferred only include the
>> nearest ones(by specific algorithms), so in above numa topology, nd->preferred of
>> node0 is node2 only. node0 distance to node3 is 25 greater than to node2(20).
>>
>>> 1657         int target_nid = next_demotion_node(pgdat->node_id);
>>
>> So target_nid cannot be node3 IIUC.
>>
>> (I cooked this patches weeks ago, maybe something has changed, i will also take a deep look later.)
>>
>> 1650 /*
>> 1651  * Take folios on @demote_folios and attempt to demote them to another node.
>> 1652  * Folios which are not demoted are left on @demote_folios.
>> 1653  */
>> 1654 static unsigned int demote_folio_list(struct list_head *demote_folios,
>> 1655                                      struct pglist_data *pgdat)
>> 1656 {
>> 1657         int target_nid = next_demotion_node(pgdat->node_id);
>> 1658         unsigned int nr_succeeded;
>> 1659         nodemask_t allowed_mask;
>> 1660
>> 1661         struct migration_target_control mtc = {
>> 1662                 /*
>> 1663                  * Allocate from 'node', or fail quickly and quietly.
>> 1664                  * When this happens, 'page' will likely just be discarded
>> 1665                  * instead of migrated.
>> 1666                  */
>> 1667                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
>> 1668                         __GFP_NOMEMALLOC | GFP_NOWAIT,
>> 1669                 .nid = target_nid,
>> 1670                 .nmask = &allowed_mask
>> 1671         };
>> 1672
>> 1673         if (list_empty(demote_folios))
>> 1674                 return 0;
>> 1675
>> 1676         if (target_nid == NUMA_NO_NODE)
>> 1677                 return 0;
>> 1678
>> 1679         node_get_allowed_targets(pgdat, &allowed_mask);
>> 1680
>> 1681         /* Demotion ignores all cpuset and mempolicy settings */
>> 1682         migrate_pages(demote_folios, alloc_demote_folio, NULL,
>> 1683                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>> 1684                       &nr_succeeded);
>>
> 
> In alloc_demote_folio(), target_nid is tried firstly. Then, if
> allocation fails, any node in allowed_mask will be tried.


Very thanks for your kindly explanation. You are right.
Let me re-think if it's still needed...


BTW, i will split PATCH2 as a separate patch first.

Thanks
Zhijian


> 
> --
> Best Regards,
> Huang, Ying
> 
>>>
>>>> # cat /sys/devices/system/node/node1/demotion_nodes
>>>> 3
>>>> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
>>>> 2-3
>>>>
>>>> Thanks
>>>> Zhijian
>>>>
>>>> (I hate the outlook native reply composition format.)
>>>> ________________________________________
>>>> From: Huang, Ying <ying.huang@intel.com>
>>>> Sent: Thursday, November 2, 2023 11:17
>>>> To: Li, Zhijian/李 智坚
>>>> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@kernel.org; linux-mm@kvack.org; Gotou, Yasunori/五島 康文; linux-kernel@vger.kernel.org
>>>> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
>>>>
>>>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>>>
>>>>> It shows the demotion target nodes of a node. Export this information to
>>>>> user directly.
>>>>>
>>>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>>>>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>>>    <show nothing>
>>>>> - After node3 is online as kmem
>>>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>>>>> [
>>>>>     {
>>>>>       "chardev":"dax0.0",
>>>>>       "size":1054867456,
>>>>>       "target_node":3,
>>>>>       "align":2097152,
>>>>>       "mode":"system-ram",
>>>>>       "online_memblocks":0,
>>>>>       "total_memblocks":7
>>>>>     }
>>>>> ]
>>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>>> 3
>>>>> $ cat /sys/devices/system/node/node1/demotion_nodes
>>>>> 3
>>>>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>>>>    <show nothing>
>>>>
>>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>>> already.  A node in a higher tier can demote to any node in the lower
>>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>>>
>>>> --
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>>>> ---
>>>>>    drivers/base/node.c          | 13 +++++++++++++
>>>>>    include/linux/memory-tiers.h |  6 ++++++
>>>>>    mm/memory-tiers.c            |  8 ++++++++
>>>>>    3 files changed, 27 insertions(+)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index 493d533f8375..27e8502548a7 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -7,6 +7,7 @@
>>>>>    #include <linux/init.h>
>>>>>    #include <linux/mm.h>
>>>>>    #include <linux/memory.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>>    #include <linux/vmstat.h>
>>>>>    #include <linux/notifier.h>
>>>>>    #include <linux/node.h>
>>>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>>>>    }
>>>>>    static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>>>>
>>>>> +static ssize_t demotion_nodes_show(struct device *dev,
>>>>> +                          struct device_attribute *attr, char *buf)
>>>>> +{
>>>>> +     int ret;
>>>>> +     nodemask_t nmask = next_demotion_nodes(dev->id);
>>>>> +
>>>>> +     ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>>>>> +     return ret;
>>>>> +}
>>>>> +static DEVICE_ATTR_RO(demotion_nodes);
>>>>> +
>>>>>    static struct attribute *node_dev_attrs[] = {
>>>>>         &dev_attr_meminfo.attr,
>>>>>         &dev_attr_numastat.attr,
>>>>>         &dev_attr_distance.attr,
>>>>>         &dev_attr_vmstat.attr,
>>>>> +     &dev_attr_demotion_nodes.attr,
>>>>>         NULL
>>>>>    };
>>>>>
>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>> index 437441cdf78f..8eb04923f965 100644
>>>>> --- a/include/linux/memory-tiers.h
>>>>> +++ b/include/linux/memory-tiers.h
>>>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>>>>    void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>>>>    #ifdef CONFIG_MIGRATION
>>>>>    int next_demotion_node(int node);
>>>>> +nodemask_t next_demotion_nodes(int node);
>>>>>    void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>>>    bool node_is_toptier(int node);
>>>>>    #else
>>>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>>>>         return NUMA_NO_NODE;
>>>>>    }
>>>>>
>>>>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>>>>> +{
>>>>> +     return NODE_MASK_NONE;
>>>>> +}
>>>>> +
>>>>>    static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>>    {
>>>>>         *targets = NODE_MASK_NONE;
>>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>>> index 37a4f59d9585..90047f37d98a 100644
>>>>> --- a/mm/memory-tiers.c
>>>>> +++ b/mm/memory-tiers.c
>>>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>>         rcu_read_unlock();
>>>>>    }
>>>>>
>>>>> +nodemask_t next_demotion_nodes(int node)
>>>>> +{
>>>>> +     if (!node_demotion)
>>>>> +             return NODE_MASK_NONE;
>>>>> +
>>>>> +     return node_demotion[node].preferred;
>>>>> +}
>>>>> +
>>>>>    /**
>>>>>     * next_demotion_node() - Get the next node in the demotion path
>>>>>     * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-02  9:45           ` Yasunori Gotou (Fujitsu)
@ 2023-11-03  6:14             ` Huang, Ying
  2023-11-06  5:02               ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2023-11-03  6:14 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel, Zhijian Li (Fujitsu)

"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com> writes:

>> > Hello,
>> >
>> >> On 02/11/2023 13:45, Huang, Ying wrote:
>> >> > Li Zhijian <lizhijian@fujitsu.com> writes:
>> >> >
>> >> >> pgdemote_src_*: pages demoted from this node.
>> >> >> pgdemote_dst_*: pages demoted to this node.
>> >> >>
>> >> >> So that we are able to know their demotion per-node stats by checking
>> this.
>> >> >>
>> >> >> In the environment, node0 and node1 are DRAM, node3 is PMEM.
>> >> >>
>> >> >> Global stats:
>> >> >> $ grep -E 'demote' /proc/vmstat
>> >> >> pgdemote_src_kswapd 130155
>> >> >> pgdemote_src_direct 113497
>> >> >> pgdemote_src_khugepaged 0
>> >> >> pgdemote_dst_kswapd 130155
>> >> >> pgdemote_dst_direct 113497
>> >> >> pgdemote_dst_khugepaged 0
>> >> >>
>> >> >> Per-node stats:
>> >> >> $ grep demote /sys/devices/system/node/node0/vmstat
>> >> >> pgdemote_src_kswapd 68454
>> >> >> pgdemote_src_direct 83431
>> >> >> pgdemote_src_khugepaged 0
>> >> >> pgdemote_dst_kswapd 0
>> >> >> pgdemote_dst_direct 0
>> >> >> pgdemote_dst_khugepaged 0
>> >> >>
>> >> >> $ grep demote /sys/devices/system/node/node1/vmstat
>> >> >> pgdemote_src_kswapd 185834
>> >> >> pgdemote_src_direct 30066
>> >> >> pgdemote_src_khugepaged 0
>> >> >> pgdemote_dst_kswapd 0
>> >> >> pgdemote_dst_direct 0
>> >> >> pgdemote_dst_khugepaged 0
>> >> >>
>> >> >> $ grep demote /sys/devices/system/node/node3/vmstat
>> >> >> pgdemote_src_kswapd 0
>> >> >> pgdemote_src_direct 0
>> >> >> pgdemote_src_khugepaged 0
>> >> >> pgdemote_dst_kswapd 254288
>> >> >> pgdemote_dst_direct 113497
>> >> >> pgdemote_dst_khugepaged 0
>> >> >>
>> >> >>  From above stats, we know node3 is the demotion destination which
>> >> >> one the node0 and node1 will demote to.
>> >> >
>> >> > Why do we need these information?  Do you have some use case?
>> >>
>> >> I recall our customers have mentioned that they want to know how much
>> >> the memory is demoted to the CXL memory device in a specific period.
>> >
>> > I'll mention about it more.
>> >
>> > I had a conversation with one of our customers. He expressed a desire
>> > for more detailed profile information to analyze the behavior of
>> > demotion (and promotion) when his workloads are executed.
>> > If the results are not satisfactory for his workloads, he wants to
>> > tune his servers for his workloads with these profiles.
>> > Additionally, depending on the results, he may want to change his server
>> configuration.
>> > For example, he may want to buy more expensive DDR memories rather than
>> cheaper CXL memory.
>> >
>> > In my impression, our customers seems to think that CXL memory is NOT as
>> reliable as DDR memory yet.
>> > Therefore, they want to prepare for the new world that CXL will bring,
>> > and want to have a method for the preparation by profiling information as
>> much as possible.
>> >
>> > it this enough for your question?
>> 
>> I want some more detailed information about how these stats are used?
>> Why isn't per-node pgdemote_xxx counter enough?
>
> I rechecked the customer's original request.
>
> - If a memory area is demoted to a CXL memory node, he wanted to analyze how it affects performance
>  of their workload, such as latency. He wanted to use CXL Node memory usage as basic
>  information for the analysis.
>
> - If he notices that demotion occurs well on a server and CXL memories are used 85% constantly, he 
>   may want to add DDR DRAM or select some other ways to avoid demotion.
>   (His image is likely Swap free/used.)
>   IIRC, demotion target is not spread to all of the CXL memory node, right? 
>   Then, he needs to know how CXL memory is occupied by demoted memory.
>
> If I misunderstand something, or you have any better idea,
> please let us know. I'll talk with him again. (It will be next week...)


To check CXL memory usage, /proc/PID/numa_maps,
/sys/fs/cgroup/CGROUP/memory.numa_stat, and
/sys/devices/system/node/nodeN/meminfo can be used for process, cgroup,
and NUMA node respectively.  Is this enough?

--
Best Regards,
Huang, Ying

>> >
>> >>
>> >>
>> >> >>>   	mod_node_page_state(NODE_DATA(target_nid),
>> >> >>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
>> >> nr_succeeded);
>> >> >>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
>> >> nr_succeeded);
>> >>
>> >> But if the *target_nid* is only indicate the preferred node, this
>> >> accounting maybe not accurate.
>> >>

[snip]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_*
  2023-11-03  6:14             ` Huang, Ying
@ 2023-11-06  5:02               ` Yasunori Gotou (Fujitsu)
  0 siblings, 0 replies; 34+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-11-06  5:02 UTC (permalink / raw)
  To: 'Huang, Ying'
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel, Zhijian Li (Fujitsu)

> >> > Hello,
> >> >
> >> >> On 02/11/2023 13:45, Huang, Ying wrote:
> >> >> > Li Zhijian <lizhijian@fujitsu.com> writes:
> >> >> >
> >> >> >> pgdemote_src_*: pages demoted from this node.
> >> >> >> pgdemote_dst_*: pages demoted to this node.
> >> >> >>
> >> >> >> So that we are able to know their demotion per-node stats by
> >> >> >> checking
> >> this.
> >> >> >>
> >> >> >> In the environment, node0 and node1 are DRAM, node3 is PMEM.
> >> >> >>
> >> >> >> Global stats:
> >> >> >> $ grep -E 'demote' /proc/vmstat pgdemote_src_kswapd 130155
> >> >> >> pgdemote_src_direct 113497 pgdemote_src_khugepaged 0
> >> >> >> pgdemote_dst_kswapd 130155 pgdemote_dst_direct 113497
> >> >> >> pgdemote_dst_khugepaged 0
> >> >> >>
> >> >> >> Per-node stats:
> >> >> >> $ grep demote /sys/devices/system/node/node0/vmstat
> >> >> >> pgdemote_src_kswapd 68454
> >> >> >> pgdemote_src_direct 83431
> >> >> >> pgdemote_src_khugepaged 0
> >> >> >> pgdemote_dst_kswapd 0
> >> >> >> pgdemote_dst_direct 0
> >> >> >> pgdemote_dst_khugepaged 0
> >> >> >>
> >> >> >> $ grep demote /sys/devices/system/node/node1/vmstat
> >> >> >> pgdemote_src_kswapd 185834
> >> >> >> pgdemote_src_direct 30066
> >> >> >> pgdemote_src_khugepaged 0
> >> >> >> pgdemote_dst_kswapd 0
> >> >> >> pgdemote_dst_direct 0
> >> >> >> pgdemote_dst_khugepaged 0
> >> >> >>
> >> >> >> $ grep demote /sys/devices/system/node/node3/vmstat
> >> >> >> pgdemote_src_kswapd 0
> >> >> >> pgdemote_src_direct 0
> >> >> >> pgdemote_src_khugepaged 0
> >> >> >> pgdemote_dst_kswapd 254288
> >> >> >> pgdemote_dst_direct 113497
> >> >> >> pgdemote_dst_khugepaged 0
> >> >> >>
> >> >> >>  From above stats, we know node3 is the demotion destination
> >> >> >> which one the node0 and node1 will demote to.
> >> >> >
> >> >> > Why do we need these information?  Do you have some use case?
> >> >>
> >> >> I recall our customers have mentioned that they want to know how
> >> >> much the memory is demoted to the CXL memory device in a specific
> period.
> >> >
> >> > I'll mention about it more.
> >> >
> >> > I had a conversation with one of our customers. He expressed a
> >> > desire for more detailed profile information to analyze the
> >> > behavior of demotion (and promotion) when his workloads are executed.
> >> > If the results are not satisfactory for his workloads, he wants to
> >> > tune his servers for his workloads with these profiles.
> >> > Additionally, depending on the results, he may want to change his
> >> > server
> >> configuration.
> >> > For example, he may want to buy more expensive DDR memories rather
> >> > than
> >> cheaper CXL memory.
> >> >
> >> > In my impression, our customers seems to think that CXL memory is
> >> > NOT as
> >> reliable as DDR memory yet.
> >> > Therefore, they want to prepare for the new world that CXL will
> >> > bring, and want to have a method for the preparation by profiling
> >> > information as
> >> much as possible.
> >> >
> >> > it this enough for your question?
> >>
> >> I want some more detailed information about how these stats are used?
> >> Why isn't per-node pgdemote_xxx counter enough?
> >
> > I rechecked the customer's original request.
> >
> > - If a memory area is demoted to a CXL memory node, he wanted to
> > analyze how it affects performance  of their workload, such as
> > latency. He wanted to use CXL Node memory usage as basic  information for
> the analysis.
> >
> > - If he notices that demotion occurs well on a server and CXL memories are
> used 85% constantly, he
> >   may want to add DDR DRAM or select some other ways to avoid demotion.
> >   (His image is likely Swap free/used.)
> >   IIRC, demotion target is not spread to all of the CXL memory node, right?
> >   Then, he needs to know how CXL memory is occupied by demoted
> memory.
> >
> > If I misunderstand something, or you have any better idea, please let
> > us know. I'll talk with him again. (It will be next week...)
> 
> 
> To check CXL memory usage, /proc/PID/numa_maps,
> /sys/fs/cgroup/CGROUP/memory.numa_stat, and
> /sys/devices/system/node/nodeN/meminfo can be used for process, cgroup,
> and NUMA node respectively.  Is this enough?

Thank you for your idea
We will investigate your idea and talk with our customer.
Please wait.

Thanks,
---
Yasunori Goto


> 
> --
> Best Regards,
> Huang, Ying
> 
> >> >
> >> >>
> >> >>
> >> >> >>>   	mod_node_page_state(NODE_DATA(target_nid),
> >> >> >>> -		    PGDEMOTE_KSWAPD + reclaimer_offset(),
> >> >> nr_succeeded);
> >> >> >>> +		    PGDEMOTE_DST_KSWAPD + reclaimer_offset(),
> >> >> nr_succeeded);
> >> >>
> >> >> But if the *target_nid* is only indicate the preferred node, this
> >> >> accounting maybe not accurate.
> >> >>
> 
> [snip]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2023-11-02  3:17   ` Huang, Ying
  2023-11-02  3:39     ` Zhijian Li (Fujitsu)
@ 2024-01-30  8:53     ` Li Zhijian
  2024-01-31  1:13       ` Huang, Ying
  1 sibling, 1 reply; 34+ messages in thread
From: Li Zhijian @ 2024-01-30  8:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Hi Ying


I need to pick up this thread/patch again.

> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> already.  A node in a higher tier can demote to any node in the lower
> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
> 

Yes, it's believed that /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
are intended to show nodes in memory_tierN. But IMHO, it's not enough, especially
for the preferred demotion node(s).

Currently, when a demotion occurs, it will prioritize selecting a node
from the preferred nodes as the destination node for the demotion. If
the preferred nodes does not meet the requirements, it will try from all
the lower memory tier nodes until it finds a suitable demotion destination
node or ultimately fails.
                                                                                 
However, currently it only lists the nodes of each tier. If the
administrators want to know all the possible demotion destinations for a
given node, they need to calculate it themselves:
Step 1, find the memory tier where the given node is located
Step 2, list all nodes under all its lower tiers
                                                                                    
It will be even more difficult to know the preferred nodes which depend on
more factors, distance etc. For the following example, we may have 6 nodes
splitting into three memory tiers.
                                                                                 
For emulated hmat numa topology example:
> $ numactl -H                                                                  
> available: 6 nodes (0-5)                                                      
> node 0 cpus: 0                                                                
> node 0 size: 1974 MB                                                          
> node 0 free: 1767 MB                                                             
> node 1 cpus: 1                                                                
> node 1 size: 1694 MB                                                          
> node 1 free: 1454 MB                                                          
> node 2 cpus:                                                                  
> node 2 size: 896 MB                                                           
> node 2 free: 896 MB                                                           
> node 3 cpus:                                                                  
> node 3 size: 896 MB                                                           
> node 3 free: 896 MB                                                           
> node 4 cpus:                                                                  
> node 4 size: 896 MB                                                           
> node 4 free: 896 MB                                                           
> node 5 cpus:                                                                  
> node 5 size: 896 MB                                                           
> node 5 free: 896 MB                                                           
> node distances:                                                               
> node   0   1   2   3   4   5                                                  
>   0:  10  31  21  41  21  41                                                  
>   1:  31  10  41  21  41  21                                                  
>   2:  21  41  10  51  21  51                                                  
>   3:  31  21  51  10  51  21                                                  
>   4:  21  41  21  51  10  51                                                  
>   5:  31  21  51  21  51  10                                                  
>                                                                               
> $ cat memory_tier4/nodelist                                                   
> 0-1                                                                           
> $ cat memory_tier12/nodelist                                                  
> 2,5
> $ cat memory_tier54/nodelist                                                  
> 3-4                                                                           
                                                                                 
For above topology, memory-tier will build the demotion path for each node
like this:
node[0].preferred = 2
node[0].demotion_targets = 2-5
node[1].preferred = 5
node[1].demotion_targets = 2-5
node[2].preferred = 4
node[2].demotion_targets = 3-4
node[3].preferred = <empty>
node[3].demotion_targets = <empty>
node[4].preferred = <empty>
node[4].demotion_targets = <empty>
node[5].preferred = 3
node[5].demotion_targets = 3-4
                                                                          
But this demotion path is not explicitly known to administrator. And with the
feedback from our customers, they also think it is helpful to know demotion
path built by kernel to understand the demotion behaviors.

So i think we should have 2 new interfaces for each node:

/sys/devices/system/node/nodeN/demotion_allowed_nodes
/sys/devices/system/node/nodeN/demotion_preferred_nodes

I value your opinion, and I'd like to know what you think about...


Thanks
Zhijian


On 02/11/2023 11:17, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
>> It shows the demotion target nodes of a node. Export this information to
>> user directly.
>>
>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>   <show nothing>
>> - After node3 is online as kmem
>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>> [
>>    {
>>      "chardev":"dax0.0",
>>      "size":1054867456,
>>      "target_node":3,
>>      "align":2097152,
>>      "mode":"system-ram",
>>      "online_memblocks":0,
>>      "total_memblocks":7
>>    }
>> ]
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>   <show nothing>
> 
> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> already.  A node in a higher tier can demote to any node in the lower
> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
> 
> --
> Best Regards,
> Huang, Ying
> 
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>   drivers/base/node.c          | 13 +++++++++++++
>>   include/linux/memory-tiers.h |  6 ++++++
>>   mm/memory-tiers.c            |  8 ++++++++
>>   3 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 493d533f8375..27e8502548a7 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -7,6 +7,7 @@
>>   #include <linux/init.h>
>>   #include <linux/mm.h>
>>   #include <linux/memory.h>
>> +#include <linux/memory-tiers.h>
>>   #include <linux/vmstat.h>
>>   #include <linux/notifier.h>
>>   #include <linux/node.h>
>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>   }
>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>   
>> +static ssize_t demotion_nodes_show(struct device *dev,
>> +			     struct device_attribute *attr, char *buf)
>> +{
>> +	int ret;
>> +	nodemask_t nmask = next_demotion_nodes(dev->id);
>> +
>> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> +	return ret;
>> +}
>> +static DEVICE_ATTR_RO(demotion_nodes);
>> +
>>   static struct attribute *node_dev_attrs[] = {
>>   	&dev_attr_meminfo.attr,
>>   	&dev_attr_numastat.attr,
>>   	&dev_attr_distance.attr,
>>   	&dev_attr_vmstat.attr,
>> +	&dev_attr_demotion_nodes.attr,
>>   	NULL
>>   };
>>   
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 437441cdf78f..8eb04923f965 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>   void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>   #ifdef CONFIG_MIGRATION
>>   int next_demotion_node(int node);
>> +nodemask_t next_demotion_nodes(int node);
>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>   bool node_is_toptier(int node);
>>   #else
>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>   	return NUMA_NO_NODE;
>>   }
>>   
>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>> +{
>> +	return NODE_MASK_NONE;
>> +}
>> +
>>   static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>   {
>>   	*targets = NODE_MASK_NONE;
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 37a4f59d9585..90047f37d98a 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>   	rcu_read_unlock();
>>   }
>>   
>> +nodemask_t next_demotion_nodes(int node)
>> +{
>> +	if (!node_demotion)
>> +		return NODE_MASK_NONE;
>> +
>> +	return node_demotion[node].preferred;
>> +}
>> +
>>   /**
>>    * next_demotion_node() - Get the next node in the demotion path
>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-01-30  8:53     ` Li Zhijian
@ 2024-01-31  1:13       ` Huang, Ying
  2024-01-31  3:18         ` Zhijian Li (Fujitsu)
  2024-01-31  6:23         ` Yasunori Gotou (Fujitsu)
  0 siblings, 2 replies; 34+ messages in thread
From: Huang, Ying @ 2024-01-31  1:13 UTC (permalink / raw)
  To: Li Zhijian
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, y-goto,
	linux-kernel

Li Zhijian <lizhijian@fujitsu.com> writes:

> Hi Ying
>
>
> I need to pick up this thread/patch again.
>
>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> already.  A node in a higher tier can demote to any node in the lower
>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>> 
>
> Yes, it's believed that /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
> are intended to show nodes in memory_tierN. But IMHO, it's not enough, especially
> for the preferred demotion node(s).
>
> Currently, when a demotion occurs, it will prioritize selecting a node
> from the preferred nodes as the destination node for the demotion. If
> the preferred nodes does not meet the requirements, it will try from all
> the lower memory tier nodes until it finds a suitable demotion destination
> node or ultimately fails.
>                                                                                 However,
> currently it only lists the nodes of each tier. If the
> administrators want to know all the possible demotion destinations for a
> given node, they need to calculate it themselves:
> Step 1, find the memory tier where the given node is located
> Step 2, list all nodes under all its lower tiers
>                                                                                    It
> will be even more difficult to know the preferred nodes which depend
> on
> more factors, distance etc. For the following example, we may have 6 nodes
> splitting into three memory tiers.
>                                                                                 For
> emulated hmat numa topology example:
>> $ numactl -H
>> available: 6 nodes (0-5)
>> node 0 cpus: 0
>> node 0 size: 1974 MB
>> node 0 free: 1767 MB
>> node 1 cpus: 1
>> node 1 size: 1694 MB
>> node 1 free: 1454 MB
>> node 2 cpus:
>> node 2 size: 896 MB
>> node 2 free: 896 MB
>> node 3 cpus:
>> node 3 size: 896 MB
>> node 3 free: 896 MB
>> node 4 cpus:
>> node 4 size: 896 MB
>> node 4 free: 896 MB
>> node 5 cpus:
>> node 5 size: 896 MB
>> node 5 free: 896 MB
>> node distances:
>> node   0   1   2   3   4   5
>> 0:  10  31  21  41  21  41
>> 1:  31  10  41  21  41  21
>> 2:  21  41  10  51  21  51
>> 3:  31  21  51  10  51  21
>> 4:  21  41  21  51  10  51
>> 5:  31  21  51  21  51  10
>> $ cat memory_tier4/nodelist
>> 0-1
>> $ cat memory_tier12/nodelist
>> 2,5
>> $ cat memory_tier54/nodelist
>> 3-4                                                                           
>                                                                                 For
> above topology, memory-tier will build the demotion path for each node
> like this:
> node[0].preferred = 2
> node[0].demotion_targets = 2-5
> node[1].preferred = 5
> node[1].demotion_targets = 2-5
> node[2].preferred = 4
> node[2].demotion_targets = 3-4
> node[3].preferred = <empty>
> node[3].demotion_targets = <empty>
> node[4].preferred = <empty>
> node[4].demotion_targets = <empty>
> node[5].preferred = 3
> node[5].demotion_targets = 3-4
>                                                                          But
> this demotion path is not explicitly known to administrator. And with
> the
> feedback from our customers, they also think it is helpful to know demotion
> path built by kernel to understand the demotion behaviors.
>
> So i think we should have 2 new interfaces for each node:
>
> /sys/devices/system/node/nodeN/demotion_allowed_nodes
> /sys/devices/system/node/nodeN/demotion_preferred_nodes
>
> I value your opinion, and I'd like to know what you think about...

Per my understanding, we will not expose everything inside kernel to
user space.  For page placement in a tiered memory system, demotion is
just a part of the story.  For example, if the DRAM of a system becomes
full, new page allocation will fall back to the CXL memory.  Have we
exposed the default page allocation fallback order to user space?

All in all, in my opinion, we only expose as little as possible to user
space because we need to maintain the ABI for ever.

--
Best Regards,
Huang, Ying

>
> On 02/11/2023 11:17, Huang, Ying wrote:
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>> 
>>> It shows the demotion target nodes of a node. Export this information to
>>> user directly.
>>>
>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>   <show nothing>
>>> - After node3 is online as kmem
>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>>> [
>>>    {
>>>      "chardev":"dax0.0",
>>>      "size":1054867456,
>>>      "target_node":3,
>>>      "align":2097152,
>>>      "mode":"system-ram",
>>>      "online_memblocks":0,
>>>      "total_memblocks":7
>>>    }
>>> ]
>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>> 3
>>> $ cat /sys/devices/system/node/node1/demotion_nodes
>>> 3
>>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>>   <show nothing>
>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> already.  A node in a higher tier can demote to any node in the lower
>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>> --
>> Best Regards,
>> Huang, Ying
>> 
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> ---
>>>   drivers/base/node.c          | 13 +++++++++++++
>>>   include/linux/memory-tiers.h |  6 ++++++
>>>   mm/memory-tiers.c            |  8 ++++++++
>>>   3 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index 493d533f8375..27e8502548a7 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -7,6 +7,7 @@
>>>   #include <linux/init.h>
>>>   #include <linux/mm.h>
>>>   #include <linux/memory.h>
>>> +#include <linux/memory-tiers.h>
>>>   #include <linux/vmstat.h>
>>>   #include <linux/notifier.h>
>>>   #include <linux/node.h>
>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>>   }
>>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>>   +static ssize_t demotion_nodes_show(struct device *dev,
>>> +			     struct device_attribute *attr, char *buf)
>>> +{
>>> +	int ret;
>>> +	nodemask_t nmask = next_demotion_nodes(dev->id);
>>> +
>>> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>>> +	return ret;
>>> +}
>>> +static DEVICE_ATTR_RO(demotion_nodes);
>>> +
>>>   static struct attribute *node_dev_attrs[] = {
>>>   	&dev_attr_meminfo.attr,
>>>   	&dev_attr_numastat.attr,
>>>   	&dev_attr_distance.attr,
>>>   	&dev_attr_vmstat.attr,
>>> +	&dev_attr_demotion_nodes.attr,
>>>   	NULL
>>>   };
>>>   diff --git a/include/linux/memory-tiers.h
>>> b/include/linux/memory-tiers.h
>>> index 437441cdf78f..8eb04923f965 100644
>>> --- a/include/linux/memory-tiers.h
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>>   void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>>   #ifdef CONFIG_MIGRATION
>>>   int next_demotion_node(int node);
>>> +nodemask_t next_demotion_nodes(int node);
>>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>   bool node_is_toptier(int node);
>>>   #else
>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>>   	return NUMA_NO_NODE;
>>>   }
>>>   +static inline next_demotion_nodes next_demotion_nodes(int node)
>>> +{
>>> +	return NODE_MASK_NONE;
>>> +}
>>> +
>>>   static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>   {
>>>   	*targets = NODE_MASK_NONE;
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index 37a4f59d9585..90047f37d98a 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>   	rcu_read_unlock();
>>>   }
>>>   +nodemask_t next_demotion_nodes(int node)
>>> +{
>>> +	if (!node_demotion)
>>> +		return NODE_MASK_NONE;
>>> +
>>> +	return node_demotion[node].preferred;
>>> +}
>>> +
>>>   /**
>>>    * next_demotion_node() - Get the next node in the demotion path
>>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-01-31  1:13       ` Huang, Ying
@ 2024-01-31  3:18         ` Zhijian Li (Fujitsu)
  2024-02-02  7:43           ` Zhijian Li (Fujitsu)
  2024-01-31  6:23         ` Yasunori Gotou (Fujitsu)
  1 sibling, 1 reply; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-01-31  3:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel, Jagdish Gediya


+CC Jagdish,

Who may also still be interesting in this interface.
You had ever tried to add such interface[1], but memory-tier was introduced afterwards.

[1]: [PATCH v3 6/7] mm: demotion: expose per-node demotion targets via sysfs
https://lore.kernel.org/all/20220422195516.10769-7-jvgediya@linux.ibm.com/


On 31/01/2024 09:13, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
>> Hi Ying
>>
>>
>> I need to pick up this thread/patch again.
>>
>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>> already.  A node in a higher tier can demote to any node in the lower
>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>>
>>
>> Yes, it's believed that /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
>> are intended to show nodes in memory_tierN. But IMHO, it's not enough, especially
>> for the preferred demotion node(s).
>>
>> Currently, when a demotion occurs, it will prioritize selecting a node
>> from the preferred nodes as the destination node for the demotion. If
>> the preferred nodes does not meet the requirements, it will try from all
>> the lower memory tier nodes until it finds a suitable demotion destination
>> node or ultimately fails.
>>                                                                                  However,
>> currently it only lists the nodes of each tier. If the
>> administrators want to know all the possible demotion destinations for a
>> given node, they need to calculate it themselves:
>> Step 1, find the memory tier where the given node is located
>> Step 2, list all nodes under all its lower tiers
>>                                                                                     It
>> will be even more difficult to know the preferred nodes which depend
>> on
>> more factors, distance etc. For the following example, we may have 6 nodes
>> splitting into three memory tiers.
>>                                                                                  For
>> emulated hmat numa topology example:
>>> $ numactl -H
>>> available: 6 nodes (0-5)
>>> node 0 cpus: 0
>>> node 0 size: 1974 MB
>>> node 0 free: 1767 MB
>>> node 1 cpus: 1
>>> node 1 size: 1694 MB
>>> node 1 free: 1454 MB
>>> node 2 cpus:
>>> node 2 size: 896 MB
>>> node 2 free: 896 MB
>>> node 3 cpus:
>>> node 3 size: 896 MB
>>> node 3 free: 896 MB
>>> node 4 cpus:
>>> node 4 size: 896 MB
>>> node 4 free: 896 MB
>>> node 5 cpus:
>>> node 5 size: 896 MB
>>> node 5 free: 896 MB
>>> node distances:
>>> node   0   1   2   3   4   5
>>> 0:  10  31  21  41  21  41
>>> 1:  31  10  41  21  41  21
>>> 2:  21  41  10  51  21  51
>>> 3:  31  21  51  10  51  21
>>> 4:  21  41  21  51  10  51
>>> 5:  31  21  51  21  51  10
>>> $ cat memory_tier4/nodelist
>>> 0-1
>>> $ cat memory_tier12/nodelist
>>> 2,5
>>> $ cat memory_tier54/nodelist
>>> 3-4
>>                                                                                  For
>> above topology, memory-tier will build the demotion path for each node
>> like this:
>> node[0].preferred = 2
>> node[0].demotion_targets = 2-5
>> node[1].preferred = 5
>> node[1].demotion_targets = 2-5
>> node[2].preferred = 4
>> node[2].demotion_targets = 3-4
>> node[3].preferred = <empty>
>> node[3].demotion_targets = <empty>
>> node[4].preferred = <empty>
>> node[4].demotion_targets = <empty>
>> node[5].preferred = 3
>> node[5].demotion_targets = 3-4
>>                                                                           But
>> this demotion path is not explicitly known to administrator. And with
>> the
>> feedback from our customers, they also think it is helpful to know demotion
>> path built by kernel to understand the demotion behaviors.
>>
>> So i think we should have 2 new interfaces for each node:
>>
>> /sys/devices/system/node/nodeN/demotion_allowed_nodes
>> /sys/devices/system/node/nodeN/demotion_preferred_nodes
>>
>> I value your opinion, and I'd like to know what you think about...
> 
> Per my understanding, we will not expose everything inside kernel to
> user space.  For page placement in a tiered memory system, demotion is
> just a part of the story.  For example, if the DRAM of a system becomes
> full, new page allocation will fall back to the CXL memory.  Have we
> exposed the default page allocation fallback order to user space?

Good question, I have no answer yet, but I think we can get the fallback order
from the dmesg now.

The further action for us is that we will also try improve the use space tool,
such as numactl to show the demotion path with the help of this exposed information.


Thanks
Zhijian

> 
> All in all, in my opinion, we only expose as little as possible to user
> space because we need to maintain the ABI for ever.

> 
> --
> Best Regards,
> Huang, Ying
> 
>>
>> On 02/11/2023 11:17, Huang, Ying wrote:
>>> Li Zhijian <lizhijian@fujitsu.com> writes:
>>>
>>>> It shows the demotion target nodes of a node. Export this information to
>>>> user directly.
>>>>
>>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>>>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>>    <show nothing>
>>>> - After node3 is online as kmem
>>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>>>> [
>>>>     {
>>>>       "chardev":"dax0.0",
>>>>       "size":1054867456,
>>>>       "target_node":3,
>>>>       "align":2097152,
>>>>       "mode":"system-ram",
>>>>       "online_memblocks":0,
>>>>       "total_memblocks":7
>>>>     }
>>>> ]
>>>> $ cat /sys/devices/system/node/node0/demotion_nodes
>>>> 3
>>>> $ cat /sys/devices/system/node/node1/demotion_nodes
>>>> 3
>>>> $ cat /sys/devices/system/node/node3/demotion_nodes
>>>>    <show nothing>
>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>>> already.  A node in a higher tier can demote to any node in the lower
>>> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>>> --
>>> Best Regards,
>>> Huang, Ying
>>>
>>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>>> ---
>>>>    drivers/base/node.c          | 13 +++++++++++++
>>>>    include/linux/memory-tiers.h |  6 ++++++
>>>>    mm/memory-tiers.c            |  8 ++++++++
>>>>    3 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index 493d533f8375..27e8502548a7 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -7,6 +7,7 @@
>>>>    #include <linux/init.h>
>>>>    #include <linux/mm.h>
>>>>    #include <linux/memory.h>
>>>> +#include <linux/memory-tiers.h>
>>>>    #include <linux/vmstat.h>
>>>>    #include <linux/notifier.h>
>>>>    #include <linux/node.h>
>>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>>>>    }
>>>>    static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>>>    +static ssize_t demotion_nodes_show(struct device *dev,
>>>> +			     struct device_attribute *attr, char *buf)
>>>> +{
>>>> +	int ret;
>>>> +	nodemask_t nmask = next_demotion_nodes(dev->id);
>>>> +
>>>> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>>>> +	return ret;
>>>> +}
>>>> +static DEVICE_ATTR_RO(demotion_nodes);
>>>> +
>>>>    static struct attribute *node_dev_attrs[] = {
>>>>    	&dev_attr_meminfo.attr,
>>>>    	&dev_attr_numastat.attr,
>>>>    	&dev_attr_distance.attr,
>>>>    	&dev_attr_vmstat.attr,
>>>> +	&dev_attr_demotion_nodes.attr,
>>>>    	NULL
>>>>    };
>>>>    diff --git a/include/linux/memory-tiers.h
>>>> b/include/linux/memory-tiers.h
>>>> index 437441cdf78f..8eb04923f965 100644
>>>> --- a/include/linux/memory-tiers.h
>>>> +++ b/include/linux/memory-tiers.h
>>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>>>>    void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>>>>    #ifdef CONFIG_MIGRATION
>>>>    int next_demotion_node(int node);
>>>> +nodemask_t next_demotion_nodes(int node);
>>>>    void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>>    bool node_is_toptier(int node);
>>>>    #else
>>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>>>>    	return NUMA_NO_NODE;
>>>>    }
>>>>    +static inline next_demotion_nodes next_demotion_nodes(int node)
>>>> +{
>>>> +	return NODE_MASK_NONE;
>>>> +}
>>>> +
>>>>    static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>    {
>>>>    	*targets = NODE_MASK_NONE;
>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>> index 37a4f59d9585..90047f37d98a 100644
>>>> --- a/mm/memory-tiers.c
>>>> +++ b/mm/memory-tiers.c
>>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>    	rcu_read_unlock();
>>>>    }
>>>>    +nodemask_t next_demotion_nodes(int node)
>>>> +{
>>>> +	if (!node_demotion)
>>>> +		return NODE_MASK_NONE;
>>>> +
>>>> +	return node_demotion[node].preferred;
>>>> +}
>>>> +
>>>>    /**
>>>>     * next_demotion_node() - Get the next node in the demotion path
>>>>     * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-01-31  1:13       ` Huang, Ying
  2024-01-31  3:18         ` Zhijian Li (Fujitsu)
@ 2024-01-31  6:23         ` Yasunori Gotou (Fujitsu)
  2024-01-31  6:52           ` Huang, Ying
  1 sibling, 1 reply; 34+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2024-01-31  6:23 UTC (permalink / raw)
  To: 'Huang, Ying', Zhijian Li (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm, linux-kernel

Hello,

> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
> > Hi Ying
> >
> > I need to pick up this thread/patch again.
> >
> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> >> already.  A node in a higher tier can demote to any node in the lower
> >> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
> >>
> >
> > Yes, it's believed that
> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
> > are intended to show nodes in memory_tierN. But IMHO, it's not enough,
> > especially for the preferred demotion node(s).
> >
> > Currently, when a demotion occurs, it will prioritize selecting a node
> > from the preferred nodes as the destination node for the demotion. If
> > the preferred nodes does not meet the requirements, it will try from
> > all the lower memory tier nodes until it finds a suitable demotion
> > destination node or ultimately fails.
> >
> > However, currently it only lists the nodes of each tier. If the
> > administrators want to know all the possible demotion destinations for
> > a given node, they need to calculate it themselves:
> > Step 1, find the memory tier where the given node is located Step 2,
> > list all nodes under all its lower tiers
> >
> > It will be even more difficult to know the preferred nodes which
> > depend on more factors, distance etc. For the following example, we
> > may have 6 nodes splitting into three memory tiers.
> >
> > For emulated hmat numa topology example:
> >> $ numactl -H
> >> available: 6 nodes (0-5)
> >> node 0 cpus: 0
> >> node 0 size: 1974 MB
> >> node 0 free: 1767 MB
> >> node 1 cpus: 1
> >> node 1 size: 1694 MB
> >> node 1 free: 1454 MB
> >> node 2 cpus:
> >> node 2 size: 896 MB
> >> node 2 free: 896 MB
> >> node 3 cpus:
> >> node 3 size: 896 MB
> >> node 3 free: 896 MB
> >> node 4 cpus:
> >> node 4 size: 896 MB
> >> node 4 free: 896 MB
> >> node 5 cpus:
> >> node 5 size: 896 MB
> >> node 5 free: 896 MB
> >> node distances:
> >> node   0   1   2   3   4   5
> >> 0:  10  31  21  41  21  41
> >> 1:  31  10  41  21  41  21
> >> 2:  21  41  10  51  21  51
> >> 3:  31  21  51  10  51  21
> >> 4:  21  41  21  51  10  51
> >> 5:  31  21  51  21  51  10
> >> $ cat memory_tier4/nodelist
> >> 0-1
> >> $ cat memory_tier12/nodelist
> >> 2,5
> >> $ cat memory_tier54/nodelist
> >> 3-4
> >
> > For above topology, memory-tier will build the demotion path for each
> > node like this:
> > node[0].preferred = 2
> > node[0].demotion_targets = 2-5
> > node[1].preferred = 5
> > node[1].demotion_targets = 2-5
> > node[2].preferred = 4
> > node[2].demotion_targets = 3-4
> > node[3].preferred = <empty>
> > node[3].demotion_targets = <empty>
> > node[4].preferred = <empty>
> > node[4].demotion_targets = <empty>
> > node[5].preferred = 3
> > node[5].demotion_targets = 3-4
> >
> > But this demotion path is not explicitly known to administrator. And
> > with the feedback from our customers, they also think it is helpful to
> > know demotion path built by kernel to understand the demotion
> > behaviors.
> >
> > So i think we should have 2 new interfaces for each node:
> >
> > /sys/devices/system/node/nodeN/demotion_allowed_nodes
> > /sys/devices/system/node/nodeN/demotion_preferred_nodes
> >
> > I value your opinion, and I'd like to know what you think about...
> 
> Per my understanding, we will not expose everything inside kernel to user
> space.  For page placement in a tiered memory system, demotion is just a part
> of the story.  For example, if the DRAM of a system becomes full, new page
> allocation will fall back to the CXL memory.  Have we exposed the default page
> allocation fallback order to user space?

In extreme terms, users want to analyze all the memory behaviors of memory management
while executing their workload, and want to trace ALL of them if possible.
Of course, it is impossible due to the heavy load, then users want to have other ways as
a compromise. Our request, the demotion target information, is just one of them.

In my impression, users worry about the impact of the CXL memory device on their workload, 
and want to have a way to understand the impact.
If they know there is no information to remove their anxious, they may avoid to buy CXL memory.

In addition, our support team also needs to have clues to solve users' performance problems. 
Even if new page allocation will fall back to the CXL memory, we need to explain why it would
happen as accountability.

> 
> All in all, in my opinion, we only expose as little as possible to user space
> because we need to maintain the ABI for ever.

I can understand there is a compatibility problem by our propose, and kernel may
change its logic in future. This is a tug-of-war situation between kernel developers
and users or support engineers. I suppose It often occurs in many place...

Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected..
Anyone?

Thanks,
----
Yasunori Goto

> 
> --
> Best Regards,
> Huang, Ying
> 
> >
> > On 02/11/2023 11:17, Huang, Ying wrote:
> >> Li Zhijian <lizhijian@fujitsu.com> writes:
> >>
> >>> It shows the demotion target nodes of a node. Export this
> >>> information to user directly.
> >>>
> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM
> node.
> >>> - Before PMEM is online, no demotion_nodes for node0 and node1.
> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
> >>>   <show nothing>
> >>> - After node3 is online as kmem
> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 &&
> >>> daxctl online-memory dax0.0 [
> >>>    {
> >>>      "chardev":"dax0.0",
> >>>      "size":1054867456,
> >>>      "target_node":3,
> >>>      "align":2097152,
> >>>      "mode":"system-ram",
> >>>      "online_memblocks":0,
> >>>      "total_memblocks":7
> >>>    }
> >>> ]
> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
> >>> 3
> >>> $ cat /sys/devices/system/node/node1/demotion_nodes
> >>> 3
> >>> $ cat /sys/devices/system/node/node3/demotion_nodes
> >>>   <show nothing>
> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> >> already.  A node in a higher tier can demote to any node in the lower
> >> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> >>> ---
> >>>   drivers/base/node.c          | 13 +++++++++++++
> >>>   include/linux/memory-tiers.h |  6 ++++++
> >>>   mm/memory-tiers.c            |  8 ++++++++
> >>>   3 files changed, 27 insertions(+)
> >>>
> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index
> >>> 493d533f8375..27e8502548a7 100644
> >>> --- a/drivers/base/node.c
> >>> +++ b/drivers/base/node.c
> >>> @@ -7,6 +7,7 @@
> >>>   #include <linux/init.h>
> >>>   #include <linux/mm.h>
> >>>   #include <linux/memory.h>
> >>> +#include <linux/memory-tiers.h>
> >>>   #include <linux/vmstat.h>
> >>>   #include <linux/notifier.h>
> >>>   #include <linux/node.h>
> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device
> *dev,
> >>>   }
> >>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
> >>>   +static ssize_t demotion_nodes_show(struct device *dev,
> >>> +			     struct device_attribute *attr, char *buf) {
> >>> +	int ret;
> >>> +	nodemask_t nmask = next_demotion_nodes(dev->id);
> >>> +
> >>> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
> >>> +	return ret;
> >>> +}
> >>> +static DEVICE_ATTR_RO(demotion_nodes);
> >>> +
> >>>   static struct attribute *node_dev_attrs[] = {
> >>>   	&dev_attr_meminfo.attr,
> >>>   	&dev_attr_numastat.attr,
> >>>   	&dev_attr_distance.attr,
> >>>   	&dev_attr_vmstat.attr,
> >>> +	&dev_attr_demotion_nodes.attr,
> >>>   	NULL
> >>>   };
> >>>   diff --git a/include/linux/memory-tiers.h
> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965
> >>> 100644
> >>> --- a/include/linux/memory-tiers.h
> >>> +++ b/include/linux/memory-tiers.h
> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct
> memory_dev_type *default_type);
> >>>   void clear_node_memory_type(int node, struct memory_dev_type
> *memtype);
> >>>   #ifdef CONFIG_MIGRATION
> >>>   int next_demotion_node(int node);
> >>> +nodemask_t next_demotion_nodes(int node);
> >>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t
> *targets);
> >>>   bool node_is_toptier(int node);
> >>>   #else
> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
> >>>   	return NUMA_NO_NODE;
> >>>   }
> >>>   +static inline next_demotion_nodes next_demotion_nodes(int node)
> >>> +{
> >>> +	return NODE_MASK_NONE;
> >>> +}
> >>> +
> >>>   static inline void node_get_allowed_targets(pg_data_t *pgdat,
> nodemask_t *targets)
> >>>   {
> >>>   	*targets = NODE_MASK_NONE;
> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index
> >>> 37a4f59d9585..90047f37d98a 100644
> >>> --- a/mm/memory-tiers.c
> >>> +++ b/mm/memory-tiers.c
> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat,
> nodemask_t *targets)
> >>>   	rcu_read_unlock();
> >>>   }
> >>>   +nodemask_t next_demotion_nodes(int node)
> >>> +{
> >>> +	if (!node_demotion)
> >>> +		return NODE_MASK_NONE;
> >>> +
> >>> +	return node_demotion[node].preferred; }
> >>> +
> >>>   /**
> >>>    * next_demotion_node() - Get the next node in the demotion path
> >>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-01-31  6:23         ` Yasunori Gotou (Fujitsu)
@ 2024-01-31  6:52           ` Huang, Ying
  0 siblings, 0 replies; 34+ messages in thread
From: Huang, Ying @ 2024-01-31  6:52 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu)
  Cc: Zhijian Li (Fujitsu),
	Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	linux-kernel

"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com> writes:

> Hello,
>
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>> 
>> > Hi Ying
>> >
>> > I need to pick up this thread/patch again.
>> >
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already.  A node in a higher tier can demote to any node in the lower
>> >> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>> >>
>> >
>> > Yes, it's believed that
>> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
>> > are intended to show nodes in memory_tierN. But IMHO, it's not enough,
>> > especially for the preferred demotion node(s).
>> >
>> > Currently, when a demotion occurs, it will prioritize selecting a node
>> > from the preferred nodes as the destination node for the demotion. If
>> > the preferred nodes does not meet the requirements, it will try from
>> > all the lower memory tier nodes until it finds a suitable demotion
>> > destination node or ultimately fails.
>> >
>> > However, currently it only lists the nodes of each tier. If the
>> > administrators want to know all the possible demotion destinations for
>> > a given node, they need to calculate it themselves:
>> > Step 1, find the memory tier where the given node is located Step 2,
>> > list all nodes under all its lower tiers
>> >
>> > It will be even more difficult to know the preferred nodes which
>> > depend on more factors, distance etc. For the following example, we
>> > may have 6 nodes splitting into three memory tiers.
>> >
>> > For emulated hmat numa topology example:
>> >> $ numactl -H
>> >> available: 6 nodes (0-5)
>> >> node 0 cpus: 0
>> >> node 0 size: 1974 MB
>> >> node 0 free: 1767 MB
>> >> node 1 cpus: 1
>> >> node 1 size: 1694 MB
>> >> node 1 free: 1454 MB
>> >> node 2 cpus:
>> >> node 2 size: 896 MB
>> >> node 2 free: 896 MB
>> >> node 3 cpus:
>> >> node 3 size: 896 MB
>> >> node 3 free: 896 MB
>> >> node 4 cpus:
>> >> node 4 size: 896 MB
>> >> node 4 free: 896 MB
>> >> node 5 cpus:
>> >> node 5 size: 896 MB
>> >> node 5 free: 896 MB
>> >> node distances:
>> >> node   0   1   2   3   4   5
>> >> 0:  10  31  21  41  21  41
>> >> 1:  31  10  41  21  41  21
>> >> 2:  21  41  10  51  21  51
>> >> 3:  31  21  51  10  51  21
>> >> 4:  21  41  21  51  10  51
>> >> 5:  31  21  51  21  51  10
>> >> $ cat memory_tier4/nodelist
>> >> 0-1
>> >> $ cat memory_tier12/nodelist
>> >> 2,5
>> >> $ cat memory_tier54/nodelist
>> >> 3-4
>> >
>> > For above topology, memory-tier will build the demotion path for each
>> > node like this:
>> > node[0].preferred = 2
>> > node[0].demotion_targets = 2-5
>> > node[1].preferred = 5
>> > node[1].demotion_targets = 2-5
>> > node[2].preferred = 4
>> > node[2].demotion_targets = 3-4
>> > node[3].preferred = <empty>
>> > node[3].demotion_targets = <empty>
>> > node[4].preferred = <empty>
>> > node[4].demotion_targets = <empty>
>> > node[5].preferred = 3
>> > node[5].demotion_targets = 3-4
>> >
>> > But this demotion path is not explicitly known to administrator. And
>> > with the feedback from our customers, they also think it is helpful to
>> > know demotion path built by kernel to understand the demotion
>> > behaviors.
>> >
>> > So i think we should have 2 new interfaces for each node:
>> >
>> > /sys/devices/system/node/nodeN/demotion_allowed_nodes
>> > /sys/devices/system/node/nodeN/demotion_preferred_nodes
>> >
>> > I value your opinion, and I'd like to know what you think about...
>> 
>> Per my understanding, we will not expose everything inside kernel to user
>> space.  For page placement in a tiered memory system, demotion is just a part
>> of the story.  For example, if the DRAM of a system becomes full, new page
>> allocation will fall back to the CXL memory.  Have we exposed the default page
>> allocation fallback order to user space?
>
> In extreme terms, users want to analyze all the memory behaviors of memory management
> while executing their workload, and want to trace ALL of them if possible.
> Of course, it is impossible due to the heavy load, then users want to have other ways as
> a compromise. Our request, the demotion target information, is just one of them.
>
> In my impression, users worry about the impact of the CXL memory device on their workload, 
> and want to have a way to understand the impact.
> If they know there is no information to remove their anxious, they may avoid to buy CXL memory.
>
> In addition, our support team also needs to have clues to solve users' performance problems. 
> Even if new page allocation will fall back to the CXL memory, we need to explain why it would
> happen as accountability.

I guess

/proc/<PID>/numa_maps
/sys/fs/cgroup/<CGNAME>/memory.numa_stat

may help to understand system behavior.

--
Best Regards,
Huang, Ying

>> 
>> All in all, in my opinion, we only expose as little as possible to user space
>> because we need to maintain the ABI for ever.
>
> I can understand there is a compatibility problem by our propose, and kernel may
> change its logic in future. This is a tug-of-war situation between kernel developers
> and users or support engineers. I suppose It often occurs in many place...
>
> Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected..
> Anyone?
>
> Thanks,
> ----
> Yasunori Goto
>
>> 
>> --
>> Best Regards,
>> Huang, Ying
>> 
>> >
>> > On 02/11/2023 11:17, Huang, Ying wrote:
>> >> Li Zhijian <lizhijian@fujitsu.com> writes:
>> >>
>> >>> It shows the demotion target nodes of a node. Export this
>> >>> information to user directly.
>> >>>
>> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM
>> node.
>> >>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>>   <show nothing>
>> >>> - After node3 is online as kmem
>> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 &&
>> >>> daxctl online-memory dax0.0 [
>> >>>    {
>> >>>      "chardev":"dax0.0",
>> >>>      "size":1054867456,
>> >>>      "target_node":3,
>> >>>      "align":2097152,
>> >>>      "mode":"system-ram",
>> >>>      "online_memblocks":0,
>> >>>      "total_memblocks":7
>> >>>    }
>> >>> ]
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node3/demotion_nodes
>> >>>   <show nothing>
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already.  A node in a higher tier can demote to any node in the lower
>> >> tiers.  What's more need to be displayed in nodeX/demotion_nodes?
>> >> --
>> >> Best Regards,
>> >> Huang, Ying
>> >>
>> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> >>> ---
>> >>>   drivers/base/node.c          | 13 +++++++++++++
>> >>>   include/linux/memory-tiers.h |  6 ++++++
>> >>>   mm/memory-tiers.c            |  8 ++++++++
>> >>>   3 files changed, 27 insertions(+)
>> >>>
>> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index
>> >>> 493d533f8375..27e8502548a7 100644
>> >>> --- a/drivers/base/node.c
>> >>> +++ b/drivers/base/node.c
>> >>> @@ -7,6 +7,7 @@
>> >>>   #include <linux/init.h>
>> >>>   #include <linux/mm.h>
>> >>>   #include <linux/memory.h>
>> >>> +#include <linux/memory-tiers.h>
>> >>>   #include <linux/vmstat.h>
>> >>>   #include <linux/notifier.h>
>> >>>   #include <linux/node.h>
>> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device
>> *dev,
>> >>>   }
>> >>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>> >>>   +static ssize_t demotion_nodes_show(struct device *dev,
>> >>> +			     struct device_attribute *attr, char *buf) {
>> >>> +	int ret;
>> >>> +	nodemask_t nmask = next_demotion_nodes(dev->id);
>> >>> +
>> >>> +	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> >>> +	return ret;
>> >>> +}
>> >>> +static DEVICE_ATTR_RO(demotion_nodes);
>> >>> +
>> >>>   static struct attribute *node_dev_attrs[] = {
>> >>>   	&dev_attr_meminfo.attr,
>> >>>   	&dev_attr_numastat.attr,
>> >>>   	&dev_attr_distance.attr,
>> >>>   	&dev_attr_vmstat.attr,
>> >>> +	&dev_attr_demotion_nodes.attr,
>> >>>   	NULL
>> >>>   };
>> >>>   diff --git a/include/linux/memory-tiers.h
>> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965
>> >>> 100644
>> >>> --- a/include/linux/memory-tiers.h
>> >>> +++ b/include/linux/memory-tiers.h
>> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct
>> memory_dev_type *default_type);
>> >>>   void clear_node_memory_type(int node, struct memory_dev_type
>> *memtype);
>> >>>   #ifdef CONFIG_MIGRATION
>> >>>   int next_demotion_node(int node);
>> >>> +nodemask_t next_demotion_nodes(int node);
>> >>>   void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t
>> *targets);
>> >>>   bool node_is_toptier(int node);
>> >>>   #else
>> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>> >>>   	return NUMA_NO_NODE;
>> >>>   }
>> >>>   +static inline next_demotion_nodes next_demotion_nodes(int node)
>> >>> +{
>> >>> +	return NODE_MASK_NONE;
>> >>> +}
>> >>> +
>> >>>   static inline void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>>   {
>> >>>   	*targets = NODE_MASK_NONE;
>> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index
>> >>> 37a4f59d9585..90047f37d98a 100644
>> >>> --- a/mm/memory-tiers.c
>> >>> +++ b/mm/memory-tiers.c
>> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>>   	rcu_read_unlock();
>> >>>   }
>> >>>   +nodemask_t next_demotion_nodes(int node)
>> >>> +{
>> >>> +	if (!node_demotion)
>> >>> +		return NODE_MASK_NONE;
>> >>> +
>> >>> +	return node_demotion[node].preferred; }
>> >>> +
>> >>>   /**
>> >>>    * next_demotion_node() - Get the next node in the demotion path
>> >>>    * @node: The starting node to lookup the next node

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-01-31  3:18         ` Zhijian Li (Fujitsu)
@ 2024-02-02  7:43           ` Zhijian Li (Fujitsu)
  2024-02-02  8:19             ` Huang, Ying
  0 siblings, 1 reply; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-02-02  7:43 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel, Jagdish Gediya


On 31/01/2024 11:17, Li Zhijian wrote:
>>> node[0].preferred = 2
>>> node[0].demotion_targets = 2-5
>>> node[1].preferred = 5
>>> node[1].demotion_targets = 2-5
>>> node[2].preferred = 4
>>> node[2].demotion_targets = 3-4
>>> node[3].preferred = <empty>
>>> node[3].demotion_targets = <empty>
>>> node[4].preferred = <empty>
>>> node[4].demotion_targets = <empty>
>>> node[5].preferred = 3
>>> node[5].demotion_targets = 3-4
>>>                                                                           But
>>> this demotion path is not explicitly known to administrator. And with
>>> the
>>> feedback from our customers, they also think it is helpful to know demotion
>>> path built by kernel to understand the demotion behaviors.
>>>
>>> So i think we should have 2 new interfaces for each node:
>>>

>>> /sys/devices/system/node/nodeN/demotion_allowed_nodes
>>> /sys/devices/system/node/nodeN/demotion_preferred_nodes
>>>
>>> I value your opinion, and I'd like to know what you think about...
>>
>> Per my understanding, we will not expose everything inside kernel to
>> user space.  For page placement in a tiered memory system, demotion is
>> just a part of the story.  For example, if the DRAM of a system becomes
>> full, new page allocation will fall back to the CXL memory.  Have we
>> exposed the default page allocation fallback order to user space?


Back to our initial requirement:
When demotion is enabled, what's the demotion path, especially the preferred node?
are they consistent with administrator's expectations?"

It seems there is no a direct answer. But actually, kernel have already known
this information, IMHO, exposing them to users is not a bad choice.

This information is able to help them adjust/tune the machine before really
deploy their workloads.

If the sysfs approach isn't better enough, is it possible to have another more
user-friendly way to convey this information? like the allocation fallback order does,
simply print them to dmesg?


Thanks
Zhijian


> 
> Good question, I have no answer yet, but I think we can get the fallback order
> from the dmesg now.
> 
> The further action for us is that we will also try improve the use space tool,
> such as numactl to show the demotion path with the help of this exposed information.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-02-02  7:43           ` Zhijian Li (Fujitsu)
@ 2024-02-02  8:19             ` Huang, Ying
  2024-02-05  7:31               ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2024-02-02  8:19 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel, Jagdish Gediya

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 31/01/2024 11:17, Li Zhijian wrote:
>>>> node[0].preferred = 2
>>>> node[0].demotion_targets = 2-5
>>>> node[1].preferred = 5
>>>> node[1].demotion_targets = 2-5
>>>> node[2].preferred = 4
>>>> node[2].demotion_targets = 3-4
>>>> node[3].preferred = <empty>
>>>> node[3].demotion_targets = <empty>
>>>> node[4].preferred = <empty>
>>>> node[4].demotion_targets = <empty>
>>>> node[5].preferred = 3
>>>> node[5].demotion_targets = 3-4
>>>>                                                                           But
>>>> this demotion path is not explicitly known to administrator. And with
>>>> the
>>>> feedback from our customers, they also think it is helpful to know demotion
>>>> path built by kernel to understand the demotion behaviors.
>>>>
>>>> So i think we should have 2 new interfaces for each node:
>>>>
>
>>>> /sys/devices/system/node/nodeN/demotion_allowed_nodes
>>>> /sys/devices/system/node/nodeN/demotion_preferred_nodes
>>>>
>>>> I value your opinion, and I'd like to know what you think about...
>>>
>>> Per my understanding, we will not expose everything inside kernel to
>>> user space.  For page placement in a tiered memory system, demotion is
>>> just a part of the story.  For example, if the DRAM of a system becomes
>>> full, new page allocation will fall back to the CXL memory.  Have we
>>> exposed the default page allocation fallback order to user space?
>
>
> Back to our initial requirement:
> When demotion is enabled, what's the demotion path, especially the preferred node?
> are they consistent with administrator's expectations?"
>
> It seems there is no a direct answer. But actually, kernel have already known
> this information, IMHO, exposing them to users is not a bad choice.
>
> This information is able to help them adjust/tune the machine before really
> deploy their workloads.
>
> If the sysfs approach isn't better enough, is it possible to have another more
> user-friendly way to convey this information? like the allocation fallback order does,
> simply print them to dmesg?

I have no object to print some demotion information in dmesg.

--
Best Regards,
Huang, Ying

>
>> 
>> Good question, I have no answer yet, but I think we can get the fallback order
>> from the dmesg now.
>> 
>> The further action for us is that we will also try improve the use space tool,
>> such as numactl to show the demotion path with the help of this exposed information.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
  2024-02-02  8:19             ` Huang, Ying
@ 2024-02-05  7:31               ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 34+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-02-05  7:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Greg Kroah-Hartman, rafael, linux-mm,
	Yasunori Gotou (Fujitsu),
	linux-kernel, Jagdish Gediya



On 02/02/2024 16:19, Huang, Ying wrote:
>> Back to our initial requirement:
>> When demotion is enabled, what's the demotion path, especially the preferred node?
>> are they consistent with administrator's expectations?"
>>
>> It seems there is no a direct answer. But actually, kernel have already known
>> this information, IMHO, exposing them to users is not a bad choice.
>>
>> This information is able to help them adjust/tune the machine before really
>> deploy their workloads.
>>
>> If the sysfs approach isn't better enough, is it possible to have another more
>> user-friendly way to convey this information? like the allocation fallback order does,
>> simply print them to dmesg?
> I have no object to print some demotion information in dmesg.
> 

Thank you for sharing your thoughts and feedback on this.
I will attempt to do so.


Thanks
Zhijian

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-02-05  7:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-02  2:56 Subject: [PATCH RFC 0/4] Demotion Profiling Improvements Li Zhijian
2023-11-02  2:56 ` [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface Li Zhijian
2023-11-02  3:17   ` Huang, Ying
2023-11-02  3:39     ` Zhijian Li (Fujitsu)
2023-11-02  5:18       ` Huang, Ying
2023-11-02  5:54         ` Zhijian Li (Fujitsu)
2023-11-02  5:58           ` Huang, Ying
2023-11-03  3:05             ` Zhijian Li (Fujitsu)
2024-01-30  8:53     ` Li Zhijian
2024-01-31  1:13       ` Huang, Ying
2024-01-31  3:18         ` Zhijian Li (Fujitsu)
2024-02-02  7:43           ` Zhijian Li (Fujitsu)
2024-02-02  8:19             ` Huang, Ying
2024-02-05  7:31               ` Zhijian Li (Fujitsu)
2024-01-31  6:23         ` Yasunori Gotou (Fujitsu)
2024-01-31  6:52           ` Huang, Ying
2023-11-03  2:21   ` kernel test robot
2023-11-02  2:56 ` [PATCH RFC 2/4] mm/vmstat: Move pgdemote_* to per-node stats Li Zhijian
2023-11-02  4:56   ` Huang, Ying
2023-11-02  5:43   ` Huang, Ying
2023-11-02  5:57     ` Zhijian Li (Fujitsu)
2023-11-02  2:56 ` [PATCH RFC 3/4] mm/vmstat: rename pgdemote_* to pgdemote_dst_* and add pgdemote_src_* Li Zhijian
2023-11-02  5:45   ` Huang, Ying
2023-11-02  6:34     ` Zhijian Li (Fujitsu)
2023-11-02  6:56       ` Huang, Ying
2023-11-02  7:38       ` Yasunori Gotou (Fujitsu)
2023-11-02  7:46         ` Huang, Ying
2023-11-02  9:45           ` Yasunori Gotou (Fujitsu)
2023-11-03  6:14             ` Huang, Ying
2023-11-06  5:02               ` Yasunori Gotou (Fujitsu)
2023-11-02 17:16   ` kernel test robot
2023-11-02  2:56 ` [PATCH RFC 4/4] drivers/base/node: add demote_src and demote_dst to numastat Li Zhijian
2023-11-02  5:40   ` Greg Kroah-Hartman
2023-11-02  8:15     ` Zhijian Li (Fujitsu)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.