* [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
@ 2022-01-26 17:00 Jonghyeon Kim
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
0 siblings, 2 replies; 13+ messages in thread
From: Jonghyeon Kim @ 2022-01-26 17:00 UTC (permalink / raw)
To: dan.j.williams
Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm,
Jonghyeon Kim
Export shrink_zone_span() and update_pgdat_span() functions to head
file. We need to update real number of spanned pages for NUMA nodes and
zones when we add memory device node such as device dax memory.
Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
---
include/linux/memory_hotplug.h | 3 +++
mm/memory_hotplug.c | 6 ++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index be48e003a518..25c7f60c317e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
extern void remove_pfn_range_from_zone(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages);
+extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn);
+extern void update_pgdat_span(struct pglist_data *pgdat);
extern bool is_memblock_offlined(struct memory_block *mem);
extern int sparse_add_section(int nid, unsigned long pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..38f46a9ef853 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
return 0;
}
-static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn;
@@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
}
}
}
+EXPORT_SYMBOL_GPL(shrink_zone_span);
-static void update_pgdat_span(struct pglist_data *pgdat)
+void update_pgdat_span(struct pglist_data *pgdat)
{
unsigned long node_start_pfn = 0, node_end_pfn = 0;
struct zone *zone;
@@ -456,6 +457,7 @@ static void update_pgdat_span(struct pglist_data *pgdat)
pgdat->node_start_pfn = node_start_pfn;
pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
}
+EXPORT_SYMBOL_GPL(update_pgdat_span);
void __ref remove_pfn_range_from_zone(struct zone *zone,
unsigned long start_pfn,
--
2.17.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
@ 2022-01-26 17:00 ` Jonghyeon Kim
2022-01-27 0:29 ` kernel test robot
2022-01-27 5:29 ` kernel test robot
2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
1 sibling, 2 replies; 13+ messages in thread
From: Jonghyeon Kim @ 2022-01-26 17:00 UTC (permalink / raw)
To: dan.j.williams
Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm,
Jonghyeon Kim
When device memory adds to the online NUMA node, the number of spanned
pages of the original device NUMA node should be updated.
By this patch, we can monitor the current spanned pages of each node
more accurately.
Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
---
drivers/dax/kmem.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..f63a739ac790 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/memory_hotplug.h>
#include "dax-private.h"
#include "bus.h"
@@ -48,6 +49,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
struct dax_kmem_data *data;
int i, rc, mapped = 0;
int numa_node;
+ int dev_node;
/*
* Ensure good NUMA information for the persistent memory.
@@ -147,6 +149,18 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
dev_set_drvdata(dev, data);
+ /* Update spanned_pages of the device numa node */
+ dev_node = dev_to_node(dev);
+ if (dev_node != numa_node && dev_node < numa_node) {
+ struct pglist_data *pgdat = NODE_DATA(dev_node);
+ struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
+ unsigned long start_pfn = zone->zone_start_pfn;
+ unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
+
+ shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+ update_pgdat_span(pgdat);
+ }
+
return 0;
err_request_mem:
--
2.17.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
@ 2022-01-26 17:04 ` David Hildenbrand
2022-01-27 9:41 ` Jonghyeon Kim
1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-01-26 17:04 UTC (permalink / raw)
To: Jonghyeon Kim, dan.j.williams
Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm
On 26.01.22 18:00, Jonghyeon Kim wrote:
> Export shrink_zone_span() and update_pgdat_span() functions to head
> file. We need to update real number of spanned pages for NUMA nodes and
> zones when we add memory device node such as device dax memory.
>
Can you elaborate a bit more what you intend to fix?
Memory onlining/offlining is reponsible for updating the node/zone span,
and that's triggered when the dax/kmem mamory gets onlined/offlined.
> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> ---
> include/linux/memory_hotplug.h | 3 +++
> mm/memory_hotplug.c | 6 ++++--
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index be48e003a518..25c7f60c317e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> extern void remove_pfn_range_from_zone(struct zone *zone,
> unsigned long start_pfn,
> unsigned long nr_pages);
> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> + unsigned long end_pfn);
> +extern void update_pgdat_span(struct pglist_data *pgdat);
> extern bool is_memblock_offlined(struct memory_block *mem);
> extern int sparse_add_section(int nid, unsigned long pfn,
> unsigned long nr_pages, struct vmem_altmap *altmap);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..38f46a9ef853 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> return 0;
> }
>
> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> unsigned long end_pfn)
> {
> unsigned long pfn;
> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> }
> }
> }
> +EXPORT_SYMBOL_GPL(shrink_zone_span);
Exporting both as symbols feels very wrong. This is memory
onlining/offlining internal stuff.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
@ 2022-01-27 0:29 ` kernel test robot
2022-01-27 5:29 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-01-27 0:29 UTC (permalink / raw)
To: Jonghyeon Kim, dan.j.williams
Cc: kbuild-all, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm, Jonghyeon Kim
Hi Jonghyeon,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
url: https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base: https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-a002-20220124 (https://download.01.org/0day-ci/archive/20220127/202201270836.H8feaOM9-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
# save the config file to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/dax/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
drivers/dax/kmem.c: In function 'dev_dax_kmem_probe':
>> drivers/dax/kmem.c:156:42: error: 'ZONE_DEVICE' undeclared (first use in this function)
156 | struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
| ^~~~~~~~~~~
drivers/dax/kmem.c:156:42: note: each undeclared identifier is reported only once for each function it appears in
vim +/ZONE_DEVICE +156 drivers/dax/kmem.c
44
45 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
46 {
47 struct device *dev = &dev_dax->dev;
48 unsigned long total_len = 0;
49 struct dax_kmem_data *data;
50 int i, rc, mapped = 0;
51 int numa_node;
52 int dev_node;
53
54 /*
55 * Ensure good NUMA information for the persistent memory.
56 * Without this check, there is a risk that slow memory
57 * could be mixed in a node with faster memory, causing
58 * unavoidable performance issues.
59 */
60 numa_node = dev_dax->target_node;
61 if (numa_node < 0) {
62 dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
63 numa_node);
64 return -EINVAL;
65 }
66
67 for (i = 0; i < dev_dax->nr_range; i++) {
68 struct range range;
69
70 rc = dax_kmem_range(dev_dax, i, &range);
71 if (rc) {
72 dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
73 i, range.start, range.end);
74 continue;
75 }
76 total_len += range_len(&range);
77 }
78
79 if (!total_len) {
80 dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
81 return -EINVAL;
82 }
83
84 data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
85 if (!data)
86 return -ENOMEM;
87
88 rc = -ENOMEM;
89 data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
90 if (!data->res_name)
91 goto err_res_name;
92
93 rc = memory_group_register_static(numa_node, total_len);
94 if (rc < 0)
95 goto err_reg_mgid;
96 data->mgid = rc;
97
98 for (i = 0; i < dev_dax->nr_range; i++) {
99 struct resource *res;
100 struct range range;
101
102 rc = dax_kmem_range(dev_dax, i, &range);
103 if (rc)
104 continue;
105
106 /* Region is permanently reserved if hotremove fails. */
107 res = request_mem_region(range.start, range_len(&range), data->res_name);
108 if (!res) {
109 dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
110 i, range.start, range.end);
111 /*
112 * Once some memory has been onlined we can't
113 * assume that it can be un-onlined safely.
114 */
115 if (mapped)
116 continue;
117 rc = -EBUSY;
118 goto err_request_mem;
119 }
120 data->res[i] = res;
121
122 /*
123 * Set flags appropriate for System RAM. Leave ..._BUSY clear
124 * so that add_memory() can add a child resource. Do not
125 * inherit flags from the parent since it may set new flags
126 * unknown to us that will break add_memory() below.
127 */
128 res->flags = IORESOURCE_SYSTEM_RAM;
129
130 /*
131 * Ensure that future kexec'd kernels will not treat
132 * this as RAM automatically.
133 */
134 rc = add_memory_driver_managed(data->mgid, range.start,
135 range_len(&range), kmem_name, MHP_NID_IS_MGID);
136
137 if (rc) {
138 dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
139 i, range.start, range.end);
140 release_resource(res);
141 kfree(res);
142 data->res[i] = NULL;
143 if (mapped)
144 continue;
145 goto err_request_mem;
146 }
147 mapped++;
148 }
149
150 dev_set_drvdata(dev, data);
151
152 /* Update spanned_pages of the device numa node */
153 dev_node = dev_to_node(dev);
154 if (dev_node != numa_node && dev_node < numa_node) {
155 struct pglist_data *pgdat = NODE_DATA(dev_node);
> 156 struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
157 unsigned long start_pfn = zone->zone_start_pfn;
158 unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
159
160 shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
161 update_pgdat_span(pgdat);
162 }
163
164 return 0;
165
166 err_request_mem:
167 memory_group_unregister(data->mgid);
168 err_reg_mgid:
169 kfree(data->res_name);
170 err_res_name:
171 kfree(data);
172 return rc;
173 }
174
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
@ 2022-01-27 0:29 ` kernel test robot
0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-01-27 0:29 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 6111 bytes --]
Hi Jonghyeon,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
url: https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base: https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-a002-20220124 (https://download.01.org/0day-ci/archive/20220127/202201270836.H8feaOM9-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
# save the config file to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/dax/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
drivers/dax/kmem.c: In function 'dev_dax_kmem_probe':
>> drivers/dax/kmem.c:156:42: error: 'ZONE_DEVICE' undeclared (first use in this function)
156 | struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
| ^~~~~~~~~~~
drivers/dax/kmem.c:156:42: note: each undeclared identifier is reported only once for each function it appears in
vim +/ZONE_DEVICE +156 drivers/dax/kmem.c
44
45 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
46 {
47 struct device *dev = &dev_dax->dev;
48 unsigned long total_len = 0;
49 struct dax_kmem_data *data;
50 int i, rc, mapped = 0;
51 int numa_node;
52 int dev_node;
53
54 /*
55 * Ensure good NUMA information for the persistent memory.
56 * Without this check, there is a risk that slow memory
57 * could be mixed in a node with faster memory, causing
58 * unavoidable performance issues.
59 */
60 numa_node = dev_dax->target_node;
61 if (numa_node < 0) {
62 dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
63 numa_node);
64 return -EINVAL;
65 }
66
67 for (i = 0; i < dev_dax->nr_range; i++) {
68 struct range range;
69
70 rc = dax_kmem_range(dev_dax, i, &range);
71 if (rc) {
72 dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
73 i, range.start, range.end);
74 continue;
75 }
76 total_len += range_len(&range);
77 }
78
79 if (!total_len) {
80 dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
81 return -EINVAL;
82 }
83
84 data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
85 if (!data)
86 return -ENOMEM;
87
88 rc = -ENOMEM;
89 data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
90 if (!data->res_name)
91 goto err_res_name;
92
93 rc = memory_group_register_static(numa_node, total_len);
94 if (rc < 0)
95 goto err_reg_mgid;
96 data->mgid = rc;
97
98 for (i = 0; i < dev_dax->nr_range; i++) {
99 struct resource *res;
100 struct range range;
101
102 rc = dax_kmem_range(dev_dax, i, &range);
103 if (rc)
104 continue;
105
106 /* Region is permanently reserved if hotremove fails. */
107 res = request_mem_region(range.start, range_len(&range), data->res_name);
108 if (!res) {
109 dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
110 i, range.start, range.end);
111 /*
112 * Once some memory has been onlined we can't
113 * assume that it can be un-onlined safely.
114 */
115 if (mapped)
116 continue;
117 rc = -EBUSY;
118 goto err_request_mem;
119 }
120 data->res[i] = res;
121
122 /*
123 * Set flags appropriate for System RAM. Leave ..._BUSY clear
124 * so that add_memory() can add a child resource. Do not
125 * inherit flags from the parent since it may set new flags
126 * unknown to us that will break add_memory() below.
127 */
128 res->flags = IORESOURCE_SYSTEM_RAM;
129
130 /*
131 * Ensure that future kexec'd kernels will not treat
132 * this as RAM automatically.
133 */
134 rc = add_memory_driver_managed(data->mgid, range.start,
135 range_len(&range), kmem_name, MHP_NID_IS_MGID);
136
137 if (rc) {
138 dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
139 i, range.start, range.end);
140 release_resource(res);
141 kfree(res);
142 data->res[i] = NULL;
143 if (mapped)
144 continue;
145 goto err_request_mem;
146 }
147 mapped++;
148 }
149
150 dev_set_drvdata(dev, data);
151
152 /* Update spanned_pages of the device numa node */
153 dev_node = dev_to_node(dev);
154 if (dev_node != numa_node && dev_node < numa_node) {
155 struct pglist_data *pgdat = NODE_DATA(dev_node);
> 156 struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
157 unsigned long start_pfn = zone->zone_start_pfn;
158 unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
159
160 shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
161 update_pgdat_span(pgdat);
162 }
163
164 return 0;
165
166 err_request_mem:
167 memory_group_unregister(data->mgid);
168 err_reg_mgid:
169 kfree(data->res_name);
170 err_res_name:
171 kfree(data);
172 return rc;
173 }
174
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
@ 2022-01-27 5:29 ` kernel test robot
2022-01-27 5:29 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-01-27 5:29 UTC (permalink / raw)
To: Jonghyeon Kim, dan.j.williams
Cc: llvm, kbuild-all, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm, Jonghyeon Kim
Hi Jonghyeon,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
url: https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base: https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220124 (https://download.01.org/0day-ci/archive/20220127/202201271342.1w9oD4VP-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project f400a6012c668dfaa73462caf067ceb074e66c47)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/dax/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
>> drivers/dax/kmem.c:156:42: error: use of undeclared identifier 'ZONE_DEVICE'
struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
^
1 error generated.
vim +/ZONE_DEVICE +156 drivers/dax/kmem.c
44
45 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
46 {
47 struct device *dev = &dev_dax->dev;
48 unsigned long total_len = 0;
49 struct dax_kmem_data *data;
50 int i, rc, mapped = 0;
51 int numa_node;
52 int dev_node;
53
54 /*
55 * Ensure good NUMA information for the persistent memory.
56 * Without this check, there is a risk that slow memory
57 * could be mixed in a node with faster memory, causing
58 * unavoidable performance issues.
59 */
60 numa_node = dev_dax->target_node;
61 if (numa_node < 0) {
62 dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
63 numa_node);
64 return -EINVAL;
65 }
66
67 for (i = 0; i < dev_dax->nr_range; i++) {
68 struct range range;
69
70 rc = dax_kmem_range(dev_dax, i, &range);
71 if (rc) {
72 dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
73 i, range.start, range.end);
74 continue;
75 }
76 total_len += range_len(&range);
77 }
78
79 if (!total_len) {
80 dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
81 return -EINVAL;
82 }
83
84 data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
85 if (!data)
86 return -ENOMEM;
87
88 rc = -ENOMEM;
89 data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
90 if (!data->res_name)
91 goto err_res_name;
92
93 rc = memory_group_register_static(numa_node, total_len);
94 if (rc < 0)
95 goto err_reg_mgid;
96 data->mgid = rc;
97
98 for (i = 0; i < dev_dax->nr_range; i++) {
99 struct resource *res;
100 struct range range;
101
102 rc = dax_kmem_range(dev_dax, i, &range);
103 if (rc)
104 continue;
105
106 /* Region is permanently reserved if hotremove fails. */
107 res = request_mem_region(range.start, range_len(&range), data->res_name);
108 if (!res) {
109 dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
110 i, range.start, range.end);
111 /*
112 * Once some memory has been onlined we can't
113 * assume that it can be un-onlined safely.
114 */
115 if (mapped)
116 continue;
117 rc = -EBUSY;
118 goto err_request_mem;
119 }
120 data->res[i] = res;
121
122 /*
123 * Set flags appropriate for System RAM. Leave ..._BUSY clear
124 * so that add_memory() can add a child resource. Do not
125 * inherit flags from the parent since it may set new flags
126 * unknown to us that will break add_memory() below.
127 */
128 res->flags = IORESOURCE_SYSTEM_RAM;
129
130 /*
131 * Ensure that future kexec'd kernels will not treat
132 * this as RAM automatically.
133 */
134 rc = add_memory_driver_managed(data->mgid, range.start,
135 range_len(&range), kmem_name, MHP_NID_IS_MGID);
136
137 if (rc) {
138 dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
139 i, range.start, range.end);
140 release_resource(res);
141 kfree(res);
142 data->res[i] = NULL;
143 if (mapped)
144 continue;
145 goto err_request_mem;
146 }
147 mapped++;
148 }
149
150 dev_set_drvdata(dev, data);
151
152 /* Update spanned_pages of the device numa node */
153 dev_node = dev_to_node(dev);
154 if (dev_node != numa_node && dev_node < numa_node) {
155 struct pglist_data *pgdat = NODE_DATA(dev_node);
> 156 struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
157 unsigned long start_pfn = zone->zone_start_pfn;
158 unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
159
160 shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
161 update_pgdat_span(pgdat);
162 }
163
164 return 0;
165
166 err_request_mem:
167 memory_group_unregister(data->mgid);
168 err_reg_mgid:
169 kfree(data->res_name);
170 err_res_name:
171 kfree(data);
172 return rc;
173 }
174
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
@ 2022-01-27 5:29 ` kernel test robot
0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-01-27 5:29 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 6325 bytes --]
Hi Jonghyeon,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
url: https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base: https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220124 (https://download.01.org/0day-ci/archive/20220127/202201271342.1w9oD4VP-lkp(a)intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project f400a6012c668dfaa73462caf067ceb074e66c47)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/dax/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
>> drivers/dax/kmem.c:156:42: error: use of undeclared identifier 'ZONE_DEVICE'
struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
^
1 error generated.
vim +/ZONE_DEVICE +156 drivers/dax/kmem.c
44
45 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
46 {
47 struct device *dev = &dev_dax->dev;
48 unsigned long total_len = 0;
49 struct dax_kmem_data *data;
50 int i, rc, mapped = 0;
51 int numa_node;
52 int dev_node;
53
54 /*
55 * Ensure good NUMA information for the persistent memory.
56 * Without this check, there is a risk that slow memory
57 * could be mixed in a node with faster memory, causing
58 * unavoidable performance issues.
59 */
60 numa_node = dev_dax->target_node;
61 if (numa_node < 0) {
62 dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
63 numa_node);
64 return -EINVAL;
65 }
66
67 for (i = 0; i < dev_dax->nr_range; i++) {
68 struct range range;
69
70 rc = dax_kmem_range(dev_dax, i, &range);
71 if (rc) {
72 dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
73 i, range.start, range.end);
74 continue;
75 }
76 total_len += range_len(&range);
77 }
78
79 if (!total_len) {
80 dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
81 return -EINVAL;
82 }
83
84 data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
85 if (!data)
86 return -ENOMEM;
87
88 rc = -ENOMEM;
89 data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
90 if (!data->res_name)
91 goto err_res_name;
92
93 rc = memory_group_register_static(numa_node, total_len);
94 if (rc < 0)
95 goto err_reg_mgid;
96 data->mgid = rc;
97
98 for (i = 0; i < dev_dax->nr_range; i++) {
99 struct resource *res;
100 struct range range;
101
102 rc = dax_kmem_range(dev_dax, i, &range);
103 if (rc)
104 continue;
105
106 /* Region is permanently reserved if hotremove fails. */
107 res = request_mem_region(range.start, range_len(&range), data->res_name);
108 if (!res) {
109 dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
110 i, range.start, range.end);
111 /*
112 * Once some memory has been onlined we can't
113 * assume that it can be un-onlined safely.
114 */
115 if (mapped)
116 continue;
117 rc = -EBUSY;
118 goto err_request_mem;
119 }
120 data->res[i] = res;
121
122 /*
123 * Set flags appropriate for System RAM. Leave ..._BUSY clear
124 * so that add_memory() can add a child resource. Do not
125 * inherit flags from the parent since it may set new flags
126 * unknown to us that will break add_memory() below.
127 */
128 res->flags = IORESOURCE_SYSTEM_RAM;
129
130 /*
131 * Ensure that future kexec'd kernels will not treat
132 * this as RAM automatically.
133 */
134 rc = add_memory_driver_managed(data->mgid, range.start,
135 range_len(&range), kmem_name, MHP_NID_IS_MGID);
136
137 if (rc) {
138 dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
139 i, range.start, range.end);
140 release_resource(res);
141 kfree(res);
142 data->res[i] = NULL;
143 if (mapped)
144 continue;
145 goto err_request_mem;
146 }
147 mapped++;
148 }
149
150 dev_set_drvdata(dev, data);
151
152 /* Update spanned_pages of the device numa node */
153 dev_node = dev_to_node(dev);
154 if (dev_node != numa_node && dev_node < numa_node) {
155 struct pglist_data *pgdat = NODE_DATA(dev_node);
> 156 struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
157 unsigned long start_pfn = zone->zone_start_pfn;
158 unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
159
160 shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
161 update_pgdat_span(pgdat);
162 }
163
164 return 0;
165
166 err_request_mem:
167 memory_group_unregister(data->mgid);
168 err_reg_mgid:
169 kfree(data->res_name);
170 err_res_name:
171 kfree(data);
172 return rc;
173 }
174
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
@ 2022-01-27 9:41 ` Jonghyeon Kim
2022-01-27 9:54 ` David Hildenbrand
0 siblings, 1 reply; 13+ messages in thread
From: Jonghyeon Kim @ 2022-01-27 9:41 UTC (permalink / raw)
To: David Hildenbrand
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> On 26.01.22 18:00, Jonghyeon Kim wrote:
> > Export shrink_zone_span() and update_pgdat_span() functions to head
> > file. We need to update real number of spanned pages for NUMA nodes and
> > zones when we add memory device node such as device dax memory.
> >
>
> Can you elaborate a bit more what you intend to fix?
>
> Memory onlining/offlining is reponsible for updating the node/zone span,
> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>
Sure, sorry for the lack of explanation of the intended fix.
Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
is no problem because node/zone contain these additional non-visible memory
devices to the system.
But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
to new NUMA node(but CPU-less). I think there is no need to hold
zone[ZONE_DEVICE] pages on the original node.
Additionally, spanned pages are also used to calculate the end pfn of a node.
Thus, it is needed to maintain accurate page stats for node/zone.
My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
(DC persistent memory modules) with App-Direct mode.
Below are my test results.
Before memory onlining:
# ndctl create-namespace --mode=devdax
# ndctl create-namespace --mode=devdax
# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
Node 0, zone DMA spanned 4095
Node 0, zone DMA32 spanned 1044480
Node 0, zone Normal spanned 7864320
Node 0, zone Movable spanned 0
Node 0, zone Device spanned 66060288
Node 1, zone DMA spanned 0
Node 1, zone DMA32 spanned 0
Node 1, zone Normal spanned 8388608
Node 1, zone Movable spanned 0
Node 1, zone Device spanned 66060288
After memory onlining:
# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
Node 0, zone DMA spanned 4095
Node 0, zone DMA32 spanned 1044480
Node 0, zone Normal spanned 7864320
Node 0, zone Movable spanned 0
Node 0, zone Device spanned 66060288
Node 1, zone DMA spanned 0
Node 1, zone DMA32 spanned 0
Node 1, zone Normal spanned 8388608
Node 1, zone Movable spanned 0
Node 1, zone Device spanned 66060288
Node 2, zone DMA spanned 0
Node 2, zone DMA32 spanned 0
Node 2, zone Normal spanned 65011712
Node 2, zone Movable spanned 0
Node 2, zone Device spanned 0
Node 3, zone DMA spanned 0
Node 3, zone DMA32 spanned 0
Node 3, zone Normal spanned 65011712
Node 3, zone Movable spanned 0
Node 3, zone Device spanned 0
As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
This causes problem that Node 0 and Node 2 have same end of pfn values, also
Node 1 and Node 3 have same problem.
> > Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> > ---
> > include/linux/memory_hotplug.h | 3 +++
> > mm/memory_hotplug.c | 6 ++++--
> > 2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index be48e003a518..25c7f60c317e 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> > extern void remove_pfn_range_from_zone(struct zone *zone,
> > unsigned long start_pfn,
> > unsigned long nr_pages);
> > +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > + unsigned long end_pfn);
> > +extern void update_pgdat_span(struct pglist_data *pgdat);
> > extern bool is_memblock_offlined(struct memory_block *mem);
> > extern int sparse_add_section(int nid, unsigned long pfn,
> > unsigned long nr_pages, struct vmem_altmap *altmap);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 2a9627dc784c..38f46a9ef853 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> > return 0;
> > }
> >
> > -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > unsigned long end_pfn)
> > {
> > unsigned long pfn;
> > @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > }
> > }
> > }
> > +EXPORT_SYMBOL_GPL(shrink_zone_span);
>
> Exporting both as symbols feels very wrong. This is memory
> onlining/offlining internal stuff.
I agree with you that your comment. I will find another approach to avoid
directly using onlining/offlining internal stuff while updating node/zone span.
Thanks,
Jonghyeon
>
>
>
> --
> Thanks,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-27 9:41 ` Jonghyeon Kim
@ 2022-01-27 9:54 ` David Hildenbrand
2022-01-28 4:19 ` Jonghyeon Kim
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-01-27 9:54 UTC (permalink / raw)
To: Jonghyeon Kim
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On 27.01.22 10:41, Jonghyeon Kim wrote:
> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>> file. We need to update real number of spanned pages for NUMA nodes and
>>> zones when we add memory device node such as device dax memory.
>>>
>>
>> Can you elaborate a bit more what you intend to fix?
>>
>> Memory onlining/offlining is reponsible for updating the node/zone span,
>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>
> Sure, sorry for the lack of explanation of the intended fix.
>
> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> is no problem because node/zone contain these additional non-visible memory
> devices to the system.
> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> to new NUMA node(but CPU-less). I think there is no need to hold
> zone[ZONE_DEVICE] pages on the original node.
>
> Additionally, spanned pages are also used to calculate the end pfn of a node.
> Thus, it is needed to maintain accurate page stats for node/zone.
>
> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> (DC persistent memory modules) with App-Direct mode.
>
> Below are my test results.
>
> Before memory onlining:
>
> # ndctl create-namespace --mode=devdax
> # ndctl create-namespace --mode=devdax
> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> Node 0, zone DMA spanned 4095
> Node 0, zone DMA32 spanned 1044480
> Node 0, zone Normal spanned 7864320
> Node 0, zone Movable spanned 0
> Node 0, zone Device spanned 66060288
> Node 1, zone DMA spanned 0
> Node 1, zone DMA32 spanned 0
> Node 1, zone Normal spanned 8388608
> Node 1, zone Movable spanned 0
> Node 1, zone Device spanned 66060288
>
> After memory onlining:
>
> # daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> # daxctl reconfigure-device --mode=system-ram --no-online dax1.0
>
> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> Node 0, zone DMA spanned 4095
> Node 0, zone DMA32 spanned 1044480
> Node 0, zone Normal spanned 7864320
> Node 0, zone Movable spanned 0
> Node 0, zone Device spanned 66060288
> Node 1, zone DMA spanned 0
> Node 1, zone DMA32 spanned 0
> Node 1, zone Normal spanned 8388608
> Node 1, zone Movable spanned 0
> Node 1, zone Device spanned 66060288
> Node 2, zone DMA spanned 0
> Node 2, zone DMA32 spanned 0
> Node 2, zone Normal spanned 65011712
> Node 2, zone Movable spanned 0
> Node 2, zone Device spanned 0
> Node 3, zone DMA spanned 0
> Node 3, zone DMA32 spanned 0
> Node 3, zone Normal spanned 65011712
> Node 3, zone Movable spanned 0
> Node 3, zone Device spanned 0
>
> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> This causes problem that Node 0 and Node 2 have same end of pfn values, also
> Node 1 and Node 3 have same problem.
Thanks for the information, that makes it clearer.
While this unfortunate, the node/zone span is something fairly
unreliable/unusable for user space. Nodes and zones can overlap just easily.
What counts are present/managed pages in the node/zone.
So at least I don't count this as something that "needs fixing",
it's more something that's nice to handle better if easily possible.
See below.
>
>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>> ---
>>> include/linux/memory_hotplug.h | 3 +++
>>> mm/memory_hotplug.c | 6 ++++--
>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>> index be48e003a518..25c7f60c317e 100644
>>> --- a/include/linux/memory_hotplug.h
>>> +++ b/include/linux/memory_hotplug.h
>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>> extern void remove_pfn_range_from_zone(struct zone *zone,
>>> unsigned long start_pfn,
>>> unsigned long nr_pages);
>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> + unsigned long end_pfn);
>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>> extern bool is_memblock_offlined(struct memory_block *mem);
>>> extern int sparse_add_section(int nid, unsigned long pfn,
>>> unsigned long nr_pages, struct vmem_altmap *altmap);
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 2a9627dc784c..38f46a9ef853 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>> return 0;
>>> }
>>>
>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> unsigned long end_pfn)
>>> {
>>> unsigned long pfn;
>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> }
>>> }
>>> }
>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>
>> Exporting both as symbols feels very wrong. This is memory
>> onlining/offlining internal stuff.
>
> I agree with you that your comment. I will find another approach to avoid
> directly using onlining/offlining internal stuff while updating node/zone span.
IIRC, to handle what you intend to handle properly want to look into teaching
remove_pfn_range_from_zone() to handle zone_is_zone_device().
There is a big fat comment:
/*
* Zone shrinking code cannot properly deal with ZONE_DEVICE. So
* we will not try to shrink the zones - which is okay as
* set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
*/
if (zone_is_zone_device(zone))
return;
Similarly, try_offline_node() spells this out:
/*
* If the node still spans pages (especially ZONE_DEVICE), don't
* offline it. A node spans memory after move_pfn_range_to_zone(),
* e.g., after the memory block was onlined.
*/
if (pgdat->node_spanned_pages)
return;
So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
try_offline_node() implicitly.
Trying to update the node span manually without teaching node/zone shrinking code how to
handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
won't work if the range of interest is still surrounded by other ranges.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-27 9:54 ` David Hildenbrand
@ 2022-01-28 4:19 ` Jonghyeon Kim
2022-01-28 8:10 ` David Hildenbrand
0 siblings, 1 reply; 13+ messages in thread
From: Jonghyeon Kim @ 2022-01-28 4:19 UTC (permalink / raw)
To: David Hildenbrand
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
> On 27.01.22 10:41, Jonghyeon Kim wrote:
> > On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> >> On 26.01.22 18:00, Jonghyeon Kim wrote:
> >>> Export shrink_zone_span() and update_pgdat_span() functions to head
> >>> file. We need to update real number of spanned pages for NUMA nodes and
> >>> zones when we add memory device node such as device dax memory.
> >>>
> >>
> >> Can you elaborate a bit more what you intend to fix?
> >>
> >> Memory onlining/offlining is reponsible for updating the node/zone span,
> >> and that's triggered when the dax/kmem mamory gets onlined/offlined.
> >>
> > Sure, sorry for the lack of explanation of the intended fix.
> >
> > Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> > cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> > is no problem because node/zone contain these additional non-visible memory
> > devices to the system.
> > But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> > to new NUMA node(but CPU-less). I think there is no need to hold
> > zone[ZONE_DEVICE] pages on the original node.
> >
> > Additionally, spanned pages are also used to calculate the end pfn of a node.
> > Thus, it is needed to maintain accurate page stats for node/zone.
> >
> > My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> > (DC persistent memory modules) with App-Direct mode.
> >
> > Below are my test results.
> >
> > Before memory onlining:
> >
> > # ndctl create-namespace --mode=devdax
> > # ndctl create-namespace --mode=devdax
> > # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> > Node 0, zone DMA spanned 4095
> > Node 0, zone DMA32 spanned 1044480
> > Node 0, zone Normal spanned 7864320
> > Node 0, zone Movable spanned 0
> > Node 0, zone Device spanned 66060288
> > Node 1, zone DMA spanned 0
> > Node 1, zone DMA32 spanned 0
> > Node 1, zone Normal spanned 8388608
> > Node 1, zone Movable spanned 0
> > Node 1, zone Device spanned 66060288
> >
> > After memory onlining:
> >
> > # daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> > # daxctl reconfigure-device --mode=system-ram --no-online dax1.0
> >
> > # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> > Node 0, zone DMA spanned 4095
> > Node 0, zone DMA32 spanned 1044480
> > Node 0, zone Normal spanned 7864320
> > Node 0, zone Movable spanned 0
> > Node 0, zone Device spanned 66060288
> > Node 1, zone DMA spanned 0
> > Node 1, zone DMA32 spanned 0
> > Node 1, zone Normal spanned 8388608
> > Node 1, zone Movable spanned 0
> > Node 1, zone Device spanned 66060288
> > Node 2, zone DMA spanned 0
> > Node 2, zone DMA32 spanned 0
> > Node 2, zone Normal spanned 65011712
> > Node 2, zone Movable spanned 0
> > Node 2, zone Device spanned 0
> > Node 3, zone DMA spanned 0
> > Node 3, zone DMA32 spanned 0
> > Node 3, zone Normal spanned 65011712
> > Node 3, zone Movable spanned 0
> > Node 3, zone Device spanned 0
> >
> > As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> > This causes problem that Node 0 and Node 2 have same end of pfn values, also
> > Node 1 and Node 3 have same problem.
>
> Thanks for the information, that makes it clearer.
>
> While this unfortunate, the node/zone span is something fairly
> unreliable/unusable for user space. Nodes and zones can overlap just easily.
>
> What counts are present/managed pages in the node/zone.
>
> So at least I don't count this as something that "needs fixing",
> it's more something that's nice to handle better if easily possible.
>
> See below.
>
> >
> >>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> >>> ---
> >>> include/linux/memory_hotplug.h | 3 +++
> >>> mm/memory_hotplug.c | 6 ++++--
> >>> 2 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >>> index be48e003a518..25c7f60c317e 100644
> >>> --- a/include/linux/memory_hotplug.h
> >>> +++ b/include/linux/memory_hotplug.h
> >>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >>> extern void remove_pfn_range_from_zone(struct zone *zone,
> >>> unsigned long start_pfn,
> >>> unsigned long nr_pages);
> >>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> + unsigned long end_pfn);
> >>> +extern void update_pgdat_span(struct pglist_data *pgdat);
> >>> extern bool is_memblock_offlined(struct memory_block *mem);
> >>> extern int sparse_add_section(int nid, unsigned long pfn,
> >>> unsigned long nr_pages, struct vmem_altmap *altmap);
> >>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>> index 2a9627dc784c..38f46a9ef853 100644
> >>> --- a/mm/memory_hotplug.c
> >>> +++ b/mm/memory_hotplug.c
> >>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> >>> return 0;
> >>> }
> >>>
> >>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> unsigned long end_pfn)
> >>> {
> >>> unsigned long pfn;
> >>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> }
> >>> }
> >>> }
> >>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
> >>
> >> Exporting both as symbols feels very wrong. This is memory
> >> onlining/offlining internal stuff.
> >
> > I agree with you that your comment. I will find another approach to avoid
> > directly using onlining/offlining internal stuff while updating node/zone span.
>
> IIRC, to handle what you intend to handle properly want to look into teaching
> remove_pfn_range_from_zone() to handle zone_is_zone_device().
>
> There is a big fat comment:
>
> /*
> * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
> * we will not try to shrink the zones - which is okay as
> * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
> */
> if (zone_is_zone_device(zone))
> return;
>
>
> Similarly, try_offline_node() spells this out:
>
> /*
> * If the node still spans pages (especially ZONE_DEVICE), don't
> * offline it. A node spans memory after move_pfn_range_to_zone(),
> * e.g., after the memory block was onlined.
> */
> if (pgdat->node_spanned_pages)
> return;
>
>
> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
> try_offline_node() implicitly.
>
> Trying to update the node span manually without teaching node/zone shrinking code how to
> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
> won't work if the range of interest is still surrounded by other ranges.
>
Thanks for your pointing out, I missed those comments.
I will keep trying to handle node/zone span updating process.
> --
> Thanks,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-28 4:19 ` Jonghyeon Kim
@ 2022-01-28 8:10 ` David Hildenbrand
2022-02-03 2:22 ` Jonghyeon Kim
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-01-28 8:10 UTC (permalink / raw)
To: Jonghyeon Kim
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On 28.01.22 05:19, Jonghyeon Kim wrote:
> On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
>> On 27.01.22 10:41, Jonghyeon Kim wrote:
>>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>>>> file. We need to update real number of spanned pages for NUMA nodes and
>>>>> zones when we add memory device node such as device dax memory.
>>>>>
>>>>
>>>> Can you elaborate a bit more what you intend to fix?
>>>>
>>>> Memory onlining/offlining is reponsible for updating the node/zone span,
>>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>>>
>>> Sure, sorry for the lack of explanation of the intended fix.
>>>
>>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
>>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
>>> is no problem because node/zone contain these additional non-visible memory
>>> devices to the system.
>>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
>>> to new NUMA node(but CPU-less). I think there is no need to hold
>>> zone[ZONE_DEVICE] pages on the original node.
>>>
>>> Additionally, spanned pages are also used to calculate the end pfn of a node.
>>> Thus, it is needed to maintain accurate page stats for node/zone.
>>>
>>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
>>> (DC persistent memory modules) with App-Direct mode.
>>>
>>> Below are my test results.
>>>
>>> Before memory onlining:
>>>
>>> # ndctl create-namespace --mode=devdax
>>> # ndctl create-namespace --mode=devdax
>>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>> Node 0, zone DMA spanned 4095
>>> Node 0, zone DMA32 spanned 1044480
>>> Node 0, zone Normal spanned 7864320
>>> Node 0, zone Movable spanned 0
>>> Node 0, zone Device spanned 66060288
>>> Node 1, zone DMA spanned 0
>>> Node 1, zone DMA32 spanned 0
>>> Node 1, zone Normal spanned 8388608
>>> Node 1, zone Movable spanned 0
>>> Node 1, zone Device spanned 66060288
>>>
>>> After memory onlining:
>>>
>>> # daxctl reconfigure-device --mode=system-ram --no-online dax0.0
>>> # daxctl reconfigure-device --mode=system-ram --no-online dax1.0
>>>
>>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>> Node 0, zone DMA spanned 4095
>>> Node 0, zone DMA32 spanned 1044480
>>> Node 0, zone Normal spanned 7864320
>>> Node 0, zone Movable spanned 0
>>> Node 0, zone Device spanned 66060288
>>> Node 1, zone DMA spanned 0
>>> Node 1, zone DMA32 spanned 0
>>> Node 1, zone Normal spanned 8388608
>>> Node 1, zone Movable spanned 0
>>> Node 1, zone Device spanned 66060288
>>> Node 2, zone DMA spanned 0
>>> Node 2, zone DMA32 spanned 0
>>> Node 2, zone Normal spanned 65011712
>>> Node 2, zone Movable spanned 0
>>> Node 2, zone Device spanned 0
>>> Node 3, zone DMA spanned 0
>>> Node 3, zone DMA32 spanned 0
>>> Node 3, zone Normal spanned 65011712
>>> Node 3, zone Movable spanned 0
>>> Node 3, zone Device spanned 0
>>>
>>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
>>> This causes problem that Node 0 and Node 2 have same end of pfn values, also
>>> Node 1 and Node 3 have same problem.
>>
>> Thanks for the information, that makes it clearer.
>>
>> While this unfortunate, the node/zone span is something fairly
>> unreliable/unusable for user space. Nodes and zones can overlap just easily.
>>
>> What counts are present/managed pages in the node/zone.
>>
>> So at least I don't count this as something that "needs fixing",
>> it's more something that's nice to handle better if easily possible.
>>
>> See below.
>>
>>>
>>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>>>> ---
>>>>> include/linux/memory_hotplug.h | 3 +++
>>>>> mm/memory_hotplug.c | 6 ++++--
>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>> index be48e003a518..25c7f60c317e 100644
>>>>> --- a/include/linux/memory_hotplug.h
>>>>> +++ b/include/linux/memory_hotplug.h
>>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>>>> extern void remove_pfn_range_from_zone(struct zone *zone,
>>>>> unsigned long start_pfn,
>>>>> unsigned long nr_pages);
>>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> + unsigned long end_pfn);
>>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>>>> extern bool is_memblock_offlined(struct memory_block *mem);
>>>>> extern int sparse_add_section(int nid, unsigned long pfn,
>>>>> unsigned long nr_pages, struct vmem_altmap *altmap);
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 2a9627dc784c..38f46a9ef853 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> unsigned long end_pfn)
>>>>> {
>>>>> unsigned long pfn;
>>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> }
>>>>> }
>>>>> }
>>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>>>
>>>> Exporting both as symbols feels very wrong. This is memory
>>>> onlining/offlining internal stuff.
>>>
>>> I agree with you that your comment. I will find another approach to avoid
>>> directly using onlining/offlining internal stuff while updating node/zone span.
>>
>> IIRC, to handle what you intend to handle properly want to look into teaching
>> remove_pfn_range_from_zone() to handle zone_is_zone_device().
>>
>> There is a big fat comment:
>>
>> /*
>> * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
>> * we will not try to shrink the zones - which is okay as
>> * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
>> */
>> if (zone_is_zone_device(zone))
>> return;
>>
>>
>> Similarly, try_offline_node() spells this out:
>>
>> /*
>> * If the node still spans pages (especially ZONE_DEVICE), don't
>> * offline it. A node spans memory after move_pfn_range_to_zone(),
>> * e.g., after the memory block was onlined.
>> */
>> if (pgdat->node_spanned_pages)
>> return;
>>
>>
>> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
>> try_offline_node() implicitly.
>>
>> Trying to update the node span manually without teaching node/zone shrinking code how to
>> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
>> won't work if the range of interest is still surrounded by other ranges.
>>
>
> Thanks for your pointing out, I missed those comments.
> I will keep trying to handle node/zone span updating process.
The only safe thing right now for on ZONE_DEVICE in
remove_pfn_range_from_zone() would be removing the given range from the
start/end of the zone range, but we must not scan using the existing
functions.
As soon as we start actual *scanning* via find_smallest...
find_biggest... in shrink_zone_span() we would mistakenly skip other
ZONE_DEVICE ranges and mess up.
Assume you would have a ZONE_DEVICE layout like
[ DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]
What we actually want to do when removing
* DEV 0 is scanning low->high until we find DEV 1
* DEV 1 is doing nothing, because we cannot shrink
* DEV 2 is scanning high -> low until we find DEV 1
I assume we'd want to call in shrink_zone_span() two new functions for
ZONE_DEVICE:
find_smallest_zone_device_pfn
find_biggest_zone_device_pfn
Which would be able to do exactly that scanning, eventually, using
get_dev_pagemap() or some similar source of information.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-01-28 8:10 ` David Hildenbrand
@ 2022-02-03 2:22 ` Jonghyeon Kim
2022-02-03 8:19 ` David Hildenbrand
0 siblings, 1 reply; 13+ messages in thread
From: Jonghyeon Kim @ 2022-02-03 2:22 UTC (permalink / raw)
To: David Hildenbrand
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On Fri, Jan 28, 2022 at 09:10:21AM +0100, David Hildenbrand wrote:
> On 28.01.22 05:19, Jonghyeon Kim wrote:
> > On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
> >> On 27.01.22 10:41, Jonghyeon Kim wrote:
> >>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> >>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
> >>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
> >>>>> file. We need to update real number of spanned pages for NUMA nodes and
> >>>>> zones when we add memory device node such as device dax memory.
> >>>>>
> >>>>
> >>>> Can you elaborate a bit more what you intend to fix?
> >>>>
> >>>> Memory onlining/offlining is reponsible for updating the node/zone span,
> >>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
> >>>>
> >>> Sure, sorry for the lack of explanation of the intended fix.
> >>>
> >>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> >>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> >>> is no problem because node/zone contain these additional non-visible memory
> >>> devices to the system.
> >>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> >>> to new NUMA node(but CPU-less). I think there is no need to hold
> >>> zone[ZONE_DEVICE] pages on the original node.
> >>>
> >>> Additionally, spanned pages are also used to calculate the end pfn of a node.
> >>> Thus, it is needed to maintain accurate page stats for node/zone.
> >>>
> >>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> >>> (DC persistent memory modules) with App-Direct mode.
> >>>
> >>> Below are my test results.
> >>>
> >>> Before memory onlining:
> >>>
> >>> # ndctl create-namespace --mode=devdax
> >>> # ndctl create-namespace --mode=devdax
> >>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> >>> Node 0, zone DMA spanned 4095
> >>> Node 0, zone DMA32 spanned 1044480
> >>> Node 0, zone Normal spanned 7864320
> >>> Node 0, zone Movable spanned 0
> >>> Node 0, zone Device spanned 66060288
> >>> Node 1, zone DMA spanned 0
> >>> Node 1, zone DMA32 spanned 0
> >>> Node 1, zone Normal spanned 8388608
> >>> Node 1, zone Movable spanned 0
> >>> Node 1, zone Device spanned 66060288
> >>>
> >>> After memory onlining:
> >>>
> >>> # daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> >>> # daxctl reconfigure-device --mode=system-ram --no-online dax1.0
> >>>
> >>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> >>> Node 0, zone DMA spanned 4095
> >>> Node 0, zone DMA32 spanned 1044480
> >>> Node 0, zone Normal spanned 7864320
> >>> Node 0, zone Movable spanned 0
> >>> Node 0, zone Device spanned 66060288
> >>> Node 1, zone DMA spanned 0
> >>> Node 1, zone DMA32 spanned 0
> >>> Node 1, zone Normal spanned 8388608
> >>> Node 1, zone Movable spanned 0
> >>> Node 1, zone Device spanned 66060288
> >>> Node 2, zone DMA spanned 0
> >>> Node 2, zone DMA32 spanned 0
> >>> Node 2, zone Normal spanned 65011712
> >>> Node 2, zone Movable spanned 0
> >>> Node 2, zone Device spanned 0
> >>> Node 3, zone DMA spanned 0
> >>> Node 3, zone DMA32 spanned 0
> >>> Node 3, zone Normal spanned 65011712
> >>> Node 3, zone Movable spanned 0
> >>> Node 3, zone Device spanned 0
> >>>
> >>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> >>> This causes problem that Node 0 and Node 2 have same end of pfn values, also
> >>> Node 1 and Node 3 have same problem.
> >>
> >> Thanks for the information, that makes it clearer.
> >>
> >> While this unfortunate, the node/zone span is something fairly
> >> unreliable/unusable for user space. Nodes and zones can overlap just easily.
> >>
> >> What counts are present/managed pages in the node/zone.
> >>
> >> So at least I don't count this as something that "needs fixing",
> >> it's more something that's nice to handle better if easily possible.
> >>
> >> See below.
> >>
> >>>
> >>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> >>>>> ---
> >>>>> include/linux/memory_hotplug.h | 3 +++
> >>>>> mm/memory_hotplug.c | 6 ++++--
> >>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >>>>> index be48e003a518..25c7f60c317e 100644
> >>>>> --- a/include/linux/memory_hotplug.h
> >>>>> +++ b/include/linux/memory_hotplug.h
> >>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >>>>> extern void remove_pfn_range_from_zone(struct zone *zone,
> >>>>> unsigned long start_pfn,
> >>>>> unsigned long nr_pages);
> >>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> + unsigned long end_pfn);
> >>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
> >>>>> extern bool is_memblock_offlined(struct memory_block *mem);
> >>>>> extern int sparse_add_section(int nid, unsigned long pfn,
> >>>>> unsigned long nr_pages, struct vmem_altmap *altmap);
> >>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>>>> index 2a9627dc784c..38f46a9ef853 100644
> >>>>> --- a/mm/memory_hotplug.c
> >>>>> +++ b/mm/memory_hotplug.c
> >>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> >>>>> return 0;
> >>>>> }
> >>>>>
> >>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> unsigned long end_pfn)
> >>>>> {
> >>>>> unsigned long pfn;
> >>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> }
> >>>>> }
> >>>>> }
> >>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
> >>>>
> >>>> Exporting both as symbols feels very wrong. This is memory
> >>>> onlining/offlining internal stuff.
> >>>
> >>> I agree with you that your comment. I will find another approach to avoid
> >>> directly using onlining/offlining internal stuff while updating node/zone span.
> >>
> >> IIRC, to handle what you intend to handle properly want to look into teaching
> >> remove_pfn_range_from_zone() to handle zone_is_zone_device().
> >>
> >> There is a big fat comment:
> >>
> >> /*
> >> * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
> >> * we will not try to shrink the zones - which is okay as
> >> * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
> >> */
> >> if (zone_is_zone_device(zone))
> >> return;
> >>
> >>
> >> Similarly, try_offline_node() spells this out:
> >>
> >> /*
> >> * If the node still spans pages (especially ZONE_DEVICE), don't
> >> * offline it. A node spans memory after move_pfn_range_to_zone(),
> >> * e.g., after the memory block was onlined.
> >> */
> >> if (pgdat->node_spanned_pages)
> >> return;
> >>
> >>
> >> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
> >> try_offline_node() implicitly.
> >>
> >> Trying to update the node span manually without teaching node/zone shrinking code how to
> >> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
> >> won't work if the range of interest is still surrounded by other ranges.
> >>
> >
> > Thanks for your pointing out, I missed those comments.
> > I will keep trying to handle node/zone span updating process.
>
> The only safe thing right now for on ZONE_DEVICE in
> remove_pfn_range_from_zone() would be removing the given range from the
> start/end of the zone range, but we must not scan using the existing
> functions.
>
> As soon as we start actual *scanning* via find_smallest...
> find_biggest... in shrink_zone_span() we would mistakenly skip other
> ZONE_DEVICE ranges and mess up.
>
> Assume you would have a ZONE_DEVICE layout like
>
> [ DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]
>
IIUC, you assumed situation that multiple ZONE_DEVICE in single node, and there
are holes among them, right?
> What we actually want to do when removing
>
> * DEV 0 is scanning low->high until we find DEV 1
> * DEV 1 is doing nothing, because we cannot shrink
> * DEV 2 is scanning high -> low until we find DEV 1
>
>
> I assume we'd want to call in shrink_zone_span() two new functions for
> ZONE_DEVICE:
> find_smallest_zone_device_pfn
> find_biggest_zone_device_pfn
>
> Which would be able to do exactly that scanning, eventually, using
> get_dev_pagemap() or some similar source of information.
>
I agree with your suggestion. It might cleanly deal with ZONE_DEVICE with actual
page scanning for checking holes.
> --
> Thanks,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
2022-02-03 2:22 ` Jonghyeon Kim
@ 2022-02-03 8:19 ` David Hildenbrand
0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2022-02-03 8:19 UTC (permalink / raw)
To: Jonghyeon Kim
Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
linux-kernel, linux-mm
On 03.02.22 03:22, Jonghyeon Kim wrote:
> On Fri, Jan 28, 2022 at 09:10:21AM +0100, David Hildenbrand wrote:
>> On 28.01.22 05:19, Jonghyeon Kim wrote:
>>> On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
>>>> On 27.01.22 10:41, Jonghyeon Kim wrote:
>>>>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>>>>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>>>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>>>>>> file. We need to update real number of spanned pages for NUMA nodes and
>>>>>>> zones when we add memory device node such as device dax memory.
>>>>>>>
>>>>>>
>>>>>> Can you elaborate a bit more what you intend to fix?
>>>>>>
>>>>>> Memory onlining/offlining is reponsible for updating the node/zone span,
>>>>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>>>>>
>>>>> Sure, sorry for the lack of explanation of the intended fix.
>>>>>
>>>>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
>>>>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
>>>>> is no problem because node/zone contain these additional non-visible memory
>>>>> devices to the system.
>>>>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
>>>>> to new NUMA node(but CPU-less). I think there is no need to hold
>>>>> zone[ZONE_DEVICE] pages on the original node.
>>>>>
>>>>> Additionally, spanned pages are also used to calculate the end pfn of a node.
>>>>> Thus, it is needed to maintain accurate page stats for node/zone.
>>>>>
>>>>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
>>>>> (DC persistent memory modules) with App-Direct mode.
>>>>>
>>>>> Below are my test results.
>>>>>
>>>>> Before memory onlining:
>>>>>
>>>>> # ndctl create-namespace --mode=devdax
>>>>> # ndctl create-namespace --mode=devdax
>>>>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>>>> Node 0, zone DMA spanned 4095
>>>>> Node 0, zone DMA32 spanned 1044480
>>>>> Node 0, zone Normal spanned 7864320
>>>>> Node 0, zone Movable spanned 0
>>>>> Node 0, zone Device spanned 66060288
>>>>> Node 1, zone DMA spanned 0
>>>>> Node 1, zone DMA32 spanned 0
>>>>> Node 1, zone Normal spanned 8388608
>>>>> Node 1, zone Movable spanned 0
>>>>> Node 1, zone Device spanned 66060288
>>>>>
>>>>> After memory onlining:
>>>>>
>>>>> # daxctl reconfigure-device --mode=system-ram --no-online dax0.0
>>>>> # daxctl reconfigure-device --mode=system-ram --no-online dax1.0
>>>>>
>>>>> # cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>>>> Node 0, zone DMA spanned 4095
>>>>> Node 0, zone DMA32 spanned 1044480
>>>>> Node 0, zone Normal spanned 7864320
>>>>> Node 0, zone Movable spanned 0
>>>>> Node 0, zone Device spanned 66060288
>>>>> Node 1, zone DMA spanned 0
>>>>> Node 1, zone DMA32 spanned 0
>>>>> Node 1, zone Normal spanned 8388608
>>>>> Node 1, zone Movable spanned 0
>>>>> Node 1, zone Device spanned 66060288
>>>>> Node 2, zone DMA spanned 0
>>>>> Node 2, zone DMA32 spanned 0
>>>>> Node 2, zone Normal spanned 65011712
>>>>> Node 2, zone Movable spanned 0
>>>>> Node 2, zone Device spanned 0
>>>>> Node 3, zone DMA spanned 0
>>>>> Node 3, zone DMA32 spanned 0
>>>>> Node 3, zone Normal spanned 65011712
>>>>> Node 3, zone Movable spanned 0
>>>>> Node 3, zone Device spanned 0
>>>>>
>>>>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
>>>>> This causes problem that Node 0 and Node 2 have same end of pfn values, also
>>>>> Node 1 and Node 3 have same problem.
>>>>
>>>> Thanks for the information, that makes it clearer.
>>>>
>>>> While this unfortunate, the node/zone span is something fairly
>>>> unreliable/unusable for user space. Nodes and zones can overlap just easily.
>>>>
>>>> What counts are present/managed pages in the node/zone.
>>>>
>>>> So at least I don't count this as something that "needs fixing",
>>>> it's more something that's nice to handle better if easily possible.
>>>>
>>>> See below.
>>>>
>>>>>
>>>>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>>>>>> ---
>>>>>>> include/linux/memory_hotplug.h | 3 +++
>>>>>>> mm/memory_hotplug.c | 6 ++++--
>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>>>> index be48e003a518..25c7f60c317e 100644
>>>>>>> --- a/include/linux/memory_hotplug.h
>>>>>>> +++ b/include/linux/memory_hotplug.h
>>>>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>>>>>> extern void remove_pfn_range_from_zone(struct zone *zone,
>>>>>>> unsigned long start_pfn,
>>>>>>> unsigned long nr_pages);
>>>>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> + unsigned long end_pfn);
>>>>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>>>>>> extern bool is_memblock_offlined(struct memory_block *mem);
>>>>>>> extern int sparse_add_section(int nid, unsigned long pfn,
>>>>>>> unsigned long nr_pages, struct vmem_altmap *altmap);
>>>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>>>> index 2a9627dc784c..38f46a9ef853 100644
>>>>>>> --- a/mm/memory_hotplug.c
>>>>>>> +++ b/mm/memory_hotplug.c
>>>>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> unsigned long end_pfn)
>>>>>>> {
>>>>>>> unsigned long pfn;
>>>>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>>>>>
>>>>>> Exporting both as symbols feels very wrong. This is memory
>>>>>> onlining/offlining internal stuff.
>>>>>
>>>>> I agree with you that your comment. I will find another approach to avoid
>>>>> directly using onlining/offlining internal stuff while updating node/zone span.
>>>>
>>>> IIRC, to handle what you intend to handle properly want to look into teaching
>>>> remove_pfn_range_from_zone() to handle zone_is_zone_device().
>>>>
>>>> There is a big fat comment:
>>>>
>>>> /*
>>>> * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
>>>> * we will not try to shrink the zones - which is okay as
>>>> * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
>>>> */
>>>> if (zone_is_zone_device(zone))
>>>> return;
>>>>
>>>>
>>>> Similarly, try_offline_node() spells this out:
>>>>
>>>> /*
>>>> * If the node still spans pages (especially ZONE_DEVICE), don't
>>>> * offline it. A node spans memory after move_pfn_range_to_zone(),
>>>> * e.g., after the memory block was onlined.
>>>> */
>>>> if (pgdat->node_spanned_pages)
>>>> return;
>>>>
>>>>
>>>> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
>>>> try_offline_node() implicitly.
>>>>
>>>> Trying to update the node span manually without teaching node/zone shrinking code how to
>>>> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
>>>> won't work if the range of interest is still surrounded by other ranges.
>>>>
>>>
>>> Thanks for your pointing out, I missed those comments.
>>> I will keep trying to handle node/zone span updating process.
>>
>> The only safe thing right now for on ZONE_DEVICE in
>> remove_pfn_range_from_zone() would be removing the given range from the
>> start/end of the zone range, but we must not scan using the existing
>> functions.
>>
>> As soon as we start actual *scanning* via find_smallest...
>> find_biggest... in shrink_zone_span() we would mistakenly skip other
>> ZONE_DEVICE ranges and mess up.
>>
>> Assume you would have a ZONE_DEVICE layout like
>>
>> [ DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]
>>
>
> IIUC, you assumed situation that multiple ZONE_DEVICE in single node, and there
> are holes among them, right?
Exactly. But the holes are just an example to show what we'd have to do
when holes are around. Having multiple devices is the important part.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-02-03 8:19 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
2022-01-27 0:29 ` kernel test robot
2022-01-27 0:29 ` kernel test robot
2022-01-27 5:29 ` kernel test robot
2022-01-27 5:29 ` kernel test robot
2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
2022-01-27 9:41 ` Jonghyeon Kim
2022-01-27 9:54 ` David Hildenbrand
2022-01-28 4:19 ` Jonghyeon Kim
2022-01-28 8:10 ` David Hildenbrand
2022-02-03 2:22 ` Jonghyeon Kim
2022-02-03 8:19 ` David Hildenbrand
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.