linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
@ 2022-01-26 17:00 Jonghyeon Kim
  2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
  2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
  0 siblings, 2 replies; 11+ messages in thread
From: Jonghyeon Kim @ 2022-01-26 17:00 UTC (permalink / raw)
  To: dan.j.williams
  Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm,
	Jonghyeon Kim

Export shrink_zone_span() and update_pgdat_span() functions to head
file. We need to update real number of spanned pages for NUMA nodes and
zones when we add memory device node such as device dax memory.

Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
---
 include/linux/memory_hotplug.h | 3 +++
 mm/memory_hotplug.c            | 6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index be48e003a518..25c7f60c317e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long start_pfn,
 				       unsigned long nr_pages);
+extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+			     unsigned long end_pfn);
+extern void update_pgdat_span(struct pglist_data *pgdat);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..38f46a9ef853 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 	return 0;
 }
 
-static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
 	unsigned long pfn;
@@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		}
 	}
 }
+EXPORT_SYMBOL_GPL(shrink_zone_span);
 
-static void update_pgdat_span(struct pglist_data *pgdat)
+void update_pgdat_span(struct pglist_data *pgdat)
 {
 	unsigned long node_start_pfn = 0, node_end_pfn = 0;
 	struct zone *zone;
@@ -456,6 +457,7 @@ static void update_pgdat_span(struct pglist_data *pgdat)
 	pgdat->node_start_pfn = node_start_pfn;
 	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
+EXPORT_SYMBOL_GPL(update_pgdat_span);
 
 void __ref remove_pfn_range_from_zone(struct zone *zone,
 				      unsigned long start_pfn,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
  2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
@ 2022-01-26 17:00 ` Jonghyeon Kim
  2022-01-27  0:29   ` kernel test robot
  2022-01-27  5:29   ` kernel test robot
  2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
  1 sibling, 2 replies; 11+ messages in thread
From: Jonghyeon Kim @ 2022-01-26 17:00 UTC (permalink / raw)
  To: dan.j.williams
  Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm,
	Jonghyeon Kim

When device memory adds to the online NUMA node, the number of spanned
pages of the original device NUMA node should be updated.

By this patch, we can monitor the current spanned pages of each node
more accurately.

Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
---
 drivers/dax/kmem.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..f63a739ac790 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/memory_hotplug.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -48,6 +49,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	struct dax_kmem_data *data;
 	int i, rc, mapped = 0;
 	int numa_node;
+	int dev_node;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -147,6 +149,18 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
+	/* Update spanned_pages of the device numa node */
+	dev_node = dev_to_node(dev);
+	if (dev_node != numa_node && dev_node < numa_node) {
+		struct pglist_data *pgdat = NODE_DATA(dev_node);
+		struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
+		unsigned long start_pfn = zone->zone_start_pfn;
+		unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
+
+		shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+		update_pgdat_span(pgdat);
+	}
+
 	return 0;
 
 err_request_mem:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
  2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
@ 2022-01-26 17:04 ` David Hildenbrand
  2022-01-27  9:41   ` Jonghyeon Kim
  1 sibling, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-01-26 17:04 UTC (permalink / raw)
  To: Jonghyeon Kim, dan.j.williams
  Cc: vishal.l.verma, dave.jiang, akpm, nvdimm, linux-kernel, linux-mm

On 26.01.22 18:00, Jonghyeon Kim wrote:
> Export shrink_zone_span() and update_pgdat_span() functions to head
> file. We need to update real number of spanned pages for NUMA nodes and
> zones when we add memory device node such as device dax memory.
> 

Can you elaborate a bit more what you intend to fix?

Memory onlining/offlining is reponsible for updating the node/zone span,
and that's triggered when the dax/kmem mamory gets onlined/offlined.

> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> ---
>  include/linux/memory_hotplug.h | 3 +++
>  mm/memory_hotplug.c            | 6 ++++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index be48e003a518..25c7f60c317e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  extern void remove_pfn_range_from_zone(struct zone *zone,
>  				       unsigned long start_pfn,
>  				       unsigned long nr_pages);
> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> +			     unsigned long end_pfn);
> +extern void update_pgdat_span(struct pglist_data *pgdat);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern int sparse_add_section(int nid, unsigned long pfn,
>  		unsigned long nr_pages, struct vmem_altmap *altmap);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..38f46a9ef853 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>  	return 0;
>  }
>  
> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  			     unsigned long end_pfn)
>  {
>  	unsigned long pfn;
> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  		}
>  	}
>  }
> +EXPORT_SYMBOL_GPL(shrink_zone_span);

Exporting both as symbols feels very wrong. This is memory
onlining/offlining internal stuff.



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
  2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
@ 2022-01-27  0:29   ` kernel test robot
  2022-01-27  5:29   ` kernel test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2022-01-27  0:29 UTC (permalink / raw)
  To: Jonghyeon Kim, dan.j.williams
  Cc: kbuild-all, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm, Jonghyeon Kim

Hi Jonghyeon,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]

url:    https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base:   https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-a002-20220124 (https://download.01.org/0day-ci/archive/20220127/202201270836.H8feaOM9-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
        git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/dax/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/dax/kmem.c: In function 'dev_dax_kmem_probe':
>> drivers/dax/kmem.c:156:42: error: 'ZONE_DEVICE' undeclared (first use in this function)
     156 |   struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
         |                                          ^~~~~~~~~~~
   drivers/dax/kmem.c:156:42: note: each undeclared identifier is reported only once for each function it appears in


vim +/ZONE_DEVICE +156 drivers/dax/kmem.c

    44	
    45	static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
    46	{
    47		struct device *dev = &dev_dax->dev;
    48		unsigned long total_len = 0;
    49		struct dax_kmem_data *data;
    50		int i, rc, mapped = 0;
    51		int numa_node;
    52		int dev_node;
    53	
    54		/*
    55		 * Ensure good NUMA information for the persistent memory.
    56		 * Without this check, there is a risk that slow memory
    57		 * could be mixed in a node with faster memory, causing
    58		 * unavoidable performance issues.
    59		 */
    60		numa_node = dev_dax->target_node;
    61		if (numa_node < 0) {
    62			dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
    63					numa_node);
    64			return -EINVAL;
    65		}
    66	
    67		for (i = 0; i < dev_dax->nr_range; i++) {
    68			struct range range;
    69	
    70			rc = dax_kmem_range(dev_dax, i, &range);
    71			if (rc) {
    72				dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
    73						i, range.start, range.end);
    74				continue;
    75			}
    76			total_len += range_len(&range);
    77		}
    78	
    79		if (!total_len) {
    80			dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
    81			return -EINVAL;
    82		}
    83	
    84		data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
    85		if (!data)
    86			return -ENOMEM;
    87	
    88		rc = -ENOMEM;
    89		data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
    90		if (!data->res_name)
    91			goto err_res_name;
    92	
    93		rc = memory_group_register_static(numa_node, total_len);
    94		if (rc < 0)
    95			goto err_reg_mgid;
    96		data->mgid = rc;
    97	
    98		for (i = 0; i < dev_dax->nr_range; i++) {
    99			struct resource *res;
   100			struct range range;
   101	
   102			rc = dax_kmem_range(dev_dax, i, &range);
   103			if (rc)
   104				continue;
   105	
   106			/* Region is permanently reserved if hotremove fails. */
   107			res = request_mem_region(range.start, range_len(&range), data->res_name);
   108			if (!res) {
   109				dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
   110						i, range.start, range.end);
   111				/*
   112				 * Once some memory has been onlined we can't
   113				 * assume that it can be un-onlined safely.
   114				 */
   115				if (mapped)
   116					continue;
   117				rc = -EBUSY;
   118				goto err_request_mem;
   119			}
   120			data->res[i] = res;
   121	
   122			/*
   123			 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
   124			 * so that add_memory() can add a child resource.  Do not
   125			 * inherit flags from the parent since it may set new flags
   126			 * unknown to us that will break add_memory() below.
   127			 */
   128			res->flags = IORESOURCE_SYSTEM_RAM;
   129	
   130			/*
   131			 * Ensure that future kexec'd kernels will not treat
   132			 * this as RAM automatically.
   133			 */
   134			rc = add_memory_driver_managed(data->mgid, range.start,
   135					range_len(&range), kmem_name, MHP_NID_IS_MGID);
   136	
   137			if (rc) {
   138				dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
   139						i, range.start, range.end);
   140				release_resource(res);
   141				kfree(res);
   142				data->res[i] = NULL;
   143				if (mapped)
   144					continue;
   145				goto err_request_mem;
   146			}
   147			mapped++;
   148		}
   149	
   150		dev_set_drvdata(dev, data);
   151	
   152		/* Update spanned_pages of the device numa node */
   153		dev_node = dev_to_node(dev);
   154		if (dev_node != numa_node && dev_node < numa_node) {
   155			struct pglist_data *pgdat = NODE_DATA(dev_node);
 > 156			struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
   157			unsigned long start_pfn = zone->zone_start_pfn;
   158			unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
   159	
   160			shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   161			update_pgdat_span(pgdat);
   162		}
   163	
   164		return 0;
   165	
   166	err_request_mem:
   167		memory_group_unregister(data->mgid);
   168	err_reg_mgid:
   169		kfree(data->res_name);
   170	err_res_name:
   171		kfree(data);
   172		return rc;
   173	}
   174	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node
  2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
  2022-01-27  0:29   ` kernel test robot
@ 2022-01-27  5:29   ` kernel test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2022-01-27  5:29 UTC (permalink / raw)
  To: Jonghyeon Kim, dan.j.williams
  Cc: llvm, kbuild-all, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm, Jonghyeon Kim

Hi Jonghyeon,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]

url:    https://github.com/0day-ci/linux/commits/Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
base:   https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220124 (https://download.01.org/0day-ci/archive/20220127/202201271342.1w9oD4VP-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project f400a6012c668dfaa73462caf067ceb074e66c47)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/0day-ci/linux/commit/ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jonghyeon-Kim/mm-memory_hotplug-Export-shrink-span-functions-for-zone-and-node/20220127-010219
        git checkout ef33cc7f7380ddd07a3fedb42f35c1f81de401a4
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/dax/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/dax/kmem.c:156:42: error: use of undeclared identifier 'ZONE_DEVICE'
                   struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
                                                          ^
   1 error generated.


vim +/ZONE_DEVICE +156 drivers/dax/kmem.c

    44	
    45	static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
    46	{
    47		struct device *dev = &dev_dax->dev;
    48		unsigned long total_len = 0;
    49		struct dax_kmem_data *data;
    50		int i, rc, mapped = 0;
    51		int numa_node;
    52		int dev_node;
    53	
    54		/*
    55		 * Ensure good NUMA information for the persistent memory.
    56		 * Without this check, there is a risk that slow memory
    57		 * could be mixed in a node with faster memory, causing
    58		 * unavoidable performance issues.
    59		 */
    60		numa_node = dev_dax->target_node;
    61		if (numa_node < 0) {
    62			dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
    63					numa_node);
    64			return -EINVAL;
    65		}
    66	
    67		for (i = 0; i < dev_dax->nr_range; i++) {
    68			struct range range;
    69	
    70			rc = dax_kmem_range(dev_dax, i, &range);
    71			if (rc) {
    72				dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
    73						i, range.start, range.end);
    74				continue;
    75			}
    76			total_len += range_len(&range);
    77		}
    78	
    79		if (!total_len) {
    80			dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
    81			return -EINVAL;
    82		}
    83	
    84		data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
    85		if (!data)
    86			return -ENOMEM;
    87	
    88		rc = -ENOMEM;
    89		data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
    90		if (!data->res_name)
    91			goto err_res_name;
    92	
    93		rc = memory_group_register_static(numa_node, total_len);
    94		if (rc < 0)
    95			goto err_reg_mgid;
    96		data->mgid = rc;
    97	
    98		for (i = 0; i < dev_dax->nr_range; i++) {
    99			struct resource *res;
   100			struct range range;
   101	
   102			rc = dax_kmem_range(dev_dax, i, &range);
   103			if (rc)
   104				continue;
   105	
   106			/* Region is permanently reserved if hotremove fails. */
   107			res = request_mem_region(range.start, range_len(&range), data->res_name);
   108			if (!res) {
   109				dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
   110						i, range.start, range.end);
   111				/*
   112				 * Once some memory has been onlined we can't
   113				 * assume that it can be un-onlined safely.
   114				 */
   115				if (mapped)
   116					continue;
   117				rc = -EBUSY;
   118				goto err_request_mem;
   119			}
   120			data->res[i] = res;
   121	
   122			/*
   123			 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
   124			 * so that add_memory() can add a child resource.  Do not
   125			 * inherit flags from the parent since it may set new flags
   126			 * unknown to us that will break add_memory() below.
   127			 */
   128			res->flags = IORESOURCE_SYSTEM_RAM;
   129	
   130			/*
   131			 * Ensure that future kexec'd kernels will not treat
   132			 * this as RAM automatically.
   133			 */
   134			rc = add_memory_driver_managed(data->mgid, range.start,
   135					range_len(&range), kmem_name, MHP_NID_IS_MGID);
   136	
   137			if (rc) {
   138				dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
   139						i, range.start, range.end);
   140				release_resource(res);
   141				kfree(res);
   142				data->res[i] = NULL;
   143				if (mapped)
   144					continue;
   145				goto err_request_mem;
   146			}
   147			mapped++;
   148		}
   149	
   150		dev_set_drvdata(dev, data);
   151	
   152		/* Update spanned_pages of the device numa node */
   153		dev_node = dev_to_node(dev);
   154		if (dev_node != numa_node && dev_node < numa_node) {
   155			struct pglist_data *pgdat = NODE_DATA(dev_node);
 > 156			struct zone *zone = &pgdat->node_zones[ZONE_DEVICE];
   157			unsigned long start_pfn = zone->zone_start_pfn;
   158			unsigned long nr_pages = NODE_DATA(numa_node)->node_spanned_pages;
   159	
   160			shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   161			update_pgdat_span(pgdat);
   162		}
   163	
   164		return 0;
   165	
   166	err_request_mem:
   167		memory_group_unregister(data->mgid);
   168	err_reg_mgid:
   169		kfree(data->res_name);
   170	err_res_name:
   171		kfree(data);
   172		return rc;
   173	}
   174	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
@ 2022-01-27  9:41   ` Jonghyeon Kim
  2022-01-27  9:54     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Jonghyeon Kim @ 2022-01-27  9:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> On 26.01.22 18:00, Jonghyeon Kim wrote:
> > Export shrink_zone_span() and update_pgdat_span() functions to head
> > file. We need to update real number of spanned pages for NUMA nodes and
> > zones when we add memory device node such as device dax memory.
> > 
> 
> Can you elaborate a bit more what you intend to fix?
> 
> Memory onlining/offlining is reponsible for updating the node/zone span,
> and that's triggered when the dax/kmem mamory gets onlined/offlined.
> 
Sure, sorry for the lack of explanation of the intended fix.

Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
is no problem because node/zone contain these additional non-visible memory
devices to the system.
But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
to new NUMA node(but CPU-less). I think there is no need to hold
zone[ZONE_DEVICE] pages on the original node.

Additionally, spanned pages are also used to calculate the end pfn of a node.
Thus, it is needed to maintain accurate page stats for node/zone.

My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
(DC persistent memory modules) with App-Direct mode.

Below are my test results.

Before memory onlining:

	# ndctl create-namespace --mode=devdax
	# ndctl create-namespace --mode=devdax
	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
	Node 0, zone      DMA	        spanned  4095
	Node 0, zone    DMA32	        spanned  1044480
	Node 0, zone   Normal	        spanned  7864320
	Node 0, zone  Movable	        spanned  0
	Node 0, zone   Device	        spanned  66060288
	Node 1, zone      DMA	        spanned  0
	Node 1, zone    DMA32	        spanned  0
	Node 1, zone   Normal	        spanned  8388608
	Node 1, zone  Movable	        spanned  0
	Node 1, zone   Device	        spanned  66060288

After memory onlining:

	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0

	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
	Node 0, zone      DMA	        spanned  4095
	Node 0, zone    DMA32	        spanned  1044480
	Node 0, zone   Normal	        spanned  7864320
	Node 0, zone  Movable	        spanned  0
	Node 0, zone   Device	        spanned  66060288
	Node 1, zone      DMA	        spanned  0
	Node 1, zone    DMA32	        spanned  0
	Node 1, zone   Normal	        spanned  8388608
	Node 1, zone  Movable	        spanned  0
	Node 1, zone   Device	        spanned  66060288
	Node 2, zone      DMA	        spanned  0
	Node 2, zone    DMA32	        spanned  0
	Node 2, zone   Normal	        spanned  65011712
	Node 2, zone  Movable	        spanned  0
	Node 2, zone   Device	        spanned  0
	Node 3, zone      DMA	        spanned  0
	Node 3, zone    DMA32	        spanned  0
	Node 3, zone   Normal	        spanned  65011712
	Node 3, zone  Movable	        spanned  0
	Node 3, zone   Device	        spanned  0

As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
This causes problem that Node 0 and Node 2 have same end of pfn values, also 
Node 1 and Node 3 have same problem.

> > Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> > ---
> >  include/linux/memory_hotplug.h | 3 +++
> >  mm/memory_hotplug.c            | 6 ++++--
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index be48e003a518..25c7f60c317e 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >  extern void remove_pfn_range_from_zone(struct zone *zone,
> >  				       unsigned long start_pfn,
> >  				       unsigned long nr_pages);
> > +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > +			     unsigned long end_pfn);
> > +extern void update_pgdat_span(struct pglist_data *pgdat);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> >  extern int sparse_add_section(int nid, unsigned long pfn,
> >  		unsigned long nr_pages, struct vmem_altmap *altmap);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 2a9627dc784c..38f46a9ef853 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> >  	return 0;
> >  }
> >  
> > -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> > +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >  			     unsigned long end_pfn)
> >  {
> >  	unsigned long pfn;
> > @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >  		}
> >  	}
> >  }
> > +EXPORT_SYMBOL_GPL(shrink_zone_span);
> 
> Exporting both as symbols feels very wrong. This is memory
> onlining/offlining internal stuff.

I agree with you that your comment. I will find another approach to avoid
directly using onlining/offlining internal stuff while updating node/zone span.


Thanks,
Jonghyeon
> 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-27  9:41   ` Jonghyeon Kim
@ 2022-01-27  9:54     ` David Hildenbrand
  2022-01-28  4:19       ` Jonghyeon Kim
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-01-27  9:54 UTC (permalink / raw)
  To: Jonghyeon Kim
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On 27.01.22 10:41, Jonghyeon Kim wrote:
> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>> file. We need to update real number of spanned pages for NUMA nodes and
>>> zones when we add memory device node such as device dax memory.
>>>
>>
>> Can you elaborate a bit more what you intend to fix?
>>
>> Memory onlining/offlining is reponsible for updating the node/zone span,
>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>
> Sure, sorry for the lack of explanation of the intended fix.
> 
> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> is no problem because node/zone contain these additional non-visible memory
> devices to the system.
> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> to new NUMA node(but CPU-less). I think there is no need to hold
> zone[ZONE_DEVICE] pages on the original node.
> 
> Additionally, spanned pages are also used to calculate the end pfn of a node.
> Thus, it is needed to maintain accurate page stats for node/zone.
> 
> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> (DC persistent memory modules) with App-Direct mode.
> 
> Below are my test results.
> 
> Before memory onlining:
> 
> 	# ndctl create-namespace --mode=devdax
> 	# ndctl create-namespace --mode=devdax
> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> 	Node 0, zone      DMA	        spanned  4095
> 	Node 0, zone    DMA32	        spanned  1044480
> 	Node 0, zone   Normal	        spanned  7864320
> 	Node 0, zone  Movable	        spanned  0
> 	Node 0, zone   Device	        spanned  66060288
> 	Node 1, zone      DMA	        spanned  0
> 	Node 1, zone    DMA32	        spanned  0
> 	Node 1, zone   Normal	        spanned  8388608
> 	Node 1, zone  Movable	        spanned  0
> 	Node 1, zone   Device	        spanned  66060288
> 
> After memory onlining:
> 
> 	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> 	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
> 
> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> 	Node 0, zone      DMA	        spanned  4095
> 	Node 0, zone    DMA32	        spanned  1044480
> 	Node 0, zone   Normal	        spanned  7864320
> 	Node 0, zone  Movable	        spanned  0
> 	Node 0, zone   Device	        spanned  66060288
> 	Node 1, zone      DMA	        spanned  0
> 	Node 1, zone    DMA32	        spanned  0
> 	Node 1, zone   Normal	        spanned  8388608
> 	Node 1, zone  Movable	        spanned  0
> 	Node 1, zone   Device	        spanned  66060288
> 	Node 2, zone      DMA	        spanned  0
> 	Node 2, zone    DMA32	        spanned  0
> 	Node 2, zone   Normal	        spanned  65011712
> 	Node 2, zone  Movable	        spanned  0
> 	Node 2, zone   Device	        spanned  0
> 	Node 3, zone      DMA	        spanned  0
> 	Node 3, zone    DMA32	        spanned  0
> 	Node 3, zone   Normal	        spanned  65011712
> 	Node 3, zone  Movable	        spanned  0
> 	Node 3, zone   Device	        spanned  0
> 
> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> This causes problem that Node 0 and Node 2 have same end of pfn values, also 
> Node 1 and Node 3 have same problem.

Thanks for the information, that makes it clearer.

While this unfortunate, the node/zone span is something fairly
unreliable/unusable for user space. Nodes and zones can overlap just easily.

What counts are present/managed pages in the node/zone.

So at least I don't count this as something that "needs fixing",
it's more something that's nice to handle better if easily possible.

See below.

> 
>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>> ---
>>>  include/linux/memory_hotplug.h | 3 +++
>>>  mm/memory_hotplug.c            | 6 ++++--
>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>> index be48e003a518..25c7f60c317e 100644
>>> --- a/include/linux/memory_hotplug.h
>>> +++ b/include/linux/memory_hotplug.h
>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>>  extern void remove_pfn_range_from_zone(struct zone *zone,
>>>  				       unsigned long start_pfn,
>>>  				       unsigned long nr_pages);
>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> +			     unsigned long end_pfn);
>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>>  extern bool is_memblock_offlined(struct memory_block *mem);
>>>  extern int sparse_add_section(int nid, unsigned long pfn,
>>>  		unsigned long nr_pages, struct vmem_altmap *altmap);
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 2a9627dc784c..38f46a9ef853 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>>  	return 0;
>>>  }
>>>  
>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>  			     unsigned long end_pfn)
>>>  {
>>>  	unsigned long pfn;
>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>  		}
>>>  	}
>>>  }
>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>
>> Exporting both as symbols feels very wrong. This is memory
>> onlining/offlining internal stuff.
> 
> I agree with you that your comment. I will find another approach to avoid
> directly using onlining/offlining internal stuff while updating node/zone span.

IIRC, to handle what you intend to handle properly want to look into teaching
remove_pfn_range_from_zone() to handle zone_is_zone_device().

There is a big fat comment:

	/*
	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
	 * we will not try to shrink the zones - which is okay as
	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
	 */
	if (zone_is_zone_device(zone))
		return;


Similarly, try_offline_node() spells this out:

	/*
	 * If the node still spans pages (especially ZONE_DEVICE), don't
	 * offline it. A node spans memory after move_pfn_range_to_zone(),
	 * e.g., after the memory block was onlined.
	 */
	if (pgdat->node_spanned_pages)
		return;


So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
try_offline_node() implicitly.

Trying to update the node span manually without teaching node/zone shrinking code how to
handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
won't work if the range of interest is still surrounded by other ranges.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-27  9:54     ` David Hildenbrand
@ 2022-01-28  4:19       ` Jonghyeon Kim
  2022-01-28  8:10         ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Jonghyeon Kim @ 2022-01-28  4:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
> On 27.01.22 10:41, Jonghyeon Kim wrote:
> > On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> >> On 26.01.22 18:00, Jonghyeon Kim wrote:
> >>> Export shrink_zone_span() and update_pgdat_span() functions to head
> >>> file. We need to update real number of spanned pages for NUMA nodes and
> >>> zones when we add memory device node such as device dax memory.
> >>>
> >>
> >> Can you elaborate a bit more what you intend to fix?
> >>
> >> Memory onlining/offlining is reponsible for updating the node/zone span,
> >> and that's triggered when the dax/kmem mamory gets onlined/offlined.
> >>
> > Sure, sorry for the lack of explanation of the intended fix.
> > 
> > Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> > cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> > is no problem because node/zone contain these additional non-visible memory
> > devices to the system.
> > But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> > to new NUMA node(but CPU-less). I think there is no need to hold
> > zone[ZONE_DEVICE] pages on the original node.
> > 
> > Additionally, spanned pages are also used to calculate the end pfn of a node.
> > Thus, it is needed to maintain accurate page stats for node/zone.
> > 
> > My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> > (DC persistent memory modules) with App-Direct mode.
> > 
> > Below are my test results.
> > 
> > Before memory onlining:
> > 
> > 	# ndctl create-namespace --mode=devdax
> > 	# ndctl create-namespace --mode=devdax
> > 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> > 	Node 0, zone      DMA	        spanned  4095
> > 	Node 0, zone    DMA32	        spanned  1044480
> > 	Node 0, zone   Normal	        spanned  7864320
> > 	Node 0, zone  Movable	        spanned  0
> > 	Node 0, zone   Device	        spanned  66060288
> > 	Node 1, zone      DMA	        spanned  0
> > 	Node 1, zone    DMA32	        spanned  0
> > 	Node 1, zone   Normal	        spanned  8388608
> > 	Node 1, zone  Movable	        spanned  0
> > 	Node 1, zone   Device	        spanned  66060288
> > 
> > After memory onlining:
> > 
> > 	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> > 	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
> > 
> > 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> > 	Node 0, zone      DMA	        spanned  4095
> > 	Node 0, zone    DMA32	        spanned  1044480
> > 	Node 0, zone   Normal	        spanned  7864320
> > 	Node 0, zone  Movable	        spanned  0
> > 	Node 0, zone   Device	        spanned  66060288
> > 	Node 1, zone      DMA	        spanned  0
> > 	Node 1, zone    DMA32	        spanned  0
> > 	Node 1, zone   Normal	        spanned  8388608
> > 	Node 1, zone  Movable	        spanned  0
> > 	Node 1, zone   Device	        spanned  66060288
> > 	Node 2, zone      DMA	        spanned  0
> > 	Node 2, zone    DMA32	        spanned  0
> > 	Node 2, zone   Normal	        spanned  65011712
> > 	Node 2, zone  Movable	        spanned  0
> > 	Node 2, zone   Device	        spanned  0
> > 	Node 3, zone      DMA	        spanned  0
> > 	Node 3, zone    DMA32	        spanned  0
> > 	Node 3, zone   Normal	        spanned  65011712
> > 	Node 3, zone  Movable	        spanned  0
> > 	Node 3, zone   Device	        spanned  0
> > 
> > As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> > This causes problem that Node 0 and Node 2 have same end of pfn values, also 
> > Node 1 and Node 3 have same problem.
> 
> Thanks for the information, that makes it clearer.
> 
> While this unfortunate, the node/zone span is something fairly
> unreliable/unusable for user space. Nodes and zones can overlap just easily.
> 
> What counts are present/managed pages in the node/zone.
> 
> So at least I don't count this as something that "needs fixing",
> it's more something that's nice to handle better if easily possible.
> 
> See below.
> 
> > 
> >>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> >>> ---
> >>>  include/linux/memory_hotplug.h | 3 +++
> >>>  mm/memory_hotplug.c            | 6 ++++--
> >>>  2 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >>> index be48e003a518..25c7f60c317e 100644
> >>> --- a/include/linux/memory_hotplug.h
> >>> +++ b/include/linux/memory_hotplug.h
> >>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >>>  extern void remove_pfn_range_from_zone(struct zone *zone,
> >>>  				       unsigned long start_pfn,
> >>>  				       unsigned long nr_pages);
> >>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> +			     unsigned long end_pfn);
> >>> +extern void update_pgdat_span(struct pglist_data *pgdat);
> >>>  extern bool is_memblock_offlined(struct memory_block *mem);
> >>>  extern int sparse_add_section(int nid, unsigned long pfn,
> >>>  		unsigned long nr_pages, struct vmem_altmap *altmap);
> >>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>> index 2a9627dc784c..38f46a9ef853 100644
> >>> --- a/mm/memory_hotplug.c
> >>> +++ b/mm/memory_hotplug.c
> >>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> >>>  	return 0;
> >>>  }
> >>>  
> >>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>  			     unsigned long end_pfn)
> >>>  {
> >>>  	unsigned long pfn;
> >>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>  		}
> >>>  	}
> >>>  }
> >>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
> >>
> >> Exporting both as symbols feels very wrong. This is memory
> >> onlining/offlining internal stuff.
> > 
> > I agree with you that your comment. I will find another approach to avoid
> > directly using onlining/offlining internal stuff while updating node/zone span.
> 
> IIRC, to handle what you intend to handle properly want to look into teaching
> remove_pfn_range_from_zone() to handle zone_is_zone_device().
> 
> There is a big fat comment:
> 
> 	/*
> 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
> 	 * we will not try to shrink the zones - which is okay as
> 	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
> 	 */
> 	if (zone_is_zone_device(zone))
> 		return;
> 
> 
> Similarly, try_offline_node() spells this out:
> 
> 	/*
> 	 * If the node still spans pages (especially ZONE_DEVICE), don't
> 	 * offline it. A node spans memory after move_pfn_range_to_zone(),
> 	 * e.g., after the memory block was onlined.
> 	 */
> 	if (pgdat->node_spanned_pages)
> 		return;
> 
> 
> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
> try_offline_node() implicitly.
> 
> Trying to update the node span manually without teaching node/zone shrinking code how to
> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
> won't work if the range of interest is still surrounded by other ranges.
> 

Thanks for your pointing out, I missed those comments.
I will keep trying to handle node/zone span updating process.

> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-28  4:19       ` Jonghyeon Kim
@ 2022-01-28  8:10         ` David Hildenbrand
  2022-02-03  2:22           ` Jonghyeon Kim
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-01-28  8:10 UTC (permalink / raw)
  To: Jonghyeon Kim
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On 28.01.22 05:19, Jonghyeon Kim wrote:
> On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
>> On 27.01.22 10:41, Jonghyeon Kim wrote:
>>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>>>> file. We need to update real number of spanned pages for NUMA nodes and
>>>>> zones when we add memory device node such as device dax memory.
>>>>>
>>>>
>>>> Can you elaborate a bit more what you intend to fix?
>>>>
>>>> Memory onlining/offlining is reponsible for updating the node/zone span,
>>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>>>
>>> Sure, sorry for the lack of explanation of the intended fix.
>>>
>>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
>>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
>>> is no problem because node/zone contain these additional non-visible memory
>>> devices to the system.
>>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
>>> to new NUMA node(but CPU-less). I think there is no need to hold
>>> zone[ZONE_DEVICE] pages on the original node.
>>>
>>> Additionally, spanned pages are also used to calculate the end pfn of a node.
>>> Thus, it is needed to maintain accurate page stats for node/zone.
>>>
>>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
>>> (DC persistent memory modules) with App-Direct mode.
>>>
>>> Below are my test results.
>>>
>>> Before memory onlining:
>>>
>>> 	# ndctl create-namespace --mode=devdax
>>> 	# ndctl create-namespace --mode=devdax
>>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>> 	Node 0, zone      DMA	        spanned  4095
>>> 	Node 0, zone    DMA32	        spanned  1044480
>>> 	Node 0, zone   Normal	        spanned  7864320
>>> 	Node 0, zone  Movable	        spanned  0
>>> 	Node 0, zone   Device	        spanned  66060288
>>> 	Node 1, zone      DMA	        spanned  0
>>> 	Node 1, zone    DMA32	        spanned  0
>>> 	Node 1, zone   Normal	        spanned  8388608
>>> 	Node 1, zone  Movable	        spanned  0
>>> 	Node 1, zone   Device	        spanned  66060288
>>>
>>> After memory onlining:
>>>
>>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
>>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
>>>
>>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>> 	Node 0, zone      DMA	        spanned  4095
>>> 	Node 0, zone    DMA32	        spanned  1044480
>>> 	Node 0, zone   Normal	        spanned  7864320
>>> 	Node 0, zone  Movable	        spanned  0
>>> 	Node 0, zone   Device	        spanned  66060288
>>> 	Node 1, zone      DMA	        spanned  0
>>> 	Node 1, zone    DMA32	        spanned  0
>>> 	Node 1, zone   Normal	        spanned  8388608
>>> 	Node 1, zone  Movable	        spanned  0
>>> 	Node 1, zone   Device	        spanned  66060288
>>> 	Node 2, zone      DMA	        spanned  0
>>> 	Node 2, zone    DMA32	        spanned  0
>>> 	Node 2, zone   Normal	        spanned  65011712
>>> 	Node 2, zone  Movable	        spanned  0
>>> 	Node 2, zone   Device	        spanned  0
>>> 	Node 3, zone      DMA	        spanned  0
>>> 	Node 3, zone    DMA32	        spanned  0
>>> 	Node 3, zone   Normal	        spanned  65011712
>>> 	Node 3, zone  Movable	        spanned  0
>>> 	Node 3, zone   Device	        spanned  0
>>>
>>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
>>> This causes problem that Node 0 and Node 2 have same end of pfn values, also 
>>> Node 1 and Node 3 have same problem.
>>
>> Thanks for the information, that makes it clearer.
>>
>> While this unfortunate, the node/zone span is something fairly
>> unreliable/unusable for user space. Nodes and zones can overlap just easily.
>>
>> What counts are present/managed pages in the node/zone.
>>
>> So at least I don't count this as something that "needs fixing",
>> it's more something that's nice to handle better if easily possible.
>>
>> See below.
>>
>>>
>>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>>>> ---
>>>>>  include/linux/memory_hotplug.h | 3 +++
>>>>>  mm/memory_hotplug.c            | 6 ++++--
>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>> index be48e003a518..25c7f60c317e 100644
>>>>> --- a/include/linux/memory_hotplug.h
>>>>> +++ b/include/linux/memory_hotplug.h
>>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>>>>  extern void remove_pfn_range_from_zone(struct zone *zone,
>>>>>  				       unsigned long start_pfn,
>>>>>  				       unsigned long nr_pages);
>>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> +			     unsigned long end_pfn);
>>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>>>>  extern bool is_memblock_offlined(struct memory_block *mem);
>>>>>  extern int sparse_add_section(int nid, unsigned long pfn,
>>>>>  		unsigned long nr_pages, struct vmem_altmap *altmap);
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 2a9627dc784c..38f46a9ef853 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>>>>  	return 0;
>>>>>  }
>>>>>  
>>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>  			     unsigned long end_pfn)
>>>>>  {
>>>>>  	unsigned long pfn;
>>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>  		}
>>>>>  	}
>>>>>  }
>>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>>>
>>>> Exporting both as symbols feels very wrong. This is memory
>>>> onlining/offlining internal stuff.
>>>
>>> I agree with you that your comment. I will find another approach to avoid
>>> directly using onlining/offlining internal stuff while updating node/zone span.
>>
>> IIRC, to handle what you intend to handle properly want to look into teaching
>> remove_pfn_range_from_zone() to handle zone_is_zone_device().
>>
>> There is a big fat comment:
>>
>> 	/*
>> 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
>> 	 * we will not try to shrink the zones - which is okay as
>> 	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
>> 	 */
>> 	if (zone_is_zone_device(zone))
>> 		return;
>>
>>
>> Similarly, try_offline_node() spells this out:
>>
>> 	/*
>> 	 * If the node still spans pages (especially ZONE_DEVICE), don't
>> 	 * offline it. A node spans memory after move_pfn_range_to_zone(),
>> 	 * e.g., after the memory block was onlined.
>> 	 */
>> 	if (pgdat->node_spanned_pages)
>> 		return;
>>
>>
>> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
>> try_offline_node() implicitly.
>>
>> Trying to update the node span manually without teaching node/zone shrinking code how to
>> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
>> won't work if the range of interest is still surrounded by other ranges.
>>
> 
> Thanks for your pointing out, I missed those comments.
> I will keep trying to handle node/zone span updating process.

The only safe thing right now for on ZONE_DEVICE in
remove_pfn_range_from_zone() would be removing the given range from the
start/end of the zone range, but we must not scan using the existing
functions.

As soon as we start actual *scanning* via find_smallest...
find_biggest... in shrink_zone_span() we would mistakenly skip other
ZONE_DEVICE ranges and mess up.

Assume you would have a ZONE_DEVICE layout like

[  DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]

What we actually want to do when removing

* DEV 0 is scanning low->high until we find DEV 1
* DEV 1 is doing nothing, because we cannot shrink
* DEV 2 is scanning high -> low until we find DEV 1


I assume we'd want to call in shrink_zone_span() two new functions for
ZONE_DEVICE:
find_smallest_zone_device_pfn
find_biggest_zone_device_pfn

Which would be able to do exactly that scanning, eventually, using
get_dev_pagemap() or some similar source of information.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-01-28  8:10         ` David Hildenbrand
@ 2022-02-03  2:22           ` Jonghyeon Kim
  2022-02-03  8:19             ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Jonghyeon Kim @ 2022-02-03  2:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On Fri, Jan 28, 2022 at 09:10:21AM +0100, David Hildenbrand wrote:
> On 28.01.22 05:19, Jonghyeon Kim wrote:
> > On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
> >> On 27.01.22 10:41, Jonghyeon Kim wrote:
> >>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
> >>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
> >>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
> >>>>> file. We need to update real number of spanned pages for NUMA nodes and
> >>>>> zones when we add memory device node such as device dax memory.
> >>>>>
> >>>>
> >>>> Can you elaborate a bit more what you intend to fix?
> >>>>
> >>>> Memory onlining/offlining is reponsible for updating the node/zone span,
> >>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
> >>>>
> >>> Sure, sorry for the lack of explanation of the intended fix.
> >>>
> >>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
> >>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
> >>> is no problem because node/zone contain these additional non-visible memory
> >>> devices to the system.
> >>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
> >>> to new NUMA node(but CPU-less). I think there is no need to hold
> >>> zone[ZONE_DEVICE] pages on the original node.
> >>>
> >>> Additionally, spanned pages are also used to calculate the end pfn of a node.
> >>> Thus, it is needed to maintain accurate page stats for node/zone.
> >>>
> >>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
> >>> (DC persistent memory modules) with App-Direct mode.
> >>>
> >>> Below are my test results.
> >>>
> >>> Before memory onlining:
> >>>
> >>> 	# ndctl create-namespace --mode=devdax
> >>> 	# ndctl create-namespace --mode=devdax
> >>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> >>> 	Node 0, zone      DMA	        spanned  4095
> >>> 	Node 0, zone    DMA32	        spanned  1044480
> >>> 	Node 0, zone   Normal	        spanned  7864320
> >>> 	Node 0, zone  Movable	        spanned  0
> >>> 	Node 0, zone   Device	        spanned  66060288
> >>> 	Node 1, zone      DMA	        spanned  0
> >>> 	Node 1, zone    DMA32	        spanned  0
> >>> 	Node 1, zone   Normal	        spanned  8388608
> >>> 	Node 1, zone  Movable	        spanned  0
> >>> 	Node 1, zone   Device	        spanned  66060288
> >>>
> >>> After memory onlining:
> >>>
> >>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
> >>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
> >>>
> >>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
> >>> 	Node 0, zone      DMA	        spanned  4095
> >>> 	Node 0, zone    DMA32	        spanned  1044480
> >>> 	Node 0, zone   Normal	        spanned  7864320
> >>> 	Node 0, zone  Movable	        spanned  0
> >>> 	Node 0, zone   Device	        spanned  66060288
> >>> 	Node 1, zone      DMA	        spanned  0
> >>> 	Node 1, zone    DMA32	        spanned  0
> >>> 	Node 1, zone   Normal	        spanned  8388608
> >>> 	Node 1, zone  Movable	        spanned  0
> >>> 	Node 1, zone   Device	        spanned  66060288
> >>> 	Node 2, zone      DMA	        spanned  0
> >>> 	Node 2, zone    DMA32	        spanned  0
> >>> 	Node 2, zone   Normal	        spanned  65011712
> >>> 	Node 2, zone  Movable	        spanned  0
> >>> 	Node 2, zone   Device	        spanned  0
> >>> 	Node 3, zone      DMA	        spanned  0
> >>> 	Node 3, zone    DMA32	        spanned  0
> >>> 	Node 3, zone   Normal	        spanned  65011712
> >>> 	Node 3, zone  Movable	        spanned  0
> >>> 	Node 3, zone   Device	        spanned  0
> >>>
> >>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
> >>> This causes problem that Node 0 and Node 2 have same end of pfn values, also 
> >>> Node 1 and Node 3 have same problem.
> >>
> >> Thanks for the information, that makes it clearer.
> >>
> >> While this unfortunate, the node/zone span is something fairly
> >> unreliable/unusable for user space. Nodes and zones can overlap just easily.
> >>
> >> What counts are present/managed pages in the node/zone.
> >>
> >> So at least I don't count this as something that "needs fixing",
> >> it's more something that's nice to handle better if easily possible.
> >>
> >> See below.
> >>
> >>>
> >>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
> >>>>> ---
> >>>>>  include/linux/memory_hotplug.h | 3 +++
> >>>>>  mm/memory_hotplug.c            | 6 ++++--
> >>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >>>>> index be48e003a518..25c7f60c317e 100644
> >>>>> --- a/include/linux/memory_hotplug.h
> >>>>> +++ b/include/linux/memory_hotplug.h
> >>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >>>>>  extern void remove_pfn_range_from_zone(struct zone *zone,
> >>>>>  				       unsigned long start_pfn,
> >>>>>  				       unsigned long nr_pages);
> >>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> +			     unsigned long end_pfn);
> >>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
> >>>>>  extern bool is_memblock_offlined(struct memory_block *mem);
> >>>>>  extern int sparse_add_section(int nid, unsigned long pfn,
> >>>>>  		unsigned long nr_pages, struct vmem_altmap *altmap);
> >>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>>>> index 2a9627dc784c..38f46a9ef853 100644
> >>>>> --- a/mm/memory_hotplug.c
> >>>>> +++ b/mm/memory_hotplug.c
> >>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> >>>>>  	return 0;
> >>>>>  }
> >>>>>  
> >>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>>  			     unsigned long end_pfn)
> >>>>>  {
> >>>>>  	unsigned long pfn;
> >>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> >>>>>  		}
> >>>>>  	}
> >>>>>  }
> >>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
> >>>>
> >>>> Exporting both as symbols feels very wrong. This is memory
> >>>> onlining/offlining internal stuff.
> >>>
> >>> I agree with you that your comment. I will find another approach to avoid
> >>> directly using onlining/offlining internal stuff while updating node/zone span.
> >>
> >> IIRC, to handle what you intend to handle properly want to look into teaching
> >> remove_pfn_range_from_zone() to handle zone_is_zone_device().
> >>
> >> There is a big fat comment:
> >>
> >> 	/*
> >> 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
> >> 	 * we will not try to shrink the zones - which is okay as
> >> 	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
> >> 	 */
> >> 	if (zone_is_zone_device(zone))
> >> 		return;
> >>
> >>
> >> Similarly, try_offline_node() spells this out:
> >>
> >> 	/*
> >> 	 * If the node still spans pages (especially ZONE_DEVICE), don't
> >> 	 * offline it. A node spans memory after move_pfn_range_to_zone(),
> >> 	 * e.g., after the memory block was onlined.
> >> 	 */
> >> 	if (pgdat->node_spanned_pages)
> >> 		return;
> >>
> >>
> >> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
> >> try_offline_node() implicitly.
> >>
> >> Trying to update the node span manually without teaching node/zone shrinking code how to
> >> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
> >> won't work if the range of interest is still surrounded by other ranges.
> >>
> > 
> > Thanks for your pointing out, I missed those comments.
> > I will keep trying to handle node/zone span updating process.
> 
> The only safe thing right now for on ZONE_DEVICE in
> remove_pfn_range_from_zone() would be removing the given range from the
> start/end of the zone range, but we must not scan using the existing
> functions.
> 
> As soon as we start actual *scanning* via find_smallest...
> find_biggest... in shrink_zone_span() we would mistakenly skip other
> ZONE_DEVICE ranges and mess up.
> 
> Assume you would have a ZONE_DEVICE layout like
> 
> [  DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]
> 

IIUC, you assumed situation that multiple ZONE_DEVICE in single node, and there
are holes among them, right?

> What we actually want to do when removing
> 
> * DEV 0 is scanning low->high until we find DEV 1
> * DEV 1 is doing nothing, because we cannot shrink
> * DEV 2 is scanning high -> low until we find DEV 1
> 
> 
> I assume we'd want to call in shrink_zone_span() two new functions for
> ZONE_DEVICE:
> find_smallest_zone_device_pfn
> find_biggest_zone_device_pfn
> 
> Which would be able to do exactly that scanning, eventually, using
> get_dev_pagemap() or some similar source of information.
> 

I agree with your suggestion. It might cleanly deal with ZONE_DEVICE with actual
page scanning for checking holes.

> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node
  2022-02-03  2:22           ` Jonghyeon Kim
@ 2022-02-03  8:19             ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-02-03  8:19 UTC (permalink / raw)
  To: Jonghyeon Kim
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, akpm, nvdimm,
	linux-kernel, linux-mm

On 03.02.22 03:22, Jonghyeon Kim wrote:
> On Fri, Jan 28, 2022 at 09:10:21AM +0100, David Hildenbrand wrote:
>> On 28.01.22 05:19, Jonghyeon Kim wrote:
>>> On Thu, Jan 27, 2022 at 10:54:23AM +0100, David Hildenbrand wrote:
>>>> On 27.01.22 10:41, Jonghyeon Kim wrote:
>>>>> On Wed, Jan 26, 2022 at 06:04:50PM +0100, David Hildenbrand wrote:
>>>>>> On 26.01.22 18:00, Jonghyeon Kim wrote:
>>>>>>> Export shrink_zone_span() and update_pgdat_span() functions to head
>>>>>>> file. We need to update real number of spanned pages for NUMA nodes and
>>>>>>> zones when we add memory device node such as device dax memory.
>>>>>>>
>>>>>>
>>>>>> Can you elaborate a bit more what you intend to fix?
>>>>>>
>>>>>> Memory onlining/offlining is reponsible for updating the node/zone span,
>>>>>> and that's triggered when the dax/kmem mamory gets onlined/offlined.
>>>>>>
>>>>> Sure, sorry for the lack of explanation of the intended fix.
>>>>>
>>>>> Before onlining nvdimm memory using dax(devdax or fsdax), these memory belong to
>>>>> cpu NUMA nodes, which extends span pages of node/zone as a ZONE_DEVICE. So there
>>>>> is no problem because node/zone contain these additional non-visible memory
>>>>> devices to the system.
>>>>> But, if we online dax-memory, zone[ZONE_DEVICE] of CPU NUMA node is hot-plugged
>>>>> to new NUMA node(but CPU-less). I think there is no need to hold
>>>>> zone[ZONE_DEVICE] pages on the original node.
>>>>>
>>>>> Additionally, spanned pages are also used to calculate the end pfn of a node.
>>>>> Thus, it is needed to maintain accurate page stats for node/zone.
>>>>>
>>>>> My machine contains two CPU-socket consisting of DRAM and Intel DCPMM
>>>>> (DC persistent memory modules) with App-Direct mode.
>>>>>
>>>>> Below are my test results.
>>>>>
>>>>> Before memory onlining:
>>>>>
>>>>> 	# ndctl create-namespace --mode=devdax
>>>>> 	# ndctl create-namespace --mode=devdax
>>>>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>>>> 	Node 0, zone      DMA	        spanned  4095
>>>>> 	Node 0, zone    DMA32	        spanned  1044480
>>>>> 	Node 0, zone   Normal	        spanned  7864320
>>>>> 	Node 0, zone  Movable	        spanned  0
>>>>> 	Node 0, zone   Device	        spanned  66060288
>>>>> 	Node 1, zone      DMA	        spanned  0
>>>>> 	Node 1, zone    DMA32	        spanned  0
>>>>> 	Node 1, zone   Normal	        spanned  8388608
>>>>> 	Node 1, zone  Movable	        spanned  0
>>>>> 	Node 1, zone   Device	        spanned  66060288
>>>>>
>>>>> After memory onlining:
>>>>>
>>>>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
>>>>> 	# daxctl reconfigure-device --mode=system-ram --no-online dax1.0
>>>>>
>>>>> 	# cat /proc/zoneinfo | grep -E "Node|spanned" | paste - -
>>>>> 	Node 0, zone      DMA	        spanned  4095
>>>>> 	Node 0, zone    DMA32	        spanned  1044480
>>>>> 	Node 0, zone   Normal	        spanned  7864320
>>>>> 	Node 0, zone  Movable	        spanned  0
>>>>> 	Node 0, zone   Device	        spanned  66060288
>>>>> 	Node 1, zone      DMA	        spanned  0
>>>>> 	Node 1, zone    DMA32	        spanned  0
>>>>> 	Node 1, zone   Normal	        spanned  8388608
>>>>> 	Node 1, zone  Movable	        spanned  0
>>>>> 	Node 1, zone   Device	        spanned  66060288
>>>>> 	Node 2, zone      DMA	        spanned  0
>>>>> 	Node 2, zone    DMA32	        spanned  0
>>>>> 	Node 2, zone   Normal	        spanned  65011712
>>>>> 	Node 2, zone  Movable	        spanned  0
>>>>> 	Node 2, zone   Device	        spanned  0
>>>>> 	Node 3, zone      DMA	        spanned  0
>>>>> 	Node 3, zone    DMA32	        spanned  0
>>>>> 	Node 3, zone   Normal	        spanned  65011712
>>>>> 	Node 3, zone  Movable	        spanned  0
>>>>> 	Node 3, zone   Device	        spanned  0
>>>>>
>>>>> As we can see, Node 0 and 1 still have zone_device pages after memory onlining.
>>>>> This causes problem that Node 0 and Node 2 have same end of pfn values, also 
>>>>> Node 1 and Node 3 have same problem.
>>>>
>>>> Thanks for the information, that makes it clearer.
>>>>
>>>> While this unfortunate, the node/zone span is something fairly
>>>> unreliable/unusable for user space. Nodes and zones can overlap just easily.
>>>>
>>>> What counts are present/managed pages in the node/zone.
>>>>
>>>> So at least I don't count this as something that "needs fixing",
>>>> it's more something that's nice to handle better if easily possible.
>>>>
>>>> See below.
>>>>
>>>>>
>>>>>>> Signed-off-by: Jonghyeon Kim <tome01@ajou.ac.kr>
>>>>>>> ---
>>>>>>>  include/linux/memory_hotplug.h | 3 +++
>>>>>>>  mm/memory_hotplug.c            | 6 ++++--
>>>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>>>> index be48e003a518..25c7f60c317e 100644
>>>>>>> --- a/include/linux/memory_hotplug.h
>>>>>>> +++ b/include/linux/memory_hotplug.h
>>>>>>> @@ -337,6 +337,9 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>>>>>>>  extern void remove_pfn_range_from_zone(struct zone *zone,
>>>>>>>  				       unsigned long start_pfn,
>>>>>>>  				       unsigned long nr_pages);
>>>>>>> +extern void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> +			     unsigned long end_pfn);
>>>>>>> +extern void update_pgdat_span(struct pglist_data *pgdat);
>>>>>>>  extern bool is_memblock_offlined(struct memory_block *mem);
>>>>>>>  extern int sparse_add_section(int nid, unsigned long pfn,
>>>>>>>  		unsigned long nr_pages, struct vmem_altmap *altmap);
>>>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>>>> index 2a9627dc784c..38f46a9ef853 100644
>>>>>>> --- a/mm/memory_hotplug.c
>>>>>>> +++ b/mm/memory_hotplug.c
>>>>>>> @@ -389,7 +389,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>>>>>>>  	return 0;
>>>>>>>  }
>>>>>>>  
>>>>>>> -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>> +void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>>  			     unsigned long end_pfn)
>>>>>>>  {
>>>>>>>  	unsigned long pfn;
>>>>>>> @@ -428,8 +428,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>>>>>>  		}
>>>>>>>  	}
>>>>>>>  }
>>>>>>> +EXPORT_SYMBOL_GPL(shrink_zone_span);
>>>>>>
>>>>>> Exporting both as symbols feels very wrong. This is memory
>>>>>> onlining/offlining internal stuff.
>>>>>
>>>>> I agree with you that your comment. I will find another approach to avoid
>>>>> directly using onlining/offlining internal stuff while updating node/zone span.
>>>>
>>>> IIRC, to handle what you intend to handle properly want to look into teaching
>>>> remove_pfn_range_from_zone() to handle zone_is_zone_device().
>>>>
>>>> There is a big fat comment:
>>>>
>>>> 	/*
>>>> 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
>>>> 	 * we will not try to shrink the zones - which is okay as
>>>> 	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
>>>> 	 */
>>>> 	if (zone_is_zone_device(zone))
>>>> 		return;
>>>>
>>>>
>>>> Similarly, try_offline_node() spells this out:
>>>>
>>>> 	/*
>>>> 	 * If the node still spans pages (especially ZONE_DEVICE), don't
>>>> 	 * offline it. A node spans memory after move_pfn_range_to_zone(),
>>>> 	 * e.g., after the memory block was onlined.
>>>> 	 */
>>>> 	if (pgdat->node_spanned_pages)
>>>> 		return;
>>>>
>>>>
>>>> So once you handle remove_pfn_range_from_zone() cleanly, you'll cleanly handle
>>>> try_offline_node() implicitly.
>>>>
>>>> Trying to update the node span manually without teaching node/zone shrinking code how to
>>>> handle ZONE_DEVICE properly is just a hack that will only sometimes work. Especially, it
>>>> won't work if the range of interest is still surrounded by other ranges.
>>>>
>>>
>>> Thanks for your pointing out, I missed those comments.
>>> I will keep trying to handle node/zone span updating process.
>>
>> The only safe thing right now for on ZONE_DEVICE in
>> remove_pfn_range_from_zone() would be removing the given range from the
>> start/end of the zone range, but we must not scan using the existing
>> functions.
>>
>> As soon as we start actual *scanning* via find_smallest...
>> find_biggest... in shrink_zone_span() we would mistakenly skip other
>> ZONE_DEVICE ranges and mess up.
>>
>> Assume you would have a ZONE_DEVICE layout like
>>
>> [  DEV 0 | Hole | DEV 1 | Hole | DEV 2 ]
>>
> 
> IIUC, you assumed situation that multiple ZONE_DEVICE in single node, and there
> are holes among them, right?

Exactly. But the holes are just an example to show what we'd have to do
when holes are around. Having multiple devices is the important part.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-03  8:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 17:00 [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node Jonghyeon Kim
2022-01-26 17:00 ` [PATCH 2/2] dax/kmem: Update spanned page stat of origin device node Jonghyeon Kim
2022-01-27  0:29   ` kernel test robot
2022-01-27  5:29   ` kernel test robot
2022-01-26 17:04 ` [PATCH 1/2] mm/memory_hotplug: Export shrink span functions for zone and node David Hildenbrand
2022-01-27  9:41   ` Jonghyeon Kim
2022-01-27  9:54     ` David Hildenbrand
2022-01-28  4:19       ` Jonghyeon Kim
2022-01-28  8:10         ` David Hildenbrand
2022-02-03  2:22           ` Jonghyeon Kim
2022-02-03  8:19             ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).