linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] mm, proc: add PcpFree to meminfo
@ 2022-08-16  8:44 Kefeng Wang
  2022-08-16  8:48 ` huang ying
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kefeng Wang @ 2022-08-16  8:44 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, linux-mm
  Cc: linux-kernel, Liu Shixin, Kefeng Wang

From: Liu Shixin <liushixin2@huawei.com>

The page on pcplist could be used, but not counted into memory free or
avaliable, and pcp_free is only showed by show_mem(). Since commit
d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
significant decrease in the display of free memory, with a large number
of cpus and nodes, the number of pages in the percpu list can be very
large, so it is better to let user to know the pcp count.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 drivers/base/node.c | 14 +++++++++++++-
 fs/proc/meminfo.c   |  9 +++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb0f43784c2b..846864e45db6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 	struct sysinfo i;
 	unsigned long sreclaimable, sunreclaimable;
 	unsigned long swapcached = 0;
+	unsigned long free_pcp = 0;
+	struct zone *zone;
+	int cpu;
 
 	si_meminfo_node(&i, nid);
 	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
@@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_SWAP
 	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
 #endif
+	for_each_populated_zone(zone) {
+		if (zone_to_nid(zone) != nid)
+			continue;
+		for_each_online_cpu(cpu)
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+	}
+
 	len = sysfs_emit_at(buf, len,
 			    "Node %d MemTotal:       %8lu kB\n"
 			    "Node %d MemFree:        %8lu kB\n"
+			    "Node %d PcpFree:        %8lu kB\n"
 			    "Node %d MemUsed:        %8lu kB\n"
 			    "Node %d SwapCached:     %8lu kB\n"
 			    "Node %d Active:         %8lu kB\n"
@@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 			    "Node %d Mlocked:        %8lu kB\n",
 			    nid, K(i.totalram),
 			    nid, K(i.freeram),
-			    nid, K(i.totalram - i.freeram),
+			    nid, K(free_pcp),
+			    nid, K(i.totalram - i.freeram - free_pcp),
 			    nid, K(swapcached),
 			    nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
 				   node_page_state(pgdat, NR_ACTIVE_FILE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6e89f0e2fd20..672c784dfc8a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	unsigned long pages[NR_LRU_LISTS];
 	unsigned long sreclaimable, sunreclaim;
 	int lru;
+	unsigned long free_pcp = 0;
+	struct zone *zone;
+	int cpu;
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
@@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
 	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
+	for_each_populated_zone(zone) {
+		for_each_online_cpu(cpu)
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+	}
+
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
+	show_val_kb(m, "PcpFree:        ", free_pcp);
 	show_val_kb(m, "MemAvailable:   ", available);
 	show_val_kb(m, "Buffers:        ", i.bufferram);
 	show_val_kb(m, "Cached:         ", cached);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  8:44 [PATCH RFC] mm, proc: add PcpFree to meminfo Kefeng Wang
@ 2022-08-16  8:48 ` huang ying
  2022-08-16  9:24   ` Kefeng Wang
  2022-08-16  9:16 ` Greg Kroah-Hartman
  2022-08-17  7:16 ` Kefeng Wang
  2 siblings, 1 reply; 14+ messages in thread
From: huang ying @ 2022-08-16  8:48 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, Greg Kroah-Hartman, linux-mm, linux-kernel,
	Liu Shixin, Huang Ying

On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> From: Liu Shixin <liushixin2@huawei.com>
>
> The page on pcplist could be used, but not counted into memory free or
> avaliable, and pcp_free is only showed by show_mem(). Since commit
> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> significant decrease in the display of free memory, with a large number
> of cpus and nodes, the number of pages in the percpu list can be very
> large, so it is better to let user to know the pcp count.

Can you show some data?

Another choice is to count PCP free pages in MemFree.  Is that OK for
your use case too?

Best Regards,
Huang, Ying

> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  drivers/base/node.c | 14 +++++++++++++-
>  fs/proc/meminfo.c   |  9 +++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb0f43784c2b..846864e45db6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>         struct sysinfo i;
>         unsigned long sreclaimable, sunreclaimable;
>         unsigned long swapcached = 0;
> +       unsigned long free_pcp = 0;
> +       struct zone *zone;
> +       int cpu;
>
>         si_meminfo_node(&i, nid);
>         sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>  #ifdef CONFIG_SWAP
>         swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>  #endif
> +       for_each_populated_zone(zone) {
> +               if (zone_to_nid(zone) != nid)
> +                       continue;
> +               for_each_online_cpu(cpu)
> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +       }
> +
>         len = sysfs_emit_at(buf, len,
>                             "Node %d MemTotal:       %8lu kB\n"
>                             "Node %d MemFree:        %8lu kB\n"
> +                           "Node %d PcpFree:        %8lu kB\n"
>                             "Node %d MemUsed:        %8lu kB\n"
>                             "Node %d SwapCached:     %8lu kB\n"
>                             "Node %d Active:         %8lu kB\n"
> @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
>                             "Node %d Mlocked:        %8lu kB\n",
>                             nid, K(i.totalram),
>                             nid, K(i.freeram),
> -                           nid, K(i.totalram - i.freeram),
> +                           nid, K(free_pcp),
> +                           nid, K(i.totalram - i.freeram - free_pcp),
>                             nid, K(swapcached),
>                             nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
>                                    node_page_state(pgdat, NR_ACTIVE_FILE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6e89f0e2fd20..672c784dfc8a 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>         unsigned long pages[NR_LRU_LISTS];
>         unsigned long sreclaimable, sunreclaim;
>         int lru;
> +       unsigned long free_pcp = 0;
> +       struct zone *zone;
> +       int cpu;
>
>         si_meminfo(&i);
>         si_swapinfo(&i);
> @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>         sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>         sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>
> +       for_each_populated_zone(zone) {
> +               for_each_online_cpu(cpu)
> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +       }
> +
>         show_val_kb(m, "MemTotal:       ", i.totalram);
>         show_val_kb(m, "MemFree:        ", i.freeram);
> +       show_val_kb(m, "PcpFree:        ", free_pcp);
>         show_val_kb(m, "MemAvailable:   ", available);
>         show_val_kb(m, "Buffers:        ", i.bufferram);
>         show_val_kb(m, "Cached:         ", cached);
> --
> 2.35.3
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  8:44 [PATCH RFC] mm, proc: add PcpFree to meminfo Kefeng Wang
  2022-08-16  8:48 ` huang ying
@ 2022-08-16  9:16 ` Greg Kroah-Hartman
  2022-08-16 10:11   ` Kefeng Wang
  2022-08-17  7:16 ` Kefeng Wang
  2 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2022-08-16  9:16 UTC (permalink / raw)
  To: Kefeng Wang; +Cc: Andrew Morton, linux-mm, linux-kernel, Liu Shixin

On Tue, Aug 16, 2022 at 04:44:26PM +0800, Kefeng Wang wrote:
> From: Liu Shixin <liushixin2@huawei.com>
> 
> The page on pcplist could be used, but not counted into memory free or
> avaliable, and pcp_free is only showed by show_mem(). Since commit
> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> significant decrease in the display of free memory, with a large number
> of cpus and nodes, the number of pages in the percpu list can be very
> large, so it is better to let user to know the pcp count.
> 
> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  drivers/base/node.c | 14 +++++++++++++-
>  fs/proc/meminfo.c   |  9 +++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb0f43784c2b..846864e45db6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>  	struct sysinfo i;
>  	unsigned long sreclaimable, sunreclaimable;
>  	unsigned long swapcached = 0;
> +	unsigned long free_pcp = 0;
> +	struct zone *zone;
> +	int cpu;
>  
>  	si_meminfo_node(&i, nid);
>  	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>  #ifdef CONFIG_SWAP
>  	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>  #endif
> +	for_each_populated_zone(zone) {
> +		if (zone_to_nid(zone) != nid)
> +			continue;
> +		for_each_online_cpu(cpu)
> +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +	}
> +
>  	len = sysfs_emit_at(buf, len,
>  			    "Node %d MemTotal:       %8lu kB\n"
>  			    "Node %d MemFree:        %8lu kB\n"
> +			    "Node %d PcpFree:        %8lu kB\n"

First off, this sysfs file is a huge violation of the normal sysfs
rules, so I will not allow any new entries to be added.  In fact, the
whole thing should just be removed and multiple files created in its
place.

Can you send a patch to do that instead please?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  8:48 ` huang ying
@ 2022-08-16  9:24   ` Kefeng Wang
  2022-08-19  7:40     ` Aaron Lu
  0 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2022-08-16  9:24 UTC (permalink / raw)
  To: huang ying
  Cc: Andrew Morton, Greg Kroah-Hartman, linux-mm, linux-kernel,
	Liu Shixin, Huang Ying


On 2022/8/16 16:48, huang ying wrote:
> On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>> From: Liu Shixin <liushixin2@huawei.com>
>>
>> The page on pcplist could be used, but not counted into memory free or
>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>> significant decrease in the display of free memory, with a large number
>> of cpus and nodes, the number of pages in the percpu list can be very
>> large, so it is better to let user to know the pcp count.
> Can you show some data?

80M+ with 72cpus/2node

>
> Another choice is to count PCP free pages in MemFree.  Is that OK for
> your use case too?

Yes, the user will make policy according to MemFree, we think count PCP 
free pages

in MemFree is better, but don't know whether it is right way.

>
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   drivers/base/node.c | 14 +++++++++++++-
>>   fs/proc/meminfo.c   |  9 +++++++++
>>   2 files changed, 22 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index eb0f43784c2b..846864e45db6 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>>          struct sysinfo i;
>>          unsigned long sreclaimable, sunreclaimable;
>>          unsigned long swapcached = 0;
>> +       unsigned long free_pcp = 0;
>> +       struct zone *zone;
>> +       int cpu;
>>
>>          si_meminfo_node(&i, nid);
>>          sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
>> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>>   #ifdef CONFIG_SWAP
>>          swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>>   #endif
>> +       for_each_populated_zone(zone) {
>> +               if (zone_to_nid(zone) != nid)
>> +                       continue;
>> +               for_each_online_cpu(cpu)
>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>> +       }
>> +
>>          len = sysfs_emit_at(buf, len,
>>                              "Node %d MemTotal:       %8lu kB\n"
>>                              "Node %d MemFree:        %8lu kB\n"
>> +                           "Node %d PcpFree:        %8lu kB\n"
>>                              "Node %d MemUsed:        %8lu kB\n"
>>                              "Node %d SwapCached:     %8lu kB\n"
>>                              "Node %d Active:         %8lu kB\n"
>> @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
>>                              "Node %d Mlocked:        %8lu kB\n",
>>                              nid, K(i.totalram),
>>                              nid, K(i.freeram),
>> -                           nid, K(i.totalram - i.freeram),
>> +                           nid, K(free_pcp),
>> +                           nid, K(i.totalram - i.freeram - free_pcp),
>>                              nid, K(swapcached),
>>                              nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
>>                                     node_page_state(pgdat, NR_ACTIVE_FILE)),
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index 6e89f0e2fd20..672c784dfc8a 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>          unsigned long pages[NR_LRU_LISTS];
>>          unsigned long sreclaimable, sunreclaim;
>>          int lru;
>> +       unsigned long free_pcp = 0;
>> +       struct zone *zone;
>> +       int cpu;
>>
>>          si_meminfo(&i);
>>          si_swapinfo(&i);
>> @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>          sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>>          sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>>
>> +       for_each_populated_zone(zone) {
>> +               for_each_online_cpu(cpu)
>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>> +       }
>> +
>>          show_val_kb(m, "MemTotal:       ", i.totalram);
>>          show_val_kb(m, "MemFree:        ", i.freeram);
>> +       show_val_kb(m, "PcpFree:        ", free_pcp);
>>          show_val_kb(m, "MemAvailable:   ", available);
>>          show_val_kb(m, "Buffers:        ", i.bufferram);
>>          show_val_kb(m, "Cached:         ", cached);
>> --
>> 2.35.3
>>
>>
> .

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  9:16 ` Greg Kroah-Hartman
@ 2022-08-16 10:11   ` Kefeng Wang
  2022-08-16 10:50     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2022-08-16 10:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Andrew Morton, linux-mm, linux-kernel, Liu Shixin


On 2022/8/16 17:16, Greg Kroah-Hartman wrote:
> On Tue, Aug 16, 2022 at 04:44:26PM +0800, Kefeng Wang wrote:
>> From: Liu Shixin <liushixin2@huawei.com>
>>
>> The page on pcplist could be used, but not counted into memory free or
>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>> significant decrease in the display of free memory, with a large number
>> of cpus and nodes, the number of pages in the percpu list can be very
>> large, so it is better to let user to know the pcp count.
>>
>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   drivers/base/node.c | 14 +++++++++++++-
>>   fs/proc/meminfo.c   |  9 +++++++++
>>   2 files changed, 22 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index eb0f43784c2b..846864e45db6 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>>   	struct sysinfo i;
>>   	unsigned long sreclaimable, sunreclaimable;
>>   	unsigned long swapcached = 0;
>> +	unsigned long free_pcp = 0;
>> +	struct zone *zone;
>> +	int cpu;
>>   
>>   	si_meminfo_node(&i, nid);
>>   	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
>> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>>   #ifdef CONFIG_SWAP
>>   	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>>   #endif
>> +	for_each_populated_zone(zone) {
>> +		if (zone_to_nid(zone) != nid)
>> +			continue;
>> +		for_each_online_cpu(cpu)
>> +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>> +	}
>> +
>>   	len = sysfs_emit_at(buf, len,
>>   			    "Node %d MemTotal:       %8lu kB\n"
>>   			    "Node %d MemFree:        %8lu kB\n"
>> +			    "Node %d PcpFree:        %8lu kB\n"
> First off, this sysfs file is a huge violation of the normal sysfs
> rules, so I will not allow any new entries to be added.  In fact, the
> whole thing should just be removed and multiple files created in its
> place.

Hi Greg, do you mean to remove all /sys/devices/system/node/node*/meminfo,

but this will beak ABI, is it acceptable?

>
> Can you send a patch to do that instead please?
>
> thanks,
>
> greg k-h
> .

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16 10:11   ` Kefeng Wang
@ 2022-08-16 10:50     ` Greg Kroah-Hartman
  2022-08-16 12:03       ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2022-08-16 10:50 UTC (permalink / raw)
  To: Kefeng Wang; +Cc: Andrew Morton, linux-mm, linux-kernel, Liu Shixin

On Tue, Aug 16, 2022 at 06:11:18PM +0800, Kefeng Wang wrote:
> 
> On 2022/8/16 17:16, Greg Kroah-Hartman wrote:
> > On Tue, Aug 16, 2022 at 04:44:26PM +0800, Kefeng Wang wrote:
> > > From: Liu Shixin <liushixin2@huawei.com>
> > > 
> > > The page on pcplist could be used, but not counted into memory free or
> > > avaliable, and pcp_free is only showed by show_mem(). Since commit
> > > d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> > > significant decrease in the display of free memory, with a large number
> > > of cpus and nodes, the number of pages in the percpu list can be very
> > > large, so it is better to let user to know the pcp count.
> > > 
> > > Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> > > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> > > ---
> > >   drivers/base/node.c | 14 +++++++++++++-
> > >   fs/proc/meminfo.c   |  9 +++++++++
> > >   2 files changed, 22 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index eb0f43784c2b..846864e45db6 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >   	struct sysinfo i;
> > >   	unsigned long sreclaimable, sunreclaimable;
> > >   	unsigned long swapcached = 0;
> > > +	unsigned long free_pcp = 0;
> > > +	struct zone *zone;
> > > +	int cpu;
> > >   	si_meminfo_node(&i, nid);
> > >   	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
> > > @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >   #ifdef CONFIG_SWAP
> > >   	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
> > >   #endif
> > > +	for_each_populated_zone(zone) {
> > > +		if (zone_to_nid(zone) != nid)
> > > +			continue;
> > > +		for_each_online_cpu(cpu)
> > > +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> > > +	}
> > > +
> > >   	len = sysfs_emit_at(buf, len,
> > >   			    "Node %d MemTotal:       %8lu kB\n"
> > >   			    "Node %d MemFree:        %8lu kB\n"
> > > +			    "Node %d PcpFree:        %8lu kB\n"
> > First off, this sysfs file is a huge violation of the normal sysfs
> > rules, so I will not allow any new entries to be added.  In fact, the
> > whole thing should just be removed and multiple files created in its
> > place.
> 
> Hi Greg, do you mean to remove all /sys/devices/system/node/node*/meminfo,
> 
> but this will beak ABI, is it acceptable?

I do not know, what tool relies on this file?  Any userspace tool should
always be able to handle a sysfs file being removed, so you should
probably work with the tool authors to fix this up before removing it.

thanks,

gre gk-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16 10:50     ` Greg Kroah-Hartman
@ 2022-08-16 12:03       ` Kefeng Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2022-08-16 12:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andrew Morton, linux-mm, linux-kernel, Liu Shixin,
	Vlastimil Babka, David Hildenbrand, Shakeel Butt


On 2022/8/16 18:50, Greg Kroah-Hartman wrote:
> On Tue, Aug 16, 2022 at 06:11:18PM +0800, Kefeng Wang wrote:
>> On 2022/8/16 17:16, Greg Kroah-Hartman wrote:
>>> On Tue, Aug 16, 2022 at 04:44:26PM +0800, Kefeng Wang wrote:
>>>> From: Liu Shixin <liushixin2@huawei.com>
>>>>
>>>> The page on pcplist could be used, but not counted into memory free or
>>>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>>> significant decrease in the display of free memory, with a large number
>>>> of cpus and nodes, the number of pages in the percpu list can be very
>>>> large, so it is better to let user to know the pcp count.
>>>>
>>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>>    drivers/base/node.c | 14 +++++++++++++-
>>>>    fs/proc/meminfo.c   |  9 +++++++++
>>>>    2 files changed, 22 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index eb0f43784c2b..846864e45db6 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>    	struct sysinfo i;
>>>>    	unsigned long sreclaimable, sunreclaimable;
>>>>    	unsigned long swapcached = 0;
>>>> +	unsigned long free_pcp = 0;
>>>> +	struct zone *zone;
>>>> +	int cpu;
>>>>    	si_meminfo_node(&i, nid);
>>>>    	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
>>>> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>    #ifdef CONFIG_SWAP
>>>>    	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>>>>    #endif
>>>> +	for_each_populated_zone(zone) {
>>>> +		if (zone_to_nid(zone) != nid)
>>>> +			continue;
>>>> +		for_each_online_cpu(cpu)
>>>> +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>>> +	}
>>>> +
>>>>    	len = sysfs_emit_at(buf, len,
>>>>    			    "Node %d MemTotal:       %8lu kB\n"
>>>>    			    "Node %d MemFree:        %8lu kB\n"
>>>> +			    "Node %d PcpFree:        %8lu kB\n"
>>> First off, this sysfs file is a huge violation of the normal sysfs
>>> rules, so I will not allow any new entries to be added.  In fact, the
>>> whole thing should just be removed and multiple files created in its
>>> place.
>> Hi Greg, do you mean to remove all /sys/devices/system/node/node*/meminfo,
>>
>> but this will beak ABI, is it acceptable?
> I do not know, what tool relies on this file?  Any userspace tool should
> always be able to handle a sysfs file being removed, so you should
> probably work with the tool authors to fix this up before removing it.

https://github.com/numactl/numactl/blob/93867c59b0bb29470873a427dc7f06ebaf305221/numastat.c#L841

It seems numastat use this file, but there are maybe more tools,

Cc'ed more mm experts to see more comments.

>
> thanks,
>
> gre gk-h
> .

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  8:44 [PATCH RFC] mm, proc: add PcpFree to meminfo Kefeng Wang
  2022-08-16  8:48 ` huang ying
  2022-08-16  9:16 ` Greg Kroah-Hartman
@ 2022-08-17  7:16 ` Kefeng Wang
  2022-08-18 21:07   ` Dave Hansen
  2 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2022-08-17  7:16 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman, linux-mm, Aaron Lu,
	Michal Hocko, Jesper Dangaard Brouer, Vlastimil Babka,
	Dave Hansen, Kemi Wang
  Cc: linux-kernel, Liu Shixin

On 2022/8/16 16:44, Kefeng Wang wrote:
> From: Liu Shixin <liushixin2@huawei.com>
>
> The page on pcplist could be used, but not counted into memory free or
> avaliable, and pcp_free is only showed by show_mem(). Since commit
> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> significant decrease in the display of free memory, with a large number
> of cpus and nodes, the number of pages in the percpu list can be very
> large, so it is better to let user to know the pcp count.
Add more experts according to commit d8a759b57035,
any advice would be much appreciated,thanks.
> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   drivers/base/node.c | 14 +++++++++++++-
>   fs/proc/meminfo.c   |  9 +++++++++
>   2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb0f43784c2b..846864e45db6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>   	struct sysinfo i;
>   	unsigned long sreclaimable, sunreclaimable;
>   	unsigned long swapcached = 0;
> +	unsigned long free_pcp = 0;
> +	struct zone *zone;
> +	int cpu;
>   
>   	si_meminfo_node(&i, nid);
>   	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>   #ifdef CONFIG_SWAP
>   	swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>   #endif
> +	for_each_populated_zone(zone) {
> +		if (zone_to_nid(zone) != nid)
> +			continue;
> +		for_each_online_cpu(cpu)
> +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +	}
> +
>   	len = sysfs_emit_at(buf, len,
>   			    "Node %d MemTotal:       %8lu kB\n"
>   			    "Node %d MemFree:        %8lu kB\n"
> +			    "Node %d PcpFree:        %8lu kB\n"
>   			    "Node %d MemUsed:        %8lu kB\n"
>   			    "Node %d SwapCached:     %8lu kB\n"
>   			    "Node %d Active:         %8lu kB\n"
> @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
>   			    "Node %d Mlocked:        %8lu kB\n",
>   			    nid, K(i.totalram),
>   			    nid, K(i.freeram),
> -			    nid, K(i.totalram - i.freeram),
> +			    nid, K(free_pcp),
> +			    nid, K(i.totalram - i.freeram - free_pcp),
>   			    nid, K(swapcached),
>   			    nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
>   				   node_page_state(pgdat, NR_ACTIVE_FILE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6e89f0e2fd20..672c784dfc8a 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>   	unsigned long pages[NR_LRU_LISTS];
>   	unsigned long sreclaimable, sunreclaim;
>   	int lru;
> +	unsigned long free_pcp = 0;
> +	struct zone *zone;
> +	int cpu;
>   
>   	si_meminfo(&i);
>   	si_swapinfo(&i);
> @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>   	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>   	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>   
> +	for_each_populated_zone(zone) {
> +		for_each_online_cpu(cpu)
> +			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +	}
> +
>   	show_val_kb(m, "MemTotal:       ", i.totalram);
>   	show_val_kb(m, "MemFree:        ", i.freeram);
> +	show_val_kb(m, "PcpFree:        ", free_pcp);
>   	show_val_kb(m, "MemAvailable:   ", available);
>   	show_val_kb(m, "Buffers:        ", i.bufferram);
>   	show_val_kb(m, "Cached:         ", cached);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-17  7:16 ` Kefeng Wang
@ 2022-08-18 21:07   ` Dave Hansen
  2022-08-19  1:06     ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2022-08-18 21:07 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton, Greg Kroah-Hartman, linux-mm,
	Aaron Lu, Michal Hocko, Jesper Dangaard Brouer, Vlastimil Babka,
	Kemi Wang
  Cc: linux-kernel, Liu Shixin

On 8/17/22 00:16, Kefeng Wang wrote:
> On 2022/8/16 16:44, Kefeng Wang wrote:
>> From: Liu Shixin <liushixin2@huawei.com>
>>
>> The page on pcplist could be used, but not counted into memory free or
>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>> significant decrease in the display of free memory, with a large number
>> of cpus and nodes, the number of pages in the percpu list can be very
>> large, so it is better to let user to know the pcp count.
> Add more experts according to commit d8a759b57035,
> any advice would be much appreciated,thanks.

Adding a new meminfo field seems like overkill.  I'd just make this a
part of MemFree if anything.

Also, some actual data would be nice here.  Like:

	Before commit d8a759b57035, the maximum amount of pages in the
	pcp lists was theoretically $FOO MB.  After the patch, the lists
	can hold $BAR MB.  It has been observed to be $BAZ MB in
	practice.

	This was all on a system with $X memory NUMA nodes and $Y CPUs.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-18 21:07   ` Dave Hansen
@ 2022-08-19  1:06     ` Kefeng Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2022-08-19  1:06 UTC (permalink / raw)
  To: Dave Hansen, Andrew Morton, Greg Kroah-Hartman, linux-mm,
	Aaron Lu, Michal Hocko, Jesper Dangaard Brouer, Vlastimil Babka,
	Kemi Wang
  Cc: linux-kernel, Liu Shixin


On 2022/8/19 5:07, Dave Hansen wrote:
> On 8/17/22 00:16, Kefeng Wang wrote:
>> On 2022/8/16 16:44, Kefeng Wang wrote:
>>> From: Liu Shixin <liushixin2@huawei.com>
>>>
>>> The page on pcplist could be used, but not counted into memory free or
>>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>> significant decrease in the display of free memory, with a large number
>>> of cpus and nodes, the number of pages in the percpu list can be very
>>> large, so it is better to let user to know the pcp count.
>> Add more experts according to commit d8a759b57035,
>> any advice would be much appreciated,thanks.
> Adding a new meminfo field seems like overkill.  I'd just make this a
> part of MemFree if anything.
I like this way too.
> Also, some actual data would be nice here.  Like:
>
> 	Before commit d8a759b57035, the maximum amount of pages in the
> 	pcp lists was theoretically $FOO MB.  After the patch, the lists
> 	can hold $BAR MB.  It has been observed to be $BAZ MB in
> 	practice.
>
> 	This was all on a system with $X memory NUMA nodes and $Y CPUs.
>
> .

Same question in [1],  will repost with update, thanks.

[1] 
https://lore.kernel.org/linux-mm/03f465ca-cf8e-bbd1-1083-099fd2ce026d@huawei.com/t/#m50cf15911d9d203bd97238512fa2ae9ba1bd9e1e


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-16  9:24   ` Kefeng Wang
@ 2022-08-19  7:40     ` Aaron Lu
  2022-08-19  9:53       ` Liu Shixin
  0 siblings, 1 reply; 14+ messages in thread
From: Aaron Lu @ 2022-08-19  7:40 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: huang ying, Andrew Morton, Greg Kroah-Hartman, linux-mm,
	linux-kernel, Liu Shixin, Huang Ying

On Tue, Aug 16, 2022 at 05:24:07PM +0800, Kefeng Wang wrote:
> 
> On 2022/8/16 16:48, huang ying wrote:
> > On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> > > From: Liu Shixin <liushixin2@huawei.com>
> > > 
> > > The page on pcplist could be used, but not counted into memory free or
> > > avaliable, and pcp_free is only showed by show_mem(). Since commit
> > > d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> > > significant decrease in the display of free memory, with a large number
> > > of cpus and nodes, the number of pages in the percpu list can be very
> > > large, so it is better to let user to know the pcp count.
> > Can you show some data?
> 
> 80M+ with 72cpus/2node
> 

80M+ for a 2 node system doesn't sound like a significant number.

> > 
> > Another choice is to count PCP free pages in MemFree.  Is that OK for
> > your use case too?
> 
> Yes, the user will make policy according to MemFree, we think count PCP free
> pages
> 
> in MemFree is better, but don't know whether it is right way.
> 

Is there a real problem where user makes a sub-optimal policy due to the
not accounted 80M+ free memory?

Counting PCP pages as free seems natural, since they are indeed free
pages. One concern is, there might be much more calls of
__mod_zone_freepage_state() if you do free page counting for PCP pages,
not sure if that would hurt performance. Also, you will need to
differentiate in __free_one_page() whether counting for free pages are
still needed since some pages are freed through PCP(and thus already
counted) while some are not.

BTW, since commit b92ca18e8ca59("mm/page_alloc: disassociate the pcp->high
from pcp->batch"), pcp size is no longer associated with batch size. Is
it that you are testing on an older kernel?

Thanks,
Aaron

> > > Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> > > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> > > ---
> > >   drivers/base/node.c | 14 +++++++++++++-
> > >   fs/proc/meminfo.c   |  9 +++++++++
> > >   2 files changed, 22 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index eb0f43784c2b..846864e45db6 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >          struct sysinfo i;
> > >          unsigned long sreclaimable, sunreclaimable;
> > >          unsigned long swapcached = 0;
> > > +       unsigned long free_pcp = 0;
> > > +       struct zone *zone;
> > > +       int cpu;
> > > 
> > >          si_meminfo_node(&i, nid);
> > >          sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
> > > @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >   #ifdef CONFIG_SWAP
> > >          swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
> > >   #endif
> > > +       for_each_populated_zone(zone) {
> > > +               if (zone_to_nid(zone) != nid)
> > > +                       continue;
> > > +               for_each_online_cpu(cpu)
> > > +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> > > +       }
> > > +
> > >          len = sysfs_emit_at(buf, len,
> > >                              "Node %d MemTotal:       %8lu kB\n"
> > >                              "Node %d MemFree:        %8lu kB\n"
> > > +                           "Node %d PcpFree:        %8lu kB\n"
> > >                              "Node %d MemUsed:        %8lu kB\n"
> > >                              "Node %d SwapCached:     %8lu kB\n"
> > >                              "Node %d Active:         %8lu kB\n"
> > > @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >                              "Node %d Mlocked:        %8lu kB\n",
> > >                              nid, K(i.totalram),
> > >                              nid, K(i.freeram),
> > > -                           nid, K(i.totalram - i.freeram),
> > > +                           nid, K(free_pcp),
> > > +                           nid, K(i.totalram - i.freeram - free_pcp),
> > >                              nid, K(swapcached),
> > >                              nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> > >                                     node_page_state(pgdat, NR_ACTIVE_FILE)),
> > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > index 6e89f0e2fd20..672c784dfc8a 100644
> > > --- a/fs/proc/meminfo.c
> > > +++ b/fs/proc/meminfo.c
> > > @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >          unsigned long pages[NR_LRU_LISTS];
> > >          unsigned long sreclaimable, sunreclaim;
> > >          int lru;
> > > +       unsigned long free_pcp = 0;
> > > +       struct zone *zone;
> > > +       int cpu;
> > > 
> > >          si_meminfo(&i);
> > >          si_swapinfo(&i);
> > > @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >          sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
> > >          sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
> > > 
> > > +       for_each_populated_zone(zone) {
> > > +               for_each_online_cpu(cpu)
> > > +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> > > +       }
> > > +
> > >          show_val_kb(m, "MemTotal:       ", i.totalram);
> > >          show_val_kb(m, "MemFree:        ", i.freeram);
> > > +       show_val_kb(m, "PcpFree:        ", free_pcp);
> > >          show_val_kb(m, "MemAvailable:   ", available);
> > >          show_val_kb(m, "Buffers:        ", i.bufferram);
> > >          show_val_kb(m, "Cached:         ", cached);
> > > --
> > > 2.35.3
> > > 
> > > 
> > .

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-19  7:40     ` Aaron Lu
@ 2022-08-19  9:53       ` Liu Shixin
  2022-08-19 10:02         ` Aaron Lu
  2022-08-22  7:27         ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Liu Shixin @ 2022-08-19  9:53 UTC (permalink / raw)
  To: Aaron Lu, Kefeng Wang
  Cc: huang ying, Andrew Morton, Greg Kroah-Hartman, linux-mm,
	linux-kernel, Huang Ying



On 2022/8/19 15:40, Aaron Lu wrote:
> On Tue, Aug 16, 2022 at 05:24:07PM +0800, Kefeng Wang wrote:
>> On 2022/8/16 16:48, huang ying wrote:
>>> On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>> From: Liu Shixin <liushixin2@huawei.com>
>>>>
>>>> The page on pcplist could be used, but not counted into memory free or
>>>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>>> significant decrease in the display of free memory, with a large number
>>>> of cpus and nodes, the number of pages in the percpu list can be very
>>>> large, so it is better to let user to know the pcp count.
>>> Can you show some data?
>> 80M+ with 72cpus/2node
>>
> 80M+ for a 2 node system doesn't sound like a significant number.
>
>>> Another choice is to count PCP free pages in MemFree.  Is that OK for
>>> your use case too?
>> Yes, the user will make policy according to MemFree, we think count PCP free
>> pages
>>
>> in MemFree is better, but don't know whether it is right way.
>>
> Is there a real problem where user makes a sub-optimal policy due to the
> not accounted 80M+ free memory?
I need to explain that 80M+ is the increased after patch d8a759b57035. Actually in my test,
the pcplist is about 114M after system startup, and in high load scenarios, the pcplist memory
can reach 300M+.
The downstream has sensed the memory change after the kernel is updated, which has an
actual impact on them. That's why I sent this patch to ask if should count this
part of memory.

> Counting PCP pages as free seems natural, since they are indeed free
> pages. One concern is, there might be much more calls of
> __mod_zone_freepage_state() if you do free page counting for PCP pages,
> not sure if that would hurt performance. Also, you will need to
> differentiate in __free_one_page() whether counting for free pages are
> still needed since some pages are freed through PCP(and thus already
> counted) while some are not.
I prefer to add this part of memory into free only when calculating MemFree and
MemAvailable, without modifying other statistics to avoid directly hurt performance
or cause other performance problems. How about this way?

> BTW, since commit b92ca18e8ca59("mm/page_alloc: disassociate the pcp->high
> from pcp->batch"), pcp size is no longer associated with batch size. Is
> it that you are testing on an older kernel?
I met the problem on stable-5.10. I think this patch can't fix the problem I met. Futher,
this series of patch make pcp->high to be greater. So the problem becomes even more
acute in mainline.

Thanks,
>
> Thanks,
> Aaron
>
>>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>>   drivers/base/node.c | 14 +++++++++++++-
>>>>   fs/proc/meminfo.c   |  9 +++++++++
>>>>   2 files changed, 22 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index eb0f43784c2b..846864e45db6 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>          struct sysinfo i;
>>>>          unsigned long sreclaimable, sunreclaimable;
>>>>          unsigned long swapcached = 0;
>>>> +       unsigned long free_pcp = 0;
>>>> +       struct zone *zone;
>>>> +       int cpu;
>>>>
>>>>          si_meminfo_node(&i, nid);
>>>>          sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
>>>> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>   #ifdef CONFIG_SWAP
>>>>          swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>>>>   #endif
>>>> +       for_each_populated_zone(zone) {
>>>> +               if (zone_to_nid(zone) != nid)
>>>> +                       continue;
>>>> +               for_each_online_cpu(cpu)
>>>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>>> +       }
>>>> +
>>>>          len = sysfs_emit_at(buf, len,
>>>>                              "Node %d MemTotal:       %8lu kB\n"
>>>>                              "Node %d MemFree:        %8lu kB\n"
>>>> +                           "Node %d PcpFree:        %8lu kB\n"
>>>>                              "Node %d MemUsed:        %8lu kB\n"
>>>>                              "Node %d SwapCached:     %8lu kB\n"
>>>>                              "Node %d Active:         %8lu kB\n"
>>>> @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>                              "Node %d Mlocked:        %8lu kB\n",
>>>>                              nid, K(i.totalram),
>>>>                              nid, K(i.freeram),
>>>> -                           nid, K(i.totalram - i.freeram),
>>>> +                           nid, K(free_pcp),
>>>> +                           nid, K(i.totalram - i.freeram - free_pcp),
>>>>                              nid, K(swapcached),
>>>>                              nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
>>>>                                     node_page_state(pgdat, NR_ACTIVE_FILE)),
>>>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>>>> index 6e89f0e2fd20..672c784dfc8a 100644
>>>> --- a/fs/proc/meminfo.c
>>>> +++ b/fs/proc/meminfo.c
>>>> @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>>>          unsigned long pages[NR_LRU_LISTS];
>>>>          unsigned long sreclaimable, sunreclaim;
>>>>          int lru;
>>>> +       unsigned long free_pcp = 0;
>>>> +       struct zone *zone;
>>>> +       int cpu;
>>>>
>>>>          si_meminfo(&i);
>>>>          si_swapinfo(&i);
>>>> @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>>>          sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>>>>          sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>>>>
>>>> +       for_each_populated_zone(zone) {
>>>> +               for_each_online_cpu(cpu)
>>>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>>> +       }
>>>> +
>>>>          show_val_kb(m, "MemTotal:       ", i.totalram);
>>>>          show_val_kb(m, "MemFree:        ", i.freeram);
>>>> +       show_val_kb(m, "PcpFree:        ", free_pcp);
>>>>          show_val_kb(m, "MemAvailable:   ", available);
>>>>          show_val_kb(m, "Buffers:        ", i.bufferram);
>>>>          show_val_kb(m, "Cached:         ", cached);
>>>> --
>>>> 2.35.3
>>>>
>>>>
>>> .
> .
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-19  9:53       ` Liu Shixin
@ 2022-08-19 10:02         ` Aaron Lu
  2022-08-22  7:27         ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Aaron Lu @ 2022-08-19 10:02 UTC (permalink / raw)
  To: Liu Shixin, Kefeng Wang
  Cc: huang ying, Andrew Morton, Greg Kroah-Hartman, linux-mm,
	linux-kernel, Huang Ying

On 8/19/2022 5:53 PM, Liu Shixin wrote:
> 
> 
> On 2022/8/19 15:40, Aaron Lu wrote:
>> On Tue, Aug 16, 2022 at 05:24:07PM +0800, Kefeng Wang wrote:
>>> On 2022/8/16 16:48, huang ying wrote:
>>>> On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>> From: Liu Shixin <liushixin2@huawei.com>
>>>>>
>>>>> The page on pcplist could be used, but not counted into memory free or
>>>>> avaliable, and pcp_free is only showed by show_mem(). Since commit
>>>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>>>> significant decrease in the display of free memory, with a large number
>>>>> of cpus and nodes, the number of pages in the percpu list can be very
>>>>> large, so it is better to let user to know the pcp count.
>>>> Can you show some data?
>>> 80M+ with 72cpus/2node
>>>
>> 80M+ for a 2 node system doesn't sound like a significant number.
>>
>>>> Another choice is to count PCP free pages in MemFree.  Is that OK for
>>>> your use case too?
>>> Yes, the user will make policy according to MemFree, we think count PCP free
>>> pages
>>>
>>> in MemFree is better, but don't know whether it is right way.
>>>
>> Is there a real problem where user makes a sub-optimal policy due to the
>> not accounted 80M+ free memory?
> I need to explain that 80M+ is the increased after patch d8a759b57035. Actually in my test,
> the pcplist is about 114M after system startup, and in high load scenarios, the pcplist memory
> can reach 300M+.
> The downstream has sensed the memory change after the kernel is updated, which has an
> actual impact on them. That's why I sent this patch to ask if should count this
> part of memory.
> 
>> Counting PCP pages as free seems natural, since they are indeed free
>> pages. One concern is, there might be much more calls of
>> __mod_zone_freepage_state() if you do free page counting for PCP pages,
>> not sure if that would hurt performance. Also, you will need to
>> differentiate in __free_one_page() whether counting for free pages are
>> still needed since some pages are freed through PCP(and thus already
>> counted) while some are not.
> I prefer to add this part of memory into free only when calculating MemFree and
> MemAvailable, without modifying other statistics to avoid directly hurt performance
> or cause other performance problems. How about this way?
>

That sounds workable to me.

Thanks,
Aaron

>> BTW, since commit b92ca18e8ca59("mm/page_alloc: disassociate the pcp->high
>> from pcp->batch"), pcp size is no longer associated with batch size. Is
>> it that you are testing on an older kernel?
> I met the problem on stable-5.10. I think this patch can't fix the problem I met. Futher,
> this series of patch make pcp->high to be greater. So the problem becomes even more
> acute in mainline.
> 
> Thanks,
>>
>> Thanks,
>> Aaron
>>
>>>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>>   drivers/base/node.c | 14 +++++++++++++-
>>>>>   fs/proc/meminfo.c   |  9 +++++++++
>>>>>   2 files changed, 22 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index eb0f43784c2b..846864e45db6 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -375,6 +375,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>>          struct sysinfo i;
>>>>>          unsigned long sreclaimable, sunreclaimable;
>>>>>          unsigned long swapcached = 0;
>>>>> +       unsigned long free_pcp = 0;
>>>>> +       struct zone *zone;
>>>>> +       int cpu;
>>>>>
>>>>>          si_meminfo_node(&i, nid);
>>>>>          sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
>>>>> @@ -382,9 +385,17 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>>   #ifdef CONFIG_SWAP
>>>>>          swapcached = node_page_state_pages(pgdat, NR_SWAPCACHE);
>>>>>   #endif
>>>>> +       for_each_populated_zone(zone) {
>>>>> +               if (zone_to_nid(zone) != nid)
>>>>> +                       continue;
>>>>> +               for_each_online_cpu(cpu)
>>>>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>>>> +       }
>>>>> +
>>>>>          len = sysfs_emit_at(buf, len,
>>>>>                              "Node %d MemTotal:       %8lu kB\n"
>>>>>                              "Node %d MemFree:        %8lu kB\n"
>>>>> +                           "Node %d PcpFree:        %8lu kB\n"
>>>>>                              "Node %d MemUsed:        %8lu kB\n"
>>>>>                              "Node %d SwapCached:     %8lu kB\n"
>>>>>                              "Node %d Active:         %8lu kB\n"
>>>>> @@ -397,7 +408,8 @@ static ssize_t node_read_meminfo(struct device *dev,
>>>>>                              "Node %d Mlocked:        %8lu kB\n",
>>>>>                              nid, K(i.totalram),
>>>>>                              nid, K(i.freeram),
>>>>> -                           nid, K(i.totalram - i.freeram),
>>>>> +                           nid, K(free_pcp),
>>>>> +                           nid, K(i.totalram - i.freeram - free_pcp),
>>>>>                              nid, K(swapcached),
>>>>>                              nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
>>>>>                                     node_page_state(pgdat, NR_ACTIVE_FILE)),
>>>>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>>>>> index 6e89f0e2fd20..672c784dfc8a 100644
>>>>> --- a/fs/proc/meminfo.c
>>>>> +++ b/fs/proc/meminfo.c
>>>>> @@ -38,6 +38,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>>>>          unsigned long pages[NR_LRU_LISTS];
>>>>>          unsigned long sreclaimable, sunreclaim;
>>>>>          int lru;
>>>>> +       unsigned long free_pcp = 0;
>>>>> +       struct zone *zone;
>>>>> +       int cpu;
>>>>>
>>>>>          si_meminfo(&i);
>>>>>          si_swapinfo(&i);
>>>>> @@ -55,8 +58,14 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>>>>          sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>>>>>          sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>>>>>
>>>>> +       for_each_populated_zone(zone) {
>>>>> +               for_each_online_cpu(cpu)
>>>>> +                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>>>> +       }
>>>>> +
>>>>>          show_val_kb(m, "MemTotal:       ", i.totalram);
>>>>>          show_val_kb(m, "MemFree:        ", i.freeram);
>>>>> +       show_val_kb(m, "PcpFree:        ", free_pcp);
>>>>>          show_val_kb(m, "MemAvailable:   ", available);
>>>>>          show_val_kb(m, "Buffers:        ", i.bufferram);
>>>>>          show_val_kb(m, "Cached:         ", cached);
>>>>> --
>>>>> 2.35.3
>>>>>
>>>>>
>>>> .
>> .
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC] mm, proc: add PcpFree to meminfo
  2022-08-19  9:53       ` Liu Shixin
  2022-08-19 10:02         ` Aaron Lu
@ 2022-08-22  7:27         ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2022-08-22  7:27 UTC (permalink / raw)
  To: Liu Shixin
  Cc: Aaron Lu, Kefeng Wang, huang ying, Andrew Morton,
	Greg Kroah-Hartman, linux-mm, linux-kernel, Huang Ying

On Fri 19-08-22 17:53:27, Liu Shixin wrote:
> 
> 
> On 2022/8/19 15:40, Aaron Lu wrote:
> > On Tue, Aug 16, 2022 at 05:24:07PM +0800, Kefeng Wang wrote:
> >> On 2022/8/16 16:48, huang ying wrote:
> >>> On Tue, Aug 16, 2022 at 4:38 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>> From: Liu Shixin <liushixin2@huawei.com>
> >>>>
> >>>> The page on pcplist could be used, but not counted into memory free or
> >>>> avaliable, and pcp_free is only showed by show_mem(). Since commit
> >>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> >>>> significant decrease in the display of free memory, with a large number
> >>>> of cpus and nodes, the number of pages in the percpu list can be very
> >>>> large, so it is better to let user to know the pcp count.
> >>> Can you show some data?
> >> 80M+ with 72cpus/2node

I would expect that such system would have quite some memory as well and
80MB wouldn't be a really noticeable. What is that amount in %tage 

> > 80M+ for a 2 node system doesn't sound like a significant number.
> >
> >>> Another choice is to count PCP free pages in MemFree.  Is that OK for
> >>> your use case too?
> >> Yes, the user will make policy according to MemFree, we think count PCP free
> >> pages
> >>
> >> in MemFree is better, but don't know whether it is right way.
> >>
> > Is there a real problem where user makes a sub-optimal policy due to the
> > not accounted 80M+ free memory?
> I need to explain that 80M+ is the increased after patch d8a759b57035. Actually in my test,
> the pcplist is about 114M after system startup, and in high load scenarios, the pcplist memory
> can reach 300M+.
> The downstream has sensed the memory change after the kernel is updated, which has an
> actual impact on them. That's why I sent this patch to ask if should count this
> part of memory.

It would be really great to be more explicit about this. Because if this
is really runtime noticeable then we might need to consider an improved
tunning or way to manually configure the pcp batch sizes. Reporting the
amount on its own is unlikely going to help without being able to do
anything about that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-08-22  7:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16  8:44 [PATCH RFC] mm, proc: add PcpFree to meminfo Kefeng Wang
2022-08-16  8:48 ` huang ying
2022-08-16  9:24   ` Kefeng Wang
2022-08-19  7:40     ` Aaron Lu
2022-08-19  9:53       ` Liu Shixin
2022-08-19 10:02         ` Aaron Lu
2022-08-22  7:27         ` Michal Hocko
2022-08-16  9:16 ` Greg Kroah-Hartman
2022-08-16 10:11   ` Kefeng Wang
2022-08-16 10:50     ` Greg Kroah-Hartman
2022-08-16 12:03       ` Kefeng Wang
2022-08-17  7:16 ` Kefeng Wang
2022-08-18 21:07   ` Dave Hansen
2022-08-19  1:06     ` Kefeng Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).