Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
@ 2020-07-14 17:39 Roman Gushchin
  2020-07-20  8:03 ` Michal Hocko
  2020-07-30  3:45 ` Hugh Dickins
  0 siblings, 2 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-07-14 17:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin, Hugh Dickins

I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
-5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
hosts. The numbers of these warnings were relatively low and stable,
so it didn't look like we are systematically leaking the counters.
The corresponding vmstat counters also looked sane.

These warnings are generated by the vmstat_refresh() function, which
assumes that atomic zone and numa counters can't go below zero.
However, on a SMP machine it's not quite right: due to per-cpu
caching it can in theory be as low as -(zone threshold) * NR_CPUs.

For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
reached 0. Then we've reclaimed a small number of cma pages on each
CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
slightly positive (the atomic counter is still 0). Then somebody on
CPU0 consumes all these pages. The number of pages can easily exceed
the threshold and a negative value will be committed to the atomic
counter.

To fix the problem and avoid generating false warnings, let's just
relax the condition and warn only if the value is less than minus
the maximum theoretically possible drift value, which is 125 *
number of online CPUs. It will still allow to catch systematic leaks,
but will not generate bogus warnings.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Hugh Dickins <hughd@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  4 ++--
 mm/vmstat.c                             | 30 ++++++++++++++++---------
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4b9d2e8e9142..95fb80d0c606 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
 
 As a side-effect, it also checks for negative totals (elsewhere reported
 as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
-(At time of writing, a few stats are known sometimes to be found negative,
-with no ill effects: errors and warnings on these stats are suppressed.)
+(On a SMP machine some stats can temporarily become negative, with no ill
+effects: errors and warnings on these stats are suppressed.)
 
 
 numa_stat
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a21140373edb..8f0ef8aaf8ee 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
 
+#define MAX_THRESHOLD 125
+
 int calculate_pressure_threshold(struct zone *zone)
 {
 	int threshold;
@@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
 	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
 
 	/*
-	 * Maximum threshold is 125
+	 * Threshold is capped by MAX_THRESHOLD
 	 */
-	threshold = min(125, threshold);
-
-	return threshold;
+	return min(MAX_THRESHOLD, threshold);
 }
 
 int calculate_normal_threshold(struct zone *zone)
@@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item)
 }
 EXPORT_SYMBOL(dec_node_page_state);
 #else
+
+#define MAX_THRESHOLD 0
+
 /*
  * Use interrupt disable to serialize counter updates
  */
@@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work)
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
-	long val;
+	long val, max_drift;
 	int err;
 	int i;
 
@@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
 	 * which can equally be echo'ed to or cat'ted from (by root),
 	 * can be used to update the stats just before reading them.
-	 *
-	 * Oh, and since global_zone_page_state() etc. are so careful to hide
-	 * transiently negative values, report an error here if any of
-	 * the stats is negative, so we know to go looking for imbalance.
 	 */
 	err = schedule_on_each_cpu(refresh_vm_stats);
 	if (err)
 		return err;
+
+	/*
+	 * Since global_zone_page_state() etc. are so careful to hide
+	 * transiently negative values, report an error here if any of
+	 * the stats is negative and are less than the maximum drift value,
+	 * so we know to go looking for imbalance.
+	 */
+	max_drift = num_online_cpus() * MAX_THRESHOLD;
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
 			err = -EINVAL;
@@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
 			err = -EINVAL;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-14 17:39 [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings Roman Gushchin
@ 2020-07-20  8:03 ` Michal Hocko
  2020-07-20 20:20   ` Roman Gushchin
  2020-07-30  3:45 ` Hugh Dickins
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2020-07-20  8:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, linux-mm, kernel-team,
	linux-kernel, Hugh Dickins

On Tue 14-07-20 10:39:20, Roman Gushchin wrote:
> I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> hosts. The numbers of these warnings were relatively low and stable,
> so it didn't look like we are systematically leaking the counters.
> The corresponding vmstat counters also looked sane.
> 
> These warnings are generated by the vmstat_refresh() function, which
> assumes that atomic zone and numa counters can't go below zero.
> However, on a SMP machine it's not quite right: due to per-cpu
> caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> 
> For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> reached 0. Then we've reclaimed a small number of cma pages on each
> CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> slightly positive (the atomic counter is still 0). Then somebody on
> CPU0 consumes all these pages. The number of pages can easily exceed
> the threshold and a negative value will be committed to the atomic
> counter.
> 
> To fix the problem and avoid generating false warnings, let's just
> relax the condition and warn only if the value is less than minus
> the maximum theoretically possible drift value, which is 125 *
> number of online CPUs. It will still allow to catch systematic leaks,
> but will not generate bogus warnings.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

One minor nit which can be handled by a separate patch but now that you
are touching this code...

> ---
>  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
>  mm/vmstat.c                             | 30 ++++++++++++++++---------
>  2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 4b9d2e8e9142..95fb80d0c606 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
>  
>  As a side-effect, it also checks for negative totals (elsewhere reported
>  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> -(At time of writing, a few stats are known sometimes to be found negative,
> -with no ill effects: errors and warnings on these stats are suppressed.)
> +(On a SMP machine some stats can temporarily become negative, with no ill
> +effects: errors and warnings on these stats are suppressed.)
>  
>  
>  numa_stat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a21140373edb..8f0ef8aaf8ee 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125

This would deserve a comment. 88f5acf88ae6a didn't really explain why
this specific value has been selected but the specific value shouldn't
really matter much. I would go with the following at least.
"
Maximum sync threshold for per-cpu vmstat counters. 
"
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Threshold is capped by MAX_THRESHOLD
>  	 */
> -	threshold = min(125, threshold);
> -
> -	return threshold;
> +	return min(MAX_THRESHOLD, threshold);
>  }
>  
>  int calculate_normal_threshold(struct zone *zone)
> @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item)
>  }
>  EXPORT_SYMBOL(dec_node_page_state);
>  #else
> +
> +#define MAX_THRESHOLD 0
> +
>  /*
>   * Use interrupt disable to serialize counter updates
>   */
> @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work)
>  int vmstat_refresh(struct ctl_table *table, int write,
>  		   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	long val;
> +	long val, max_drift;
>  	int err;
>  	int i;
>  
> @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
>  	 * which can equally be echo'ed to or cat'ted from (by root),
>  	 * can be used to update the stats just before reading them.
> -	 *
> -	 * Oh, and since global_zone_page_state() etc. are so careful to hide
> -	 * transiently negative values, report an error here if any of
> -	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
>  	err = schedule_on_each_cpu(refresh_vm_stats);
>  	if (err)
>  		return err;
> +
> +	/*
> +	 * Since global_zone_page_state() etc. are so careful to hide
> +	 * transiently negative values, report an error here if any of
> +	 * the stats is negative and are less than the maximum drift value,
> +	 * so we know to go looking for imbalance.
> +	 */
> +	max_drift = num_online_cpus() * MAX_THRESHOLD;
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, zone_stat_name(i), val);
>  			err = -EINVAL;
> @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  #ifdef CONFIG_NUMA
>  	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_numa_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, numa_stat_name(i), val);
>  			err = -EINVAL;
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-20  8:03 ` Michal Hocko
@ 2020-07-20 20:20   ` Roman Gushchin
  0 siblings, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-07-20 20:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, linux-mm, kernel-team,
	linux-kernel, Hugh Dickins

On Mon, Jul 20, 2020 at 10:03:49AM +0200, Michal Hocko wrote:
> On Tue 14-07-20 10:39:20, Roman Gushchin wrote:
> > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> > hosts. The numbers of these warnings were relatively low and stable,
> > so it didn't look like we are systematically leaking the counters.
> > The corresponding vmstat counters also looked sane.
> > 
> > These warnings are generated by the vmstat_refresh() function, which
> > assumes that atomic zone and numa counters can't go below zero.
> > However, on a SMP machine it's not quite right: due to per-cpu
> > caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> > 
> > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> > reached 0. Then we've reclaimed a small number of cma pages on each
> > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> > slightly positive (the atomic counter is still 0). Then somebody on
> > CPU0 consumes all these pages. The number of pages can easily exceed
> > the threshold and a negative value will be committed to the atomic
> > counter.
> > 
> > To fix the problem and avoid generating false warnings, let's just
> > relax the condition and warn only if the value is less than minus
> > the maximum theoretically possible drift value, which is 125 *
> > number of online CPUs. It will still allow to catch systematic leaks,
> > but will not generate bogus warnings.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> One minor nit which can be handled by a separate patch but now that you
> are touching this code...

Thank you!

> 
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
> >  mm/vmstat.c                             | 30 ++++++++++++++++---------
> >  2 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 4b9d2e8e9142..95fb80d0c606 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
> >  
> >  As a side-effect, it also checks for negative totals (elsewhere reported
> >  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> > -(At time of writing, a few stats are known sometimes to be found negative,
> > -with no ill effects: errors and warnings on these stats are suppressed.)
> > +(On a SMP machine some stats can temporarily become negative, with no ill
> > +effects: errors and warnings on these stats are suppressed.)
> >  
> >  
> >  numa_stat
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index a21140373edb..8f0ef8aaf8ee 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +#define MAX_THRESHOLD 125
> 
> This would deserve a comment. 88f5acf88ae6a didn't really explain why
> this specific value has been selected but the specific value shouldn't
> really matter much. I would go with the following at least.
> "
> Maximum sync threshold for per-cpu vmstat counters. 
> "

Agree. Below is the diff to be squashed into the original patch.

Thanks!

--

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 08e415e0a15d..ddc59b533599 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -167,6 +167,9 @@ EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+/*
+ * Maximum sync threshold for per-cpu vmstat counters.
+ */
 #ifdef CONFIG_SMP
 #define MAX_THRESHOLD 125
 #else


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-14 17:39 [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings Roman Gushchin
  2020-07-20  8:03 ` Michal Hocko
@ 2020-07-30  3:45 ` Hugh Dickins
  2020-07-30 16:23   ` Roman Gushchin
  1 sibling, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2020-07-30  3:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, kernel-team, linux-kernel, Hugh Dickins

On Tue, 14 Jul 2020, Roman Gushchin wrote:

> I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> hosts. The numbers of these warnings were relatively low and stable,
> so it didn't look like we are systematically leaking the counters.
> The corresponding vmstat counters also looked sane.

nr_zone_write_pending: yes, I've looked at our machines, and see that
showing up for us too (-49 was the worst I saw).  Not at all common,
but seen.  And not followed by increasingly worse numbers, so a state
that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
but never nr_writeback, which you'd expect to go along with those.

I wish I could explain that (I've wondered if somewhere delays
incrementing the stat, and can be interrupted by the decrementer of
the stat before it has reached incrementing), but have not succeeded.
Perhaps it is all down to the vmstat_refresh() skid that you hide in
this patch; but I'm not convinced.

> 
> These warnings are generated by the vmstat_refresh() function, which
> assumes that atomic zone and numa counters can't go below zero.
> However, on a SMP machine it's not quite right: due to per-cpu
> caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> 
> For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> reached 0. Then we've reclaimed a small number of cma pages on each
> CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> slightly positive (the atomic counter is still 0). Then somebody on
> CPU0 consumes all these pages. The number of pages can easily exceed
> the threshold and a negative value will be committed to the atomic
> counter.
> 
> To fix the problem and avoid generating false warnings, let's just
> relax the condition and warn only if the value is less than minus
> the maximum theoretically possible drift value, which is 125 *
> number of online CPUs. It will still allow to catch systematic leaks,
> but will not generate bogus warnings.

Sorry, but despite the acks of others, I want to NAK this in its
present form.

You're right that there's a possibility of a spurious warning,
but is that so terrible?

I'm imagining a threshold of 125, and 128 cpus, and the following
worst case.  Yes, it is possible that when vmstat_refresh() schedules
its refresh on all the cpus, that the first 127 cpus to complete all
sync a count of 0, but immediately after each allocates 125 of whatever
(raising their per-cpu counters without touching the global counter);
and then, before the 128th cpu starts its sync, somehow that 128th
gets to be the lucky cpu to free all of those 127*125 items,
so arriving at a mistaken count of -15875 for that stat.

And I guess you could even devise a test which conspires to do that.
But is it so likely, that it's worth throwing away the warning when
we leak (or perhaps it's unleak) 16000 huge pages?  I don't think so:
I think it makes those warnings pretty much useless, and it would be
better just to delete the code that issues them.

But there's other things you could do, that we can probably agree on.

When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH
and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise
to negative reports: the original commit just filtered those cases out
in a switch.  Maybe you should filter out NR_ZONE_WRITE_PENDING and
NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting.

I'm not sure exactly what your objection is to the warning: would
you prefer pr_info or pr_debug to pr_warn?  I'd prefer to leave it
as pr_warn, but can compromise.

(IIRC xfstests can fail a test if an unexpected message appears
in the log; but xfstests does not use /proc/sys/vm/stat_refresh.)

But a better idea is perhaps to redefine the behavior of
"echo >/proc/sys/vm/stat_refresh".  What if
"echo someparticularstring >/proc/sys/vm/stat_refresh" were to
disable or enable the warning (permanently? or just that time?):
disable would be more "back-compatible", but I think it's okay
if you prefer enable.  Or "someparticularstring" could actually
specify the warning threshold you want to use - you might echo
125 or 16000, I might echo 0.  We can haggle over the default.

And there's a simpler change we already made internally: we didn't
mind the warning at all, but the accompanying -EINVALs became very
annoying.  A lot of testing scripts have "set -e" in them, and for
test B of feature B to fail because earlier test A of feature A
had tickled a bug in A that wrapped some stat negative, that
was very irritating.  We deleted those "err = -EINVAL;"s -
which might be what's actually most annoying you too?

Nit in this patch called out below.

Hugh

> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
>  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
>  mm/vmstat.c                             | 30 ++++++++++++++++---------
>  2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 4b9d2e8e9142..95fb80d0c606 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
>  
>  As a side-effect, it also checks for negative totals (elsewhere reported
>  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> -(At time of writing, a few stats are known sometimes to be found negative,
> -with no ill effects: errors and warnings on these stats are suppressed.)
> +(On a SMP machine some stats can temporarily become negative, with no ill
> +effects: errors and warnings on these stats are suppressed.)
>  
>  
>  numa_stat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a21140373edb..8f0ef8aaf8ee 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Threshold is capped by MAX_THRESHOLD
>  	 */
> -	threshold = min(125, threshold);
> -
> -	return threshold;
> +	return min(MAX_THRESHOLD, threshold);
>  }
>  
>  int calculate_normal_threshold(struct zone *zone)

calculate_normal_threshold() also contains a 125:
better change that to MAX_THRESHOLD too, if you do pursue this.

> @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item)
>  }
>  EXPORT_SYMBOL(dec_node_page_state);
>  #else
> +
> +#define MAX_THRESHOLD 0
> +
>  /*
>   * Use interrupt disable to serialize counter updates
>   */
> @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work)
>  int vmstat_refresh(struct ctl_table *table, int write,
>  		   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	long val;
> +	long val, max_drift;
>  	int err;
>  	int i;
>  
> @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
>  	 * which can equally be echo'ed to or cat'ted from (by root),
>  	 * can be used to update the stats just before reading them.
> -	 *
> -	 * Oh, and since global_zone_page_state() etc. are so careful to hide
> -	 * transiently negative values, report an error here if any of
> -	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
>  	err = schedule_on_each_cpu(refresh_vm_stats);
>  	if (err)
>  		return err;
> +
> +	/*
> +	 * Since global_zone_page_state() etc. are so careful to hide
> +	 * transiently negative values, report an error here if any of
> +	 * the stats is negative and are less than the maximum drift value,
> +	 * so we know to go looking for imbalance.
> +	 */
> +	max_drift = num_online_cpus() * MAX_THRESHOLD;
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, zone_stat_name(i), val);
>  			err = -EINVAL;
> @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  #ifdef CONFIG_NUMA
>  	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_numa_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, numa_stat_name(i), val);
>  			err = -EINVAL;
> -- 
> 2.26.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-30  3:45 ` Hugh Dickins
@ 2020-07-30 16:23   ` Roman Gushchin
  2020-07-31  4:06     ` Hugh Dickins
  0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-07-30 16:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, kernel-team, linux-kernel

On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> On Tue, 14 Jul 2020, Roman Gushchin wrote:
> 
> > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> > hosts. The numbers of these warnings were relatively low and stable,
> > so it didn't look like we are systematically leaking the counters.
> > The corresponding vmstat counters also looked sane.
> 
> nr_zone_write_pending: yes, I've looked at our machines, and see that
> showing up for us too (-49 was the worst I saw).  Not at all common,
> but seen.  And not followed by increasingly worse numbers, so a state
> that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> but never nr_writeback, which you'd expect to go along with those.

NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?

> 
> I wish I could explain that (I've wondered if somewhere delays
> incrementing the stat, and can be interrupted by the decrementer of
> the stat before it has reached incrementing), but have not succeeded.
> Perhaps it is all down to the vmstat_refresh() skid that you hide in
> this patch; but I'm not convinced.
> 
> > 
> > These warnings are generated by the vmstat_refresh() function, which
> > assumes that atomic zone and numa counters can't go below zero.
> > However, on a SMP machine it's not quite right: due to per-cpu
> > caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> > 
> > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> > reached 0. Then we've reclaimed a small number of cma pages on each
> > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> > slightly positive (the atomic counter is still 0). Then somebody on
> > CPU0 consumes all these pages. The number of pages can easily exceed
> > the threshold and a negative value will be committed to the atomic
> > counter.
> > 
> > To fix the problem and avoid generating false warnings, let's just
> > relax the condition and warn only if the value is less than minus
> > the maximum theoretically possible drift value, which is 125 *
> > number of online CPUs. It will still allow to catch systematic leaks,
> > but will not generate bogus warnings.
> 
> Sorry, but despite the acks of others, I want to NAK this in its
> present form.

Sorry to hear this.
> 
> You're right that there's a possibility of a spurious warning,
> but is that so terrible?

We do collect all warnings fleet-wide and false warnings are
creating a noise, which makes it easier to miss real problems.
Of course, we can filter out these particular warnings, but then
what's the point of generating them?

In my opinion such warnings with a bad signal/ratio aren't good because
initially they cause somebody to look into the "problem" and waste
their time, and later they are usually just ignored, even when the real
problem appears.

In this particular case I was testing some cma-related changes,
and when I saw a bunch a new warnings (cma was not used on these hosts
before), I was concerned that something's wrong with my changes.

> 
> I'm imagining a threshold of 125, and 128 cpus, and the following
> worst case.  Yes, it is possible that when vmstat_refresh() schedules
> its refresh on all the cpus, that the first 127 cpus to complete all
> sync a count of 0, but immediately after each allocates 125 of whatever
> (raising their per-cpu counters without touching the global counter);
> and then, before the 128th cpu starts its sync, somehow that 128th
> gets to be the lucky cpu to free all of those 127*125 items,
> so arriving at a mistaken count of -15875 for that stat.

First, I have to agree, 125 * number of cpus is definitely a very high
number, so it's extremely unlikely that any vmstat value will reach it
in the real life. I'm totally happy to go with a (much) lower limit,
I just have no good idea how to justify any particular number below.
I you can suggest something, I'd appreciate it a lot.
I like the number 400 :)

> 
> And I guess you could even devise a test which conspires to do that.
> But is it so likely, that it's worth throwing away the warning when
> we leak (or perhaps it's unleak) 16000 huge pages?  I don't think so:
> I think it makes those warnings pretty much useless, and it would be
> better just to delete the code that issues them.

Of course, it's an option too (delete the warnings at all).

> 
> But there's other things you could do, that we can probably agree on.
> 
> When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH
> and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise
> to negative reports: the original commit just filtered those cases out
> in a switch.  Maybe you should filter out NR_ZONE_WRITE_PENDING and
> NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting.

In fact, there is nothing specific to NR_ZONE_WRITE_PENDING and
NR_FREE_CMA_PAGES, any value which is often bouncing around 0 can go
negative, and it's not an indication of any issues.

> 
> I'm not sure exactly what your objection is to the warning: would
> you prefer pr_info or pr_debug to pr_warn?  I'd prefer to leave it
> as pr_warn, but can compromise.

Yeah, we can go with pr_debug as well.

> 
> (IIRC xfstests can fail a test if an unexpected message appears
> in the log; but xfstests does not use /proc/sys/vm/stat_refresh.)
> 
> But a better idea is perhaps to redefine the behavior of
> "echo >/proc/sys/vm/stat_refresh".  What if
> "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> disable or enable the warning (permanently? or just that time?):
> disable would be more "back-compatible", but I think it's okay
> if you prefer enable.  Or "someparticularstring" could actually
> specify the warning threshold you want to use - you might echo
> 125 or 16000, I might echo 0.  We can haggle over the default.

May I ask you, what kind of problems you have in your in mind,
which can be revealed by these warnings? Or maybe there is some
history attached?

If it's all about some particular counters, which are known to be
strictly positive, maybe we should do the opposite, and check only
those counters? Because in general it's not an indication of a problem.

> 
> And there's a simpler change we already made internally: we didn't
> mind the warning at all, but the accompanying -EINVALs became very
> annoying.  A lot of testing scripts have "set -e" in them, and for
> test B of feature B to fail because earlier test A of feature A
> had tickled a bug in A that wrapped some stat negative, that
> was very irritating.  We deleted those "err = -EINVAL;"s -
> which might be what's actually most annoying you too?

> 
> Nit in this patch called out below.
> 
> Hugh
> 
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
> >  mm/vmstat.c                             | 30 ++++++++++++++++---------
> >  2 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 4b9d2e8e9142..95fb80d0c606 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
> >  
> >  As a side-effect, it also checks for negative totals (elsewhere reported
> >  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> > -(At time of writing, a few stats are known sometimes to be found negative,
> > -with no ill effects: errors and warnings on these stats are suppressed.)
> > +(On a SMP machine some stats can temporarily become negative, with no ill
> > +effects: errors and warnings on these stats are suppressed.)
> >  
> >  
> >  numa_stat
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index a21140373edb..8f0ef8aaf8ee 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +#define MAX_THRESHOLD 125
> > +
> >  int calculate_pressure_threshold(struct zone *zone)
> >  {
> >  	int threshold;
> > @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
> >  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> >  
> >  	/*
> > -	 * Maximum threshold is 125
> > +	 * Threshold is capped by MAX_THRESHOLD
> >  	 */
> > -	threshold = min(125, threshold);
> > -
> > -	return threshold;
> > +	return min(MAX_THRESHOLD, threshold);
> >  }
> >  
> >  int calculate_normal_threshold(struct zone *zone)
> 
> calculate_normal_threshold() also contains a 125:
> better change that to MAX_THRESHOLD too, if you do pursue this.

Totally agree, good catch.

Thanks!

Roman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-30 16:23   ` Roman Gushchin
@ 2020-07-31  4:06     ` Hugh Dickins
  2020-08-01  1:18       ` Roman Gushchin
  0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2020-07-31  4:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, linux-mm, kernel-team, linux-kernel

On Thu, 30 Jul 2020, Roman Gushchin wrote:
> On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> > 
> > But a better idea is perhaps to redefine the behavior of
> > "echo >/proc/sys/vm/stat_refresh".  What if
> > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> > disable or enable the warning (permanently? or just that time?):
> > disable would be more "back-compatible", but I think it's okay
> > if you prefer enable.  Or "someparticularstring" could actually
> > specify the warning threshold you want to use - you might echo
> > 125 or 16000, I might echo 0.  We can haggle over the default.
> 
> May I ask you, what kind of problems you have in your in mind,
> which can be revealed by these warnings? Or maybe there is some
> history attached?

Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE
accounting, but IIRC (though I might be making this up) there was
also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting.

When one of the stats used for balancing or limiting in vmscan.c
trends increasingly negative, it becomes increasingly difficult
for those heuristics (adding on to others, comparing with others)
to do what they're intended to do: they behave increasingly weirdly.

Now the same (or the opposite) is true if one of those stats trends
increasingly positive: but if it leaks positive, it's visible in
/proc/vmstat; whereas if it leaks negative, it's presented there as 0.

And most of the time (when unsynchronized) showing 0 is much better
than showing a transient negative.  But to help fix bugs, we do need
some way of seeing the negatives, and vm/stat_refresh provides an
opportunity to do so, when it synchronizes.

I'd be glad not to show the transients if I knew them: set a flag
on any that go negative, and only show if negative twice or more
in a row?  Perhaps, but I don't relish adding that, and think it
would be over-engineering.

It does sound to me like echoing the warning threshold into
/proc/sys/vm/stat_refresh is the best way to satisfy us both.

Though another alternative did occur to me overnight: we could
scrap the logged warning, and show "nr_whatever -53" as output
from /proc/sys/vm/stat_refresh: that too would be acceptable
to me, and you redirect to /dev/null.

(Why did I choose -53 in my example?  An in-joke: when I looked
through our machines for these warnings, on old kernels with my
old shmem hugepage implementation, there were a striking number
with "nr_shmem_freeholes -53"; but I'm a few years too late to
investigate what was going on there.)

> 
> If it's all about some particular counters, which are known to be
> strictly positive, maybe we should do the opposite, and check only
> those counters? Because in general it's not an indication of a problem.

Yet it's very curious how few stats ever generate such warnings:
you're convinced they're just transient noise, and you're probably right;
but I am a little suspicious of whether they are accounted correctly.

Hugh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-07-31  4:06     ` Hugh Dickins
@ 2020-08-01  1:18       ` Roman Gushchin
  2020-08-01  2:17         ` Hugh Dickins
  0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-08-01  1:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, kernel-team, linux-kernel

On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> On Thu, 30 Jul 2020, Roman Gushchin wrote:
> > On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> > > 
> > > But a better idea is perhaps to redefine the behavior of
> > > "echo >/proc/sys/vm/stat_refresh".  What if
> > > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> > > disable or enable the warning (permanently? or just that time?):
> > > disable would be more "back-compatible", but I think it's okay
> > > if you prefer enable.  Or "someparticularstring" could actually
> > > specify the warning threshold you want to use - you might echo
> > > 125 or 16000, I might echo 0.  We can haggle over the default.
> > 
> > May I ask you, what kind of problems you have in your in mind,
> > which can be revealed by these warnings? Or maybe there is some
> > history attached?
> 
> Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE
> accounting, but IIRC (though I might be making this up) there was
> also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting.
> 
> When one of the stats used for balancing or limiting in vmscan.c
> trends increasingly negative, it becomes increasingly difficult
> for those heuristics (adding on to others, comparing with others)
> to do what they're intended to do: they behave increasingly weirdly.
> 
> Now the same (or the opposite) is true if one of those stats trends
> increasingly positive: but if it leaks positive, it's visible in
> /proc/vmstat; whereas if it leaks negative, it's presented there as 0.
> 
> And most of the time (when unsynchronized) showing 0 is much better
> than showing a transient negative.  But to help fix bugs, we do need
> some way of seeing the negatives, and vm/stat_refresh provides an
> opportunity to do so, when it synchronizes.
> 
> I'd be glad not to show the transients if I knew them: set a flag
> on any that go negative, and only show if negative twice or more
> in a row?  Perhaps, but I don't relish adding that, and think it
> would be over-engineering.
> 
> It does sound to me like echoing the warning threshold into
> /proc/sys/vm/stat_refresh is the best way to satisfy us both.
> 
> Though another alternative did occur to me overnight: we could
> scrap the logged warning, and show "nr_whatever -53" as output
> from /proc/sys/vm/stat_refresh: that too would be acceptable
> to me, and you redirect to /dev/null.

It sounds like a good idea to me. Do you want me to prepare a patch?

> 
> (Why did I choose -53 in my example?  An in-joke: when I looked
> through our machines for these warnings, on old kernels with my
> old shmem hugepage implementation, there were a striking number
> with "nr_shmem_freeholes -53"; but I'm a few years too late to
> investigate what was going on there.)

:)

> 
> > 
> > If it's all about some particular counters, which are known to be
> > strictly positive, maybe we should do the opposite, and check only
> > those counters? Because in general it's not an indication of a problem.
> 
> Yet it's very curious how few stats ever generate such warnings:
> you're convinced they're just transient noise, and you're probably right;
> but I am a little suspicious of whether they are accounted correctly.

Yeah, I was initially very suspicious too, but I didn't find any issues
and still think it's not an indication of a problem.

Thank you!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings
  2020-08-01  1:18       ` Roman Gushchin
@ 2020-08-01  2:17         ` Hugh Dickins
  0 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2020-08-01  2:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, linux-mm, kernel-team, linux-kernel

On Fri, 31 Jul 2020, Roman Gushchin wrote:
> On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > 
> > Though another alternative did occur to me overnight: we could
> > scrap the logged warning, and show "nr_whatever -53" as output
> > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > to me, and you redirect to /dev/null.
> 
> It sounds like a good idea to me. Do you want me to prepare a patch?

Yes, if you like that one best, please do prepare a patch - thanks!

Hugh


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14 17:39 [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings Roman Gushchin
2020-07-20  8:03 ` Michal Hocko
2020-07-20 20:20   ` Roman Gushchin
2020-07-30  3:45 ` Hugh Dickins
2020-07-30 16:23   ` Roman Gushchin
2020-07-31  4:06     ` Hugh Dickins
2020-08-01  1:18       ` Roman Gushchin
2020-08-01  2:17         ` Hugh Dickins

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git