[PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh
@ 2021-02-25 23:10 Hugh Dickins
  2021-02-25 23:12 ` [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh Hugh Dickins
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-02-25 23:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

v4.7 52b6f46bc163 ("mm: /proc/sys/vm/stat_refresh to force vmstat update")
introduced vmstat_refresh(), with its vmstat underflow checking; then
v4.8 75ef71840539 ("mm, vmstat: add infrastructure for per-node vmstats")
split NR_VM_NODE_STAT_ITEMS out of NR_VM_ZONE_STAT_ITEMS without updating
vmstat_refresh(): so it has been missing out much of the vmstat underflow
checking ever since. Reinstate it. Thanks to Roman Gushchin <guro@fb.com>
for tangentially pointing this out.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/vmstat.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- 5.12-rc/mm/vmstat.c	2021-02-24 12:03:55.000000000 -0800
+++ vmstat1/mm/vmstat.c	2021-02-25 11:50:36.000000000 -0800
@@ -1857,6 +1857,14 @@ int vmstat_refresh(struct ctl_table *tab
 		}
 	}
 #endif
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		val = atomic_long_read(&vm_node_stat[i]);
+		if (val < 0) {
+			pr_warn("%s: %s %ld\n",
+				__func__, node_stat_name(i), val);
+			err = -EINVAL;
+		}
+	}
 	if (err)
 		return err;
 	if (write)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh
  2021-02-25 23:10 [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Hugh Dickins
@ 2021-02-25 23:12 ` Hugh Dickins
  2021-03-01  0:38   ` Roman Gushchin
  2021-02-25 23:14 ` [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats Hugh Dickins
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-02-25 23:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

EINVAL was good for drawing the refresher's attention to a warning in
dmesg, but became very tiresome when running test suites scripted with
"set -e": an underflow from a bug in one feature would cause unrelated
tests much later to fail, just because their /proc/sys/vm/stat_refresh
touch failed with that error. Stop doing that.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/vmstat.c |    5 -----
 1 file changed, 5 deletions(-)

--- vmstat1/mm/vmstat.c	2021-02-25 11:50:36.000000000 -0800
+++ vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
@@ -1844,7 +1844,6 @@ int vmstat_refresh(struct ctl_table *tab
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
-			err = -EINVAL;
 		}
 	}
 #ifdef CONFIG_NUMA
@@ -1853,7 +1852,6 @@ int vmstat_refresh(struct ctl_table *tab
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
-			err = -EINVAL;
 		}
 	}
 #endif
@@ -1862,11 +1860,8 @@ int vmstat_refresh(struct ctl_table *tab
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, node_stat_name(i), val);
-			err = -EINVAL;
 		}
 	}
-	if (err)
-		return err;
 	if (write)
 		*ppos += *lenp;
 	else


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh
  2021-02-25 23:12 ` [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh Hugh Dickins
@ 2021-03-01  0:38   ` Roman Gushchin
  0 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2021-03-01  0:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

On Thu, Feb 25, 2021 at 03:12:10PM -0800, Hugh Dickins wrote:
> EINVAL was good for drawing the refresher's attention to a warning in
> dmesg, but became very tiresome when running test suites scripted with
> "set -e": an underflow from a bug in one feature would cause unrelated
> tests much later to fail, just because their /proc/sys/vm/stat_refresh
> touch failed with that error. Stop doing that.

Totally agree!

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Roman Gushchin <guro@fb.com>

> ---
> 
>  mm/vmstat.c |    5 -----
>  1 file changed, 5 deletions(-)
> 
> --- vmstat1/mm/vmstat.c	2021-02-25 11:50:36.000000000 -0800
> +++ vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
> @@ -1844,7 +1844,6 @@ int vmstat_refresh(struct ctl_table *tab
>  		if (val < 0) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, zone_stat_name(i), val);
> -			err = -EINVAL;
>  		}
>  	}
>  #ifdef CONFIG_NUMA
> @@ -1853,7 +1852,6 @@ int vmstat_refresh(struct ctl_table *tab
>  		if (val < 0) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, numa_stat_name(i), val);
> -			err = -EINVAL;
>  		}
>  	}
>  #endif
> @@ -1862,11 +1860,8 @@ int vmstat_refresh(struct ctl_table *tab
>  		if (val < 0) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, node_stat_name(i), val);
> -			err = -EINVAL;
>  		}
>  	}
> -	if (err)
> -		return err;
>  	if (write)
>  		*ppos += *lenp;
>  	else


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-02-25 23:10 [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Hugh Dickins
  2021-02-25 23:12 ` [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh Hugh Dickins
@ 2021-02-25 23:14 ` Hugh Dickins
  2021-03-01  0:53   ` Roman Gushchin
  2021-02-25 23:15 ` [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats Hugh Dickins
  2021-03-01  0:37 ` [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Roman Gushchin
  3 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-02-25 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
we have not yet resolved why (Roman believes that it is an unavoidable
consequence of the refresh scheduled on each cpu).  But those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Use switch statements so we can most easily add or remove cases later.

Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/vmstat.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
+++ vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
@@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+		/*
+		 * Skip checking stats known to go negative occasionally.
+		 */
+		switch (i) {
+		case NR_ZONE_WRITE_PENDING:
+		case NR_FREE_CMA_PAGES:
+			continue;
+		}
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
@@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
 	}
 #endif
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		/*
+		 * Skip checking stats known to go negative occasionally.
+		 */
+		switch (i) {
+		case NR_WRITEBACK:
+			continue;
+		}
 		val = atomic_long_read(&vm_node_stat[i]);
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-02-25 23:14 ` [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats Hugh Dickins
@ 2021-03-01  0:53   ` Roman Gushchin
  2021-03-01 22:08     ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2021-03-01  0:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote:
> vmstat_refresh() can occasionally catch nr_zone_write_pending and
> nr_writeback when they are transiently negative.  The reason is partly
> that the interrupt which decrements them in test_clear_page_writeback()
> can come in before __test_set_page_writeback() got to increment them;
> but transient negatives are still seen even when that is prevented, and
> we have not yet resolved why (Roman believes that it is an unavoidable
> consequence of the refresh scheduled on each cpu).  But those stats are
> not buggy, they have never been seen to drift away from 0 permanently:
> so just avoid the annoyance of showing a warning on them.
> 
> Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> that one reported negative from /proc/sys/vm/stat_refresh too, but it
> does drift away permanently: I believe that's because its incrementation
> and decrementation are decided by page migratetype, but the migratetype
> of a pageblock is not guaranteed to be constant.
> 
> Use switch statements so we can most easily add or remove cases later.

I'm OK with the code, but I can't fully agree with the commit log. I don't think
there is any mystery around negative values. Let me copy-paste the explanation
from my original patch:

    These warnings* are generated by the vmstat_refresh() function, which
    assumes that atomic zone and numa counters can't go below zero.  However,
    on a SMP machine it's not quite right: due to per-cpu caching it can in
    theory be as low as -(zone threshold) * NR_CPUs.
    
    For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
    reached 0.  Then we've reclaimed a small number of cma pages on each CPU
    except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
    positive (the atomic counter is still 0).  Then somebody on CPU0 consumes
    all these pages.  The number of pages can easily exceed the threshold and
    a negative value will be committed to the atomic counter.

    * warnings about negative NR_FREE_CMA_PAGES

Actually, the same is almost true for ANY other counter. What differs CMA, dirty
and write pending counters is that they can reach 0 value under normal conditions.
Other counters are usually not reaching values small enough to see negative values
on a reasonable sized machine.

Does it makes sense?

> 
> Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
> Reported-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  mm/vmstat.c |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> --- vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
> +++ vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
> @@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
>  	if (err)
>  		return err;
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
> +		/*
> +		 * Skip checking stats known to go negative occasionally.
> +		 */
> +		switch (i) {
> +		case NR_ZONE_WRITE_PENDING:
> +		case NR_FREE_CMA_PAGES:
> +			continue;
> +		}
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
>  			pr_warn("%s: %s %ld\n",
> @@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
>  	}
>  #endif
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> +		/*
> +		 * Skip checking stats known to go negative occasionally.
> +		 */
> +		switch (i) {
> +		case NR_WRITEBACK:
> +			continue;
> +		}
>  		val = atomic_long_read(&vm_node_stat[i]);
>  		if (val < 0) {
>  			pr_warn("%s: %s %ld\n",


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-03-01  0:53   ` Roman Gushchin
@ 2021-03-01 22:08     ` Hugh Dickins
  2021-03-02  0:34       ` Roman Gushchin
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-03-01 22:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, linux-mm, linux-kernel

On Sun, 28 Feb 2021, Roman Gushchin wrote:
> On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote:
> > vmstat_refresh() can occasionally catch nr_zone_write_pending and
> > nr_writeback when they are transiently negative.  The reason is partly
> > that the interrupt which decrements them in test_clear_page_writeback()
> > can come in before __test_set_page_writeback() got to increment them;
> > but transient negatives are still seen even when that is prevented, and
> > we have not yet resolved why (Roman believes that it is an unavoidable
> > consequence of the refresh scheduled on each cpu).  But those stats are
> > not buggy, they have never been seen to drift away from 0 permanently:
> > so just avoid the annoyance of showing a warning on them.
> > 
> > Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> > that one reported negative from /proc/sys/vm/stat_refresh too, but it
> > does drift away permanently: I believe that's because its incrementation
> > and decrementation are decided by page migratetype, but the migratetype
> > of a pageblock is not guaranteed to be constant.
> > 
> > Use switch statements so we can most easily add or remove cases later.
> 
> I'm OK with the code, but I can't fully agree with the commit log. I don't think
> there is any mystery around negative values. Let me copy-paste the explanation
> from my original patch:
> 
>     These warnings* are generated by the vmstat_refresh() function, which
>     assumes that atomic zone and numa counters can't go below zero.  However,
>     on a SMP machine it's not quite right: due to per-cpu caching it can in
>     theory be as low as -(zone threshold) * NR_CPUs.
>     
>     For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
>     reached 0.  Then we've reclaimed a small number of cma pages on each CPU
>     except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
>     positive (the atomic counter is still 0).  Then somebody on CPU0 consumes
>     all these pages.  The number of pages can easily exceed the threshold and
>     a negative value will be committed to the atomic counter.
> 
>     * warnings about negative NR_FREE_CMA_PAGES

Hi Roman, thanks for your Acks on the others - and indeed this
is the one on which disagreement was more to be expected.

I certainly wanted (and included below) a Link to your original patch;
and even wondered whether to paste your description into mine.
But I read it again and still have issues with it.

Mainly, it does not convey at all, that touching stat_refresh adds the
per-cpu counts into the global atomics, resetting per-cpu counts to 0.
Which does not invalidate your explanation: races might still manage
to underflow; but it does take the "easily" out of "can easily exceed".

Since I don't use CMA on any machine, I cannot be sure, but it looked
like a bad example to rely upon, because of its migratetype-based
accounting.  If you use /proc/sys/vm/stat_refresh frequently enough,
without suppressing the warning, I guess that uncertainty could be
resolved by checking whether nr_free_cma is seen with negative value
in consecutive refreshes - which would tend to support my migratetype
theory - or only singly - which would support your raciness theory.

> 
> Actually, the same is almost true for ANY other counter. What differs CMA, dirty
> and write pending counters is that they can reach 0 value under normal conditions.
> Other counters are usually not reaching values small enough to see negative values
> on a reasonable sized machine.

Looking through /proc/vmstat now, yes, I can see that there are fewer
counters which hover near 0 than I had imagined: more have a positive
bias, or are monotonically increasing.  And I'd be lying if I said I'd
never seen any others than nr_writeback or nr_zone_write_pending caught
negative.  But what are you asking for?  Should the patch be changed, to
retry the refresh_vm_stats() before warning, if it sees any negative?
Depends on how terrible one line in dmesg is considered!

> 
> Does it makes sense?

I'm not sure: you were not asking for the patch to be changed, but
its commit log: and I better not say "Roman believes that it is an
unavoidable consequence of the refresh scheduled on each cpu" if
that's untrue (or unclear: now it reads to me as if we're accusing
the refresh of messing things up, whereas it's the non-atomic nature
of the refresh which leaves it vulnerable to races).

Hugh

> 
> > 
> > Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
> > Reported-by: Roman Gushchin <guro@fb.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > 
> >  mm/vmstat.c |   15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > --- vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
> > +++ vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
> > @@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
> >  	if (err)
> >  		return err;
> >  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
> > +		/*
> > +		 * Skip checking stats known to go negative occasionally.
> > +		 */
> > +		switch (i) {
> > +		case NR_ZONE_WRITE_PENDING:
> > +		case NR_FREE_CMA_PAGES:
> > +			continue;
> > +		}
> >  		val = atomic_long_read(&vm_zone_stat[i]);
> >  		if (val < 0) {
> >  			pr_warn("%s: %s %ld\n",
> > @@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
> >  	}
> >  #endif
> >  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> > +		/*
> > +		 * Skip checking stats known to go negative occasionally.
> > +		 */
> > +		switch (i) {
> > +		case NR_WRITEBACK:
> > +			continue;
> > +		}
> >  		val = atomic_long_read(&vm_node_stat[i]);
> >  		if (val < 0) {
> >  			pr_warn("%s: %s %ld\n",
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-03-01 22:08     ` Hugh Dickins
@ 2021-03-02  0:34       ` Roman Gushchin
  2021-03-02  6:03         ` [PATCH v2 " Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2021-03-02  0:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

Mon, Mar 01, 2021 at 02:08:17PM -0800, Hugh Dickins wrote:
> On Sun, 28 Feb 2021, Roman Gushchin wrote:
> > On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote:
> > > vmstat_refresh() can occasionally catch nr_zone_write_pending and
> > > nr_writeback when they are transiently negative.  The reason is partly
> > > that the interrupt which decrements them in test_clear_page_writeback()
> > > can come in before __test_set_page_writeback() got to increment them;
> > > but transient negatives are still seen even when that is prevented, and
> > > we have not yet resolved why (Roman believes that it is an unavoidable
> > > consequence of the refresh scheduled on each cpu).  But those stats are
> > > not buggy, they have never been seen to drift away from 0 permanently:
> > > so just avoid the annoyance of showing a warning on them.
> > > 
> > > Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> > > that one reported negative from /proc/sys/vm/stat_refresh too, but it
> > > does drift away permanently: I believe that's because its incrementation
> > > and decrementation are decided by page migratetype, but the migratetype
> > > of a pageblock is not guaranteed to be constant.
> > > 
> > > Use switch statements so we can most easily add or remove cases later.
> > 
> > I'm OK with the code, but I can't fully agree with the commit log. I don't think
> > there is any mystery around negative values. Let me copy-paste the explanation
> > from my original patch:
> > 
> >     These warnings* are generated by the vmstat_refresh() function, which
> >     assumes that atomic zone and numa counters can't go below zero.  However,
> >     on a SMP machine it's not quite right: due to per-cpu caching it can in
> >     theory be as low as -(zone threshold) * NR_CPUs.
> >     
> >     For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> >     reached 0.  Then we've reclaimed a small number of cma pages on each CPU
> >     except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
> >     positive (the atomic counter is still 0).  Then somebody on CPU0 consumes
> >     all these pages.  The number of pages can easily exceed the threshold and
> >     a negative value will be committed to the atomic counter.
> > 
> >     * warnings about negative NR_FREE_CMA_PAGES
> 
> Hi Roman, thanks for your Acks on the others - and indeed this
> is the one on which disagreement was more to be expected.
> 
> I certainly wanted (and included below) a Link to your original patch;
> and even wondered whether to paste your description into mine.
> But I read it again and still have issues with it.
> 
> Mainly, it does not convey at all, that touching stat_refresh adds the
> per-cpu counts into the global atomics, resetting per-cpu counts to 0.
> Which does not invalidate your explanation: races might still manage
> to underflow; but it does take the "easily" out of "can easily exceed".

Hi Hugh!

It could be that "easily" simple comes from the scale (number of machines).

> 
> Since I don't use CMA on any machine, I cannot be sure, but it looked
> like a bad example to rely upon, because of its migratetype-based
> accounting.  If you use /proc/sys/vm/stat_refresh frequently enough,
> without suppressing the warning, I guess that uncertainty could be
> resolved by checking whether nr_free_cma is seen with negative value
> in consecutive refreshes - which would tend to support my migratetype
> theory - or only singly - which would support your raciness theory.
> 
> > 
> > Actually, the same is almost true for ANY other counter. What differs CMA, dirty
> > and write pending counters is that they can reach 0 value under normal conditions.
> > Other counters are usually not reaching values small enough to see negative values
> > on a reasonable sized machine.
> 
> Looking through /proc/vmstat now, yes, I can see that there are fewer
> counters which hover near 0 than I had imagined: more have a positive
> bias, or are monotonically increasing.  And I'd be lying if I said I'd
> never seen any others than nr_writeback or nr_zone_write_pending caught
> negative.  But what are you asking for?  Should the patch be changed, to
> retry the refresh_vm_stats() before warning, if it sees any negative?
> Depends on how terrible one line in dmesg is considered!
> 
> > 
> > Does it makes sense?
> 
> I'm not sure: you were not asking for the patch to be changed, but
> its commit log: and I better not say "Roman believes that it is an
> unavoidable consequence of the refresh scheduled on each cpu" if
> that's untrue (or unclear: now it reads to me as if we're accusing
> the refresh of messing things up, whereas it's the non-atomic nature
> of the refresh which leaves it vulnerable to races).

I think we both agree that for some counters going slightly into negative
is possible and isn't an indication of an error, if only they don't become
too negative. For other counters it's unlikely: so you see a value in
raising a warning when they do. I don't think there is any disagreement here.

So the only question is how we encode the list of counters which we're
comparing to 0 (we can list them or list all others, as in your version),
and what we do with the rest (we can ignore them completely or compare
with the maximum drift value, as in my original patch). I actually don't
care that much how exactly it's implemented, if only we're not generating
too many false warnings.

How about putting something like this into the commit log (I'm sure you
can put it better than me. Please, do!):

"For performance reasons vmstat counters are incremented and decremented
using percpu batches. vmstat_refresh() flushes per-cpu batches on all CPUs
to get as accurate values, as possible. However, because this process
is not atomic, the resulting value is not exactly precise. As a consequence,
for some counters, which real value tend to oscillate around 0, it's possible
to obtain a slightly negative value. If the value is relatively small and
the state is transient, it's not an indication on an error."

Thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-03-02  0:34       ` Roman Gushchin
@ 2021-03-02  6:03         ` Hugh Dickins
  2021-03-04  2:25           ` Roman Gushchin
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-03-02  6:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Hugh Dickins, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, linux-mm, linux-kernel

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
I am not yet certain why (but see Roman's note below).  Those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Roman Gushchin points out:
For performance reasons, vmstat counters are incremented and decremented
using per-cpu batches.  vmstat_refresh() flushes the per-cpu batches on
all CPUs, to get values as accurate as possible; but this method is not
atomic, so the resulting value is not always precise.  As a consequence,
for those counters whose actual value is close to 0, a small negative
value may occasionally be reported.  If the value is small and the state
is transient, it is not an indication of an error.

Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/vmstat.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- vmstat2/mm/vmstat.c	2021-02-25 11:56:18.000000000 -0800
+++ vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
@@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+		/*
+		 * Skip checking stats known to go negative occasionally.
+		 */
+		switch (i) {
+		case NR_ZONE_WRITE_PENDING:
+		case NR_FREE_CMA_PAGES:
+			continue;
+		}
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
@@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
 	}
 #endif
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		/*
+		 * Skip checking stats known to go negative occasionally.
+		 */
+		switch (i) {
+		case NR_WRITEBACK:
+			continue;
+		}
 		val = atomic_long_read(&vm_node_stat[i]);
 		if (val < 0) {
 			pr_warn("%s: %s %ld\n",

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats
  2021-03-02  6:03         ` [PATCH v2 " Hugh Dickins
@ 2021-03-04  2:25           ` Roman Gushchin
  0 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2021-03-04  2:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

On Mon, Mar 01, 2021 at 10:03:26PM -0800, Hugh Dickins wrote:
> vmstat_refresh() can occasionally catch nr_zone_write_pending and
> nr_writeback when they are transiently negative.  The reason is partly
> that the interrupt which decrements them in test_clear_page_writeback()
> can come in before __test_set_page_writeback() got to increment them;
> but transient negatives are still seen even when that is prevented, and
> I am not yet certain why (but see Roman's note below).  Those stats are
> not buggy, they have never been seen to drift away from 0 permanently:
> so just avoid the annoyance of showing a warning on them.
> 
> Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> that one reported negative from /proc/sys/vm/stat_refresh too, but it
> does drift away permanently: I believe that's because its incrementation
> and decrementation are decided by page migratetype, but the migratetype
> of a pageblock is not guaranteed to be constant.
> 
> Roman Gushchin points out:
> For performance reasons, vmstat counters are incremented and decremented
> using per-cpu batches.  vmstat_refresh() flushes the per-cpu batches on
> all CPUs, to get values as accurate as possible; but this method is not
> atomic, so the resulting value is not always precise.  As a consequence,
> for those counters whose actual value is close to 0, a small negative
> value may occasionally be reported.  If the value is small and the state
> is transient, it is not an indication of an error.
> 
> Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
> Reported-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---

Oh, sorry, it looks like I missed to ack it. Thank you for updating
the commit log!

Acked-by: Roman Gushchin <guro@fb.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats
  2021-02-25 23:10 [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Hugh Dickins
  2021-02-25 23:12 ` [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh Hugh Dickins
  2021-02-25 23:14 ` [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats Hugh Dickins
@ 2021-02-25 23:15 ` Hugh Dickins
  2021-03-01  0:53   ` Roman Gushchin
  2021-03-01  0:37 ` [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Roman Gushchin
  3 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-02-25 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

All of the VM NUMA stats are event counts, incremented never decremented:
it is not very useful for vmstat_refresh() to check them throughout their
first aeon, then warn on them throughout their next.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/vmstat.c |    9 ---------
 1 file changed, 9 deletions(-)

--- vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
+++ vmstat4/mm/vmstat.c	2021-02-25 12:44:20.000000000 -0800
@@ -1854,15 +1854,6 @@ int vmstat_refresh(struct ctl_table *tab
 				__func__, zone_stat_name(i), val);
 		}
 	}
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
-			pr_warn("%s: %s %ld\n",
-				__func__, numa_stat_name(i), val);
-		}
-	}
-#endif
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats
  2021-02-25 23:15 ` [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats Hugh Dickins
@ 2021-03-01  0:53   ` Roman Gushchin
  0 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2021-03-01  0:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

On Thu, Feb 25, 2021 at 03:15:42PM -0800, Hugh Dickins wrote:
> All of the VM NUMA stats are event counts, incremented never decremented:
> it is not very useful for vmstat_refresh() to check them throughout their
> first aeon, then warn on them throughout their next.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

> ---
> 
>  mm/vmstat.c |    9 ---------
>  1 file changed, 9 deletions(-)
> 
> --- vmstat3/mm/vmstat.c	2021-02-25 12:42:15.000000000 -0800
> +++ vmstat4/mm/vmstat.c	2021-02-25 12:44:20.000000000 -0800
> @@ -1854,15 +1854,6 @@ int vmstat_refresh(struct ctl_table *tab
>  				__func__, zone_stat_name(i), val);
>  		}
>  	}
> -#ifdef CONFIG_NUMA
> -	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
> -		val = atomic_long_read(&vm_numa_stat[i]);
> -		if (val < 0) {
> -			pr_warn("%s: %s %ld\n",
> -				__func__, numa_stat_name(i), val);
> -		}
> -	}
> -#endif
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>  		/*
>  		 * Skip checking stats known to go negative occasionally.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh
  2021-02-25 23:10 [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Hugh Dickins
                   ` (2 preceding siblings ...)
  2021-02-25 23:15 ` [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats Hugh Dickins
@ 2021-03-01  0:37 ` Roman Gushchin
  3 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2021-03-01  0:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

On Thu, Feb 25, 2021 at 03:10:09PM -0800, Hugh Dickins wrote:
> v4.7 52b6f46bc163 ("mm: /proc/sys/vm/stat_refresh to force vmstat update")
> introduced vmstat_refresh(), with its vmstat underflow checking; then
> v4.8 75ef71840539 ("mm, vmstat: add infrastructure for per-node vmstats")
> split NR_VM_NODE_STAT_ITEMS out of NR_VM_ZONE_STAT_ITEMS without updating
> vmstat_refresh(): so it has been missing out much of the vmstat underflow
> checking ever since. Reinstate it. Thanks to Roman Gushchin <guro@fb.com>
> for tangentially pointing this out.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

> ---
> 
>  mm/vmstat.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- 5.12-rc/mm/vmstat.c	2021-02-24 12:03:55.000000000 -0800
> +++ vmstat1/mm/vmstat.c	2021-02-25 11:50:36.000000000 -0800
> @@ -1857,6 +1857,14 @@ int vmstat_refresh(struct ctl_table *tab
>  		}
>  	}
>  #endif
> +	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> +		val = atomic_long_read(&vm_node_stat[i]);
> +		if (val < 0) {
> +			pr_warn("%s: %s %ld\n",
> +				__func__, node_stat_name(i), val);
> +			err = -EINVAL;
> +		}
> +	}
>  	if (err)
>  		return err;
>  	if (write)


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-04  2:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-25 23:10 [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Hugh Dickins
2021-02-25 23:12 ` [PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh Hugh Dickins
2021-03-01  0:38   ` Roman Gushchin
2021-02-25 23:14 ` [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats Hugh Dickins
2021-03-01  0:53   ` Roman Gushchin
2021-03-01 22:08     ` Hugh Dickins
2021-03-02  0:34       ` Roman Gushchin
2021-03-02  6:03         ` [PATCH v2 " Hugh Dickins
2021-03-04  2:25           ` Roman Gushchin
2021-02-25 23:15 ` [PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats Hugh Dickins
2021-03-01  0:53   ` Roman Gushchin
2021-03-01  0:37 ` [PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).