linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm,vmscan: fix divide by zero in get_scan_count
@ 2021-08-27  2:01 Rik van Riel
  2021-08-27 16:28 ` Roman Gushchin
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Rik van Riel @ 2021-08-27  2:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, stable, Michal Hocko,
	Chris Down, Johannes Weiner

Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.

When oomd decides to kill a cgroup, it will force the cgroup memory
to be reclaimed after killing the tasks, by writing to the memory.max
file for that cgroup, forcing the remaining page cache and reclaimable
slab to be reclaimed down to zero.

Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely not
at all, having the page cache reclaimed asynchronously later.

With f56ce412a59d the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.

This patch implements the obvious fix.

Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <riel@surriel.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6bc532..f1782b816c98 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			cgroup_size = max(cgroup_size, protection);
 
 			scan = lruvec_size - lruvec_size * protection /
-				cgroup_size;
+				(cgroup_size + 1);
 
 			/*
 			 * Minimally target SWAP_CLUSTER_MAX pages to keep



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-27  2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
@ 2021-08-27 16:28 ` Roman Gushchin
  2021-08-30 11:33 ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2021-08-27 16:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Michal Hocko, Chris Down, Johannes Weiner

On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.
> 
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>

It looks like we discover a new divide-by-zero corner case after every
functional change to the memory protection code :)

Acked-by: Roman Gushchin <guro@fb.com>

Thanks, Rik!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-27  2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
  2021-08-27 16:28 ` Roman Gushchin
@ 2021-08-30 11:33 ` Michal Hocko
  2021-08-30 13:24   ` Rik van Riel
  2021-08-30 20:48 ` Johannes Weiner
  2021-08-31 12:58 ` Chris Down
  3 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2021-08-30 11:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Chris Down, Johannes Weiner

On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.

I must be missing something but how can cgroup_size be ever 0 when it is
max(cgroup_size, protection) and protection != 0?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-30 11:33 ` Michal Hocko
@ 2021-08-30 13:24   ` Rik van Riel
  2021-08-30 13:41     ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2021-08-30 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Chris Down, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 1460 bytes --]

On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
> On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> > proportional memory.low reclaim") introduced a divide by zero
> > corner
> > case when oomd is being used in combination with cgroup memory.low
> > protection.
> > 
> > When oomd decides to kill a cgroup, it will force the cgroup memory
> > to be reclaimed after killing the tasks, by writing to the
> > memory.max
> > file for that cgroup, forcing the remaining page cache and
> > reclaimable
> > slab to be reclaimed down to zero.
> > 
> > Previously, on cgroups with some memory.low protection that would
> > result
> > in the memory being reclaimed down to the memory.low limit, or
> > likely not
> > at all, having the page cache reclaimed asynchronously later.
> > 
> > With f56ce412a59d the oomd write to memory.max tries to reclaim all
> > the
> > way down to zero, which may race with another reclaimer, to the
> > point of
> > ending up with the divide by zero below.
> > 
> > This patch implements the obvious fix.
> 
> I must be missing something but how can cgroup_size be ever 0 when it
> is
> max(cgroup_size, protection) and protection != 0?

Going into the condition we use if (low || min), where
it is possible for low > 0 && min == 0.

Inside the conditional, we can end up testing against
min.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-30 13:24   ` Rik van Riel
@ 2021-08-30 13:41     ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2021-08-30 13:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Chris Down, Johannes Weiner

On Mon 30-08-21 09:24:22, Rik van Riel wrote:
> On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
[...]
> > I must be missing something but how can cgroup_size be ever 0 when it
> > is
> > max(cgroup_size, protection) and protection != 0?
> 
> Going into the condition we use if (low || min), where
> it is possible for low > 0 && min == 0.
> 
> Inside the conditional, we can end up testing against
> min.

Dang, I was looking at the tree without f56ce412a59d7 applied. My bad!
Personally I would consider the following slightly easier to follow
			scan = lruvec_size - lruvec_size * protection /
				max(cgroup_size, 1);

The code is quite tricky already and if you asked me what kind of effect
cgroup_size + 1 have there I would just shrug shoulders...

Anyway your fix will prevent the reported problem and I cannot see any
obvious problem with it either so
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-27  2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
  2021-08-27 16:28 ` Roman Gushchin
  2021-08-30 11:33 ` Michal Hocko
@ 2021-08-30 20:48 ` Johannes Weiner
  2021-08-31  9:59   ` Michal Hocko
  2021-08-31 12:58 ` Chris Down
  3 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2021-08-30 20:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Michal Hocko, Chris Down

On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.
> 
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>

That took me a second.

Before the patch, that sc->memcg_low_reclaim test was outside of that
whole proportional reclaim branch. So if we were in low reclaim mode
we wouldn't even check if a low setting is in place; if min is zero,
we don't enter the proportional branch.

Now we enter if low is set but ignored, and then end up with
cgroup_size == min == 0 == divide by black hole.

Good catch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeae2f6bc532..f1782b816c98 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			cgroup_size = max(cgroup_size, protection);
>  
>  			scan = lruvec_size - lruvec_size * protection /
> -				cgroup_size;
> +				(cgroup_size + 1);

I have no overly strong preferences, but if Michal prefers max(), how about:

	cgroup_size = max3(cgroup_size, protection, 1);

Or go back to not taking the branch in the first place when there is
no protection in effect...

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6247f6f4469a..9c200bb3ae51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
 				      &min, &low);
 
-		if (min || low) {
+		if (min || (!sc->memcg_low_reclaim && low)) {
 			/*
 			 * Scale a cgroup's reclaim pressure by proportioning
 			 * its current usage to its memory.low or memory.min


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-30 20:48 ` Johannes Weiner
@ 2021-08-31  9:59   ` Michal Hocko
  2021-08-31 15:48     ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2021-08-31  9:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	stable, Chris Down

On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eeae2f6bc532..f1782b816c98 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  			cgroup_size = max(cgroup_size, protection);
> >  
> >  			scan = lruvec_size - lruvec_size * protection /
> > -				cgroup_size;
> > +				(cgroup_size + 1);
> 
> I have no overly strong preferences, but if Michal prefers max(), how about:
> 
> 	cgroup_size = max3(cgroup_size, protection, 1);

Yes this is better.

> Or go back to not taking the branch in the first place when there is
> no protection in effect...
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6247f6f4469a..9c200bb3ae51 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
>  				      &min, &low);
>  
> -		if (min || low) {
> +		if (min || (!sc->memcg_low_reclaim && low)) {
>  			/*
>  			 * Scale a cgroup's reclaim pressure by proportioning
>  			 * its current usage to its memory.low or memory.min

This is slightly more complex to read but it is also better than +1
trick.

Either of the two work for me.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-27  2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
                   ` (2 preceding siblings ...)
  2021-08-30 20:48 ` Johannes Weiner
@ 2021-08-31 12:58 ` Chris Down
  3 siblings, 0 replies; 10+ messages in thread
From: Chris Down @ 2021-08-31 12:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
	Michal Hocko, Johannes Weiner

Rik van Riel writes:
>Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
>proportional memory.low reclaim") introduced a divide by zero corner
>case when oomd is being used in combination with cgroup memory.low
>protection.
>
>When oomd decides to kill a cgroup, it will force the cgroup memory
>to be reclaimed after killing the tasks, by writing to the memory.max
>file for that cgroup, forcing the remaining page cache and reclaimable
>slab to be reclaimed down to zero.
>
>Previously, on cgroups with some memory.low protection that would result
>in the memory being reclaimed down to the memory.low limit, or likely not
>at all, having the page cache reclaimed asynchronously later.
>
>With f56ce412a59d the oomd write to memory.max tries to reclaim all the
>way down to zero, which may race with another reclaimer, to the point of
>ending up with the divide by zero below.
>
>This patch implements the obvious fix.
>
>Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
>Signed-off-by: Rik van Riel <riel@surriel.com>

Thanks, good catch. No strong preference on this vs. max3(), so feel free to 
keep my ack either way.

Acked-by: Chris Down <chris@chrisdown.name>

>
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index eeae2f6bc532..f1782b816c98 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> 			cgroup_size = max(cgroup_size, protection);
>
> 			scan = lruvec_size - lruvec_size * protection /
>-				cgroup_size;
>+				(cgroup_size + 1);
>
> 			/*
> 			 * Minimally target SWAP_CLUSTER_MAX pages to keep
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-31  9:59   ` Michal Hocko
@ 2021-08-31 15:48     ` Rik van Riel
  2021-09-01 19:40       ` Johannes Weiner
  0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2021-08-31 15:48 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable, Chris Down

[-- Attachment #1: Type: text/plain, Size: 1960 bytes --]

On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> 
> 
> > Or go back to not taking the branch in the first place when there
> > is
> > no protection in effect...
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6247f6f4469a..9c200bb3ae51 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > *lruvec, struct scan_control *sc,
> >                 mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> >                                       &min, &low);
> >  
> > -               if (min || low) {
> > +               if (min || (!sc->memcg_low_reclaim && low)) {
> >                         /*
> >                          * Scale a cgroup's reclaim pressure by
> > proportioning
> >                          * its current usage to its memory.low or
> > memory.min
> 
> This is slightly more complex to read but it is also better than +1
> trick.

We could also fold it into the helper function, with
mem_cgroup_protection deciding whether to use low or
min for the protection limit, and then we key the rest
of our decisions off that.

Wait a minute, that's pretty much what mem_cgroup_protection
looked like before f56ce412a59d ("mm: memcontrol: fix occasional
OOMs due to proportional memory.low reclaim")

Now I'm confused how that changeset works.

Before f56ce412a59d, mem_cgroup_protection would return
memcg->memory.emin if sc->memcg_low_reclaim is true, and
memcg->memory.elow when not.

After f56ce412a59d, we still do the same thing. We just
also set sc->memcg_low_skipped so we know to come back
if we could not hit our target without skipping groups
with memory.low protection...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
  2021-08-31 15:48     ` Rik van Riel
@ 2021-09-01 19:40       ` Johannes Weiner
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2021-09-01 19:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	stable, Chris Down

On Tue, Aug 31, 2021 at 11:48:28AM -0400, Rik van Riel wrote:
> On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> > On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> > 
> > 
> > > Or go back to not taking the branch in the first place when there
> > > is
> > > no protection in effect...
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6247f6f4469a..9c200bb3ae51 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > > *lruvec, struct scan_control *sc,
> > >                 mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> > >                                       &min, &low);
> > >  
> > > -               if (min || low) {
> > > +               if (min || (!sc->memcg_low_reclaim && low)) {
> > >                         /*
> > >                          * Scale a cgroup's reclaim pressure by
> > > proportioning
> > >                          * its current usage to its memory.low or
> > > memory.min
> > 
> > This is slightly more complex to read but it is also better than +1
> > trick.
> 
> We could also fold it into the helper function, with
> mem_cgroup_protection deciding whether to use low or
> min for the protection limit, and then we key the rest
> of our decisions off that.
> 
> Wait a minute, that's pretty much what mem_cgroup_protection
> looked like before f56ce412a59d ("mm: memcontrol: fix occasional
> OOMs due to proportional memory.low reclaim")
> 
> Now I'm confused how that changeset works.
> 
> Before f56ce412a59d, mem_cgroup_protection would return
> memcg->memory.emin if sc->memcg_low_reclaim is true, and
> memcg->memory.elow when not.
> 
> After f56ce412a59d, we still do the same thing. We just
> also set sc->memcg_low_skipped so we know to come back
> if we could not hit our target without skipping groups
> with memory.low protection...

Yeah, I just bubbled the sc->memcg_low_reclaim test up into vmscan.c
so we can modify sc->memcg_low_skipped based on it. Because
scan_control is vmscan.c-scope and I tried to retain that; and avoid
doing things like passing &sc->memcg_low_skipped into memcg code.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-01 19:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-27  2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
2021-08-27 16:28 ` Roman Gushchin
2021-08-30 11:33 ` Michal Hocko
2021-08-30 13:24   ` Rik van Riel
2021-08-30 13:41     ` Michal Hocko
2021-08-30 20:48 ` Johannes Weiner
2021-08-31  9:59   ` Michal Hocko
2021-08-31 15:48     ` Rik van Riel
2021-09-01 19:40       ` Johannes Weiner
2021-08-31 12:58 ` Chris Down

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).