* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-27 2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
@ 2021-08-27 16:28 ` Roman Gushchin
2021-08-30 11:33 ` Michal Hocko
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2021-08-27 16:28 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Michal Hocko, Chris Down, Johannes Weiner
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
>
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
>
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
>
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
>
> This patch implements the obvious fix.
>
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>
It looks like we discover a new divide-by-zero corner case after every
functional change to the memory protection code :)
Acked-by: Roman Gushchin <guro@fb.com>
Thanks, Rik!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-27 2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
2021-08-27 16:28 ` Roman Gushchin
@ 2021-08-30 11:33 ` Michal Hocko
2021-08-30 13:24 ` Rik van Riel
2021-08-30 20:48 ` Johannes Weiner
2021-08-31 12:58 ` Chris Down
3 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2021-08-30 11:33 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Chris Down, Johannes Weiner
On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
>
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
>
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
>
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
>
> This patch implements the obvious fix.
I must be missing something but how can cgroup_size be ever 0 when it is
max(cgroup_size, protection) and protection != 0?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-30 11:33 ` Michal Hocko
@ 2021-08-30 13:24 ` Rik van Riel
2021-08-30 13:41 ` Michal Hocko
0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2021-08-30 13:24 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Chris Down, Johannes Weiner
[-- Attachment #1: Type: text/plain, Size: 1460 bytes --]
On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
> On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> > proportional memory.low reclaim") introduced a divide by zero
> > corner
> > case when oomd is being used in combination with cgroup memory.low
> > protection.
> >
> > When oomd decides to kill a cgroup, it will force the cgroup memory
> > to be reclaimed after killing the tasks, by writing to the
> > memory.max
> > file for that cgroup, forcing the remaining page cache and
> > reclaimable
> > slab to be reclaimed down to zero.
> >
> > Previously, on cgroups with some memory.low protection that would
> > result
> > in the memory being reclaimed down to the memory.low limit, or
> > likely not
> > at all, having the page cache reclaimed asynchronously later.
> >
> > With f56ce412a59d the oomd write to memory.max tries to reclaim all
> > the
> > way down to zero, which may race with another reclaimer, to the
> > point of
> > ending up with the divide by zero below.
> >
> > This patch implements the obvious fix.
>
> I must be missing something but how can cgroup_size be ever 0 when it
> is
> max(cgroup_size, protection) and protection != 0?
Going into the condition we use if (low || min), where
it is possible for low > 0 && min == 0.
Inside the conditional, we can end up testing against
min.
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-30 13:24 ` Rik van Riel
@ 2021-08-30 13:41 ` Michal Hocko
0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2021-08-30 13:41 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Chris Down, Johannes Weiner
On Mon 30-08-21 09:24:22, Rik van Riel wrote:
> On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
[...]
> > I must be missing something but how can cgroup_size be ever 0 when it
> > is
> > max(cgroup_size, protection) and protection != 0?
>
> Going into the condition we use if (low || min), where
> it is possible for low > 0 && min == 0.
>
> Inside the conditional, we can end up testing against
> min.
Dang, I was looking at the tree without f56ce412a59d7 applied. My bad!
Personally I would consider the following slightly easier to follow
scan = lruvec_size - lruvec_size * protection /
max(cgroup_size, 1);
The code is quite tricky already and if you asked me what kind of effect
cgroup_size + 1 have there I would just shrug shoulders...
Anyway your fix will prevent the reported problem and I cannot see any
obvious problem with it either so
Acked-by: Michal Hocko <mhocko@suse.com>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-27 2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
2021-08-27 16:28 ` Roman Gushchin
2021-08-30 11:33 ` Michal Hocko
@ 2021-08-30 20:48 ` Johannes Weiner
2021-08-31 9:59 ` Michal Hocko
2021-08-31 12:58 ` Chris Down
3 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2021-08-30 20:48 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Michal Hocko, Chris Down
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
>
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
>
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
>
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
>
> This patch implements the obvious fix.
>
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>
That took me a second.
Before the patch, that sc->memcg_low_reclaim test was outside of that
whole proportional reclaim branch. So if we were in low reclaim mode
we wouldn't even check if a low setting is in place; if min is zero,
we don't enter the proportional branch.
Now we enter if low is set but ignored, and then end up with
cgroup_size == min == 0 == divide by black hole.
Good catch.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeae2f6bc532..f1782b816c98 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> cgroup_size = max(cgroup_size, protection);
>
> scan = lruvec_size - lruvec_size * protection /
> - cgroup_size;
> + (cgroup_size + 1);
I have no overly strong preferences, but if Michal prefers max(), how about:
cgroup_size = max3(cgroup_size, protection, 1);
Or go back to not taking the branch in the first place when there is
no protection in effect...
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6247f6f4469a..9c200bb3ae51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
mem_cgroup_protection(sc->target_mem_cgroup, memcg,
&min, &low);
- if (min || low) {
+ if (min || (!sc->memcg_low_reclaim && low)) {
/*
* Scale a cgroup's reclaim pressure by proportioning
* its current usage to its memory.low or memory.min
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-30 20:48 ` Johannes Weiner
@ 2021-08-31 9:59 ` Michal Hocko
2021-08-31 15:48 ` Rik van Riel
0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2021-08-31 9:59 UTC (permalink / raw)
To: Johannes Weiner
Cc: Rik van Riel, Andrew Morton, linux-mm, linux-kernel, kernel-team,
stable, Chris Down
On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eeae2f6bc532..f1782b816c98 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > cgroup_size = max(cgroup_size, protection);
> >
> > scan = lruvec_size - lruvec_size * protection /
> > - cgroup_size;
> > + (cgroup_size + 1);
>
> I have no overly strong preferences, but if Michal prefers max(), how about:
>
> cgroup_size = max3(cgroup_size, protection, 1);
Yes this is better.
> Or go back to not taking the branch in the first place when there is
> no protection in effect...
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6247f6f4469a..9c200bb3ae51 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> &min, &low);
>
> - if (min || low) {
> + if (min || (!sc->memcg_low_reclaim && low)) {
> /*
> * Scale a cgroup's reclaim pressure by proportioning
> * its current usage to its memory.low or memory.min
This is slightly more complex to read but it is also better than +1
trick.
Either of the two work for me.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-31 9:59 ` Michal Hocko
@ 2021-08-31 15:48 ` Rik van Riel
2021-09-01 19:40 ` Johannes Weiner
0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2021-08-31 15:48 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable, Chris Down
[-- Attachment #1: Type: text/plain, Size: 1960 bytes --]
On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
>
>
> > Or go back to not taking the branch in the first place when there
> > is
> > no protection in effect...
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6247f6f4469a..9c200bb3ae51 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > *lruvec, struct scan_control *sc,
> > mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> > &min, &low);
> >
> > - if (min || low) {
> > + if (min || (!sc->memcg_low_reclaim && low)) {
> > /*
> > * Scale a cgroup's reclaim pressure by
> > proportioning
> > * its current usage to its memory.low or
> > memory.min
>
> This is slightly more complex to read but it is also better than +1
> trick.
We could also fold it into the helper function, with
mem_cgroup_protection deciding whether to use low or
min for the protection limit, and then we key the rest
of our decisions off that.
Wait a minute, that's pretty much what mem_cgroup_protection
looked like before f56ce412a59d ("mm: memcontrol: fix occasional
OOMs due to proportional memory.low reclaim")
Now I'm confused how that changeset works.
Before f56ce412a59d, mem_cgroup_protection would return
memcg->memory.emin if sc->memcg_low_reclaim is true, and
memcg->memory.elow when not.
After f56ce412a59d, we still do the same thing. We just
also set sc->memcg_low_skipped so we know to come back
if we could not hit our target without skipping groups
with memory.low protection...
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-31 15:48 ` Rik van Riel
@ 2021-09-01 19:40 ` Johannes Weiner
0 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2021-09-01 19:40 UTC (permalink / raw)
To: Rik van Riel
Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel, kernel-team,
stable, Chris Down
On Tue, Aug 31, 2021 at 11:48:28AM -0400, Rik van Riel wrote:
> On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> > On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> >
> >
> > > Or go back to not taking the branch in the first place when there
> > > is
> > > no protection in effect...
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6247f6f4469a..9c200bb3ae51 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > > *lruvec, struct scan_control *sc,
> > > mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> > > &min, &low);
> > >
> > > - if (min || low) {
> > > + if (min || (!sc->memcg_low_reclaim && low)) {
> > > /*
> > > * Scale a cgroup's reclaim pressure by
> > > proportioning
> > > * its current usage to its memory.low or
> > > memory.min
> >
> > This is slightly more complex to read but it is also better than +1
> > trick.
>
> We could also fold it into the helper function, with
> mem_cgroup_protection deciding whether to use low or
> min for the protection limit, and then we key the rest
> of our decisions off that.
>
> Wait a minute, that's pretty much what mem_cgroup_protection
> looked like before f56ce412a59d ("mm: memcontrol: fix occasional
> OOMs due to proportional memory.low reclaim")
>
> Now I'm confused how that changeset works.
>
> Before f56ce412a59d, mem_cgroup_protection would return
> memcg->memory.emin if sc->memcg_low_reclaim is true, and
> memcg->memory.elow when not.
>
> After f56ce412a59d, we still do the same thing. We just
> also set sc->memcg_low_skipped so we know to come back
> if we could not hit our target without skipping groups
> with memory.low protection...
Yeah, I just bubbled the sc->memcg_low_reclaim test up into vmscan.c
so we can modify sc->memcg_low_skipped based on it. Because
scan_control is vmscan.c-scope and I tried to retain that; and avoid
doing things like passing &sc->memcg_low_skipped into memcg code.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm,vmscan: fix divide by zero in get_scan_count
2021-08-27 2:01 [PATCH] mm,vmscan: fix divide by zero in get_scan_count Rik van Riel
` (2 preceding siblings ...)
2021-08-30 20:48 ` Johannes Weiner
@ 2021-08-31 12:58 ` Chris Down
3 siblings, 0 replies; 10+ messages in thread
From: Chris Down @ 2021-08-31 12:58 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, stable,
Michal Hocko, Johannes Weiner
Rik van Riel writes:
>Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
>proportional memory.low reclaim") introduced a divide by zero corner
>case when oomd is being used in combination with cgroup memory.low
>protection.
>
>When oomd decides to kill a cgroup, it will force the cgroup memory
>to be reclaimed after killing the tasks, by writing to the memory.max
>file for that cgroup, forcing the remaining page cache and reclaimable
>slab to be reclaimed down to zero.
>
>Previously, on cgroups with some memory.low protection that would result
>in the memory being reclaimed down to the memory.low limit, or likely not
>at all, having the page cache reclaimed asynchronously later.
>
>With f56ce412a59d the oomd write to memory.max tries to reclaim all the
>way down to zero, which may race with another reclaimer, to the point of
>ending up with the divide by zero below.
>
>This patch implements the obvious fix.
>
>Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
>Signed-off-by: Rik van Riel <riel@surriel.com>
Thanks, good catch. No strong preference on this vs. max3(), so feel free to
keep my ack either way.
Acked-by: Chris Down <chris@chrisdown.name>
>
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index eeae2f6bc532..f1782b816c98 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> cgroup_size = max(cgroup_size, protection);
>
> scan = lruvec_size - lruvec_size * protection /
>- cgroup_size;
>+ (cgroup_size + 1);
>
> /*
> * Minimally target SWAP_CLUSTER_MAX pages to keep
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread