* Re: [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
2021-01-22 18:43 [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" Johannes Weiner
@ 2021-01-22 19:52 ` Roman Gushchin
2021-01-22 20:57 ` Shakeel Butt
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Roman Gushchin @ 2021-01-22 19:52 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, Shakeel Butt, Michal Koutný,
Tejun Heo, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Jan 22, 2021 at 01:43:41PM -0500, Johannes Weiner wrote:
> This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it
> can cause writers to memory.high to get stuck in the kernel forever,
> performing page reclaim and consuming excessive amounts of CPU cycles.
>
> Before the patch, a write to memory.high would first put the new limit
> in place for the workload, and then reclaim the requested delta. After
> the patch, the kernel tries to reclaim the delta before putting the
> new limit into place, in order to not overwhelm the workload with a
> sudden, large excess over the limit. However, if reclaim is actively
> racing with new allocations from the uncurbed workload, it can keep
> the write() working inside the kernel indefinitely.
>
> This is causing problems in Facebook production. A privileged
> system-level daemon that adjusts memory.high for various workloads
> running on a host can get unexpectedly stuck in the kernel and
> essentially turn into a sort of involuntary kswapd for one of the
> workloads. We've observed that daemon busy-spin in a write() for
> minutes at a time, neglecting its other duties on the system, and
> expending privileged system resources on behalf of a workload.
>
> To remedy this, we have first considered changing the reclaim logic to
> break out after a couple of loops - whether the workload has converged
> to the new limit or not - and bound the write() call this way.
> However, the root cause that inspired the sequence change in the first
> place has been fixed through other means, and so a revert back to the
> proven limit-setting sequence, also used by memory.max, is preferable.
>
> The sequence was changed to avoid extreme latencies in the workload
> when the limit was lowered: the sudden, large excess created by the
> limit lowering would erroneously trigger the penalty sleeping code
> that is meant to throttle excessive growth from below. Allocating
> threads could end up sleeping long after the write() had already
> reclaimed the delta for which they were being punished.
>
> However, erroneous throttling also caused problems in other scenarios
> at around the same time. This resulted in commit b3ff92916af3 ("mm,
> memcg: reclaim more aggressively before high allocator throttling"),
> included in the same release as the offending commit. When allocating
> threads now encounter large excess caused by a racing write() to
> memory.high, instead of entering punitive sleeps, they will simply be
> tasked with helping reclaim down the excess, and will be held no
> longer than it takes to accomplish that. This is in line with regular
> limit enforcement - i.e. if the workload allocates up against or over
> an otherwise unchanged limit from below.
>
> With the patch breaking userspace, and the root cause addressed by
> other means already, revert it again.
>
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
2021-01-22 18:43 [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" Johannes Weiner
2021-01-22 19:52 ` Roman Gushchin
@ 2021-01-22 20:57 ` Shakeel Butt
2021-01-25 8:58 ` Michal Hocko
2021-01-25 14:00 ` Chris Down
3 siblings, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2021-01-22 20:57 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Michal Koutný,
Tejun Heo, Linux MM, Cgroups, LKML, Kernel Team
On Fri, Jan 22, 2021 at 10:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it
> can cause writers to memory.high to get stuck in the kernel forever,
> performing page reclaim and consuming excessive amounts of CPU cycles.
>
> Before the patch, a write to memory.high would first put the new limit
> in place for the workload, and then reclaim the requested delta. After
> the patch, the kernel tries to reclaim the delta before putting the
> new limit into place, in order to not overwhelm the workload with a
> sudden, large excess over the limit. However, if reclaim is actively
> racing with new allocations from the uncurbed workload, it can keep
> the write() working inside the kernel indefinitely.
>
> This is causing problems in Facebook production. A privileged
> system-level daemon that adjusts memory.high for various workloads
> running on a host can get unexpectedly stuck in the kernel and
> essentially turn into a sort of involuntary kswapd for one of the
> workloads. We've observed that daemon busy-spin in a write() for
> minutes at a time, neglecting its other duties on the system, and
> expending privileged system resources on behalf of a workload.
>
> To remedy this, we have first considered changing the reclaim logic to
> break out after a couple of loops - whether the workload has converged
> to the new limit or not - and bound the write() call this way.
> However, the root cause that inspired the sequence change in the first
> place has been fixed through other means, and so a revert back to the
> proven limit-setting sequence, also used by memory.max, is preferable.
>
> The sequence was changed to avoid extreme latencies in the workload
> when the limit was lowered: the sudden, large excess created by the
> limit lowering would erroneously trigger the penalty sleeping code
> that is meant to throttle excessive growth from below. Allocating
> threads could end up sleeping long after the write() had already
> reclaimed the delta for which they were being punished.
>
> However, erroneous throttling also caused problems in other scenarios
> at around the same time. This resulted in commit b3ff92916af3 ("mm,
> memcg: reclaim more aggressively before high allocator throttling"),
> included in the same release as the offending commit. When allocating
> threads now encounter large excess caused by a racing write() to
> memory.high, instead of entering punitive sleeps, they will simply be
> tasked with helping reclaim down the excess, and will be held no
> longer than it takes to accomplish that. This is in line with regular
> limit enforcement - i.e. if the workload allocates up against or over
> an otherwise unchanged limit from below.
>
> With the patch breaking userspace, and the root cause addressed by
> other means already, revert it again.
>
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
2021-01-22 18:43 [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" Johannes Weiner
2021-01-22 19:52 ` Roman Gushchin
2021-01-22 20:57 ` Shakeel Butt
@ 2021-01-25 8:58 ` Michal Hocko
2021-01-25 14:00 ` Chris Down
3 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2021-01-25 8:58 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Roman Gushchin, Shakeel Butt, Michal Koutný,
Tejun Heo, linux-mm, cgroups, linux-kernel, kernel-team
On Fri 22-01-21 13:43:41, Johannes Weiner wrote:
> This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it
> can cause writers to memory.high to get stuck in the kernel forever,
> performing page reclaim and consuming excessive amounts of CPU cycles.
>
> Before the patch, a write to memory.high would first put the new limit
> in place for the workload, and then reclaim the requested delta. After
> the patch, the kernel tries to reclaim the delta before putting the
> new limit into place, in order to not overwhelm the workload with a
> sudden, large excess over the limit. However, if reclaim is actively
> racing with new allocations from the uncurbed workload, it can keep
> the write() working inside the kernel indefinitely.
>
> This is causing problems in Facebook production. A privileged
> system-level daemon that adjusts memory.high for various workloads
> running on a host can get unexpectedly stuck in the kernel and
> essentially turn into a sort of involuntary kswapd for one of the
> workloads. We've observed that daemon busy-spin in a write() for
> minutes at a time, neglecting its other duties on the system, and
> expending privileged system resources on behalf of a workload.
>
> To remedy this, we have first considered changing the reclaim logic to
> break out after a couple of loops - whether the workload has converged
> to the new limit or not - and bound the write() call this way.
> However, the root cause that inspired the sequence change in the first
> place has been fixed through other means, and so a revert back to the
> proven limit-setting sequence, also used by memory.max, is preferable.
>
> The sequence was changed to avoid extreme latencies in the workload
> when the limit was lowered: the sudden, large excess created by the
> limit lowering would erroneously trigger the penalty sleeping code
> that is meant to throttle excessive growth from below. Allocating
> threads could end up sleeping long after the write() had already
> reclaimed the delta for which they were being punished.
>
> However, erroneous throttling also caused problems in other scenarios
> at around the same time. This resulted in commit b3ff92916af3 ("mm,
> memcg: reclaim more aggressively before high allocator throttling"),
> included in the same release as the offending commit. When allocating
> threads now encounter large excess caused by a racing write() to
> memory.high, instead of entering punitive sleeps, they will simply be
> tasked with helping reclaim down the excess, and will be held no
> longer than it takes to accomplish that. This is in line with regular
> limit enforcement - i.e. if the workload allocates up against or over
> an otherwise unchanged limit from below.
>
> With the patch breaking userspace, and the root cause addressed by
> other means already, revert it again.
>
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Cc: <stable@vger.kernel.org> # 5.8+
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks for extending the changelog to describe the scenario in a more
detail.
> ---
> mm/memcontrol.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> Andrew, this is a replacement for
> mm-memcontrol-prevent-starvation-when-writing-memoryhigh.patch
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 605f671203ef..a8611a62bafd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6273,6 +6273,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> if (err)
> return err;
>
> + page_counter_set_high(&memcg->memory, high);
> +
> for (;;) {
> unsigned long nr_pages = page_counter_read(&memcg->memory);
> unsigned long reclaimed;
> @@ -6296,10 +6298,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> break;
> }
>
> - page_counter_set_high(&memcg->memory, high);
> -
> memcg_wb_domain_size_changed(memcg);
> -
> return nbytes;
> }
>
> --
> 2.30.0
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
2021-01-22 18:43 [PATCH] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" Johannes Weiner
` (2 preceding siblings ...)
2021-01-25 8:58 ` Michal Hocko
@ 2021-01-25 14:00 ` Chris Down
3 siblings, 0 replies; 5+ messages in thread
From: Chris Down @ 2021-01-25 14:00 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Shakeel Butt,
Michal Koutný,
Tejun Heo, linux-mm, cgroups, linux-kernel, kernel-team
Johannes Weiner writes:
>This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it
>can cause writers to memory.high to get stuck in the kernel forever,
>performing page reclaim and consuming excessive amounts of CPU cycles.
>
>Before the patch, a write to memory.high would first put the new limit
>in place for the workload, and then reclaim the requested delta. After
>the patch, the kernel tries to reclaim the delta before putting the
>new limit into place, in order to not overwhelm the workload with a
>sudden, large excess over the limit. However, if reclaim is actively
>racing with new allocations from the uncurbed workload, it can keep
>the write() working inside the kernel indefinitely.
>
>This is causing problems in Facebook production. A privileged
>system-level daemon that adjusts memory.high for various workloads
>running on a host can get unexpectedly stuck in the kernel and
>essentially turn into a sort of involuntary kswapd for one of the
>workloads. We've observed that daemon busy-spin in a write() for
>minutes at a time, neglecting its other duties on the system, and
>expending privileged system resources on behalf of a workload.
>
>To remedy this, we have first considered changing the reclaim logic to
>break out after a couple of loops - whether the workload has converged
>to the new limit or not - and bound the write() call this way.
>However, the root cause that inspired the sequence change in the first
>place has been fixed through other means, and so a revert back to the
>proven limit-setting sequence, also used by memory.max, is preferable.
>
>The sequence was changed to avoid extreme latencies in the workload
>when the limit was lowered: the sudden, large excess created by the
>limit lowering would erroneously trigger the penalty sleeping code
>that is meant to throttle excessive growth from below. Allocating
>threads could end up sleeping long after the write() had already
>reclaimed the delta for which they were being punished.
>
>However, erroneous throttling also caused problems in other scenarios
>at around the same time. This resulted in commit b3ff92916af3 ("mm,
>memcg: reclaim more aggressively before high allocator throttling"),
>included in the same release as the offending commit. When allocating
>threads now encounter large excess caused by a racing write() to
>memory.high, instead of entering punitive sleeps, they will simply be
>tasked with helping reclaim down the excess, and will be held no
>longer than it takes to accomplish that. This is in line with regular
>limit enforcement - i.e. if the workload allocates up against or over
>an otherwise unchanged limit from below.
>
>With the patch breaking userspace, and the root cause addressed by
>other means already, revert it again.
>
>Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
>Cc: <stable@vger.kernel.org> # 5.8+
>Reported-by: Tejun Heo <tj@kernel.org>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
^ permalink raw reply [flat|nested] 5+ messages in thread