[PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 12:39 Vasily Averin
  2021-09-10 13:04   ` Tetsuo Handa
                   ` (3 more replies)
  0 siblings, 4 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-10 12:39 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel

The kernel currently allows dying tasks to exceed the memcg limits.
The allocation is expected to be the last one and the occupied memory
will be freed soon.
This is not always true because it can be part of the huge vmalloc
allocation. Allowed once, they will repeat over and over again.
Moreover lifetime of the allocated object can differ from
In addition the lifetime of the dying task.
Multiple such allocations running concurrently can not only overuse
the memcg limit, but can lead to a global out of memory and,
in the worst case, cause the host to panic.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 389b5766e74f..67195fcfbddf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		return OOM_ASYNC;
 	}
 
+	if (should_force_charge())
+		return OOM_SKIPPED;
+
 	mem_cgroup_mark_under_oom(memcg);
 
 	locked = mem_cgroup_oom_trylock(memcg);
@@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
-
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
 	 * a forward progress or bypass the charge if the oom killer
@@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
+	} else if (oom_status == OOM_FAILED)
 		goto force;
-	default:
-		goto nomem;
-	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:04   ` Tetsuo Handa
  0 siblings, 0 replies; 62+ messages in thread
From: Tetsuo Handa @ 2021-09-10 13:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, linux-mm, linux-kernel, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On 2021/09/10 21:39, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.

Can't we add fatal_signal_pending(current) test to vmalloc() loop?

> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:04   ` Tetsuo Handa
  0 siblings, 0 replies; 62+ messages in thread
From: Tetsuo Handa @ 2021-09-10 13:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Andrew Morton

On 2021/09/10 21:39, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.

Can't we add fatal_signal_pending(current) test to vmalloc() loop?

> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:07   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-10 13:07 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel

On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.

btw should_force_charge() function name become wrong with this.
Is it make sense to replace it by something like is_task_dying() ?

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		return OOM_ASYNC;
>  	}
>  
> +	if (should_force_charge())
> +		return OOM_SKIPPED;
> +
>  	mem_cgroup_mark_under_oom(memcg);
>  
>  	locked = mem_cgroup_oom_trylock(memcg);
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> -
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
>  	 * a forward progress or bypass the charge if the oom killer
> @@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> +	} else if (oom_status == OOM_FAILED)
>  		goto force;
> -	default:
> -		goto nomem;
> -	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:07   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-10 13:07 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.

btw should_force_charge() function name become wrong with this.
Is it make sense to replace it by something like is_task_dying() ?

> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/memcontrol.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		return OOM_ASYNC;
>  	}
>  
> +	if (should_force_charge())
> +		return OOM_SKIPPED;
> +
>  	mem_cgroup_mark_under_oom(memcg);
>  
>  	locked = mem_cgroup_oom_trylock(memcg);
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> -
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
>  	 * a forward progress or bypass the charge if the oom killer
> @@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> +	} else if (oom_status == OOM_FAILED)
>  		goto force;
> -	default:
> -		goto nomem;
> -	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:20     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-10 13:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> On 2021/09/10 21:39, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
> 
> Can't we add fatal_signal_pending(current) test to vmalloc() loop?

1) this has been done in the past but has been reverted later.
2) any vmalloc changes will affect non-memcg allocations too.
 If we're doing memcg-related checks it's better to do it in one place.
3) it is not vmalloc-only issue. Huge number of  kmalloc page allocations 
from N concurrent threads will lead to the same problem. 

>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-10 13:20     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-10 13:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Andrew Morton

On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> On 2021/09/10 21:39, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
> 
> Can't we add fatal_signal_pending(current) test to vmalloc() loop?

1) this has been done in the past but has been reverted later.
2) any vmalloc changes will affect non-memcg allocations too.
 If we're doing memcg-related checks it's better to do it in one place.
3) it is not vmalloc-only issue. Huge number of  kmalloc page allocations 
from N concurrent threads will lead to the same problem. 

>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
  2021-09-10 13:20     ` Vasily Averin
  (?)
@ 2021-09-10 14:55     ` Michal Hocko
  2021-09-13  8:29         ` Vasily Averin
  -1 siblings, 1 reply; 62+ messages in thread
From: Michal Hocko @ 2021-09-10 14:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tetsuo Handa, cgroups, linux-mm, linux-kernel, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On Fri 10-09-21 16:20:58, Vasily Averin wrote:
> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> > On 2021/09/10 21:39, Vasily Averin wrote:
> >> The kernel currently allows dying tasks to exceed the memcg limits.
> >> The allocation is expected to be the last one and the occupied memory
> >> will be freed soon.
> >> This is not always true because it can be part of the huge vmalloc
> >> allocation. Allowed once, they will repeat over and over again.
> >> Moreover lifetime of the allocated object can differ from
> >> In addition the lifetime of the dying task.
> > 
> > Can't we add fatal_signal_pending(current) test to vmalloc() loop?

We can and we should.

> 1) this has been done in the past but has been reverted later.

The reason for that should be addressed IIRC.

> 2) any vmalloc changes will affect non-memcg allocations too.
>  If we're doing memcg-related checks it's better to do it in one place.

I think those two things are just orthogonal. Bailing out from vmalloc
early sounds reasonable to me on its own. Allocating a large thing that
is likely to go away with the allocating context is just a waste of
resources and potential reason to disruptions to others.

> 3) it is not vmalloc-only issue. Huge number of  kmalloc page allocations 
> from N concurrent threads will lead to the same problem. 

Agreed. I do not think it is viable to sprinkle fatal_signal_pending or
similar checks all over the code. This should better be limited to
allocators and the charging function.

Our assumption that is described in the code simply doesn't hold and it
is close to impossible to check all charging paths to bail out properly
so I think we should just remove that optimistic attitude and do not
force charges unless that is absolutely necessary (e.g. __GFP_NOFAIL) or
impractical (e.g. atomic allocations) and bound.

I didn't get to review the patch yet. This is a tricky area with many
unobvious corner cases. I will try find some time next week.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  7:51   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  7:51 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel

On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -

Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  7:51   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  7:51 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -

Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:29         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  8:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cgroups, linux-mm, linux-kernel, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On 9/10/21 5:55 PM, Michal Hocko wrote:
> On Fri 10-09-21 16:20:58, Vasily Averin wrote:
>> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
>>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> 
> We can and we should.
> 
>> 1) this has been done in the past but has been reverted later.
> 
> The reason for that should be addressed IIRC.

I don't know the details of this, and I need some time to investigate it.

>> 2) any vmalloc changes will affect non-memcg allocations too.
>>  If we're doing memcg-related checks it's better to do it in one place.
> 
> I think those two things are just orthogonal. Bailing out from vmalloc
> early sounds reasonable to me on its own. Allocating a large thing that
> is likely to go away with the allocating context is just a waste of
> resources and potential reason to disruptions to others.

I doubt that fatal signal should block any vmalloc allocations.
I assume there are situations where rollback of some cancelled operation uses vmalloc.
Or coredump saving on some remote storage can uses vmalloc.

However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
to break vmalloc cycle.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:29         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  8:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On 9/10/21 5:55 PM, Michal Hocko wrote:
> On Fri 10-09-21 16:20:58, Vasily Averin wrote:
>> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
>>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> 
> We can and we should.
> 
>> 1) this has been done in the past but has been reverted later.
> 
> The reason for that should be addressed IIRC.

I don't know the details of this, and I need some time to investigate it.

>> 2) any vmalloc changes will affect non-memcg allocations too.
>>  If we're doing memcg-related checks it's better to do it in one place.
> 
> I think those two things are just orthogonal. Bailing out from vmalloc
> early sounds reasonable to me on its own. Allocating a large thing that
> is likely to go away with the allocating context is just a waste of
> resources and potential reason to disruptions to others.

I doubt that fatal signal should block any vmalloc allocations.
I assume there are situations where rollback of some cancelled operation uses vmalloc.
Or coredump saving on some remote storage can uses vmalloc.

However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
to break vmalloc cycle.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:39     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13  8:39 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> On 9/10/21 3:39 PM, Vasily Averin wrote:
> > The kernel currently allows dying tasks to exceed the memcg limits.
> > The allocation is expected to be the last one and the occupied memory
> > will be freed soon.
> > This is not always true because it can be part of the huge vmalloc
> > allocation. Allowed once, they will repeat over and over again.
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 389b5766e74f..67195fcfbddf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (gfp_mask & __GFP_ATOMIC)
> >  		goto force;
> >  
> > -	/*
> > -	 * Unlike in global OOM situations, memcg is not in a physical
> > -	 * memory shortage.  Allow dying and OOM-killed tasks to
> > -	 * bypass the last charges so that they can exit quickly and
> > -	 * free their memory.
> > -	 */
> > -	if (unlikely(should_force_charge()))
> > -		goto force;
> > -
> 
> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?

Why?

> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Allocations in this code path should be rare but it is not like they are
non-existent. This is rather hard to review area spread at many places
so if we are deciding to make the existing model simpler (no bypassing)
then I would rather have no exceptions unless they are reaaly necessary
and document them if they are.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:39     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13  8:39 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> On 9/10/21 3:39 PM, Vasily Averin wrote:
> > The kernel currently allows dying tasks to exceed the memcg limits.
> > The allocation is expected to be the last one and the occupied memory
> > will be freed soon.
> > This is not always true because it can be part of the huge vmalloc
> > allocation. Allowed once, they will repeat over and over again.
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 389b5766e74f..67195fcfbddf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (gfp_mask & __GFP_ATOMIC)
> >  		goto force;
> >  
> > -	/*
> > -	 * Unlike in global OOM situations, memcg is not in a physical
> > -	 * memory shortage.  Allow dying and OOM-killed tasks to
> > -	 * bypass the last charges so that they can exit quickly and
> > -	 * free their memory.
> > -	 */
> > -	if (unlikely(should_force_charge()))
> > -		goto force;
> > -
> 
> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?

Why?

> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Allocations in this code path should be rare but it is not like they are
non-existent. This is rather hard to review area spread at many places
so if we are deciding to make the existing model simpler (no bypassing)
then I would rather have no exceptions unless they are reaaly necessary
and document them if they are.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:42           ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13  8:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tetsuo Handa, cgroups, linux-mm, linux-kernel, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On Mon 13-09-21 11:29:37, Vasily Averin wrote:
> On 9/10/21 5:55 PM, Michal Hocko wrote:
> > On Fri 10-09-21 16:20:58, Vasily Averin wrote:
> >> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> >>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> > 
> > We can and we should.
> > 
> >> 1) this has been done in the past but has been reverted later.
> > 
> > The reason for that should be addressed IIRC.
> 
> I don't know the details of this, and I need some time to investigate it.

b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed"")
should give a good insight and references.

> >> 2) any vmalloc changes will affect non-memcg allocations too.
> >>  If we're doing memcg-related checks it's better to do it in one place.
> > 
> > I think those two things are just orthogonal. Bailing out from vmalloc
> > early sounds reasonable to me on its own. Allocating a large thing that
> > is likely to go away with the allocating context is just a waste of
> > resources and potential reason to disruptions to others.
> 
> I doubt that fatal signal should block any vmalloc allocations.
> I assume there are situations where rollback of some cancelled operation uses vmalloc.
> Or coredump saving on some remote storage can uses vmalloc.

If there really are any such requirements then this should be really
documented. 

> However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
> So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
> to break vmalloc cycle.

Why should oom killed task behave any different than any other task
killed without a way to handle the signal?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  8:42           ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13  8:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner,
	Vladimir Davydov, Andrew Morton

On Mon 13-09-21 11:29:37, Vasily Averin wrote:
> On 9/10/21 5:55 PM, Michal Hocko wrote:
> > On Fri 10-09-21 16:20:58, Vasily Averin wrote:
> >> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> >>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> > 
> > We can and we should.
> > 
> >> 1) this has been done in the past but has been reverted later.
> > 
> > The reason for that should be addressed IIRC.
> 
> I don't know the details of this, and I need some time to investigate it.

b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed"")
should give a good insight and references.

> >> 2) any vmalloc changes will affect non-memcg allocations too.
> >>  If we're doing memcg-related checks it's better to do it in one place.
> > 
> > I think those two things are just orthogonal. Bailing out from vmalloc
> > early sounds reasonable to me on its own. Allocating a large thing that
> > is likely to go away with the allocating context is just a waste of
> > resources and potential reason to disruptions to others.
> 
> I doubt that fatal signal should block any vmalloc allocations.
> I assume there are situations where rollback of some cancelled operation uses vmalloc.
> Or coredump saving on some remote storage can uses vmalloc.

If there really are any such requirements then this should be really
documented. 

> However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
> So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
> to break vmalloc cycle.

Why should oom killed task behave any different than any other task
killed without a way to handle the signal?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
  2021-09-10 12:39 [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
                   ` (2 preceding siblings ...)
  2021-09-13  7:51   ` Vasily Averin
@ 2021-09-13  8:53 ` Michal Hocko
  2021-09-13 10:35     ` Vasily Averin
  3 siblings, 1 reply; 62+ messages in thread
From: Michal Hocko @ 2021-09-13  8:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On Fri 10-09-21 15:39:28, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		return OOM_ASYNC;
>  	}
>  
> +	if (should_force_charge())
> +		return OOM_SKIPPED;

mem_cgroup_out_of_memory already check for the bypass, now you are
duplicating that check with a different answer to the caller. This is
really messy. One of the two has to go away.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  9:37       ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On 9/13/21 11:39 AM, Michal Hocko wrote:
> On Mon 13-09-21 10:51:37, Vasily Averin wrote:
>> On 9/10/21 3:39 PM, Vasily Averin wrote:
>>> The kernel currently allows dying tasks to exceed the memcg limits.
>>> The allocation is expected to be the last one and the occupied memory
>>> will be freed soon.
>>> This is not always true because it can be part of the huge vmalloc
>>> allocation. Allowed once, they will repeat over and over again.
>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 389b5766e74f..67195fcfbddf 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	if (gfp_mask & __GFP_ATOMIC)
>>>  		goto force;
>>>  
>>> -	/*
>>> -	 * Unlike in global OOM situations, memcg is not in a physical
>>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
>>> -	 * bypass the last charges so that they can exit quickly and
>>> -	 * free their memory.
>>> -	 */
>>> -	if (unlikely(should_force_charge()))
>>> -		goto force;
>>> -
>>
>> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> 
> Why?

On this stage task really dies and mostly releases taken resources.
It can allocate though, and this allocation can reach memcg limit due to the activity
of parallel memcg threads.

Noting bad should happen if we reject this allocation,
because the same thing can happen in non-memcg case too.
However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.

In other words: we are not obliged to allow such allocations, but we CAN do it because
we hope that it is safe and cannot be misused.

>> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.
> 
> Allocations in this code path should be rare but it is not like they are
> non-existent. This is rather hard to review area spread at many places
> so if we are deciding to make the existing model simpler (no bypassing)
> then I would rather have no exceptions unless they are reaaly necessary
> and document them if they are.

Thank you,
	vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13  9:37       ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 11:39 AM, Michal Hocko wrote:
> On Mon 13-09-21 10:51:37, Vasily Averin wrote:
>> On 9/10/21 3:39 PM, Vasily Averin wrote:
>>> The kernel currently allows dying tasks to exceed the memcg limits.
>>> The allocation is expected to be the last one and the occupied memory
>>> will be freed soon.
>>> This is not always true because it can be part of the huge vmalloc
>>> allocation. Allowed once, they will repeat over and over again.
>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 389b5766e74f..67195fcfbddf 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	if (gfp_mask & __GFP_ATOMIC)
>>>  		goto force;
>>>  
>>> -	/*
>>> -	 * Unlike in global OOM situations, memcg is not in a physical
>>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
>>> -	 * bypass the last charges so that they can exit quickly and
>>> -	 * free their memory.
>>> -	 */
>>> -	if (unlikely(should_force_charge()))
>>> -		goto force;
>>> -
>>
>> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> 
> Why?

On this stage task really dies and mostly releases taken resources.
It can allocate though, and this allocation can reach memcg limit due to the activity
of parallel memcg threads.

Noting bad should happen if we reject this allocation,
because the same thing can happen in non-memcg case too.
However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.

In other words: we are not obliged to allow such allocations, but we CAN do it because
we hope that it is safe and cannot be misused.

>> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.
> 
> Allocations in this code path should be rare but it is not like they are
> non-existent. This is rather hard to review area spread at many places
> so if we are deciding to make the existing model simpler (no bypassing)
> then I would rather have no exceptions unless they are reaaly necessary
> and document them if they are.

Thank you,
	vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:10         ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13 10:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On Mon 13-09-21 12:37:56, Vasily Averin wrote:
> On 9/13/21 11:39 AM, Michal Hocko wrote:
> > On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> >> On 9/10/21 3:39 PM, Vasily Averin wrote:
> >>> The kernel currently allows dying tasks to exceed the memcg limits.
> >>> The allocation is expected to be the last one and the occupied memory
> >>> will be freed soon.
> >>> This is not always true because it can be part of the huge vmalloc
> >>> allocation. Allowed once, they will repeat over and over again.
> >>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 389b5766e74f..67195fcfbddf 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	if (gfp_mask & __GFP_ATOMIC)
> >>>  		goto force;
> >>>  
> >>> -	/*
> >>> -	 * Unlike in global OOM situations, memcg is not in a physical
> >>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> >>> -	 * bypass the last charges so that they can exit quickly and
> >>> -	 * free their memory.
> >>> -	 */
> >>> -	if (unlikely(should_force_charge()))
> >>> -		goto force;
> >>> -
> >>
> >> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> > 
> > Why?
> 
> On this stage task really dies and mostly releases taken resources.
> It can allocate though, and this allocation can reach memcg limit due to the activity
> of parallel memcg threads.
> 
> Noting bad should happen if we reject this allocation,
> because the same thing can happen in non-memcg case too.
> However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.
> 
> In other words: we are not obliged to allow such allocations, but we CAN do it because
> we hope that it is safe and cannot be misused.

This is a lot of hoping that has turned out to be a bad strategy in the
existing code.  So let's stop hoping and if we are shown that an
exit path really benefits from a special treatment then we can add it
with a good reasoning rathat than "we hope it's gonna be ok".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:10         ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13 10:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 13-09-21 12:37:56, Vasily Averin wrote:
> On 9/13/21 11:39 AM, Michal Hocko wrote:
> > On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> >> On 9/10/21 3:39 PM, Vasily Averin wrote:
> >>> The kernel currently allows dying tasks to exceed the memcg limits.
> >>> The allocation is expected to be the last one and the occupied memory
> >>> will be freed soon.
> >>> This is not always true because it can be part of the huge vmalloc
> >>> allocation. Allowed once, they will repeat over and over again.
> >>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 389b5766e74f..67195fcfbddf 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	if (gfp_mask & __GFP_ATOMIC)
> >>>  		goto force;
> >>>  
> >>> -	/*
> >>> -	 * Unlike in global OOM situations, memcg is not in a physical
> >>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> >>> -	 * bypass the last charges so that they can exit quickly and
> >>> -	 * free their memory.
> >>> -	 */
> >>> -	if (unlikely(should_force_charge()))
> >>> -		goto force;
> >>> -
> >>
> >> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> > 
> > Why?
> 
> On this stage task really dies and mostly releases taken resources.
> It can allocate though, and this allocation can reach memcg limit due to the activity
> of parallel memcg threads.
> 
> Noting bad should happen if we reject this allocation,
> because the same thing can happen in non-memcg case too.
> However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.
> 
> In other words: we are not obliged to allow such allocations, but we CAN do it because
> we hope that it is safe and cannot be misused.

This is a lot of hoping that has turned out to be a bad strategy in the
existing code.  So let's stop hoping and if we are shown that an
exit path really benefits from a special treatment then we can add it
with a good reasoning rathat than "we hope it's gonna be ok".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:35     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13 10:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On 9/13/21 11:53 AM, Michal Hocko wrote:
> On Fri 10-09-21 15:39:28, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c | 23 +++++------------------
>>  1 file changed, 5 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 389b5766e74f..67195fcfbddf 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>>  		return OOM_ASYNC;
>>  	}
>>  
>> +	if (should_force_charge())
>> +		return OOM_SKIPPED;
> 
> mem_cgroup_out_of_memory already check for the bypass, now you are
> duplicating that check with a different answer to the caller. This is
> really messy. One of the two has to go away.

In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.

I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
check added into mem_cgroup_oom().

Though I got your argument, and will think how to improve the patch.
Anyway we'll need to do something with name of should_force_charge() function
that will NOT lead to forced charge.

Thank you,
	Vasily Averin

Thank you,

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:35     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-13 10:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 11:53 AM, Michal Hocko wrote:
> On Fri 10-09-21 15:39:28, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
>> ---
>>  mm/memcontrol.c | 23 +++++------------------
>>  1 file changed, 5 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 389b5766e74f..67195fcfbddf 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>>  		return OOM_ASYNC;
>>  	}
>>  
>> +	if (should_force_charge())
>> +		return OOM_SKIPPED;
> 
> mem_cgroup_out_of_memory already check for the bypass, now you are
> duplicating that check with a different answer to the caller. This is
> really messy. One of the two has to go away.

In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.

I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
check added into mem_cgroup_oom().

Though I got your argument, and will think how to improve the patch.
Anyway we'll need to do something with name of should_force_charge() function
that will NOT lead to forced charge.

Thank you,
	Vasily Averin

Thank you,

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:55       ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13 10:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On Mon 13-09-21 13:35:00, Vasily Averin wrote:
> On 9/13/21 11:53 AM, Michal Hocko wrote:
> > On Fri 10-09-21 15:39:28, Vasily Averin wrote:
> >> The kernel currently allows dying tasks to exceed the memcg limits.
> >> The allocation is expected to be the last one and the occupied memory
> >> will be freed soon.
> >> This is not always true because it can be part of the huge vmalloc
> >> allocation. Allowed once, they will repeat over and over again.
> >> Moreover lifetime of the allocated object can differ from
> >> In addition the lifetime of the dying task.
> >> Multiple such allocations running concurrently can not only overuse
> >> the memcg limit, but can lead to a global out of memory and,
> >> in the worst case, cause the host to panic.
> >>
> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >> ---
> >>  mm/memcontrol.c | 23 +++++------------------
> >>  1 file changed, 5 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 389b5766e74f..67195fcfbddf 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >>  		return OOM_ASYNC;
> >>  	}
> >>  
> >> +	if (should_force_charge())
> >> +		return OOM_SKIPPED;
> > 
> > mem_cgroup_out_of_memory already check for the bypass, now you are
> > duplicating that check with a different answer to the caller. This is
> > really messy. One of the two has to go away.
> 
> In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
> useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.
> 
> I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
> check added into mem_cgroup_oom().
> 
> Though I got your argument, and will think how to improve the patch.
> Anyway we'll need to do something with name of should_force_charge() function
> that will NOT lead to forced charge.

Here is what I would do.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 702a81dfe72d..58269721d438 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && should_force_charge())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-13 10:55       ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-13 10:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 13-09-21 13:35:00, Vasily Averin wrote:
> On 9/13/21 11:53 AM, Michal Hocko wrote:
> > On Fri 10-09-21 15:39:28, Vasily Averin wrote:
> >> The kernel currently allows dying tasks to exceed the memcg limits.
> >> The allocation is expected to be the last one and the occupied memory
> >> will be freed soon.
> >> This is not always true because it can be part of the huge vmalloc
> >> allocation. Allowed once, they will repeat over and over again.
> >> Moreover lifetime of the allocated object can differ from
> >> In addition the lifetime of the dying task.
> >> Multiple such allocations running concurrently can not only overuse
> >> the memcg limit, but can lead to a global out of memory and,
> >> in the worst case, cause the host to panic.
> >>
> >> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> >> ---
> >>  mm/memcontrol.c | 23 +++++------------------
> >>  1 file changed, 5 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 389b5766e74f..67195fcfbddf 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >>  		return OOM_ASYNC;
> >>  	}
> >>  
> >> +	if (should_force_charge())
> >> +		return OOM_SKIPPED;
> > 
> > mem_cgroup_out_of_memory already check for the bypass, now you are
> > duplicating that check with a different answer to the caller. This is
> > really messy. One of the two has to go away.
> 
> In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
> useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.
> 
> I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
> check added into mem_cgroup_oom().
> 
> Though I got your argument, and will think how to improve the patch.
> Anyway we'll need to do something with name of should_force_charge() function
> that will NOT lead to forced charge.

Here is what I would do.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 702a81dfe72d..58269721d438 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && should_force_charge())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-14 10:01         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-14 10:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel

On 9/13/21 1:55 PM, Michal Hocko wrote:
> Here is what I would do.

Your patch fixes the problem, I just want to add rename of should_force_charge()
helper to task_is_dying() and will submit v2 version soon.

I think this patch is not optimal, however its improvements can be discussed later.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-14 10:01         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-14 10:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 9/13/21 1:55 PM, Michal Hocko wrote:
> Here is what I would do.

Your patch fixes the problem, I just want to add rename of should_force_charge()
helper to task_is_dying() and will submit v2 version soon.

I think this patch is not optimal, however its improvements can be discussed later.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH memcg v2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-14 10:10           ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-14 10:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel, kernel

The kernel currently allows dying tasks to exceed the memcg limits.
The allocation is expected to be the last one and the occupied memory
will be freed soon.
This is not always true because it can be part of the huge vmalloc
allocation. Allowed once, they will repeat over and over again.
Moreover lifetime of the allocated object can differ from the lifetime
of the dying task.
Multiple such allocations running concurrently can not only overuse
the memcg limit, but can lead to a global out of memory and,
in the worst case, cause the host to panic.

This patch removes checks forced exceed of the memcg limit for dying
tasks. Also it breaks endless loop for tasks bypassed by the oom killer.
In addition, it renames should_force_charge() helper to task_is_dying()
because now its use do not lead to the forced charge.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 389b5766e74f..707f6640edda 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -234,7 +234,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH memcg v2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-14 10:10           ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-14 10:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

The kernel currently allows dying tasks to exceed the memcg limits.
The allocation is expected to be the last one and the occupied memory
will be freed soon.
This is not always true because it can be part of the huge vmalloc
allocation. Allowed once, they will repeat over and over again.
Moreover lifetime of the allocated object can differ from the lifetime
of the dying task.
Multiple such allocations running concurrently can not only overuse
the memcg limit, but can lead to a global out of memory and,
in the worst case, cause the host to panic.

This patch removes checks forced exceed of the memcg limit for dying
tasks. Also it breaks endless loop for tasks bypassed by the oom killer.
In addition, it renames should_force_charge() helper to task_is_dying()
because now its use do not lead to the forced charge.

Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 389b5766e74f..707f6640edda 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -234,7 +234,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg v2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-16 12:55             ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 14-09-21 13:10:04, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
>
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from the lifetime
> of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> This patch removes checks forced exceed of the memcg limit for dying
> tasks. Also it breaks endless loop for tasks bypassed by the oom killer.
> In addition, it renames should_force_charge() helper to task_is_dying()
> because now its use do not lead to the forced charge.

I would rephrase the changelog as follows to give a broader picture.
"
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by b8c8a338f75e (Revert "vmalloc: back off when the
current task is killed"). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time. 

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.
"
feel free to use parts or whole of it.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..707f6640edda 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -234,7 +234,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);

task_is_dying check will prevent the oom killer for dying tasks. There
is an additional bail out at out_of_memory layer. These checks are now
leading to a completely different behavior. Currently we simply use
"unlimited" reserves and therefore we do not have to kill any task. Now
the charge fails without using all reclaim measures. So I believe we
should drop those checks for memcg oom paths. I have to think about this
some more because I might be missing some other side effects.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg v2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-09-16 12:55             ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 14-09-21 13:10:04, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
>
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from the lifetime
> of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> This patch removes checks forced exceed of the memcg limit for dying
> tasks. Also it breaks endless loop for tasks bypassed by the oom killer.
> In addition, it renames should_force_charge() helper to task_is_dying()
> because now its use do not lead to the forced charge.

I would rephrase the changelog as follows to give a broader picture.
"
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by b8c8a338f75e (Revert "vmalloc: back off when the
current task is killed"). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time. 

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.
"
feel free to use parts or whole of it.

> Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..707f6640edda 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -234,7 +234,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);

task_is_dying check will prevent the oom killer for dying tasks. There
is an additional bail out at out_of_memory layer. These checks are now
leading to a completely different behavior. Currently we simply use
"unlimited" reserves and therefore we do not have to kill any task. Now
the charge fails without using all reclaim measures. So I believe we
should drop those checks for memcg oom paths. I have to think about this
some more because I might be missing some other side effects.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-17  8:06             ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-17  8:06 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups, linux-mm, linux-kernel, kernel

Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. A task called vmalloc can have the worst badness
and be chosen by OOM-killer, however received fatal signal and
oom victim mark does not interrupt allocation cycle. Vmalloc will
continue allocating pages over and over again, exacerbating the crisis
and consuming the memory freed up by another killed tasks.

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.

Unfortunately it is not 100% safe. Previous attempt to break vmalloc
cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed"") due to some vmalloc callers did not handled
failures properly. Found issues was resolved, however, there may
be other similar places.

Such failures may be acceptable for emergencies, such as OOM. On the other
hand, we would like to detect them earlier. However they are quite rare,
and will be hidden by OOM messages, so I'm afraid they wikk have quite
small chance of being noticed and reported.

To improve the detection of such places this patch also interrupts the vmalloc
allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
config option to do not break unaware production kernels.

Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
affect other users of this subsystem.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/page_alloc.c | 5 +++++
 mm/vmalloc.c    | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..133d52e507ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			continue;
 		}

+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c3b8e3e5cfc5..04b291076726 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
 #include <linux/pgtable.h>
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
+#include <linux/oom.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>

@@ -2860,6 +2861,11 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;

+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-17  8:06             ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-17  8:06 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Tetsuo Handa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. A task called vmalloc can have the worst badness
and be chosen by OOM-killer, however received fatal signal and
oom victim mark does not interrupt allocation cycle. Vmalloc will
continue allocating pages over and over again, exacerbating the crisis
and consuming the memory freed up by another killed tasks.

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.

Unfortunately it is not 100% safe. Previous attempt to break vmalloc
cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed"") due to some vmalloc callers did not handled
failures properly. Found issues was resolved, however, there may
be other similar places.

Such failures may be acceptable for emergencies, such as OOM. On the other
hand, we would like to detect them earlier. However they are quite rare,
and will be hidden by OOM messages, so I'm afraid they wikk have quite
small chance of being noticed and reported.

To improve the detection of such places this patch also interrupts the vmalloc
allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
config option to do not break unaware production kernels.

Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
affect other users of this subsystem.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/page_alloc.c | 5 +++++
 mm/vmalloc.c    | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..133d52e507ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			continue;
 		}

+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c3b8e3e5cfc5..04b291076726 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
 #include <linux/pgtable.h>
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
+#include <linux/oom.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>

@@ -2860,6 +2861,11 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;

+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-19 23:31               ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2021-09-19 23:31 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel, Uladzislau Rezki (Sony)

On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.

Well that was lame of us.

I believe that at least one of the kernel testbots can utilize fault
injection.  If we were to wire up vmalloc (as we have done with slab
and pagealloc) then this will help to locate such buggy vmalloc callers.

> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

This sounds like a pretty sad half-measure?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-19 23:31               ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2021-09-19 23:31 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Uladzislau Rezki (Sony)

On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.

Well that was lame of us.

I believe that at least one of the kernel testbots can utilize fault
injection.  If we were to wire up vmalloc (as we have done with slab
and pagealloc) then this will help to locate such buggy vmalloc callers.

> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

This sounds like a pretty sad half-measure?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
  2021-09-19 23:31               ` Andrew Morton
  (?)
@ 2021-09-20  1:22               ` Tetsuo Handa
  2021-09-20 10:59                   ` Vasily Averin
  -1 siblings, 1 reply; 62+ messages in thread
From: Tetsuo Handa @ 2021-09-20  1:22 UTC (permalink / raw)
  To: Andrew Morton, Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, cgroups,
	linux-mm, linux-kernel, kernel, Uladzislau Rezki (Sony)

On 2021/09/20 8:31, Andrew Morton wrote:
> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> 
>> Huge vmalloc allocation on heavy loaded node can lead to a global
>> memory shortage. A task called vmalloc can have the worst badness
>> and be chosen by OOM-killer, however received fatal signal and
>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>> continue allocating pages over and over again, exacerbating the crisis
>> and consuming the memory freed up by another killed tasks.
>>
>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>> effective and avoid host panic.
>>
>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>> the current task is killed"") due to some vmalloc callers did not handled
>> failures properly. Found issues was resolved, however, there may
>> be other similar places.
> 
> Well that was lame of us.
> 
> I believe that at least one of the kernel testbots can utilize fault
> injection.  If we were to wire up vmalloc (as we have done with slab
> and pagealloc) then this will help to locate such buggy vmalloc callers.

__alloc_pages_bulk() has three callers.

  alloc_pages_bulk_list() => No in-tree users.

  alloc_pages_bulk_array() => Used by xfs_buf_alloc_pages(), __page_pool_alloc_pages_slow(), svc_alloc_arg().

    xfs_buf_alloc_pages() => Might retry forever until all pages are allocated (i.e. effectively __GFP_NOFAIL). This patch can cause infinite loop problem.

    __page_pool_alloc_pages_slow() => Will not retry if allocation failed. This patch might help.

    svc_alloc_arg() => Will not retry if signal pending. This patch might help only if allocating a lot of pages.

  alloc_pages_bulk_array_node() => Used by vm_area_alloc_pages().

vm_area_alloc_pages() => Used by __vmalloc_area_node() from __vmalloc_node_range() from vmalloc functions. Needs !__GFP_NOFAIL check?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-20 10:59                   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-20 10:59 UTC (permalink / raw)
  To: Tetsuo Handa, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, cgroups,
	linux-mm, linux-kernel, kernel, Uladzislau Rezki (Sony)

On 9/20/21 4:22 AM, Tetsuo Handa wrote:
> On 2021/09/20 8:31, Andrew Morton wrote:
>> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>>> Huge vmalloc allocation on heavy loaded node can lead to a global
>>> memory shortage. A task called vmalloc can have the worst badness
>>> and be chosen by OOM-killer, however received fatal signal and
>>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>>> continue allocating pages over and over again, exacerbating the crisis
>>> and consuming the memory freed up by another killed tasks.
>>>
>>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>>> effective and avoid host panic.
>>>
>>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>>> the current task is killed"") due to some vmalloc callers did not handled
>>> failures properly. Found issues was resolved, however, there may
>>> be other similar places.
>>
>> Well that was lame of us.
>>
>> I believe that at least one of the kernel testbots can utilize fault
>> injection.  If we were to wire up vmalloc (as we have done with slab
>> and pagealloc) then this will help to locate such buggy vmalloc callers.

Andrew, could you please clarify how we can do it?
Do you mean we can use exsiting allocation fault injection infrastructure to trigger
such kind of issues? Unfortunately I found no ways to reach this goal.
It  allows to emulate single faults with small probability, however it is not enough,
we need to completely disable all vmalloc allocations. 
I've tried to extend fault injection infrastructure however found that it is not trivial.

That's why I've added direct fatal_signal_pending() check into my patch.
 
> __alloc_pages_bulk() has three callers.
> 
>   alloc_pages_bulk_list() => No in-tree users.
> 
>   alloc_pages_bulk_array() => Used by xfs_buf_alloc_pages(), __page_pool_alloc_pages_slow(), svc_alloc_arg().
> 
>     xfs_buf_alloc_pages() => Might retry forever until all pages are allocated (i.e. effectively __GFP_NOFAIL). This patch can cause infinite loop problem.

You are right, I've missed it.
However __alloc_pages_bulk() can return no new pages without my patch too:
- due to fault injection inside  prepare_alloc_pages()
- if __rmqueue_pcplist() returns NULL and if array already had some assigned pages,
- if both __rmqueue_pcplist() and following __alloc_pages(0) cannot get any page.
On the other hand I cannot say that it is 100% xfs-related issue, it looks strange
but they have some chance to get page after few attemps.

So I think I can change 'break' to 'goto failed_irq', call __alloc_pages(0) and
return 1 page. It seems is handled correctly in all callers too.

>     __page_pool_alloc_pages_slow() => Will not retry if allocation failed. This patch might help.
> 
>     svc_alloc_arg() => Will not retry if signal pending. This patch might help only if allocating a lot of pages.
> 
>   alloc_pages_bulk_array_node() => Used by vm_area_alloc_pages().
> 
> vm_area_alloc_pages() => Used by __vmalloc_area_node() from __vmalloc_node_range() from vmalloc functions. Needs !__GFP_NOFAIL check?

Comments in description of __vmalloc_node() and kvmalloc() claim that __GFP_NOFAIL is not supported,
I did not found any other callers used this flag.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-20 10:59                   ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-20 10:59 UTC (permalink / raw)
  To: Tetsuo Handa, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Uladzislau Rezki (Sony)

On 9/20/21 4:22 AM, Tetsuo Handa wrote:
> On 2021/09/20 8:31, Andrew Morton wrote:
>> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>
>>> Huge vmalloc allocation on heavy loaded node can lead to a global
>>> memory shortage. A task called vmalloc can have the worst badness
>>> and be chosen by OOM-killer, however received fatal signal and
>>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>>> continue allocating pages over and over again, exacerbating the crisis
>>> and consuming the memory freed up by another killed tasks.
>>>
>>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>>> effective and avoid host panic.
>>>
>>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>>> the current task is killed"") due to some vmalloc callers did not handled
>>> failures properly. Found issues was resolved, however, there may
>>> be other similar places.
>>
>> Well that was lame of us.
>>
>> I believe that at least one of the kernel testbots can utilize fault
>> injection.  If we were to wire up vmalloc (as we have done with slab
>> and pagealloc) then this will help to locate such buggy vmalloc callers.

Andrew, could you please clarify how we can do it?
Do you mean we can use exsiting allocation fault injection infrastructure to trigger
such kind of issues? Unfortunately I found no ways to reach this goal.
It  allows to emulate single faults with small probability, however it is not enough,
we need to completely disable all vmalloc allocations. 
I've tried to extend fault injection infrastructure however found that it is not trivial.

That's why I've added direct fatal_signal_pending() check into my patch.
 
> __alloc_pages_bulk() has three callers.
> 
>   alloc_pages_bulk_list() => No in-tree users.
> 
>   alloc_pages_bulk_array() => Used by xfs_buf_alloc_pages(), __page_pool_alloc_pages_slow(), svc_alloc_arg().
> 
>     xfs_buf_alloc_pages() => Might retry forever until all pages are allocated (i.e. effectively __GFP_NOFAIL). This patch can cause infinite loop problem.

You are right, I've missed it.
However __alloc_pages_bulk() can return no new pages without my patch too:
- due to fault injection inside  prepare_alloc_pages()
- if __rmqueue_pcplist() returns NULL and if array already had some assigned pages,
- if both __rmqueue_pcplist() and following __alloc_pages(0) cannot get any page.
On the other hand I cannot say that it is 100% xfs-related issue, it looks strange
but they have some chance to get page after few attemps.

So I think I can change 'break' to 'goto failed_irq', call __alloc_pages(0) and
return 1 page. It seems is handled correctly in all callers too.

>     __page_pool_alloc_pages_slow() => Will not retry if allocation failed. This patch might help.
> 
>     svc_alloc_arg() => Will not retry if signal pending. This patch might help only if allocating a lot of pages.
> 
>   alloc_pages_bulk_array_node() => Used by vm_area_alloc_pages().
> 
> vm_area_alloc_pages() => Used by __vmalloc_area_node() from __vmalloc_node_range() from vmalloc functions. Needs !__GFP_NOFAIL check?

Comments in description of __vmalloc_node() and kvmalloc() claim that __GFP_NOFAIL is not supported,
I did not found any other callers used this flag.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
  2021-09-20 10:59                   ` Vasily Averin
  (?)
@ 2021-09-21 18:55                   ` Andrew Morton
  2021-09-22  6:18                     ` Vasily Averin
  -1 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2021-09-21 18:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tetsuo Handa, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm, linux-kernel, kernel, Uladzislau Rezki (Sony)

On Mon, 20 Sep 2021 13:59:35 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> On 9/20/21 4:22 AM, Tetsuo Handa wrote:
> > On 2021/09/20 8:31, Andrew Morton wrote:
> >> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >>> Huge vmalloc allocation on heavy loaded node can lead to a global
> >>> memory shortage. A task called vmalloc can have the worst badness
> >>> and be chosen by OOM-killer, however received fatal signal and
> >>> oom victim mark does not interrupt allocation cycle. Vmalloc will
> >>> continue allocating pages over and over again, exacerbating the crisis
> >>> and consuming the memory freed up by another killed tasks.
> >>>
> >>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> >>> effective and avoid host panic.
> >>>
> >>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> >>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> >>> the current task is killed"") due to some vmalloc callers did not handled
> >>> failures properly. Found issues was resolved, however, there may
> >>> be other similar places.
> >>
> >> Well that was lame of us.
> >>
> >> I believe that at least one of the kernel testbots can utilize fault
> >> injection.  If we were to wire up vmalloc (as we have done with slab
> >> and pagealloc) then this will help to locate such buggy vmalloc callers.
> 
> Andrew, could you please clarify how we can do it?
> Do you mean we can use exsiting allocation fault injection infrastructure to trigger
> such kind of issues? Unfortunately I found no ways to reach this goal.
> It  allows to emulate single faults with small probability, however it is not enough,
> we need to completely disable all vmalloc allocations. 

I don't see why there's a problem?  You're saying "there might still be
vmalloc() callers which don't correctly handle allocation failures",
yes?

I'm suggesting that we use fault injection to cause a small proportion
of vmalloc() calls to artificially fail, so such buggy callers will
eventually be found and fixed.  Why does such a scheme require that
*all* vmalloc() calls fail?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
  2021-09-21 18:55                   ` Andrew Morton
@ 2021-09-22  6:18                     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-22  6:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm, linux-kernel, kernel, Uladzislau Rezki (Sony)

On 9/21/21 9:55 PM, Andrew Morton wrote:
> On Mon, 20 Sep 2021 13:59:35 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> 
>> On 9/20/21 4:22 AM, Tetsuo Handa wrote:
>>> On 2021/09/20 8:31, Andrew Morton wrote:
>>>> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>>> Huge vmalloc allocation on heavy loaded node can lead to a global
>>>>> memory shortage. A task called vmalloc can have the worst badness
>>>>> and be chosen by OOM-killer, however received fatal signal and
>>>>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>>>>> continue allocating pages over and over again, exacerbating the crisis
>>>>> and consuming the memory freed up by another killed tasks.
>>>>>
>>>>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>>>>> effective and avoid host panic.
>>>>>
>>>>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>>>>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>>>>> the current task is killed"") due to some vmalloc callers did not handled
>>>>> failures properly. Found issues was resolved, however, there may
>>>>> be other similar places.
>>>>
>>>> Well that was lame of us.
>>>>
>>>> I believe that at least one of the kernel testbots can utilize fault
>>>> injection.  If we were to wire up vmalloc (as we have done with slab
>>>> and pagealloc) then this will help to locate such buggy vmalloc callers.
>>
>> Andrew, could you please clarify how we can do it?
>> Do you mean we can use exsiting allocation fault injection infrastructure to trigger
>> such kind of issues? Unfortunately I found no ways to reach this goal.
>> It  allows to emulate single faults with small probability, however it is not enough,
>> we need to completely disable all vmalloc allocations. 
> 
> I don't see why there's a problem?  You're saying "there might still be
> vmalloc() callers which don't correctly handle allocation failures",
> yes?
> 
> I'm suggesting that we use fault injection to cause a small proportion
> of vmalloc() calls to artificially fail, so such buggy callers will
> eventually be found and fixed.  Why does such a scheme require that
> *all* vmalloc() calls fail?

Let me explain.
1) it is not trivial to use current allocation fault injection to cause
a small proportion of vmalloc() calls to artificially fail.

vmalloc
 __vmalloc_node
  __vmalloc_node_range
   __vmalloc_area_node
    vm_area_alloc_pages
 
vm_area_alloc_pages uses new __alloc_pages_bulk subsystem, requesting up to 100 pages in cycle.
__alloc_pages_bulk() can be interrupted by allocation fault injection, however in this case
vm_area_alloc_pages() will failback to old-style page allocation cycle.
In general case it successfully finishes allocation and vmalloc itself will not fail.

To fail vmalloc we need to fail both alloc_pages_bulk_array_node() and alloc_pages_node() together.

2) if we failed single vmalloc it is not enough.
I would remind, we want to emulate fatal signal reaction.
However I afraid dying task can execute a quite complex rollback procedure.
This rollback can call another vmalloc and last one will be failed
again on fatal_signal_pending check.

To emulate this behavior in fault injection we need to disable all following
vmalloc calls of our victim, pseudo-"dying" task.

I doubt both these goals can be reached by current allocation fault injection subsystem,
I do not understand how to configure it accordingly.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-22 12:27               ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-22 12:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Fri 17-09-21 11:06:49, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.
> 
> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

I really dislike this. We shouldn't have a sementically different
behavior for a debugging kernel.

Is there any technical reason to not do fatal_signal_pending bailout
unconditionally? OOM victim based check will make it less likely and
therefore any potential bugs are just hidden more. So I think we should
really go with fatal_signal_pending check here.

> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
> affect other users of this subsystem.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/page_alloc.c | 5 +++++
>  mm/vmalloc.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c274cf..133d52e507ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  			continue;
>  		}
>  
> +		if (tsk_is_oom_victim(current) ||
> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +		     fatal_signal_pending(current)))
> +			break;

This allocator interface is used in some real hot paths. It is also
meant to be fail fast interface (e.g. it only allocates from pcp
allocator) so it shouldn't bring any additional risk to memory depletion
under heavy memory pressure.

In other words I do not see any reason to bail out in this code path.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-22 12:27               ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-22 12:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Fri 17-09-21 11:06:49, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.
> 
> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

I really dislike this. We shouldn't have a sementically different
behavior for a debugging kernel.

Is there any technical reason to not do fatal_signal_pending bailout
unconditionally? OOM victim based check will make it less likely and
therefore any potential bugs are just hidden more. So I think we should
really go with fatal_signal_pending check here.

> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
> affect other users of this subsystem.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/page_alloc.c | 5 +++++
>  mm/vmalloc.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c274cf..133d52e507ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  			continue;
>  		}
>  
> +		if (tsk_is_oom_victim(current) ||
> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +		     fatal_signal_pending(current)))
> +			break;

This allocator interface is used in some real hot paths. It is also
meant to be fail fast interface (e.g. it only allocates from pcp
allocator) so it shouldn't bring any additional risk to memory depletion
under heavy memory pressure.

In other words I do not see any reason to bail out in this code path.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-23  6:49                 ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-23  6:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On 9/22/21 3:27 PM, Michal Hocko wrote:
> On Fri 17-09-21 11:06:49, Vasily Averin wrote:
>> Huge vmalloc allocation on heavy loaded node can lead to a global
>> memory shortage. A task called vmalloc can have the worst badness
>> and be chosen by OOM-killer, however received fatal signal and
>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>> continue allocating pages over and over again, exacerbating the crisis
>> and consuming the memory freed up by another killed tasks.
>>
>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>> effective and avoid host panic.
>>
>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>> the current task is killed"") due to some vmalloc callers did not handled
>> failures properly. Found issues was resolved, however, there may
>> be other similar places.
>>
>> Such failures may be acceptable for emergencies, such as OOM. On the other
>> hand, we would like to detect them earlier. However they are quite rare,
>> and will be hidden by OOM messages, so I'm afraid they wikk have quite
>> small chance of being noticed and reported.
>>
>> To improve the detection of such places this patch also interrupts the vmalloc
>> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
>> config option to do not break unaware production kernels.
> 
> I really dislike this. We shouldn't have a sementically different
> behavior for a debugging kernel.

Yes, you're right, thank you.

> Is there any technical reason to not do fatal_signal_pending bailout
> unconditionally? OOM victim based check will make it less likely and
> therefore any potential bugs are just hidden more. So I think we should
> really go with fatal_signal_pending check here.

I'm agree, oom_victim == fatal_signal_pending.
I'm agree that vmalloc callers should expect and handle single vnalloc failures.
I think it is acceptable to enable fatal_signal_pending check to quickly
detect such kind of iussues.
However fatal_signal_pending check can cause serial vmalloc failures
and I doubt it is acceptable. 

Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
even properly handled such serial failures can cause troubles.

Hypothetically, cancelled vmalloc called inside some filesystem's transaction
forces its rollback, that in own turn it can call own vmalloc.
Any failures on this path can break the filesystem.
I doubt it is acceptable, especially for non-OOM fatal signals.
On the other hand I cannot say that it is a 100% bug.

Another scenario:
as you know failed vmalloc calls pr_warn. According message should be sent
to remote terminal or netconsole. I'm not sure about execution context,
however if this is done in task context it may call vmalloc either in terminal
or in network subsystems. Even handled, such failures are not fatal,
but this behaviour is at least unexpected.

Should we perhaps interrupt the first vmalloc only?

>> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
>> affect other users of this subsystem.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/page_alloc.c | 5 +++++
>>  mm/vmalloc.c    | 6 ++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b37435c274cf..133d52e507ff 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>>  			continue;
>>  		}
>>  
>> +		if (tsk_is_oom_victim(current) ||
>> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
>> +		     fatal_signal_pending(current)))
>> +			break;
> 
> This allocator interface is used in some real hot paths. It is also
> meant to be fail fast interface (e.g. it only allocates from pcp
> allocator) so it shouldn't bring any additional risk to memory depletion
> under heavy memory pressure.
> 
> In other words I do not see any reason to bail out in this code path.

Thank you for the explanation, let's drop this check at all.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-23  6:49                 ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-23  6:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 9/22/21 3:27 PM, Michal Hocko wrote:
> On Fri 17-09-21 11:06:49, Vasily Averin wrote:
>> Huge vmalloc allocation on heavy loaded node can lead to a global
>> memory shortage. A task called vmalloc can have the worst badness
>> and be chosen by OOM-killer, however received fatal signal and
>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>> continue allocating pages over and over again, exacerbating the crisis
>> and consuming the memory freed up by another killed tasks.
>>
>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>> effective and avoid host panic.
>>
>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>> the current task is killed"") due to some vmalloc callers did not handled
>> failures properly. Found issues was resolved, however, there may
>> be other similar places.
>>
>> Such failures may be acceptable for emergencies, such as OOM. On the other
>> hand, we would like to detect them earlier. However they are quite rare,
>> and will be hidden by OOM messages, so I'm afraid they wikk have quite
>> small chance of being noticed and reported.
>>
>> To improve the detection of such places this patch also interrupts the vmalloc
>> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
>> config option to do not break unaware production kernels.
> 
> I really dislike this. We shouldn't have a sementically different
> behavior for a debugging kernel.

Yes, you're right, thank you.

> Is there any technical reason to not do fatal_signal_pending bailout
> unconditionally? OOM victim based check will make it less likely and
> therefore any potential bugs are just hidden more. So I think we should
> really go with fatal_signal_pending check here.

I'm agree, oom_victim == fatal_signal_pending.
I'm agree that vmalloc callers should expect and handle single vnalloc failures.
I think it is acceptable to enable fatal_signal_pending check to quickly
detect such kind of iussues.
However fatal_signal_pending check can cause serial vmalloc failures
and I doubt it is acceptable. 

Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
even properly handled such serial failures can cause troubles.

Hypothetically, cancelled vmalloc called inside some filesystem's transaction
forces its rollback, that in own turn it can call own vmalloc.
Any failures on this path can break the filesystem.
I doubt it is acceptable, especially for non-OOM fatal signals.
On the other hand I cannot say that it is a 100% bug.

Another scenario:
as you know failed vmalloc calls pr_warn. According message should be sent
to remote terminal or netconsole. I'm not sure about execution context,
however if this is done in task context it may call vmalloc either in terminal
or in network subsystems. Even handled, such failures are not fatal,
but this behaviour is at least unexpected.

Should we perhaps interrupt the first vmalloc only?

>> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
>> affect other users of this subsystem.
>>
>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
>> ---
>>  mm/page_alloc.c | 5 +++++
>>  mm/vmalloc.c    | 6 ++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b37435c274cf..133d52e507ff 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>>  			continue;
>>  		}
>>  
>> +		if (tsk_is_oom_victim(current) ||
>> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
>> +		     fatal_signal_pending(current)))
>> +			break;
> 
> This allocator interface is used in some real hot paths. It is also
> meant to be fail fast interface (e.g. it only allocates from pcp
> allocator) so it shouldn't bring any additional risk to memory depletion
> under heavy memory pressure.
> 
> In other words I do not see any reason to bail out in this code path.

Thank you for the explanation, let's drop this check at all.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-24  7:55                   ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-24  7:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
> I think it is acceptable to enable fatal_signal_pending check to quickly
> detect such kind of iussues.
> However fatal_signal_pending check can cause serial vmalloc failures
> and I doubt it is acceptable. 
> 
> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
> even properly handled such serial failures can cause troubles.

Could you be more specific? Also how would this be any different from
similar failures for an oom victim? Except that the later is less likely
so (as already mentioend) any potential bugs would be just lurking there
for a longer time.

> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> forces its rollback, that in own turn it can call own vmalloc.

Do you have any specific example?

> Any failures on this path can break the filesystem.
> I doubt it is acceptable, especially for non-OOM fatal signals.
> On the other hand I cannot say that it is a 100% bug.
> 
> Another scenario:
> as you know failed vmalloc calls pr_warn. According message should be sent
> to remote terminal or netconsole. I'm not sure about execution context,
> however if this is done in task context it may call vmalloc either in terminal
> or in network subsystems. Even handled, such failures are not fatal,
> but this behaviour is at least unexpected.

I do not think we want to shape the vmalloc bahavior based on
printk/console behavior.

> Should we perhaps interrupt the first vmalloc only?

This doesn't make much sense to me TBH. It doesn't address the very
problem you are describing in the changelog.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-24  7:55                   ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-24  7:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
> I think it is acceptable to enable fatal_signal_pending check to quickly
> detect such kind of iussues.
> However fatal_signal_pending check can cause serial vmalloc failures
> and I doubt it is acceptable. 
> 
> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
> even properly handled such serial failures can cause troubles.

Could you be more specific? Also how would this be any different from
similar failures for an oom victim? Except that the later is less likely
so (as already mentioend) any potential bugs would be just lurking there
for a longer time.

> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> forces its rollback, that in own turn it can call own vmalloc.

Do you have any specific example?

> Any failures on this path can break the filesystem.
> I doubt it is acceptable, especially for non-OOM fatal signals.
> On the other hand I cannot say that it is a 100% bug.
> 
> Another scenario:
> as you know failed vmalloc calls pr_warn. According message should be sent
> to remote terminal or netconsole. I'm not sure about execution context,
> however if this is done in task context it may call vmalloc either in terminal
> or in network subsystems. Even handled, such failures are not fatal,
> but this behaviour is at least unexpected.

I do not think we want to shape the vmalloc bahavior based on
printk/console behavior.

> Should we perhaps interrupt the first vmalloc only?

This doesn't make much sense to me TBH. It doesn't address the very
problem you are describing in the changelog.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
  2021-09-24  7:55                   ` Michal Hocko
@ 2021-09-27  9:36                     ` Vasily Averin
  -1 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-27  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On 9/24/21 10:55 AM, Michal Hocko wrote:
> On Thu 23-09-21 09:49:57, Vasily Averin wrote:
> [...]
>> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
>> I think it is acceptable to enable fatal_signal_pending check to quickly
>> detect such kind of iussues.
>> However fatal_signal_pending check can cause serial vmalloc failures
>> and I doubt it is acceptable. 
>>
>> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
>> even properly handled such serial failures can cause troubles.
> 
> Could you be more specific? Also how would this be any different from
> similar failures for an oom victim? Except that the later is less likely
> so (as already mentioend) any potential bugs would be just lurking there
> for a longer time.
> 
>> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
>> forces its rollback, that in own turn it can call own vmalloc.
> 
> Do you have any specific example?

No, it was pure hypothetical assumption.
I was thinking about it over the weekend, and decided that:
a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
   is very unlikely
b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
   or just accept/ignore such failure and should not lead to critical failures.
   If this still happen -- ihis is a bug, we should detect and fix it ASAP.

>> Should we perhaps interrupt the first vmalloc only?
> 
> This doesn't make much sense to me TBH. It doesn't address the very
> problem you are describing in the changelog.

Last question:
how do you think, should we perhaps, instead, detect such vmallocs 
(called in rollback after failed vmalloc) and generate a warnings,
to prevent such kind of problems in future?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-27  9:36                     ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-09-27  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 9/24/21 10:55 AM, Michal Hocko wrote:
> On Thu 23-09-21 09:49:57, Vasily Averin wrote:
> [...]
>> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
>> I think it is acceptable to enable fatal_signal_pending check to quickly
>> detect such kind of iussues.
>> However fatal_signal_pending check can cause serial vmalloc failures
>> and I doubt it is acceptable. 
>>
>> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
>> even properly handled such serial failures can cause troubles.
> 
> Could you be more specific? Also how would this be any different from
> similar failures for an oom victim? Except that the later is less likely
> so (as already mentioend) any potential bugs would be just lurking there
> for a longer time.
> 
>> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
>> forces its rollback, that in own turn it can call own vmalloc.
> 
> Do you have any specific example?

No, it was pure hypothetical assumption.
I was thinking about it over the weekend, and decided that:
a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
   is very unlikely
b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
   or just accept/ignore such failure and should not lead to critical failures.
   If this still happen -- ihis is a bug, we should detect and fix it ASAP.

>> Should we perhaps interrupt the first vmalloc only?
> 
> This doesn't make much sense to me TBH. It doesn't address the very
> problem you are describing in the changelog.

Last question:
how do you think, should we perhaps, instead, detect such vmallocs 
(called in rollback after failed vmalloc) and generate a warnings,
to prevent such kind of problems in future?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-27 11:08                       ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-27 11:08 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Mon 27-09-21 12:36:15, Vasily Averin wrote:
> On 9/24/21 10:55 AM, Michal Hocko wrote:
> > On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> >> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> >> forces its rollback, that in own turn it can call own vmalloc.
> > 
> > Do you have any specific example?
> 
> No, it was pure hypothetical assumption.
> I was thinking about it over the weekend, and decided that:
> a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
>    is very unlikely
> b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
>    or just accept/ignore such failure and should not lead to critical failures.
>    If this still happen -- ihis is a bug, we should detect and fix it ASAP.

I would even argue that nobody should rely on vmalloc suceeding. The
purpose of the allocator is to allow larger allocations and we do not
guarantee anything even for small reqests.

> >> Should we perhaps interrupt the first vmalloc only?
> > 
> > This doesn't make much sense to me TBH. It doesn't address the very
> > problem you are describing in the changelog.
> 
> Last question:
> how do you think, should we perhaps, instead, detect such vmallocs 
> (called in rollback after failed vmalloc) and generate a warnings,
> to prevent such kind of problems in future?

We do provide an allocation failure splat unless the request is
explicitly __GFP_NOWARN IIRC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm] vmalloc: back off when the current task is OOM-killed
@ 2021-09-27 11:08                       ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-09-27 11:08 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon 27-09-21 12:36:15, Vasily Averin wrote:
> On 9/24/21 10:55 AM, Michal Hocko wrote:
> > On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> >> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> >> forces its rollback, that in own turn it can call own vmalloc.
> > 
> > Do you have any specific example?
> 
> No, it was pure hypothetical assumption.
> I was thinking about it over the weekend, and decided that:
> a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
>    is very unlikely
> b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
>    or just accept/ignore such failure and should not lead to critical failures.
>    If this still happen -- ihis is a bug, we should detect and fix it ASAP.

I would even argue that nobody should rely on vmalloc suceeding. The
purpose of the allocator is to allow larger allocations and we do not
guarantee anything even for small reqests.

> >> Should we perhaps interrupt the first vmalloc only?
> > 
> > This doesn't make much sense to me TBH. It doesn't address the very
> > problem you are describing in the changelog.
> 
> Last question:
> how do you think, should we perhaps, instead, detect such vmallocs 
> (called in rollback after failed vmalloc) and generate a warnings,
> to prevent such kind of problems in future?

We do provide an allocation failure splat unless the request is
explicitly __GFP_NOWARN IIRC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH memcg v3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-05 13:52               ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))

-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);

 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;

-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;

-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;

 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH memcg v3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-05 13:52               ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))

-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);

 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;

-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;

-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;

 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-05 13:52                         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. Task called vmalloc can have worst badness and
be selected by OOM-killer, however taken fatal signal does not
interrupt allocation cycle. Vmalloc repeat page allocaions
again and again, exacerbating the crisis and consuming the memory
freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally.
However it may not release the consumed memory, since the allocated
object may have a lifetime unrelated to the completed task.
In the worst case, this can lead to the host will panic
due to "Out of memory and no killable processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic. It does not check oom condition directly,
however, and breaks page allocation cycle when fatal signal was received.

This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside. However all of these scenarios are incorrect:
vmalloc does not guarantee successful allocation, it has never been called
with __GFP_NOFAIL and threfore either should not be used for any rollbacks
or should handle such errors correctly and not lead to critical
failures.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
    removed check inside __alloc_pages_bulk(),
    according to feedback from mhocko@.
    Updated patch description.
---
 mm/vmalloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..71706f5447f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2860,6 +2860,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;

+		if (fatal_signal_pending(current))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-05 13:52                         ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. Task called vmalloc can have worst badness and
be selected by OOM-killer, however taken fatal signal does not
interrupt allocation cycle. Vmalloc repeat page allocaions
again and again, exacerbating the crisis and consuming the memory
freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally.
However it may not release the consumed memory, since the allocated
object may have a lifetime unrelated to the completed task.
In the worst case, this can lead to the host will panic
due to "Out of memory and no killable processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic. It does not check oom condition directly,
however, and breaks page allocation cycle when fatal signal was received.

This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside. However all of these scenarios are incorrect:
vmalloc does not guarantee successful allocation, it has never been called
with __GFP_NOFAIL and threfore either should not be used for any rollbacks
or should handle such errors correctly and not lead to critical
failures.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
    removed check inside __alloc_pages_bulk(),
    according to feedback from mhocko@.
    Updated patch description.
---
 mm/vmalloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..71706f5447f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2860,6 +2860,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;

+		if (fatal_signal_pending(current))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-05 14:00                           ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On 10/5/21 4:52 PM, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

I briefly checked this patch together with 
 v3 memcg: prohibit unconditional exceeding the limit of dying tasks
 over v5.15-rc4.
I executed LTP on host, all oom, cgroup and memcg tests was successfully finished.
and then experimented with memcg limited LXC containers.
I did not noticed any troubles on my test node.
Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-05 14:00                           ` Vasily Averin
  0 siblings, 0 replies; 62+ messages in thread
From: Vasily Averin @ 2021-10-05 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 10/5/21 4:52 PM, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

I briefly checked this patch together with 
 v3 memcg: prohibit unconditional exceeding the limit of dying tasks
 over v5.15-rc4.
I executed LTP on host, all oom, cgroup and memcg tests was successfully finished.
and then experimented with memcg limited LXC containers.
I did not noticed any troubles on my test node.
Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg v3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-05 14:55                 ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-10-05 14:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 05-10-21 16:52:31, Vasily Averin wrote:
> v3: no functional changes, just improved patch description

You haven't addressed my review feedback regarding the oom invocation.
Let me paste it here again:
: > @@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
: >        * A few threads which were not waiting at mutex_lock_killable() can
: >        * fail to bail out. Therefore, check again after holding oom_lock.
: >        */
: > -     ret = should_force_charge() || out_of_memory(&oc);
: > +     ret = task_is_dying() || out_of_memory(&oc);
: 
: task_is_dying check will prevent the oom killer for dying tasks. There
: is an additional bail out at out_of_memory layer. These checks are now
: leading to a completely different behavior. Currently we simply use
: "unlimited" reserves and therefore we do not have to kill any task. Now
: the charge fails without using all reclaim measures. So I believe we
: should drop those checks for memcg oom paths. I have to think about this
: some more because I might be missing some other side effects.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH memcg v3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-05 14:55                 ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-10-05 14:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 05-10-21 16:52:31, Vasily Averin wrote:
> v3: no functional changes, just improved patch description

You haven't addressed my review feedback regarding the oom invocation.
Let me paste it here again:
: > @@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
: >        * A few threads which were not waiting at mutex_lock_killable() can
: >        * fail to bail out. Therefore, check again after holding oom_lock.
: >        */
: > -     ret = should_force_charge() || out_of_memory(&oc);
: > +     ret = task_is_dying() || out_of_memory(&oc);
: 
: task_is_dying check will prevent the oom killer for dying tasks. There
: is an additional bail out at out_of_memory layer. These checks are now
: leading to a completely different behavior. Currently we simply use
: "unlimited" reserves and therefore we do not have to kill any task. Now
: the charge fails without using all reclaim measures. So I believe we
: should drop those checks for memcg oom paths. I have to think about this
: some more because I might be missing some other side effects.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-07 10:47                           ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-10-07 10:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 05-10-21 16:52:40, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.

This will allow also interrupting a user space requist which happens to
trigger a large vmalloc, hence the reason for going for
fatal_signal_pending rather than oom specific condition.

> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

__GFP_NOFAIL semantic is explicitly not supported for vmalloc.

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Michal Hocko <mhocko@suse.com>

I would keep it sitting in the linux-next for some time to sort out
potential fallouts and have them fixed before this one gets merged.

Thanks!
> ---
> v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
>     removed check inside __alloc_pages_bulk(),
>     according to feedback from mhocko@.
>     Updated patch description.
> ---
>  mm/vmalloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..71706f5447f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2860,6 +2860,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		struct page *page;
>  		int i;
>  
> +		if (fatal_signal_pending(current))
> +			break;
> +
>  		page = alloc_pages_node(nid, gfp, order);
>  		if (unlikely(!page))
>  			break;
> -- 
> 2.31.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-07 10:47                           ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2021-10-07 10:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 05-10-21 16:52:40, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.

This will allow also interrupting a user space requist which happens to
trigger a large vmalloc, hence the reason for going for
fatal_signal_pending rather than oom specific condition.

> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

__GFP_NOFAIL semantic is explicitly not supported for vmalloc.

> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

I would keep it sitting in the linux-next for some time to sort out
potential fallouts and have them fixed before this one gets merged.

Thanks!
> ---
> v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
>     removed check inside __alloc_pages_bulk(),
>     according to feedback from mhocko@.
>     Updated patch description.
> ---
>  mm/vmalloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..71706f5447f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2860,6 +2860,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		struct page *page;
>  		int i;
>  
> +		if (fatal_signal_pending(current))
> +			break;
> +
>  		page = alloc_pages_node(nid, gfp, order);
>  		if (unlikely(!page))
>  			break;
> -- 
> 2.31.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-07 19:55                           ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2021-10-07 19:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	cgroups, linux-mm, linux-kernel, kernel

On Tue, 5 Oct 2021 16:52:40 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.
> 

This needed a little rework due to
https://lkml.kernel.org/r/20210928121040.2547407-1-chenwandun@huawei.com.
Please check and retest sometime?

--- a/mm/vmalloc.c~vmalloc-back-off-when-the-current-task-is-oom-killed
+++ a/mm/vmalloc.c
@@ -2887,6 +2887,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 
 	page = NULL;
 	while (nr_allocated < nr_pages) {
+		if (fatal_signal_pending(current))
+			break;
+
 		if (nid == NUMA_NO_NODE)
 			page = alloc_pages(gfp, order);
 		else
_


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH mm v2] vmalloc: back off when the current task is OOM-killed
@ 2021-10-07 19:55                           ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2021-10-07 19:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue, 5 Oct 2021 16:52:40 +0300 Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.
> 

This needed a little rework due to
https://lkml.kernel.org/r/20210928121040.2547407-1-chenwandun-hv44wF8Li92Xj1p+fO2waQ@public.gmane.org
Please check and retest sometime?

--- a/mm/vmalloc.c~vmalloc-back-off-when-the-current-task-is-oom-killed
+++ a/mm/vmalloc.c
@@ -2887,6 +2887,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 
 	page = NULL;
 	while (nr_allocated < nr_pages) {
+		if (fatal_signal_pending(current))
+			break;
+
 		if (nid == NUMA_NO_NODE)
 			page = alloc_pages(gfp, order);
 		else
_


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-10-07 19:55 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 12:39 [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-09-10 13:04 ` Tetsuo Handa
2021-09-10 13:04   ` Tetsuo Handa
2021-09-10 13:20   ` Vasily Averin
2021-09-10 13:20     ` Vasily Averin
2021-09-10 14:55     ` Michal Hocko
2021-09-13  8:29       ` Vasily Averin
2021-09-13  8:29         ` Vasily Averin
2021-09-13  8:42         ` Michal Hocko
2021-09-13  8:42           ` Michal Hocko
2021-09-17  8:06           ` [PATCH mm] vmalloc: back off when the current task is OOM-killed Vasily Averin
2021-09-17  8:06             ` Vasily Averin
2021-09-19 23:31             ` Andrew Morton
2021-09-19 23:31               ` Andrew Morton
2021-09-20  1:22               ` Tetsuo Handa
2021-09-20 10:59                 ` Vasily Averin
2021-09-20 10:59                   ` Vasily Averin
2021-09-21 18:55                   ` Andrew Morton
2021-09-22  6:18                     ` Vasily Averin
2021-09-22 12:27             ` Michal Hocko
2021-09-22 12:27               ` Michal Hocko
2021-09-23  6:49               ` Vasily Averin
2021-09-23  6:49                 ` Vasily Averin
2021-09-24  7:55                 ` Michal Hocko
2021-09-24  7:55                   ` Michal Hocko
2021-09-27  9:36                   ` Vasily Averin
2021-09-27  9:36                     ` Vasily Averin
2021-09-27 11:08                     ` Michal Hocko
2021-09-27 11:08                       ` Michal Hocko
2021-10-05 13:52                       ` [PATCH mm v2] " Vasily Averin
2021-10-05 13:52                         ` Vasily Averin
2021-10-05 14:00                         ` Vasily Averin
2021-10-05 14:00                           ` Vasily Averin
2021-10-07 10:47                         ` Michal Hocko
2021-10-07 10:47                           ` Michal Hocko
2021-10-07 19:55                         ` Andrew Morton
2021-10-07 19:55                           ` Andrew Morton
2021-09-10 13:07 ` [PATCH memcg] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-09-10 13:07   ` Vasily Averin
2021-09-13  7:51 ` Vasily Averin
2021-09-13  7:51   ` Vasily Averin
2021-09-13  8:39   ` Michal Hocko
2021-09-13  8:39     ` Michal Hocko
2021-09-13  9:37     ` Vasily Averin
2021-09-13  9:37       ` Vasily Averin
2021-09-13 10:10       ` Michal Hocko
2021-09-13 10:10         ` Michal Hocko
2021-09-13  8:53 ` Michal Hocko
2021-09-13 10:35   ` Vasily Averin
2021-09-13 10:35     ` Vasily Averin
2021-09-13 10:55     ` Michal Hocko
2021-09-13 10:55       ` Michal Hocko
2021-09-14 10:01       ` Vasily Averin
2021-09-14 10:01         ` Vasily Averin
2021-09-14 10:10         ` [PATCH memcg v2] " Vasily Averin
2021-09-14 10:10           ` Vasily Averin
2021-09-16 12:55           ` Michal Hocko
2021-09-16 12:55             ` Michal Hocko
2021-10-05 13:52             ` [PATCH memcg v3] " Vasily Averin
2021-10-05 13:52               ` Vasily Averin
2021-10-05 14:55               ` Michal Hocko
2021-10-05 14:55                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.