All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-22  7:05 ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-22  7:05 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm, Haifeng Xu

When application in userland receives oom notification from kernel
and reads the oom_control file, it's confusing that under_oom is 0
though the omm killer hasn't finished. The reason is that under_oom
is cleared before invoking mem_cgroup_out_of_memory(), so move the
action that unmark under_oom after completing oom handling. Therefore,
the value of under_oom won't mislead users.

Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..0b6ed63504ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
 
-	mem_cgroup_unmark_under_oom(memcg);
 	ret = mem_cgroup_out_of_memory(memcg, mask, order);
+	mem_cgroup_unmark_under_oom(memcg);
 
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-22  7:05 ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-22  7:05 UTC (permalink / raw)
  To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Haifeng Xu

When application in userland receives oom notification from kernel
and reads the oom_control file, it's confusing that under_oom is 0
though the omm killer hasn't finished. The reason is that under_oom
is cleared before invoking mem_cgroup_out_of_memory(), so move the
action that unmark under_oom after completing oom handling. Therefore,
the value of under_oom won't mislead users.

Signed-off-by: Haifeng Xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..0b6ed63504ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
 
-	mem_cgroup_unmark_under_oom(memcg);
 	ret = mem_cgroup_out_of_memory(memcg, mask, order);
+	mem_cgroup_unmark_under_oom(memcg);
 
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-22 23:17   ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2023-09-22 23:17 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: mhocko, hannes, shakeelb, cgroups, linux-mm

On Fri, Sep 22, 2023 at 07:05:28AM +0000, Haifeng Xu wrote:
> When application in userland receives oom notification from kernel
> and reads the oom_control file, it's confusing that under_oom is 0
> though the omm killer hasn't finished. The reason is that under_oom
> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> action that unmark under_oom after completing oom handling. Therefore,
> the value of under_oom won't mislead users.
> 
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>

Makes sense to me.

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8ca4bdcb03c..0b6ed63504ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	if (locked)
>  		mem_cgroup_oom_notify(memcg);
>  
> -	mem_cgroup_unmark_under_oom(memcg);
>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
> +	mem_cgroup_unmark_under_oom(memcg);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-22 23:17   ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2023-09-22 23:17 UTC (permalink / raw)
  To: Haifeng Xu
  Cc: mhocko-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, Sep 22, 2023 at 07:05:28AM +0000, Haifeng Xu wrote:
> When application in userland receives oom notification from kernel
> and reads the oom_control file, it's confusing that under_oom is 0
> though the omm killer hasn't finished. The reason is that under_oom
> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> action that unmark under_oom after completing oom handling. Therefore,
> the value of under_oom won't mislead users.
> 
> Signed-off-by: Haifeng Xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>

Makes sense to me.

Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>

Thanks!

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8ca4bdcb03c..0b6ed63504ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	if (locked)
>  		mem_cgroup_oom_notify(memcg);
>  
> -	mem_cgroup_unmark_under_oom(memcg);
>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
> +	mem_cgroup_unmark_under_oom(memcg);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-23  8:05     ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-23  8:05 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: mhocko, hannes, shakeelb, cgroups, linux-mm



On 2023/9/23 07:17, Roman Gushchin wrote:
> On Fri, Sep 22, 2023 at 07:05:28AM +0000, Haifeng Xu wrote:
>> When application in userland receives oom notification from kernel
>> and reads the oom_control file, it's confusing that under_oom is 0
>> though the omm killer hasn't finished. The reason is that under_oom
>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>> action that unmark under_oom after completing oom handling. Therefore,
>> the value of under_oom won't mislead users.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> 
> Makes sense to me.
> 
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> 
> Thanks!

OK,thanks. But I forgot to cc mailing list and akpm. I'll resend a new mail later.

> 
>> ---
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e8ca4bdcb03c..0b6ed63504ca 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>>  	if (locked)
>>  		mem_cgroup_oom_notify(memcg);
>>  
>> -	mem_cgroup_unmark_under_oom(memcg);
>>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>> +	mem_cgroup_unmark_under_oom(memcg);
>>  
>>  	if (locked)
>>  		mem_cgroup_oom_unlock(memcg);
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-23  8:05     ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-23  8:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: mhocko-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg



On 2023/9/23 07:17, Roman Gushchin wrote:
> On Fri, Sep 22, 2023 at 07:05:28AM +0000, Haifeng Xu wrote:
>> When application in userland receives oom notification from kernel
>> and reads the oom_control file, it's confusing that under_oom is 0
>> though the omm killer hasn't finished. The reason is that under_oom
>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>> action that unmark under_oom after completing oom handling. Therefore,
>> the value of under_oom won't mislead users.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
> 
> Makes sense to me.
> 
> Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> 
> Thanks!

OK,thanks. But I forgot to cc mailing list and akpm. I'll resend a new mail later.

> 
>> ---
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e8ca4bdcb03c..0b6ed63504ca 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>>  	if (locked)
>>  		mem_cgroup_oom_notify(memcg);
>>  
>> -	mem_cgroup_unmark_under_oom(memcg);
>>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>> +	mem_cgroup_unmark_under_oom(memcg);
>>  
>>  	if (locked)
>>  		mem_cgroup_oom_unlock(memcg);
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25  7:57     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25  7:57 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> When application in userland receives oom notification from kernel
> and reads the oom_control file, it's confusing that under_oom is 0
> though the omm killer hasn't finished. The reason is that under_oom
> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> action that unmark under_oom after completing oom handling. Therefore,
> the value of under_oom won't mislead users.

I do not really remember why are we doing it this way but trying to track
this down shows that we have been doing that since fb2a6fc56be6 ("mm:
memcg: rework and document OOM waiting and wakeup"). So this is an
established behavior for 10 years now. Do we really need to change it
now? The interface is legacy and hopefully no new workloads are
emerging.

I agree that the placement is surprising but I would rather not change
that unless there is a very good reason for that. Do you have any actual
workload which depends on the ordering? And if yes, how do you deal with
timing when the consumer of the notification just gets woken up after
mem_cgroup_out_of_memory completes?

> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8ca4bdcb03c..0b6ed63504ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	if (locked)
>  		mem_cgroup_oom_notify(memcg);
>  
> -	mem_cgroup_unmark_under_oom(memcg);
>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
> +	mem_cgroup_unmark_under_oom(memcg);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25  7:57     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25  7:57 UTC (permalink / raw)
  To: Haifeng Xu
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> When application in userland receives oom notification from kernel
> and reads the oom_control file, it's confusing that under_oom is 0
> though the omm killer hasn't finished. The reason is that under_oom
> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> action that unmark under_oom after completing oom handling. Therefore,
> the value of under_oom won't mislead users.

I do not really remember why are we doing it this way but trying to track
this down shows that we have been doing that since fb2a6fc56be6 ("mm:
memcg: rework and document OOM waiting and wakeup"). So this is an
established behavior for 10 years now. Do we really need to change it
now? The interface is legacy and hopefully no new workloads are
emerging.

I agree that the placement is surprising but I would rather not change
that unless there is a very good reason for that. Do you have any actual
workload which depends on the ordering? And if yes, how do you deal with
timing when the consumer of the notification just gets woken up after
mem_cgroup_out_of_memory completes?

> Signed-off-by: Haifeng Xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8ca4bdcb03c..0b6ed63504ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	if (locked)
>  		mem_cgroup_oom_notify(memcg);
>  
> -	mem_cgroup_unmark_under_oom(memcg);
>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
> +	mem_cgroup_unmark_under_oom(memcg);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25  9:03         ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-25  9:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm



On 2023/9/25 15:57, Michal Hocko wrote:
> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>> When application in userland receives oom notification from kernel
>> and reads the oom_control file, it's confusing that under_oom is 0
>> though the omm killer hasn't finished. The reason is that under_oom
>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>> action that unmark under_oom after completing oom handling. Therefore,
>> the value of under_oom won't mislead users.
> 
> I do not really remember why are we doing it this way but trying to track
> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> memcg: rework and document OOM waiting and wakeup"). So this is an
> established behavior for 10 years now. Do we really need to change it
> now? The interface is legacy and hopefully no new workloads are
> emerging.
> 
> I agree that the placement is surprising but I would rather not change
> that unless there is a very good reason for that. Do you have any actual
> workload which depends on the ordering? And if yes, how do you deal with
> timing when the consumer of the notification just gets woken up after
> mem_cgroup_out_of_memory completes?

yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
is cleared, then we create a new process with less memory allocation to avoid oom again.

> 
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
>> ---
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e8ca4bdcb03c..0b6ed63504ca 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>>  	if (locked)
>>  		mem_cgroup_oom_notify(memcg);
>>  
>> -	mem_cgroup_unmark_under_oom(memcg);
>>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>> +	mem_cgroup_unmark_under_oom(memcg);
>>  
>>  	if (locked)
>>  		mem_cgroup_oom_unlock(memcg);
>> -- 
>> 2.25.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25  9:03         ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-25  9:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg



On 2023/9/25 15:57, Michal Hocko wrote:
> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>> When application in userland receives oom notification from kernel
>> and reads the oom_control file, it's confusing that under_oom is 0
>> though the omm killer hasn't finished. The reason is that under_oom
>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>> action that unmark under_oom after completing oom handling. Therefore,
>> the value of under_oom won't mislead users.
> 
> I do not really remember why are we doing it this way but trying to track
> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> memcg: rework and document OOM waiting and wakeup"). So this is an
> established behavior for 10 years now. Do we really need to change it
> now? The interface is legacy and hopefully no new workloads are
> emerging.
> 
> I agree that the placement is surprising but I would rather not change
> that unless there is a very good reason for that. Do you have any actual
> workload which depends on the ordering? And if yes, how do you deal with
> timing when the consumer of the notification just gets woken up after
> mem_cgroup_out_of_memory completes?

yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
is cleared, then we create a new process with less memory allocation to avoid oom again.

> 
>> Signed-off-by: Haifeng Xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
>> ---
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e8ca4bdcb03c..0b6ed63504ca 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1970,8 +1970,8 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>>  	if (locked)
>>  		mem_cgroup_oom_notify(memcg);
>>  
>> -	mem_cgroup_unmark_under_oom(memcg);
>>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>> +	mem_cgroup_unmark_under_oom(memcg);
>>  
>>  	if (locked)
>>  		mem_cgroup_oom_unlock(memcg);
>> -- 
>> 2.25.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 11:38             ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25 11:38 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 15:57, Michal Hocko wrote:
> > On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >> When application in userland receives oom notification from kernel
> >> and reads the oom_control file, it's confusing that under_oom is 0
> >> though the omm killer hasn't finished. The reason is that under_oom
> >> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >> action that unmark under_oom after completing oom handling. Therefore,
> >> the value of under_oom won't mislead users.
> > 
> > I do not really remember why are we doing it this way but trying to track
> > this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> > memcg: rework and document OOM waiting and wakeup"). So this is an
> > established behavior for 10 years now. Do we really need to change it
> > now? The interface is legacy and hopefully no new workloads are
> > emerging.
> > 
> > I agree that the placement is surprising but I would rather not change
> > that unless there is a very good reason for that. Do you have any actual
> > workload which depends on the ordering? And if yes, how do you deal with
> > timing when the consumer of the notification just gets woken up after
> > mem_cgroup_out_of_memory completes?
> 
> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> is cleared, then we create a new process with less memory allocation to avoid oom again.

OK, I do understand what you mean and I could have made myself
more clear previously. Even if the state is cleared _after_
mem_cgroup_out_of_memory then you won't get what you need I am
afraid. The memcg stays under OOM until a memory is freed (uncharged)
from that memcg. mem_cgroup_out_of_memory itself doesn't really free
any memory on its own. It relies on the task to wake up and die or
oom_reaper to do the work on its behalf. All of that is time dependent.
under_oom would have to be reimplemented to be cleared when a memory is
unchanrged to meet your demands. Something that has never really been
the semantic.

Btw. is this something new that you are developing on top of v1? And if
yes, why don't you use v2?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 11:38             ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25 11:38 UTC (permalink / raw)
  To: Haifeng Xu
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 15:57, Michal Hocko wrote:
> > On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >> When application in userland receives oom notification from kernel
> >> and reads the oom_control file, it's confusing that under_oom is 0
> >> though the omm killer hasn't finished. The reason is that under_oom
> >> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >> action that unmark under_oom after completing oom handling. Therefore,
> >> the value of under_oom won't mislead users.
> > 
> > I do not really remember why are we doing it this way but trying to track
> > this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> > memcg: rework and document OOM waiting and wakeup"). So this is an
> > established behavior for 10 years now. Do we really need to change it
> > now? The interface is legacy and hopefully no new workloads are
> > emerging.
> > 
> > I agree that the placement is surprising but I would rather not change
> > that unless there is a very good reason for that. Do you have any actual
> > workload which depends on the ordering? And if yes, how do you deal with
> > timing when the consumer of the notification just gets woken up after
> > mem_cgroup_out_of_memory completes?
> 
> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> is cleared, then we create a new process with less memory allocation to avoid oom again.

OK, I do understand what you mean and I could have made myself
more clear previously. Even if the state is cleared _after_
mem_cgroup_out_of_memory then you won't get what you need I am
afraid. The memcg stays under OOM until a memory is freed (uncharged)
from that memcg. mem_cgroup_out_of_memory itself doesn't really free
any memory on its own. It relies on the task to wake up and die or
oom_reaper to do the work on its behalf. All of that is time dependent.
under_oom would have to be reimplemented to be cleared when a memory is
unchanrged to meet your demands. Something that has never really been
the semantic.

Btw. is this something new that you are developing on top of v1? And if
yes, why don't you use v2?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 12:28                 ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-25 12:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm



On 2023/9/25 19:38, Michal Hocko wrote:
> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
>>
>>
>> On 2023/9/25 15:57, Michal Hocko wrote:
>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>>>> When application in userland receives oom notification from kernel
>>>> and reads the oom_control file, it's confusing that under_oom is 0
>>>> though the omm killer hasn't finished. The reason is that under_oom
>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>>>> action that unmark under_oom after completing oom handling. Therefore,
>>>> the value of under_oom won't mislead users.
>>>
>>> I do not really remember why are we doing it this way but trying to track
>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
>>> memcg: rework and document OOM waiting and wakeup"). So this is an
>>> established behavior for 10 years now. Do we really need to change it
>>> now? The interface is legacy and hopefully no new workloads are
>>> emerging.
>>>
>>> I agree that the placement is surprising but I would rather not change
>>> that unless there is a very good reason for that. Do you have any actual
>>> workload which depends on the ordering? And if yes, how do you deal with
>>> timing when the consumer of the notification just gets woken up after
>>> mem_cgroup_out_of_memory completes?
>>
>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
>> is cleared, then we create a new process with less memory allocation to avoid oom again.
> 
> OK, I do understand what you mean and I could have made myself
> more clear previously. Even if the state is cleared _after_
> mem_cgroup_out_of_memory then you won't get what you need I am
> afraid. The memcg stays under OOM until a memory is freed (uncharged)
> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> any memory on its own. It relies on the task to wake up and die or
> oom_reaper to do the work on its behalf. All of that is time dependent.
> under_oom would have to be reimplemented to be cleared when a memory is
> unchanrged to meet your demands. Something that has never really been
> the semantic.
> 

yes, but at least before we create the new process, it has more chance to get some memory freed.

> Btw. is this something new that you are developing on top of v1? And if
> yes, why don't you use v2?
> 

yes, v2 doesn't have the "cgroup.event_control" file.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 12:28                 ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-25 12:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg



On 2023/9/25 19:38, Michal Hocko wrote:
> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
>>
>>
>> On 2023/9/25 15:57, Michal Hocko wrote:
>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>>>> When application in userland receives oom notification from kernel
>>>> and reads the oom_control file, it's confusing that under_oom is 0
>>>> though the omm killer hasn't finished. The reason is that under_oom
>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>>>> action that unmark under_oom after completing oom handling. Therefore,
>>>> the value of under_oom won't mislead users.
>>>
>>> I do not really remember why are we doing it this way but trying to track
>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
>>> memcg: rework and document OOM waiting and wakeup"). So this is an
>>> established behavior for 10 years now. Do we really need to change it
>>> now? The interface is legacy and hopefully no new workloads are
>>> emerging.
>>>
>>> I agree that the placement is surprising but I would rather not change
>>> that unless there is a very good reason for that. Do you have any actual
>>> workload which depends on the ordering? And if yes, how do you deal with
>>> timing when the consumer of the notification just gets woken up after
>>> mem_cgroup_out_of_memory completes?
>>
>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
>> is cleared, then we create a new process with less memory allocation to avoid oom again.
> 
> OK, I do understand what you mean and I could have made myself
> more clear previously. Even if the state is cleared _after_
> mem_cgroup_out_of_memory then you won't get what you need I am
> afraid. The memcg stays under OOM until a memory is freed (uncharged)
> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> any memory on its own. It relies on the task to wake up and die or
> oom_reaper to do the work on its behalf. All of that is time dependent.
> under_oom would have to be reimplemented to be cleared when a memory is
> unchanrged to meet your demands. Something that has never really been
> the semantic.
> 

yes, but at least before we create the new process, it has more chance to get some memory freed.

> Btw. is this something new that you are developing on top of v1? And if
> yes, why don't you use v2?
> 

yes, v2 doesn't have the "cgroup.event_control" file.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 12:37                     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25 12:37 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 19:38, Michal Hocko wrote:
> > On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> >>
> >>
> >> On 2023/9/25 15:57, Michal Hocko wrote:
> >>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >>>> When application in userland receives oom notification from kernel
> >>>> and reads the oom_control file, it's confusing that under_oom is 0
> >>>> though the omm killer hasn't finished. The reason is that under_oom
> >>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >>>> action that unmark under_oom after completing oom handling. Therefore,
> >>>> the value of under_oom won't mislead users.
> >>>
> >>> I do not really remember why are we doing it this way but trying to track
> >>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> >>> memcg: rework and document OOM waiting and wakeup"). So this is an
> >>> established behavior for 10 years now. Do we really need to change it
> >>> now? The interface is legacy and hopefully no new workloads are
> >>> emerging.
> >>>
> >>> I agree that the placement is surprising but I would rather not change
> >>> that unless there is a very good reason for that. Do you have any actual
> >>> workload which depends on the ordering? And if yes, how do you deal with
> >>> timing when the consumer of the notification just gets woken up after
> >>> mem_cgroup_out_of_memory completes?
> >>
> >> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> >> is cleared, then we create a new process with less memory allocation to avoid oom again.
> > 
> > OK, I do understand what you mean and I could have made myself
> > more clear previously. Even if the state is cleared _after_
> > mem_cgroup_out_of_memory then you won't get what you need I am
> > afraid. The memcg stays under OOM until a memory is freed (uncharged)
> > from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> > any memory on its own. It relies on the task to wake up and die or
> > oom_reaper to do the work on its behalf. All of that is time dependent.
> > under_oom would have to be reimplemented to be cleared when a memory is
> > unchanrged to meet your demands. Something that has never really been
> > the semantic.
> > 
> 
> yes, but at least before we create the new process, it has more chance to get some memory freed.

The time window we are talking about is the call of
mem_cgroup_out_of_memory which, depending on the number of evaluated
processes, could be a very short time. So what kind of practical
difference does this have on your workload? Is this measurable in any
way.

> > Btw. is this something new that you are developing on top of v1? And if
> > yes, why don't you use v2?
> > 
> 
> yes, v2 doesn't have the "cgroup.event_control" file.

Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
far from ideal as v1 is deprecated and mostly frozen. Why do you need to
rely on the oom notifications (or oom behavior in general) in the first
place? Could you share more about your workload and your requirements?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-25 12:37                     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2023-09-25 12:37 UTC (permalink / raw)
  To: Haifeng Xu
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 19:38, Michal Hocko wrote:
> > On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> >>
> >>
> >> On 2023/9/25 15:57, Michal Hocko wrote:
> >>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >>>> When application in userland receives oom notification from kernel
> >>>> and reads the oom_control file, it's confusing that under_oom is 0
> >>>> though the omm killer hasn't finished. The reason is that under_oom
> >>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >>>> action that unmark under_oom after completing oom handling. Therefore,
> >>>> the value of under_oom won't mislead users.
> >>>
> >>> I do not really remember why are we doing it this way but trying to track
> >>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> >>> memcg: rework and document OOM waiting and wakeup"). So this is an
> >>> established behavior for 10 years now. Do we really need to change it
> >>> now? The interface is legacy and hopefully no new workloads are
> >>> emerging.
> >>>
> >>> I agree that the placement is surprising but I would rather not change
> >>> that unless there is a very good reason for that. Do you have any actual
> >>> workload which depends on the ordering? And if yes, how do you deal with
> >>> timing when the consumer of the notification just gets woken up after
> >>> mem_cgroup_out_of_memory completes?
> >>
> >> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> >> is cleared, then we create a new process with less memory allocation to avoid oom again.
> > 
> > OK, I do understand what you mean and I could have made myself
> > more clear previously. Even if the state is cleared _after_
> > mem_cgroup_out_of_memory then you won't get what you need I am
> > afraid. The memcg stays under OOM until a memory is freed (uncharged)
> > from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> > any memory on its own. It relies on the task to wake up and die or
> > oom_reaper to do the work on its behalf. All of that is time dependent.
> > under_oom would have to be reimplemented to be cleared when a memory is
> > unchanrged to meet your demands. Something that has never really been
> > the semantic.
> > 
> 
> yes, but at least before we create the new process, it has more chance to get some memory freed.

The time window we are talking about is the call of
mem_cgroup_out_of_memory which, depending on the number of evaluated
processes, could be a very short time. So what kind of practical
difference does this have on your workload? Is this measurable in any
way.

> > Btw. is this something new that you are developing on top of v1? And if
> > yes, why don't you use v2?
> > 
> 
> yes, v2 doesn't have the "cgroup.event_control" file.

Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
far from ideal as v1 is deprecated and mostly frozen. Why do you need to
rely on the oom notifications (or oom behavior in general) in the first
place? Could you share more about your workload and your requirements?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-26 14:39                         ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-26 14:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm



On 2023/9/25 20:37, Michal Hocko wrote:
> On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
>>
>>
>> On 2023/9/25 19:38, Michal Hocko wrote:
>>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
>>>>
>>>>
>>>> On 2023/9/25 15:57, Michal Hocko wrote:
>>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>>>>>> When application in userland receives oom notification from kernel
>>>>>> and reads the oom_control file, it's confusing that under_oom is 0
>>>>>> though the omm killer hasn't finished. The reason is that under_oom
>>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>>>>>> action that unmark under_oom after completing oom handling. Therefore,
>>>>>> the value of under_oom won't mislead users.
>>>>>
>>>>> I do not really remember why are we doing it this way but trying to track
>>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
>>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
>>>>> established behavior for 10 years now. Do we really need to change it
>>>>> now? The interface is legacy and hopefully no new workloads are
>>>>> emerging.
>>>>>
>>>>> I agree that the placement is surprising but I would rather not change
>>>>> that unless there is a very good reason for that. Do you have any actual
>>>>> workload which depends on the ordering? And if yes, how do you deal with
>>>>> timing when the consumer of the notification just gets woken up after
>>>>> mem_cgroup_out_of_memory completes?
>>>>
>>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
>>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
>>>
>>> OK, I do understand what you mean and I could have made myself
>>> more clear previously. Even if the state is cleared _after_
>>> mem_cgroup_out_of_memory then you won't get what you need I am
>>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
>>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
>>> any memory on its own. It relies on the task to wake up and die or
>>> oom_reaper to do the work on its behalf. All of that is time dependent.
>>> under_oom would have to be reimplemented to be cleared when a memory is
>>> unchanrged to meet your demands. Something that has never really been
>>> the semantic.
>>>
>>
>> yes, but at least before we create the new process, it has more chance to get some memory freed.
> 
> The time window we are talking about is the call of
> mem_cgroup_out_of_memory which, depending on the number of evaluated
> processes, could be a very short time. So what kind of practical
> difference does this have on your workload? Is this measurable in any
> way.

The oom events in this group seems less than before.

> 
>>> Btw. is this something new that you are developing on top of v1? And if
>>> yes, why don't you use v2?
>>>
>>
>> yes, v2 doesn't have the "cgroup.event_control" file.
> 
> Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
> far from ideal as v1 is deprecated and mostly frozen. Why do you need to
> rely on the oom notifications (or oom behavior in general) in the first
> place? Could you share more about your workload and your requirements?
> 

for example, we want to run processes in the group but those parametes related to 
memory allocation is hard to decide, so use the notifications to inform us that we
need to adjust the paramters automatically and we don't need to create the new processes
manually.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
@ 2023-09-26 14:39                         ` Haifeng Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Haifeng Xu @ 2023-09-26 14:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg



On 2023/9/25 20:37, Michal Hocko wrote:
> On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
>>
>>
>> On 2023/9/25 19:38, Michal Hocko wrote:
>>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
>>>>
>>>>
>>>> On 2023/9/25 15:57, Michal Hocko wrote:
>>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>>>>>> When application in userland receives oom notification from kernel
>>>>>> and reads the oom_control file, it's confusing that under_oom is 0
>>>>>> though the omm killer hasn't finished. The reason is that under_oom
>>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>>>>>> action that unmark under_oom after completing oom handling. Therefore,
>>>>>> the value of under_oom won't mislead users.
>>>>>
>>>>> I do not really remember why are we doing it this way but trying to track
>>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
>>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
>>>>> established behavior for 10 years now. Do we really need to change it
>>>>> now? The interface is legacy and hopefully no new workloads are
>>>>> emerging.
>>>>>
>>>>> I agree that the placement is surprising but I would rather not change
>>>>> that unless there is a very good reason for that. Do you have any actual
>>>>> workload which depends on the ordering? And if yes, how do you deal with
>>>>> timing when the consumer of the notification just gets woken up after
>>>>> mem_cgroup_out_of_memory completes?
>>>>
>>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
>>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
>>>
>>> OK, I do understand what you mean and I could have made myself
>>> more clear previously. Even if the state is cleared _after_
>>> mem_cgroup_out_of_memory then you won't get what you need I am
>>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
>>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
>>> any memory on its own. It relies on the task to wake up and die or
>>> oom_reaper to do the work on its behalf. All of that is time dependent.
>>> under_oom would have to be reimplemented to be cleared when a memory is
>>> unchanrged to meet your demands. Something that has never really been
>>> the semantic.
>>>
>>
>> yes, but at least before we create the new process, it has more chance to get some memory freed.
> 
> The time window we are talking about is the call of
> mem_cgroup_out_of_memory which, depending on the number of evaluated
> processes, could be a very short time. So what kind of practical
> difference does this have on your workload? Is this measurable in any
> way.

The oom events in this group seems less than before.

> 
>>> Btw. is this something new that you are developing on top of v1? And if
>>> yes, why don't you use v2?
>>>
>>
>> yes, v2 doesn't have the "cgroup.event_control" file.
> 
> Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
> far from ideal as v1 is deprecated and mostly frozen. Why do you need to
> rely on the oom notifications (or oom behavior in general) in the first
> place? Could you share more about your workload and your requirements?
> 

for example, we want to run processes in the group but those parametes related to 
memory allocation is hard to decide, so use the notifications to inform us that we
need to adjust the paramters automatically and we don't need to create the new processes
manually.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
  2023-09-26 14:39                         ` Haifeng Xu
  (?)
@ 2023-09-27 13:36                         ` Michal Hocko
  2023-09-28  3:03                           ` Haifeng Xu
  -1 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-09-27 13:36 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Tue 26-09-23 22:39:11, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 20:37, Michal Hocko wrote:
> > On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
> >>
> >>
> >> On 2023/9/25 19:38, Michal Hocko wrote:
> >>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> >>>>
> >>>>
> >>>> On 2023/9/25 15:57, Michal Hocko wrote:
> >>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >>>>>> When application in userland receives oom notification from kernel
> >>>>>> and reads the oom_control file, it's confusing that under_oom is 0
> >>>>>> though the omm killer hasn't finished. The reason is that under_oom
> >>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >>>>>> action that unmark under_oom after completing oom handling. Therefore,
> >>>>>> the value of under_oom won't mislead users.
> >>>>>
> >>>>> I do not really remember why are we doing it this way but trying to track
> >>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> >>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
> >>>>> established behavior for 10 years now. Do we really need to change it
> >>>>> now? The interface is legacy and hopefully no new workloads are
> >>>>> emerging.
> >>>>>
> >>>>> I agree that the placement is surprising but I would rather not change
> >>>>> that unless there is a very good reason for that. Do you have any actual
> >>>>> workload which depends on the ordering? And if yes, how do you deal with
> >>>>> timing when the consumer of the notification just gets woken up after
> >>>>> mem_cgroup_out_of_memory completes?
> >>>>
> >>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> >>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
> >>>
> >>> OK, I do understand what you mean and I could have made myself
> >>> more clear previously. Even if the state is cleared _after_
> >>> mem_cgroup_out_of_memory then you won't get what you need I am
> >>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
> >>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> >>> any memory on its own. It relies on the task to wake up and die or
> >>> oom_reaper to do the work on its behalf. All of that is time dependent.
> >>> under_oom would have to be reimplemented to be cleared when a memory is
> >>> unchanrged to meet your demands. Something that has never really been
> >>> the semantic.
> >>>
> >>
> >> yes, but at least before we create the new process, it has more chance to get some memory freed.
> > 
> > The time window we are talking about is the call of
> > mem_cgroup_out_of_memory which, depending on the number of evaluated
> > processes, could be a very short time. So what kind of practical
> > difference does this have on your workload? Is this measurable in any
> > way.
> 
> The oom events in this group seems less than before.

Let me see if I follow. You are launching new workloads after oom
happens as soon as under_oom becomes 0. With the patch applied you see
fewer oom invocations which imlies that fewer re-launchings hit the
stil-under-oom situations? I would also expect that those are compared
over the same time period. Do you have any actual numbers to present?
Are they statistically representative?

I really have to say that I am skeptical over the presented usecase.
Optimizing over oom events seems just like a very wrong way to scale the
workload. Timing of oom handling is a subject to change at any time and
what you are optimizing for might change.

That being said, I do not see any obvious problem with the patch. IMO we
should rather not apply it because it is slighly changing a long term
behavior for something that is in a legacy mode now. But I will not Nack
it either as it is just a trivial thing. I just do not like an idea we
would be changing the timing of under_oom clearing just to fine tune
some workloads.
 
> >>> Btw. is this something new that you are developing on top of v1? And if
> >>> yes, why don't you use v2?
> >>>
> >>
> >> yes, v2 doesn't have the "cgroup.event_control" file.
> > 
> > Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
> > far from ideal as v1 is deprecated and mostly frozen. Why do you need to
> > rely on the oom notifications (or oom behavior in general) in the first
> > place? Could you share more about your workload and your requirements?
> > 
> 
> for example, we want to run processes in the group but those parametes related to 
> memory allocation is hard to decide, so use the notifications to inform us that we
> need to adjust the paramters automatically and we don't need to create the new processes
> manually.

I do understand that but OOM is just way too late to tune anything
upon. Cgroup v2 has a notion of high limit which can throttle memory
allocations way before the hard limit is set and this along with PSI
metrics could give you a much better insight on the memory pressure
in a memcg.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
  2023-09-27 13:36                         ` Michal Hocko
@ 2023-09-28  3:03                           ` Haifeng Xu
  2023-10-03  7:50                             ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Haifeng Xu @ 2023-09-28  3:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm



On 2023/9/27 21:36, Michal Hocko wrote:
> On Tue 26-09-23 22:39:11, Haifeng Xu wrote:
>>
>>
>> On 2023/9/25 20:37, Michal Hocko wrote:
>>> On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
>>>>
>>>>
>>>> On 2023/9/25 19:38, Michal Hocko wrote:
>>>>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 2023/9/25 15:57, Michal Hocko wrote:
>>>>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
>>>>>>>> When application in userland receives oom notification from kernel
>>>>>>>> and reads the oom_control file, it's confusing that under_oom is 0
>>>>>>>> though the omm killer hasn't finished. The reason is that under_oom
>>>>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
>>>>>>>> action that unmark under_oom after completing oom handling. Therefore,
>>>>>>>> the value of under_oom won't mislead users.
>>>>>>>
>>>>>>> I do not really remember why are we doing it this way but trying to track
>>>>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
>>>>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
>>>>>>> established behavior for 10 years now. Do we really need to change it
>>>>>>> now? The interface is legacy and hopefully no new workloads are
>>>>>>> emerging.
>>>>>>>
>>>>>>> I agree that the placement is surprising but I would rather not change
>>>>>>> that unless there is a very good reason for that. Do you have any actual
>>>>>>> workload which depends on the ordering? And if yes, how do you deal with
>>>>>>> timing when the consumer of the notification just gets woken up after
>>>>>>> mem_cgroup_out_of_memory completes?
>>>>>>
>>>>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
>>>>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
>>>>>
>>>>> OK, I do understand what you mean and I could have made myself
>>>>> more clear previously. Even if the state is cleared _after_
>>>>> mem_cgroup_out_of_memory then you won't get what you need I am
>>>>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
>>>>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
>>>>> any memory on its own. It relies on the task to wake up and die or
>>>>> oom_reaper to do the work on its behalf. All of that is time dependent.
>>>>> under_oom would have to be reimplemented to be cleared when a memory is
>>>>> unchanrged to meet your demands. Something that has never really been
>>>>> the semantic.
>>>>>
>>>>
>>>> yes, but at least before we create the new process, it has more chance to get some memory freed.
>>>
>>> The time window we are talking about is the call of
>>> mem_cgroup_out_of_memory which, depending on the number of evaluated
>>> processes, could be a very short time. So what kind of practical
>>> difference does this have on your workload? Is this measurable in any
>>> way.
>>
>> The oom events in this group seems less than before.
> 
> Let me see if I follow. You are launching new workloads after oom
> happens as soon as under_oom becomes 0. With the patch applied you see
> fewer oom invocations which imlies that fewer re-launchings hit the
> stil-under-oom situations? I would also expect that those are compared
> over the same time period. Do you have any actual numbers to present?
> Are they statistically representative?
> 
> I really have to say that I am skeptical over the presented usecase.
> Optimizing over oom events seems just like a very wrong way to scale the
> workload. Timing of oom handling is a subject to change at any time and
> what you are optimizing for might change.

I think the improvement may because of that, if we see under_oom is 1, we'll
sleep for 10s and check again. And the sleep time is enough to complete oom handling,
So the size of time window is much larger than the time spending on mem_cgroup_out_of_memory().

> 
> That being said, I do not see any obvious problem with the patch. IMO we
> should rather not apply it because it is slighly changing a long term
> behavior for something that is in a legacy mode now. But I will not Nack
> it either as it is just a trivial thing. I just do not like an idea we
> would be changing the timing of under_oom clearing just to fine tune
> some workloads.
>  
>>>>> Btw. is this something new that you are developing on top of v1? And if
>>>>> yes, why don't you use v2?
>>>>>
>>>>
>>>> yes, v2 doesn't have the "cgroup.event_control" file.
>>>
>>> Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
>>> far from ideal as v1 is deprecated and mostly frozen. Why do you need to
>>> rely on the oom notifications (or oom behavior in general) in the first
>>> place? Could you share more about your workload and your requirements?
>>>
>>
>> for example, we want to run processes in the group but those parametes related to 
>> memory allocation is hard to decide, so use the notifications to inform us that we
>> need to adjust the paramters automatically and we don't need to create the new processes
>> manually.
> 
> I do understand that but OOM is just way too late to tune anything
> upon. Cgroup v2 has a notion of high limit which can throttle memory
> allocations way before the hard limit is set and this along with PSI
> metrics could give you a much better insight on the memory pressure
> in a memcg.
> 

Thank you for your suggestion. We will try to use memory.high instead.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
  2023-09-28  3:03                           ` Haifeng Xu
@ 2023-10-03  7:50                             ` Michal Hocko
  2023-10-11  1:59                               ` Haifeng Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-10-03  7:50 UTC (permalink / raw)
  To: Haifeng Xu; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Thu 28-09-23 11:03:23, Haifeng Xu wrote:
[...]
> >> for example, we want to run processes in the group but those parametes related to 
> >> memory allocation is hard to decide, so use the notifications to inform us that we
> >> need to adjust the paramters automatically and we don't need to create the new processes
> >> manually.
> > 
> > I do understand that but OOM is just way too late to tune anything
> > upon. Cgroup v2 has a notion of high limit which can throttle memory
> > allocations way before the hard limit is set and this along with PSI
> > metrics could give you a much better insight on the memory pressure
> > in a memcg.
> > 
> 
> Thank you for your suggestion. We will try to use memory.high instead.

OK, is the patch still required? As I've said I am not strongly opposed,
it is just that the justification is rather weak.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
  2023-10-03  7:50                             ` Michal Hocko
@ 2023-10-11  1:59                               ` Haifeng Xu
  2023-10-25 21:48                                 ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Haifeng Xu @ 2023-10-11  1:59 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, roman.gushchin, shakeelb, cgroups, linux-mm



On 2023/10/3 15:50, Michal Hocko wrote:
> On Thu 28-09-23 11:03:23, Haifeng Xu wrote:
> [...]
>>>> for example, we want to run processes in the group but those parametes related to 
>>>> memory allocation is hard to decide, so use the notifications to inform us that we
>>>> need to adjust the paramters automatically and we don't need to create the new processes
>>>> manually.
>>>
>>> I do understand that but OOM is just way too late to tune anything
>>> upon. Cgroup v2 has a notion of high limit which can throttle memory
>>> allocations way before the hard limit is set and this along with PSI
>>> metrics could give you a much better insight on the memory pressure
>>> in a memcg.
>>>
>>
>> Thank you for your suggestion. We will try to use memory.high instead.
> 
> OK, is the patch still required? 
Yes
As I've said I am not strongly opposed,
> it is just that the justification is rather weak.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done
  2023-10-11  1:59                               ` Haifeng Xu
@ 2023-10-25 21:48                                 ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2023-10-25 21:48 UTC (permalink / raw)
  To: Haifeng Xu
  Cc: Michal Hocko, hannes, roman.gushchin, shakeelb, cgroups, linux-mm

On Wed, 11 Oct 2023 09:59:25 +0800 Haifeng Xu <haifeng.xu@shopee.com> wrote:

> 
> 
> On 2023/10/3 15:50, Michal Hocko wrote:
> > On Thu 28-09-23 11:03:23, Haifeng Xu wrote:
> > [...]
> >>>> for example, we want to run processes in the group but those parametes related to 
> >>>> memory allocation is hard to decide, so use the notifications to inform us that we
> >>>> need to adjust the paramters automatically and we don't need to create the new processes
> >>>> manually.
> >>>
> >>> I do understand that but OOM is just way too late to tune anything
> >>> upon. Cgroup v2 has a notion of high limit which can throttle memory
> >>> allocations way before the hard limit is set and this along with PSI
> >>> metrics could give you a much better insight on the memory pressure
> >>> in a memcg.
> >>>
> >>
> >> Thank you for your suggestion. We will try to use memory.high instead.
> > 
> > OK, is the patch still required? 
> Yes
> As I've said I am not strongly opposed,

I'm confused.  You (Haifeng Xu) are looking at using memory.high for
your requirement, yet you believe that this patch is still required? 
This seems contradictory.

Oh well.  I think I'll drop this patch for now.  If you believe that
kernel changes are still required, please propose something for
6.7-rcX.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-10-25 21:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-22  7:05 [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done Haifeng Xu
2023-09-22  7:05 ` Haifeng Xu
2023-09-22 23:17 ` Roman Gushchin
2023-09-22 23:17   ` Roman Gushchin
2023-09-23  8:05   ` Haifeng Xu
2023-09-23  8:05     ` Haifeng Xu
2023-09-25  7:57 ` Michal Hocko
2023-09-25  7:57   ` Michal Hocko
2023-09-25  7:57     ` Michal Hocko
2023-09-25  9:03     ` Haifeng Xu
2023-09-25  9:03       ` Haifeng Xu
2023-09-25  9:03         ` Haifeng Xu
2023-09-25 11:38         ` Michal Hocko
2023-09-25 11:38           ` Michal Hocko
2023-09-25 11:38             ` Michal Hocko
2023-09-25 12:28             ` Haifeng Xu
2023-09-25 12:28               ` Haifeng Xu
2023-09-25 12:28                 ` Haifeng Xu
2023-09-25 12:37                 ` Michal Hocko
2023-09-25 12:37                   ` Michal Hocko
2023-09-25 12:37                     ` Michal Hocko
2023-09-26 14:39                     ` Haifeng Xu
2023-09-26 14:39                       ` Haifeng Xu
2023-09-26 14:39                         ` Haifeng Xu
2023-09-27 13:36                         ` Michal Hocko
2023-09-28  3:03                           ` Haifeng Xu
2023-10-03  7:50                             ` Michal Hocko
2023-10-11  1:59                               ` Haifeng Xu
2023-10-25 21:48                                 ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.