All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:12 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:12 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko
  Cc: linux-kernel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..c329940 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	if (sysctl_oom_kill_allocating_task && current->mm &&
+	    !oom_unkillable_task(current, memcg, NULL) &&
+	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		get_task_struct(current);
+		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
+				 "Memory cgroup out of memory (oom_kill_allocating_task)");
+		return;
+	}
+
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct cgroup *cgroup = iter->css.cgroup;
 		struct cgroup_iter it;
-- 
1.7.6.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:12 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:12 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko
  Cc: linux-kernel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..c329940 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	if (sysctl_oom_kill_allocating_task && current->mm &&
+	    !oom_unkillable_task(current, memcg, NULL) &&
+	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		get_task_struct(current);
+		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
+				 "Memory cgroup out of memory (oom_kill_allocating_task)");
+		return;
+	}
+
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct cgroup *cgroup = iter->css.cgroup;
 		struct cgroup_iter it;
-- 
1.7.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:12 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:12 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..c329940 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	if (sysctl_oom_kill_allocating_task && current->mm &&
+	    !oom_unkillable_task(current, memcg, NULL) &&
+	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		get_task_struct(current);
+		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
+				 "Memory cgroup out of memory (oom_kill_allocating_task)");
+		return;
+	}
+
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct cgroup *cgroup = iter->css.cgroup;
 		struct cgroup_iter it;
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 10:12 ` Sha Zhengju
@ 2012-10-16 10:20   ` Ni zhan Chen
  -1 siblings, 0 replies; 43+ messages in thread
From: Ni zhan Chen @ 2012-10-16 10:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 10/16/2012 06:12 PM, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.

Is it the resend one or new version, could you add changelog if it is 
the last case?

>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>   mm/memcontrol.c |    9 +++++++++
>   1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>   	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>   	for_each_mem_cgroup_tree(iter, memcg) {
>   		struct cgroup *cgroup = iter->css.cgroup;
>   		struct cgroup_iter it;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:20   ` Ni zhan Chen
  0 siblings, 0 replies; 43+ messages in thread
From: Ni zhan Chen @ 2012-10-16 10:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 10/16/2012 06:12 PM, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.

Is it the resend one or new version, could you add changelog if it is 
the last case?

>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>   mm/memcontrol.c |    9 +++++++++
>   1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>   	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>   	for_each_mem_cgroup_tree(iter, memcg) {
>   		struct cgroup *cgroup = iter->css.cgroup;
>   		struct cgroup_iter it;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 10:20   ` Ni zhan Chen
  (?)
@ 2012-10-16 10:41     ` Sha Zhengju
  -1 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:41 UTC (permalink / raw)
  To: Ni zhan Chen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 10/16/2012 06:20 PM, Ni zhan Chen wrote:
> On 10/16/2012 06:12 PM, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the 
>> OOM-triggering
>> task in out-of-memory situations, but it only works on overall 
>> system-wide oom.
>> But it's also a useful indication in memcg so we take it into 
>> consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has 
>> already
>> been memcg-ware.
>
> Is it the resend one or new version, could you add changelog if it is 
> the last case?

Sorry, forget to mention that this patch is an updated one rebased on 
mhocko mm tree, since-3.6 branch.
The first one is on old kernel, please ignore it. :-)


Thanks,
Sha


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:41     ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:41 UTC (permalink / raw)
  To: Ni zhan Chen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 10/16/2012 06:20 PM, Ni zhan Chen wrote:
> On 10/16/2012 06:12 PM, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the 
>> OOM-triggering
>> task in out-of-memory situations, but it only works on overall 
>> system-wide oom.
>> But it's also a useful indication in memcg so we take it into 
>> consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has 
>> already
>> been memcg-ware.
>
> Is it the resend one or new version, could you add changelog if it is 
> the last case?

Sorry, forget to mention that this patch is an updated one rebased on 
mhocko mm tree, since-3.6 branch.
The first one is on old kernel, please ignore it. :-)


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 10:41     ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 10:41 UTC (permalink / raw)
  To: Ni zhan Chen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On 10/16/2012 06:20 PM, Ni zhan Chen wrote:
> On 10/16/2012 06:12 PM, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the 
>> OOM-triggering
>> task in out-of-memory situations, but it only works on overall 
>> system-wide oom.
>> But it's also a useful indication in memcg so we take it into 
>> consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has 
>> already
>> been memcg-ware.
>
> Is it the resend one or new version, could you add changelog if it is 
> the last case?

Sorry, forget to mention that this patch is an updated one rebased on 
mhocko mm tree, since-3.6 branch.
The first one is on old kernel, please ignore it. :-)


Thanks,
Sha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 10:12 ` Sha Zhengju
  (?)
@ 2012-10-16 13:34   ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-16 13:34 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Tue 16-10-12 18:12:08, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.

Could you be more specific about the motivation for this patch? Is it
"let's be consistent with the global oom" or you have a real use case
for this knob.

The primary motivation for oom_kill_allocating_task AFAIU was to reduce
search over huge tasklists and reduce task_lock holding times. I am not
sure whether the original concern is still valid since 6b0c81b (mm,
oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
been reduced conciderably in favor of RCU read locks is taken but maybe
even that can be too disruptive?
David?

Moreover memcg oom killer doesn't iterate over tasklist (it uses
cgroup_iter*) so this shouldn't cause the performance problem like
for the global case.
On the other hand we are taking css_set_lock for reading for the whole
iteration which might cause some issues as well but those should better
be described in the changelog.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  mm/memcontrol.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;
> -- 
> 1.7.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 13:34   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-16 13:34 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Tue 16-10-12 18:12:08, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.

Could you be more specific about the motivation for this patch? Is it
"let's be consistent with the global oom" or you have a real use case
for this knob.

The primary motivation for oom_kill_allocating_task AFAIU was to reduce
search over huge tasklists and reduce task_lock holding times. I am not
sure whether the original concern is still valid since 6b0c81b (mm,
oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
been reduced conciderably in favor of RCU read locks is taken but maybe
even that can be too disruptive?
David?

Moreover memcg oom killer doesn't iterate over tasklist (it uses
cgroup_iter*) so this shouldn't cause the performance problem like
for the global case.
On the other hand we are taking css_set_lock for reading for the whole
iteration which might cause some issues as well but those should better
be described in the changelog.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  mm/memcontrol.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;
> -- 
> 1.7.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 13:34   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-16 13:34 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On Tue 16-10-12 18:12:08, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.

Could you be more specific about the motivation for this patch? Is it
"let's be consistent with the global oom" or you have a real use case
for this knob.

The primary motivation for oom_kill_allocating_task AFAIU was to reduce
search over huge tasklists and reduce task_lock holding times. I am not
sure whether the original concern is still valid since 6b0c81b (mm,
oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
been reduced conciderably in favor of RCU read locks is taken but maybe
even that can be too disruptive?
David?

Moreover memcg oom killer doesn't iterate over tasklist (it uses
cgroup_iter*) so this shouldn't cause the performance problem like
for the global case.
On the other hand we are taking css_set_lock for reading for the whole
iteration which might cause some issues as well but those should better
be described in the changelog.

> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;
> -- 
> 1.7.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 13:34   ` Michal Hocko
  (?)
  (?)
@ 2012-10-16 17:14   ` Sha Zhengju
  2012-10-18 11:56       ` Michal Hocko
  -1 siblings, 1 reply; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16 17:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

[-- Attachment #1: Type: text/plain, Size: 4034 bytes --]

On Tuesday, October 16, 2012, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 16-10-12 18:12:08, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the
OOM-triggering
>> task in out-of-memory situations, but it only works on overall
system-wide oom.
>> But it's also a useful indication in memcg so we take it into
consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has
already
>> been memcg-ware.
>
> Could you be more specific about the motivation for this patch? Is it
> "let's be consistent with the global oom" or you have a real use case
> for this knob.
>

In our environment(rhel6), we encounter a memcg oom 'deadlock' problem.
Simply speaking,
suppose process A is selected to be killed by memcg oom killer, but A is
uninterruptible
sleeping on a page lock. What's worse, the exact page lock is holding by
another memcg
process B which is trapped in mem_croup_oom_lock(proves to be a livelock).
Then A can not
exit successfully to free the memory and both of them can not moving on.
Indeed, we
should dig into these locks to find the solution and in fact the 37b23e05
(x86, mm: make pagefault
killable) and 7d9fdac(Memcg: make oom_lock 0 and 1 based other than
counter) have already solved
the problem, but if oom_killing_allocating_task is memcg aware, enabling
this suicide oom behavior
will be a simpler workaround. What's more, enabling the sysctl can avoid
other potential oom
problems to some extent.


> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
> search over huge tasklists and reduce task_lock holding times. I am not
> sure whether the original concern is still valid since 6b0c81b (mm,
> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> been reduced conciderably in favor of RCU read locks is taken but maybe
> even that can be too disruptive?
> David?


On the other hand, from the semantic meaning of oom_kill_allocating_task,
it implies to allow
suicide-like oom, which has no obvious relationship with performance
problems(such as huge task lists
or task_lock holding time). So make the sysctl be consistent with global
oom will be better or set an
individual option for memcg oom just as panic_on_oom does.


> Moreover memcg oom killer doesn't iterate over tasklist (it uses
> cgroup_iter*) so this shouldn't cause the performance problem like
> for the global case.
> On the other hand we are taking css_set_lock for reading for the whole
> iteration which might cause some issues as well but those should better
> be described in the changelog.
>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  mm/memcontrol.c |    9 +++++++++
>>  1 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e4e9b18..c329940 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct
mem_cgroup *memcg, gfp_t gfp_mask,
>>
>>       check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>>       totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
>> +         !oom_unkillable_task(current, memcg, NULL) &&
>> +         current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>> +             get_task_struct(current);
>> +             oom_kill_process(current, gfp_mask, order, 0, totalpages,
memcg, NULL,
>> +                              "Memory cgroup out of memory
(oom_kill_allocating_task)");
>> +             return;
>> +     }
>> +
>>       for_each_mem_cgroup_tree(iter, memcg) {
>>               struct cgroup *cgroup = iter->css.cgroup;
>>               struct cgroup_iter it;
>> --
>> 1.7.6.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 4991 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 13:34   ` Michal Hocko
  (?)
@ 2012-10-16 18:39     ` David Rientjes
  -1 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu, akpm,
	linux-kernel, Sha Zhengju

On Tue, 16 Oct 2012, Michal Hocko wrote:

> The primary motivation for oom_kill_allocating_task AFAIU was to reduce
> search over huge tasklists and reduce task_lock holding times. I am not
> sure whether the original concern is still valid since 6b0c81b (mm,
> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> been reduced conciderably in favor of RCU read locks is taken but maybe
> even that can be too disruptive?
> David?
> 

When the oom killer became serialized, the folks from SGI requested this 
tunable to be able to avoid the expensive tasklist scan on their systems 
and to be able to avoid killing threads that aren't allocating memory at 
all in a steady state.  It wasn't necessarily about tasklist_lock holding 
time but rather the expensive iteration over such a large number of 
processes.

> Moreover memcg oom killer doesn't iterate over tasklist (it uses
> cgroup_iter*) so this shouldn't cause the performance problem like
> for the global case.

Depends on how many threads are attached to a memcg.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 18:39     ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu, akpm,
	linux-kernel, Sha Zhengju

On Tue, 16 Oct 2012, Michal Hocko wrote:

> The primary motivation for oom_kill_allocating_task AFAIU was to reduce
> search over huge tasklists and reduce task_lock holding times. I am not
> sure whether the original concern is still valid since 6b0c81b (mm,
> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> been reduced conciderably in favor of RCU read locks is taken but maybe
> even that can be too disruptive?
> David?
> 

When the oom killer became serialized, the folks from SGI requested this 
tunable to be able to avoid the expensive tasklist scan on their systems 
and to be able to avoid killing threads that aren't allocating memory at 
all in a steady state.  It wasn't necessarily about tasklist_lock holding 
time but rather the expensive iteration over such a large number of 
processes.

> Moreover memcg oom killer doesn't iterate over tasklist (it uses
> cgroup_iter*) so this shouldn't cause the performance problem like
> for the global case.

Depends on how many threads are attached to a memcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 18:39     ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Tue, 16 Oct 2012, Michal Hocko wrote:

> The primary motivation for oom_kill_allocating_task AFAIU was to reduce
> search over huge tasklists and reduce task_lock holding times. I am not
> sure whether the original concern is still valid since 6b0c81b (mm,
> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> been reduced conciderably in favor of RCU read locks is taken but maybe
> even that can be too disruptive?
> David?
> 

When the oom killer became serialized, the folks from SGI requested this 
tunable to be able to avoid the expensive tasklist scan on their systems 
and to be able to avoid killing threads that aren't allocating memory at 
all in a steady state.  It wasn't necessarily about tasklist_lock holding 
time but rather the expensive iteration over such a large number of 
processes.

> Moreover memcg oom killer doesn't iterate over tasklist (it uses
> cgroup_iter*) so this shouldn't cause the performance problem like
> for the global case.

Depends on how many threads are attached to a memcg.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 10:12 ` Sha Zhengju
  (?)
@ 2012-10-16 18:44   ` David Rientjes
  -1 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;

Please try to compile your patches and run scripts/checkpatch.pl on them 
before proposing them.

You'll also need to update Documentation/sysctl/vm.txt.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 18:44   ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel,
	Sha Zhengju

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;

Please try to compile your patches and run scripts/checkpatch.pl on them 
before proposing them.

You'll also need to update Documentation/sysctl/vm.txt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16 18:44   ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16 18:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4e9b18..c329940 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +	    !oom_unkillable_task(current, memcg, NULL) &&
> +	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(current);
> +		oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL,
> +				 "Memory cgroup out of memory (oom_kill_allocating_task)");
> +		return;
> +	}
> +
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
>  		struct cgroup_iter it;

Please try to compile your patches and run scripts/checkpatch.pl on them 
before proposing them.

You'll also need to update Documentation/sysctl/vm.txt.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16 17:14   ` Sha Zhengju
  2012-10-18 11:56       ` Michal Hocko
@ 2012-10-18 11:56       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 11:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> On Tuesday, October 16, 2012, Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > Could you be more specific about the motivation for this patch? Is it
> > "let's be consistent with the global oom" or you have a real use case
> > for this knob.
> >
> 
> In our environment(rhel6), we encounter a memcg oom 'deadlock'
> problem.  Simply speaking, suppose process A is selected to be killed
> by memcg oom killer, but A is uninterruptible sleeping on a page
> lock. What's worse, the exact page lock is holding by another memcg
> process B which is trapped in mem_croup_oom_lock(proves to be a
> livelock).

Hmm, this is strange. How can you get down that road with the page lock
held? Is it possible this is related to the issue fixed by: 1d65f86d
(mm: preallocate page before lock_page() at filemap COW)?

> Then A can not exit successfully to free the memory and both of them
> can not moving on.

> Indeed, we should dig into these locks to find the solution and
> in fact the 37b23e05 (x86, mm: make pagefault killable) and
> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> already solved the problem, but if oom_killing_allocating_task is
> memcg aware, enabling this suicide oom behavior will be a simpler
> workaround. What's more, enabling the sysctl can avoid other potential
> oom problems to some extent.

As I said, I am not against this but I really want to see a valid use
case first. So far I haven't seen any because what you mention above is
a clear bug which should be fixed. I can imagine the huge number of
tasks in the group could be a problem as well but I would like to see
what are those problems first.

> > The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
> > search over huge tasklists and reduce task_lock holding times. I am not
> > sure whether the original concern is still valid since 6b0c81b (mm,
> > oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> > been reduced conciderably in favor of RCU read locks is taken but maybe
> > even that can be too disruptive?
> > David?
> 
> 
> On the other hand, from the semantic meaning of oom_kill_allocating_task,
> it implies to allow suicide-like oom, which has no obvious relationship
> with performance problems(such as huge task lists or task_lock holding
> time). 

I guess that suicide-like oom in fact means "kill the poor soul that
happened to charge the last". I do not see any use case for this from
top of my head (appart from the performance benefits of course).

> So make the sysctl be consistent with global oom will be better or set
> an individual option for memcg oom just as panic_on_oom does.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 11:56       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 11:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> On Tuesday, October 16, 2012, Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > Could you be more specific about the motivation for this patch? Is it
> > "let's be consistent with the global oom" or you have a real use case
> > for this knob.
> >
> 
> In our environment(rhel6), we encounter a memcg oom 'deadlock'
> problem.  Simply speaking, suppose process A is selected to be killed
> by memcg oom killer, but A is uninterruptible sleeping on a page
> lock. What's worse, the exact page lock is holding by another memcg
> process B which is trapped in mem_croup_oom_lock(proves to be a
> livelock).

Hmm, this is strange. How can you get down that road with the page lock
held? Is it possible this is related to the issue fixed by: 1d65f86d
(mm: preallocate page before lock_page() at filemap COW)?

> Then A can not exit successfully to free the memory and both of them
> can not moving on.

> Indeed, we should dig into these locks to find the solution and
> in fact the 37b23e05 (x86, mm: make pagefault killable) and
> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> already solved the problem, but if oom_killing_allocating_task is
> memcg aware, enabling this suicide oom behavior will be a simpler
> workaround. What's more, enabling the sysctl can avoid other potential
> oom problems to some extent.

As I said, I am not against this but I really want to see a valid use
case first. So far I haven't seen any because what you mention above is
a clear bug which should be fixed. I can imagine the huge number of
tasks in the group could be a problem as well but I would like to see
what are those problems first.

> > The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
> > search over huge tasklists and reduce task_lock holding times. I am not
> > sure whether the original concern is still valid since 6b0c81b (mm,
> > oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> > been reduced conciderably in favor of RCU read locks is taken but maybe
> > even that can be too disruptive?
> > David?
> 
> 
> On the other hand, from the semantic meaning of oom_kill_allocating_task,
> it implies to allow suicide-like oom, which has no obvious relationship
> with performance problems(such as huge task lists or task_lock holding
> time). 

I guess that suicide-like oom in fact means "kill the poor soul that
happened to charge the last". I do not see any use case for this from
top of my head (appart from the performance benefits of course).

> So make the sysctl be consistent with global oom will be better or set
> an individual option for memcg oom just as panic_on_oom does.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 11:56       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 11:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> On Tuesday, October 16, 2012, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
[...]
> > Could you be more specific about the motivation for this patch? Is it
> > "let's be consistent with the global oom" or you have a real use case
> > for this knob.
> >
> 
> In our environment(rhel6), we encounter a memcg oom 'deadlock'
> problem.  Simply speaking, suppose process A is selected to be killed
> by memcg oom killer, but A is uninterruptible sleeping on a page
> lock. What's worse, the exact page lock is holding by another memcg
> process B which is trapped in mem_croup_oom_lock(proves to be a
> livelock).

Hmm, this is strange. How can you get down that road with the page lock
held? Is it possible this is related to the issue fixed by: 1d65f86d
(mm: preallocate page before lock_page() at filemap COW)?

> Then A can not exit successfully to free the memory and both of them
> can not moving on.

> Indeed, we should dig into these locks to find the solution and
> in fact the 37b23e05 (x86, mm: make pagefault killable) and
> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> already solved the problem, but if oom_killing_allocating_task is
> memcg aware, enabling this suicide oom behavior will be a simpler
> workaround. What's more, enabling the sysctl can avoid other potential
> oom problems to some extent.

As I said, I am not against this but I really want to see a valid use
case first. So far I haven't seen any because what you mention above is
a clear bug which should be fixed. I can imagine the huge number of
tasks in the group could be a problem as well but I would like to see
what are those problems first.

> > The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
> > search over huge tasklists and reduce task_lock holding times. I am not
> > sure whether the original concern is still valid since 6b0c81b (mm,
> > oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
> > been reduced conciderably in favor of RCU read locks is taken but maybe
> > even that can be too disruptive?
> > David?
> 
> 
> On the other hand, from the semantic meaning of oom_kill_allocating_task,
> it implies to allow suicide-like oom, which has no obvious relationship
> with performance problems(such as huge task lists or task_lock holding
> time). 

I guess that suicide-like oom in fact means "kill the poor soul that
happened to charge the last". I do not see any use case for this from
top of my head (appart from the performance benefits of course).

> So make the sysctl be consistent with global oom will be better or set
> an individual option for memcg oom just as panic_on_oom does.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-18 11:56       ` Michal Hocko
  (?)
@ 2012-10-18 13:51         ` Sha Zhengju
  -1 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-18 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 13:51         ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-18 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 13:51         ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-18 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-18 13:51         ` Sha Zhengju
  (?)
@ 2012-10-18 15:32           ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 15:32 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> >[...]
> >>>Could you be more specific about the motivation for this patch? Is it
> >>>"let's be consistent with the global oom" or you have a real use case
> >>>for this knob.
> >>>
> >>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>problem.  Simply speaking, suppose process A is selected to be killed
> >>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>lock. What's worse, the exact page lock is holding by another memcg
> >>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>livelock).
> >Hmm, this is strange. How can you get down that road with the page lock
> >held? Is it possible this is related to the issue fixed by: 1d65f86d
> >(mm: preallocate page before lock_page() at filemap COW)?
> 
> No, it has nothing with the cow page. By checking stack of the process A
> selected to be killed(uninterruptible sleeping), it was stuck at:
> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> state).
> The person B holding the exactly page lock is on the following path:
> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> ->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)

Hmm filemap_fault locks the page after the read ahead is triggered
already so it doesn't call mpage_readpages with any page locked - the
add_to_page_cache_lru is called without any page locked.
This is at least the current code. It might be different in rhel6 but
calling memcg charging with a page lock is definitely a bug.

> In mpage_readpages, B tends to read a dozen of pages in: for each of
> page will do
> locking, charging, and then send out a big bio. And A is waiting for
> one of the pages
> and stuck.
> 
> As I said, 37b23e05 has made pagefault killable by changing
> uninterruptible sleeping to killable sleeping. So A can be woke up to
> exit successfully and free the memory which can in turn help B pass
> memcg charging period.
> 
> (By the way, it seems commit 37b23e05 and 7d9fdac need to be

79dfdaccd1d5 you mean, right? That one just helps when there are too
many tasks trashing oom killer so it is not related to what you are
trying to achieve. Besides that make sure you take 23751be0 if you
take it.

> backported to --stable tree to deliver RHEL users. ;-) )

I am not sure the first one qualifies the stable tree inclusion as it is
a feature.

> >>Then A can not exit successfully to free the memory and both of them
> >>can not moving on.
> >>Indeed, we should dig into these locks to find the solution and
> >>in fact the 37b23e05 (x86, mm: make pagefault killable) and
> >>7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> >>already solved the problem, but if oom_killing_allocating_task is
> >>memcg aware, enabling this suicide oom behavior will be a simpler
> >>workaround. What's more, enabling the sysctl can avoid other potential
> >>oom problems to some extent.
> >As I said, I am not against this but I really want to see a valid use
> >case first. So far I haven't seen any because what you mention above is
> >a clear bug which should be fixed. I can imagine the huge number of
> >tasks in the group could be a problem as well but I would like to see
> >what are those problems first.
> >
> 
> In view of consistent with global oom and performance benefit, I suggest
> we may as well open it in memcg oom as there's no obvious harm.

I am not sure about "no obvious harm" part. The policy could be
different in different groups e.g. and the global knob could be really
misleading. But the question is. Is it worth having this per group? To
be honest, I do not like the global knob either and I am not entirely
keen on spreading it out into memcg unless there is a real use case for
it.

> As refer to the bug I mentioned, obviously the key solution is the above two
> patchset, but considing other *potential* memcg oom bugs, the sysctl may
> be a role of temporary workaround to some extent... but it's just a
> workaround.

We shouldn't add something like that just to workaround obvious bugs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 15:32           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 15:32 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> >[...]
> >>>Could you be more specific about the motivation for this patch? Is it
> >>>"let's be consistent with the global oom" or you have a real use case
> >>>for this knob.
> >>>
> >>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>problem.  Simply speaking, suppose process A is selected to be killed
> >>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>lock. What's worse, the exact page lock is holding by another memcg
> >>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>livelock).
> >Hmm, this is strange. How can you get down that road with the page lock
> >held? Is it possible this is related to the issue fixed by: 1d65f86d
> >(mm: preallocate page before lock_page() at filemap COW)?
> 
> No, it has nothing with the cow page. By checking stack of the process A
> selected to be killed(uninterruptible sleeping), it was stuck at:
> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> state).
> The person B holding the exactly page lock is on the following path:
> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> ->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)

Hmm filemap_fault locks the page after the read ahead is triggered
already so it doesn't call mpage_readpages with any page locked - the
add_to_page_cache_lru is called without any page locked.
This is at least the current code. It might be different in rhel6 but
calling memcg charging with a page lock is definitely a bug.

> In mpage_readpages, B tends to read a dozen of pages in: for each of
> page will do
> locking, charging, and then send out a big bio. And A is waiting for
> one of the pages
> and stuck.
> 
> As I said, 37b23e05 has made pagefault killable by changing
> uninterruptible sleeping to killable sleeping. So A can be woke up to
> exit successfully and free the memory which can in turn help B pass
> memcg charging period.
> 
> (By the way, it seems commit 37b23e05 and 7d9fdac need to be

79dfdaccd1d5 you mean, right? That one just helps when there are too
many tasks trashing oom killer so it is not related to what you are
trying to achieve. Besides that make sure you take 23751be0 if you
take it.

> backported to --stable tree to deliver RHEL users. ;-) )

I am not sure the first one qualifies the stable tree inclusion as it is
a feature.

> >>Then A can not exit successfully to free the memory and both of them
> >>can not moving on.
> >>Indeed, we should dig into these locks to find the solution and
> >>in fact the 37b23e05 (x86, mm: make pagefault killable) and
> >>7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> >>already solved the problem, but if oom_killing_allocating_task is
> >>memcg aware, enabling this suicide oom behavior will be a simpler
> >>workaround. What's more, enabling the sysctl can avoid other potential
> >>oom problems to some extent.
> >As I said, I am not against this but I really want to see a valid use
> >case first. So far I haven't seen any because what you mention above is
> >a clear bug which should be fixed. I can imagine the huge number of
> >tasks in the group could be a problem as well but I would like to see
> >what are those problems first.
> >
> 
> In view of consistent with global oom and performance benefit, I suggest
> we may as well open it in memcg oom as there's no obvious harm.

I am not sure about "no obvious harm" part. The policy could be
different in different groups e.g. and the global knob could be really
misleading. But the question is. Is it worth having this per group? To
be honest, I do not like the global knob either and I am not entirely
keen on spreading it out into memcg unless there is a real use case for
it.

> As refer to the bug I mentioned, obviously the key solution is the above two
> patchset, but considing other *potential* memcg oom bugs, the sysctl may
> be a role of temporary workaround to some extent... but it's just a
> workaround.

We shouldn't add something like that just to workaround obvious bugs.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-18 15:32           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-18 15:32 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>On Tuesday, October 16, 2012, Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>  wrote:
> >[...]
> >>>Could you be more specific about the motivation for this patch? Is it
> >>>"let's be consistent with the global oom" or you have a real use case
> >>>for this knob.
> >>>
> >>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>problem.  Simply speaking, suppose process A is selected to be killed
> >>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>lock. What's worse, the exact page lock is holding by another memcg
> >>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>livelock).
> >Hmm, this is strange. How can you get down that road with the page lock
> >held? Is it possible this is related to the issue fixed by: 1d65f86d
> >(mm: preallocate page before lock_page() at filemap COW)?
> 
> No, it has nothing with the cow page. By checking stack of the process A
> selected to be killed(uninterruptible sleeping), it was stuck at:
> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> state).
> The person B holding the exactly page lock is on the following path:
> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> ->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)

Hmm filemap_fault locks the page after the read ahead is triggered
already so it doesn't call mpage_readpages with any page locked - the
add_to_page_cache_lru is called without any page locked.
This is at least the current code. It might be different in rhel6 but
calling memcg charging with a page lock is definitely a bug.

> In mpage_readpages, B tends to read a dozen of pages in: for each of
> page will do
> locking, charging, and then send out a big bio. And A is waiting for
> one of the pages
> and stuck.
> 
> As I said, 37b23e05 has made pagefault killable by changing
> uninterruptible sleeping to killable sleeping. So A can be woke up to
> exit successfully and free the memory which can in turn help B pass
> memcg charging period.
> 
> (By the way, it seems commit 37b23e05 and 7d9fdac need to be

79dfdaccd1d5 you mean, right? That one just helps when there are too
many tasks trashing oom killer so it is not related to what you are
trying to achieve. Besides that make sure you take 23751be0 if you
take it.

> backported to --stable tree to deliver RHEL users. ;-) )

I am not sure the first one qualifies the stable tree inclusion as it is
a feature.

> >>Then A can not exit successfully to free the memory and both of them
> >>can not moving on.
> >>Indeed, we should dig into these locks to find the solution and
> >>in fact the 37b23e05 (x86, mm: make pagefault killable) and
> >>7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
> >>already solved the problem, but if oom_killing_allocating_task is
> >>memcg aware, enabling this suicide oom behavior will be a simpler
> >>workaround. What's more, enabling the sysctl can avoid other potential
> >>oom problems to some extent.
> >As I said, I am not against this but I really want to see a valid use
> >case first. So far I haven't seen any because what you mention above is
> >a clear bug which should be fixed. I can imagine the huge number of
> >tasks in the group could be a problem as well but I would like to see
> >what are those problems first.
> >
> 
> In view of consistent with global oom and performance benefit, I suggest
> we may as well open it in memcg oom as there's no obvious harm.

I am not sure about "no obvious harm" part. The policy could be
different in different groups e.g. and the global knob could be really
misleading. But the question is. Is it worth having this per group? To
be honest, I do not like the global knob either and I am not entirely
keen on spreading it out into memcg unless there is a real use case for
it.

> As refer to the bug I mentioned, obviously the key solution is the above two
> patchset, but considing other *potential* memcg oom bugs, the sysctl may
> be a role of temporary workaround to some extent... but it's just a
> workaround.

We shouldn't add something like that just to workaround obvious bugs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-18 15:32           ` Michal Hocko
  (?)
@ 2012-10-19  4:11             ` Sha Zhengju
  -1 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-19  4:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On 10/18/2012 11:32 PM, Michal Hocko wrote:
> On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
>> On 10/18/2012 07:56 PM, Michal Hocko wrote:
>>> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>>>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>   wrote:
>>> [...]
>>>>> Could you be more specific about the motivation for this patch? Is it
>>>>> "let's be consistent with the global oom" or you have a real use case
>>>>> for this knob.
>>>>>
>>>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>>>> problem.  Simply speaking, suppose process A is selected to be killed
>>>> by memcg oom killer, but A is uninterruptible sleeping on a page
>>>> lock. What's worse, the exact page lock is holding by another memcg
>>>> process B which is trapped in mem_croup_oom_lock(proves to be a
>>>> livelock).
>>> Hmm, this is strange. How can you get down that road with the page lock
>>> held? Is it possible this is related to the issue fixed by: 1d65f86d
>>> (mm: preallocate page before lock_page() at filemap COW)?
>> No, it has nothing with the cow page. By checking stack of the process A
>> selected to be killed(uninterruptible sleeping), it was stuck at:
>> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
>> state).
>> The person B holding the exactly page lock is on the following path:
>> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
>> ->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> Hmm filemap_fault locks the page after the read ahead is triggered
> already so it doesn't call mpage_readpages with any page locked - the
> add_to_page_cache_lru is called without any page locked.

It's not the page being fault in filemap_fault that causing the problem, 
but those
pages handling by readhead. To clarify the point, the more detailed call 
stack is:
filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
__do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*

It is because mpage_readpages that bring the risk:
for each of readahead pages
      (1)add_to_page_cache_lru (--> *will lock page and go through memcg 
charging*)
      add the page to a big bio
submit_bio (So those locked pages will be unlocked in end_bio after swapin)

So if a page is being charged and cannot exit from memcg oom successfully
(following I'll explain the reason) in step (1), it will cause the 
submit_bio indefinitely
postponed while holding the PageLock of previous pages.

> This is at least the current code. It might be different in rhel6 but
> calling memcg charging with a page lock is definitely a bug.
>

The current code (mm repo since-3.6) here remains unchanged. Through we 
may need
to take care of page lock and memcg charging in mpage_readpages, it 
dives to fs level.
Besides 37b23e05 have already fixed the deadlock from the other side: 
process still can be
killed even waiting for pagelock. But considering other potential 
problem, we may as well do
something in mpage_readpages to avoid calling add_to_page_cache_lru with 
any page locked.

>> In mpage_readpages, B tends to read a dozen of pages in: for each of
>> page will do
>> locking, charging, and then send out a big bio. And A is waiting for
>> one of the pages
>> and stuck.
>>
>> As I said, 37b23e05 has made pagefault killable by changing
>> uninterruptible sleeping to killable sleeping. So A can be woke up to
>> exit successfully and free the memory which can in turn help B pass
>> memcg charging period.
>>
>> (By the way, it seems commit 37b23e05 and 7d9fdac need to be
> 79dfdaccd1d5 you mean, right? That one just helps when there are too
> many tasks trashing oom killer so it is not related to what you are
> trying to achieve. Besides that make sure you take 23751be0 if you
> take it.
>

Here is the reason why I said a process may go though memcg oom and cannot
exit. It's just the phenomenon described in the commit log of 79dfdaccd:
the old version of memcg oom lock can lead to serious starvation and make
many tasks trash oom killer but nothing useful can be done.


It is for these two reasons that cause the bug and can make the memcg 
unusable
(sys up to almost 100%)for hours even days... Once we give some extra memory
to the memcg(such as increase hardlimit a little), the processes tending 
into oom killer
will pass the charging and send bio out eventually, which will unlock 
those pages and
wake up the D sleeper.


>> backported to --stable tree to deliver RHEL users. ;-) )
> I am not sure the first one qualifies the stable tree inclusion as it is
> a feature.
>

When debugging the problem, we indeed found 37b23e05 is the key
enemy of the deadlock bug.


>>>> Then A can not exit successfully to free the memory and both of them
>>>> can not moving on.
>>>> Indeed, we should dig into these locks to find the solution and
>>>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>>>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>>>> already solved the problem, but if oom_killing_allocating_task is
>>>> memcg aware, enabling this suicide oom behavior will be a simpler
>>>> workaround. What's more, enabling the sysctl can avoid other potential
>>>> oom problems to some extent.
>>> As I said, I am not against this but I really want to see a valid use
>>> case first. So far I haven't seen any because what you mention above is
>>> a clear bug which should be fixed. I can imagine the huge number of
>>> tasks in the group could be a problem as well but I would like to see
>>> what are those problems first.
>>>
>> In view of consistent with global oom and performance benefit, I suggest
>> we may as well open it in memcg oom as there's no obvious harm.
> I am not sure about "no obvious harm" part. The policy could be
> different in different groups e.g. and the global knob could be really
> misleading. But the question is. Is it worth having this per group? To
> be honest, I do not like the global knob either and I am not entirely
> keen on spreading it out into memcg unless there is a real use case for
> it.
>

Okay...then let's lie it on the table. We may use it as a in-house 
patch. :-)


Thanks,
Sha

>> As refer to the bug I mentioned, obviously the key solution is the above two
>> patchset, but considing other *potential* memcg oom bugs, the sysctl may
>> be a role of temporary workaround to some extent... but it's just a
>> workaround.
> We shouldn't add something like that just to workaround obvious bugs.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-19  4:11             ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-19  4:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On 10/18/2012 11:32 PM, Michal Hocko wrote:
> On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
>> On 10/18/2012 07:56 PM, Michal Hocko wrote:
>>> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>>>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>   wrote:
>>> [...]
>>>>> Could you be more specific about the motivation for this patch? Is it
>>>>> "let's be consistent with the global oom" or you have a real use case
>>>>> for this knob.
>>>>>
>>>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>>>> problem.  Simply speaking, suppose process A is selected to be killed
>>>> by memcg oom killer, but A is uninterruptible sleeping on a page
>>>> lock. What's worse, the exact page lock is holding by another memcg
>>>> process B which is trapped in mem_croup_oom_lock(proves to be a
>>>> livelock).
>>> Hmm, this is strange. How can you get down that road with the page lock
>>> held? Is it possible this is related to the issue fixed by: 1d65f86d
>>> (mm: preallocate page before lock_page() at filemap COW)?
>> No, it has nothing with the cow page. By checking stack of the process A
>> selected to be killed(uninterruptible sleeping), it was stuck at:
>> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
>> state).
>> The person B holding the exactly page lock is on the following path:
>> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
>> ->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> Hmm filemap_fault locks the page after the read ahead is triggered
> already so it doesn't call mpage_readpages with any page locked - the
> add_to_page_cache_lru is called without any page locked.

It's not the page being fault in filemap_fault that causing the problem, 
but those
pages handling by readhead. To clarify the point, the more detailed call 
stack is:
filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
__do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*

It is because mpage_readpages that bring the risk:
for each of readahead pages
      (1)add_to_page_cache_lru (--> *will lock page and go through memcg 
charging*)
      add the page to a big bio
submit_bio (So those locked pages will be unlocked in end_bio after swapin)

So if a page is being charged and cannot exit from memcg oom successfully
(following I'll explain the reason) in step (1), it will cause the 
submit_bio indefinitely
postponed while holding the PageLock of previous pages.

> This is at least the current code. It might be different in rhel6 but
> calling memcg charging with a page lock is definitely a bug.
>

The current code (mm repo since-3.6) here remains unchanged. Through we 
may need
to take care of page lock and memcg charging in mpage_readpages, it 
dives to fs level.
Besides 37b23e05 have already fixed the deadlock from the other side: 
process still can be
killed even waiting for pagelock. But considering other potential 
problem, we may as well do
something in mpage_readpages to avoid calling add_to_page_cache_lru with 
any page locked.

>> In mpage_readpages, B tends to read a dozen of pages in: for each of
>> page will do
>> locking, charging, and then send out a big bio. And A is waiting for
>> one of the pages
>> and stuck.
>>
>> As I said, 37b23e05 has made pagefault killable by changing
>> uninterruptible sleeping to killable sleeping. So A can be woke up to
>> exit successfully and free the memory which can in turn help B pass
>> memcg charging period.
>>
>> (By the way, it seems commit 37b23e05 and 7d9fdac need to be
> 79dfdaccd1d5 you mean, right? That one just helps when there are too
> many tasks trashing oom killer so it is not related to what you are
> trying to achieve. Besides that make sure you take 23751be0 if you
> take it.
>

Here is the reason why I said a process may go though memcg oom and cannot
exit. It's just the phenomenon described in the commit log of 79dfdaccd:
the old version of memcg oom lock can lead to serious starvation and make
many tasks trash oom killer but nothing useful can be done.


It is for these two reasons that cause the bug and can make the memcg 
unusable
(sys up to almost 100%)for hours even days... Once we give some extra memory
to the memcg(such as increase hardlimit a little), the processes tending 
into oom killer
will pass the charging and send bio out eventually, which will unlock 
those pages and
wake up the D sleeper.


>> backported to --stable tree to deliver RHEL users. ;-) )
> I am not sure the first one qualifies the stable tree inclusion as it is
> a feature.
>

When debugging the problem, we indeed found 37b23e05 is the key
enemy of the deadlock bug.


>>>> Then A can not exit successfully to free the memory and both of them
>>>> can not moving on.
>>>> Indeed, we should dig into these locks to find the solution and
>>>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>>>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>>>> already solved the problem, but if oom_killing_allocating_task is
>>>> memcg aware, enabling this suicide oom behavior will be a simpler
>>>> workaround. What's more, enabling the sysctl can avoid other potential
>>>> oom problems to some extent.
>>> As I said, I am not against this but I really want to see a valid use
>>> case first. So far I haven't seen any because what you mention above is
>>> a clear bug which should be fixed. I can imagine the huge number of
>>> tasks in the group could be a problem as well but I would like to see
>>> what are those problems first.
>>>
>> In view of consistent with global oom and performance benefit, I suggest
>> we may as well open it in memcg oom as there's no obvious harm.
> I am not sure about "no obvious harm" part. The policy could be
> different in different groups e.g. and the global knob could be really
> misleading. But the question is. Is it worth having this per group? To
> be honest, I do not like the global knob either and I am not entirely
> keen on spreading it out into memcg unless there is a real use case for
> it.
>

Okay...then let's lie it on the table. We may use it as a in-house 
patch. :-)


Thanks,
Sha

>> As refer to the bug I mentioned, obviously the key solution is the above two
>> patchset, but considing other *potential* memcg oom bugs, the sysctl may
>> be a role of temporary workaround to some extent... but it's just a
>> workaround.
> We shouldn't add something like that just to workaround obvious bugs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-19  4:11             ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-19  4:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On 10/18/2012 11:32 PM, Michal Hocko wrote:
> On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
>> On 10/18/2012 07:56 PM, Michal Hocko wrote:
>>> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>>>> On Tuesday, October 16, 2012, Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>   wrote:
>>> [...]
>>>>> Could you be more specific about the motivation for this patch? Is it
>>>>> "let's be consistent with the global oom" or you have a real use case
>>>>> for this knob.
>>>>>
>>>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>>>> problem.  Simply speaking, suppose process A is selected to be killed
>>>> by memcg oom killer, but A is uninterruptible sleeping on a page
>>>> lock. What's worse, the exact page lock is holding by another memcg
>>>> process B which is trapped in mem_croup_oom_lock(proves to be a
>>>> livelock).
>>> Hmm, this is strange. How can you get down that road with the page lock
>>> held? Is it possible this is related to the issue fixed by: 1d65f86d
>>> (mm: preallocate page before lock_page() at filemap COW)?
>> No, it has nothing with the cow page. By checking stack of the process A
>> selected to be killed(uninterruptible sleeping), it was stuck at:
>> __do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
>> state).
>> The person B holding the exactly page lock is on the following path:
>> __do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
>> ->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> Hmm filemap_fault locks the page after the read ahead is triggered
> already so it doesn't call mpage_readpages with any page locked - the
> add_to_page_cache_lru is called without any page locked.

It's not the page being fault in filemap_fault that causing the problem, 
but those
pages handling by readhead. To clarify the point, the more detailed call 
stack is:
filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
__do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*

It is because mpage_readpages that bring the risk:
for each of readahead pages
      (1)add_to_page_cache_lru (--> *will lock page and go through memcg 
charging*)
      add the page to a big bio
submit_bio (So those locked pages will be unlocked in end_bio after swapin)

So if a page is being charged and cannot exit from memcg oom successfully
(following I'll explain the reason) in step (1), it will cause the 
submit_bio indefinitely
postponed while holding the PageLock of previous pages.

> This is at least the current code. It might be different in rhel6 but
> calling memcg charging with a page lock is definitely a bug.
>

The current code (mm repo since-3.6) here remains unchanged. Through we 
may need
to take care of page lock and memcg charging in mpage_readpages, it 
dives to fs level.
Besides 37b23e05 have already fixed the deadlock from the other side: 
process still can be
killed even waiting for pagelock. But considering other potential 
problem, we may as well do
something in mpage_readpages to avoid calling add_to_page_cache_lru with 
any page locked.

>> In mpage_readpages, B tends to read a dozen of pages in: for each of
>> page will do
>> locking, charging, and then send out a big bio. And A is waiting for
>> one of the pages
>> and stuck.
>>
>> As I said, 37b23e05 has made pagefault killable by changing
>> uninterruptible sleeping to killable sleeping. So A can be woke up to
>> exit successfully and free the memory which can in turn help B pass
>> memcg charging period.
>>
>> (By the way, it seems commit 37b23e05 and 7d9fdac need to be
> 79dfdaccd1d5 you mean, right? That one just helps when there are too
> many tasks trashing oom killer so it is not related to what you are
> trying to achieve. Besides that make sure you take 23751be0 if you
> take it.
>

Here is the reason why I said a process may go though memcg oom and cannot
exit. It's just the phenomenon described in the commit log of 79dfdaccd:
the old version of memcg oom lock can lead to serious starvation and make
many tasks trash oom killer but nothing useful can be done.


It is for these two reasons that cause the bug and can make the memcg 
unusable
(sys up to almost 100%)for hours even days... Once we give some extra memory
to the memcg(such as increase hardlimit a little), the processes tending 
into oom killer
will pass the charging and send bio out eventually, which will unlock 
those pages and
wake up the D sleeper.


>> backported to --stable tree to deliver RHEL users. ;-) )
> I am not sure the first one qualifies the stable tree inclusion as it is
> a feature.
>

When debugging the problem, we indeed found 37b23e05 is the key
enemy of the deadlock bug.


>>>> Then A can not exit successfully to free the memory and both of them
>>>> can not moving on.
>>>> Indeed, we should dig into these locks to find the solution and
>>>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>>>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>>>> already solved the problem, but if oom_killing_allocating_task is
>>>> memcg aware, enabling this suicide oom behavior will be a simpler
>>>> workaround. What's more, enabling the sysctl can avoid other potential
>>>> oom problems to some extent.
>>> As I said, I am not against this but I really want to see a valid use
>>> case first. So far I haven't seen any because what you mention above is
>>> a clear bug which should be fixed. I can imagine the huge number of
>>> tasks in the group could be a problem as well but I would like to see
>>> what are those problems first.
>>>
>> In view of consistent with global oom and performance benefit, I suggest
>> we may as well open it in memcg oom as there's no obvious harm.
> I am not sure about "no obvious harm" part. The policy could be
> different in different groups e.g. and the global knob could be really
> misleading. But the question is. Is it worth having this per group? To
> be honest, I do not like the global knob either and I am not entirely
> keen on spreading it out into memcg unless there is a real use case for
> it.
>

Okay...then let's lie it on the table. We may use it as a in-house 
patch. :-)


Thanks,
Sha

>> As refer to the bug I mentioned, obviously the key solution is the above two
>> patchset, but considing other *potential* memcg oom bugs, the sysctl may
>> be a role of temporary workaround to some extent... but it's just a
>> workaround.
> We shouldn't add something like that just to workaround obvious bugs.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-19  4:11             ` Sha Zhengju
  (?)
@ 2012-10-19  9:52               ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-19  9:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Fri 19-10-12 12:11:52, Sha Zhengju wrote:
> On 10/18/2012 11:32 PM, Michal Hocko wrote:
> >On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> >>On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >>>On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>>>On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>   wrote:
> >>>[...]
> >>>>>Could you be more specific about the motivation for this patch? Is it
> >>>>>"let's be consistent with the global oom" or you have a real use case
> >>>>>for this knob.
> >>>>>
> >>>>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>>>problem.  Simply speaking, suppose process A is selected to be killed
> >>>>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>>>lock. What's worse, the exact page lock is holding by another memcg
> >>>>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>>>livelock).
> >>>Hmm, this is strange. How can you get down that road with the page lock
> >>>held? Is it possible this is related to the issue fixed by: 1d65f86d
> >>>(mm: preallocate page before lock_page() at filemap COW)?
> >>No, it has nothing with the cow page. By checking stack of the process A
> >>selected to be killed(uninterruptible sleeping), it was stuck at:
> >>__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> >>state).
> >>The person B holding the exactly page lock is on the following path:
> >>__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> >>->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> >Hmm filemap_fault locks the page after the read ahead is triggered
> >already so it doesn't call mpage_readpages with any page locked - the
> >add_to_page_cache_lru is called without any page locked.

And I was probably blind yesterday because if I have looked inside
add_to_page_cache_lru then I would have found out that we lock the page
before charging it. /me stupid. Sorry about the confusion.
That one is OK, though, because the page is fresh new and not visible
when we charge it. This is not related to your problem, more on that
below.

> It's not the page being fault in filemap_fault that causing the
> problem, but those pages handling by readhead. To clarify the point,
> the more detailed call stack is:
> filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
> __do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*
> 
> It is because mpage_readpages that bring the risk:
> for each of readahead pages
>      (1)add_to_page_cache_lru (--> *will lock page and go through
> memcg charging*) add the page to a big bio submit_bio (So those locked
> pages will be unlocked in end_bio after swapin)
> 
> So if a page is being charged and cannot exit from memcg oom
> successfully (following I'll explain the reason) in step (1), it will
> cause the submit_bio indefinitely postponed while holding the PageLock
> of previous pages.

OK I think I am seeing what you are trying to say, finally. But you
are wrong here. Previously locked&charged pages were already submitted
(every do_mpage_readpage submits the given page) so the IO will finish
eventually so those pages get unlocked.
 
[...]
> >>As I said, 37b23e05 has made pagefault killable by changing
> >>uninterruptible sleeping to killable sleeping. So A can be woke up to
> >>exit successfully and free the memory which can in turn help B pass
> >>memcg charging period.
> >>
> >>(By the way, it seems commit 37b23e05 and 7d9fdac need to be
> >79dfdaccd1d5 you mean, right? That one just helps when there are too
> >many tasks trashing oom killer so it is not related to what you are
> >trying to achieve. Besides that make sure you take 23751be0 if you
> >take it.
> >
> 
> Here is the reason why I said a process may go though memcg oom and cannot
> exit. It's just the phenomenon described in the commit log of 79dfdaccd:
> the old version of memcg oom lock can lead to serious starvation and make
> many tasks trash oom killer but nothing useful can be done.

Yes the trashing on oom is certainly possible without that patch and it
seems that this is what the culprit of the problem you are describing.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-19  9:52               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-19  9:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel,
	Sha Zhengju, David Rientjes

On Fri 19-10-12 12:11:52, Sha Zhengju wrote:
> On 10/18/2012 11:32 PM, Michal Hocko wrote:
> >On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> >>On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >>>On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>>>On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>   wrote:
> >>>[...]
> >>>>>Could you be more specific about the motivation for this patch? Is it
> >>>>>"let's be consistent with the global oom" or you have a real use case
> >>>>>for this knob.
> >>>>>
> >>>>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>>>problem.  Simply speaking, suppose process A is selected to be killed
> >>>>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>>>lock. What's worse, the exact page lock is holding by another memcg
> >>>>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>>>livelock).
> >>>Hmm, this is strange. How can you get down that road with the page lock
> >>>held? Is it possible this is related to the issue fixed by: 1d65f86d
> >>>(mm: preallocate page before lock_page() at filemap COW)?
> >>No, it has nothing with the cow page. By checking stack of the process A
> >>selected to be killed(uninterruptible sleeping), it was stuck at:
> >>__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> >>state).
> >>The person B holding the exactly page lock is on the following path:
> >>__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> >>->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> >Hmm filemap_fault locks the page after the read ahead is triggered
> >already so it doesn't call mpage_readpages with any page locked - the
> >add_to_page_cache_lru is called without any page locked.

And I was probably blind yesterday because if I have looked inside
add_to_page_cache_lru then I would have found out that we lock the page
before charging it. /me stupid. Sorry about the confusion.
That one is OK, though, because the page is fresh new and not visible
when we charge it. This is not related to your problem, more on that
below.

> It's not the page being fault in filemap_fault that causing the
> problem, but those pages handling by readhead. To clarify the point,
> the more detailed call stack is:
> filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
> __do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*
> 
> It is because mpage_readpages that bring the risk:
> for each of readahead pages
>      (1)add_to_page_cache_lru (--> *will lock page and go through
> memcg charging*) add the page to a big bio submit_bio (So those locked
> pages will be unlocked in end_bio after swapin)
> 
> So if a page is being charged and cannot exit from memcg oom
> successfully (following I'll explain the reason) in step (1), it will
> cause the submit_bio indefinitely postponed while holding the PageLock
> of previous pages.

OK I think I am seeing what you are trying to say, finally. But you
are wrong here. Previously locked&charged pages were already submitted
(every do_mpage_readpage submits the given page) so the IO will finish
eventually so those pages get unlocked.
 
[...]
> >>As I said, 37b23e05 has made pagefault killable by changing
> >>uninterruptible sleeping to killable sleeping. So A can be woke up to
> >>exit successfully and free the memory which can in turn help B pass
> >>memcg charging period.
> >>
> >>(By the way, it seems commit 37b23e05 and 7d9fdac need to be
> >79dfdaccd1d5 you mean, right? That one just helps when there are too
> >many tasks trashing oom killer so it is not related to what you are
> >trying to achieve. Besides that make sure you take 23751be0 if you
> >take it.
> >
> 
> Here is the reason why I said a process may go though memcg oom and cannot
> exit. It's just the phenomenon described in the commit log of 79dfdaccd:
> the old version of memcg oom lock can lead to serious starvation and make
> many tasks trash oom killer but nothing useful can be done.

Yes the trashing on oom is certainly possible without that patch and it
seems that this is what the culprit of the problem you are describing.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-19  9:52               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-19  9:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, David Rientjes

On Fri 19-10-12 12:11:52, Sha Zhengju wrote:
> On 10/18/2012 11:32 PM, Michal Hocko wrote:
> >On Thu 18-10-12 21:51:57, Sha Zhengju wrote:
> >>On 10/18/2012 07:56 PM, Michal Hocko wrote:
> >>>On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
> >>>>On Tuesday, October 16, 2012, Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>   wrote:
> >>>[...]
> >>>>>Could you be more specific about the motivation for this patch? Is it
> >>>>>"let's be consistent with the global oom" or you have a real use case
> >>>>>for this knob.
> >>>>>
> >>>>In our environment(rhel6), we encounter a memcg oom 'deadlock'
> >>>>problem.  Simply speaking, suppose process A is selected to be killed
> >>>>by memcg oom killer, but A is uninterruptible sleeping on a page
> >>>>lock. What's worse, the exact page lock is holding by another memcg
> >>>>process B which is trapped in mem_croup_oom_lock(proves to be a
> >>>>livelock).
> >>>Hmm, this is strange. How can you get down that road with the page lock
> >>>held? Is it possible this is related to the issue fixed by: 1d65f86d
> >>>(mm: preallocate page before lock_page() at filemap COW)?
> >>No, it has nothing with the cow page. By checking stack of the process A
> >>selected to be killed(uninterruptible sleeping), it was stuck at:
> >>__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D
> >>state).
> >>The person B holding the exactly page lock is on the following path:
> >>__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
> >>->add_to_page_cache_locked ---->(in memcg oom and cannot exit)
> >Hmm filemap_fault locks the page after the read ahead is triggered
> >already so it doesn't call mpage_readpages with any page locked - the
> >add_to_page_cache_lru is called without any page locked.

And I was probably blind yesterday because if I have looked inside
add_to_page_cache_lru then I would have found out that we lock the page
before charging it. /me stupid. Sorry about the confusion.
That one is OK, though, because the page is fresh new and not visible
when we charge it. This is not related to your problem, more on that
below.

> It's not the page being fault in filemap_fault that causing the
> problem, but those pages handling by readhead. To clarify the point,
> the more detailed call stack is:
> filemap_fault->do_async/sync_mmap_readahead->ondemand_readahead->
> __do_page_cache_readahead->read_pages->ext3/4_readpages->*mpage_readpages*
> 
> It is because mpage_readpages that bring the risk:
> for each of readahead pages
>      (1)add_to_page_cache_lru (--> *will lock page and go through
> memcg charging*) add the page to a big bio submit_bio (So those locked
> pages will be unlocked in end_bio after swapin)
> 
> So if a page is being charged and cannot exit from memcg oom
> successfully (following I'll explain the reason) in step (1), it will
> cause the submit_bio indefinitely postponed while holding the PageLock
> of previous pages.

OK I think I am seeing what you are trying to say, finally. But you
are wrong here. Previously locked&charged pages were already submitted
(every do_mpage_readpage submits the given page) so the IO will finish
eventually so those pages get unlocked.
 
[...]
> >>As I said, 37b23e05 has made pagefault killable by changing
> >>uninterruptible sleeping to killable sleeping. So A can be woke up to
> >>exit successfully and free the memory which can in turn help B pass
> >>memcg charging period.
> >>
> >>(By the way, it seems commit 37b23e05 and 7d9fdac need to be
> >79dfdaccd1d5 you mean, right? That one just helps when there are too
> >many tasks trashing oom killer so it is not related to what you are
> >trying to achieve. Besides that make sure you take 23751be0 if you
> >take it.
> >
> 
> Here is the reason why I said a process may go though memcg oom and cannot
> exit. It's just the phenomenon described in the commit log of 79dfdaccd:
> the old version of memcg oom lock can lead to serious starvation and make
> many tasks trash oom killer but nothing useful can be done.

Yes the trashing on oom is certainly possible without that patch and it
seems that this is what the culprit of the problem you are describing.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16  6:32     ` Sha Zhengju
@ 2012-10-16  7:03       ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-16  7:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: David Rientjes, Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu,
	akpm, linux-kernel

On Tue 16-10-12 14:32:05, Sha Zhengju wrote:
[...]
> Thanks for reminding!  Yes, I cooked it on memcg-devel git repo but
> a out-of-date
> since-3.2 branch... But I notice the latest branch is since-3.5(not
> seeing 3.6/3.7), does
> it okay to working on this branch?

The tree has moved to
http://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git;a=summary.
Please use that tree.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  7:03       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2012-10-16  7:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: David Rientjes, Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu,
	akpm, linux-kernel

On Tue 16-10-12 14:32:05, Sha Zhengju wrote:
[...]
> Thanks for reminding!  Yes, I cooked it on memcg-devel git repo but
> a out-of-date
> since-3.2 branch... But I notice the latest branch is since-3.5(not
> seeing 3.6/3.7), does
> it okay to working on this branch?

The tree has moved to
http://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git;a=summary.
Please use that tree.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16  6:12   ` David Rientjes
@ 2012-10-16  6:32     ` Sha Zhengju
  -1 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16  6:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko,
	linux-kernel

On 10/16/2012 02:12 PM, David Rientjes wrote:
> On Tue, 16 Oct 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
>> task in out-of-memory situations, but it only works on overall system-wide oom.
>> But it's also a useful indication in memcg so we take it into consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has already
>> been memcg-ware.
>>
> You're working on an old kernel, mem_cgroup_out_of_memory() has moved to
> mm/memcontrol.c.  Please rebase on 3.7-rc1 and send an updated patch,
> which otherwise looks good.

Thanks for reminding!  Yes, I cooked it on memcg-devel git repo but a 
out-of-date
since-3.2 branch... But I notice the latest branch is since-3.5(not 
seeing 3.6/3.7), does
it okay to working on this branch?


Thanks,
Sha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:32     ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16  6:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sha Zhengju, linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko,
	linux-kernel

On 10/16/2012 02:12 PM, David Rientjes wrote:
> On Tue, 16 Oct 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
>> task in out-of-memory situations, but it only works on overall system-wide oom.
>> But it's also a useful indication in memcg so we take it into consideration
>> while oom happening in memcg. Other sysctl such as panic_on_oom has already
>> been memcg-ware.
>>
> You're working on an old kernel, mem_cgroup_out_of_memory() has moved to
> mm/memcontrol.c.  Please rebase on 3.7-rc1 and send an updated patch,
> which otherwise looks good.

Thanks for reminding!  Yes, I cooked it on memcg-devel git repo but a 
out-of-date
since-3.2 branch... But I notice the latest branch is since-3.5(not 
seeing 3.6/3.7), does
it okay to working on this branch?


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
  2012-10-16  6:10 ` Sha Zhengju
  (?)
@ 2012-10-16  6:12   ` David Rientjes
  -1 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16  6:12 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.
> 

You're working on an old kernel, mem_cgroup_out_of_memory() has moved to 
mm/memcontrol.c.  Please rebase on 3.7-rc1 and send an updated patch, 
which otherwise looks good.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:12   ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16  6:12 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko, linux-kernel

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.
> 

You're working on an old kernel, mem_cgroup_out_of_memory() has moved to 
mm/memcontrol.c.  Please rebase on 3.7-rc1 and send an updated patch, 
which otherwise looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:12   ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2012-10-16  6:12 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 16 Oct 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
> task in out-of-memory situations, but it only works on overall system-wide oom.
> But it's also a useful indication in memcg so we take it into consideration
> while oom happening in memcg. Other sysctl such as panic_on_oom has already
> been memcg-ware.
> 

You're working on an old kernel, mem_cgroup_out_of_memory() has moved to 
mm/memcontrol.c.  Please rebase on 3.7-rc1 and send an updated patch, 
which otherwise looks good.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:10 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16  6:10 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko
  Cc: linux-kernel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/oom_kill.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38129e3..2a176af 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -574,6 +574,18 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask)
        check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
        limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT;
        read_lock(&tasklist_lock);
+       if (sysctl_oom_kill_allocating_task &&
+           !oom_unkillable_task(current, memcg, NULL) &&
+           current->mm) {
+               /*
+                * oom_kill_process() needs tasklist_lock held.  If it returns
+                * non-zero, current could not be killed so we must fallback to
+                * the tasklist scan.
+                */
+               if (!oom_kill_process(current, gfp_mask, 0, 0, limit, memcg, NULL,
+                               "Memory cgroup out of memory (oom_kill_allocating_task)"))
+                       goto out;
+       }
 retry:
        p = select_bad_process(&points, limit, memcg, NULL);
        if (!p || PTR_ERR(p) == -1UL)
--
1.7.6.1


________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

±¾µçÓÊ(°üÀ¨Èκθ½¼þ)¿ÉÄܺ¬ÓлúÃÜ×ÊÁϲ¢ÊÜ·¨Âɱ£»¤¡£ÈçÄú²»ÊÇÕýÈ·µÄÊÕ¼þÈË£¬ÇëÄúÁ¢¼´É¾³ý±¾Óʼþ¡£Çë²»Òª½«±¾µçÓʽøÐи´ÖƲ¢ÓÃ×÷ÈκÎÆäËûÓÃ;¡¢»ò͸¶±¾ÓʼþÖ®ÄÚÈÝ¡£Ð»Ð»¡£

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:10 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16  6:10 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko
  Cc: linux-kernel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/oom_kill.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38129e3..2a176af 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -574,6 +574,18 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask)
        check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
        limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT;
        read_lock(&tasklist_lock);
+       if (sysctl_oom_kill_allocating_task &&
+           !oom_unkillable_task(current, memcg, NULL) &&
+           current->mm) {
+               /*
+                * oom_kill_process() needs tasklist_lock held.  If it returns
+                * non-zero, current could not be killed so we must fallback to
+                * the tasklist scan.
+                */
+               if (!oom_kill_process(current, gfp_mask, 0, 0, limit, memcg, NULL,
+                               "Memory cgroup out of memory (oom_kill_allocating_task)"))
+                       goto out;
+       }
 retry:
        p = select_bad_process(&points, limit, memcg, NULL);
        if (!p || PTR_ERR(p) == -1UL)
--
1.7.6.1


________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

±¾µçÓÊ(°üÀ¨Èκθ½¼þ)¿ÉÄܺ¬ÓлúÃÜ×ÊÁϲ¢ÊÜ·¨Âɱ£»¤¡£ÈçÄú²»ÊÇÕýÈ·µÄÊÕ¼þÈË£¬ÇëÄúÁ¢¼´É¾³ý±¾Óʼþ¡£Çë²»Òª½«±¾µçÓʽøÐи´ÖƲ¢ÓÃ×÷ÈκÎÆäËûÓÃ;¡¢»ò͸¶±¾ÓʼþÖ®ÄÚÈÝ¡£Ð»Ð»¡£

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
@ 2012-10-16  6:10 ` Sha Zhengju
  0 siblings, 0 replies; 43+ messages in thread
From: Sha Zhengju @ 2012-10-16  6:10 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu, akpm, mhocko
  Cc: linux-kernel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/oom_kill.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38129e3..2a176af 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -574,6 +574,18 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask)
        check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
        limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT;
        read_lock(&tasklist_lock);
+       if (sysctl_oom_kill_allocating_task &&
+           !oom_unkillable_task(current, memcg, NULL) &&
+           current->mm) {
+               /*
+                * oom_kill_process() needs tasklist_lock held.  If it returns
+                * non-zero, current could not be killed so we must fallback to
+                * the tasklist scan.
+                */
+               if (!oom_kill_process(current, gfp_mask, 0, 0, limit, memcg, NULL,
+                               "Memory cgroup out of memory (oom_kill_allocating_task)"))
+                       goto out;
+       }
 retry:
        p = select_bad_process(&points, limit, memcg, NULL);
        if (!p || PTR_ERR(p) == -1UL)
--
1.7.6.1


________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-10-19  9:52 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16 10:12 [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening Sha Zhengju
2012-10-16 10:12 ` Sha Zhengju
2012-10-16 10:12 ` Sha Zhengju
2012-10-16 10:20 ` Ni zhan Chen
2012-10-16 10:20   ` Ni zhan Chen
2012-10-16 10:41   ` Sha Zhengju
2012-10-16 10:41     ` Sha Zhengju
2012-10-16 10:41     ` Sha Zhengju
2012-10-16 13:34 ` Michal Hocko
2012-10-16 13:34   ` Michal Hocko
2012-10-16 13:34   ` Michal Hocko
2012-10-16 17:14   ` Sha Zhengju
2012-10-18 11:56     ` Michal Hocko
2012-10-18 11:56       ` Michal Hocko
2012-10-18 11:56       ` Michal Hocko
2012-10-18 13:51       ` Sha Zhengju
2012-10-18 13:51         ` Sha Zhengju
2012-10-18 13:51         ` Sha Zhengju
2012-10-18 15:32         ` Michal Hocko
2012-10-18 15:32           ` Michal Hocko
2012-10-18 15:32           ` Michal Hocko
2012-10-19  4:11           ` Sha Zhengju
2012-10-19  4:11             ` Sha Zhengju
2012-10-19  4:11             ` Sha Zhengju
2012-10-19  9:52             ` Michal Hocko
2012-10-19  9:52               ` Michal Hocko
2012-10-19  9:52               ` Michal Hocko
2012-10-16 18:39   ` David Rientjes
2012-10-16 18:39     ` David Rientjes
2012-10-16 18:39     ` David Rientjes
2012-10-16 18:44 ` David Rientjes
2012-10-16 18:44   ` David Rientjes
2012-10-16 18:44   ` David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2012-10-16  6:10 Sha Zhengju
2012-10-16  6:10 ` Sha Zhengju
2012-10-16  6:10 ` Sha Zhengju
2012-10-16  6:12 ` David Rientjes
2012-10-16  6:12   ` David Rientjes
2012-10-16  6:12   ` David Rientjes
2012-10-16  6:32   ` Sha Zhengju
2012-10-16  6:32     ` Sha Zhengju
2012-10-16  7:03     ` Michal Hocko
2012-10-16  7:03       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.