All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-05 23:26 ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-05 23:26 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots -- first
in select_bad_process and then a further refinement by looking for
child processes in oom_kill_process. Since this is a two step process,
it is possible that the process selected by select_bad_process may get a
SIGKILL just before oom_kill_process executes. If this were to happen,
__unhash_process deletes this process from the thread_group list. This
then results in oom_kill_process getting stuck in an infinite loop when
traversing the thread_group list of the selected process.

Fix this race by holding the tasklist_lock across the calls to both
select_bad_process and oom_kill_process.

Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7bd3587 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -436,7 +436,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,7 +455,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
 	p = find_lock_task_mm(victim);
@@ -641,6 +639,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
 	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
+	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -663,6 +662,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
+	read_unlock(&tasklist_lock);
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
 	 * allocate memory again.
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-05 23:26 ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-05 23:26 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots -- first
in select_bad_process and then a further refinement by looking for
child processes in oom_kill_process. Since this is a two step process,
it is possible that the process selected by select_bad_process may get a
SIGKILL just before oom_kill_process executes. If this were to happen,
__unhash_process deletes this process from the thread_group list. This
then results in oom_kill_process getting stuck in an infinite loop when
traversing the thread_group list of the selected process.

Fix this race by holding the tasklist_lock across the calls to both
select_bad_process and oom_kill_process.

Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7bd3587 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -436,7 +436,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,7 +455,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
 	p = find_lock_task_mm(victim);
@@ -641,6 +639,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
 	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
+	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -663,6 +662,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
+	read_unlock(&tasklist_lock);
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
 	 * allocate memory again.
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-05 23:26 ` Sameer Nanda
@ 2013-11-06  1:18   ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-06  1:18 UTC (permalink / raw)
  To: Sameer Nanda; +Cc: akpm, mhocko, hannes, rusty, linux-mm, linux-kernel

On Tue, 5 Nov 2013, Sameer Nanda wrote:

> The selection of the process to be killed happens in two spots -- first
> in select_bad_process and then a further refinement by looking for
> child processes in oom_kill_process. Since this is a two step process,
> it is possible that the process selected by select_bad_process may get a
> SIGKILL just before oom_kill_process executes. If this were to happen,
> __unhash_process deletes this process from the thread_group list. This
> then results in oom_kill_process getting stuck in an infinite loop when
> traversing the thread_group list of the selected process.
> 
> Fix this race by holding the tasklist_lock across the calls to both
> select_bad_process and oom_kill_process.
> 
> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
> Signed-off-by: Sameer Nanda <snanda@chromium.org>

Nack, we had to avoid taking tasklist_lock for this duration since it 
stalls out forks and exits on other cpus trying to take the writeside with 
irqs disabled to avoid watchdog problems.

What kernel version are you patching?  If you check the latest Linus tree, 
we hold a reference to the task_struct of the chosen process before 
calling oom_kill_process() so the hypothesis would seem incorrect.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  1:18   ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-06  1:18 UTC (permalink / raw)
  To: Sameer Nanda; +Cc: akpm, mhocko, hannes, rusty, linux-mm, linux-kernel

On Tue, 5 Nov 2013, Sameer Nanda wrote:

> The selection of the process to be killed happens in two spots -- first
> in select_bad_process and then a further refinement by looking for
> child processes in oom_kill_process. Since this is a two step process,
> it is possible that the process selected by select_bad_process may get a
> SIGKILL just before oom_kill_process executes. If this were to happen,
> __unhash_process deletes this process from the thread_group list. This
> then results in oom_kill_process getting stuck in an infinite loop when
> traversing the thread_group list of the selected process.
> 
> Fix this race by holding the tasklist_lock across the calls to both
> select_bad_process and oom_kill_process.
> 
> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
> Signed-off-by: Sameer Nanda <snanda@chromium.org>

Nack, we had to avoid taking tasklist_lock for this duration since it 
stalls out forks and exits on other cpus trying to take the writeside with 
irqs disabled to avoid watchdog problems.

What kernel version are you patching?  If you check the latest Linus tree, 
we hold a reference to the task_struct of the chosen process before 
calling oom_kill_process() so the hypothesis would seem incorrect.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  1:18   ` David Rientjes
@ 2013-11-06  1:25     ` Luigi Semenzato
  -1 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  1:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty,
	linux-mm, linux-kernel

It's not enough to hold a reference to the task struct, because it can
still be taken out of the circular list of threads.  The RCU
assumptions don't hold in that case.

On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 5 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots -- first
>> in select_bad_process and then a further refinement by looking for
>> child processes in oom_kill_process. Since this is a two step process,
>> it is possible that the process selected by select_bad_process may get a
>> SIGKILL just before oom_kill_process executes. If this were to happen,
>> __unhash_process deletes this process from the thread_group list. This
>> then results in oom_kill_process getting stuck in an infinite loop when
>> traversing the thread_group list of the selected process.
>>
>> Fix this race by holding the tasklist_lock across the calls to both
>> select_bad_process and oom_kill_process.
>>
>> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>
> Nack, we had to avoid taking tasklist_lock for this duration since it
> stalls out forks and exits on other cpus trying to take the writeside with
> irqs disabled to avoid watchdog problems.
>
> What kernel version are you patching?  If you check the latest Linus tree,
> we hold a reference to the task_struct of the chosen process before
> calling oom_kill_process() so the hypothesis would seem incorrect.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  1:25     ` Luigi Semenzato
  0 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  1:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty,
	linux-mm, linux-kernel

It's not enough to hold a reference to the task struct, because it can
still be taken out of the circular list of threads.  The RCU
assumptions don't hold in that case.

On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 5 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots -- first
>> in select_bad_process and then a further refinement by looking for
>> child processes in oom_kill_process. Since this is a two step process,
>> it is possible that the process selected by select_bad_process may get a
>> SIGKILL just before oom_kill_process executes. If this were to happen,
>> __unhash_process deletes this process from the thread_group list. This
>> then results in oom_kill_process getting stuck in an infinite loop when
>> traversing the thread_group list of the selected process.
>>
>> Fix this race by holding the tasklist_lock across the calls to both
>> select_bad_process and oom_kill_process.
>>
>> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>
> Nack, we had to avoid taking tasklist_lock for this duration since it
> stalls out forks and exits on other cpus trying to take the writeside with
> irqs disabled to avoid watchdog problems.
>
> What kernel version are you patching?  If you check the latest Linus tree,
> we hold a reference to the task_struct of the chosen process before
> calling oom_kill_process() so the hypothesis would seem incorrect.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  1:25     ` Luigi Semenzato
@ 2013-11-06  1:27       ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-06  1:27 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty,
	linux-mm, linux-kernel

On Tue, 5 Nov 2013, Luigi Semenzato wrote:

> It's not enough to hold a reference to the task struct, because it can
> still be taken out of the circular list of threads.  The RCU
> assumptions don't hold in that case.
> 

Could you please post a proper bug report that isolates this at the cause?

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  1:27       ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-06  1:27 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty,
	linux-mm, linux-kernel

On Tue, 5 Nov 2013, Luigi Semenzato wrote:

> It's not enough to hold a reference to the task struct, because it can
> still be taken out of the circular list of threads.  The RCU
> assumptions don't hold in that case.
> 

Could you please post a proper bug report that isolates this at the cause?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  1:27       ` David Rientjes
@ 2013-11-06  3:00         ` Vladimir Murzin
  -1 siblings, 0 replies; 51+ messages in thread
From: Vladimir Murzin @ 2013-11-06  3:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, Sameer Nanda, Andrew Morton, mhocko,
	Johannes Weiner, rusty, linux-mm, linux-kernel, oleg, dserrg

Ccing Oleg and Sergey

On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote:
> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
> 
> > It's not enough to hold a reference to the task struct, because it can
> > still be taken out of the circular list of threads.  The RCU
> > assumptions don't hold in that case.
> > 
> 
> Could you please post a proper bug report that isolates this at the cause?
>
Hi David!

I think it has already been reported[1] and actively discussed. Oleg has
confirmed that while_each_thread() is not safe under rcu_read_lock()[2].

Oleg, any news about your activity for fixing that?

[1] http://www.spinics.net/lists/linux-mm/msg54836.html
[2] http://marc.info/?l=linux-kernel&m=127688978121665
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  3:00         ` Vladimir Murzin
  0 siblings, 0 replies; 51+ messages in thread
From: Vladimir Murzin @ 2013-11-06  3:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, Sameer Nanda, Andrew Morton, mhocko,
	Johannes Weiner, rusty, linux-mm, linux-kernel, oleg, dserrg

Ccing Oleg and Sergey

On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote:
> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
> 
> > It's not enough to hold a reference to the task struct, because it can
> > still be taken out of the circular list of threads.  The RCU
> > assumptions don't hold in that case.
> > 
> 
> Could you please post a proper bug report that isolates this at the cause?
>
Hi David!

I think it has already been reported[1] and actively discussed. Oleg has
confirmed that while_each_thread() is not safe under rcu_read_lock()[2].

Oleg, any news about your activity for fixing that?

[1] http://www.spinics.net/lists/linux-mm/msg54836.html
[2] http://marc.info/?l=linux-kernel&m=127688978121665
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  1:27       ` David Rientjes
  (?)
  (?)
@ 2013-11-06  3:04       ` Sameer Nanda
  2013-11-06  4:45           ` Luigi Semenzato
  -1 siblings, 1 reply; 51+ messages in thread
From: Sameer Nanda @ 2013-11-06  3:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, Andrew Morton, mhocko, Johannes Weiner, rusty,
	linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:

> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>
> > It's not enough to hold a reference to the task struct, because it can
> > still be taken out of the circular list of threads.  The RCU
> > assumptions don't hold in that case.
> >
>
> Could you please post a proper bug report that isolates this at the cause?
>

We've been running into this issue on Chrome OS. crbug.com/256326 has
additional
details.  The issue manifests itself as a soft lockup.

The kernel we've been seeing this on is 3.8.

We have a pretty consistent repro currently.  Happy to try out other
suggestions
for a fix.


>
> Thanks.
>



-- 
Sameer

[-- Attachment #2: Type: text/html, Size: 1514 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  3:04       ` Sameer Nanda
@ 2013-11-06  4:45           ` Luigi Semenzato
  0 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  4:45 UTC (permalink / raw)
  To: Sameer Nanda, msb
  Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

It's interesting that this was known for 3+ years, but nobody bothered
adding a small warning to the code.

We noticed this because it's actually happening on Chromebooks in the
field.  We try to minimize OOM kills, but we can deal with them.  Of
course, a hung kernel we cannot deal with.

On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>
>
>
> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>
>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>
>> > It's not enough to hold a reference to the task struct, because it can
>> > still be taken out of the circular list of threads.  The RCU
>> > assumptions don't hold in that case.
>> >
>>
>> Could you please post a proper bug report that isolates this at the cause?
>
>
> We've been running into this issue on Chrome OS. crbug.com/256326 has
> additional
> details.  The issue manifests itself as a soft lockup.
>
> The kernel we've been seeing this on is 3.8.
>
> We have a pretty consistent repro currently.  Happy to try out other
> suggestions
> for a fix.
>
>>
>>
>> Thanks.
>
>
>
>
> --
> Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  4:45           ` Luigi Semenzato
  0 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  4:45 UTC (permalink / raw)
  To: Sameer Nanda, msb
  Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

It's interesting that this was known for 3+ years, but nobody bothered
adding a small warning to the code.

We noticed this because it's actually happening on Chromebooks in the
field.  We try to minimize OOM kills, but we can deal with them.  Of
course, a hung kernel we cannot deal with.

On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>
>
>
> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>
>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>
>> > It's not enough to hold a reference to the task struct, because it can
>> > still be taken out of the circular list of threads.  The RCU
>> > assumptions don't hold in that case.
>> >
>>
>> Could you please post a proper bug report that isolates this at the cause?
>
>
> We've been running into this issue on Chrome OS. crbug.com/256326 has
> additional
> details.  The issue manifests itself as a soft lockup.
>
> The kernel we've been seeing this on is 3.8.
>
> We have a pretty consistent repro currently.  Happy to try out other
> suggestions
> for a fix.
>
>>
>>
>> Thanks.
>
>
>
>
> --
> Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  4:45           ` Luigi Semenzato
@ 2013-11-06  7:17             ` Luigi Semenzato
  -1 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  7:17 UTC (permalink / raw)
  To: Sameer Nanda, msb
  Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

Regarding other fixes: would it be possible to have the thread
iterator insert a dummy marker element in the thread list before the
scan?  There would be one such dummy element per CPU, so that multiple
CPUs can scan the list in parallel.  The loop would skip such
elements, and each dummy element would be removed at the end of each
scan.

I think this would work, i.e. it would have all the right properties,
but I don't have a sense of whether the performance impact is
acceptable.  Probably not, or it would have been proposed earlier.



On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote:
> It's interesting that this was known for 3+ years, but nobody bothered
> adding a small warning to the code.
>
> We noticed this because it's actually happening on Chromebooks in the
> field.  We try to minimize OOM kills, but we can deal with them.  Of
> course, a hung kernel we cannot deal with.
>
> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>>
>>
>>
>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>>
>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>>
>>> > It's not enough to hold a reference to the task struct, because it can
>>> > still be taken out of the circular list of threads.  The RCU
>>> > assumptions don't hold in that case.
>>> >
>>>
>>> Could you please post a proper bug report that isolates this at the cause?
>>
>>
>> We've been running into this issue on Chrome OS. crbug.com/256326 has
>> additional
>> details.  The issue manifests itself as a soft lockup.
>>
>> The kernel we've been seeing this on is 3.8.
>>
>> We have a pretty consistent repro currently.  Happy to try out other
>> suggestions
>> for a fix.
>>
>>>
>>>
>>> Thanks.
>>
>>
>>
>>
>> --
>> Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06  7:17             ` Luigi Semenzato
  0 siblings, 0 replies; 51+ messages in thread
From: Luigi Semenzato @ 2013-11-06  7:17 UTC (permalink / raw)
  To: Sameer Nanda, msb
  Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

Regarding other fixes: would it be possible to have the thread
iterator insert a dummy marker element in the thread list before the
scan?  There would be one such dummy element per CPU, so that multiple
CPUs can scan the list in parallel.  The loop would skip such
elements, and each dummy element would be removed at the end of each
scan.

I think this would work, i.e. it would have all the right properties,
but I don't have a sense of whether the performance impact is
acceptable.  Probably not, or it would have been proposed earlier.



On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote:
> It's interesting that this was known for 3+ years, but nobody bothered
> adding a small warning to the code.
>
> We noticed this because it's actually happening on Chromebooks in the
> field.  We try to minimize OOM kills, but we can deal with them.  Of
> course, a hung kernel we cannot deal with.
>
> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>>
>>
>>
>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>>
>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>>
>>> > It's not enough to hold a reference to the task struct, because it can
>>> > still be taken out of the circular list of threads.  The RCU
>>> > assumptions don't hold in that case.
>>> >
>>>
>>> Could you please post a proper bug report that isolates this at the cause?
>>
>>
>> We've been running into this issue on Chrome OS. crbug.com/256326 has
>> additional
>> details.  The issue manifests itself as a soft lockup.
>>
>> The kernel we've been seeing this on is 3.8.
>>
>> We have a pretty consistent repro currently.  Happy to try out other
>> suggestions
>> for a fix.
>>
>>>
>>>
>>> Thanks.
>>
>>
>>
>>
>> --
>> Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06  7:17             ` Luigi Semenzato
@ 2013-11-06 16:58               ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-06 16:58 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: msb, David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

(adding back context from thread history)
On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 5 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots -- first
>> in select_bad_process and then a further refinement by looking for
>> child processes in oom_kill_process. Since this is a two step process,
>> it is possible that the process selected by select_bad_process may get a
>> SIGKILL just before oom_kill_process executes. If this were to happen,
>> __unhash_process deletes this process from the thread_group list. This
>> then results in oom_kill_process getting stuck in an infinite loop when
>> traversing the thread_group list of the selected process.
>>
>> Fix this race by holding the tasklist_lock across the calls to both
>> select_bad_process and oom_kill_process.
>>
>> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>
> Nack, we had to avoid taking tasklist_lock for this duration since it
> stalls out forks and exits on other cpus trying to take the writeside with
> irqs disabled to avoid watchdog problems.
>

David -- I think we can make the duration that the tasklist_lock is
held smaller by consolidating the process selection logic that is
currently split across select_bad_process and oom_kill_process into
one place in select_bad_process.  The tasklist_lock would then need to
be held only when the thread lists are being traversed.  Would you be
ok with that?  I can re-spin the patch if that sounds like a workable
option.

> What kernel version are you patching?  If you check the latest Linus tree,
> we hold a reference to the task_struct of the chosen process before
> calling oom_kill_process() so the hypothesis would seem incorrect.


On Tue, Nov 5, 2013 at 11:17 PM, Luigi Semenzato <semenzato@google.com> wrote:
> Regarding other fixes: would it be possible to have the thread
> iterator insert a dummy marker element in the thread list before the
> scan?  There would be one such dummy element per CPU, so that multiple
> CPUs can scan the list in parallel.  The loop would skip such
> elements, and each dummy element would be removed at the end of each
> scan.
>
> I think this would work, i.e. it would have all the right properties,
> but I don't have a sense of whether the performance impact is
> acceptable.  Probably not, or it would have been proposed earlier.
>
>
>
> On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote:
>> It's interesting that this was known for 3+ years, but nobody bothered
>> adding a small warning to the code.
>>
>> We noticed this because it's actually happening on Chromebooks in the
>> field.  We try to minimize OOM kills, but we can deal with them.  Of
>> course, a hung kernel we cannot deal with.
>>
>> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>>>
>>>
>>>
>>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>>>
>>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>>>
>>>> > It's not enough to hold a reference to the task struct, because it can
>>>> > still be taken out of the circular list of threads.  The RCU
>>>> > assumptions don't hold in that case.
>>>> >
>>>>
>>>> Could you please post a proper bug report that isolates this at the cause?
>>>
>>>
>>> We've been running into this issue on Chrome OS. crbug.com/256326 has
>>> additional
>>> details.  The issue manifests itself as a soft lockup.
>>>
>>> The kernel we've been seeing this on is 3.8.
>>>
>>> We have a pretty consistent repro currently.  Happy to try out other
>>> suggestions
>>> for a fix.
>>>
>>>>
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>> --
>>> Sameer



-- 
Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-06 16:58               ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-06 16:58 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: msb, David Rientjes, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, linux-mm, linux-kernel

(adding back context from thread history)
On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 5 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots -- first
>> in select_bad_process and then a further refinement by looking for
>> child processes in oom_kill_process. Since this is a two step process,
>> it is possible that the process selected by select_bad_process may get a
>> SIGKILL just before oom_kill_process executes. If this were to happen,
>> __unhash_process deletes this process from the thread_group list. This
>> then results in oom_kill_process getting stuck in an infinite loop when
>> traversing the thread_group list of the selected process.
>>
>> Fix this race by holding the tasklist_lock across the calls to both
>> select_bad_process and oom_kill_process.
>>
>> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>
> Nack, we had to avoid taking tasklist_lock for this duration since it
> stalls out forks and exits on other cpus trying to take the writeside with
> irqs disabled to avoid watchdog problems.
>

David -- I think we can make the duration that the tasklist_lock is
held smaller by consolidating the process selection logic that is
currently split across select_bad_process and oom_kill_process into
one place in select_bad_process.  The tasklist_lock would then need to
be held only when the thread lists are being traversed.  Would you be
ok with that?  I can re-spin the patch if that sounds like a workable
option.

> What kernel version are you patching?  If you check the latest Linus tree,
> we hold a reference to the task_struct of the chosen process before
> calling oom_kill_process() so the hypothesis would seem incorrect.


On Tue, Nov 5, 2013 at 11:17 PM, Luigi Semenzato <semenzato@google.com> wrote:
> Regarding other fixes: would it be possible to have the thread
> iterator insert a dummy marker element in the thread list before the
> scan?  There would be one such dummy element per CPU, so that multiple
> CPUs can scan the list in parallel.  The loop would skip such
> elements, and each dummy element would be removed at the end of each
> scan.
>
> I think this would work, i.e. it would have all the right properties,
> but I don't have a sense of whether the performance impact is
> acceptable.  Probably not, or it would have been proposed earlier.
>
>
>
> On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote:
>> It's interesting that this was known for 3+ years, but nobody bothered
>> adding a small warning to the code.
>>
>> We noticed this because it's actually happening on Chromebooks in the
>> field.  We try to minimize OOM kills, but we can deal with them.  Of
>> course, a hung kernel we cannot deal with.
>>
>> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote:
>>>
>>>
>>>
>>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote:
>>>>
>>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>>>
>>>> > It's not enough to hold a reference to the task struct, because it can
>>>> > still be taken out of the circular list of threads.  The RCU
>>>> > assumptions don't hold in that case.
>>>> >
>>>>
>>>> Could you please post a proper bug report that isolates this at the cause?
>>>
>>>
>>> We've been running into this issue on Chrome OS. crbug.com/256326 has
>>> additional
>>> details.  The issue manifests itself as a soft lockup.
>>>
>>> The kernel we've been seeing this on is 3.8.
>>>
>>> We have a pretty consistent repro currently.  Happy to try out other
>>> suggestions
>>> for a fix.
>>>
>>>>
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>> --
>>> Sameer



-- 
Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-06 16:58               ` Sameer Nanda
@ 2013-11-07  0:35                 ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-07  0:35 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, oleg, linux-mm, linux-kernel

On Wed, 6 Nov 2013, Sameer Nanda wrote:

> David -- I think we can make the duration that the tasklist_lock is
> held smaller by consolidating the process selection logic that is
> currently split across select_bad_process and oom_kill_process into
> one place in select_bad_process.  The tasklist_lock would then need to
> be held only when the thread lists are being traversed.  Would you be
> ok with that?  I can re-spin the patch if that sounds like a workable
> option.
> 

No, this caused hundreds of machines to hit soft lockups for Google 
because there's no synchronization that prevents dozens of cpus to take 
tasklist_lock in the oom killer during parallel memcg oom conditions and 
never allow the write_lock_irq() on fork() or exit() to make progress.  We 
absolutely must hold tasklist_lock for as little time as possible in the 
oom killer.

That said, I've never actually seen your reported bug manifest in our 
production environment so let's see if Oleg has any ideas.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-07  0:35                 ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-07  0:35 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, oleg, linux-mm, linux-kernel

On Wed, 6 Nov 2013, Sameer Nanda wrote:

> David -- I think we can make the duration that the tasklist_lock is
> held smaller by consolidating the process selection logic that is
> currently split across select_bad_process and oom_kill_process into
> one place in select_bad_process.  The tasklist_lock would then need to
> be held only when the thread lists are being traversed.  Would you be
> ok with that?  I can re-spin the patch if that sounds like a workable
> option.
> 

No, this caused hundreds of machines to hit soft lockups for Google 
because there's no synchronization that prevents dozens of cpus to take 
tasklist_lock in the oom killer during parallel memcg oom conditions and 
never allow the write_lock_irq() on fork() or exit() to make progress.  We 
absolutely must hold tasklist_lock for as little time as possible in the 
oom killer.

That said, I've never actually seen your reported bug manifest in our 
production environment so let's see if Oleg has any ideas.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
  2013-11-07  0:35                 ` David Rientjes
@ 2013-11-07 19:34                   ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-07 19:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, oleg, linux-mm, linux-kernel

On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 6 Nov 2013, Sameer Nanda wrote:
>
>> David -- I think we can make the duration that the tasklist_lock is
>> held smaller by consolidating the process selection logic that is
>> currently split across select_bad_process and oom_kill_process into
>> one place in select_bad_process.  The tasklist_lock would then need to
>> be held only when the thread lists are being traversed.  Would you be
>> ok with that?  I can re-spin the patch if that sounds like a workable
>> option.
>>
>
> No, this caused hundreds of machines to hit soft lockups for Google
> because there's no synchronization that prevents dozens of cpus to take
> tasklist_lock in the oom killer during parallel memcg oom conditions and
> never allow the write_lock_irq() on fork() or exit() to make progress.  We
> absolutely must hold tasklist_lock for as little time as possible in the
> oom killer.
>
> That said, I've never actually seen your reported bug manifest in our
> production environment so let's see if Oleg has any ideas.

Is the path you are referring to mem_cgroup_out_of_memory calling
oom_kill_process?  If so, then that path doesn't appear to suffer from
the two step select_bad_process, oom_kill_process race since
mem_cgroup_out_of_memory directly calls oom_kill_process without going
through select_bad_process.  This also means that the patch I sent is
incorrect since it removes the existing tasklist_lock protection in
oom_kill_process.

Respinning patch to take care of this case.

-- 
Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: Fix race when selecting process to kill
@ 2013-11-07 19:34                   ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-07 19:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner,
	Rusty Russell, oleg, linux-mm, linux-kernel

On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 6 Nov 2013, Sameer Nanda wrote:
>
>> David -- I think we can make the duration that the tasklist_lock is
>> held smaller by consolidating the process selection logic that is
>> currently split across select_bad_process and oom_kill_process into
>> one place in select_bad_process.  The tasklist_lock would then need to
>> be held only when the thread lists are being traversed.  Would you be
>> ok with that?  I can re-spin the patch if that sounds like a workable
>> option.
>>
>
> No, this caused hundreds of machines to hit soft lockups for Google
> because there's no synchronization that prevents dozens of cpus to take
> tasklist_lock in the oom killer during parallel memcg oom conditions and
> never allow the write_lock_irq() on fork() or exit() to make progress.  We
> absolutely must hold tasklist_lock for as little time as possible in the
> oom killer.
>
> That said, I've never actually seen your reported bug manifest in our
> production environment so let's see if Oleg has any ideas.

Is the path you are referring to mem_cgroup_out_of_memory calling
oom_kill_process?  If so, then that path doesn't appear to suffer from
the two step select_bad_process, oom_kill_process race since
mem_cgroup_out_of_memory directly calls oom_kill_process without going
through select_bad_process.  This also means that the patch I sent is
incorrect since it removes the existing tasklist_lock protection in
oom_kill_process.

Respinning patch to take care of this case.

-- 
Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2] mm, oom: Fix race when selecting process to kill
  2013-11-07  0:35                 ` David Rientjes
@ 2013-11-08 18:07                   ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-08 18:07 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I865c64486ccfc0e4818e7045a8fa3353e2fb63f8
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..af99b1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +439,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2] mm, oom: Fix race when selecting process to kill
@ 2013-11-08 18:07                   ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-08 18:07 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I865c64486ccfc0e4818e7045a8fa3353e2fb63f8
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..af99b1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +439,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2] mm, oom: Fix race when selecting process to kill
  2013-11-08 18:07                   ` Sameer Nanda
@ 2013-11-08 18:45                     ` Oleg Nesterov
  -1 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-08 18:45 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

Sorry.

I didn't have time to answer other emails, will try to do later.

And yes, yes, while_each_thread() should be fixed, still on my
TODO list... But just in case, whatever we do with while_each_thread()
we should also fix some users.

Until then,

On 11/08, Sameer Nanda wrote:
>
> @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
>
> +	read_lock(&tasklist_lock);
> +
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {

OK.

> -	read_lock(&tasklist_lock);

But you should also move read_unlock_down(), at least after
find_lock_task_mm().

And of course, this doesn't fix other users in oom_kill.c.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2] mm, oom: Fix race when selecting process to kill
@ 2013-11-08 18:45                     ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-08 18:45 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

Sorry.

I didn't have time to answer other emails, will try to do later.

And yes, yes, while_each_thread() should be fixed, still on my
TODO list... But just in case, whatever we do with while_each_thread()
we should also fix some users.

Until then,

On 11/08, Sameer Nanda wrote:
>
> @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
>
> +	read_lock(&tasklist_lock);
> +
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {

OK.

> -	read_lock(&tasklist_lock);

But you should also move read_unlock_down(), at least after
find_lock_task_mm().

And of course, this doesn't fix other users in oom_kill.c.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3] mm, oom: Fix race when selecting process to kill
  2013-11-08 18:45                     ` Oleg Nesterov
@ 2013-11-08 19:49                       ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-08 19:49 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7b28d9f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 					      DEFAULT_RATELIMIT_BURST);
 
 	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
+	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
+
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
 		rcu_read_unlock();
 		put_task_struct(victim);
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3] mm, oom: Fix race when selecting process to kill
@ 2013-11-08 19:49                       ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-08 19:49 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7b28d9f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 					      DEFAULT_RATELIMIT_BURST);
 
 	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
+	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
+
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
 		rcu_read_unlock();
 		put_task_struct(victim);
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill
  2013-11-08 19:49                       ` Sameer Nanda
@ 2013-11-09 15:16                         ` Oleg Nesterov
  -1 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-09 15:16 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/08, Sameer Nanda wrote:
>
> @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  					      DEFAULT_RATELIMIT_BURST);
> @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			}
>  		}
>  	} while_each_thread(p, t);
> -	read_unlock(&tasklist_lock);
>  
>  	rcu_read_lock();
> +
>  	p = find_lock_task_mm(victim);
> +
> +	/*
> +	 * Since while_each_thread is currently not RCU safe, this unlock of
> +	 * tasklist_lock may need to be moved further down if any additional
> +	 * while_each_thread loops get added to this function.
> +	 */
> +	read_unlock(&tasklist_lock);

Well, ack... but with this change find_lock_task_mm() relies on tasklist,
so it makes sense to move rcu_read_lock() down before for_each_process().
Otherwise this looks confusing, but I won't insist.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill
@ 2013-11-09 15:16                         ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-09 15:16 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/08, Sameer Nanda wrote:
>
> @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  					      DEFAULT_RATELIMIT_BURST);
> @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			}
>  		}
>  	} while_each_thread(p, t);
> -	read_unlock(&tasklist_lock);
>  
>  	rcu_read_lock();
> +
>  	p = find_lock_task_mm(victim);
> +
> +	/*
> +	 * Since while_each_thread is currently not RCU safe, this unlock of
> +	 * tasklist_lock may need to be moved further down if any additional
> +	 * while_each_thread loops get added to this function.
> +	 */
> +	read_unlock(&tasklist_lock);

Well, ack... but with this change find_lock_task_mm() relies on tasklist,
so it makes sense to move rcu_read_lock() down before for_each_process().
Otherwise this looks confusing, but I won't insist.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill
  2013-11-09 15:16                         ` Oleg Nesterov
  (?)
@ 2013-11-11 23:15                         ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-11 23:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner,
	Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1189 bytes --]

On Sat, Nov 9, 2013 at 7:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:

> On 11/08, Sameer Nanda wrote:
> >
> > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t
> gfp_mask, int order,
> >                                             DEFAULT_RATELIMIT_BURST);
> > @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t
> gfp_mask, int order,
> >                       }
> >               }
> >       } while_each_thread(p, t);
> > -     read_unlock(&tasklist_lock);
> >
> >       rcu_read_lock();
> > +
> >       p = find_lock_task_mm(victim);
> > +
> > +     /*
> > +      * Since while_each_thread is currently not RCU safe, this unlock
> of
> > +      * tasklist_lock may need to be moved further down if any
> additional
> > +      * while_each_thread loops get added to this function.
> > +      */
> > +     read_unlock(&tasklist_lock);
>
> Well, ack... but with this change find_lock_task_mm() relies on tasklist,
> so it makes sense to move rcu_read_lock() down before for_each_process().
> Otherwise this looks confusing, but I won't insist.
>

Agreed that this looks a bit confusing.  I will respin the patch.


>
> Oleg.
>
>


-- 
Sameer

[-- Attachment #2: Type: text/html, Size: 2027 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v4] mm, oom: Fix race when selecting process to kill
  2013-11-09 15:16                         ` Oleg Nesterov
@ 2013-11-12  0:21                           ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12  0:21 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..57638ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 					      DEFAULT_RATELIMIT_BURST);
 
 	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
+	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +463,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -478,6 +490,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
 
+	rcu_read_lock();
+
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4] mm, oom: Fix race when selecting process to kill
@ 2013-11-12  0:21                           ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12  0:21 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..57638ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 					      DEFAULT_RATELIMIT_BURST);
 
 	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
+	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +463,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -478,6 +490,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
 
+	rcu_read_lock();
+
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
  2013-11-12  0:21                           ` Sameer Nanda
@ 2013-11-12 15:13                             ` Michal Hocko
  -1 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2013-11-12 15:13 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg,
	msb, linux-mm, linux-kernel

On Mon 11-11-13 16:21:57, Sameer Nanda wrote:
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
> 
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.
> 
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
>  mm/oom_kill.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6738c47..57638ef 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  					      DEFAULT_RATELIMIT_BURST);
>  
>  	/*
> +	 * while_each_thread is currently not RCU safe. Lets hold the
> +	 * tasklist_lock across all invocations of while_each_thread (including
> +	 * the one in find_lock_task_mm) in this function.
> +	 */
> +	read_lock(&tasklist_lock);
> +
> +	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
> +		read_unlock(&tasklist_lock);
>  		return;
>  	}

show_mem used to be one of a bottleneck but now that we have Mel's "mm:
do not walk all of system memory during show_mem" it shouldn't be a big
deal anymore.
The real trouble is with dump_tasks which might be zillions of tasks and
we do not want to hold tasklist_lock for that long.

So no this would regress on the huge machines and yes we have seen
reports like that and explicit requests to backport 6b0c81b3be114 (mm,
oom: reduce dependency on tasklist_lock) so this would be a step
backwards although I see there is a real problem that it tries to fix.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
@ 2013-11-12 15:13                             ` Michal Hocko
  0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2013-11-12 15:13 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg,
	msb, linux-mm, linux-kernel

On Mon 11-11-13 16:21:57, Sameer Nanda wrote:
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
> 
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.
> 
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
>  mm/oom_kill.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6738c47..57638ef 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  					      DEFAULT_RATELIMIT_BURST);
>  
>  	/*
> +	 * while_each_thread is currently not RCU safe. Lets hold the
> +	 * tasklist_lock across all invocations of while_each_thread (including
> +	 * the one in find_lock_task_mm) in this function.
> +	 */
> +	read_lock(&tasklist_lock);
> +
> +	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
> +		read_unlock(&tasklist_lock);
>  		return;
>  	}

show_mem used to be one of a bottleneck but now that we have Mel's "mm:
do not walk all of system memory during show_mem" it shouldn't be a big
deal anymore.
The real trouble is with dump_tasks which might be zillions of tasks and
we do not want to hold tasklist_lock for that long.

So no this would regress on the huge machines and yes we have seen
reports like that and explicit requests to backport 6b0c81b3be114 (mm,
oom: reduce dependency on tasklist_lock) so this would be a step
backwards although I see there is a real problem that it tries to fix.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
  2013-11-12  0:21                           ` Sameer Nanda
@ 2013-11-12 20:01                             ` Oleg Nesterov
  -1 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-12 20:01 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/11, Sameer Nanda wrote:
>
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
>
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.

OK, looks correct to me. Thanks.


Yes, this is a step backwards, hopefully we will revert this patch soon.
I am starting to think something like while_each_thread_lame_but_safe()
makes sense before we really fix this nasty (and afaics not simple)
problem with with while_each_thread() (which should die).

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
@ 2013-11-12 20:01                             ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-12 20:01 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/11, Sameer Nanda wrote:
>
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
>
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.

OK, looks correct to me. Thanks.


Yes, this is a step backwards, hopefully we will revert this patch soon.
I am starting to think something like while_each_thread_lame_but_safe()
makes sense before we really fix this nasty (and afaics not simple)
problem with with while_each_thread() (which should die).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
  2013-11-12 20:01                             ` Oleg Nesterov
@ 2013-11-12 20:08                               ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12 20:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner,
	Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm,
	linux-kernel

On Tue, Nov 12, 2013 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 11/11, Sameer Nanda wrote:
>>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>
> OK, looks correct to me. Thanks.
>
>
> Yes, this is a step backwards, hopefully we will revert this patch soon.
> I am starting to think something like while_each_thread_lame_but_safe()
> makes sense before we really fix this nasty (and afaics not simple)
> problem with with while_each_thread() (which should die).

Looking forward to a real fix for the nasty problems with
while_each_thread.  In the meanwhile, let me float one more
(hopefully, the last) version of this patch that should address
Michal's concern.  Thanks for your feedback!

>
> Oleg.
>



-- 
Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill
@ 2013-11-12 20:08                               ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12 20:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner,
	Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm,
	linux-kernel

On Tue, Nov 12, 2013 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 11/11, Sameer Nanda wrote:
>>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>
> OK, looks correct to me. Thanks.
>
>
> Yes, this is a step backwards, hopefully we will revert this patch soon.
> I am starting to think something like while_each_thread_lame_but_safe()
> makes sense before we really fix this nasty (and afaics not simple)
> problem with with while_each_thread() (which should die).

Looking forward to a real fix for the nasty problems with
while_each_thread.  In the meanwhile, let me float one more
(hopefully, the last) version of this patch that should address
Michal's concern.  Thanks for your feedback!

>
> Oleg.
>



-- 
Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v5] mm, oom: Fix race when selecting process to kill
  2013-11-12 20:08                               ` Sameer Nanda
@ 2013-11-12 20:23                                 ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12 20:23 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..5108c2b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&oom_rs))
+		dump_header(p, gfp_mask, order, memcg, nodemask);
+
+	task_lock(p);
+	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+	task_unlock(p);
+
+	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
+		pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
+			message, task_pid_nr(p));
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
-	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
 
+	rcu_read_lock();
+
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5] mm, oom: Fix race when selecting process to kill
@ 2013-11-12 20:23                                 ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-12 20:23 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..5108c2b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&oom_rs))
+		dump_header(p, gfp_mask, order, memcg, nodemask);
+
+	task_lock(p);
+	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+	task_unlock(p);
+
+	/*
+	 * while_each_thread is currently not RCU safe. Lets hold the
+	 * tasklist_lock across all invocations of while_each_thread (including
+	 * the one in find_lock_task_mm) in this function.
+	 */
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
+		pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
+			message, task_pid_nr(p));
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
-	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
 	task_unlock(victim);
 
+	rcu_read_lock();
+
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill
  2013-11-12 20:23                                 ` Sameer Nanda
@ 2013-11-13  2:33                                   ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-13  2:33 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, hannes, rusty, semenzato, murzin.v, oleg, dserrg,
	msb, linux-mm, linux-kernel

On Tue, 12 Nov 2013, Sameer Nanda wrote:

> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
> 
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.
> 
> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31

Is this needed?

> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
>  mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
>  1 file changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6738c47..5108c2b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
>  
> +	if (__ratelimit(&oom_rs))
> +		dump_header(p, gfp_mask, order, memcg, nodemask);
> +
> +	task_lock(p);
> +	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);
> +	task_unlock(p);
> +
> +	/*
> +	 * while_each_thread is currently not RCU safe. Lets hold the
> +	 * tasklist_lock across all invocations of while_each_thread (including
> +	 * the one in find_lock_task_mm) in this function.
> +	 */
> +	read_lock(&tasklist_lock);
> +
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {
> +		pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
> +			message, task_pid_nr(p));

That makes no sense in the kernel log to have

	Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child
	Out of Memory: Not killing process 1234, just setting TIF_MEMDIE

Those are contradictory statements (and will actually mess with kernel log 
parsing at Google) and nobody other than kernel developers are going to 
know what TIF_MEMDIE is.

>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
> +		read_unlock(&tasklist_lock);
>  		return;
>  	}
>  
> -	if (__ratelimit(&oom_rs))
> -		dump_header(p, gfp_mask, order, memcg, nodemask);
> -
> -	task_lock(p);
> -	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> -		message, task_pid_nr(p), p->comm, points);
> -	task_unlock(p);
> -
>  	/*
>  	 * If any of p's children has a different mm and is eligible for kill,
>  	 * the one with the highest oom_badness() score is sacrificed for its
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> -	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			}
>  		}
>  	} while_each_thread(p, t);
> -	read_unlock(&tasklist_lock);
>  
> -	rcu_read_lock();
>  	p = find_lock_task_mm(victim);
> +
> +	/*
> +	 * Since while_each_thread is currently not RCU safe, this unlock of
> +	 * tasklist_lock may need to be moved further down if any additional
> +	 * while_each_thread loops get added to this function.
> +	 */

This comment should be moved to sched.h to indicate how 
while_each_thread() needs to be handled with respect to tasklist_lock, 
it's not specific to the oom killer.

> +	read_unlock(&tasklist_lock);
> +
>  	if (!p) {
> -		rcu_read_unlock();
>  		put_task_struct(victim);
>  		return;
>  	} else if (victim != p) {
> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>  	task_unlock(victim);
>  
> +	rcu_read_lock();
> +
>  	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
>  	 * any.  They don't get access to memory reserves, though, to avoid

Please move this rcu_read_lock() to be immediatley before the 
for_each_process() instead of before the comment.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill
@ 2013-11-13  2:33                                   ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2013-11-13  2:33 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, hannes, rusty, semenzato, murzin.v, oleg, dserrg,
	msb, linux-mm, linux-kernel

On Tue, 12 Nov 2013, Sameer Nanda wrote:

> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
> 
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.
> 
> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31

Is this needed?

> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
>  mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
>  1 file changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6738c47..5108c2b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
>  
> +	if (__ratelimit(&oom_rs))
> +		dump_header(p, gfp_mask, order, memcg, nodemask);
> +
> +	task_lock(p);
> +	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);
> +	task_unlock(p);
> +
> +	/*
> +	 * while_each_thread is currently not RCU safe. Lets hold the
> +	 * tasklist_lock across all invocations of while_each_thread (including
> +	 * the one in find_lock_task_mm) in this function.
> +	 */
> +	read_lock(&tasklist_lock);
> +
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING || !pid_alive(p)) {
> +		pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
> +			message, task_pid_nr(p));

That makes no sense in the kernel log to have

	Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child
	Out of Memory: Not killing process 1234, just setting TIF_MEMDIE

Those are contradictory statements (and will actually mess with kernel log 
parsing at Google) and nobody other than kernel developers are going to 
know what TIF_MEMDIE is.

>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
> +		read_unlock(&tasklist_lock);
>  		return;
>  	}
>  
> -	if (__ratelimit(&oom_rs))
> -		dump_header(p, gfp_mask, order, memcg, nodemask);
> -
> -	task_lock(p);
> -	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> -		message, task_pid_nr(p), p->comm, points);
> -	task_unlock(p);
> -
>  	/*
>  	 * If any of p's children has a different mm and is eligible for kill,
>  	 * the one with the highest oom_badness() score is sacrificed for its
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> -	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			}
>  		}
>  	} while_each_thread(p, t);
> -	read_unlock(&tasklist_lock);
>  
> -	rcu_read_lock();
>  	p = find_lock_task_mm(victim);
> +
> +	/*
> +	 * Since while_each_thread is currently not RCU safe, this unlock of
> +	 * tasklist_lock may need to be moved further down if any additional
> +	 * while_each_thread loops get added to this function.
> +	 */

This comment should be moved to sched.h to indicate how 
while_each_thread() needs to be handled with respect to tasklist_lock, 
it's not specific to the oom killer.

> +	read_unlock(&tasklist_lock);
> +
>  	if (!p) {
> -		rcu_read_unlock();
>  		put_task_struct(victim);
>  		return;
>  	} else if (victim != p) {
> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>  	task_unlock(victim);
>  
> +	rcu_read_lock();
> +
>  	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
>  	 * any.  They don't get access to memory reserves, though, to avoid

Please move this rcu_read_lock() to be immediatley before the 
for_each_process() instead of before the comment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill
  2013-11-13  2:33                                   ` David Rientjes
@ 2013-11-13 16:46                                     ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-13 16:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, mhocko, Johannes Weiner, Rusty Russell,
	Luigi Semenzato, Vladimir Murzin, Oleg Nesterov, Sergey Dyasly,
	msb, linux-mm, linux-kernel

On Tue, Nov 12, 2013 at 6:33 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 12 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
>
> Is this needed?

No, it's not.  It's a side effect of using Chrome OS tools to manage
the patches.  I will make sure to remove it on the next version of the
patch.

>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>>  mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
>>  1 file changed, 29 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..5108c2b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>       static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>>                                             DEFAULT_RATELIMIT_BURST);
>>
>> +     if (__ratelimit(&oom_rs))
>> +             dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> +     task_lock(p);
>> +     pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> +             message, task_pid_nr(p), p->comm, points);
>> +     task_unlock(p);
>> +
>> +     /*
>> +      * while_each_thread is currently not RCU safe. Lets hold the
>> +      * tasklist_lock across all invocations of while_each_thread (including
>> +      * the one in find_lock_task_mm) in this function.
>> +      */
>> +     read_lock(&tasklist_lock);
>> +
>>       /*
>>        * If the task is already exiting, don't alarm the sysadmin or kill
>>        * its children or threads, just set TIF_MEMDIE so it can die quickly
>>        */
>> -     if (p->flags & PF_EXITING) {
>> +     if (p->flags & PF_EXITING || !pid_alive(p)) {
>> +             pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
>> +                     message, task_pid_nr(p));
>
> That makes no sense in the kernel log to have
>
>         Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child
>         Out of Memory: Not killing process 1234, just setting TIF_MEMDIE
>
> Those are contradictory statements (and will actually mess with kernel log
> parsing at Google) and nobody other than kernel developers are going to
> know what TIF_MEMDIE is.

Since the "Kill process" printk has now moved above the (p->flags &
PF_EXITING || !pid_alive(p)) check, it is possible that
oom_kill_process will emit the "Kill process" message but will not
actually try to kill the process or its child.  The new "Not killing
process" printk helps disambiguate this case from when a process is
actually killed by oom_kill_process.  However, since you are finding
it confusing, let me remove it.

>
>>               set_tsk_thread_flag(p, TIF_MEMDIE);
>>               put_task_struct(p);
>> +             read_unlock(&tasklist_lock);
>>               return;
>>       }
>>
>> -     if (__ratelimit(&oom_rs))
>> -             dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> -     task_lock(p);
>> -     pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> -             message, task_pid_nr(p), p->comm, points);
>> -     task_unlock(p);
>> -
>>       /*
>>        * If any of p's children has a different mm and is eligible for kill,
>>        * the one with the highest oom_badness() score is sacrificed for its
>>        * parent.  This attempts to lose the minimal amount of work done while
>>        * still freeing memory.
>>        */
>> -     read_lock(&tasklist_lock);
>>       do {
>>               list_for_each_entry(child, &t->children, sibling) {
>>                       unsigned int child_points;
>> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>                       }
>>               }
>>       } while_each_thread(p, t);
>> -     read_unlock(&tasklist_lock);
>>
>> -     rcu_read_lock();
>>       p = find_lock_task_mm(victim);
>> +
>> +     /*
>> +      * Since while_each_thread is currently not RCU safe, this unlock of
>> +      * tasklist_lock may need to be moved further down if any additional
>> +      * while_each_thread loops get added to this function.
>> +      */
>
> This comment should be moved to sched.h to indicate how
> while_each_thread() needs to be handled with respect to tasklist_lock,
> it's not specific to the oom killer.

OK.
>
>> +     read_unlock(&tasklist_lock);
>> +
>>       if (!p) {
>> -             rcu_read_unlock();
>>               put_task_struct(victim);
>>               return;
>>       } else if (victim != p) {
>> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>               K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>>       task_unlock(victim);
>>
>> +     rcu_read_lock();
>> +
>>       /*
>>        * Kill all user processes sharing victim->mm in other thread groups, if
>>        * any.  They don't get access to memory reserves, though, to avoid
>
> Please move this rcu_read_lock() to be immediatley before the
> for_each_process() instead of before the comment.

OK.



-- 
Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill
@ 2013-11-13 16:46                                     ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-13 16:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, mhocko, Johannes Weiner, Rusty Russell,
	Luigi Semenzato, Vladimir Murzin, Oleg Nesterov, Sergey Dyasly,
	msb, linux-mm, linux-kernel

On Tue, Nov 12, 2013 at 6:33 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 12 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
>
> Is this needed?

No, it's not.  It's a side effect of using Chrome OS tools to manage
the patches.  I will make sure to remove it on the next version of the
patch.

>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>>  mm/oom_kill.c | 42 +++++++++++++++++++++++++++++-------------
>>  1 file changed, 29 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..5108c2b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>       static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>>                                             DEFAULT_RATELIMIT_BURST);
>>
>> +     if (__ratelimit(&oom_rs))
>> +             dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> +     task_lock(p);
>> +     pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> +             message, task_pid_nr(p), p->comm, points);
>> +     task_unlock(p);
>> +
>> +     /*
>> +      * while_each_thread is currently not RCU safe. Lets hold the
>> +      * tasklist_lock across all invocations of while_each_thread (including
>> +      * the one in find_lock_task_mm) in this function.
>> +      */
>> +     read_lock(&tasklist_lock);
>> +
>>       /*
>>        * If the task is already exiting, don't alarm the sysadmin or kill
>>        * its children or threads, just set TIF_MEMDIE so it can die quickly
>>        */
>> -     if (p->flags & PF_EXITING) {
>> +     if (p->flags & PF_EXITING || !pid_alive(p)) {
>> +             pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
>> +                     message, task_pid_nr(p));
>
> That makes no sense in the kernel log to have
>
>         Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child
>         Out of Memory: Not killing process 1234, just setting TIF_MEMDIE
>
> Those are contradictory statements (and will actually mess with kernel log
> parsing at Google) and nobody other than kernel developers are going to
> know what TIF_MEMDIE is.

Since the "Kill process" printk has now moved above the (p->flags &
PF_EXITING || !pid_alive(p)) check, it is possible that
oom_kill_process will emit the "Kill process" message but will not
actually try to kill the process or its child.  The new "Not killing
process" printk helps disambiguate this case from when a process is
actually killed by oom_kill_process.  However, since you are finding
it confusing, let me remove it.

>
>>               set_tsk_thread_flag(p, TIF_MEMDIE);
>>               put_task_struct(p);
>> +             read_unlock(&tasklist_lock);
>>               return;
>>       }
>>
>> -     if (__ratelimit(&oom_rs))
>> -             dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> -     task_lock(p);
>> -     pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> -             message, task_pid_nr(p), p->comm, points);
>> -     task_unlock(p);
>> -
>>       /*
>>        * If any of p's children has a different mm and is eligible for kill,
>>        * the one with the highest oom_badness() score is sacrificed for its
>>        * parent.  This attempts to lose the minimal amount of work done while
>>        * still freeing memory.
>>        */
>> -     read_lock(&tasklist_lock);
>>       do {
>>               list_for_each_entry(child, &t->children, sibling) {
>>                       unsigned int child_points;
>> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>                       }
>>               }
>>       } while_each_thread(p, t);
>> -     read_unlock(&tasklist_lock);
>>
>> -     rcu_read_lock();
>>       p = find_lock_task_mm(victim);
>> +
>> +     /*
>> +      * Since while_each_thread is currently not RCU safe, this unlock of
>> +      * tasklist_lock may need to be moved further down if any additional
>> +      * while_each_thread loops get added to this function.
>> +      */
>
> This comment should be moved to sched.h to indicate how
> while_each_thread() needs to be handled with respect to tasklist_lock,
> it's not specific to the oom killer.

OK.
>
>> +     read_unlock(&tasklist_lock);
>> +
>>       if (!p) {
>> -             rcu_read_unlock();
>>               put_task_struct(victim);
>>               return;
>>       } else if (victim != p) {
>> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>>               K(get_mm_counter(victim->mm, MM_FILEPAGES)));
>>       task_unlock(victim);
>>
>> +     rcu_read_lock();
>> +
>>       /*
>>        * Kill all user processes sharing victim->mm in other thread groups, if
>>        * any.  They don't get access to memory reserves, though, to avoid
>
> Please move this rcu_read_lock() to be immediatley before the
> for_each_process() instead of before the comment.

OK.



-- 
Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v6] mm, oom: Fix race when selecting process to kill
  2013-11-13 16:46                                     ` Sameer Nanda
@ 2013-11-13 17:18                                       ` Sameer Nanda
  -1 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-13 17:18 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 include/linux/sched.h |  5 +++++
 mm/oom_kill.c         | 34 +++++++++++++++++++++-------------
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e27baee..8975dbb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+/*
+ * Careful: while_each_thread is not RCU safe. Callers should hold
+ * read_lock(tasklist_lock) across while_each_thread loops.
+ */
+
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..0d1f804 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&oom_rs))
+		dump_header(p, gfp_mask, order, memcg, nodemask);
+
+	task_lock(p);
+	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+	task_unlock(p);
+
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
-	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * That thread will now get access to memory reserves since it has a
 	 * pending fatal signal.
 	 */
+	rcu_read_lock();
 	for_each_process(p)
 		if (p->mm == mm && !same_thread_group(p, victim) &&
 		    !(p->flags & PF_KTHREAD)) {
-- 
1.8.4.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v6] mm, oom: Fix race when selecting process to kill
@ 2013-11-13 17:18                                       ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-13 17:18 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg,
	dserrg, msb
  Cc: linux-mm, linux-kernel, Sameer Nanda

The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
 include/linux/sched.h |  5 +++++
 mm/oom_kill.c         | 34 +++++++++++++++++++++-------------
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e27baee..8975dbb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+/*
+ * Careful: while_each_thread is not RCU safe. Callers should hold
+ * read_lock(tasklist_lock) across while_each_thread loops.
+ */
+
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..0d1f804 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&oom_rs))
+		dump_header(p, gfp_mask, order, memcg, nodemask);
+
+	task_lock(p);
+	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+	task_unlock(p);
+
+	read_lock(&tasklist_lock);
+
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING || !pid_alive(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
+		read_unlock(&tasklist_lock);
 		return;
 	}
 
-	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
-	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			}
 		}
 	} while_each_thread(p, t);
-	read_unlock(&tasklist_lock);
 
-	rcu_read_lock();
 	p = find_lock_task_mm(victim);
+
+	/*
+	 * Since while_each_thread is currently not RCU safe, this unlock of
+	 * tasklist_lock may need to be moved further down if any additional
+	 * while_each_thread loops get added to this function.
+	 */
+	read_unlock(&tasklist_lock);
+
 	if (!p) {
-		rcu_read_unlock();
 		put_task_struct(victim);
 		return;
 	} else if (victim != p) {
@@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * That thread will now get access to memory reserves since it has a
 	 * pending fatal signal.
 	 */
+	rcu_read_lock();
 	for_each_process(p)
 		if (p->mm == mm && !same_thread_group(p, victim) &&
 		    !(p->flags & PF_KTHREAD)) {
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill
  2013-11-13 17:18                                       ` Sameer Nanda
@ 2013-11-13 17:29                                         ` Oleg Nesterov
  -1 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-13 17:29 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/13, Sameer Nanda wrote:
>
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
>
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.

I am fine with this patch as well, but honestly I'd prefer the previous
v5. I won't argue though.

> +/*
> + * Careful: while_each_thread is not RCU safe. Callers should hold
> + * read_lock(tasklist_lock) across while_each_thread loops.
> + */

(tasklist_lock or siglock, in fact but this doesn't matter).

This is not that simple, even tasklist_lock can't help if the task is
already dead.

Oh. Yes, sorry. I promised to send the patches "soon" many times, but
still didn't find the time.

Perhaps I should try to start with the "make this all less buggy" changes,
the "complete" fix needs to change the callers as well.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill
@ 2013-11-13 17:29                                         ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2013-11-13 17:29 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v,
	dserrg, msb, linux-mm, linux-kernel

On 11/13, Sameer Nanda wrote:
>
> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
>
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.

I am fine with this patch as well, but honestly I'd prefer the previous
v5. I won't argue though.

> +/*
> + * Careful: while_each_thread is not RCU safe. Callers should hold
> + * read_lock(tasklist_lock) across while_each_thread loops.
> + */

(tasklist_lock or siglock, in fact but this doesn't matter).

This is not that simple, even tasklist_lock can't help if the task is
already dead.

Oh. Yes, sorry. I promised to send the patches "soon" many times, but
still didn't find the time.

Perhaps I should try to start with the "make this all less buggy" changes,
the "complete" fix needs to change the callers as well.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill
  2013-11-13 17:18                                       ` Sameer Nanda
  (?)
  (?)
@ 2013-11-14 13:43                                       ` dserrg
  2013-11-14 17:03                                           ` Sameer Nanda
  -1 siblings, 1 reply; 51+ messages in thread
From: dserrg @ 2013-11-14 13:43 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: rusty, hannes, msb, oleg,
	Мурзин
	Владимир,
	linux-mm, rientjes, mhocko, akpm, semenzato, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5193 bytes --]

(sorry for html)

Why do we even bother with locking?
Why not just merge my original patch? (The link is in Vladimir's message)
It provides much more elegant (and working!) solution for this problem.
David, how did you miss it in the first place?

Oh.. and by the way. I was hitting the same bug in other
while_each_thread loops in oom_kill.c. Anyway, goodluck ;)
14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org>
написал:

> The selection of the process to be killed happens in two spots:
> first in select_bad_process and then a further refinement by
> looking for child processes in oom_kill_process. Since this is
> a two step process, it is possible that the process selected by
> select_bad_process may get a SIGKILL just before oom_kill_process
> executes. If this were to happen, __unhash_process deletes this
> process from the thread_group list. This results in oom_kill_process
> getting stuck in an infinite loop when traversing the thread_group
> list of the selected process.
>
> Fix this race by adding a pid_alive check for the selected process
> with tasklist_lock held in oom_kill_process.
>
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
>  include/linux/sched.h |  5 +++++
>  mm/oom_kill.c         | 34 +++++++++++++++++++++-------------
>  2 files changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e27baee..8975dbb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>         for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; )
> do
>
> +/*
> + * Careful: while_each_thread is not RCU safe. Callers should hold
> + * read_lock(tasklist_lock) across while_each_thread loops.
> + */
> +
>  #define while_each_thread(g, t) \
>         while ((t = next_thread(t)) != g)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6738c47..0d1f804 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t
> gfp_mask, int order,
>         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>                                               DEFAULT_RATELIMIT_BURST);
>
> +       if (__ratelimit(&oom_rs))
> +               dump_header(p, gfp_mask, order, memcg, nodemask);
> +
> +       task_lock(p);
> +       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> +               message, task_pid_nr(p), p->comm, points);
> +       task_unlock(p);
> +
> +       read_lock(&tasklist_lock);
> +
>         /*
>          * If the task is already exiting, don't alarm the sysadmin or kill
>          * its children or threads, just set TIF_MEMDIE so it can die
> quickly
>          */
> -       if (p->flags & PF_EXITING) {
> +       if (p->flags & PF_EXITING || !pid_alive(p)) {
>                 set_tsk_thread_flag(p, TIF_MEMDIE);
>                 put_task_struct(p);
> +               read_unlock(&tasklist_lock);
>                 return;
>         }
>
> -       if (__ratelimit(&oom_rs))
> -               dump_header(p, gfp_mask, order, memcg, nodemask);
> -
> -       task_lock(p);
> -       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> -               message, task_pid_nr(p), p->comm, points);
> -       task_unlock(p);
> -
>         /*
>          * If any of p's children has a different mm and is eligible for
> kill,
>          * the one with the highest oom_badness() score is sacrificed for
> its
>          * parent.  This attempts to lose the minimal amount of work done
> while
>          * still freeing memory.
>          */
> -       read_lock(&tasklist_lock);
>         do {
>                 list_for_each_entry(child, &t->children, sibling) {
>                         unsigned int child_points;
> @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t
> gfp_mask, int order,
>                         }
>                 }
>         } while_each_thread(p, t);
> -       read_unlock(&tasklist_lock);
>
> -       rcu_read_lock();
>         p = find_lock_task_mm(victim);
> +
> +       /*
> +        * Since while_each_thread is currently not RCU safe, this unlock
> of
> +        * tasklist_lock may need to be moved further down if any
> additional
> +        * while_each_thread loops get added to this function.
> +        */
> +       read_unlock(&tasklist_lock);
> +
>         if (!p) {
> -               rcu_read_unlock();
>                 put_task_struct(victim);
>                 return;
>         } else if (victim != p) {
> @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t
> gfp_mask, int order,
>          * That thread will now get access to memory reserves since it has
> a
>          * pending fatal signal.
>          */
> +       rcu_read_lock();
>         for_each_process(p)
>                 if (p->mm == mm && !same_thread_group(p, victim) &&
>                     !(p->flags & PF_KTHREAD)) {
> --
> 1.8.4.1
>
>

[-- Attachment #2: Type: text/html, Size: 5852 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill
  2013-11-14 13:43                                       ` dserrg
@ 2013-11-14 17:03                                           ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-14 17:03 UTC (permalink / raw)
  To: dserrg
  Cc: Rusty Russell, Johannes Weiner, msb, Oleg Nesterov,
	Мурзин
	Владимир,
	linux-mm, David Rientjes, mhocko, Andrew Morton, Luigi Semenzato,
	linux-kernel

On Thu, Nov 14, 2013 at 5:43 AM, dserrg <dserrg@gmail.com> wrote:
> (sorry for html)
>
> Why do we even bother with locking?
> Why not just merge my original patch? (The link is in Vladimir's message)
> It provides much more elegant (and working!) solution for this problem.

As Oleg alluded to in that thread, that patch makes the race window
smaller, but doesn't close it completely.  Imagine if a SIGKILL gets
sent to the task p immediately after the fatal_signal_pending check.
In that case, the infinite loop in while_each_thread will still happen
since  __unhash_process would delete the task p from the thread_group
list while while_each_thread loop is in progress on another CPU.  This
is precisely why we need to hold read_lock(&tasklist_lock) _before_
checking the state of the process p and entering the while_each_thread
loop.

> David, how did you miss it in the first place?
>
> Oh.. and by the way. I was hitting the same bug in other
> while_each_thread loops in oom_kill.c.

> Anyway, goodluck ;)

Thanks!

>
> 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org>
> написал:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>>  include/linux/sched.h |  5 +++++
>>  mm/oom_kill.c         | 34 +++++++++++++++++++++-------------
>>  2 files changed, 26 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index e27baee..8975dbb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
>>  #define do_each_thread(g, t) \
>>         for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; )
>> do
>>
>> +/*
>> + * Careful: while_each_thread is not RCU safe. Callers should hold
>> + * read_lock(tasklist_lock) across while_each_thread loops.
>> + */
>> +
>>  #define while_each_thread(g, t) \
>>         while ((t = next_thread(t)) != g)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..0d1f804 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>>                                               DEFAULT_RATELIMIT_BURST);
>>
>> +       if (__ratelimit(&oom_rs))
>> +               dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> +       task_lock(p);
>> +       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> +               message, task_pid_nr(p), p->comm, points);
>> +       task_unlock(p);
>> +
>> +       read_lock(&tasklist_lock);
>> +
>>         /*
>>          * If the task is already exiting, don't alarm the sysadmin or
>> kill
>>          * its children or threads, just set TIF_MEMDIE so it can die
>> quickly
>>          */
>> -       if (p->flags & PF_EXITING) {
>> +       if (p->flags & PF_EXITING || !pid_alive(p)) {
>>                 set_tsk_thread_flag(p, TIF_MEMDIE);
>>                 put_task_struct(p);
>> +               read_unlock(&tasklist_lock);
>>                 return;
>>         }
>>
>> -       if (__ratelimit(&oom_rs))
>> -               dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> -       task_lock(p);
>> -       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> -               message, task_pid_nr(p), p->comm, points);
>> -       task_unlock(p);
>> -
>>         /*
>>          * If any of p's children has a different mm and is eligible for
>> kill,
>>          * the one with the highest oom_badness() score is sacrificed for
>> its
>>          * parent.  This attempts to lose the minimal amount of work done
>> while
>>          * still freeing memory.
>>          */
>> -       read_lock(&tasklist_lock);
>>         do {
>>                 list_for_each_entry(child, &t->children, sibling) {
>>                         unsigned int child_points;
>> @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>                         }
>>                 }
>>         } while_each_thread(p, t);
>> -       read_unlock(&tasklist_lock);
>>
>> -       rcu_read_lock();
>>         p = find_lock_task_mm(victim);
>> +
>> +       /*
>> +        * Since while_each_thread is currently not RCU safe, this unlock
>> of
>> +        * tasklist_lock may need to be moved further down if any
>> additional
>> +        * while_each_thread loops get added to this function.
>> +        */
>> +       read_unlock(&tasklist_lock);
>> +
>>         if (!p) {
>> -               rcu_read_unlock();
>>                 put_task_struct(victim);
>>                 return;
>>         } else if (victim != p) {
>> @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>          * That thread will now get access to memory reserves since it has
>> a
>>          * pending fatal signal.
>>          */
>> +       rcu_read_lock();
>>         for_each_process(p)
>>                 if (p->mm == mm && !same_thread_group(p, victim) &&
>>                     !(p->flags & PF_KTHREAD)) {
>> --
>> 1.8.4.1
>>
>



-- 
Sameer

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill
@ 2013-11-14 17:03                                           ` Sameer Nanda
  0 siblings, 0 replies; 51+ messages in thread
From: Sameer Nanda @ 2013-11-14 17:03 UTC (permalink / raw)
  To: dserrg
  Cc: Rusty Russell, Johannes Weiner, msb, Oleg Nesterov,
	Мурзин
	Владимир,
	linux-mm, David Rientjes, mhocko, Andrew Morton, Luigi Semenzato,
	linux-kernel

On Thu, Nov 14, 2013 at 5:43 AM, dserrg <dserrg@gmail.com> wrote:
> (sorry for html)
>
> Why do we even bother with locking?
> Why not just merge my original patch? (The link is in Vladimir's message)
> It provides much more elegant (and working!) solution for this problem.

As Oleg alluded to in that thread, that patch makes the race window
smaller, but doesn't close it completely.  Imagine if a SIGKILL gets
sent to the task p immediately after the fatal_signal_pending check.
In that case, the infinite loop in while_each_thread will still happen
since  __unhash_process would delete the task p from the thread_group
list while while_each_thread loop is in progress on another CPU.  This
is precisely why we need to hold read_lock(&tasklist_lock) _before_
checking the state of the process p and entering the while_each_thread
loop.

> David, how did you miss it in the first place?
>
> Oh.. and by the way. I was hitting the same bug in other
> while_each_thread loops in oom_kill.c.

> Anyway, goodluck ;)

Thanks!

>
> 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org>
> написал:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>>  include/linux/sched.h |  5 +++++
>>  mm/oom_kill.c         | 34 +++++++++++++++++++++-------------
>>  2 files changed, 26 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index e27baee..8975dbb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
>>  #define do_each_thread(g, t) \
>>         for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; )
>> do
>>
>> +/*
>> + * Careful: while_each_thread is not RCU safe. Callers should hold
>> + * read_lock(tasklist_lock) across while_each_thread loops.
>> + */
>> +
>>  #define while_each_thread(g, t) \
>>         while ((t = next_thread(t)) != g)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..0d1f804 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>>                                               DEFAULT_RATELIMIT_BURST);
>>
>> +       if (__ratelimit(&oom_rs))
>> +               dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> +       task_lock(p);
>> +       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> +               message, task_pid_nr(p), p->comm, points);
>> +       task_unlock(p);
>> +
>> +       read_lock(&tasklist_lock);
>> +
>>         /*
>>          * If the task is already exiting, don't alarm the sysadmin or
>> kill
>>          * its children or threads, just set TIF_MEMDIE so it can die
>> quickly
>>          */
>> -       if (p->flags & PF_EXITING) {
>> +       if (p->flags & PF_EXITING || !pid_alive(p)) {
>>                 set_tsk_thread_flag(p, TIF_MEMDIE);
>>                 put_task_struct(p);
>> +               read_unlock(&tasklist_lock);
>>                 return;
>>         }
>>
>> -       if (__ratelimit(&oom_rs))
>> -               dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> -       task_lock(p);
>> -       pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> -               message, task_pid_nr(p), p->comm, points);
>> -       task_unlock(p);
>> -
>>         /*
>>          * If any of p's children has a different mm and is eligible for
>> kill,
>>          * the one with the highest oom_badness() score is sacrificed for
>> its
>>          * parent.  This attempts to lose the minimal amount of work done
>> while
>>          * still freeing memory.
>>          */
>> -       read_lock(&tasklist_lock);
>>         do {
>>                 list_for_each_entry(child, &t->children, sibling) {
>>                         unsigned int child_points;
>> @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>                         }
>>                 }
>>         } while_each_thread(p, t);
>> -       read_unlock(&tasklist_lock);
>>
>> -       rcu_read_lock();
>>         p = find_lock_task_mm(victim);
>> +
>> +       /*
>> +        * Since while_each_thread is currently not RCU safe, this unlock
>> of
>> +        * tasklist_lock may need to be moved further down if any
>> additional
>> +        * while_each_thread loops get added to this function.
>> +        */
>> +       read_unlock(&tasklist_lock);
>> +
>>         if (!p) {
>> -               rcu_read_unlock();
>>                 put_task_struct(victim);
>>                 return;
>>         } else if (victim != p) {
>> @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>>          * That thread will now get access to memory reserves since it has
>> a
>>          * pending fatal signal.
>>          */
>> +       rcu_read_lock();
>>         for_each_process(p)
>>                 if (p->mm == mm && !same_thread_group(p, victim) &&
>>                     !(p->flags & PF_KTHREAD)) {
>> --
>> 1.8.4.1
>>
>



-- 
Sameer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-11-14 17:04 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-05 23:26 [PATCH] mm, oom: Fix race when selecting process to kill Sameer Nanda
2013-11-05 23:26 ` Sameer Nanda
2013-11-06  1:18 ` David Rientjes
2013-11-06  1:18   ` David Rientjes
2013-11-06  1:25   ` Luigi Semenzato
2013-11-06  1:25     ` Luigi Semenzato
2013-11-06  1:27     ` David Rientjes
2013-11-06  1:27       ` David Rientjes
2013-11-06  3:00       ` Vladimir Murzin
2013-11-06  3:00         ` Vladimir Murzin
2013-11-06  3:04       ` Sameer Nanda
2013-11-06  4:45         ` Luigi Semenzato
2013-11-06  4:45           ` Luigi Semenzato
2013-11-06  7:17           ` Luigi Semenzato
2013-11-06  7:17             ` Luigi Semenzato
2013-11-06 16:58             ` Sameer Nanda
2013-11-06 16:58               ` Sameer Nanda
2013-11-07  0:35               ` David Rientjes
2013-11-07  0:35                 ` David Rientjes
2013-11-07 19:34                 ` Sameer Nanda
2013-11-07 19:34                   ` Sameer Nanda
2013-11-08 18:07                 ` [PATCH v2] " Sameer Nanda
2013-11-08 18:07                   ` Sameer Nanda
2013-11-08 18:45                   ` Oleg Nesterov
2013-11-08 18:45                     ` Oleg Nesterov
2013-11-08 19:49                     ` [PATCH v3] " Sameer Nanda
2013-11-08 19:49                       ` Sameer Nanda
2013-11-09 15:16                       ` Oleg Nesterov
2013-11-09 15:16                         ` Oleg Nesterov
2013-11-11 23:15                         ` Sameer Nanda
2013-11-12  0:21                         ` [PATCH v4] " Sameer Nanda
2013-11-12  0:21                           ` Sameer Nanda
2013-11-12 15:13                           ` Michal Hocko
2013-11-12 15:13                             ` Michal Hocko
2013-11-12 20:01                           ` Oleg Nesterov
2013-11-12 20:01                             ` Oleg Nesterov
2013-11-12 20:08                             ` Sameer Nanda
2013-11-12 20:08                               ` Sameer Nanda
2013-11-12 20:23                               ` [PATCH v5] " Sameer Nanda
2013-11-12 20:23                                 ` Sameer Nanda
2013-11-13  2:33                                 ` David Rientjes
2013-11-13  2:33                                   ` David Rientjes
2013-11-13 16:46                                   ` Sameer Nanda
2013-11-13 16:46                                     ` Sameer Nanda
2013-11-13 17:18                                     ` [PATCH v6] " Sameer Nanda
2013-11-13 17:18                                       ` Sameer Nanda
2013-11-13 17:29                                       ` Oleg Nesterov
2013-11-13 17:29                                         ` Oleg Nesterov
2013-11-14 13:43                                       ` dserrg
2013-11-14 17:03                                         ` Sameer Nanda
2013-11-14 17:03                                           ` Sameer Nanda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.