* [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-05 23:26 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-05 23:26 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This then results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by holding the tasklist_lock across the calls to both select_bad_process and oom_kill_process. Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..7bd3587 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -436,7 +436,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,7 +455,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); rcu_read_lock(); p = find_lock_task_mm(victim); @@ -641,6 +639,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL; check_panic_on_oom(constraint, gfp_mask, order, mpol_mask); + read_lock(&tasklist_lock); if (sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current, NULL, nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { @@ -663,6 +662,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, killed = 1; } out: + read_unlock(&tasklist_lock); /* * Give the killed threads a good chance of exiting before trying to * allocate memory again. -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-05 23:26 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-05 23:26 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This then results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by holding the tasklist_lock across the calls to both select_bad_process and oom_kill_process. Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..7bd3587 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -436,7 +436,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,7 +455,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); rcu_read_lock(); p = find_lock_task_mm(victim); @@ -641,6 +639,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL; check_panic_on_oom(constraint, gfp_mask, order, mpol_mask); + read_lock(&tasklist_lock); if (sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current, NULL, nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { @@ -663,6 +662,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, killed = 1; } out: + read_unlock(&tasklist_lock); /* * Give the killed threads a good chance of exiting before trying to * allocate memory again. -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-05 23:26 ` Sameer Nanda @ 2013-11-06 1:18 ` David Rientjes -1 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-06 1:18 UTC (permalink / raw) To: Sameer Nanda; +Cc: akpm, mhocko, hannes, rusty, linux-mm, linux-kernel On Tue, 5 Nov 2013, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots -- first > in select_bad_process and then a further refinement by looking for > child processes in oom_kill_process. Since this is a two step process, > it is possible that the process selected by select_bad_process may get a > SIGKILL just before oom_kill_process executes. If this were to happen, > __unhash_process deletes this process from the thread_group list. This > then results in oom_kill_process getting stuck in an infinite loop when > traversing the thread_group list of the selected process. > > Fix this race by holding the tasklist_lock across the calls to both > select_bad_process and oom_kill_process. > > Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 > Signed-off-by: Sameer Nanda <snanda@chromium.org> Nack, we had to avoid taking tasklist_lock for this duration since it stalls out forks and exits on other cpus trying to take the writeside with irqs disabled to avoid watchdog problems. What kernel version are you patching? If you check the latest Linus tree, we hold a reference to the task_struct of the chosen process before calling oom_kill_process() so the hypothesis would seem incorrect. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 1:18 ` David Rientjes 0 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-06 1:18 UTC (permalink / raw) To: Sameer Nanda; +Cc: akpm, mhocko, hannes, rusty, linux-mm, linux-kernel On Tue, 5 Nov 2013, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots -- first > in select_bad_process and then a further refinement by looking for > child processes in oom_kill_process. Since this is a two step process, > it is possible that the process selected by select_bad_process may get a > SIGKILL just before oom_kill_process executes. If this were to happen, > __unhash_process deletes this process from the thread_group list. This > then results in oom_kill_process getting stuck in an infinite loop when > traversing the thread_group list of the selected process. > > Fix this race by holding the tasklist_lock across the calls to both > select_bad_process and oom_kill_process. > > Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 > Signed-off-by: Sameer Nanda <snanda@chromium.org> Nack, we had to avoid taking tasklist_lock for this duration since it stalls out forks and exits on other cpus trying to take the writeside with irqs disabled to avoid watchdog problems. What kernel version are you patching? If you check the latest Linus tree, we hold a reference to the task_struct of the chosen process before calling oom_kill_process() so the hypothesis would seem incorrect. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 1:18 ` David Rientjes @ 2013-11-06 1:25 ` Luigi Semenzato -1 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 1:25 UTC (permalink / raw) To: David Rientjes Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in that case. On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots -- first >> in select_bad_process and then a further refinement by looking for >> child processes in oom_kill_process. Since this is a two step process, >> it is possible that the process selected by select_bad_process may get a >> SIGKILL just before oom_kill_process executes. If this were to happen, >> __unhash_process deletes this process from the thread_group list. This >> then results in oom_kill_process getting stuck in an infinite loop when >> traversing the thread_group list of the selected process. >> >> Fix this race by holding the tasklist_lock across the calls to both >> select_bad_process and oom_kill_process. >> >> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 >> Signed-off-by: Sameer Nanda <snanda@chromium.org> > > Nack, we had to avoid taking tasklist_lock for this duration since it > stalls out forks and exits on other cpus trying to take the writeside with > irqs disabled to avoid watchdog problems. > > What kernel version are you patching? If you check the latest Linus tree, > we hold a reference to the task_struct of the chosen process before > calling oom_kill_process() so the hypothesis would seem incorrect. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 1:25 ` Luigi Semenzato 0 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 1:25 UTC (permalink / raw) To: David Rientjes Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in that case. On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots -- first >> in select_bad_process and then a further refinement by looking for >> child processes in oom_kill_process. Since this is a two step process, >> it is possible that the process selected by select_bad_process may get a >> SIGKILL just before oom_kill_process executes. If this were to happen, >> __unhash_process deletes this process from the thread_group list. This >> then results in oom_kill_process getting stuck in an infinite loop when >> traversing the thread_group list of the selected process. >> >> Fix this race by holding the tasklist_lock across the calls to both >> select_bad_process and oom_kill_process. >> >> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 >> Signed-off-by: Sameer Nanda <snanda@chromium.org> > > Nack, we had to avoid taking tasklist_lock for this duration since it > stalls out forks and exits on other cpus trying to take the writeside with > irqs disabled to avoid watchdog problems. > > What kernel version are you patching? If you check the latest Linus tree, > we hold a reference to the task_struct of the chosen process before > calling oom_kill_process() so the hypothesis would seem incorrect. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 1:25 ` Luigi Semenzato @ 2013-11-06 1:27 ` David Rientjes -1 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-06 1:27 UTC (permalink / raw) To: Luigi Semenzato Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel On Tue, 5 Nov 2013, Luigi Semenzato wrote: > It's not enough to hold a reference to the task struct, because it can > still be taken out of the circular list of threads. The RCU > assumptions don't hold in that case. > Could you please post a proper bug report that isolates this at the cause? Thanks. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 1:27 ` David Rientjes 0 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-06 1:27 UTC (permalink / raw) To: Luigi Semenzato Cc: Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel On Tue, 5 Nov 2013, Luigi Semenzato wrote: > It's not enough to hold a reference to the task struct, because it can > still be taken out of the circular list of threads. The RCU > assumptions don't hold in that case. > Could you please post a proper bug report that isolates this at the cause? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 1:27 ` David Rientjes @ 2013-11-06 3:00 ` Vladimir Murzin -1 siblings, 0 replies; 51+ messages in thread From: Vladimir Murzin @ 2013-11-06 3:00 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel, oleg, dserrg Ccing Oleg and Sergey On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote: > On Tue, 5 Nov 2013, Luigi Semenzato wrote: > > > It's not enough to hold a reference to the task struct, because it can > > still be taken out of the circular list of threads. The RCU > > assumptions don't hold in that case. > > > > Could you please post a proper bug report that isolates this at the cause? > Hi David! I think it has already been reported[1] and actively discussed. Oleg has confirmed that while_each_thread() is not safe under rcu_read_lock()[2]. Oleg, any news about your activity for fixing that? [1] http://www.spinics.net/lists/linux-mm/msg54836.html [2] http://marc.info/?l=linux-kernel&m=127688978121665 > Thanks. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 3:00 ` Vladimir Murzin 0 siblings, 0 replies; 51+ messages in thread From: Vladimir Murzin @ 2013-11-06 3:00 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, Sameer Nanda, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel, oleg, dserrg Ccing Oleg and Sergey On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote: > On Tue, 5 Nov 2013, Luigi Semenzato wrote: > > > It's not enough to hold a reference to the task struct, because it can > > still be taken out of the circular list of threads. The RCU > > assumptions don't hold in that case. > > > > Could you please post a proper bug report that isolates this at the cause? > Hi David! I think it has already been reported[1] and actively discussed. Oleg has confirmed that while_each_thread() is not safe under rcu_read_lock()[2]. Oleg, any news about your activity for fixing that? [1] http://www.spinics.net/lists/linux-mm/msg54836.html [2] http://marc.info/?l=linux-kernel&m=127688978121665 > Thanks. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 1:27 ` David Rientjes (?) (?) @ 2013-11-06 3:04 ` Sameer Nanda 2013-11-06 4:45 ` Luigi Semenzato -1 siblings, 1 reply; 51+ messages in thread From: Sameer Nanda @ 2013-11-06 3:04 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, Andrew Morton, mhocko, Johannes Weiner, rusty, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 696 bytes --] On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 5 Nov 2013, Luigi Semenzato wrote: > > > It's not enough to hold a reference to the task struct, because it can > > still be taken out of the circular list of threads. The RCU > > assumptions don't hold in that case. > > > > Could you please post a proper bug report that isolates this at the cause? > We've been running into this issue on Chrome OS. crbug.com/256326 has additional details. The issue manifests itself as a soft lockup. The kernel we've been seeing this on is 3.8. We have a pretty consistent repro currently. Happy to try out other suggestions for a fix. > > Thanks. > -- Sameer [-- Attachment #2: Type: text/html, Size: 1514 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 3:04 ` Sameer Nanda @ 2013-11-06 4:45 ` Luigi Semenzato 0 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 4:45 UTC (permalink / raw) To: Sameer Nanda, msb Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel It's interesting that this was known for 3+ years, but nobody bothered adding a small warning to the code. We noticed this because it's actually happening on Chromebooks in the field. We try to minimize OOM kills, but we can deal with them. Of course, a hung kernel we cannot deal with. On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: > > > > On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >> >> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >> >> > It's not enough to hold a reference to the task struct, because it can >> > still be taken out of the circular list of threads. The RCU >> > assumptions don't hold in that case. >> > >> >> Could you please post a proper bug report that isolates this at the cause? > > > We've been running into this issue on Chrome OS. crbug.com/256326 has > additional > details. The issue manifests itself as a soft lockup. > > The kernel we've been seeing this on is 3.8. > > We have a pretty consistent repro currently. Happy to try out other > suggestions > for a fix. > >> >> >> Thanks. > > > > > -- > Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 4:45 ` Luigi Semenzato 0 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 4:45 UTC (permalink / raw) To: Sameer Nanda, msb Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel It's interesting that this was known for 3+ years, but nobody bothered adding a small warning to the code. We noticed this because it's actually happening on Chromebooks in the field. We try to minimize OOM kills, but we can deal with them. Of course, a hung kernel we cannot deal with. On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: > > > > On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >> >> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >> >> > It's not enough to hold a reference to the task struct, because it can >> > still be taken out of the circular list of threads. The RCU >> > assumptions don't hold in that case. >> > >> >> Could you please post a proper bug report that isolates this at the cause? > > > We've been running into this issue on Chrome OS. crbug.com/256326 has > additional > details. The issue manifests itself as a soft lockup. > > The kernel we've been seeing this on is 3.8. > > We have a pretty consistent repro currently. Happy to try out other > suggestions > for a fix. > >> >> >> Thanks. > > > > > -- > Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 4:45 ` Luigi Semenzato @ 2013-11-06 7:17 ` Luigi Semenzato -1 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 7:17 UTC (permalink / raw) To: Sameer Nanda, msb Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel Regarding other fixes: would it be possible to have the thread iterator insert a dummy marker element in the thread list before the scan? There would be one such dummy element per CPU, so that multiple CPUs can scan the list in parallel. The loop would skip such elements, and each dummy element would be removed at the end of each scan. I think this would work, i.e. it would have all the right properties, but I don't have a sense of whether the performance impact is acceptable. Probably not, or it would have been proposed earlier. On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote: > It's interesting that this was known for 3+ years, but nobody bothered > adding a small warning to the code. > > We noticed this because it's actually happening on Chromebooks in the > field. We try to minimize OOM kills, but we can deal with them. Of > course, a hung kernel we cannot deal with. > > On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: >> >> >> >> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >>> >>> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >>> >>> > It's not enough to hold a reference to the task struct, because it can >>> > still be taken out of the circular list of threads. The RCU >>> > assumptions don't hold in that case. >>> > >>> >>> Could you please post a proper bug report that isolates this at the cause? >> >> >> We've been running into this issue on Chrome OS. crbug.com/256326 has >> additional >> details. The issue manifests itself as a soft lockup. >> >> The kernel we've been seeing this on is 3.8. >> >> We have a pretty consistent repro currently. Happy to try out other >> suggestions >> for a fix. >> >>> >>> >>> Thanks. >> >> >> >> >> -- >> Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 7:17 ` Luigi Semenzato 0 siblings, 0 replies; 51+ messages in thread From: Luigi Semenzato @ 2013-11-06 7:17 UTC (permalink / raw) To: Sameer Nanda, msb Cc: David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel Regarding other fixes: would it be possible to have the thread iterator insert a dummy marker element in the thread list before the scan? There would be one such dummy element per CPU, so that multiple CPUs can scan the list in parallel. The loop would skip such elements, and each dummy element would be removed at the end of each scan. I think this would work, i.e. it would have all the right properties, but I don't have a sense of whether the performance impact is acceptable. Probably not, or it would have been proposed earlier. On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote: > It's interesting that this was known for 3+ years, but nobody bothered > adding a small warning to the code. > > We noticed this because it's actually happening on Chromebooks in the > field. We try to minimize OOM kills, but we can deal with them. Of > course, a hung kernel we cannot deal with. > > On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: >> >> >> >> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >>> >>> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >>> >>> > It's not enough to hold a reference to the task struct, because it can >>> > still be taken out of the circular list of threads. The RCU >>> > assumptions don't hold in that case. >>> > >>> >>> Could you please post a proper bug report that isolates this at the cause? >> >> >> We've been running into this issue on Chrome OS. crbug.com/256326 has >> additional >> details. The issue manifests itself as a soft lockup. >> >> The kernel we've been seeing this on is 3.8. >> >> We have a pretty consistent repro currently. Happy to try out other >> suggestions >> for a fix. >> >>> >>> >>> Thanks. >> >> >> >> >> -- >> Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 7:17 ` Luigi Semenzato @ 2013-11-06 16:58 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-06 16:58 UTC (permalink / raw) To: Luigi Semenzato Cc: msb, David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel (adding back context from thread history) On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots -- first >> in select_bad_process and then a further refinement by looking for >> child processes in oom_kill_process. Since this is a two step process, >> it is possible that the process selected by select_bad_process may get a >> SIGKILL just before oom_kill_process executes. If this were to happen, >> __unhash_process deletes this process from the thread_group list. This >> then results in oom_kill_process getting stuck in an infinite loop when >> traversing the thread_group list of the selected process. >> >> Fix this race by holding the tasklist_lock across the calls to both >> select_bad_process and oom_kill_process. >> >> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 >> Signed-off-by: Sameer Nanda <snanda@chromium.org> > > Nack, we had to avoid taking tasklist_lock for this duration since it > stalls out forks and exits on other cpus trying to take the writeside with > irqs disabled to avoid watchdog problems. > David -- I think we can make the duration that the tasklist_lock is held smaller by consolidating the process selection logic that is currently split across select_bad_process and oom_kill_process into one place in select_bad_process. The tasklist_lock would then need to be held only when the thread lists are being traversed. Would you be ok with that? I can re-spin the patch if that sounds like a workable option. > What kernel version are you patching? If you check the latest Linus tree, > we hold a reference to the task_struct of the chosen process before > calling oom_kill_process() so the hypothesis would seem incorrect. On Tue, Nov 5, 2013 at 11:17 PM, Luigi Semenzato <semenzato@google.com> wrote: > Regarding other fixes: would it be possible to have the thread > iterator insert a dummy marker element in the thread list before the > scan? There would be one such dummy element per CPU, so that multiple > CPUs can scan the list in parallel. The loop would skip such > elements, and each dummy element would be removed at the end of each > scan. > > I think this would work, i.e. it would have all the right properties, > but I don't have a sense of whether the performance impact is > acceptable. Probably not, or it would have been proposed earlier. > > > > On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote: >> It's interesting that this was known for 3+ years, but nobody bothered >> adding a small warning to the code. >> >> We noticed this because it's actually happening on Chromebooks in the >> field. We try to minimize OOM kills, but we can deal with them. Of >> course, a hung kernel we cannot deal with. >> >> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: >>> >>> >>> >>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >>>> >>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >>>> >>>> > It's not enough to hold a reference to the task struct, because it can >>>> > still be taken out of the circular list of threads. The RCU >>>> > assumptions don't hold in that case. >>>> > >>>> >>>> Could you please post a proper bug report that isolates this at the cause? >>> >>> >>> We've been running into this issue on Chrome OS. crbug.com/256326 has >>> additional >>> details. The issue manifests itself as a soft lockup. >>> >>> The kernel we've been seeing this on is 3.8. >>> >>> We have a pretty consistent repro currently. Happy to try out other >>> suggestions >>> for a fix. >>> >>>> >>>> >>>> Thanks. >>> >>> >>> >>> >>> -- >>> Sameer -- Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-06 16:58 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-06 16:58 UTC (permalink / raw) To: Luigi Semenzato Cc: msb, David Rientjes, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, linux-mm, linux-kernel (adding back context from thread history) On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots -- first >> in select_bad_process and then a further refinement by looking for >> child processes in oom_kill_process. Since this is a two step process, >> it is possible that the process selected by select_bad_process may get a >> SIGKILL just before oom_kill_process executes. If this were to happen, >> __unhash_process deletes this process from the thread_group list. This >> then results in oom_kill_process getting stuck in an infinite loop when >> traversing the thread_group list of the selected process. >> >> Fix this race by holding the tasklist_lock across the calls to both >> select_bad_process and oom_kill_process. >> >> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60 >> Signed-off-by: Sameer Nanda <snanda@chromium.org> > > Nack, we had to avoid taking tasklist_lock for this duration since it > stalls out forks and exits on other cpus trying to take the writeside with > irqs disabled to avoid watchdog problems. > David -- I think we can make the duration that the tasklist_lock is held smaller by consolidating the process selection logic that is currently split across select_bad_process and oom_kill_process into one place in select_bad_process. The tasklist_lock would then need to be held only when the thread lists are being traversed. Would you be ok with that? I can re-spin the patch if that sounds like a workable option. > What kernel version are you patching? If you check the latest Linus tree, > we hold a reference to the task_struct of the chosen process before > calling oom_kill_process() so the hypothesis would seem incorrect. On Tue, Nov 5, 2013 at 11:17 PM, Luigi Semenzato <semenzato@google.com> wrote: > Regarding other fixes: would it be possible to have the thread > iterator insert a dummy marker element in the thread list before the > scan? There would be one such dummy element per CPU, so that multiple > CPUs can scan the list in parallel. The loop would skip such > elements, and each dummy element would be removed at the end of each > scan. > > I think this would work, i.e. it would have all the right properties, > but I don't have a sense of whether the performance impact is > acceptable. Probably not, or it would have been proposed earlier. > > > > On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato <semenzato@google.com> wrote: >> It's interesting that this was known for 3+ years, but nobody bothered >> adding a small warning to the code. >> >> We noticed this because it's actually happening on Chromebooks in the >> field. We try to minimize OOM kills, but we can deal with them. Of >> course, a hung kernel we cannot deal with. >> >> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda <snanda@chromium.org> wrote: >>> >>> >>> >>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes <rientjes@google.com> wrote: >>>> >>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote: >>>> >>>> > It's not enough to hold a reference to the task struct, because it can >>>> > still be taken out of the circular list of threads. The RCU >>>> > assumptions don't hold in that case. >>>> > >>>> >>>> Could you please post a proper bug report that isolates this at the cause? >>> >>> >>> We've been running into this issue on Chrome OS. crbug.com/256326 has >>> additional >>> details. The issue manifests itself as a soft lockup. >>> >>> The kernel we've been seeing this on is 3.8. >>> >>> We have a pretty consistent repro currently. Happy to try out other >>> suggestions >>> for a fix. >>> >>>> >>>> >>>> Thanks. >>> >>> >>> >>> >>> -- >>> Sameer -- Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-06 16:58 ` Sameer Nanda @ 2013-11-07 0:35 ` David Rientjes -1 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-07 0:35 UTC (permalink / raw) To: Sameer Nanda Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, oleg, linux-mm, linux-kernel On Wed, 6 Nov 2013, Sameer Nanda wrote: > David -- I think we can make the duration that the tasklist_lock is > held smaller by consolidating the process selection logic that is > currently split across select_bad_process and oom_kill_process into > one place in select_bad_process. The tasklist_lock would then need to > be held only when the thread lists are being traversed. Would you be > ok with that? I can re-spin the patch if that sounds like a workable > option. > No, this caused hundreds of machines to hit soft lockups for Google because there's no synchronization that prevents dozens of cpus to take tasklist_lock in the oom killer during parallel memcg oom conditions and never allow the write_lock_irq() on fork() or exit() to make progress. We absolutely must hold tasklist_lock for as little time as possible in the oom killer. That said, I've never actually seen your reported bug manifest in our production environment so let's see if Oleg has any ideas. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-07 0:35 ` David Rientjes 0 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-07 0:35 UTC (permalink / raw) To: Sameer Nanda Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, oleg, linux-mm, linux-kernel On Wed, 6 Nov 2013, Sameer Nanda wrote: > David -- I think we can make the duration that the tasklist_lock is > held smaller by consolidating the process selection logic that is > currently split across select_bad_process and oom_kill_process into > one place in select_bad_process. The tasklist_lock would then need to > be held only when the thread lists are being traversed. Would you be > ok with that? I can re-spin the patch if that sounds like a workable > option. > No, this caused hundreds of machines to hit soft lockups for Google because there's no synchronization that prevents dozens of cpus to take tasklist_lock in the oom killer during parallel memcg oom conditions and never allow the write_lock_irq() on fork() or exit() to make progress. We absolutely must hold tasklist_lock for as little time as possible in the oom killer. That said, I've never actually seen your reported bug manifest in our production environment so let's see if Oleg has any ideas. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill 2013-11-07 0:35 ` David Rientjes @ 2013-11-07 19:34 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-07 19:34 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, oleg, linux-mm, linux-kernel On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes <rientjes@google.com> wrote: > On Wed, 6 Nov 2013, Sameer Nanda wrote: > >> David -- I think we can make the duration that the tasklist_lock is >> held smaller by consolidating the process selection logic that is >> currently split across select_bad_process and oom_kill_process into >> one place in select_bad_process. The tasklist_lock would then need to >> be held only when the thread lists are being traversed. Would you be >> ok with that? I can re-spin the patch if that sounds like a workable >> option. >> > > No, this caused hundreds of machines to hit soft lockups for Google > because there's no synchronization that prevents dozens of cpus to take > tasklist_lock in the oom killer during parallel memcg oom conditions and > never allow the write_lock_irq() on fork() or exit() to make progress. We > absolutely must hold tasklist_lock for as little time as possible in the > oom killer. > > That said, I've never actually seen your reported bug manifest in our > production environment so let's see if Oleg has any ideas. Is the path you are referring to mem_cgroup_out_of_memory calling oom_kill_process? If so, then that path doesn't appear to suffer from the two step select_bad_process, oom_kill_process race since mem_cgroup_out_of_memory directly calls oom_kill_process without going through select_bad_process. This also means that the patch I sent is incorrect since it removes the existing tasklist_lock protection in oom_kill_process. Respinning patch to take care of this case. -- Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: Fix race when selecting process to kill @ 2013-11-07 19:34 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-07 19:34 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, msb, Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, oleg, linux-mm, linux-kernel On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes <rientjes@google.com> wrote: > On Wed, 6 Nov 2013, Sameer Nanda wrote: > >> David -- I think we can make the duration that the tasklist_lock is >> held smaller by consolidating the process selection logic that is >> currently split across select_bad_process and oom_kill_process into >> one place in select_bad_process. The tasklist_lock would then need to >> be held only when the thread lists are being traversed. Would you be >> ok with that? I can re-spin the patch if that sounds like a workable >> option. >> > > No, this caused hundreds of machines to hit soft lockups for Google > because there's no synchronization that prevents dozens of cpus to take > tasklist_lock in the oom killer during parallel memcg oom conditions and > never allow the write_lock_irq() on fork() or exit() to make progress. We > absolutely must hold tasklist_lock for as little time as possible in the > oom killer. > > That said, I've never actually seen your reported bug manifest in our > production environment so let's see if Oleg has any ideas. Is the path you are referring to mem_cgroup_out_of_memory calling oom_kill_process? If so, then that path doesn't appear to suffer from the two step select_bad_process, oom_kill_process race since mem_cgroup_out_of_memory directly calls oom_kill_process without going through select_bad_process. This also means that the patch I sent is incorrect since it removes the existing tasklist_lock protection in oom_kill_process. Respinning patch to take care of this case. -- Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v2] mm, oom: Fix race when selecting process to kill 2013-11-07 0:35 ` David Rientjes @ 2013-11-08 18:07 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-08 18:07 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Change-Id: I865c64486ccfc0e4818e7045a8fa3353e2fb63f8 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..af99b1a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +439,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2] mm, oom: Fix race when selecting process to kill @ 2013-11-08 18:07 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-08 18:07 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Change-Id: I865c64486ccfc0e4818e7045a8fa3353e2fb63f8 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..af99b1a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +439,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v2] mm, oom: Fix race when selecting process to kill 2013-11-08 18:07 ` Sameer Nanda @ 2013-11-08 18:45 ` Oleg Nesterov -1 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-08 18:45 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel Sorry. I didn't have time to answer other emails, will try to do later. And yes, yes, while_each_thread() should be fixed, still on my TODO list... But just in case, whatever we do with while_each_thread() we should also fix some users. Until then, On 11/08, Sameer Nanda wrote: > > @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > + read_lock(&tasklist_lock); > + > /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { OK. > - read_lock(&tasklist_lock); But you should also move read_unlock_down(), at least after find_lock_task_mm(). And of course, this doesn't fix other users in oom_kill.c. Oleg. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2] mm, oom: Fix race when selecting process to kill @ 2013-11-08 18:45 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-08 18:45 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel Sorry. I didn't have time to answer other emails, will try to do later. And yes, yes, while_each_thread() should be fixed, still on my TODO list... But just in case, whatever we do with while_each_thread() we should also fix some users. Until then, On 11/08, Sameer Nanda wrote: > > @@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > + read_lock(&tasklist_lock); > + > /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { OK. > - read_lock(&tasklist_lock); But you should also move read_unlock_down(), at least after find_lock_task_mm(). And of course, this doesn't fix other users in oom_kill.c. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v3] mm, oom: Fix race when selecting process to kill 2013-11-08 18:45 ` Oleg Nesterov @ 2013-11-08 19:49 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-08 19:49 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..7b28d9f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, DEFAULT_RATELIMIT_BURST); /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); rcu_read_lock(); + p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { rcu_read_unlock(); put_task_struct(victim); -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3] mm, oom: Fix race when selecting process to kill @ 2013-11-08 19:49 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-08 19:49 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..7b28d9f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, DEFAULT_RATELIMIT_BURST); /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); rcu_read_lock(); + p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { rcu_read_unlock(); put_task_struct(victim); -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill 2013-11-08 19:49 ` Sameer Nanda @ 2013-11-09 15:16 ` Oleg Nesterov -1 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-09 15:16 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/08, Sameer Nanda wrote: > > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > DEFAULT_RATELIMIT_BURST); > @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > } > } > } while_each_thread(p, t); > - read_unlock(&tasklist_lock); > > rcu_read_lock(); > + > p = find_lock_task_mm(victim); > + > + /* > + * Since while_each_thread is currently not RCU safe, this unlock of > + * tasklist_lock may need to be moved further down if any additional > + * while_each_thread loops get added to this function. > + */ > + read_unlock(&tasklist_lock); Well, ack... but with this change find_lock_task_mm() relies on tasklist, so it makes sense to move rcu_read_lock() down before for_each_process(). Otherwise this looks confusing, but I won't insist. Oleg. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill @ 2013-11-09 15:16 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-09 15:16 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/08, Sameer Nanda wrote: > > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > DEFAULT_RATELIMIT_BURST); > @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > } > } > } while_each_thread(p, t); > - read_unlock(&tasklist_lock); > > rcu_read_lock(); > + > p = find_lock_task_mm(victim); > + > + /* > + * Since while_each_thread is currently not RCU safe, this unlock of > + * tasklist_lock may need to be moved further down if any additional > + * while_each_thread loops get added to this function. > + */ > + read_unlock(&tasklist_lock); Well, ack... but with this change find_lock_task_mm() relies on tasklist, so it makes sense to move rcu_read_lock() down before for_each_process(). Otherwise this looks confusing, but I won't insist. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3] mm, oom: Fix race when selecting process to kill 2013-11-09 15:16 ` Oleg Nesterov (?) @ 2013-11-11 23:15 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-11 23:15 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner, Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1189 bytes --] On Sat, Nov 9, 2013 at 7:16 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 11/08, Sameer Nanda wrote: > > > > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t > gfp_mask, int order, > > DEFAULT_RATELIMIT_BURST); > > @@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t > gfp_mask, int order, > > } > > } > > } while_each_thread(p, t); > > - read_unlock(&tasklist_lock); > > > > rcu_read_lock(); > > + > > p = find_lock_task_mm(victim); > > + > > + /* > > + * Since while_each_thread is currently not RCU safe, this unlock > of > > + * tasklist_lock may need to be moved further down if any > additional > > + * while_each_thread loops get added to this function. > > + */ > > + read_unlock(&tasklist_lock); > > Well, ack... but with this change find_lock_task_mm() relies on tasklist, > so it makes sense to move rcu_read_lock() down before for_each_process(). > Otherwise this looks confusing, but I won't insist. > Agreed that this looks a bit confusing. I will respin the patch. > > Oleg. > > -- Sameer [-- Attachment #2: Type: text/html, Size: 2027 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4] mm, oom: Fix race when selecting process to kill 2013-11-09 15:16 ` Oleg Nesterov @ 2013-11-12 0:21 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 0:21 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..57638ef 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, DEFAULT_RATELIMIT_BURST); /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +463,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -478,6 +490,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, K(get_mm_counter(victim->mm, MM_FILEPAGES))); task_unlock(victim); + rcu_read_lock(); + /* * Kill all user processes sharing victim->mm in other thread groups, if * any. They don't get access to memory reserves, though, to avoid -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v4] mm, oom: Fix race when selecting process to kill @ 2013-11-12 0:21 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 0:21 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..57638ef 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, DEFAULT_RATELIMIT_BURST); /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } @@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +463,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -478,6 +490,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, K(get_mm_counter(victim->mm, MM_FILEPAGES))); task_unlock(victim); + rcu_read_lock(); + /* * Kill all user processes sharing victim->mm in other thread groups, if * any. They don't get access to memory reserves, though, to avoid -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill 2013-11-12 0:21 ` Sameer Nanda @ 2013-11-12 15:13 ` Michal Hocko -1 siblings, 0 replies; 51+ messages in thread From: Michal Hocko @ 2013-11-12 15:13 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb, linux-mm, linux-kernel On Mon 11-11-13 16:21:57, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. > > Signed-off-by: Sameer Nanda <snanda@chromium.org> > --- > mm/oom_kill.c | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6738c47..57638ef 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > DEFAULT_RATELIMIT_BURST); > > /* > + * while_each_thread is currently not RCU safe. Lets hold the > + * tasklist_lock across all invocations of while_each_thread (including > + * the one in find_lock_task_mm) in this function. > + */ > + read_lock(&tasklist_lock); > + > + /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > + read_unlock(&tasklist_lock); > return; > } show_mem used to be one of a bottleneck but now that we have Mel's "mm: do not walk all of system memory during show_mem" it shouldn't be a big deal anymore. The real trouble is with dump_tasks which might be zillions of tasks and we do not want to hold tasklist_lock for that long. So no this would regress on the huge machines and yes we have seen reports like that and explicit requests to backport 6b0c81b3be114 (mm, oom: reduce dependency on tasklist_lock) so this would be a step backwards although I see there is a real problem that it tries to fix. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill @ 2013-11-12 15:13 ` Michal Hocko 0 siblings, 0 replies; 51+ messages in thread From: Michal Hocko @ 2013-11-12 15:13 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb, linux-mm, linux-kernel On Mon 11-11-13 16:21:57, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. > > Signed-off-by: Sameer Nanda <snanda@chromium.org> > --- > mm/oom_kill.c | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6738c47..57638ef 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > DEFAULT_RATELIMIT_BURST); > > /* > + * while_each_thread is currently not RCU safe. Lets hold the > + * tasklist_lock across all invocations of while_each_thread (including > + * the one in find_lock_task_mm) in this function. > + */ > + read_lock(&tasklist_lock); > + > + /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > + read_unlock(&tasklist_lock); > return; > } show_mem used to be one of a bottleneck but now that we have Mel's "mm: do not walk all of system memory during show_mem" it shouldn't be a big deal anymore. The real trouble is with dump_tasks which might be zillions of tasks and we do not want to hold tasklist_lock for that long. So no this would regress on the huge machines and yes we have seen reports like that and explicit requests to backport 6b0c81b3be114 (mm, oom: reduce dependency on tasklist_lock) so this would be a step backwards although I see there is a real problem that it tries to fix. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill 2013-11-12 0:21 ` Sameer Nanda @ 2013-11-12 20:01 ` Oleg Nesterov -1 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-12 20:01 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/11, Sameer Nanda wrote: > > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. OK, looks correct to me. Thanks. Yes, this is a step backwards, hopefully we will revert this patch soon. I am starting to think something like while_each_thread_lame_but_safe() makes sense before we really fix this nasty (and afaics not simple) problem with with while_each_thread() (which should die). Oleg. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill @ 2013-11-12 20:01 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-12 20:01 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/11, Sameer Nanda wrote: > > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. OK, looks correct to me. Thanks. Yes, this is a step backwards, hopefully we will revert this patch soon. I am starting to think something like while_each_thread_lame_but_safe() makes sense before we really fix this nasty (and afaics not simple) problem with with while_each_thread() (which should die). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill 2013-11-12 20:01 ` Oleg Nesterov @ 2013-11-12 20:08 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 20:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner, Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On Tue, Nov 12, 2013 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 11/11, Sameer Nanda wrote: >> >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. > > OK, looks correct to me. Thanks. > > > Yes, this is a step backwards, hopefully we will revert this patch soon. > I am starting to think something like while_each_thread_lame_but_safe() > makes sense before we really fix this nasty (and afaics not simple) > problem with with while_each_thread() (which should die). Looking forward to a real fix for the nasty problems with while_each_thread. In the meanwhile, let me float one more (hopefully, the last) version of this patch that should address Michal's concern. Thanks for your feedback! > > Oleg. > -- Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4] mm, oom: Fix race when selecting process to kill @ 2013-11-12 20:08 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 20:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, mhocko, David Rientjes, Johannes Weiner, Rusty Russell, Luigi Semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On Tue, Nov 12, 2013 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 11/11, Sameer Nanda wrote: >> >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. > > OK, looks correct to me. Thanks. > > > Yes, this is a step backwards, hopefully we will revert this patch soon. > I am starting to think something like while_each_thread_lame_but_safe() > makes sense before we really fix this nasty (and afaics not simple) > problem with with while_each_thread() (which should die). Looking forward to a real fix for the nasty problems with while_each_thread. In the meanwhile, let me float one more (hopefully, the last) version of this patch that should address Michal's concern. Thanks for your feedback! > > Oleg. > -- Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v5] mm, oom: Fix race when selecting process to kill 2013-11-12 20:08 ` Sameer Nanda @ 2013-11-12 20:23 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 20:23 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..5108c2b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + if (__ratelimit(&oom_rs)) + dump_header(p, gfp_mask, order, memcg, nodemask); + + task_lock(p); + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + task_unlock(p); + + /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", + message, task_pid_nr(p)); set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } - if (__ratelimit(&oom_rs)) - dump_header(p, gfp_mask, order, memcg, nodemask); - - task_lock(p); - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - task_unlock(p); - /* * If any of p's children has a different mm and is eligible for kill, * the one with the highest oom_badness() score is sacrificed for its * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, K(get_mm_counter(victim->mm, MM_FILEPAGES))); task_unlock(victim); + rcu_read_lock(); + /* * Kill all user processes sharing victim->mm in other thread groups, if * any. They don't get access to memory reserves, though, to avoid -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5] mm, oom: Fix race when selecting process to kill @ 2013-11-12 20:23 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-12 20:23 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 Signed-off-by: Sameer Nanda <snanda@chromium.org> --- mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..5108c2b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + if (__ratelimit(&oom_rs)) + dump_header(p, gfp_mask, order, memcg, nodemask); + + task_lock(p); + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + task_unlock(p); + + /* + * while_each_thread is currently not RCU safe. Lets hold the + * tasklist_lock across all invocations of while_each_thread (including + * the one in find_lock_task_mm) in this function. + */ + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", + message, task_pid_nr(p)); set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } - if (__ratelimit(&oom_rs)) - dump_header(p, gfp_mask, order, memcg, nodemask); - - task_lock(p); - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - task_unlock(p); - /* * If any of p's children has a different mm and is eligible for kill, * the one with the highest oom_badness() score is sacrificed for its * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, K(get_mm_counter(victim->mm, MM_FILEPAGES))); task_unlock(victim); + rcu_read_lock(); + /* * Kill all user processes sharing victim->mm in other thread groups, if * any. They don't get access to memory reserves, though, to avoid -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill 2013-11-12 20:23 ` Sameer Nanda @ 2013-11-13 2:33 ` David Rientjes -1 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-13 2:33 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb, linux-mm, linux-kernel On Tue, 12 Nov 2013, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. > > Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 Is this needed? > Signed-off-by: Sameer Nanda <snanda@chromium.org> > --- > mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- > 1 file changed, 29 insertions(+), 13 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6738c47..5108c2b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > + if (__ratelimit(&oom_rs)) > + dump_header(p, gfp_mask, order, memcg, nodemask); > + > + task_lock(p); > + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); > + task_unlock(p); > + > + /* > + * while_each_thread is currently not RCU safe. Lets hold the > + * tasklist_lock across all invocations of while_each_thread (including > + * the one in find_lock_task_mm) in this function. > + */ > + read_lock(&tasklist_lock); > + > /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { > + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", > + message, task_pid_nr(p)); That makes no sense in the kernel log to have Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child Out of Memory: Not killing process 1234, just setting TIF_MEMDIE Those are contradictory statements (and will actually mess with kernel log parsing at Google) and nobody other than kernel developers are going to know what TIF_MEMDIE is. > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > + read_unlock(&tasklist_lock); > return; > } > > - if (__ratelimit(&oom_rs)) > - dump_header(p, gfp_mask, order, memcg, nodemask); > - > - task_lock(p); > - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > - message, task_pid_nr(p), p->comm, points); > - task_unlock(p); > - > /* > * If any of p's children has a different mm and is eligible for kill, > * the one with the highest oom_badness() score is sacrificed for its > * parent. This attempts to lose the minimal amount of work done while > * still freeing memory. > */ > - read_lock(&tasklist_lock); > do { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > } > } > } while_each_thread(p, t); > - read_unlock(&tasklist_lock); > > - rcu_read_lock(); > p = find_lock_task_mm(victim); > + > + /* > + * Since while_each_thread is currently not RCU safe, this unlock of > + * tasklist_lock may need to be moved further down if any additional > + * while_each_thread loops get added to this function. > + */ This comment should be moved to sched.h to indicate how while_each_thread() needs to be handled with respect to tasklist_lock, it's not specific to the oom killer. > + read_unlock(&tasklist_lock); > + > if (!p) { > - rcu_read_unlock(); > put_task_struct(victim); > return; > } else if (victim != p) { > @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > K(get_mm_counter(victim->mm, MM_FILEPAGES))); > task_unlock(victim); > > + rcu_read_lock(); > + > /* > * Kill all user processes sharing victim->mm in other thread groups, if > * any. They don't get access to memory reserves, though, to avoid Please move this rcu_read_lock() to be immediatley before the for_each_process() instead of before the comment. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill @ 2013-11-13 2:33 ` David Rientjes 0 siblings, 0 replies; 51+ messages in thread From: David Rientjes @ 2013-11-13 2:33 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb, linux-mm, linux-kernel On Tue, 12 Nov 2013, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. > > Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 Is this needed? > Signed-off-by: Sameer Nanda <snanda@chromium.org> > --- > mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- > 1 file changed, 29 insertions(+), 13 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6738c47..5108c2b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > + if (__ratelimit(&oom_rs)) > + dump_header(p, gfp_mask, order, memcg, nodemask); > + > + task_lock(p); > + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); > + task_unlock(p); > + > + /* > + * while_each_thread is currently not RCU safe. Lets hold the > + * tasklist_lock across all invocations of while_each_thread (including > + * the one in find_lock_task_mm) in this function. > + */ > + read_lock(&tasklist_lock); > + > /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { > + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", > + message, task_pid_nr(p)); That makes no sense in the kernel log to have Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child Out of Memory: Not killing process 1234, just setting TIF_MEMDIE Those are contradictory statements (and will actually mess with kernel log parsing at Google) and nobody other than kernel developers are going to know what TIF_MEMDIE is. > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > + read_unlock(&tasklist_lock); > return; > } > > - if (__ratelimit(&oom_rs)) > - dump_header(p, gfp_mask, order, memcg, nodemask); > - > - task_lock(p); > - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > - message, task_pid_nr(p), p->comm, points); > - task_unlock(p); > - > /* > * If any of p's children has a different mm and is eligible for kill, > * the one with the highest oom_badness() score is sacrificed for its > * parent. This attempts to lose the minimal amount of work done while > * still freeing memory. > */ > - read_lock(&tasklist_lock); > do { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > } > } > } while_each_thread(p, t); > - read_unlock(&tasklist_lock); > > - rcu_read_lock(); > p = find_lock_task_mm(victim); > + > + /* > + * Since while_each_thread is currently not RCU safe, this unlock of > + * tasklist_lock may need to be moved further down if any additional > + * while_each_thread loops get added to this function. > + */ This comment should be moved to sched.h to indicate how while_each_thread() needs to be handled with respect to tasklist_lock, it's not specific to the oom killer. > + read_unlock(&tasklist_lock); > + > if (!p) { > - rcu_read_unlock(); > put_task_struct(victim); > return; > } else if (victim != p) { > @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > K(get_mm_counter(victim->mm, MM_FILEPAGES))); > task_unlock(victim); > > + rcu_read_lock(); > + > /* > * Kill all user processes sharing victim->mm in other thread groups, if > * any. They don't get access to memory reserves, though, to avoid Please move this rcu_read_lock() to be immediatley before the for_each_process() instead of before the comment. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill 2013-11-13 2:33 ` David Rientjes @ 2013-11-13 16:46 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-13 16:46 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, Luigi Semenzato, Vladimir Murzin, Oleg Nesterov, Sergey Dyasly, msb, linux-mm, linux-kernel On Tue, Nov 12, 2013 at 6:33 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 12 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. >> >> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 > > Is this needed? No, it's not. It's a side effect of using Chrome OS tools to manage the patches. I will make sure to remove it on the next version of the patch. > >> Signed-off-by: Sameer Nanda <snanda@chromium.org> >> --- >> mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- >> 1 file changed, 29 insertions(+), 13 deletions(-) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 6738c47..5108c2b 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, >> DEFAULT_RATELIMIT_BURST); >> >> + if (__ratelimit(&oom_rs)) >> + dump_header(p, gfp_mask, order, memcg, nodemask); >> + >> + task_lock(p); >> + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> + message, task_pid_nr(p), p->comm, points); >> + task_unlock(p); >> + >> + /* >> + * while_each_thread is currently not RCU safe. Lets hold the >> + * tasklist_lock across all invocations of while_each_thread (including >> + * the one in find_lock_task_mm) in this function. >> + */ >> + read_lock(&tasklist_lock); >> + >> /* >> * If the task is already exiting, don't alarm the sysadmin or kill >> * its children or threads, just set TIF_MEMDIE so it can die quickly >> */ >> - if (p->flags & PF_EXITING) { >> + if (p->flags & PF_EXITING || !pid_alive(p)) { >> + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", >> + message, task_pid_nr(p)); > > That makes no sense in the kernel log to have > > Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child > Out of Memory: Not killing process 1234, just setting TIF_MEMDIE > > Those are contradictory statements (and will actually mess with kernel log > parsing at Google) and nobody other than kernel developers are going to > know what TIF_MEMDIE is. Since the "Kill process" printk has now moved above the (p->flags & PF_EXITING || !pid_alive(p)) check, it is possible that oom_kill_process will emit the "Kill process" message but will not actually try to kill the process or its child. The new "Not killing process" printk helps disambiguate this case from when a process is actually killed by oom_kill_process. However, since you are finding it confusing, let me remove it. > >> set_tsk_thread_flag(p, TIF_MEMDIE); >> put_task_struct(p); >> + read_unlock(&tasklist_lock); >> return; >> } >> >> - if (__ratelimit(&oom_rs)) >> - dump_header(p, gfp_mask, order, memcg, nodemask); >> - >> - task_lock(p); >> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> - message, task_pid_nr(p), p->comm, points); >> - task_unlock(p); >> - >> /* >> * If any of p's children has a different mm and is eligible for kill, >> * the one with the highest oom_badness() score is sacrificed for its >> * parent. This attempts to lose the minimal amount of work done while >> * still freeing memory. >> */ >> - read_lock(&tasklist_lock); >> do { >> list_for_each_entry(child, &t->children, sibling) { >> unsigned int child_points; >> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> } >> } >> } while_each_thread(p, t); >> - read_unlock(&tasklist_lock); >> >> - rcu_read_lock(); >> p = find_lock_task_mm(victim); >> + >> + /* >> + * Since while_each_thread is currently not RCU safe, this unlock of >> + * tasklist_lock may need to be moved further down if any additional >> + * while_each_thread loops get added to this function. >> + */ > > This comment should be moved to sched.h to indicate how > while_each_thread() needs to be handled with respect to tasklist_lock, > it's not specific to the oom killer. OK. > >> + read_unlock(&tasklist_lock); >> + >> if (!p) { >> - rcu_read_unlock(); >> put_task_struct(victim); >> return; >> } else if (victim != p) { >> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> K(get_mm_counter(victim->mm, MM_FILEPAGES))); >> task_unlock(victim); >> >> + rcu_read_lock(); >> + >> /* >> * Kill all user processes sharing victim->mm in other thread groups, if >> * any. They don't get access to memory reserves, though, to avoid > > Please move this rcu_read_lock() to be immediatley before the > for_each_process() instead of before the comment. OK. -- Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v5] mm, oom: Fix race when selecting process to kill @ 2013-11-13 16:46 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-13 16:46 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, mhocko, Johannes Weiner, Rusty Russell, Luigi Semenzato, Vladimir Murzin, Oleg Nesterov, Sergey Dyasly, msb, linux-mm, linux-kernel On Tue, Nov 12, 2013 at 6:33 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 12 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. >> >> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31 > > Is this needed? No, it's not. It's a side effect of using Chrome OS tools to manage the patches. I will make sure to remove it on the next version of the patch. > >> Signed-off-by: Sameer Nanda <snanda@chromium.org> >> --- >> mm/oom_kill.c | 42 +++++++++++++++++++++++++++++------------- >> 1 file changed, 29 insertions(+), 13 deletions(-) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 6738c47..5108c2b 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, >> DEFAULT_RATELIMIT_BURST); >> >> + if (__ratelimit(&oom_rs)) >> + dump_header(p, gfp_mask, order, memcg, nodemask); >> + >> + task_lock(p); >> + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> + message, task_pid_nr(p), p->comm, points); >> + task_unlock(p); >> + >> + /* >> + * while_each_thread is currently not RCU safe. Lets hold the >> + * tasklist_lock across all invocations of while_each_thread (including >> + * the one in find_lock_task_mm) in this function. >> + */ >> + read_lock(&tasklist_lock); >> + >> /* >> * If the task is already exiting, don't alarm the sysadmin or kill >> * its children or threads, just set TIF_MEMDIE so it can die quickly >> */ >> - if (p->flags & PF_EXITING) { >> + if (p->flags & PF_EXITING || !pid_alive(p)) { >> + pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n", >> + message, task_pid_nr(p)); > > That makes no sense in the kernel log to have > > Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child > Out of Memory: Not killing process 1234, just setting TIF_MEMDIE > > Those are contradictory statements (and will actually mess with kernel log > parsing at Google) and nobody other than kernel developers are going to > know what TIF_MEMDIE is. Since the "Kill process" printk has now moved above the (p->flags & PF_EXITING || !pid_alive(p)) check, it is possible that oom_kill_process will emit the "Kill process" message but will not actually try to kill the process or its child. The new "Not killing process" printk helps disambiguate this case from when a process is actually killed by oom_kill_process. However, since you are finding it confusing, let me remove it. > >> set_tsk_thread_flag(p, TIF_MEMDIE); >> put_task_struct(p); >> + read_unlock(&tasklist_lock); >> return; >> } >> >> - if (__ratelimit(&oom_rs)) >> - dump_header(p, gfp_mask, order, memcg, nodemask); >> - >> - task_lock(p); >> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> - message, task_pid_nr(p), p->comm, points); >> - task_unlock(p); >> - >> /* >> * If any of p's children has a different mm and is eligible for kill, >> * the one with the highest oom_badness() score is sacrificed for its >> * parent. This attempts to lose the minimal amount of work done while >> * still freeing memory. >> */ >> - read_lock(&tasklist_lock); >> do { >> list_for_each_entry(child, &t->children, sibling) { >> unsigned int child_points; >> @@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> } >> } >> } while_each_thread(p, t); >> - read_unlock(&tasklist_lock); >> >> - rcu_read_lock(); >> p = find_lock_task_mm(victim); >> + >> + /* >> + * Since while_each_thread is currently not RCU safe, this unlock of >> + * tasklist_lock may need to be moved further down if any additional >> + * while_each_thread loops get added to this function. >> + */ > > This comment should be moved to sched.h to indicate how > while_each_thread() needs to be handled with respect to tasklist_lock, > it's not specific to the oom killer. OK. > >> + read_unlock(&tasklist_lock); >> + >> if (!p) { >> - rcu_read_unlock(); >> put_task_struct(victim); >> return; >> } else if (victim != p) { >> @@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, >> K(get_mm_counter(victim->mm, MM_FILEPAGES))); >> task_unlock(victim); >> >> + rcu_read_lock(); >> + >> /* >> * Kill all user processes sharing victim->mm in other thread groups, if >> * any. They don't get access to memory reserves, though, to avoid > > Please move this rcu_read_lock() to be immediatley before the > for_each_process() instead of before the comment. OK. -- Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v6] mm, oom: Fix race when selecting process to kill 2013-11-13 16:46 ` Sameer Nanda @ 2013-11-13 17:18 ` Sameer Nanda -1 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-13 17:18 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- include/linux/sched.h | 5 +++++ mm/oom_kill.c | 34 +++++++++++++++++++++------------- 2 files changed, 26 insertions(+), 13 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index e27baee..8975dbb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void); #define do_each_thread(g, t) \ for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do +/* + * Careful: while_each_thread is not RCU safe. Callers should hold + * read_lock(tasklist_lock) across while_each_thread loops. + */ + #define while_each_thread(g, t) \ while ((t = next_thread(t)) != g) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..0d1f804 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + if (__ratelimit(&oom_rs)) + dump_header(p, gfp_mask, order, memcg, nodemask); + + task_lock(p); + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + task_unlock(p); + + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } - if (__ratelimit(&oom_rs)) - dump_header(p, gfp_mask, order, memcg, nodemask); - - task_lock(p); - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - task_unlock(p); - /* * If any of p's children has a different mm and is eligible for kill, * the one with the highest oom_badness() score is sacrificed for its * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * That thread will now get access to memory reserves since it has a * pending fatal signal. */ + rcu_read_lock(); for_each_process(p) if (p->mm == mm && !same_thread_group(p, victim) && !(p->flags & PF_KTHREAD)) { -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v6] mm, oom: Fix race when selecting process to kill @ 2013-11-13 17:18 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-13 17:18 UTC (permalink / raw) To: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, oleg, dserrg, msb Cc: linux-mm, linux-kernel, Sameer Nanda The selection of the process to be killed happens in two spots: first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just before oom_kill_process executes. If this were to happen, __unhash_process deletes this process from the thread_group list. This results in oom_kill_process getting stuck in an infinite loop when traversing the thread_group list of the selected process. Fix this race by adding a pid_alive check for the selected process with tasklist_lock held in oom_kill_process. Signed-off-by: Sameer Nanda <snanda@chromium.org> --- include/linux/sched.h | 5 +++++ mm/oom_kill.c | 34 +++++++++++++++++++++------------- 2 files changed, 26 insertions(+), 13 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index e27baee..8975dbb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void); #define do_each_thread(g, t) \ for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do +/* + * Careful: while_each_thread is not RCU safe. Callers should hold + * read_lock(tasklist_lock) across while_each_thread loops. + */ + #define while_each_thread(g, t) \ while ((t = next_thread(t)) != g) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6738c47..0d1f804 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + if (__ratelimit(&oom_rs)) + dump_header(p, gfp_mask, order, memcg, nodemask); + + task_lock(p); + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + task_unlock(p); + + read_lock(&tasklist_lock); + /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING || !pid_alive(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + read_unlock(&tasklist_lock); return; } - if (__ratelimit(&oom_rs)) - dump_header(p, gfp_mask, order, memcg, nodemask); - - task_lock(p); - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - task_unlock(p); - /* * If any of p's children has a different mm and is eligible for kill, * the one with the highest oom_badness() score is sacrificed for its * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ - read_lock(&tasklist_lock); do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } } while_each_thread(p, t); - read_unlock(&tasklist_lock); - rcu_read_lock(); p = find_lock_task_mm(victim); + + /* + * Since while_each_thread is currently not RCU safe, this unlock of + * tasklist_lock may need to be moved further down if any additional + * while_each_thread loops get added to this function. + */ + read_unlock(&tasklist_lock); + if (!p) { - rcu_read_unlock(); put_task_struct(victim); return; } else if (victim != p) { @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * That thread will now get access to memory reserves since it has a * pending fatal signal. */ + rcu_read_lock(); for_each_process(p) if (p->mm == mm && !same_thread_group(p, victim) && !(p->flags & PF_KTHREAD)) { -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill 2013-11-13 17:18 ` Sameer Nanda @ 2013-11-13 17:29 ` Oleg Nesterov -1 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-13 17:29 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/13, Sameer Nanda wrote: > > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. I am fine with this patch as well, but honestly I'd prefer the previous v5. I won't argue though. > +/* > + * Careful: while_each_thread is not RCU safe. Callers should hold > + * read_lock(tasklist_lock) across while_each_thread loops. > + */ (tasklist_lock or siglock, in fact but this doesn't matter). This is not that simple, even tasklist_lock can't help if the task is already dead. Oh. Yes, sorry. I promised to send the patches "soon" many times, but still didn't find the time. Perhaps I should try to start with the "make this all less buggy" changes, the "complete" fix needs to change the callers as well. Oleg. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill @ 2013-11-13 17:29 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2013-11-13 17:29 UTC (permalink / raw) To: Sameer Nanda Cc: akpm, mhocko, rientjes, hannes, rusty, semenzato, murzin.v, dserrg, msb, linux-mm, linux-kernel On 11/13, Sameer Nanda wrote: > > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. I am fine with this patch as well, but honestly I'd prefer the previous v5. I won't argue though. > +/* > + * Careful: while_each_thread is not RCU safe. Callers should hold > + * read_lock(tasklist_lock) across while_each_thread loops. > + */ (tasklist_lock or siglock, in fact but this doesn't matter). This is not that simple, even tasklist_lock can't help if the task is already dead. Oh. Yes, sorry. I promised to send the patches "soon" many times, but still didn't find the time. Perhaps I should try to start with the "make this all less buggy" changes, the "complete" fix needs to change the callers as well. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill 2013-11-13 17:18 ` Sameer Nanda (?) (?) @ 2013-11-14 13:43 ` dserrg 2013-11-14 17:03 ` Sameer Nanda -1 siblings, 1 reply; 51+ messages in thread From: dserrg @ 2013-11-14 13:43 UTC (permalink / raw) To: Sameer Nanda Cc: rusty, hannes, msb, oleg, Мурзин Владимир, linux-mm, rientjes, mhocko, akpm, semenzato, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5193 bytes --] (sorry for html) Why do we even bother with locking? Why not just merge my original patch? (The link is in Vladimir's message) It provides much more elegant (and working!) solution for this problem. David, how did you miss it in the first place? Oh.. and by the way. I was hitting the same bug in other while_each_thread loops in oom_kill.c. Anyway, goodluck ;) 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org> написал: > The selection of the process to be killed happens in two spots: > first in select_bad_process and then a further refinement by > looking for child processes in oom_kill_process. Since this is > a two step process, it is possible that the process selected by > select_bad_process may get a SIGKILL just before oom_kill_process > executes. If this were to happen, __unhash_process deletes this > process from the thread_group list. This results in oom_kill_process > getting stuck in an infinite loop when traversing the thread_group > list of the selected process. > > Fix this race by adding a pid_alive check for the selected process > with tasklist_lock held in oom_kill_process. > > Signed-off-by: Sameer Nanda <snanda@chromium.org> > --- > include/linux/sched.h | 5 +++++ > mm/oom_kill.c | 34 +++++++++++++++++++++------------- > 2 files changed, 26 insertions(+), 13 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e27baee..8975dbb 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void); > #define do_each_thread(g, t) \ > for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) > do > > +/* > + * Careful: while_each_thread is not RCU safe. Callers should hold > + * read_lock(tasklist_lock) across while_each_thread loops. > + */ > + > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6738c47..0d1f804 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t > gfp_mask, int order, > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > + if (__ratelimit(&oom_rs)) > + dump_header(p, gfp_mask, order, memcg, nodemask); > + > + task_lock(p); > + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); > + task_unlock(p); > + > + read_lock(&tasklist_lock); > + > /* > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die > quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING || !pid_alive(p)) { > set_tsk_thread_flag(p, TIF_MEMDIE); > put_task_struct(p); > + read_unlock(&tasklist_lock); > return; > } > > - if (__ratelimit(&oom_rs)) > - dump_header(p, gfp_mask, order, memcg, nodemask); > - > - task_lock(p); > - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > - message, task_pid_nr(p), p->comm, points); > - task_unlock(p); > - > /* > * If any of p's children has a different mm and is eligible for > kill, > * the one with the highest oom_badness() score is sacrificed for > its > * parent. This attempts to lose the minimal amount of work done > while > * still freeing memory. > */ > - read_lock(&tasklist_lock); > do { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t > gfp_mask, int order, > } > } > } while_each_thread(p, t); > - read_unlock(&tasklist_lock); > > - rcu_read_lock(); > p = find_lock_task_mm(victim); > + > + /* > + * Since while_each_thread is currently not RCU safe, this unlock > of > + * tasklist_lock may need to be moved further down if any > additional > + * while_each_thread loops get added to this function. > + */ > + read_unlock(&tasklist_lock); > + > if (!p) { > - rcu_read_unlock(); > put_task_struct(victim); > return; > } else if (victim != p) { > @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t > gfp_mask, int order, > * That thread will now get access to memory reserves since it has > a > * pending fatal signal. > */ > + rcu_read_lock(); > for_each_process(p) > if (p->mm == mm && !same_thread_group(p, victim) && > !(p->flags & PF_KTHREAD)) { > -- > 1.8.4.1 > > [-- Attachment #2: Type: text/html, Size: 5852 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill 2013-11-14 13:43 ` dserrg @ 2013-11-14 17:03 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-14 17:03 UTC (permalink / raw) To: dserrg Cc: Rusty Russell, Johannes Weiner, msb, Oleg Nesterov, Мурзин Владимир, linux-mm, David Rientjes, mhocko, Andrew Morton, Luigi Semenzato, linux-kernel On Thu, Nov 14, 2013 at 5:43 AM, dserrg <dserrg@gmail.com> wrote: > (sorry for html) > > Why do we even bother with locking? > Why not just merge my original patch? (The link is in Vladimir's message) > It provides much more elegant (and working!) solution for this problem. As Oleg alluded to in that thread, that patch makes the race window smaller, but doesn't close it completely. Imagine if a SIGKILL gets sent to the task p immediately after the fatal_signal_pending check. In that case, the infinite loop in while_each_thread will still happen since __unhash_process would delete the task p from the thread_group list while while_each_thread loop is in progress on another CPU. This is precisely why we need to hold read_lock(&tasklist_lock) _before_ checking the state of the process p and entering the while_each_thread loop. > David, how did you miss it in the first place? > > Oh.. and by the way. I was hitting the same bug in other > while_each_thread loops in oom_kill.c. > Anyway, goodluck ;) Thanks! > > 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org> > написал: > >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. >> >> Signed-off-by: Sameer Nanda <snanda@chromium.org> >> --- >> include/linux/sched.h | 5 +++++ >> mm/oom_kill.c | 34 +++++++++++++++++++++------------- >> 2 files changed, 26 insertions(+), 13 deletions(-) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index e27baee..8975dbb 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void); >> #define do_each_thread(g, t) \ >> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) >> do >> >> +/* >> + * Careful: while_each_thread is not RCU safe. Callers should hold >> + * read_lock(tasklist_lock) across while_each_thread loops. >> + */ >> + >> #define while_each_thread(g, t) \ >> while ((t = next_thread(t)) != g) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 6738c47..0d1f804 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, >> DEFAULT_RATELIMIT_BURST); >> >> + if (__ratelimit(&oom_rs)) >> + dump_header(p, gfp_mask, order, memcg, nodemask); >> + >> + task_lock(p); >> + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> + message, task_pid_nr(p), p->comm, points); >> + task_unlock(p); >> + >> + read_lock(&tasklist_lock); >> + >> /* >> * If the task is already exiting, don't alarm the sysadmin or >> kill >> * its children or threads, just set TIF_MEMDIE so it can die >> quickly >> */ >> - if (p->flags & PF_EXITING) { >> + if (p->flags & PF_EXITING || !pid_alive(p)) { >> set_tsk_thread_flag(p, TIF_MEMDIE); >> put_task_struct(p); >> + read_unlock(&tasklist_lock); >> return; >> } >> >> - if (__ratelimit(&oom_rs)) >> - dump_header(p, gfp_mask, order, memcg, nodemask); >> - >> - task_lock(p); >> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> - message, task_pid_nr(p), p->comm, points); >> - task_unlock(p); >> - >> /* >> * If any of p's children has a different mm and is eligible for >> kill, >> * the one with the highest oom_badness() score is sacrificed for >> its >> * parent. This attempts to lose the minimal amount of work done >> while >> * still freeing memory. >> */ >> - read_lock(&tasklist_lock); >> do { >> list_for_each_entry(child, &t->children, sibling) { >> unsigned int child_points; >> @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> } >> } >> } while_each_thread(p, t); >> - read_unlock(&tasklist_lock); >> >> - rcu_read_lock(); >> p = find_lock_task_mm(victim); >> + >> + /* >> + * Since while_each_thread is currently not RCU safe, this unlock >> of >> + * tasklist_lock may need to be moved further down if any >> additional >> + * while_each_thread loops get added to this function. >> + */ >> + read_unlock(&tasklist_lock); >> + >> if (!p) { >> - rcu_read_unlock(); >> put_task_struct(victim); >> return; >> } else if (victim != p) { >> @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> * That thread will now get access to memory reserves since it has >> a >> * pending fatal signal. >> */ >> + rcu_read_lock(); >> for_each_process(p) >> if (p->mm == mm && !same_thread_group(p, victim) && >> !(p->flags & PF_KTHREAD)) { >> -- >> 1.8.4.1 >> > -- Sameer ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v6] mm, oom: Fix race when selecting process to kill @ 2013-11-14 17:03 ` Sameer Nanda 0 siblings, 0 replies; 51+ messages in thread From: Sameer Nanda @ 2013-11-14 17:03 UTC (permalink / raw) To: dserrg Cc: Rusty Russell, Johannes Weiner, msb, Oleg Nesterov, Мурзин Владимир, linux-mm, David Rientjes, mhocko, Andrew Morton, Luigi Semenzato, linux-kernel On Thu, Nov 14, 2013 at 5:43 AM, dserrg <dserrg@gmail.com> wrote: > (sorry for html) > > Why do we even bother with locking? > Why not just merge my original patch? (The link is in Vladimir's message) > It provides much more elegant (and working!) solution for this problem. As Oleg alluded to in that thread, that patch makes the race window smaller, but doesn't close it completely. Imagine if a SIGKILL gets sent to the task p immediately after the fatal_signal_pending check. In that case, the infinite loop in while_each_thread will still happen since __unhash_process would delete the task p from the thread_group list while while_each_thread loop is in progress on another CPU. This is precisely why we need to hold read_lock(&tasklist_lock) _before_ checking the state of the process p and entering the while_each_thread loop. > David, how did you miss it in the first place? > > Oh.. and by the way. I was hitting the same bug in other > while_each_thread loops in oom_kill.c. > Anyway, goodluck ;) Thanks! > > 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" <snanda@chromium.org> > написал: > >> The selection of the process to be killed happens in two spots: >> first in select_bad_process and then a further refinement by >> looking for child processes in oom_kill_process. Since this is >> a two step process, it is possible that the process selected by >> select_bad_process may get a SIGKILL just before oom_kill_process >> executes. If this were to happen, __unhash_process deletes this >> process from the thread_group list. This results in oom_kill_process >> getting stuck in an infinite loop when traversing the thread_group >> list of the selected process. >> >> Fix this race by adding a pid_alive check for the selected process >> with tasklist_lock held in oom_kill_process. >> >> Signed-off-by: Sameer Nanda <snanda@chromium.org> >> --- >> include/linux/sched.h | 5 +++++ >> mm/oom_kill.c | 34 +++++++++++++++++++++------------- >> 2 files changed, 26 insertions(+), 13 deletions(-) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index e27baee..8975dbb 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void); >> #define do_each_thread(g, t) \ >> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) >> do >> >> +/* >> + * Careful: while_each_thread is not RCU safe. Callers should hold >> + * read_lock(tasklist_lock) across while_each_thread loops. >> + */ >> + >> #define while_each_thread(g, t) \ >> while ((t = next_thread(t)) != g) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 6738c47..0d1f804 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, >> DEFAULT_RATELIMIT_BURST); >> >> + if (__ratelimit(&oom_rs)) >> + dump_header(p, gfp_mask, order, memcg, nodemask); >> + >> + task_lock(p); >> + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> + message, task_pid_nr(p), p->comm, points); >> + task_unlock(p); >> + >> + read_lock(&tasklist_lock); >> + >> /* >> * If the task is already exiting, don't alarm the sysadmin or >> kill >> * its children or threads, just set TIF_MEMDIE so it can die >> quickly >> */ >> - if (p->flags & PF_EXITING) { >> + if (p->flags & PF_EXITING || !pid_alive(p)) { >> set_tsk_thread_flag(p, TIF_MEMDIE); >> put_task_struct(p); >> + read_unlock(&tasklist_lock); >> return; >> } >> >> - if (__ratelimit(&oom_rs)) >> - dump_header(p, gfp_mask, order, memcg, nodemask); >> - >> - task_lock(p); >> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", >> - message, task_pid_nr(p), p->comm, points); >> - task_unlock(p); >> - >> /* >> * If any of p's children has a different mm and is eligible for >> kill, >> * the one with the highest oom_badness() score is sacrificed for >> its >> * parent. This attempts to lose the minimal amount of work done >> while >> * still freeing memory. >> */ >> - read_lock(&tasklist_lock); >> do { >> list_for_each_entry(child, &t->children, sibling) { >> unsigned int child_points; >> @@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> } >> } >> } while_each_thread(p, t); >> - read_unlock(&tasklist_lock); >> >> - rcu_read_lock(); >> p = find_lock_task_mm(victim); >> + >> + /* >> + * Since while_each_thread is currently not RCU safe, this unlock >> of >> + * tasklist_lock may need to be moved further down if any >> additional >> + * while_each_thread loops get added to this function. >> + */ >> + read_unlock(&tasklist_lock); >> + >> if (!p) { >> - rcu_read_unlock(); >> put_task_struct(victim); >> return; >> } else if (victim != p) { >> @@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t >> gfp_mask, int order, >> * That thread will now get access to memory reserves since it has >> a >> * pending fatal signal. >> */ >> + rcu_read_lock(); >> for_each_process(p) >> if (p->mm == mm && !same_thread_group(p, victim) && >> !(p->flags & PF_KTHREAD)) { >> -- >> 1.8.4.1 >> > -- Sameer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2013-11-14 17:04 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-11-05 23:26 [PATCH] mm, oom: Fix race when selecting process to kill Sameer Nanda 2013-11-05 23:26 ` Sameer Nanda 2013-11-06 1:18 ` David Rientjes 2013-11-06 1:18 ` David Rientjes 2013-11-06 1:25 ` Luigi Semenzato 2013-11-06 1:25 ` Luigi Semenzato 2013-11-06 1:27 ` David Rientjes 2013-11-06 1:27 ` David Rientjes 2013-11-06 3:00 ` Vladimir Murzin 2013-11-06 3:00 ` Vladimir Murzin 2013-11-06 3:04 ` Sameer Nanda 2013-11-06 4:45 ` Luigi Semenzato 2013-11-06 4:45 ` Luigi Semenzato 2013-11-06 7:17 ` Luigi Semenzato 2013-11-06 7:17 ` Luigi Semenzato 2013-11-06 16:58 ` Sameer Nanda 2013-11-06 16:58 ` Sameer Nanda 2013-11-07 0:35 ` David Rientjes 2013-11-07 0:35 ` David Rientjes 2013-11-07 19:34 ` Sameer Nanda 2013-11-07 19:34 ` Sameer Nanda 2013-11-08 18:07 ` [PATCH v2] " Sameer Nanda 2013-11-08 18:07 ` Sameer Nanda 2013-11-08 18:45 ` Oleg Nesterov 2013-11-08 18:45 ` Oleg Nesterov 2013-11-08 19:49 ` [PATCH v3] " Sameer Nanda 2013-11-08 19:49 ` Sameer Nanda 2013-11-09 15:16 ` Oleg Nesterov 2013-11-09 15:16 ` Oleg Nesterov 2013-11-11 23:15 ` Sameer Nanda 2013-11-12 0:21 ` [PATCH v4] " Sameer Nanda 2013-11-12 0:21 ` Sameer Nanda 2013-11-12 15:13 ` Michal Hocko 2013-11-12 15:13 ` Michal Hocko 2013-11-12 20:01 ` Oleg Nesterov 2013-11-12 20:01 ` Oleg Nesterov 2013-11-12 20:08 ` Sameer Nanda 2013-11-12 20:08 ` Sameer Nanda 2013-11-12 20:23 ` [PATCH v5] " Sameer Nanda 2013-11-12 20:23 ` Sameer Nanda 2013-11-13 2:33 ` David Rientjes 2013-11-13 2:33 ` David Rientjes 2013-11-13 16:46 ` Sameer Nanda 2013-11-13 16:46 ` Sameer Nanda 2013-11-13 17:18 ` [PATCH v6] " Sameer Nanda 2013-11-13 17:18 ` Sameer Nanda 2013-11-13 17:29 ` Oleg Nesterov 2013-11-13 17:29 ` Oleg Nesterov 2013-11-14 13:43 ` dserrg 2013-11-14 17:03 ` Sameer Nanda 2013-11-14 17:03 ` Sameer Nanda
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.