From: Michal Hocko <mhocko@suse.cz> To: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Johannes Weiner <hannes@cmpxchg.org>, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>, Minchan Kim <minchan@kernel.org>, linux-mm@kvack.org, cgroups@vger.kernel.org Subject: Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads Date: Tue, 26 Jun 2012 11:58:32 +0200 [thread overview] Message-ID: <20120626095832.GC9566@tiehlicka.suse.cz> (raw) In-Reply-To: <alpine.DEB.2.00.1206251847180.24838@chino.kir.corp.google.com> On Mon 25-06-12 18:47:53, David Rientjes wrote: > The global oom killer is serialized by the zonelist being used in the > page allocation. Concurrent oom kills are thus a rare event and only > occur in systems using mempolicies and with a large number of nodes. > > Memory controller oom kills, however, can frequently be concurrent since > there is no serialization once the oom killer is called for oom > conditions in several different memcgs in parallel. > > This creates a massive contention on tasklist_lock since the oom killer > requires the readside for the tasklist iteration. If several memcgs are > calling the oom killer, this lock can be held for a substantial amount of > time, especially if threads continue to enter it as other threads are > exiting. > > Since the exit path grabs the writeside of the lock with irqs disabled in > a few different places, this can cause a soft lockup on cpus as a result > of tasklist_lock starvation. > > The kernel lacks unfair writelocks, and successful calls to the oom > killer usually result in at least one thread entering the exit path, so > an alternative solution is needed. > > This patch introduces a seperate oom handler for memcgs so that they do > not require tasklist_lock for as much time. Instead, it iterates only > over the threads attached to the oom memcg and grabs a reference to the > selected thread before calling oom_kill_process() to ensure it doesn't > prematurely exit. > > This still requires tasklist_lock for the tasklist dump, iterating > children of the selected process, and killing all other threads on the > system sharing the same memory as the selected victim. So while this > isn't a complete solution to tasklist_lock starvation, it significantly > reduces the amount of time that it is held. There is an issues with memcg ref. counting but I like the approach in general. > > Signed-off-by: David Rientjes <rientjes@google.com> After the things bellow are fixed Acked-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/memcontrol.h | 9 ++----- > include/linux/oom.h | 16 ++++++++++++ > mm/memcontrol.c | 62 +++++++++++++++++++++++++++++++++++++++++++- > mm/oom_kill.c | 48 +++++++++++----------------------- > 4 files changed, 94 insertions(+), 41 deletions(-) > [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > + int order) > +{ > + struct mem_cgroup *iter; > + unsigned long chosen_points = 0; > + unsigned long totalpages; > + unsigned int points = 0; > + struct task_struct *chosen = NULL; > + struct task_struct *task; > + > + totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; > + for_each_mem_cgroup_tree(iter, memcg) { > + struct cgroup *cgroup = iter->css.cgroup; > + struct cgroup_iter it; > + > + cgroup_iter_start(cgroup, &it); I guess this should protect from task move and exit, right? > + while ((task = cgroup_iter_next(cgroup, &it))) { > + switch (oom_scan_process_thread(task, totalpages, NULL, > + false)) { > + case OOM_SCAN_SELECT: > + if (chosen) > + put_task_struct(chosen); > + chosen = task; > + chosen_points = ULONG_MAX; > + get_task_struct(chosen); > + /* fall through */ > + case OOM_SCAN_CONTINUE: > + continue; > + case OOM_SCAN_ABORT: > + cgroup_iter_end(cgroup, &it); > + if (chosen) > + put_task_struct(chosen); You need mem_cgroup_iter_break here to have ref. counting correct. > + return; > + case OOM_SCAN_OK: > + break; > + }; > + points = oom_badness(task, memcg, NULL, totalpages); > + if (points > chosen_points) { > + if (chosen) > + put_task_struct(chosen); > + chosen = task; > + chosen_points = points; > + get_task_struct(chosen); > + } > + } > + cgroup_iter_end(cgroup, &it); > + if (!memcg->use_hierarchy) > + break; And this is not necessary, because for_each_mem_cgroup_tree is hierarchy aware. [...] -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> To: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Subject: Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads Date: Tue, 26 Jun 2012 11:58:32 +0200 [thread overview] Message-ID: <20120626095832.GC9566@tiehlicka.suse.cz> (raw) In-Reply-To: <alpine.DEB.2.00.1206251847180.24838-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> On Mon 25-06-12 18:47:53, David Rientjes wrote: > The global oom killer is serialized by the zonelist being used in the > page allocation. Concurrent oom kills are thus a rare event and only > occur in systems using mempolicies and with a large number of nodes. > > Memory controller oom kills, however, can frequently be concurrent since > there is no serialization once the oom killer is called for oom > conditions in several different memcgs in parallel. > > This creates a massive contention on tasklist_lock since the oom killer > requires the readside for the tasklist iteration. If several memcgs are > calling the oom killer, this lock can be held for a substantial amount of > time, especially if threads continue to enter it as other threads are > exiting. > > Since the exit path grabs the writeside of the lock with irqs disabled in > a few different places, this can cause a soft lockup on cpus as a result > of tasklist_lock starvation. > > The kernel lacks unfair writelocks, and successful calls to the oom > killer usually result in at least one thread entering the exit path, so > an alternative solution is needed. > > This patch introduces a seperate oom handler for memcgs so that they do > not require tasklist_lock for as much time. Instead, it iterates only > over the threads attached to the oom memcg and grabs a reference to the > selected thread before calling oom_kill_process() to ensure it doesn't > prematurely exit. > > This still requires tasklist_lock for the tasklist dump, iterating > children of the selected process, and killing all other threads on the > system sharing the same memory as the selected victim. So while this > isn't a complete solution to tasklist_lock starvation, it significantly > reduces the amount of time that it is held. There is an issues with memcg ref. counting but I like the approach in general. > > Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> After the things bellow are fixed Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > include/linux/memcontrol.h | 9 ++----- > include/linux/oom.h | 16 ++++++++++++ > mm/memcontrol.c | 62 +++++++++++++++++++++++++++++++++++++++++++- > mm/oom_kill.c | 48 +++++++++++----------------------- > 4 files changed, 94 insertions(+), 41 deletions(-) > [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > + int order) > +{ > + struct mem_cgroup *iter; > + unsigned long chosen_points = 0; > + unsigned long totalpages; > + unsigned int points = 0; > + struct task_struct *chosen = NULL; > + struct task_struct *task; > + > + totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; > + for_each_mem_cgroup_tree(iter, memcg) { > + struct cgroup *cgroup = iter->css.cgroup; > + struct cgroup_iter it; > + > + cgroup_iter_start(cgroup, &it); I guess this should protect from task move and exit, right? > + while ((task = cgroup_iter_next(cgroup, &it))) { > + switch (oom_scan_process_thread(task, totalpages, NULL, > + false)) { > + case OOM_SCAN_SELECT: > + if (chosen) > + put_task_struct(chosen); > + chosen = task; > + chosen_points = ULONG_MAX; > + get_task_struct(chosen); > + /* fall through */ > + case OOM_SCAN_CONTINUE: > + continue; > + case OOM_SCAN_ABORT: > + cgroup_iter_end(cgroup, &it); > + if (chosen) > + put_task_struct(chosen); You need mem_cgroup_iter_break here to have ref. counting correct. > + return; > + case OOM_SCAN_OK: > + break; > + }; > + points = oom_badness(task, memcg, NULL, totalpages); > + if (points > chosen_points) { > + if (chosen) > + put_task_struct(chosen); > + chosen = task; > + chosen_points = points; > + get_task_struct(chosen); > + } > + } > + cgroup_iter_end(cgroup, &it); > + if (!memcg->use_hierarchy) > + break; And this is not necessary, because for_each_mem_cgroup_tree is hierarchy aware. [...] -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic
next prev parent reply other threads:[~2012-06-26 9:58 UTC|newest] Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top 2012-06-26 1:47 [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h David Rientjes 2012-06-26 1:47 ` David Rientjes 2012-06-26 1:47 ` [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan David Rientjes 2012-06-26 1:47 ` David Rientjes 2012-06-26 3:22 ` Kamezawa Hiroyuki 2012-06-26 3:22 ` Kamezawa Hiroyuki 2012-06-26 6:05 ` KOSAKI Motohiro 2012-06-26 8:48 ` Michal Hocko 2012-06-26 8:48 ` Michal Hocko 2012-06-26 1:47 ` [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads David Rientjes 2012-06-26 1:47 ` David Rientjes 2012-06-26 5:32 ` Kamezawa Hiroyuki 2012-06-26 5:32 ` Kamezawa Hiroyuki 2012-06-26 20:38 ` David Rientjes 2012-06-27 5:35 ` David Rientjes 2012-06-27 5:35 ` David Rientjes 2012-06-28 1:43 ` David Rientjes 2012-06-28 1:43 ` David Rientjes 2012-06-28 17:16 ` Oleg Nesterov 2012-06-29 20:37 ` David Rientjes 2012-06-28 8:55 ` Kamezawa Hiroyuki 2012-06-28 8:55 ` Kamezawa Hiroyuki 2012-06-29 20:30 ` David Rientjes 2012-06-29 20:30 ` David Rientjes 2012-07-03 17:56 ` Oleg Nesterov 2012-06-28 8:52 ` Kamezawa Hiroyuki 2012-06-26 9:58 ` Michal Hocko [this message] 2012-06-26 9:58 ` Michal Hocko 2012-06-26 3:12 ` [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h Kamezawa Hiroyuki 2012-06-26 6:04 ` KOSAKI Motohiro 2012-06-26 6:04 ` KOSAKI Motohiro 2012-06-26 8:34 ` Michal Hocko 2012-06-26 8:34 ` Michal Hocko 2012-06-29 21:06 ` [patch 1/5] " David Rientjes 2012-06-29 21:06 ` [patch 2/5] mm, oom: introduce helper function to process threads during scan David Rientjes 2012-06-29 21:06 ` David Rientjes 2012-07-12 7:18 ` Sha Zhengju 2012-07-12 7:18 ` Sha Zhengju 2012-06-29 21:06 ` [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads David Rientjes 2012-06-29 21:06 ` David Rientjes 2012-07-10 21:19 ` Andrew Morton 2012-07-10 21:19 ` Andrew Morton 2012-07-10 23:24 ` David Rientjes 2012-07-12 14:50 ` Sha Zhengju 2012-07-12 14:50 ` Sha Zhengju 2012-06-29 21:06 ` [patch 4/5] mm, oom: reduce dependency on tasklist_lock David Rientjes 2012-07-03 18:17 ` Oleg Nesterov 2012-07-03 18:17 ` Oleg Nesterov 2012-07-10 21:04 ` David Rientjes 2012-07-13 14:32 ` Michal Hocko 2012-07-13 14:32 ` Michal Hocko 2012-07-16 7:42 ` [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix Hugh Dickins 2012-07-16 7:42 ` Hugh Dickins 2012-07-16 8:06 ` Michal Hocko 2012-07-16 8:06 ` Michal Hocko 2012-07-16 9:01 ` Hugh Dickins 2012-07-16 9:27 ` Michal Hocko 2012-07-19 10:11 ` Kamezawa Hiroyuki 2012-07-19 10:11 ` Kamezawa Hiroyuki 2012-06-29 21:07 ` [patch 5/5] mm, memcg: move all oom handling to memcontrol.c David Rientjes 2012-06-29 21:07 ` David Rientjes 2012-07-04 5:51 ` Kamezawa Hiroyuki 2012-07-04 5:51 ` Kamezawa Hiroyuki 2012-07-13 14:34 ` Michal Hocko 2012-07-10 21:05 ` [patch 1/5] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h David Rientjes 2012-07-10 21:05 ` David Rientjes
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20120626095832.GC9566@tiehlicka.suse.cz \ --to=mhocko@suse.cz \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=hannes@cmpxchg.org \ --cc=kamezawa.hiroyu@jp.fujitsu.com \ --cc=kosaki.motohiro@jp.fujitsu.com \ --cc=linux-mm@kvack.org \ --cc=minchan@kernel.org \ --cc=rientjes@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.