From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935699Ab3IEMn6 (ORCPT ); Thu, 5 Sep 2013 08:43:58 -0400 Received: from cantor2.suse.de ([195.135.220.15]:41305 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757780Ab3IEMn4 (ORCPT ); Thu, 5 Sep 2013 08:43:56 -0400 Date: Thu, 5 Sep 2013 14:43:52 +0200 From: Michal Hocko To: Johannes Weiner Cc: azurIt , Andrew Morton , David Rientjes , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [patch 0/7] improve memcg oom killer robustness v2 Message-ID: <20130905124352.GB13666@dhcp22.suse.cz> References: <1375549200-19110-1-git-send-email-hannes@cmpxchg.org> <20130803170831.GB23319@cmpxchg.org> <20130830215852.3E5D3D66@pobox.sk> <20130902123802.5B8E8CB1@pobox.sk> <20130903204850.GA1412@cmpxchg.org> <20130904101852.58E70042@pobox.sk> <20130905115430.GB856@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130905115430.GB856@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 05-09-13 07:54:30, Johannes Weiner wrote: > On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote: > > Ok, i see this message several times in my syslog logs, one of them > > is also for this unremovable cgroup (but maybe all of them cannot > > be removed, should i try?). Example of the log is here (don't know > > where exactly it starts and ends so here is the full kernel log): > > http://watchdog.sk/lkml/oom_syslog.gz > > There is an unfinished OOM invocation here: > > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from: > Aug 22 13:15:21 server01 kernel: [1251422.715191] [] T.1154+0x622/0x8f0 > Aug 22 13:15:21 server01 kernel: [1251422.715274] [] mem_cgroup_cache_charge+0xbe/0xe0 > Aug 22 13:15:21 server01 kernel: [1251422.715357] [] add_to_page_cache_locked+0x4c/0x140 > Aug 22 13:15:21 server01 kernel: [1251422.715443] [] add_to_page_cache_lru+0x22/0x50 > Aug 22 13:15:21 server01 kernel: [1251422.715526] [] find_or_create_page+0x73/0xb0 > Aug 22 13:15:21 server01 kernel: [1251422.715608] [] __getblk+0xea/0x2c0 > Aug 22 13:15:21 server01 kernel: [1251422.715692] [] __bread+0x13/0xc0 > Aug 22 13:15:21 server01 kernel: [1251422.715774] [] ext3_get_branch+0x98/0x140 > Aug 22 13:15:21 server01 kernel: [1251422.715859] [] ext3_get_blocks_handle+0xd7/0xdc0 > Aug 22 13:15:21 server01 kernel: [1251422.715942] [] ext3_get_block+0xc4/0x120 > Aug 22 13:15:21 server01 kernel: [1251422.716023] [] do_mpage_readpage+0x38a/0x690 > Aug 22 13:15:21 server01 kernel: [1251422.716107] [] mpage_readpage+0x4f/0x70 > Aug 22 13:15:21 server01 kernel: [1251422.716188] [] ext3_readpage+0x28/0x60 > Aug 22 13:15:21 server01 kernel: [1251422.716268] [] filemap_fault+0x308/0x560 > Aug 22 13:15:21 server01 kernel: [1251422.716350] [] __do_fault+0x78/0x5a0 > Aug 22 13:15:21 server01 kernel: [1251422.716433] [] handle_pte_fault+0x84/0x940 > > __getblk() has this weird loop where it tries to instantiate the page, > frees memory on failure, then retries. If the memcg goes OOM, the OOM > path might be entered multiple times and each time leak the memcg > reference of the respective previous OOM invocation. Very well spotted, Johannes! > There are a few more find_or_create() sites that do not propagate an > error and it's incredibly hard to find out whether they are even taken > during a page fault. It's not practical to annotate them all with > memcg OOM toggles, so let's just catch all OOM contexts at the end of > handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating > this like an error. > > azur, here is a patch on top of your modified 3.2. Note that Michal > might be onto something and we are looking at multiple issues here, > but the log excert above suggests this fix is required either way. > > --- > From: Johannes Weiner > Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully > > Many places that can trigger a memcg OOM situation return gracefully > and don't propagate VM_FAULT_OOM up the fault stack. > > It's not practical to annotate all of them to disable the memcg OOM > killer. Instead, just clean up any set OOM state without warning in > case the fault is not returning VM_FAULT_OOM. > > Also fail charges immediately when the current task already is in an > OOM context. Otherwise, the previous context gets overwritten and the > memcg reference is leaked. This is getting way more trickier than I've expected and hoped for. The above should work although I cannot say I love it. I am afraid we do not have many choices left without polluting the every single place which can charge, though :/ > Signed-off-by: Johannes Weiner I guess this should be correct but I have to think about it some more. Two minor comments bellow. > --- > include/linux/memcontrol.h | 40 ++++++---------------------------------- > include/linux/sched.h | 3 --- > mm/filemap.c | 11 +---------- > mm/memcontrol.c | 15 ++++++++------- > mm/memory.c | 8 ++------ > mm/oom_kill.c | 2 +- > 6 files changed, 18 insertions(+), 61 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index b113c0f..7c43903 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -/** > - * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task > - * @new: true to enable, false to disable > - * > - * Toggle whether a failed memcg charge should invoke the OOM killer > - * or just return -ENOMEM. Returns the previous toggle state. > - * > - * NOTE: Any path that enables the OOM killer before charging must > - * call mem_cgroup_oom_synchronize() afterward to finalize the > - * OOM handling and clean up. > - */ > -static inline bool mem_cgroup_toggle_oom(bool new) > -{ > - bool old; > - > - old = current->memcg_oom.may_oom; > - current->memcg_oom.may_oom = new; > - > - return old; > -} I will not miss this guy. [...] > diff --git a/mm/memory.c b/mm/memory.c > index cdbe41b..cdad471 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -57,7 +57,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > if (flags & FAULT_FLAG_USER) > mem_cgroup_disable_oom(); > > - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) { > - printk("Fixing unhandled memcg OOM context set up from:\n"); > - print_stack_trace(¤t->memcg_oom.trace, 0); > - mem_cgroup_oom_synchronize(); > - } > + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) > + mem_cgroup_oom_synchronize(false); This deserves a fat comment /me thinks [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [patch 0/7] improve memcg oom killer robustness v2 Date: Thu, 5 Sep 2013 14:43:52 +0200 Message-ID: <20130905124352.GB13666@dhcp22.suse.cz> References: <1375549200-19110-1-git-send-email-hannes@cmpxchg.org> <20130803170831.GB23319@cmpxchg.org> <20130830215852.3E5D3D66@pobox.sk> <20130902123802.5B8E8CB1@pobox.sk> <20130903204850.GA1412@cmpxchg.org> <20130904101852.58E70042@pobox.sk> <20130905115430.GB856@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20130905115430.GB856@cmpxchg.org> Sender: owner-linux-mm@kvack.org To: Johannes Weiner Cc: azurIt , Andrew Morton , David Rientjes , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org List-Id: linux-arch.vger.kernel.org On Thu 05-09-13 07:54:30, Johannes Weiner wrote: > On Wed, Sep 04, 2013 at 10:18:52AM +0200, azurIt wrote: > > Ok, i see this message several times in my syslog logs, one of them > > is also for this unremovable cgroup (but maybe all of them cannot > > be removed, should i try?). Example of the log is here (don't know > > where exactly it starts and ends so here is the full kernel log): > > http://watchdog.sk/lkml/oom_syslog.gz > > There is an unfinished OOM invocation here: > > Aug 22 13:15:21 server01 kernel: [1251422.715112] Fixing unhandled memcg OOM context set up from: > Aug 22 13:15:21 server01 kernel: [1251422.715191] [] T.1154+0x622/0x8f0 > Aug 22 13:15:21 server01 kernel: [1251422.715274] [] mem_cgroup_cache_charge+0xbe/0xe0 > Aug 22 13:15:21 server01 kernel: [1251422.715357] [] add_to_page_cache_locked+0x4c/0x140 > Aug 22 13:15:21 server01 kernel: [1251422.715443] [] add_to_page_cache_lru+0x22/0x50 > Aug 22 13:15:21 server01 kernel: [1251422.715526] [] find_or_create_page+0x73/0xb0 > Aug 22 13:15:21 server01 kernel: [1251422.715608] [] __getblk+0xea/0x2c0 > Aug 22 13:15:21 server01 kernel: [1251422.715692] [] __bread+0x13/0xc0 > Aug 22 13:15:21 server01 kernel: [1251422.715774] [] ext3_get_branch+0x98/0x140 > Aug 22 13:15:21 server01 kernel: [1251422.715859] [] ext3_get_blocks_handle+0xd7/0xdc0 > Aug 22 13:15:21 server01 kernel: [1251422.715942] [] ext3_get_block+0xc4/0x120 > Aug 22 13:15:21 server01 kernel: [1251422.716023] [] do_mpage_readpage+0x38a/0x690 > Aug 22 13:15:21 server01 kernel: [1251422.716107] [] mpage_readpage+0x4f/0x70 > Aug 22 13:15:21 server01 kernel: [1251422.716188] [] ext3_readpage+0x28/0x60 > Aug 22 13:15:21 server01 kernel: [1251422.716268] [] filemap_fault+0x308/0x560 > Aug 22 13:15:21 server01 kernel: [1251422.716350] [] __do_fault+0x78/0x5a0 > Aug 22 13:15:21 server01 kernel: [1251422.716433] [] handle_pte_fault+0x84/0x940 > > __getblk() has this weird loop where it tries to instantiate the page, > frees memory on failure, then retries. If the memcg goes OOM, the OOM > path might be entered multiple times and each time leak the memcg > reference of the respective previous OOM invocation. Very well spotted, Johannes! > There are a few more find_or_create() sites that do not propagate an > error and it's incredibly hard to find out whether they are even taken > during a page fault. It's not practical to annotate them all with > memcg OOM toggles, so let's just catch all OOM contexts at the end of > handle_mm_fault() and clear them if !VM_FAULT_OOM instead of treating > this like an error. > > azur, here is a patch on top of your modified 3.2. Note that Michal > might be onto something and we are looking at multiple issues here, > but the log excert above suggests this fix is required either way. > > --- > From: Johannes Weiner > Subject: [patch] mm: memcg: handle non-error OOM situations more gracefully > > Many places that can trigger a memcg OOM situation return gracefully > and don't propagate VM_FAULT_OOM up the fault stack. > > It's not practical to annotate all of them to disable the memcg OOM > killer. Instead, just clean up any set OOM state without warning in > case the fault is not returning VM_FAULT_OOM. > > Also fail charges immediately when the current task already is in an > OOM context. Otherwise, the previous context gets overwritten and the > memcg reference is leaked. This is getting way more trickier than I've expected and hoped for. The above should work although I cannot say I love it. I am afraid we do not have many choices left without polluting the every single place which can charge, though :/ > Signed-off-by: Johannes Weiner I guess this should be correct but I have to think about it some more. Two minor comments bellow. > --- > include/linux/memcontrol.h | 40 ++++++---------------------------------- > include/linux/sched.h | 3 --- > mm/filemap.c | 11 +---------- > mm/memcontrol.c | 15 ++++++++------- > mm/memory.c | 8 ++------ > mm/oom_kill.c | 2 +- > 6 files changed, 18 insertions(+), 61 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index b113c0f..7c43903 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -120,39 +120,16 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -/** > - * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task > - * @new: true to enable, false to disable > - * > - * Toggle whether a failed memcg charge should invoke the OOM killer > - * or just return -ENOMEM. Returns the previous toggle state. > - * > - * NOTE: Any path that enables the OOM killer before charging must > - * call mem_cgroup_oom_synchronize() afterward to finalize the > - * OOM handling and clean up. > - */ > -static inline bool mem_cgroup_toggle_oom(bool new) > -{ > - bool old; > - > - old = current->memcg_oom.may_oom; > - current->memcg_oom.may_oom = new; > - > - return old; > -} I will not miss this guy. [...] > diff --git a/mm/memory.c b/mm/memory.c > index cdbe41b..cdad471 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -57,7 +57,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -3521,11 +3520,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > if (flags & FAULT_FLAG_USER) > mem_cgroup_disable_oom(); > > - if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))) { > - printk("Fixing unhandled memcg OOM context set up from:\n"); > - print_stack_trace(¤t->memcg_oom.trace, 0); > - mem_cgroup_oom_synchronize(); > - } > + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) > + mem_cgroup_oom_synchronize(false); This deserves a fat comment /me thinks [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org