From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753084Ab2KYNzq (ORCPT ); Sun, 25 Nov 2012 08:55:46 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:48669 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752659Ab2KYNzp (ORCPT ); Sun, 25 Nov 2012 08:55:45 -0500 Date: Sun, 25 Nov 2012 14:55:42 +0100 From: Michal Hocko To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki Subject: Re: memory-cgroup bug Message-ID: <20121125135542.GE10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [] do_truncate+0x58/0xa0 > 533 [] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [] do_truncate+0x58/0xa0 > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: memory-cgroup bug Date: Sun, 25 Nov 2012 14:55:42 +0100 Message-ID: <20121125135542.GE10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> <20121125120524.GB10623@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=3fz3q1LodYzVee5BYASxUz57TxhDlkeofYwPhBu/xWk=; b=sg0wOvIKsBqDvpqI2JkEu3YkcG+1z0lliKYYpvhHvG4QpQOWjuBfYklB/xoz7O0lHl lO24wLf8S9qkparr2mczIhijbdGbTMB+JEonBOfONh8XxZk1Q9/nGHncAkr8y0PDxKDX dknjSWzu07DxrpOgwmzP11OjTsiqd/74BxLEYQnhsQjP4Jtr/skVzSJL+VsslMHSlCk4 B5zVUJWbHSovKaEP6h91KUxgscR3cbVF+GmybQw562nRZiVP/zKUFXQIsuA0Lyxab6qy IGde5fe8H12jGDkboOmkUGbdHTLiFJ9X8V92rCBB2LkOoOhhbOQ3CpxcUxTzMm5TcDVj tnlQ== Content-Disposition: inline In-Reply-To: <20121125120524.GB10623@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [] do_truncate+0x58/0xa0 > 533 [] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [] do_truncate+0x58/0xa0 > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org