From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754637Ab3GOJOc (ORCPT ); Mon, 15 Jul 2013 05:14:32 -0400 Received: from cantor2.suse.de ([195.135.220.15]:57318 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754434Ab3GOJOb (ORCPT ); Mon, 15 Jul 2013 05:14:31 -0400 Date: Mon, 15 Jul 2013 11:14:28 +0200 From: Michal Hocko To: Dave Chinner Cc: Glauber Costa , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92 Message-ID: <20130715091428.GA26199@dhcp22.suse.cz> References: <20130629025509.GG9047@dastard> <20130630183349.GA23731@dhcp22.suse.cz> <20130701012558.GB27780@dastard> <20130701075005.GA28765@dhcp22.suse.cz> <20130701081056.GA4072@dastard> <20130702092200.GB16815@dhcp22.suse.cz> <20130702121947.GE14996@dastard> <20130702124427.GG16815@dhcp22.suse.cz> <20130703112403.GP14996@dastard> <20130704163643.GF7833@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130704163643.GF7833@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 04-07-13 18:36:43, Michal Hocko wrote: > On Wed 03-07-13 21:24:03, Dave Chinner wrote: > > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote: > > > On Tue 02-07-13 22:19:47, Dave Chinner wrote: > > > [...] > > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the > > > > info, Michal, it's time to go look at the code.... > > > > > > OK, just in case we will need it, I am keeping the machine in this state > > > for now. So we still can play with crash and check all the juicy > > > internals. > > > > My current suspect is the LRU_RETRY code. I don't think what it is > > doing is at all valid - list_for_each_safe() is not safe if you drop > > the lock that protects the list. i.e. there is nothing that protects > > the stored next pointer from being removed from the list by someone > > else. Hence what I think is occurring is this: > > > > > > thread 1 thread 2 > > lock(lru) > > list_for_each_safe(lru) lock(lru) > > isolate ...... > > lock(i_lock) > > has buffers > > __iget > > unlock(i_lock) > > unlock(lru) > > ..... (gets lru lock) > > list_for_each_safe(lru) > > walks all the inodes > > finds inode being isolated by other thread > > isolate > > i_count > 0 > > list_del_init(i_lru) > > return LRU_REMOVED; > > moves to next inode, inode that > > other thread has stored as next > > isolate > > i_state |= I_FREEING > > list_move(dispose_list) > > return LRU_REMOVED > > .... > > unlock(lru) > > lock(lru) > > return LRU_RETRY; > > if (!first_pass) > > .... > > --nr_to_scan > > (loop again using next, which has already been removed from the > > LRU by the other thread!) > > isolate > > lock(i_lock) > > if (i_state & ~I_REFERENCED) > > list_del_init(i_lru) <<<<< inode is on dispose list! > > <<<<< inode is now isolated, with I_FREEING set > > return LRU_REMOVED; > > > > That fits the corpse left on your machine, Michal. One thread has > > moved the inode to a dispose list, the other thread thinks it is > > still on the LRU and should be removed, and removes it. > > > > This also explains the lru item count going negative - the same item > > is being removed from the lru twice. So it seems like all the > > problems you've been seeing are caused by this one problem.... > > > > Patch below that should fix this. > > Good news! The test was running since morning and it didn't hang nor > crashed. So this really looks like the right fix. It will run also > during weekend to be 100% sure. But I guess it is safe to say > > Tested-by: Michal Hocko And I can finally confirm this after over weekend testing on ext3. Thanks a lot for your help Dave! -- Michal Hocko SUSE Labs