From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750818AbWAJGsR (ORCPT ); Tue, 10 Jan 2006 01:48:17 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750832AbWAJGsR (ORCPT ); Tue, 10 Jan 2006 01:48:17 -0500 Received: from 217-133-42-200.b2b.tiscali.it ([217.133.42.200]:40010 "EHLO opteron.random") by vger.kernel.org with ESMTP id S1750818AbWAJGsR (ORCPT ); Tue, 10 Jan 2006 01:48:17 -0500 Date: Tue, 10 Jan 2006 07:48:12 +0100 From: Andrea Arcangeli To: Andrew Morton Cc: linux-kernel@vger.kernel.org, hugh@veritas.com Subject: Re: smp race fix between invalidate_inode_pages* and do_no_page Message-ID: <20060110064812.GB15897@opteron.random> References: <20051213193735.GE3092@opteron.random> <20051213130227.2efac51e.akpm@osdl.org> <20051213211441.GH3092@opteron.random> <20051216135147.GV5270@opteron.random> <20060110062425.GA15897@opteron.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060110062425.GA15897@opteron.random> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 10, 2006 at 07:24:25AM +0100, Andrea Arcangeli wrote: > On Fri, Dec 16, 2005 at 02:51:47PM +0100, Andrea Arcangeli wrote: > > There was a minor buglet in the previous patch an update is here: > > > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.15-rc5/seqschedlock-2 > > JFYI: I got a few hours ago positive confirmation of the fix from the > customer that was capable of reproducing this. I guess this is good > enough for production use (it's at the very least certainly better than > the previous code and it's guaranteed not to hurt the scalability of the > fast path in smp, so it's the least intrusive fix I could imagine). > > So we can start to think if we should using this new primitive I > created, and if to replace the yield() with a proper waitqueue (and > how). Or if to take the risk of hitting a bit of scalability in the > nopage page faults of processes, by rewriting the fix with a > find_lock_page in the do_no_page handler, that would avoid the need of > my new locking primitive. Another possible way to fix this is to put the page_count check back in the invalidate_*_* under the tree_lock (exactly like the VM does in shrink_caches and exactly like 2.4 does too!), and to stop removing pages from pagecache if their page_count is > 1 (we would go back to clear PG_uptodate). But then we'd have a problem once again with the PG_utpodate bitflag being cleared by invalidate_*_* while do_no_page is running, and in turn a mapped page could have PG_uptodate clear 8), that's the invariant that lead us to start zapping the ptes and dropping pagecache, so then we could stop zapping the ptes too like 2.4 and allowing PG_uptodate to be clear (there's nothing fundamentally wrong with that, as long as the buffers BH_uptodate are clear too). Side note: in 2.6 invalidate_mapping_pages has been smp racy at least since 2.6.5, it basically broke when the page_count check was replaced with a page_mapping check long ago. But it's probably so infrequent and the race so tiny that it never happened there, but the first bugchecks in the sles VM (those bugchecks unfortunately removed when mainline-merging) started to trigger only very recently when we started zapping ptes and dropping pages from invalidate_inode_pages2 like mainline. Bug was invisible in invalidate_mapping_pages (apparently only jffs2 uses invalidate_mapping_pages in a way that could oops, even nfs uses invalidate_inode_pages2).