From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC] How to fix broken freezing? Date: Thu, 12 Jan 2012 09:36:31 +1100 Message-ID: <20120111223631.GL24410@dastard> References: <20120106140931.GD20291@quack.suse.cz> <4F0E04D4.6040108@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , linux-fsdevel@vger.kernel.org, Surbhi Palande , Kamal Mostafa , Christoph Hellwig , Dave Chinner , Al Viro To: Eric Sandeen Return-path: Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:11362 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932254Ab2AKWge (ORCPT ); Wed, 11 Jan 2012 17:36:34 -0500 Content-Disposition: inline In-Reply-To: <4F0E04D4.6040108@sandeen.net> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote: > On 1/6/12 8:09 AM, Jan Kara wrote: > > Hello, > > > > I was looking at what causes filesystem to have dirty data after it is > > frozen. After some thought I realized freezing code is inherently racy and > > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. > > > > The race is basically following: > > Task 1 Task 2 > > freeze_super() __generic_file_aio_write() > > ... vfs_check_frozen(sb, SB_FREEZE_WRITE) > > sb->s_frozen = SB_FREEZE_WRITE; > > sync_filesystem(sb); > > do the write > > /* Here we create dirty data > > * which is left on frozen fs */ > > sb->s_frozen = SB_FREEZE_TRANS; > > ... > > ->freeze_fs() > > > > The problem is that you can never make checking for frozen filesystem > > race-free with the current s_frozen scheme - the filesystem can always be > > frozen the instant after you check for it and you end up creating dirty > > data on frozen filesystem. > > > > The question is what to do with this problem. I outline the possibilities > > that come to my mind below: > > 1) Ignore the problem - depending on the exact fs details this could lead to > > fs snapshot being corrupted, also flusher thread can hang on the frozen > > filesystem (e.g. because of sync(1)) creating all sorts of secondary > > issues. So I don't think this is really an option. > > 2) Have a rwlock in the superblock that is held for writing while > > filesystem freezing is in progress and held for reading by the filesystem > > while a transaction is running except for transactions that are required > > to do writeback. This is kind of ugly but at least for ext3/4 relatively > > easy to implement. > > This is as far as I had gotten while independently thinking about it ;) > > But talking with dchinner, he had concerns about the scalability of any > rwlock, and I think we (ok, mostly Dave) came up with another idea. > > What if we had 2 counters in the superblock, one for the equivalent of > SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar > infrastructure to mnt_want_write et al. > > Everywhere we currently vfs_check_frozen() we'd have a better-named function > which increments the counter, then checks the freeze level. If we are > being frozen, we drop the counter & wait. If not frozen, we continue; > like this pseudo-code: > > void super_freeze_wait(sb, level) { > while (1) { > level_ref++; > if (!frozen(sb, level)) > return; /* not freezing */ > level_ref--; > wait_unfrozen(sb, level); > } > } > > There would also be new code to drop the counter when the dirtying is complete. > > The freezing functions then just have to wait until the counters hit zero > before they can consider themselves done, and freezing is complete. That way if > someone sneaks in while the freeze level is being set, they have already > notified their intent, and freeze can wait for it anyway before returning. Just to clarify, freeze_super would do: sb->s_frozen = SB_FREEZE_WRITE; smp_wmb(); while (sb->s_active_write_cnt > 0) wait; /* no new or existing dirtying writers now, safe to sync */ sync_filesystem(sb); sb->s_frozen = SB_FREEZE_TRANS; smp_wmb(); while (sb->s_active_trans_cnt > 0) wait; /* no new or existing transactions in progress now, so freeze */ sb->s_op->freeze_fs(sb); The counter implemetations will need to scale (e.g. per-cpu counters) and we could probably use a generic waitqueue, but I think this can all be implemented at the superblock level and we only need to call the inc/dec helper functions in the correct places to make it all work. Cheers, Dave. -- Dave Chinner david@fromorbit.com