From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC] How to fix broken freezing?
Date: Thu, 12 Jan 2012 09:36:31 +1100
Message-ID: <20120111223631.GL24410@dastard>
References: <20120106140931.GD20291@quack.suse.cz>
 <4F0E04D4.6040108@sandeen.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel@vger.kernel.org,
	Surbhi Palande <csurbhi@gmail.com>,
	Kamal Mostafa <kamal@canonical.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <dchinner@redhat.com>,
	Al Viro <viro@ZenIV.linux.org.uk>
To: Eric Sandeen <sandeen@sandeen.net>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:11362 "EHLO
	ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S932254Ab2AKWge (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 11 Jan 2012 17:36:34 -0500
Content-Disposition: inline
In-Reply-To: <4F0E04D4.6040108@sandeen.net>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote:
> On 1/6/12 8:09 AM, Jan Kara wrote:
> >   Hello,
> > 
> >   I was looking at what causes filesystem to have dirty data after it is
> > frozen. After some thought I realized freezing code is inherently racy and
> > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem.
> > 
> > The race is basically following:
> > 	Task 1					Task 2
> > freeze_super()				__generic_file_aio_write()
> >   ...					  vfs_check_frozen(sb, SB_FREEZE_WRITE)
> >   sb->s_frozen = SB_FREEZE_WRITE;
> >   sync_filesystem(sb);
> > 					  do the write
> > 					    /* Here we create dirty data
> > 					     * which is left on frozen fs */
> >   sb->s_frozen = SB_FREEZE_TRANS;
> >   ...
> >   ->freeze_fs()
> > 
> > The problem is that you can never make checking for frozen filesystem
> > race-free with the current s_frozen scheme - the filesystem can always be
> > frozen the instant after you check for it and you end up creating dirty
> > data on frozen filesystem.
> > 
> > The question is what to do with this problem. I outline the possibilities
> > that come to my mind below:
> > 1) Ignore the problem - depending on the exact fs details this could lead to
> >    fs snapshot being corrupted, also flusher thread can hang on the frozen
> >    filesystem (e.g. because of sync(1)) creating all sorts of secondary
> >    issues. So I don't think this is really an option.
> > 2) Have a rwlock in the superblock that is held for writing while
> >    filesystem freezing is in progress and held for reading by the filesystem
> >    while a transaction is running except for transactions that are required
> >    to do writeback. This is kind of ugly but at least for ext3/4 relatively
> >    easy to implement.
> 
> This is as far as I had gotten while independently thinking about it ;)
> 
> But talking with dchinner, he had concerns about the scalability of any
> rwlock, and I think we (ok, mostly Dave) came up with another idea.
> 
> What if we had 2 counters in the superblock, one for the equivalent of
> SB_FREEZE_WRITE and one for SB_FREEZE_TRANS.  These would use similar
> infrastructure to mnt_want_write et al.
> 
> Everywhere we currently vfs_check_frozen() we'd have a better-named function
> which increments the counter, then checks the freeze level.  If we are
> being frozen, we drop the counter & wait.  If not frozen, we continue;
> like this pseudo-code:
> 
> void super_freeze_wait(sb, level) {
> 	while (1) {
> 		level_ref++;
> 		if (!frozen(sb, level))
> 			return;	/* not freezing */
> 		level_ref--;
> 		wait_unfrozen(sb, level);
> 	}
> }
> 
> There would also be new code to drop the counter when the dirtying is complete.
> 
> The freezing functions then just have to wait until the counters hit zero
> before they can consider themselves done, and freezing is complete.  That way if
> someone sneaks in while the freeze level is being set, they have already
> notified their intent, and freeze can wait for it anyway before returning.

Just to clarify, freeze_super would do:

	sb->s_frozen = SB_FREEZE_WRITE;
	smp_wmb();

	while (sb->s_active_write_cnt > 0)
		wait;

	/* no new or existing dirtying writers now, safe to sync */
	sync_filesystem(sb);

	sb->s_frozen = SB_FREEZE_TRANS;
	smp_wmb();

	while (sb->s_active_trans_cnt > 0)
		wait;

	/* no new or existing transactions in progress now, so freeze */
	sb->s_op->freeze_fs(sb);

The counter implemetations will need to scale (e.g. per-cpu
counters) and we could probably use a generic waitqueue, but I think
this can all be implemented at the superblock level and we only need
to call the inc/dec helper functions in the correct places to make
it all work.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com