From: Dave Chinner <david@fromorbit.com>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, Surbhi Palande <csurbhi@gmail.com>,
Kamal Mostafa <kamal@canonical.com>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <dchinner@redhat.com>,
Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: [RFC] How to fix broken freezing?
Date: Thu, 12 Jan 2012 09:36:31 +1100 [thread overview]
Message-ID: <20120111223631.GL24410@dastard> (raw)
In-Reply-To: <4F0E04D4.6040108@sandeen.net>
On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote:
> On 1/6/12 8:09 AM, Jan Kara wrote:
> > Hello,
> >
> > I was looking at what causes filesystem to have dirty data after it is
> > frozen. After some thought I realized freezing code is inherently racy and
> > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem.
> >
> > The race is basically following:
> > Task 1 Task 2
> > freeze_super() __generic_file_aio_write()
> > ... vfs_check_frozen(sb, SB_FREEZE_WRITE)
> > sb->s_frozen = SB_FREEZE_WRITE;
> > sync_filesystem(sb);
> > do the write
> > /* Here we create dirty data
> > * which is left on frozen fs */
> > sb->s_frozen = SB_FREEZE_TRANS;
> > ...
> > ->freeze_fs()
> >
> > The problem is that you can never make checking for frozen filesystem
> > race-free with the current s_frozen scheme - the filesystem can always be
> > frozen the instant after you check for it and you end up creating dirty
> > data on frozen filesystem.
> >
> > The question is what to do with this problem. I outline the possibilities
> > that come to my mind below:
> > 1) Ignore the problem - depending on the exact fs details this could lead to
> > fs snapshot being corrupted, also flusher thread can hang on the frozen
> > filesystem (e.g. because of sync(1)) creating all sorts of secondary
> > issues. So I don't think this is really an option.
> > 2) Have a rwlock in the superblock that is held for writing while
> > filesystem freezing is in progress and held for reading by the filesystem
> > while a transaction is running except for transactions that are required
> > to do writeback. This is kind of ugly but at least for ext3/4 relatively
> > easy to implement.
>
> This is as far as I had gotten while independently thinking about it ;)
>
> But talking with dchinner, he had concerns about the scalability of any
> rwlock, and I think we (ok, mostly Dave) came up with another idea.
>
> What if we had 2 counters in the superblock, one for the equivalent of
> SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar
> infrastructure to mnt_want_write et al.
>
> Everywhere we currently vfs_check_frozen() we'd have a better-named function
> which increments the counter, then checks the freeze level. If we are
> being frozen, we drop the counter & wait. If not frozen, we continue;
> like this pseudo-code:
>
> void super_freeze_wait(sb, level) {
> while (1) {
> level_ref++;
> if (!frozen(sb, level))
> return; /* not freezing */
> level_ref--;
> wait_unfrozen(sb, level);
> }
> }
>
> There would also be new code to drop the counter when the dirtying is complete.
>
> The freezing functions then just have to wait until the counters hit zero
> before they can consider themselves done, and freezing is complete. That way if
> someone sneaks in while the freeze level is being set, they have already
> notified their intent, and freeze can wait for it anyway before returning.
Just to clarify, freeze_super would do:
sb->s_frozen = SB_FREEZE_WRITE;
smp_wmb();
while (sb->s_active_write_cnt > 0)
wait;
/* no new or existing dirtying writers now, safe to sync */
sync_filesystem(sb);
sb->s_frozen = SB_FREEZE_TRANS;
smp_wmb();
while (sb->s_active_trans_cnt > 0)
wait;
/* no new or existing transactions in progress now, so freeze */
sb->s_op->freeze_fs(sb);
The counter implemetations will need to scale (e.g. per-cpu
counters) and we could probably use a generic waitqueue, but I think
this can all be implemented at the superblock level and we only need
to call the inc/dec helper functions in the correct places to make
it all work.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2012-01-11 22:36 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-06 14:09 [RFC] How to fix broken freezing? Jan Kara
2012-01-11 21:53 ` Eric Sandeen
2012-01-11 22:36 ` Dave Chinner [this message]
2012-01-12 1:09 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120111223631.GL24410@dastard \
--to=david@fromorbit.com \
--cc=csurbhi@gmail.com \
--cc=dchinner@redhat.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=kamal@canonical.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=sandeen@sandeen.net \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).