All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Badari Pulavarty <pbadari@gmail.com>
Cc: linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, sandeen@sandeen.net
Subject: Re: [PATCH] dio: track and serialise unaligned direct IO
Date: Sat, 31 Jul 2010 09:13:37 +1000	[thread overview]
Message-ID: <20100730231337.GD2126@dastard> (raw)
In-Reply-To: <1280511789.16484.18.camel@badari-desktop>

On Fri, Jul 30, 2010 at 10:43:09AM -0700, Badari Pulavarty wrote:
> On Fri, 2010-07-30 at 08:45 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we get two unaligned direct IO's to the same filesystem block
> > that is marked as a new allocation (i.e. buffer_new), then both IOs will
> > zero the portion of the block they are not writing data to. As a
> > result, when the IOs complete there will be a portion of the block
> > that contains zeros from the last IO to complete rather than the
> > data that should be there.
> > 
> > This is easily manifested by qemu using aio+dio with an unaligned
> > guest filesystem - every IO is unaligned and fileystem corruption is
> > encountered in the guest filesystem. xfstest 240 (from Eric Sandeen)
> > is also a simple reproducer.
> > 
> > To avoid this problem, track unaligned IO that triggers sub-block zeroing and
> > check new incoming unaligned IO that require sub-block zeroing against that
> > list. If we get an overlap where the start and end of unaligned IOs hit the
> > same filesystem block, then we need to block the incoming IOs until the IO that
> > is zeroing the block completes. The blocked IO can then continue without
> > needing to do any zeroing and hence won't overwrite valid data with zeros.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> I can confirm that, it  fixes corruption  of my VM images when using AIO
> +DIO. (cache=none,aio=native). I haven't reviewed the patch closely, but
> 
> 1) can we do this only for AIO+DIO combination ? For regular DIO, we
> should have all the IOs serialized by i_mutex anyway..

Not for filesystems that do their own locking. In most cases XFS
does not take the i_mutiex during DIO writes, and when it does it
drops it long before we call into the generic direct IO code that
does the sub-block zeroing. So the i_mutex does not guarantee any
form of serialisation in direct IO writes at all.

> 2) Having a single global list (for all devices) might cause scaling
> issues.

Unaligned direct IO is undesirable in the first place. While we
shouldd behave correctly in this corner case, I0 don't see any need
for it to be particularly efficient as the real fix for performance
problems with unaligned DIO is to not issue it in the first place.

> 3) Are you dropping i_mutex when you are waiting for the zero-out to
> finish ?

For XFS we're not holding the i_mutex - and we can't take the
i_mutex either as that will cause lock inversion issues. We don't
know what locks are held, we don't know whether it is safe to drop
and take locks, we don't even have the context to operate on
filesystem specific locks to avoid ordering problems. If we can't
sleep with the locks we already have held at this point, the DIO is
already broken for that filesystem.

Besides, if the i_mutex is already held for some filesystem when we
zero blocks, then we can't very well have concurrent block zeroing
in progress, and therefore can't hit this bug, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Badari Pulavarty <pbadari@gmail.com>
Cc: linux-fsdevel@vger.kernel.org, sandeen@sandeen.net, xfs@oss.sgi.com
Subject: Re: [PATCH] dio: track and serialise unaligned direct IO
Date: Sat, 31 Jul 2010 09:13:37 +1000	[thread overview]
Message-ID: <20100730231337.GD2126@dastard> (raw)
In-Reply-To: <1280511789.16484.18.camel@badari-desktop>

On Fri, Jul 30, 2010 at 10:43:09AM -0700, Badari Pulavarty wrote:
> On Fri, 2010-07-30 at 08:45 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we get two unaligned direct IO's to the same filesystem block
> > that is marked as a new allocation (i.e. buffer_new), then both IOs will
> > zero the portion of the block they are not writing data to. As a
> > result, when the IOs complete there will be a portion of the block
> > that contains zeros from the last IO to complete rather than the
> > data that should be there.
> > 
> > This is easily manifested by qemu using aio+dio with an unaligned
> > guest filesystem - every IO is unaligned and fileystem corruption is
> > encountered in the guest filesystem. xfstest 240 (from Eric Sandeen)
> > is also a simple reproducer.
> > 
> > To avoid this problem, track unaligned IO that triggers sub-block zeroing and
> > check new incoming unaligned IO that require sub-block zeroing against that
> > list. If we get an overlap where the start and end of unaligned IOs hit the
> > same filesystem block, then we need to block the incoming IOs until the IO that
> > is zeroing the block completes. The blocked IO can then continue without
> > needing to do any zeroing and hence won't overwrite valid data with zeros.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> I can confirm that, it  fixes corruption  of my VM images when using AIO
> +DIO. (cache=none,aio=native). I haven't reviewed the patch closely, but
> 
> 1) can we do this only for AIO+DIO combination ? For regular DIO, we
> should have all the IOs serialized by i_mutex anyway..

Not for filesystems that do their own locking. In most cases XFS
does not take the i_mutiex during DIO writes, and when it does it
drops it long before we call into the generic direct IO code that
does the sub-block zeroing. So the i_mutex does not guarantee any
form of serialisation in direct IO writes at all.

> 2) Having a single global list (for all devices) might cause scaling
> issues.

Unaligned direct IO is undesirable in the first place. While we
shouldd behave correctly in this corner case, I0 don't see any need
for it to be particularly efficient as the real fix for performance
problems with unaligned DIO is to not issue it in the first place.

> 3) Are you dropping i_mutex when you are waiting for the zero-out to
> finish ?

For XFS we're not holding the i_mutex - and we can't take the
i_mutex either as that will cause lock inversion issues. We don't
know what locks are held, we don't know whether it is safe to drop
and take locks, we don't even have the context to operate on
filesystem specific locks to avoid ordering problems. If we can't
sleep with the locks we already have held at this point, the DIO is
already broken for that filesystem.

Besides, if the i_mutex is already held for some filesystem when we
zero blocks, then we can't very well have concurrent block zeroing
in progress, and therefore can't hit this bug, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-07-30 23:13 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-29 22:45 [PATCH] dio: track and serialise unaligned direct IO Dave Chinner
2010-07-29 22:45 ` Dave Chinner
2010-07-30  2:53 ` Matthew Wilcox
2010-07-30  2:53   ` Matthew Wilcox
2010-07-30  4:53   ` Dave Chinner
2010-07-30  4:53     ` Dave Chinner
2010-08-03 17:34     ` Mingming Cao
2010-08-03 17:34       ` Mingming Cao
2010-08-03 22:56       ` Dave Chinner
2010-08-03 22:56         ` Dave Chinner
2010-08-03 23:41         ` Mingming Cao
2010-08-03 23:41           ` Mingming Cao
2010-08-04  4:22           ` Dave Chinner
2010-08-04  4:22             ` Dave Chinner
2010-07-30 17:43 ` Badari Pulavarty
2010-07-30 17:43   ` Badari Pulavarty
2010-07-30 23:13   ` Dave Chinner [this message]
2010-07-30 23:13     ` Dave Chinner
2010-08-04  0:11 ` Mingming Cao
2010-08-04  0:11   ` Mingming Cao
2010-08-04  3:37   ` Dave Chinner
2010-08-04  3:37     ` Dave Chinner
2010-08-04  3:58     ` Dave Chinner
2010-08-04  3:58       ` Dave Chinner
2010-08-04 14:55     ` Mingming Cao
2010-08-04 14:55       ` Mingming Cao
2010-08-05  1:02       ` Dave Chinner
2010-08-05  1:02         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100730231337.GD2126@dastard \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=pbadari@gmail.com \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.