All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS and write barriers.
@ 2007-03-23  1:26 Neil Brown
  2007-03-23  5:30 ` David Chinner
  2007-03-23  6:20 ` Timothy Shimmin
  0 siblings, 2 replies; 21+ messages in thread
From: Neil Brown @ 2007-03-23  1:26 UTC (permalink / raw)
  To: xfs


Hi,
 I have two concerns related to XFS and write barrier support that I'm
 hoping can be resolved.

Firstly in xfs_mountfs_check_barriers in fs/xfs/linux-2.6/xfs_super.c,
it tests ....->queue->ordered to see if that is QUEUE_ORDERED_NONE.
If it is, then barriers are disabled.

I think this is a layering violation - xfs really has no business
looking that deeply into the device.
For dm and md devices, ->ordered is never used and so never set, so
xfs will never use barriers on those devices (as the default value is
0 or NONE).  It is true that md and dm could set ->ordered to some
non-zero value just to please XFS, but that would be telling a lie and
there is no possible value that is relevant to a layered devices.

I think this test should just be removed and the xfs_barrier_test
should be the main mechanism for seeing if barriers work.

Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
retried without the barrier (after possibly waiting for dependant
requests to complete).  This is what other filesystems do, but I
cannot find the code in xfs which does this.
The approach taken by xfs_barrier_test seems to suggest that xfs does
do this... could someone please point me to the code ?

This is particularly important for md/raid1 as it is quite possible
that barriers will be supported at first, but after a failure and
different device on a different controller could be swapped in that
does not support barriers.

Thanks for your time,
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  1:26 XFS and write barriers Neil Brown
@ 2007-03-23  5:30 ` David Chinner
  2007-03-23  7:49   ` Neil Brown
  2007-03-23  9:50   ` Christoph Hellwig
  2007-03-23  6:20 ` Timothy Shimmin
  1 sibling, 2 replies; 21+ messages in thread
From: David Chinner @ 2007-03-23  5:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: xfs, hch

On Fri, Mar 23, 2007 at 12:26:31PM +1100, Neil Brown wrote:
> 
> Hi,
>  I have two concerns related to XFS and write barrier support that I'm
>  hoping can be resolved.
> 
> Firstly in xfs_mountfs_check_barriers in fs/xfs/linux-2.6/xfs_super.c,
> it tests ....->queue->ordered to see if that is QUEUE_ORDERED_NONE.
> If it is, then barriers are disabled.
> 
> I think this is a layering violation - xfs really has no business
> looking that deeply into the device.

Except that the device behaviour determines what XFS needs to do
and there used to be no other way to find out.

Christoph, any reason for needing this check anymore? I can't see
any particular reason for needing to do this as __make_request()
will check it for us when we test now.

> I think this test should just be removed and the xfs_barrier_test
> should be the main mechanism for seeing if barriers work.

Yup.

> Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> retried without the barrier (after possibly waiting for dependant
> requests to complete).  This is what other filesystems do, but I
> cannot find the code in xfs which does this.

XFS doesn't handle this - I was unaware that the barrier status of the
underlying block device could change....

OOC, when did this behaviour get introduced?

> The approach taken by xfs_barrier_test seems to suggest that xfs does
> do this... could someone please point me to the code ?

We test at mount time if barriers are supported, and the decision
lasts the life of the mount.

> This is particularly important for md/raid1 as it is quite possible
> that barriers will be supported at first, but after a failure and
> different device on a different controller could be swapped in that
> does not support barriers.

I/O errors are not the way this should be handled. What happens if
the opposite happens? A drive that needs barriers is used as a
replacement on a filesystem that has barriers disabled because they
weren't needed? Now a crash can result in filesystem corruption, but
the filesystem has not been able to warn the admin that this
situation occurred. 

/waves hands

At the recent FS/IO workshop in San Jose I raised the issue of how
we can get the I/O layers to tell the filesystems about changes in
status of the block layer that can affect filesystem behaviour. This
is a perfect example of the sort of communication that is needed....

In the mean time, we'll need to do something like the untested
patch below.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_buf.c   |   13 ++++++++++++-
 fs/xfs/linux-2.6/xfs_super.c |    8 --------
 fs/xfs/xfs_log.c             |   13 +++++++++++++
 3 files changed, 25 insertions(+), 9 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-02-07 15:51:09.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c	2007-03-23 16:19:19.790517132 +1100
@@ -1000,7 +1000,18 @@ xfs_buf_iodone_work(
 	xfs_buf_t		*bp =
 		container_of(work, xfs_buf_t, b_iodone_work);
 
-	if (bp->b_iodone)
+	/*
+	 * We can get an EOPNOTSUPP to ordered writes.  Here we clear the
+	 * ordered flag and reissue them.  Because we can't tell the higher
+	 * layers directly that they should not issue ordered I/O anymore, they
+	 * need to check if the ordered flag was cleared during I/O completion.
+	 */
+	if ((bp->b_error == EOPNOTSUPP) &&
+	    (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) {
+		XB_TRACE(bp, "ordered_retry", bp->b_iodone);
+		bp->b_flags &= ~XBF_ORDERED;
+		xfs_buf_iorequest(bp);
+	} else if (bp->b_iodone)
 		(*(bp->b_iodone))(bp);
 	else if (bp->b_flags & XBF_ASYNC)
 		xfs_buf_relse(bp);
Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c	2007-03-23 15:00:05.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_log.c	2007-03-23 16:19:40.355818889 +1100
@@ -961,6 +961,19 @@ xlog_iodone(xfs_buf_t *bp)
 	l = iclog->ic_log;
 
 	/*
+	 * If the ordered flag has been removed by a lower
+	 * layer, it means the underlyin device no longer supports
+	 * barrier I/O. Warn loudly and turn off barriers.
+	 */
+	if ((l->l_mp->m_flags & XFS_MOUNT_BARRIER) && !XFS_BUF_ORDERED(bp)) {
+		l->l_mp->m_flags &= ~XFS_MOUNT_BARRIER;
+		xfs_fs_cmn_err(CE_WARN, l->l_mp,
+				"xlog_iodone: Barriers are no longer supported"
+				" by device. Disabling barriers\n");
+		xfs_buftrace("XLOG_IODONE BARRIERS OFF", bp);
+	}
+
+	/*
 	 * Race to shutdown the filesystem if we see an error.
 	 */
 	if (XFS_TEST_ERROR((XFS_BUF_GETERROR(bp)), l->l_mp,
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c	2007-03-16 12:48:54.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c	2007-03-23 16:24:26.998220227 +1100
@@ -314,14 +314,6 @@ xfs_mountfs_check_barriers(xfs_mount_t *
 		return;
 	}
 
-	if (mp->m_ddev_targp->bt_bdev->bd_disk->queue->ordered ==
-					QUEUE_ORDERED_NONE) {
-		xfs_fs_cmn_err(CE_NOTE, mp,
-		  "Disabling barriers, not supported by the underlying device");
-		mp->m_flags &= ~XFS_MOUNT_BARRIER;
-		return;
-	}
-
 	if (xfs_readonly_buftarg(mp->m_ddev_targp)) {
 		xfs_fs_cmn_err(CE_NOTE, mp,
 		  "Disabling barriers, underlying device is readonly");

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  1:26 XFS and write barriers Neil Brown
  2007-03-23  5:30 ` David Chinner
@ 2007-03-23  6:20 ` Timothy Shimmin
  2007-03-23  8:00   ` Neil Brown
  1 sibling, 1 reply; 21+ messages in thread
From: Timothy Shimmin @ 2007-03-23  6:20 UTC (permalink / raw)
  To: Neil Brown, xfs

Hi Neil,


--On 23 March 2007 12:26:31 PM +1100 Neil Brown <neilb@suse.de> wrote:

>
> Hi,
>  I have two concerns related to XFS and write barrier support that I'm
>  hoping can be resolved.
>

1.
> Firstly in xfs_mountfs_check_barriers in fs/xfs/linux-2.6/xfs_super.c,
> it tests ....->queue->ordered to see if that is QUEUE_ORDERED_NONE.
> If it is, then barriers are disabled.
>
> I think this is a layering violation - xfs really has no business
> looking that deeply into the device.
> For dm and md devices, ->ordered is never used and so never set, so
> xfs will never use barriers on those devices (as the default value is
> 0 or NONE).  It is true that md and dm could set ->ordered to some
> non-zero value just to please XFS, but that would be telling a lie and
> there is no possible value that is relevant to a layered devices.
>
> I think this test should just be removed and the xfs_barrier_test
> should be the main mechanism for seeing if barriers work.
>
Oh okay.
This is all Christoph's (hch) code, so it would be good for him to comment here.
The external log and readonly tests can stay though.

2.
> Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> retried without the barrier (after possibly waiting for dependant
> requests to complete).  This is what other filesystems do, but I
> cannot find the code in xfs which does this.
> The approach taken by xfs_barrier_test seems to suggest that xfs does
> do this... could someone please point me to the code ?
>
You got me confused here.
I was wondering why the test write of the superblock (in xfs_barrier_test)
should be retried without barriers :)
But you were referring to the writing of the log buffers using barriers.
Yeah, if we get an EOPNOTSUPP AFAIK, we will report the error and shutdown
the filesystem (xlog_iodone()). This will happen when one of our (up to 8)
incore log buffers I/O completes and xlog_iodone handler is called.
I don't believe we have a notion of barrier'ness changing for us, and
we just test it at mount time.
Which bit of code led you to believe we do a retry?

> This is particularly important for md/raid1 as it is quite possible
> that barriers will be supported at first, but after a failure and
> different device on a different controller could be swapped in that
> does not support barriers.
>

Oh okay, I see. And then later one that supported them can be swapped back in?
So the other FSs are doing a sync'ed write out and then if there is an
EOPNOTSUPP they retry and disable barrier support henceforth?
Yeah, I guess we could do that in xlog_iodone() on failed completion and retry the write without
the ORDERED flag on EOPNOTSUPP error case (and turn off the flag).
Dave (dgc) can you see a problem with that?

> Thanks for your time,
Thanks for pointing it out.

--Tim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  5:30 ` David Chinner
@ 2007-03-23  7:49   ` Neil Brown
  2007-03-25  4:17     ` David Chinner
  2007-03-23  9:50   ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Neil Brown @ 2007-03-23  7:49 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, hch

On Friday March 23, dgc@sgi.com wrote:
> On Fri, Mar 23, 2007 at 12:26:31PM +1100, Neil Brown wrote:
> > Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> > retried without the barrier (after possibly waiting for dependant
> > requests to complete).  This is what other filesystems do, but I
> > cannot find the code in xfs which does this.
> 
> XFS doesn't handle this - I was unaware that the barrier status of the
> underlying block device could change....
> 
> OOC, when did this behaviour get introduced?

Probably when md/raid1 started supported barriers....

The problem is that this interface is (as far as I can see) undocumented
and not fully specified.

Barriers only make sense inside drive firmware.  Trying to emulate it
in the md layer doesn't make any sense as the filesystem is in a much
better position to do any emulation required.
So as the devices can change underneath md/raid1, it must be able to
fail a barrier request at any point.

The first file systems to use barriers (ext3, reiserfs) submit a
barrier request and if that fails they decide that barriers don't work
any more and use the fall-back mechanism.

The seemed to mesh perfectly with what I needed for md, so I assumed
it was an intended feature of the interface and made md/raid1 depend
on it.


> > This is particularly important for md/raid1 as it is quite possible
> > that barriers will be supported at first, but after a failure and
> > different device on a different controller could be swapped in that
> > does not support barriers.
> 
> I/O errors are not the way this should be handled. What happens if
> the opposite happens? A drive that needs barriers is used as a
> replacement on a filesystem that has barriers disabled because they
> weren't needed? Now a crash can result in filesystem corruption, but
> the filesystem has not been able to warn the admin that this
> situation occurred. 

There should never be a possibility of filesystem corruption.
If the a barrier request fails, the filesystem should:
  wait for any dependant request to complete
  call blkdev_issue_flush
  schedule the write of the 'barrier' block
  call blkdev_issue_flush again.

My understand is that that sequence is as safe as a barrier, but maybe
not as fast.

The patch looks at least believable.  As you can imagine it is awkward
to test thoroughly.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  6:20 ` Timothy Shimmin
@ 2007-03-23  8:00   ` Neil Brown
  2007-03-25  3:19     ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Brown @ 2007-03-23  8:00 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs

On Friday March 23, tes@sgi.com wrote:
> >
> > I think this test should just be removed and the xfs_barrier_test
> > should be the main mechanism for seeing if barriers work.
> >
> Oh okay.
> This is all Christoph's (hch) code, so it would be good for him to comment here.
> The external log and readonly tests can stay though.
> 

Why no barriers on an external log device??? Not important, just
curious.

> 2.
> > Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> > retried without the barrier (after possibly waiting for dependant
> > requests to complete).  This is what other filesystems do, but I
> > cannot find the code in xfs which does this.
> > The approach taken by xfs_barrier_test seems to suggest that xfs does
> > do this... could someone please point me to the code ?
> >
> You got me confused here.
> I was wondering why the test write of the superblock (in xfs_barrier_test)
> should be retried without barriers :)
> But you were referring to the writing of the log buffers using barriers.
> Yeah, if we get an EOPNOTSUPP AFAIK, we will report the error and shutdown
> the filesystem (xlog_iodone()). This will happen when one of our (up to 8)
> incore log buffers I/O completes and xlog_iodone handler is called.
> I don't believe we have a notion of barrier'ness changing for us, and
> we just test it at mount time.
> Which bit of code led you to believe we do a retry?

Uhmm.. I think I just got confused reading xfs_barrier_test,  I cannot
see it anymore (I think I didn't see the error return and so assumed
some lower layer but be setting some state flag).

> 
> > This is particularly important for md/raid1 as it is quite possible
> > that barriers will be supported at first, but after a failure and
> > different device on a different controller could be swapped in that
> > does not support barriers.
> >
> 
> Oh okay, I see. And then later one that supported them can be swapped back in?
> So the other FSs are doing a sync'ed write out and then if there is an
> EOPNOTSUPP they retry and disable barrier support henceforth?
> Yeah, I guess we could do that in xlog_iodone() on failed completion and retry the write without
> the ORDERED flag on EOPNOTSUPP error case (and turn off the flag).
> Dave (dgc) can you see a problem with that?

If an md/raid1 disables barriers and subsequently is restored to a
state where all drives support barriers, it currently does *not*
re-enable them device-wide.  This would probably be quite easy to
achieve, but as no existing filesystem would ever try barriers
again.....

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  5:30 ` David Chinner
  2007-03-23  7:49   ` Neil Brown
@ 2007-03-23  9:50   ` Christoph Hellwig
  2007-03-25  3:51     ` David Chinner
  2007-03-26  1:11     ` Neil Brown
  1 sibling, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2007-03-23  9:50 UTC (permalink / raw)
  To: David Chinner; +Cc: Neil Brown, xfs, hch

On Fri, Mar 23, 2007 at 04:30:43PM +1100, David Chinner wrote:
> On Fri, Mar 23, 2007 at 12:26:31PM +1100, Neil Brown wrote:
> > 
> > Hi,
> >  I have two concerns related to XFS and write barrier support that I'm
> >  hoping can be resolved.
> > 
> > Firstly in xfs_mountfs_check_barriers in fs/xfs/linux-2.6/xfs_super.c,
> > it tests ....->queue->ordered to see if that is QUEUE_ORDERED_NONE.
> > If it is, then barriers are disabled.
> > 
> > I think this is a layering violation - xfs really has no business
> > looking that deeply into the device.
> 
> Except that the device behaviour determines what XFS needs to do
> and there used to be no other way to find out.
> 
> Christoph, any reason for needing this check anymore? I can't see
> any particular reason for needing to do this as __make_request()
> will check it for us when we test now.

When I first implemented it I really dislike the idea of having request
fail asynchrnously due to the lack of barriers.  Then someone (Jens?)
told me we need to do this check anyway because devices might lie to
us, at which point I implemented the test superblock writeback to
check if it actually works.

So yes, we could probably get rid of the check now, although I'd
prefer the block layer exporting an API to the filesystem to tell
it whether there is any point in trying to use barriers.

> > Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> > retried without the barrier (after possibly waiting for dependant
> > requests to complete).  This is what other filesystems do, but I
> > cannot find the code in xfs which does this.
> 
> XFS doesn't handle this - I was unaware that the barrier status of the
> underlying block device could change....
> 
> OOC, when did this behaviour get introduced?

That would be really bad.  XFS metadata buffers can have multiple bios
and retrying a single one would be rather difficult.

> +	/*
> +	 * We can get an EOPNOTSUPP to ordered writes.  Here we clear the
> +	 * ordered flag and reissue them.  Because we can't tell the higher
> +	 * layers directly that they should not issue ordered I/O anymore, they
> +	 * need to check if the ordered flag was cleared during I/O completion.
> +	 */
> +	if ((bp->b_error == EOPNOTSUPP) &&
> +	    (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) {
> +		XB_TRACE(bp, "ordered_retry", bp->b_iodone);
> +		bp->b_flags &= ~XBF_ORDERED;
> +		xfs_buf_iorequest(bp);
> +	} else if (bp->b_iodone)
>  		(*(bp->b_iodone))(bp);
>  	else if (bp->b_flags & XBF_ASYNC)
>  		xfs_buf_relse(bp);

So you're retrying the whole I/O, this is probably better than trying
to handle this at the bio level.  I still don't quite like doing another
I/O from the I/O completion handler.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  8:00   ` Neil Brown
@ 2007-03-25  3:19     ` David Chinner
  2007-03-26  0:01       ` Neil Brown
  2007-03-27  3:58       ` Timothy Shimmin
  0 siblings, 2 replies; 21+ messages in thread
From: David Chinner @ 2007-03-25  3:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: Timothy Shimmin, xfs

On Fri, Mar 23, 2007 at 07:00:46PM +1100, Neil Brown wrote:
> On Friday March 23, tes@sgi.com wrote:
> > >
> > > I think this test should just be removed and the xfs_barrier_test
> > > should be the main mechanism for seeing if barriers work.
> > >
> > Oh okay.
> > This is all Christoph's (hch) code, so it would be good for him to comment here.
> > The external log and readonly tests can stay though.
> > 
> 
> Why no barriers on an external log device??? Not important, just
> curious.

because we need to synchronize across 2 devices, not one, so issuing
barriers on an external log device does nothing to order the metadata
written to the other device...

> > > This is particularly important for md/raid1 as it is quite possible
> > > that barriers will be supported at first, but after a failure and
> > > different device on a different controller could be swapped in that
> > > does not support barriers.
> > >
> > 
> > Oh okay, I see. And then later one that supported them can be swapped back in?
> > So the other FSs are doing a sync'ed write out and then if there is an
> > EOPNOTSUPP they retry and disable barrier support henceforth?
> > Yeah, I guess we could do that in xlog_iodone() on failed completion and retry the write without
> > the ORDERED flag on EOPNOTSUPP error case (and turn off the flag).
> > Dave (dgc) can you see a problem with that?
> 
> If an md/raid1 disables barriers and subsequently is restored to a
> state where all drives support barriers, it currently does *not*
> re-enable them device-wide.  This would probably be quite easy to
> achieve, but as no existing filesystem would ever try barriers
> again.....

And this is exactly why I think we need a block->fs communications
channel for these sorts of things. Think of something like the CPU
hotplug notifier mechanisms as a rough example framework....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  9:50   ` Christoph Hellwig
@ 2007-03-25  3:51     ` David Chinner
  2007-03-25 23:58       ` Neil Brown
  2007-03-26  1:11     ` Neil Brown
  1 sibling, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-03-25  3:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David Chinner, Neil Brown, xfs

On Fri, Mar 23, 2007 at 09:50:55AM +0000, Christoph Hellwig wrote:
> On Fri, Mar 23, 2007 at 04:30:43PM +1100, David Chinner wrote:
> > On Fri, Mar 23, 2007 at 12:26:31PM +1100, Neil Brown wrote:
> > > 
> > > Hi,
> > >  I have two concerns related to XFS and write barrier support that I'm
> > >  hoping can be resolved.
> > > 
> > > Firstly in xfs_mountfs_check_barriers in fs/xfs/linux-2.6/xfs_super.c,
> > > it tests ....->queue->ordered to see if that is QUEUE_ORDERED_NONE.
> > > If it is, then barriers are disabled.
> > > 
> > > I think this is a layering violation - xfs really has no business
> > > looking that deeply into the device.
> > 
> > Except that the device behaviour determines what XFS needs to do
> > and there used to be no other way to find out.
> > 
> > Christoph, any reason for needing this check anymore? I can't see
> > any particular reason for needing to do this as __make_request()
> > will check it for us when we test now.
> 
> When I first implemented it I really dislike the idea of having request
> fail asynchrnously due to the lack of barriers.  Then someone (Jens?)
> told me we need to do this check anyway because devices might lie to
> us, at which point I implemented the test superblock writeback to
> check if it actually works.
> 
> So yes, we could probably get rid of the check now, although I'd
> prefer the block layer exporting an API to the filesystem to tell
> it whether there is any point in trying to use barriers.

Ditto.

> > > Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> > > retried without the barrier (after possibly waiting for dependant
> > > requests to complete).  This is what other filesystems do, but I
> > > cannot find the code in xfs which does this.
> > 
> > XFS doesn't handle this - I was unaware that the barrier status of the
> > underlying block device could change....
> > 
> > OOC, when did this behaviour get introduced?
> 
> That would be really bad.  XFS metadata buffers can have multiple bios
> and retrying a single one would be rather difficult.
> 
> > +	/*
> > +	 * We can get an EOPNOTSUPP to ordered writes.  Here we clear the
> > +	 * ordered flag and reissue them.  Because we can't tell the higher
> > +	 * layers directly that they should not issue ordered I/O anymore, they
> > +	 * need to check if the ordered flag was cleared during I/O completion.
> > +	 */
> > +	if ((bp->b_error == EOPNOTSUPP) &&
> > +	    (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) {
> > +		XB_TRACE(bp, "ordered_retry", bp->b_iodone);
> > +		bp->b_flags &= ~XBF_ORDERED;
> > +		xfs_buf_iorequest(bp);
> > +	} else if (bp->b_iodone)
> >  		(*(bp->b_iodone))(bp);
> >  	else if (bp->b_flags & XBF_ASYNC)
> >  		xfs_buf_relse(bp);
> 
> So you're retrying the whole I/O, this is probably better than trying
> to handle this at the bio level.  I still don't quite like doing another
> I/O from the I/O completion handler.

You're not the only one, Christoph. This may be better than trying
to handle it at lower layers, and far better than having to handle
it at every point in the higher layers where we may issue barrier
I/Os. 

But I *seriously dislike* having to reissue async I/Os in this
manner and then having to rely on a higher layer's I/o completion
handler to detect the fact that the I/O was retried to change the
way the filesystem issues I/Os in the future. It's a really crappy
way of communicating between layers....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  7:49   ` Neil Brown
@ 2007-03-25  4:17     ` David Chinner
  2007-03-25 23:21       ` Neil Brown
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-03-25  4:17 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, xfs, hch

On Fri, Mar 23, 2007 at 06:49:50PM +1100, Neil Brown wrote:
> On Friday March 23, dgc@sgi.com wrote:
> > On Fri, Mar 23, 2007 at 12:26:31PM +1100, Neil Brown wrote:
> > > Secondly, if a barrier write fails due to EOPNOTSUPP, it should be
> > > retried without the barrier (after possibly waiting for dependant
> > > requests to complete).  This is what other filesystems do, but I
> > > cannot find the code in xfs which does this.
> > 
> > XFS doesn't handle this - I was unaware that the barrier status of the
> > underlying block device could change....
> > 
> > OOC, when did this behaviour get introduced?
> 
> Probably when md/raid1 started supported barriers....
> 
> The problem is that this interface is (as far as I can see) undocumented
> and not fully specified.

And not communicated very far, either.

> Barriers only make sense inside drive firmware.

I disagree. e.g. Barriers have to be handled by the block layer to
prevent reordering of I/O in the request queues as well. The
block layer is responsible for ensuring barrier I/Os, as
indicated by the filesystem, act as real barriers.

> Trying to emulate it
> in the md layer doesn't make any sense as the filesystem is in a much
> better position to do any emulation required.

You're saying that the emulation of block layer functionality is the
responsibility of layers above the block layer. Why is this not
considered a layering violation?

> > > This is particularly important for md/raid1 as it is quite possible
> > > that barriers will be supported at first, but after a failure and
> > > different device on a different controller could be swapped in that
> > > does not support barriers.
> > 
> > I/O errors are not the way this should be handled. What happens if
> > the opposite happens? A drive that needs barriers is used as a
> > replacement on a filesystem that has barriers disabled because they
> > weren't needed? Now a crash can result in filesystem corruption, but
> > the filesystem has not been able to warn the admin that this
> > situation occurred. 
> 
> There should never be a possibility of filesystem corruption.
> If the a barrier request fails, the filesystem should:
>   wait for any dependant request to complete
>   call blkdev_issue_flush
>   schedule the write of the 'barrier' block
>   call blkdev_issue_flush again.

IOWs, the filesystem has to use block device calls to emulate a block device
barrier I/O. Why can't the block layer, on reception of a barrier write
and detecting that barriers are no longer supported by the underlying
device (i.e. in MD), do:

	wait for all queued I/Os to complete
	call blkdev_issue_flush
	schedule the write of the 'barrier' block
	call blkdev_issue_flush again.

And not involve the filesystem at all? i.e. why should the filesystem
have to do this?

> My understand is that that sequence is as safe as a barrier, but maybe
> not as fast.

Yes, and my understanding is that the block device is perfectly capable
of implementing this just as safely as the filesystem.

> The patch looks at least believable.  As you can imagine it is awkward
> to test thoroughly.

As well as being pretty much impossible to test reliably with an
automated testing framework. Hence so ongoing test coverage will
approach zero.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-25  4:17     ` David Chinner
@ 2007-03-25 23:21       ` Neil Brown
  2007-03-26  3:14         ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Brown @ 2007-03-25 23:21 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, hch

On Sunday March 25, dgc@sgi.com wrote:
> > Barriers only make sense inside drive firmware.
> 
> I disagree. e.g. Barriers have to be handled by the block layer to
> prevent reordering of I/O in the request queues as well. The
> block layer is responsible for ensuring barrier I/Os, as
> indicated by the filesystem, act as real barriers.

Absolutely.  The block layer needs to understand about barriers and
allow them to do their job, which means not re-ordering requests
around barriers.
My point was that if the functionality cannot be provided in the
lowest-level firmware (as it cannot for raid0 as there is no single
lowest-level firmware), then it should be implemented at the
filesystem level.  Implementing barriers in md or dm doesn't make any
sense (though passing barriers through can in some situations).

> 
> > Trying to emulate it
> > in the md layer doesn't make any sense as the filesystem is in a much
> > better position to do any emulation required.
> 
> You're saying that the emulation of block layer functionality is the
> responsibility of layers above the block layer. Why is this not
> considered a layering violation?

:-)
Maybe it depends on your perspective.  I think this is filesystem
layer functionality.  Making sure blocks are written in the right
order sounds like something that the filesystem should be primarily
responsible for.

The most straight-forward way to implement this is to make sure all
preceding blocks have been written before writing the barrier block.
All filesystems should be able to do this (if it is important to
them).

Because block IO tends to have long pipelines and because this
operation will stall the pipeline, it makes sense for a block IO
subsystem to provide the possibility of implementing this sequencing
without a complete stall, and the 'barrier' flag makes that possible.
But that doesn't mean it is block-layer functionality.  It means (to
me) it is common fs functionality that the block layer is helping out
with.

> > 
> > There should never be a possibility of filesystem corruption.
> > If the a barrier request fails, the filesystem should:
> >   wait for any dependant request to complete
> >   call blkdev_issue_flush
> >   schedule the write of the 'barrier' block
> >   call blkdev_issue_flush again.
> 
> IOWs, the filesystem has to use block device calls to emulate a block device
> barrier I/O. Why can't the block layer, on reception of a barrier write
> and detecting that barriers are no longer supported by the underlying
> device (i.e. in MD), do:
> 
> 	wait for all queued I/Os to complete
> 	call blkdev_issue_flush
> 	schedule the write of the 'barrier' block
> 	call blkdev_issue_flush again.
> 
> And not involve the filesystem at all? i.e. why should the filesystem
> have to do this?

Certainly it could.
However
 a/ The the block layer would have to wait for *all* queued I/O,
    where-as the filesystem would only have to wait for queued IO
    which has a semantic dependence on the barrier block.  So the
    filesystem can potentially perform the operation more efficiently.
 b/ Some block devices don't support barriers, so the filesystem needs
    to have the mechanisms in place to do this already.  Why duplicate
    it in the block layer?
(c/ md/raid0 doesn't track all the outstanding requests...:-)

I think the block device should support barriers when it can do so
more efficiently than the filesystem.  For a single SCSI drive, it
can.  For a logical volume striped over multiple physical devices, it
cannot.

> 
> > My understand is that that sequence is as safe as a barrier, but maybe
> > not as fast.
> 
> Yes, and my understanding is that the block device is perfectly capable
> of implementing this just as safely as the filesystem.
> 

But possibly not as efficiently...

What did XFS do before the block layer supported barriers?


> > The patch looks at least believable.  As you can imagine it is awkward
> > to test thoroughly.
> 
> As well as being pretty much impossible to test reliably with an
> automated testing framework. Hence so ongoing test coverage will
> approach zero.....

This is a problem with barriers in general.... it is very hard to test
that the data is encoded on the platter at any given time :-(

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-25  3:51     ` David Chinner
@ 2007-03-25 23:58       ` Neil Brown
  0 siblings, 0 replies; 21+ messages in thread
From: Neil Brown @ 2007-03-25 23:58 UTC (permalink / raw)
  To: David Chinner; +Cc: Christoph Hellwig, xfs

On Sunday March 25, dgc@sgi.com wrote:
> On Fri, Mar 23, 2007 at 09:50:55AM +0000, Christoph Hellwig wrote:
> > 
> > So yes, we could probably get rid of the check now, although I'd
> > prefer the block layer exporting an API to the filesystem to tell
> > it whether there is any point in trying to use barriers.
> 
> Ditto.

What would be the point of that interface?

If it only says "It might be worth testing", then you still have to
test.  And if you have to test, where is the value in asking in
advance.
The is no important difference between "the device said 'don't bother
trying'" and "We tried and the device said 'no'".

> > 
> > So you're retrying the whole I/O, this is probably better than trying
> > to handle this at the bio level.  I still don't quite like doing another
> > I/O from the I/O completion handler.
> 
> You're not the only one, Christoph. This may be better than trying
> to handle it at lower layers, and far better than having to handle
> it at every point in the higher layers where we may issue barrier
> I/Os. 

But I think that has to be where it is handled.
What other filesystems do is something like:

   if (barriers_supported) {
       submit barrier request;
       wait for completion
       if (fail with -EOPNOTSUPP)
             barriers_supported = 0;
   }
   if (!barriers_supported) {
        wait for other requests to complete;
        submit non-barrier request;
        wait for completion
   }
   handle_error

Obviously if you are going to issue barrier writes from multiple
places you would put this in a function...
I'm not sure that other filesystems call blkdev_issue_flush.... As you
said elsewhere, not a very effectively communicated interface.


> 
> But I *seriously dislike* having to reissue async I/Os in this
> manner and then having to rely on a higher layer's I/o completion
> handler to detect the fact that the I/O was retried to change the
> way the filesystem issues I/Os in the future. It's a really crappy
> way of communicating between layers....

md/dm do add extra complexity to the blockdev interface that I don't
think were fully considered when the interface wa designed.

We would really like a client to say "I'm starting to build a bio"
so that the device can either block that until a reconfiguration
completes, or can block any reconfiguration until the bio is fully
built and submitted (or aborted).
Once you have that bio-being-built handle, it would probably make
sense to test 'are barriers supported' for that bio without having to
submit an IO..

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-25  3:19     ` David Chinner
@ 2007-03-26  0:01       ` Neil Brown
  2007-03-26  3:58         ` David Chinner
  2007-03-27  3:58       ` Timothy Shimmin
  1 sibling, 1 reply; 21+ messages in thread
From: Neil Brown @ 2007-03-26  0:01 UTC (permalink / raw)
  To: David Chinner; +Cc: Neil Brown, Timothy Shimmin, xfs

On Sunday March 25, dgc@sgi.com wrote:
> On Fri, Mar 23, 2007 at 07:00:46PM +1100, Neil Brown wrote:
> > 
> > Why no barriers on an external log device??? Not important, just
> > curious.
> 
> because we need to synchronize across 2 devices, not one, so issuing
> barriers on an external log device does nothing to order the metadata
> written to the other device...

Right, of course.  Just like over a raid0.

So you must have code to wait for all writes to the main device before
writing the commit block on the journal.   How hard is it to fall-back
to that if the barrier fails?

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-23  9:50   ` Christoph Hellwig
  2007-03-25  3:51     ` David Chinner
@ 2007-03-26  1:11     ` Neil Brown
  1 sibling, 0 replies; 21+ messages in thread
From: Neil Brown @ 2007-03-26  1:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David Chinner, xfs

On Friday March 23, hch@infradead.org wrote:
> 
> That would be really bad.  XFS metadata buffers can have multiple bios
> and retrying a single one would be rather difficult.
> 

But would you have multiple bios for a write that had BIO_RW_BARRIER
set?  That would seem .... odd.

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-25 23:21       ` Neil Brown
@ 2007-03-26  3:14         ` David Chinner
  2007-03-26  4:27           ` Neil Brown
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-03-26  3:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, xfs, hch

On Mon, Mar 26, 2007 at 09:21:43AM +1000, Neil Brown wrote:
> My point was that if the functionality cannot be provided in the
> lowest-level firmware (as it cannot for raid0 as there is no single
> lowest-level firmware), then it should be implemented at the
> filesystem level.  Implementing barriers in md or dm doesn't make any
> sense (though passing barriers through can in some situations).

Hold on - you've said that the barrier support in a block deivce
can change because of MD doing hot swap. Now you're saying
there is no barrier implementation in md. Can you explain
*exactly* what barrier support there is in MD?

> > > Trying to emulate it
> > > in the md layer doesn't make any sense as the filesystem is in a much
> > > better position to do any emulation required.
> > 
> > You're saying that the emulation of block layer functionality is the
> > responsibility of layers above the block layer. Why is this not
> > considered a layering violation?
> 
> :-)
> Maybe it depends on your perspective.  I think this is filesystem
> layer functionality.  Making sure blocks are written in the right
> order sounds like something that the filesystem should be primarily
> responsible for.

Sure, but if the filesystem requires the block layer to provide
those ordering semantics to it. e.g. barrier I/Os.

Remember, different filesystem have different levels of
data+metadata safety and many of them do nothing to guarantee
write ordering.

> The most straight-forward way to implement this is to make sure all
> preceding blocks have been written before writing the barrier block.
> All filesystems should be able to do this (if it is important to them).
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^

And that is the key point - XFS provides no guarantee that your
data is on spinning rust other than I/O barriers when you have
volatile write caches.

IOWs, if you turn barriers off, we provide *no guarantees*
about the consistency of your filesystem after a power failure
if you are using volatile write caching. This mode is for
use with non-cached disks or disks with NVRAM caches where there
is no need for barriers.

> Because block IO tends to have long pipelines and because this
> operation will stall the pipeline, it makes sense for a block IO
> subsystem to provide the possibility of implementing this sequencing
> without a complete stall, and the 'barrier' flag makes that possible.
> But that doesn't mean it is block-layer functionality.  It means (to
> me) it is common fs functionality that the block layer is helping out
> with.

I disagree - it is a function supported and defined by the block
layer. Errors returned to the filesystem are directly defined
in the block layer, the ordering guarantees are provided by the
block layer and changes in semantics appear to be defined by
the block layer......

> > 	wait for all queued I/Os to complete
> > 	call blkdev_issue_flush
> > 	schedule the write of the 'barrier' block
> > 	call blkdev_issue_flush again.
> > 
> > And not involve the filesystem at all? i.e. why should the filesystem
> > have to do this?
> 
> Certainly it could.
> However
>  a/ The the block layer would have to wait for *all* queued I/O,
>     where-as the filesystem would only have to wait for queued IO
>     which has a semantic dependence on the barrier block.  So the
>     filesystem can potentially perform the operation more efficiently.

Assuming the filesystem can do it more efficiently. What if it
can't? What if, like XFS, when barriers are turned off, the
filesystem provides *no* guarantees?

>  b/ Some block devices don't support barriers, so the filesystem needs
>     to have the mechanisms in place to do this already.

No, you turn write caching off on the drive. This is an
especially important consideration given that many older drives
lied about cache flushes being complete (i.e. they were
implemented as no-ops).

> (c/ md/raid0 doesn't track all the outstanding requests...:-)

XFS doesn't track all outstanding requests either....

> What did XFS do before the block layer supported barriers?

Either turn off write caching or use non-volatile write caches.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-26  0:01       ` Neil Brown
@ 2007-03-26  3:58         ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-03-26  3:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, Timothy Shimmin, xfs

On Mon, Mar 26, 2007 at 10:01:01AM +1000, Neil Brown wrote:
> On Sunday March 25, dgc@sgi.com wrote:
> > On Fri, Mar 23, 2007 at 07:00:46PM +1100, Neil Brown wrote:
> > > 
> > > Why no barriers on an external log device??? Not important, just
> > > curious.
> > 
> > because we need to synchronize across 2 devices, not one, so issuing
> > barriers on an external log device does nothing to order the metadata
> > written to the other device...
> 
> Right, of course.  Just like over a raid0.
> 
> So you must have code to wait for all writes to the main device before
> writing the commit block on the journal.

Forget about what you know about journalling from ext3, XFS is
vastly different and much more complex..... ;)

We wait for space in the log to become available during transaction
reservation; we don't wait for specific I/Os to complete because we
just push a bunch out. Once we have a reservation, we know we have
space in the log for our transaction commit and so we don't have to
wait for any I/O to complete when we do our transaction commit.
Hence we don't wait for the I/Os we may have issued to make space
available; another thread's push may have made enough space for our
reservation. IOWs, we've got *no idea* what the dependent I/Os are
when writing the transaction commit to disk because we have no clue
as to what we are overwriting in the journal.

This journalling method assumes that we either have no drive level
caching, non-volatile caching, or barrier-based log I/Os to prevent
corruption on drive power loss.  Hence with external logs on XFS
you have the option of no caching or non-volatile caching....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-26  3:14         ` David Chinner
@ 2007-03-26  4:27           ` Neil Brown
  2007-03-26  9:04             ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Brown @ 2007-03-26  4:27 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, hch

On Monday March 26, dgc@sgi.com wrote:
> On Mon, Mar 26, 2007 at 09:21:43AM +1000, Neil Brown wrote:
> > My point was that if the functionality cannot be provided in the
> > lowest-level firmware (as it cannot for raid0 as there is no single
> > lowest-level firmware), then it should be implemented at the
> > filesystem level.  Implementing barriers in md or dm doesn't make any
> > sense (though passing barriers through can in some situations).
> 
> Hold on - you've said that the barrier support in a block deivce
> can change because of MD doing hot swap. Now you're saying
> there is no barrier implementation in md. Can you explain
> *exactly* what barrier support there is in MD?

For all levels other than md/raid1, md rejects bio_barrier() requests
as -EOPNOTSUPP.

For raid1 it tests barrier support when writing the superblock and the
if all devices support barriers, then md/raid1 will allow
bio_barrier() down.  If it gets an unexpected failure it just rewrites
it without the barrier flag and fails any future write requests (which
isn't ideal, but is the best available, and should happen effectively
never).

So md/raid1 barrier support is completely dependant on the underlying
devices.  md/raid1 is aware of barriers but does not *implement*
them.  Does that make it clearer?

> > The most straight-forward way to implement this is to make sure all
> > preceding blocks have been written before writing the barrier block.
> > All filesystems should be able to do this (if it is important to them).
>                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> And that is the key point - XFS provides no guarantee that your
> data is on spinning rust other than I/O barriers when you have
> volatile write caches.
> 
> IOWs, if you turn barriers off, we provide *no guarantees*
> about the consistency of your filesystem after a power failure
> if you are using volatile write caching. This mode is for
> use with non-cached disks or disks with NVRAM caches where there
> is no need for barriers.

But.... as the block layer can re-order writes, even non-cached disks
could get the writes in a different or to the order in which you sent
them.

I have a report of xfs over md/raid1 going about 10% faster once we
managed to let barrier writes through, so presumably XFS does
something different if barriers are not enabled ???  What does it do
differently?


> 
> > Because block IO tends to have long pipelines and because this
> > operation will stall the pipeline, it makes sense for a block IO
> > subsystem to provide the possibility of implementing this sequencing
> > without a complete stall, and the 'barrier' flag makes that possible.
> > But that doesn't mean it is block-layer functionality.  It means (to
> > me) it is common fs functionality that the block layer is helping out
> > with.
> 
> I disagree - it is a function supported and defined by the block
> layer. Errors returned to the filesystem are directly defined
> in the block layer, the ordering guarantees are provided by the
> block layer and changes in semantics appear to be defined by
> the block layer......

chuckle....
You can tell we are on different sides of the fence, can't you ?

There is certainly some validity in your position...

> 
> > > 	wait for all queued I/Os to complete
> > > 	call blkdev_issue_flush
> > > 	schedule the write of the 'barrier' block
> > > 	call blkdev_issue_flush again.
> > > 
> > > And not involve the filesystem at all? i.e. why should the filesystem
> > > have to do this?
> > 
> > Certainly it could.
> > However
> >  a/ The the block layer would have to wait for *all* queued I/O,
> >     where-as the filesystem would only have to wait for queued IO
> >     which has a semantic dependence on the barrier block.  So the
> >     filesystem can potentially perform the operation more efficiently.
> 
> Assuming the filesystem can do it more efficiently. What if it
> can't? What if, like XFS, when barriers are turned off, the
> filesystem provides *no* guarantees?

(Yes.... Ted T'so like casting aspersions on XFS... I guess this is
why :-)

Is there some mount flag to say "cope without barriers" or "require
barriers" ??
I can imagine implementing barriers in raid5 (which keeps careful
track of everything) but I suspect it would be a performance hit.  It
might be nice if the sysadmin has to explicitly ask...

For that matter, I could get raid1 to reject replacement devices that
didn't support barriers, if there was a way for the filesystem to
explicitly ask for them.  I think we are getting back to interface
issues, aren't we? 

> 
> > (c/ md/raid0 doesn't track all the outstanding requests...:-)
> 
> XFS doesn't track all outstanding requests either....

That surprises me... but maybe it shouldn't.


Thanks.
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-26  4:27           ` Neil Brown
@ 2007-03-26  9:04             ` David Chinner
  2007-03-29 14:56               ` Martin Steigerwald
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-03-26  9:04 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, xfs, hch

On Mon, Mar 26, 2007 at 02:27:24PM +1000, Neil Brown wrote:
> On Monday March 26, dgc@sgi.com wrote:
> > On Mon, Mar 26, 2007 at 09:21:43AM +1000, Neil Brown wrote:
> > > My point was that if the functionality cannot be provided in the
> > > lowest-level firmware (as it cannot for raid0 as there is no single
> > > lowest-level firmware), then it should be implemented at the
> > > filesystem level.  Implementing barriers in md or dm doesn't make any
> > > sense (though passing barriers through can in some situations).
> > 
> > Hold on - you've said that the barrier support in a block deivce
> > can change because of MD doing hot swap. Now you're saying
> > there is no barrier implementation in md. Can you explain
> > *exactly* what barrier support there is in MD?
> 
> For all levels other than md/raid1, md rejects bio_barrier() requests
> as -EOPNOTSUPP.
> 
> For raid1 it tests barrier support when writing the superblock and the
> if all devices support barriers, then md/raid1 will allow
> bio_barrier() down.  If it gets an unexpected failure it just rewrites
> it without the barrier flag and fails any future write requests (which
> isn't ideal, but is the best available, and should happen effectively
> never).
> 
> So md/raid1 barrier support is completely dependant on the underlying
> devices.  md/raid1 is aware of barriers but does not *implement*
> them.  Does that make it clearer?

Ah, that clears up the picture - thanks Neil.

> > > The most straight-forward way to implement this is to make sure all
> > > preceding blocks have been written before writing the barrier block.
> > > All filesystems should be able to do this (if it is important to them).
> >                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > And that is the key point - XFS provides no guarantee that your
> > data is on spinning rust other than I/O barriers when you have
> > volatile write caches.
> > 
> > IOWs, if you turn barriers off, we provide *no guarantees*
> > about the consistency of your filesystem after a power failure
> > if you are using volatile write caching. This mode is for
> > use with non-cached disks or disks with NVRAM caches where there
> > is no need for barriers.
> 
> But.... as the block layer can re-order writes, even non-cached disks
> could get the writes in a different or to the order in which you sent
> them.

But on a non-cached disk we've had an to have received an I/O completion
before the tail of the log moves, and hence the metadata is on stable
storage. The problem arises when volatile write caches are used and
I/O completion no longer means "data on stable storage".

> I have a report of xfs over md/raid1 going about 10% faster once we
> managed to let barrier writes through, so presumably XFS does
> something different if barriers are not enabled ???  What does it do
> differently?

I bet that the disk doesn't have it's write cache turned on. For
disks with write cache turned on, barriers can slow down XFS by a factor
of 5. Safety, not speed, was all we are after with barriers.

> > > Because block IO tends to have long pipelines and because this
> > > operation will stall the pipeline, it makes sense for a block IO
> > > subsystem to provide the possibility of implementing this sequencing
> > > without a complete stall, and the 'barrier' flag makes that possible.
> > > But that doesn't mean it is block-layer functionality.  It means (to
> > > me) it is common fs functionality that the block layer is helping out
> > > with.
> > 
> > I disagree - it is a function supported and defined by the block
> > layer. Errors returned to the filesystem are directly defined
> > in the block layer, the ordering guarantees are provided by the
> > block layer and changes in semantics appear to be defined by
> > the block layer......
> 
> chuckle....
> You can tell we are on different sides of the fence, can't you ?

Yup - no fence sitting here ;)

> There is certainly some validity in your position...

And likewise yours - I just don't think the responsibility here
is quite so black and white...

> > > > 	wait for all queued I/Os to complete
> > > > 	call blkdev_issue_flush
> > > > 	schedule the write of the 'barrier' block
> > > > 	call blkdev_issue_flush again.
> > > > 
> > > > And not involve the filesystem at all? i.e. why should the filesystem
> > > > have to do this?
> > > 
> > > Certainly it could.
> > > However
> > >  a/ The the block layer would have to wait for *all* queued I/O,
> > >     where-as the filesystem would only have to wait for queued IO
> > >     which has a semantic dependence on the barrier block.  So the
> > >     filesystem can potentially perform the operation more efficiently.
> > 
> > Assuming the filesystem can do it more efficiently. What if it
> > can't? What if, like XFS, when barriers are turned off, the
> > filesystem provides *no* guarantees?
> 
> (Yes.... Ted T'so like casting aspersions on XFS... I guess this is
> why :-)

Different design criteria. ext3 is great doing what it was designed
for, and the same can be said for XFS. Take them outside their
comfort area (like putting XFS on commodity disks with volatile
write caches or putting millions of files into a single directory in
ext3) and you get problems. it's just that they were designed for
different purposes, and that includes data resilience during failure
conditions.

That being said, we are doing a lot in XFS to address some of these
shortcomings - it's just that ordered writes can be very difficult
to retrofit to an existing filesystem....

> Is there some mount flag to say "cope without barriers" or "require
> barriers" ??

XFs has "-o nobarrier" to say don't use barriers, and this is
*not* the default. If barriers don't work, we drop back to "-o nobarrier"
after leaving a loud warning inthe log....

> I can imagine implementing barriers in raid5 (which keeps careful
> track of everything) but I suspect it would be a performance hit.  It
> might be nice if the sysadmin has to explicitly ask...
> 
> For that matter, I could get raid1 to reject replacement devices that
> didn't support barriers, if there was a way for the filesystem to
> explicitly ask for them.  I think we are getting back to interface
> issues, aren't we? 

Yeah, very much so. If you need the filesystem to be aware of smart
things the block deivce can do or tell it, then we really don't
want to have to communicate them via mount options ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-25  3:19     ` David Chinner
  2007-03-26  0:01       ` Neil Brown
@ 2007-03-27  3:58       ` Timothy Shimmin
  1 sibling, 0 replies; 21+ messages in thread
From: Timothy Shimmin @ 2007-03-27  3:58 UTC (permalink / raw)
  To: David Chinner, Neil Brown; +Cc: xfs

--On 25 March 2007 2:19:27 PM +1100 David Chinner <dgc@sgi.com> wrote:

> On Fri, Mar 23, 2007 at 07:00:46PM +1100, Neil Brown wrote:
>> On Friday March 23, tes@sgi.com wrote:
>> > >
>> > > I think this test should just be removed and the xfs_barrier_test
>> > > should be the main mechanism for seeing if barriers work.
>> > >
>> > Oh okay.
>> > This is all Christoph's (hch) code, so it would be good for him to comment here.
>> > The external log and readonly tests can stay though.
>> >
>>
>> Why no barriers on an external log device??? Not important, just
>> curious.
>
> because we need to synchronize across 2 devices, not one, so issuing
> barriers on an external log device does nothing to order the metadata
> written to the other device...
>

I have wondered in the past (sgi-bug#954969) about doing a blk_issue_flush
on the metadata device at xlog_sync time prior to the log write on
the log device.

  27/July/06 - pv#954969
  Currently, if one uses external logs then the barrier support is turned off.
  The reaon for this is that a write barrier is normally only done on the data
  device which has the log.
  With an external log it means that a write barrier on a log device will not
  do any flushing on the metadata device.
  This pv is opened to explore the possibility of issuing an explicit metadata
  device flush at xlog_sync time before doing a write barrier on the log data
  to the log device.
  This would guarantee if the tail moved because a metadata thought its
  data was really on disk, would now be true as we would do a flush of
  its device. Then we could do our log write without worrying that
  our log write will overwrite log data when its metadata hadn't really made it.
  Perhaps I'm missing something.
  Dave (dgc) said he'd think about it.
  I haven't heard back from Christoph yet, and he added the
  code for our barrier support in xfs.
  --Tim

--Tim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-26  9:04             ` David Chinner
@ 2007-03-29 14:56               ` Martin Steigerwald
  2007-03-29 15:18                 ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Steigerwald @ 2007-03-29 14:56 UTC (permalink / raw)
  To: linux-xfs

Am Montag 26 März 2007 schrieb David Chinner:

> > Is there some mount flag to say "cope without barriers" or "require
> > barriers" ??
>
> XFs has "-o nobarrier" to say don't use barriers, and this is
> *not* the default. If barriers don't work, we drop back to "-o
> nobarrier" after leaving a loud warning inthe log....

Hello David!

Just a thought, maybe it shouldn't do that automatically, but require the 
sysadmin to explicitely state "-o nobarrier" in that case. Safest default 
behavior IMHO would be either not to mount at all without "-o nobarrier" 
if the device has no barrier support or disable the write cache of that 
device. The latter can be considered a layering violation in itself.

BTW XFS copes really well here with commodity hardware such as my 
ThinkPads with 2.5 inch notebook harddisks *since* 2.6.17.7.

But right now I wondered about barrier support on USB connected devices? I 
have to check whether XFS does barriers on those. Does the usb mass 
storage driver support barriers?

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-29 14:56               ` Martin Steigerwald
@ 2007-03-29 15:18                 ` David Chinner
  2007-03-29 16:49                   ` Martin Steigerwald
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-03-29 15:18 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-xfs

On Thu, Mar 29, 2007 at 04:56:21PM +0200, Martin Steigerwald wrote:
> Am Montag 26 März 2007 schrieb David Chinner:
> 
> > > Is there some mount flag to say "cope without barriers" or "require
> > > barriers" ??
> >
> > XFs has "-o nobarrier" to say don't use barriers, and this is
> > *not* the default. If barriers don't work, we drop back to "-o
> > nobarrier" after leaving a loud warning inthe log....
> 
> Hello David!
> 
> Just a thought, maybe it shouldn't do that automatically, but require the 
> sysadmin to explicitely state "-o nobarrier" in that case. 

And prevent most existing XFS filesystems from mounting after
a kernel upgrade? Think about the problems that might cause
with XFs root filesystems on hardware/software that doesn't
support barriers....

Default behaviour is tolerant - it tries the safest method
known and if it can't use that it tells you and then continues
onwards. That's a good default to have.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS and write barriers.
  2007-03-29 15:18                 ` David Chinner
@ 2007-03-29 16:49                   ` Martin Steigerwald
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Steigerwald @ 2007-03-29 16:49 UTC (permalink / raw)
  To: linux-xfs; +Cc: David Chinner

[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]

Am Donnerstag 29 März 2007 schrieb David Chinner:
> On Thu, Mar 29, 2007 at 04:56:21PM +0200, Martin Steigerwald wrote:
> > Am Montag 26 März 2007 schrieb David Chinner:
> > > > Is there some mount flag to say "cope without barriers" or
> > > > "require barriers" ??
> > >
> > > XFs has "-o nobarrier" to say don't use barriers, and this is
> > > *not* the default. If barriers don't work, we drop back to "-o
> > > nobarrier" after leaving a loud warning inthe log....
> >
> > Hello David!
> >
> > Just a thought, maybe it shouldn't do that automatically, but require
> > the sysadmin to explicitely state "-o nobarrier" in that case.
>
> And prevent most existing XFS filesystems from mounting after
> a kernel upgrade? Think about the problems that might cause
> with XFs root filesystems on hardware/software that doesn't
> support barriers....

Hello David!

Granted. So it might turn out to be a decision between does not boot or is 
not totally safe in power outages or crashes. I see no easy default 
answer to that.

So while probably being a layering violation at least trying to disable 
the write cache on devices without cache flush support unless "-o 
nobarrier" (as in "I know what I am doing") is given, might help safety. 
But this adds complexity and a possible source for bugs. And maybe trying 
to disable write cache isn't safe on all setups?

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-03-29 16:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-23  1:26 XFS and write barriers Neil Brown
2007-03-23  5:30 ` David Chinner
2007-03-23  7:49   ` Neil Brown
2007-03-25  4:17     ` David Chinner
2007-03-25 23:21       ` Neil Brown
2007-03-26  3:14         ` David Chinner
2007-03-26  4:27           ` Neil Brown
2007-03-26  9:04             ` David Chinner
2007-03-29 14:56               ` Martin Steigerwald
2007-03-29 15:18                 ` David Chinner
2007-03-29 16:49                   ` Martin Steigerwald
2007-03-23  9:50   ` Christoph Hellwig
2007-03-25  3:51     ` David Chinner
2007-03-25 23:58       ` Neil Brown
2007-03-26  1:11     ` Neil Brown
2007-03-23  6:20 ` Timothy Shimmin
2007-03-23  8:00   ` Neil Brown
2007-03-25  3:19     ` David Chinner
2007-03-26  0:01       ` Neil Brown
2007-03-26  3:58         ` David Chinner
2007-03-27  3:58       ` Timothy Shimmin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.