All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] relaxed barrier semantics
@ 2010-07-27 16:56 Christoph Hellwig
  2010-07-27 17:54 ` Jan Kara
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-27 16:56 UTC (permalink / raw)
  To: jaxboe, tj, James.Bottomley
  Cc: linux-fsdevel, linux-scsi, jack, tytso, chris.mason, swhiteho,
	konishi.ryusuke

I've been dealin with reports of massive slowdowns due to the barrier
option if used with storage arrays that use do not actually have a
volatile write cache.

The reason for that is that sd.c by default sets the ordered mode to
QUEUE_ORDERED_DRAIN when the WCE bit is not set.  This is in accordance
with Documentation/block/barriers.txt but missed out on an important
point: most filesystems (at least all mainstream ones) couldn't care
less about the ordering semantics barrier operations provide.  In fact
they are actively harmful as they cause us to stall the whole I/O
queue while otherwise we'd only have to wait for a rather limited
amount of I/O.

The simplest fix is to not use write barrier for devices that do not
have a volatile write cache, by specifying the nobarrier option.  This
has a huge disadvantage that it requires manual user interaction instead
of simply working out of the box.  There are two better automatic
options:

 (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't
     actually need the barrier semantics it simply disables all calls
     to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag
     on writes.  This is a relatively safe option, but it requires
     code in all filesystems, as well as in the raid / device mapper
     modules so that they can cope with it.
 (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to
     it aftet auditing that no filesystem actually relies on this
     behaviour.  Currently the block layer fails REQ_HARDBARRIER
     if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well.
 (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by
     drivers that know no barrier handling is needed.  It's equivalent
     to QUEUE_ORDERED_NONE except for not failing barrier requests.

I'm tempted to go for variant (2) above, and could use some help
auditing the filesystems for their use of the barrier semantics.

So far I've only found an explicit depency on this behaviour in
reiserfs, and there's is guarded by the barrier mount option, so
we could easily disable it when we know we don't have the full
barrier semantics.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig
@ 2010-07-27 17:54 ` Jan Kara
  2010-07-27 18:35   ` Vivek Goyal
                     ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Jan Kara @ 2010-07-27 17:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, jack,
	tytso, chris.mason, swhiteho, konishi.ryusuke

  Hi,

On Tue 27-07-10 18:56:27, Christoph Hellwig wrote:
> I've been dealin with reports of massive slowdowns due to the barrier
> option if used with storage arrays that use do not actually have a
> volatile write cache.
> 
> The reason for that is that sd.c by default sets the ordered mode to
> QUEUE_ORDERED_DRAIN when the WCE bit is not set.  This is in accordance
> with Documentation/block/barriers.txt but missed out on an important
> point: most filesystems (at least all mainstream ones) couldn't care
> less about the ordering semantics barrier operations provide.  In fact
> they are actively harmful as they cause us to stall the whole I/O
> queue while otherwise we'd only have to wait for a rather limited
> amount of I/O.
  OK, let me understand one thing. So the storage arrays have some caches
and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this
to the platter, right?
  So can it happen that they somehow lose the requests that were already
issued to them (e.g. because of power failure)?

> The simplest fix is to not use write barrier for devices that do not
> have a volatile write cache, by specifying the nobarrier option.  This
> has a huge disadvantage that it requires manual user interaction instead
> of simply working out of the box.  There are two better automatic
> options:
> 
>  (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't
>      actually need the barrier semantics it simply disables all calls
>      to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag
>      on writes.  This is a relatively safe option, but it requires
>      code in all filesystems, as well as in the raid / device mapper
>      modules so that they can cope with it.
>  (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to
>      it aftet auditing that no filesystem actually relies on this
>      behaviour.  Currently the block layer fails REQ_HARDBARRIER
>      if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well.
>  (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by
>      drivers that know no barrier handling is needed.  It's equivalent
>      to QUEUE_ORDERED_NONE except for not failing barrier requests.
> 
> I'm tempted to go for variant (2) above, and could use some help
> auditing the filesystems for their use of the barrier semantics.
> 
> So far I've only found an explicit depency on this behaviour in
> reiserfs, and there's is guarded by the barrier mount option, so
> we could easily disable it when we know we don't have the full
> barrier semantics.
  Also JBD2 relies on the ordering semantics if
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT is set (it's used by ext4 if asked to).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 17:54 ` Jan Kara
@ 2010-07-27 18:35   ` Vivek Goyal
  2010-07-27 18:42     ` James Bottomley
                       ` (2 more replies)
  2010-07-27 19:37   ` Christoph Hellwig
  2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
  2 siblings, 3 replies; 155+ messages in thread
From: Vivek Goyal @ 2010-07-27 18:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, jaxboe, tj, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote:
>   Hi,
> 
> On Tue 27-07-10 18:56:27, Christoph Hellwig wrote:
> > I've been dealin with reports of massive slowdowns due to the barrier
> > option if used with storage arrays that use do not actually have a
> > volatile write cache.
> > 
> > The reason for that is that sd.c by default sets the ordered mode to
> > QUEUE_ORDERED_DRAIN when the WCE bit is not set.  This is in accordance
> > with Documentation/block/barriers.txt but missed out on an important
> > point: most filesystems (at least all mainstream ones) couldn't care
> > less about the ordering semantics barrier operations provide.  In fact
> > they are actively harmful as they cause us to stall the whole I/O
> > queue while otherwise we'd only have to wait for a rather limited
> > amount of I/O.
>   OK, let me understand one thing. So the storage arrays have some caches
> and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this
> to the platter, right?

IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does
not support write caches or which advertises himself as having no write
caches (it has write caches but is batter backed up and is capable of 
flushing requests upon power failure).

IIUC, what Christoph is trying to address is that if write cache is
not enabled then we don't need flushing semantics. We can get rid of
need of request ordering semantics by waiting on dependent request to
finish instead of issuing a barrier. That way we will not issue barriers
no request queue drains and that possibly will help with throughput.

Vivek 
 
>   So can it happen that they somehow lose the requests that were already
> issued to them (e.g. because of power failure)?
> 
> > The simplest fix is to not use write barrier for devices that do not
> > have a volatile write cache, by specifying the nobarrier option.  This
> > has a huge disadvantage that it requires manual user interaction instead
> > of simply working out of the box.  There are two better automatic
> > options:
> > 
> >  (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't
> >      actually need the barrier semantics it simply disables all calls
> >      to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag
> >      on writes.  This is a relatively safe option, but it requires
> >      code in all filesystems, as well as in the raid / device mapper
> >      modules so that they can cope with it.
> >  (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to
> >      it aftet auditing that no filesystem actually relies on this
> >      behaviour.  Currently the block layer fails REQ_HARDBARRIER
> >      if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well.
> >  (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by
> >      drivers that know no barrier handling is needed.  It's equivalent
> >      to QUEUE_ORDERED_NONE except for not failing barrier requests.
> > 
> > I'm tempted to go for variant (2) above, and could use some help
> > auditing the filesystems for their use of the barrier semantics.
> > 
> > So far I've only found an explicit depency on this behaviour in
> > reiserfs, and there's is guarded by the barrier mount option, so
> > we could easily disable it when we know we don't have the full
> > barrier semantics.
>   Also JBD2 relies on the ordering semantics if
> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT is set (it's used by ext4 if asked to).
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 18:35   ` Vivek Goyal
@ 2010-07-27 18:42     ` James Bottomley
  2010-07-27 18:51       ` Ric Wheeler
  2010-07-27 19:43       ` Christoph Hellwig
  2010-07-27 19:38     ` Christoph Hellwig
  2010-07-28  8:08     ` Tejun Heo
  2 siblings, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-07-27 18:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Christoph Hellwig, jaxboe, tj, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On Tue, 2010-07-27 at 14:35 -0400, Vivek Goyal wrote:
> On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote:
> >   Hi,
> > 
> > On Tue 27-07-10 18:56:27, Christoph Hellwig wrote:
> > > I've been dealin with reports of massive slowdowns due to the barrier
> > > option if used with storage arrays that use do not actually have a
> > > volatile write cache.
> > > 
> > > The reason for that is that sd.c by default sets the ordered mode to
> > > QUEUE_ORDERED_DRAIN when the WCE bit is not set.  This is in accordance
> > > with Documentation/block/barriers.txt but missed out on an important
> > > point: most filesystems (at least all mainstream ones) couldn't care
> > > less about the ordering semantics barrier operations provide.  In fact
> > > they are actively harmful as they cause us to stall the whole I/O
> > > queue while otherwise we'd only have to wait for a rather limited
> > > amount of I/O.
> >   OK, let me understand one thing. So the storage arrays have some caches
> > and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this
> > to the platter, right?
> 
> IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does
> not support write caches or which advertises himself as having no write
> caches (it has write caches but is batter backed up and is capable of 
> flushing requests upon power failure).
> 
> IIUC, what Christoph is trying to address is that if write cache is
> not enabled then we don't need flushing semantics. We can get rid of
> need of request ordering semantics by waiting on dependent request to
> finish instead of issuing a barrier. That way we will not issue barriers
> no request queue drains and that possibly will help with throughput.

I hope not ... I hope that if the drive reports write through or no
cache that we don't enable (flush) barriers by default.

The problem case is NV cache arrays (usually an array with a battery
backed cache).  There's no consistency issue since the array will
destage the cache on power fail but it reports a write back cache and we
try to use barriers.  This is wrong because we don't need barriers for
consistency and they really damage throughput.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 18:42     ` James Bottomley
@ 2010-07-27 18:51       ` Ric Wheeler
  2010-07-27 19:43       ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-27 18:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, tj,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On 07/27/2010 02:42 PM, James Bottomley wrote:
> On Tue, 2010-07-27 at 14:35 -0400, Vivek Goyal wrote:
>    
>> On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote:
>>      
>>>    Hi,
>>>
>>> On Tue 27-07-10 18:56:27, Christoph Hellwig wrote:
>>>        
>>>> I've been dealin with reports of massive slowdowns due to the barrier
>>>> option if used with storage arrays that use do not actually have a
>>>> volatile write cache.
>>>>
>>>> The reason for that is that sd.c by default sets the ordered mode to
>>>> QUEUE_ORDERED_DRAIN when the WCE bit is not set.  This is in accordance
>>>> with Documentation/block/barriers.txt but missed out on an important
>>>> point: most filesystems (at least all mainstream ones) couldn't care
>>>> less about the ordering semantics barrier operations provide.  In fact
>>>> they are actively harmful as they cause us to stall the whole I/O
>>>> queue while otherwise we'd only have to wait for a rather limited
>>>> amount of I/O.
>>>>          
>>>    OK, let me understand one thing. So the storage arrays have some caches
>>> and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this
>>> to the platter, right?
>>>        
>> IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does
>> not support write caches or which advertises himself as having no write
>> caches (it has write caches but is batter backed up and is capable of
>> flushing requests upon power failure).
>>
>> IIUC, what Christoph is trying to address is that if write cache is
>> not enabled then we don't need flushing semantics. We can get rid of
>> need of request ordering semantics by waiting on dependent request to
>> finish instead of issuing a barrier. That way we will not issue barriers
>> no request queue drains and that possibly will help with throughput.
>>      
> I hope not ... I hope that if the drive reports write through or no
> cache that we don't enable (flush) barriers by default.
>
> The problem case is NV cache arrays (usually an array with a battery
> backed cache).  There's no consistency issue since the array will
> destage the cache on power fail but it reports a write back cache and we
> try to use barriers.  This is wrong because we don't need barriers for
> consistency and they really damage throughput.
>
> James
>
>    

This is the case we are trying to address. Some (most?) of these NV 
cache arrays hopefully advertise write through caches and we can 
automate disabling the unneeded bits here....

ric


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 17:54 ` Jan Kara
  2010-07-27 18:35   ` Vivek Goyal
@ 2010-07-27 19:37   ` Christoph Hellwig
  2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
  2 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-27 19:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, jaxboe, tj, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote:
>   OK, let me understand one thing. So the storage arrays have some caches
> and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this
> to the platter, right?

Not quite.  QUEUE_ORDERED_DRAIN does not interact with the target at
all, it's entirely initiator (Linux) side.  What it does it to make
sure we drain the whole queue in the I/O scheduler (elevator) and in
flight to the device (command queueing) by waiting for all I/O before
the barrier to finish, the issue the barrier command and only then
allow any newly arriving requests to proceed.

>   So can it happen that they somehow lose the requests that were already
> issued to them (e.g. because of power failure)?

We can lose the requests already on the wire, but not completed yet.
That's why log write wait for all preceding log writes (or things like
the I/Os required to push the tail) and fsync waits for all I/O
completions manually.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 18:35   ` Vivek Goyal
  2010-07-27 18:42     ` James Bottomley
@ 2010-07-27 19:38     ` Christoph Hellwig
  2010-07-28  8:08     ` Tejun Heo
  2 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-27 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Christoph Hellwig, jaxboe, tj, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Tue, Jul 27, 2010 at 02:35:46PM -0400, Vivek Goyal wrote:
> IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does
> not support write caches or which advertises himself as having no write
> caches (it has write caches but is batter backed up and is capable of 
> flushing requests upon power failure).

More or less.  We set it for scsi devices without the write cache enable
(WCE) bit, which is only set it there is a volatile write cache that
needs flushing.  Some historic arrays used to set it despite having
a non-volatile write cache, but that doesn't happen anymore with any
of the modern ones I have access to.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 18:42     ` James Bottomley
  2010-07-27 18:51       ` Ric Wheeler
@ 2010-07-27 19:43       ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-27 19:43 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, tj,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Tue, Jul 27, 2010 at 01:42:45PM -0500, James Bottomley wrote:
> I hope not ... I hope that if the drive reports write through or no
> cache that we don't enable (flush) barriers by default.

drivers/scsi/sd.c:sd_revalidate_disk()

	if (sdkp->WCE)
		ordered = sdkp->DPOFUA
			? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
	else
		ordered = QUEUE_ORDERED_DRAIN;

	blk_queue_ordered(sdkp->disk->queue, ordered);

Documentation/block/barrier.txt:

QUEUE_ORDERED_DRAIN
        Requests are ordered by draining the request queue and cache
        flushing isn't needed.

        Sequence: drain => barrier


> The problem case is NV cache arrays (usually an array with a battery
> backed cache).  There's no consistency issue since the array will
> destage the cache on power fail but it reports a write back cache and we
> try to use barriers.  This is wrong because we don't need barriers for
> consistency and they really damage throughput.

The arrays I have access to (various Netapp, IBM and LSI) never report
write cache enabled.  I've only heard about the above issue from
historic tales.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-27 18:35   ` Vivek Goyal
  2010-07-27 18:42     ` James Bottomley
  2010-07-27 19:38     ` Christoph Hellwig
@ 2010-07-28  8:08     ` Tejun Heo
  2010-07-28  8:20       ` Tejun Heo
  2010-07-28  8:24       ` Christoph Hellwig
  2 siblings, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  8:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

Hello,

On 07/27/2010 08:35 PM, Vivek Goyal wrote:
> IIUC, what Christoph is trying to address is that if write cache is
> not enabled then we don't need flushing semantics. We can get rid of
> need of request ordering semantics by waiting on dependent request to
> finish instead of issuing a barrier. That way we will not issue barriers
> no request queue drains and that possibly will help with throughput.

What I don't get here is if filesystems order requests already by
waiting for completions why do they use barriers at all?  All they
need is flush request after all the preceding requests are known to be
complete.

Having writeback cache or not doesn't make any difference
w.r.t. request ordering requirements.  If filesystems don't need the
heavy handed ordering provided by barrier, it should just use flush
instead of barrier.  If filesystem needs the barrier ordering, whether
the device in question is battery backed and costs more than a house
doesn't make any difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:08     ` Tejun Heo
@ 2010-07-28  8:20       ` Tejun Heo
  2010-07-28 13:55         ` Vladislav Bolkhovitin
  2010-07-28  8:24       ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  8:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On 07/28/2010 10:08 AM, Tejun Heo wrote:
> Having writeback cache or not doesn't make any difference
> w.r.t. request ordering requirements.  If filesystems don't need the
> heavy handed ordering provided by barrier, it should just use flush
> instead of barrier.  If filesystem needs the barrier ordering, whether
> the device in question is battery backed and costs more than a house
> doesn't make any difference.

BTW, if filesystems already have code to order the requests they're
issuing, it would be *great* to phase out barrier and replace it with
simple in-stream, non-ordering flush request.  There have been several
different suggestions about how to improve barrier and most revolved
around how to transfer more information from filesystem to block layer
so that block layer can use more relaxed orderign, but the more I
think about it, it becomes clear that it doesn't belong to block layer
at all.

The only benefit of doing it in the block layer, and probably the
reason why it was done this way at all, is making use of advanced
ordering features of some devices - ordered tag and linked commands.
The latter is deprecated and the former is fundamentally broken in
error handling anyway.  Furthermore, although they do relax ordering
requirements from the device queue side, the level of flexibility is
significantly lower compared to what filesystems can do themselves.

So, yeah, let's phase it out if it isn't too difficult.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:08     ` Tejun Heo
  2010-07-28  8:20       ` Tejun Heo
@ 2010-07-28  8:24       ` Christoph Hellwig
  2010-07-28  8:40         ` Tejun Heo
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  8:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 10:08:44AM +0200, Tejun Heo wrote:
> What I don't get here is if filesystems order requests already by
> waiting for completions why do they use barriers at all?  All they
> need is flush request after all the preceding requests are known to be
> complete.

In fact for XFS I'm working on doing some bit of that, too, but it's not
actually that easy.  For one we don't actually have a non-barrier cache
flush primitive currently, although the conversion of cache flushes
to FS requests and the addition of REQ_FLUSH helps greatly with it.
Second the usual primitive for log writes actually is a WRITE_FUA,
that is a WRITE that needs to go to disk, without consequences to
the rest of the cache.  I've stared implementing that, including
proper emulation for devices only supporting cache flushes but got
stuck with the barrier machinery.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:24       ` Christoph Hellwig
@ 2010-07-28  8:40         ` Tejun Heo
  2010-07-28  8:50           ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  8:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

Hello,

On 07/28/2010 10:24 AM, Christoph Hellwig wrote:
> In fact for XFS I'm working on doing some bit of that, too, but it's not
> actually that easy.  For one we don't actually have a non-barrier cache
> flush primitive currently, although the conversion of cache flushes
> to FS requests and the addition of REQ_FLUSH helps greatly with it.
> Second the usual primitive for log writes actually is a WRITE_FUA,
> that is a WRITE that needs to go to disk, without consequences to
> the rest of the cache.  I've stared implementing that, including
> proper emulation for devices only supporting cache flushes but got
> stuck with the barrier machinery.

The barrier machinery can be easily changed to drop the DRAIN and
ordering stages, so all we need to do is an interface for the
filesystem to tell the barrier implementation that it will take care
of ordering itself and barriers (a bit of misnomer but well it isn't
too bad) can be handled as FUA writes which get executed after all
previous commansd are committed to NV media.  On write-through device
w/ FUA support, it will simply become a FUA write.  On a device w/
write back cache and w/o FUA support, it will become flush, write,
flush sequence.  On a device inbetween, flush, FUA write.  Would that
be enough for filesystems?  If so, the transition would be pretty
painless, md already splits barriers correctly and the modification is
confined to barrier implementation itself and filesystem which want to
use more relaxed ordering.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:40         ` Tejun Heo
@ 2010-07-28  8:50           ` Christoph Hellwig
  2010-07-28  8:58             ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  8:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 10:40:30AM +0200, Tejun Heo wrote:
> The barrier machinery can be easily changed to drop the DRAIN and
> ordering stages,

Maybe you're smarted than me, but so far I had real trouble with that.
The problem is that we actually still need the drain colouring to
keep out other "barrier" requests given that we have the state for
the pre- and post- flush requests in struct request.  This and dealing
is where I'm still struggling with my the even more relaxed barriers
I had been working on for a while.  They work perfectly on devices
supporting the FUA bit and nothing inbetween.

> so all we need to do is an interface for the
> filesystem to tell the barrier implementation that it will take care
> of ordering itself and barriers (a bit of misnomer but well it isn't
> too bad) can be handled as FUA writes which get executed after all
> previous commansd are committed to NV media.  On write-through device
> w/ FUA support, it will simply become a FUA write.

If the device is write through there is not need for the FUA bit to
start with.

> On a device w/
> write back cache and w/o FUA support, it will become flush, write,
> flush sequence.  On a device inbetween, flush, FUA write.  Would that
> be enough for filesystems?  If so, the transition would be pretty
> painless, md already splits barriers correctly and the modification is
> confined to barrier implementation itself and filesystem which want to
> use more relaxed ordering.

The above is a good start.  But at least for XFS we'll eventually
want writes without the pre flush, too.  We'll only need the pre-flush
for a specific class of log writes (when we had an extending write or
need to push the log tail), otherwise plain FUA semantics are enough.
Just going for the pre-flush / FUA semantics as a start has the
big advantage of making the transition a lot simpler, though.



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:50           ` Christoph Hellwig
@ 2010-07-28  8:58             ` Tejun Heo
  2010-07-28  9:00               ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  8:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

Hello,

On 07/28/2010 10:50 AM, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 10:40:30AM +0200, Tejun Heo wrote:
>> The barrier machinery can be easily changed to drop the DRAIN and
>> ordering stages,
> 
> Maybe you're smarted than me, but so far I had real trouble with that.

It's more likely that I was just blowing out hot air as I haven't
looked at the code for a couple of years now.  So, well, yeah, let's
drop "easily" from the original sentence.  :-)

> The problem is that we actually still need the drain colouring to
> keep out other "barrier" requests given that we have the state for
> the pre- and post- flush requests in struct request.  This and dealing
> is where I'm still struggling with my the even more relaxed barriers
> I had been working on for a while.  They work perfectly on devices
> supporting the FUA bit and nothing inbetween.
>
>> so all we need to do is an interface for the
>> filesystem to tell the barrier implementation that it will take care
>> of ordering itself and barriers (a bit of misnomer but well it isn't
>> too bad) can be handled as FUA writes which get executed after all
>> previous commansd are committed to NV media.  On write-through device
>> w/ FUA support, it will simply become a FUA write.
> 
> If the device is write through there is not need for the FUA bit to
> start with.

Oh, right.

>> On a device w/
>> write back cache and w/o FUA support, it will become flush, write,
>> flush sequence.  On a device inbetween, flush, FUA write.  Would that
>> be enough for filesystems?  If so, the transition would be pretty
>> painless, md already splits barriers correctly and the modification is
>> confined to barrier implementation itself and filesystem which want to
>> use more relaxed ordering.
> 
> The above is a good start.  But at least for XFS we'll eventually
> want writes without the pre flush, too.  We'll only need the pre-flush
> for a specific class of log writes (when we had an extending write or
> need to push the log tail), otherwise plain FUA semantics are enough.
> Just going for the pre-flush / FUA semantics as a start has the
> big advantage of making the transition a lot simpler, though.

I see.  It probably would be good to have ordering requirements
carried in the bio / request, so that filesystems can mix and match
barriers of different strengths as necesasry.  As you seem to be
already working on it, are you interested in pursuing that direction?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:58             ` Tejun Heo
@ 2010-07-28  9:00               ` Christoph Hellwig
  2010-07-28  9:11                 ` Hannes Reinecke
                                   ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  9:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
> I see.  It probably would be good to have ordering requirements
> carried in the bio / request, so that filesystems can mix and match
> barriers of different strengths as necesasry.  As you seem to be
> already working on it, are you interested in pursuing that direction?

I've been working on that for a while, but it got a lot more urgent
as there's been an application hit particularly hard by the barrier
semantics on cache less devices and people started getting angry
about it.  That's why fixing this for cache less devices has become
a higher priority than solving the big picture.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:00               ` Christoph Hellwig
@ 2010-07-28  9:11                 ` Hannes Reinecke
  2010-07-28  9:16                   ` Christoph Hellwig
  2010-07-28  9:28                   ` Steven Whitehouse
  2010-07-28  9:17                 ` Tejun Heo
  2010-07-28 14:42                 ` Vivek Goyal
  2 siblings, 2 replies; 155+ messages in thread
From: Hannes Reinecke @ 2010-07-28  9:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
>> I see.  It probably would be good to have ordering requirements
>> carried in the bio / request, so that filesystems can mix and match
>> barriers of different strengths as necesasry.  As you seem to be
>> already working on it, are you interested in pursuing that direction?
> 
> I've been working on that for a while, but it got a lot more urgent
> as there's been an application hit particularly hard by the barrier
> semantics on cache less devices and people started getting angry
> about it.  That's why fixing this for cache less devices has become
> a higher priority than solving the big picture.
> 
My idea here is to use the 'META' request tag to emulate FUA.
From what I've seen, the META request tag is only ever used on gfs2,
and even that is using is for tagging journal requests on write.

Once you've tagged all bios/requests with correctly it trivial to
set the FUA bit.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:11                 ` Hannes Reinecke
@ 2010-07-28  9:16                   ` Christoph Hellwig
  2010-07-28  9:24                     ` Tejun Heo
  2010-07-28  9:28                   ` Steven Whitehouse
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  9:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 11:11:08AM +0200, Hannes Reinecke wrote:
> My idea here is to use the 'META' request tag to emulate FUA.
> >From what I've seen, the META request tag is only ever used on gfs2,
> and even that is using is for tagging journal requests on write.

Please don't overload META even more, it's already overloaded with
at least two meanings.
We do in fact already have a REQ_FUA flag, and now that I have unified the
bio and request flags we can easily set it from filesystems.  The problem
is to emulate it properly on devices that do no actually support the FUA bit.
Of which we unfortunately have a lot given that libata by default disables
the FUA support even if the device supports.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:00               ` Christoph Hellwig
  2010-07-28  9:11                 ` Hannes Reinecke
@ 2010-07-28  9:17                 ` Tejun Heo
  2010-07-28  9:28                   ` Christoph Hellwig
  2010-07-28 13:56                   ` Vladislav Bolkhovitin
  2010-07-28 14:42                 ` Vivek Goyal
  2 siblings, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  9:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On 07/28/2010 11:00 AM, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
>> I see.  It probably would be good to have ordering requirements
>> carried in the bio / request, so that filesystems can mix and match
>> barriers of different strengths as necesasry.  As you seem to be
>> already working on it, are you interested in pursuing that direction?
> 
> I've been working on that for a while, but it got a lot more urgent
> as there's been an application hit particularly hard by the barrier
> semantics on cache less devices and people started getting angry
> about it.  That's why fixing this for cache less devices has become
> a higher priority than solving the big picture.

Well, if disabling barrier works around the problem for them (which is
basically what was suggeseted in the first message), that's not too
bad for short term, I think.  At least, there's a handy workaround.
I'll re-read barrier code and see how hard it would be to implement a
proper solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:16                   ` Christoph Hellwig
@ 2010-07-28  9:24                     ` Tejun Heo
  2010-07-28  9:38                       ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  9:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hannes Reinecke, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On 07/28/2010 11:16 AM, Christoph Hellwig wrote:
> The problem is to emulate it properly on devices that do no actually
> support the FUA bit.  Of which we unfortunately have a lot given
> that libata by default disables the FUA support even if the device
> supports.

These were the reasons.

* Some controllers puke for FUA commands whether the device supports
  it or not.

* With the traditional strong barriers, it doesn't make much
  difference whether FUA is used or not.  The full queue has already
  been stalled and flushed by the time barrier write is issued and all
  that we save is overhead for a single command which doesn't make any
  difference to actual timing of completion.

* Low confidence in drives reporting FUA support.  New features in ATA
  world seldomly work well and I'm fairly sure there are devices which
  report FUA support and handle FUA writes exactly the same way as
  regular writes.  :-(

So, until now, it just wasn't worth the effort / risk.  If filesystems
can make use of more relaxed ordering including avoiding full flush
completely, it might make sense to revisit it.  But, in general, I
think most barriers, even when relaxed, would at least involve single
flush before the FUA write and in that case I'm pretty skeptical how
useful FUA write for the barrier itself would be.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:11                 ` Hannes Reinecke
  2010-07-28  9:16                   ` Christoph Hellwig
@ 2010-07-28  9:28                   ` Steven Whitehouse
  2010-07-28  9:35                     ` READ_META semantics, was " Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Steven Whitehouse @ 2010-07-28  9:28 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	konishi.ryusuke

Hi,

On Wed, 2010-07-28 at 11:11 +0200, Hannes Reinecke wrote:
> Christoph Hellwig wrote:
> > On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
> >> I see.  It probably would be good to have ordering requirements
> >> carried in the bio / request, so that filesystems can mix and match
> >> barriers of different strengths as necesasry.  As you seem to be
> >> already working on it, are you interested in pursuing that direction?
> > 
> > I've been working on that for a while, but it got a lot more urgent
> > as there's been an application hit particularly hard by the barrier
> > semantics on cache less devices and people started getting angry
> > about it.  That's why fixing this for cache less devices has become
> > a higher priority than solving the big picture.
> > 
> My idea here is to use the 'META' request tag to emulate FUA.
> From what I've seen, the META request tag is only ever used on gfs2,
> and even that is using is for tagging journal requests on write.
> 
> Once you've tagged all bios/requests with correctly it trivial to
> set the FUA bit.
> 
> Cheers,
> 
> Hannes

The META tag is used in GFS2 for tagging all metadata whether to the
journal or otherwise. Is there some reason why this isn't correct? My
understanding was that it was more or less an informational hint to
those watching blktrace,

Steve.



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:17                 ` Tejun Heo
@ 2010-07-28  9:28                   ` Christoph Hellwig
  2010-07-28  9:48                     ` Tejun Heo
                                       ` (5 more replies)
  2010-07-28 13:56                   ` Vladislav Bolkhovitin
  1 sibling, 6 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  9:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> Well, if disabling barrier works around the problem for them (which is
> basically what was suggeseted in the first message), that's not too
> bad for short term, I think.

It's a pretty horrible workaround.  Requiring manual mount options to
get performance out of a setup which could trivially work out of the
box is a bad workaround.

> I'll re-read barrier code and see how hard it would be to implement a
> proper solution.

If we move all filesystems to non-draining barriers with pre- and post-
flushes that might actually be a relatively easy first step.  We don't
have the complications to deal with multiple types of barriers to
start with, and it'll fix the issue for devices without volatile write
caches completely.

I just need some help from the filesystem folks to determine if they
are safe with them.

I know for sure that ext3 and xfs are from looking through them.  And
I know reiserfs is if we make sure it doesn't hit the code path that
relies on it that is currently enabled by the barrier option.

I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
That already ends our small list of barrier supporting filesystems, and
possibly ocfs2, too - although the barrier implementation there seems
incomplete as it doesn't seem to flush caches in fsync.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* READ_META semantics, was Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Steven Whitehouse
@ 2010-07-28  9:35                     ` Christoph Hellwig
  2010-07-28 13:52                       ` Jeff Moyer
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  9:35 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Hannes Reinecke, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, konishi.ryusuke

On Wed, Jul 28, 2010 at 10:28:55AM +0100, Steven Whitehouse wrote:
> The META tag is used in GFS2 for tagging all metadata whether to the
> journal or otherwise. Is there some reason why this isn't correct? My
> understanding was that it was more or less an informational hint to
> those watching blktrace,

Unfortunately the META flag is overloaded in the CFQ I/O scheduler.
It gives META requests a boost over other, including synchronous
request.  From all I could gather so far it's intended to give
desktops better interactivity by boosting some metadata reads, while
it should in that form never be used for writes.

So far I failed badly in getting a clarification of which read requests
need to be tagged and if we should not apply this boost to write request
marked META so that they can be used for blktrace tagging.  Unless
we really want to boost all reads separating the META from the BOOST
flag might be a good option, but I really need to understand better
how it's supposed to use.

Except for gfs2 big hammer tagging it's used in ext3/ext4 for all
reads on directories, the quota file and for reading the actual inode
structure.  It's not used for indirect blocks, symlinks, the superblock
and allocation bitmaps.

XFS appears to set the META flag for both reads and writes, but that
code is unreachable currently.  I haven't removed it yet as I'm still
wondering if it could be used correctly instead.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:24                     ` Tejun Heo
@ 2010-07-28  9:38                       ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28  9:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Hannes Reinecke, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso,
	chris.mason, swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 11:24:11AM +0200, Tejun Heo wrote:
> On 07/28/2010 11:16 AM, Christoph Hellwig wrote:
> > The problem is to emulate it properly on devices that do no actually
> > support the FUA bit.  Of which we unfortunately have a lot given
> > that libata by default disables the FUA support even if the device
> > supports.
> 
> These were the reasons.
> 
> * Some controllers puke for FUA commands whether the device supports
>   it or not.
> 
> * Low confidence in drives reporting FUA support.  New features in ATA
>   world seldomly work well and I'm fairly sure there are devices which
>   report FUA support and handle FUA writes exactly the same way as
>   regular writes.  :-(

Jens recently told that Windows seems to send lots of FUA requests these
days, which should really have helped shacking it out.

> completely, it might make sense to revisit it.  But, in general, I
> think most barriers, even when relaxed, would at least involve single
> flush before the FUA write and in that case I'm pretty skeptical how
> useful FUA write for the barrier itself would be.

At least for XFS we should be able to get away with almost no full
flush at all for special workloads (no fsyncs/syncs, not appending file
writes).  With more normal workloads that get a fsync/sync once in
a while we'd almost alwasy do a full flush for every log write, though.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
@ 2010-07-28  9:48                     ` Tejun Heo
  2010-07-28 10:19                     ` Steven Whitehouse
                                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-07-28  9:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On 07/28/2010 11:28 AM, Christoph Hellwig wrote:
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.

Agreed, if all filesystems can agree on the relaxed semantics, things
would be much simpler.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
  2010-07-28  9:48                     ` Tejun Heo
@ 2010-07-28 10:19                     ` Steven Whitehouse
  2010-07-28 11:45                       ` Christoph Hellwig
  2010-07-28 12:47                     ` Jan Kara
                                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 155+ messages in thread
From: Steven Whitehouse @ 2010-07-28 10:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke

Hi,

On Wed, 2010-07-28 at 11:28 +0200, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> > Well, if disabling barrier works around the problem for them (which is
> > basically what was suggeseted in the first message), that's not too
> > bad for short term, I think.
> 
> It's a pretty horrible workaround.  Requiring manual mount options to
> get performance out of a setup which could trivially work out of the
> box is a bad workaround.
> 
> > I'll re-read barrier code and see how hard it would be to implement a
> > proper solution.
> 
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
> 
> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> That already ends our small list of barrier supporting filesystems, and
> possibly ocfs2, too - although the barrier implementation there seems
> incomplete as it doesn't seem to flush caches in fsync.

GFS2 uses barriers only on journal flushing. There are three reasons for
flushing the journal:

1. Its full and we need more space (or the periodic timer has expired,
and there is at least one transaction to flush)
2. We are doing fsync or a full fs sync
3. We need to release a glock to another node, and that glock has some
journaled blocks associated with it

In case #1, I don't think there is any need to actually issue a flush
along with the barrier - the fs will always be correct in case of a (for
example) power failure and it is only the amount of data which might be
lost which depends on the write cache size. This is basically the same
for any local filesystem.

In case #2 we must always flush

In case #3 we need to be certain that all I/O up to and including the
barrier (and subsequent written back in-place metadata, if any) has
reached the storage device (and is not still lurking in the I/O
elevator) before we release the lock, but there is no actual need to
flush the write cache of the device itself. In other words, we need to
flush the non-shared bit of the stack, but not the shared bit on the
device itself. The same caveats about the amount of data which may be
lost on power failure apply as per case #1.

I have also made the assumption that a barrier issued from one node to
the shared device will affect I/O from all nodes equally. If that is not
the case, then the above will not apply and we must always flush in case
#3.

Currently the code is also waiting for I/O to drain in cases #1 and #3
as well as case #2 since it was simpler to implement all cases the same,
at least to start with.

Also in case #3, if we were to implement a non-flushing barrier, then we
would need to add a barrier after the in-place metadata writeback of the
inode that is being released I think, in order to be sure cross-node
ordering was correct. Hmmm. Maybe we should be doing that anyway....

Steve.



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 10:19                     ` Steven Whitehouse
@ 2010-07-28 11:45                       ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28 11:45 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	konishi.ryusuke

On Wed, Jul 28, 2010 at 11:19:57AM +0100, Steven Whitehouse wrote:
> In case #1, I don't think there is any need to actually issue a flush
> along with the barrier - the fs will always be correct in case of a (for
> example) power failure and it is only the amount of data which might be
> lost which depends on the write cache size. This is basically the same
> for any local filesystem.

For now we're mostly talking about removing the _ordering_ not the
flushing.  Eventually I'd like to relax some of the flushing
requirements, too - but that is secondary priority.

So for now I'm mostly interested if gfs2 relies on the ordering
semantics from barriers.  Given that it's been around for a while
and primarily used on devices without any kind of barriers support
I'm inclined it is, but I'd really prefer to get this from the horses
mouth.

> I have also made the assumption that a barrier issued from one node to
> the shared device will affect I/O from all nodes equally. If that is not
> the case, then the above will not apply and we must always flush in case
> #3.

There is absolutely no ordering vs other nodes.  The volatile write
cache if present is a per-target state so it will be flushed for all
nodes.

> Currently the code is also waiting for I/O to drain in cases #1 and #3
> as well as case #2 since it was simpler to implement all cases the same,
> at least to start with.

Aka gfs2 waits for the I/O completion by itself.  That sounds like it
is the answer to my original question.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
  2010-07-28  9:48                     ` Tejun Heo
  2010-07-28 10:19                     ` Steven Whitehouse
@ 2010-07-28 12:47                     ` Jan Kara
  2010-07-28 23:00                       ` Christoph Hellwig
  2010-07-29  1:44                     ` Ted Ts'o
                                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2010-07-28 12:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Wed 28-07-10 11:28:59, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> > Well, if disabling barrier works around the problem for them (which is
> > basically what was suggeseted in the first message), that's not too
> > bad for short term, I think.
> 
> It's a pretty horrible workaround.  Requiring manual mount options to
> get performance out of a setup which could trivially work out of the
> box is a bad workaround.
> 
> > I'll re-read barrier code and see how hard it would be to implement a
> > proper solution.
> 
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
  Yes, ext3 is safe.

> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
  Yes, just always writing the commit buffer at the place where we
currently do it in !barrier case should be enough for reiserfs.

> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
  As I wrote in some other email, ext4/jbd2 is OK, unless you mount the
filesystem with async_commit mount option. With that option it does the
same thing as reiserfs in barrier case - i.e., it needs ordering.

> That already ends our small list of barrier supporting filesystems, and
> possibly ocfs2, too - although the barrier implementation there seems
> incomplete as it doesn't seem to flush caches in fsync.
  Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the
box and does not need the ordering. ocfs2_sync_file is actually correct
(although maybe slightly inefficient) because it does
jbd2_journal_force_commit() which creates and immediately commits a
transaction and that implies a barrier.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: READ_META semantics, was Re: [RFC] relaxed barrier semantics
  2010-07-28  9:35                     ` READ_META semantics, was " Christoph Hellwig
@ 2010-07-28 13:52                       ` Jeff Moyer
  0 siblings, 0 replies; 155+ messages in thread
From: Jeff Moyer @ 2010-07-28 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Steven Whitehouse, Hannes Reinecke, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, konishi.ryusuke

Christoph Hellwig <hch@lst.de> writes:

> On Wed, Jul 28, 2010 at 10:28:55AM +0100, Steven Whitehouse wrote:
>> The META tag is used in GFS2 for tagging all metadata whether to the
>> journal or otherwise. Is there some reason why this isn't correct? My
>> understanding was that it was more or less an informational hint to
>> those watching blktrace,
>
> Unfortunately the META flag is overloaded in the CFQ I/O scheduler.
> It gives META requests a boost over other, including synchronous
> request.

Within a single process, when choosing the next request to be serviced,
if both requests are synchronous and one is tagged as metadata, then the
metadata request is chosen.

Also, as you mention, a request tagged as metadata will also allow the
issuing process to preempt another process that currently has the I/O
scheduler.  Note that this isn't the intention of the code;  it's
actually a bug, I think:

	/*
	 * So both queues are sync. Let the new request get disk time if
	 * it's a metadata request and the current queue is doing regular IO.
	 */
	if (rq_is_meta(rq) && !cfqq->meta_pending)
		return true;

But, it seems to me that there is no guarantee that both cfq_queues are
synchronous at this point!  Probably some code reshuffling has caused
this to happen.

> From all I could gather so far it's intended to give desktops better
> interactivity by boosting some metadata reads, while it should in that
> form never be used for writes.

Unfortunately, I don't know the history of this code.  The commit
messages are too vague to be useful:

    cfq-iosched: fix bad return value cfq_should_preempt()
    
    Commit a6151c3a5c8e1ff5a28450bc8d6a99a2a0add0a7 inadvertently
    reversed a preempt condition check, potentially causing a
    performance regression.  Make the meta check correct again.

It's anyone's guess as to what the performance regression "potentially"
was.

commit 374f84ac39ec7829a57a66efd5125d3561ff0e00
Author: Jens Axboe <axboe@suse.de>
Date:   Sun Jul 23 01:42:19 2006 +0200

    [PATCH] cfq-iosched: use metadata read flag
    
    Give meta data reads preference over regular reads, as the process
    often needs to get that out of the way to do the io it was actually
    interested in.
    
    Signed-off-by: Jens Axboe <axboe@suse.de>

Again, no idea what the affected workloads are.  I have to admit, though
it sounds like a good idea.  ;-)  Jens, if you know what types of
workloads are affected, then I can put together some tests and submit a
patch to fix the above logic.

> So far I failed badly in getting a clarification of which read requests
> need to be tagged and if we should not apply this boost to write request
> marked META so that they can be used for blktrace tagging.  Unless
> we really want to boost all reads separating the META from the BOOST
> flag might be a good option, but I really need to understand better
> how it's supposed to use.

I think it makes sense to split out the flag into two: one for blktrace
annotation and the other for boosted I/O priority.  Hopefully we can
come up with some real world use cases that shows the benefits of the
latter.

Cheers,
Jeff

> Except for gfs2 big hammer tagging it's used in ext3/ext4 for all
> reads on directories, the quota file and for reading the actual inode
> structure.  It's not used for indirect blocks, symlinks, the superblock
> and allocation bitmaps.
>
> XFS appears to set the META flag for both reads and writes, but that
> code is unreachable currently.  I haven't removed it yet as I'm still
> wondering if it could be used correctly instead.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  8:20       ` Tejun Heo
@ 2010-07-28 13:55         ` Vladislav Bolkhovitin
  2010-07-28 14:23           ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-28 13:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

Tejun Heo, on 07/28/2010 12:20 PM wrote:
> On 07/28/2010 10:08 AM, Tejun Heo wrote:
>> Having writeback cache or not doesn't make any difference
>> w.r.t. request ordering requirements.  If filesystems don't need the
>> heavy handed ordering provided by barrier, it should just use flush
>> instead of barrier.  If filesystem needs the barrier ordering, whether
>> the device in question is battery backed and costs more than a house
>> doesn't make any difference.
>
> BTW, if filesystems already have code to order the requests they're
> issuing, it would be *great* to phase out barrier and replace it with
> simple in-stream, non-ordering flush request.  There have been several
> different suggestions about how to improve barrier and most revolved
> around how to transfer more information from filesystem to block layer
> so that block layer can use more relaxed orderign, but the more I
> think about it, it becomes clear that it doesn't belong to block layer
> at all.
>
> The only benefit of doing it in the block layer, and probably the
> reason why it was done this way at all, is making use of advanced
> ordering features of some devices - ordered tag and linked commands.
> The latter is deprecated and the former is fundamentally broken in
> error handling anyway.

Why? SCSI provides ACA and UA_INTLCK which provide all needed facilities 
for errors handling in deep ordered queues.

> Furthermore, although they do relax ordering
> requirements from the device queue side, the level of flexibility is
> significantly lower compared to what filesystems can do themselves.

Can you elaborate more what is not sufficiently flexible in SCSI ordered 
commands, please?

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:17                 ` Tejun Heo
  2010-07-28  9:28                   ` Christoph Hellwig
@ 2010-07-28 13:56                   ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-28 13:56 UTC (permalink / raw)
  To: Tejun Heo, Christoph Hellwig
  Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

Tejun Heo, on 07/28/2010 01:17 PM wrote:
> On 07/28/2010 11:00 AM, Christoph Hellwig wrote:
>> On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
>>> I see.  It probably would be good to have ordering requirements
>>> carried in the bio / request, so that filesystems can mix and match
>>> barriers of different strengths as necesasry.  As you seem to be
>>> already working on it, are you interested in pursuing that direction?
>>
>> I've been working on that for a while, but it got a lot more urgent
>> as there's been an application hit particularly hard by the barrier
>> semantics on cache less devices and people started getting angry
>> about it.  That's why fixing this for cache less devices has become
>> a higher priority than solving the big picture.
>
> Well, if disabling barrier works around the problem for them (which is
> basically what was suggeseted in the first message), that's not too
> bad for short term, I think.  At least, there's a handy workaround.
> I'll re-read barrier code and see how hard it would be to implement a
> proper solution.

For all the people working on barriers I'd recommend to use a 
Linux-based software SCSI device implemented using SCST framework 
(http://scst.sourceforge.net). This isn't an advertisement, SCST is 
really handy for such tasks. With it you can make your device be write 
through/write back/FUA/NV cache/etc., you can fully see the flow of 
commands sent by your Linux initiator, you can insert filters on some of 
them, perform various failure injections to check how robust your 
implementation, etc. SCST fully processed ORDERED commands as required 
by SAM.

You can start from iSCSI target and vdisk backend dev handler. For it, 
for example, to see the full flow of commands you should perform (with 
proc interface) "echo "add scsi" >/proc/scsi_tgt/trace_level", to see 
FUA/sync cache commands only: "echo "add order" 
 >/proc/scsi_tgt/vdisk/trace_level". The output will be in the kernel 
log, so you may need to increase CONFIG_LOG_BUF_SHIFT.

For 1.0.1.x I have a patch implementing ACA developed by one SCST using 
company, which is going to be integrated in the trunk in v2.1. This 
patch was needed for AIX to work in full performance and now used in 
production. With it implementation of UA_INTLCK is trivial and I can do 
it upon request.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 13:55         ` Vladislav Bolkhovitin
@ 2010-07-28 14:23           ` Tejun Heo
  2010-07-28 14:37             ` James Bottomley
  2010-07-28 16:16             ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-07-28 14:23 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

Hello,

On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote:
>> The only benefit of doing it in the block layer, and probably the
>> reason why it was done this way at all, is making use of advanced
>> ordering features of some devices - ordered tag and linked commands.
>> The latter is deprecated and the former is fundamentally broken in
>> error handling anyway.
> 
> Why? SCSI provides ACA and UA_INTLCK which provide all needed
> facilities for errors handling in deep ordered queues.

I don't remember all the details now but IIRC what was necessary was
earlier write failure failing all commands scheduled as ordered.  Does
ACA / UA_INTLCK or whatever allow that?

>> Furthermore, although they do relax ordering
>> requirements from the device queue side, the level of flexibility is
>> significantly lower compared to what filesystems can do themselves.
> 
> Can you elaborate more what is not sufficiently flexible in SCSI
> ordered commands, please?

File systems are not communicating enough ordering info to block layer
already so we already lose a lot of ordering information there and
SCSI ordered queueing is also pretty restricted in what kind of
ordering it can represent.  The end result is that we don't gain much
by using ordered queueing.  It may cut down command latencies among
commands used for barrier sequence but if you compare it to the level
of parallelism filesystem code can exploit by ordering requests
themselves...  Another thing is coverage.  We have ordered queueing
for quite some time now but there are only a couple of drivers which
actually support them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 14:23           ` Tejun Heo
@ 2010-07-28 14:37             ` James Bottomley
  2010-07-28 14:44               ` Tejun Heo
  2010-07-28 16:17               ` Vladislav Bolkhovitin
  2010-07-28 16:16             ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-07-28 14:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladislav Bolkhovitin, Vivek Goyal, Jan Kara, Christoph Hellwig,
	jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Wed, 2010-07-28 at 16:23 +0200, Tejun Heo wrote:
> Hello,
> 
> On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote:
> >> The only benefit of doing it in the block layer, and probably the
> >> reason why it was done this way at all, is making use of advanced
> >> ordering features of some devices - ordered tag and linked commands.
> >> The latter is deprecated and the former is fundamentally broken in
> >> error handling anyway.
> > 
> > Why? SCSI provides ACA and UA_INTLCK which provide all needed
> > facilities for errors handling in deep ordered queues.
> 
> I don't remember all the details now but IIRC what was necessary was
> earlier write failure failing all commands scheduled as ordered.  Does
> ACA / UA_INTLCK or whatever allow that?

No.  That requires support for QErr ... which is in the same mode page.

The real reason we have difficulty is that BUSY/QUEUE_FULL can cause
reordering in the issue queue, which is a driver problem and not in the
SCSI standards.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:00               ` Christoph Hellwig
  2010-07-28  9:11                 ` Hannes Reinecke
  2010-07-28  9:17                 ` Tejun Heo
@ 2010-07-28 14:42                 ` Vivek Goyal
  2 siblings, 0 replies; 155+ messages in thread
From: Vivek Goyal @ 2010-07-28 14:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 11:00:25AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote:
> > I see.  It probably would be good to have ordering requirements
> > carried in the bio / request, so that filesystems can mix and match
> > barriers of different strengths as necesasry.  As you seem to be
> > already working on it, are you interested in pursuing that direction?
> 
> I've been working on that for a while, but it got a lot more urgent
> as there's been an application hit particularly hard by the barrier
> semantics on cache less devices and people started getting angry
> about it.  That's why fixing this for cache less devices has become
> a higher priority than solving the big picture.

And in the process IO controller cgroup stuff will also benefit otherwise
excessive draining on request queue takes away any service differentiation
CFQ provides among groups.

Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 14:37             ` James Bottomley
@ 2010-07-28 14:44               ` Tejun Heo
  2010-07-28 16:17                 ` Vladislav Bolkhovitin
  2010-07-28 16:17               ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-28 14:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vladislav Bolkhovitin, Vivek Goyal, Jan Kara, Christoph Hellwig,
	jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

Hello,

On 07/28/2010 04:37 PM, James Bottomley wrote:
>> I don't remember all the details now but IIRC what was necessary was
>> earlier write failure failing all commands scheduled as ordered.  Does
>> ACA / UA_INTLCK or whatever allow that?
> 
> No.  That requires support for QErr ... which is in the same mode page.

I see.

> The real reason we have difficulty is that BUSY/QUEUE_FULL can cause
> reordering in the issue queue, which is a driver problem and not in the
> SCSI standards.

Ah yeah right.  ISTR discussions about this years ago.  But one way or
the other, given the limited amount of ordering information available
under the block layer, I doubt the benefit of doing would be anything
significant.  If it can be done w/o too much complexity, sure, but
otherwise...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 14:23           ` Tejun Heo
  2010-07-28 14:37             ` James Bottomley
@ 2010-07-28 16:16             ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-28 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

Tejun Heo, on 07/28/2010 06:23 PM wrote:
> Hello,
>
> On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote:
>>> The only benefit of doing it in the block layer, and probably the
>>> reason why it was done this way at all, is making use of advanced
>>> ordering features of some devices - ordered tag and linked commands.
>>> The latter is deprecated and the former is fundamentally broken in
>>> error handling anyway.
>>
>> Why? SCSI provides ACA and UA_INTLCK which provide all needed
>> facilities for errors handling in deep ordered queues.
>
> I don't remember all the details now but IIRC what was necessary was
> earlier write failure failing all commands scheduled as ordered.  Does
> ACA / UA_INTLCK or whatever allow that?

Basically, ACA suspends the whole queue in case if a command in the head 
finished with CHECK CONDITION status. The queue should be resumed later 
by CLEAR ACA Task Management function. During ACA one or more new 
commands can be sent in the head of the queue. It allows, eg, restart 
the failed command.

UA_INTLCK allows to establish a Unit Attention if a command in the head 
finished with error other that CHECK CONDITION status. Then next command 
will finish with CHECK CONDITION and then ACA comes into action.

Overall, they look as a complete facility for effective errors recovery 
of ordered queues.

>>> Furthermore, although they do relax ordering
>>> requirements from the device queue side, the level of flexibility is
>>> significantly lower compared to what filesystems can do themselves.
>>
>> Can you elaborate more what is not sufficiently flexible in SCSI
>> ordered commands, please?
>
> File systems are not communicating enough ordering info to block layer
> already so we already lose a lot of ordering information there and
> SCSI ordered queueing is also pretty restricted in what kind of
> ordering it can represent.

What restrictions do you mean?

> The end result is that we don't gain much
> by using ordered queueing.  It may cut down command latencies among
> commands used for barrier sequence but if you compare it to the level
> of parallelism filesystem code can exploit by ordering requests
> themselves...  Another thing is coverage.  We have ordered queueing
> for quite some time now but there are only a couple of drivers which
> actually support them.

Agree, file systems should provide full ordering info to the block 
level. The block level then should do the best to provide the needed 
ordering requirements using available hardware facilities.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 14:37             ` James Bottomley
  2010-07-28 14:44               ` Tejun Heo
@ 2010-07-28 16:17               ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-28 16:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

James Bottomley, on 07/28/2010 06:37 PM wrote:
> On Wed, 2010-07-28 at 16:23 +0200, Tejun Heo wrote:
>> Hello,
>>
>> On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote:
>>>> The only benefit of doing it in the block layer, and probably the
>>>> reason why it was done this way at all, is making use of advanced
>>>> ordering features of some devices - ordered tag and linked commands.
>>>> The latter is deprecated and the former is fundamentally broken in
>>>> error handling anyway.
>>>
>>> Why? SCSI provides ACA and UA_INTLCK which provide all needed
>>> facilities for errors handling in deep ordered queues.
>>
>> I don't remember all the details now but IIRC what was necessary was
>> earlier write failure failing all commands scheduled as ordered.  Does
>> ACA / UA_INTLCK or whatever allow that?
>
> No.  That requires support for QErr ... which is in the same mode page.
>
> The real reason we have difficulty is that BUSY/QUEUE_FULL can cause
> reordering in the issue queue, which is a driver problem and not in the
> SCSI standards.

BTW, I for long time wandering why low level drivers should process 
BUSY/QUEUE_FULL and perform adjusting the queue depth. Isn't it common 
for the drivers, so should be performed on the higher (SCSI) level? This 
level would provide facility to prevent reordering, if needed, and the 
driver would communicate with it in a transparent level.

I mean the following. A driver always deals with a single command at 
time. It either sends the command to the device, or sends command's 
status/sense from the device to the SCSI level. Then SCSI level decides 
if to send another command to the driver or perform necessary recovery, 
eg, adjusting queue depth or using ACA restarting the QUEUE_FULL'ed command.

In this architecture there would not be needed to update all the drivers 
to provide ordering guarantees and ACA based recovery as it seems needed 
now.

Or, am I missing something?

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 14:44               ` Tejun Heo
@ 2010-07-28 16:17                 ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-28 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Vivek Goyal, Jan Kara, Christoph Hellwig,
	jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

Tejun Heo, on 07/28/2010 06:44 PM wrote:
> Hello,
>
> On 07/28/2010 04:37 PM, James Bottomley wrote:
>>> I don't remember all the details now but IIRC what was necessary was
>>> earlier write failure failing all commands scheduled as ordered.  Does
>>> ACA / UA_INTLCK or whatever allow that?
>>
>> No.  That requires support for QErr ... which is in the same mode page.
>
> I see.
>
>> The real reason we have difficulty is that BUSY/QUEUE_FULL can cause
>> reordering in the issue queue, which is a driver problem and not in the
>> SCSI standards.
>
> Ah yeah right.  ISTR discussions about this years ago.  But one way or
> the other, given the limited amount of ordering information available
> under the block layer, I doubt the benefit of doing would be anything
> significant.  If it can be done w/o too much complexity, sure, but
> otherwise...

Hmm, this thread was started from the need to avoid queue draining, 
because it is a big performance hit. The use of ordered commands allows 
to _completely_ eliminate queue draining _at all_. It looks to be a 
significant benefit worth some additional complexity.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 12:47                     ` Jan Kara
@ 2010-07-28 23:00                       ` Christoph Hellwig
  2010-07-29 10:45                         ` Jan Kara
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-28 23:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote:
>   Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the
> box and does not need the ordering. ocfs2_sync_file is actually correct
> (although maybe slightly inefficient) because it does
> jbd2_journal_force_commit() which creates and immediately commits a
> transaction and that implies a barrier.

I don't think that's correct.  ocfs2_sync_file first does
ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite,
and from what I can see a just as superflous sync_mapping_buffers (given
that ocfs doesn't use mark_buffer_dirty_inode) and then might return
early in case we do fdatasync but the inode isn't marked
I_DIRTY_DATASYNC.  In that case we might need a cache flush given
that the data might still be dirty.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
                                       ` (2 preceding siblings ...)
  2010-07-28 12:47                     ` Jan Kara
@ 2010-07-29  1:44                     ` Ted Ts'o
  2010-07-29  2:43                       ` Vivek Goyal
                                         ` (4 more replies)
  2010-08-02 16:47                     ` Ryusuke Konishi
  2010-08-02 17:39                     ` Chris Mason
  5 siblings, 5 replies; 155+ messages in thread
From: Ted Ts'o @ 2010-07-29  1:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
> 
> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> That already ends our small list of barrier supporting filesystems, and
> possibly ocfs2, too - although the barrier implementation there seems
> incomplete as it doesn't seem to flush caches in fsync.

Define "are safe" --- what interface we planning on using for the
non-draining barrier?  At least for ext3, when we write the commit
record using set_buffer_ordered(bh), it assumes that this will do a
flush of all previous writes and that the commit will hit the disk
before any subsequent writes are sent to the disk.  So turning the
write of a buffer head marked with set_buffered_ordered() into a FUA
write would _not_ be safe for ext3.

For ext4, if we don't use journal checksums, then we have the same
requirements as ext3, and the same method of requesting it.  If we do
use journal checksums, what ext4 needs is a way of assuring that no
writes after the commit are reordered with respect to the disk platter
before the commit record --- but any of the writes before that,
including the commit, and be reordered because we rely on the checksum
in the commit record to know at replay time whether the last commit is
valid or not.  We do that right now by calling blkdev_issue_flush()
with BLKDEF_IFL_WAIT after submitting the write of the commit block.

					- Ted

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  1:44                     ` Ted Ts'o
@ 2010-07-29  2:43                       ` Vivek Goyal
  2010-07-29  2:43                       ` Vivek Goyal
                                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 155+ messages in thread
From: Vivek Goyal @ 2010-07-29  2:43 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel

On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote:
> On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
> > If we move all filesystems to non-draining barriers with pre- and post-
> > flushes that might actually be a relatively easy first step.  We don't
> > have the complications to deal with multiple types of barriers to
> > start with, and it'll fix the issue for devices without volatile write
> > caches completely.
> > 
> > I just need some help from the filesystem folks to determine if they
> > are safe with them.
> > 
> > I know for sure that ext3 and xfs are from looking through them.  And
> > I know reiserfs is if we make sure it doesn't hit the code path that
> > relies on it that is currently enabled by the barrier option.
> > 
> > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> > That already ends our small list of barrier supporting filesystems, and
> > possibly ocfs2, too - although the barrier implementation there seems
> > incomplete as it doesn't seem to flush caches in fsync.
> 
> Define "are safe" --- what interface we planning on using for the
> non-draining barrier?  At least for ext3, when we write the commit
> record using set_buffer_ordered(bh), it assumes that this will do a
> flush of all previous writes and that the commit will hit the disk
> before any subsequent writes are sent to the disk.  So turning the
> write of a buffer head marked with set_buffered_ordered() into a FUA
> write would _not_ be safe for ext3.
> 

I guess we will require something like set_buffer_preflush_fua() kind of
operation so that we preflush the cache to make sure everything before
commit block is on platter and then do commit block write with FUA
to make sure commit block is on platter.

This is assuming that before issuing commit block request we have waited
for completion of rest of the journal data. This will make sure none of
that journal data is in request queue. Then if we issue commit with 
preflush and FUA, it should make sure all the journal blocks are on
disk and then commit block is on disk.

So as long as we wait in filesystem for completion of the requests commit
block is dependent on, before we issue commit request, we should not
require request queue drain and preflush and FUA write probably should
be fine.

> For ext4, if we don't use journal checksums, then we have the same
> requirements as ext3, and the same method of requesting it.  If we do
> use journal checksums, what ext4 needs is a way of assuring that no
> writes after the commit are reordered with respect to the disk platter
> before the commit record --- but any of the writes before that,
> including the commit, and be reordered because we rely on the checksum
> in the commit record to know at replay time whether the last commit is
> valid or not.  We do that right now by calling blkdev_issue_flush()
> with BLKDEF_IFL_WAIT after submitting the write of the commit block.

IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue
and flush the cache. I guess what we need is only flush and not drain
after we have waited for completion of commit record as well as requests
issued before commit record. That should make sure any WRITE after 
commit record does not get reordered w.r.t previous commit. So we
probably need blkdev_issue_flush_only() which will just flush caches
and not drain request queue.

This is all based on my very primitive knowledge. Please ignore if it is
all rubbish.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  1:44                     ` Ted Ts'o
  2010-07-29  2:43                       ` Vivek Goyal
@ 2010-07-29  2:43                       ` Vivek Goyal
  2010-07-29  8:42                         ` Christoph Hellwig
  2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
                                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 155+ messages in thread
From: Vivek Goyal @ 2010-07-29  2:43 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel

On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote:
> On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
> > If we move all filesystems to non-draining barriers with pre- and post-
> > flushes that might actually be a relatively easy first step.  We don't
> > have the complications to deal with multiple types of barriers to
> > start with, and it'll fix the issue for devices without volatile write
> > caches completely.
> > 
> > I just need some help from the filesystem folks to determine if they
> > are safe with them.
> > 
> > I know for sure that ext3 and xfs are from looking through them.  And
> > I know reiserfs is if we make sure it doesn't hit the code path that
> > relies on it that is currently enabled by the barrier option.
> > 
> > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> > That already ends our small list of barrier supporting filesystems, and
> > possibly ocfs2, too - although the barrier implementation there seems
> > incomplete as it doesn't seem to flush caches in fsync.
> 
> Define "are safe" --- what interface we planning on using for the
> non-draining barrier?  At least for ext3, when we write the commit
> record using set_buffer_ordered(bh), it assumes that this will do a
> flush of all previous writes and that the commit will hit the disk
> before any subsequent writes are sent to the disk.  So turning the
> write of a buffer head marked with set_buffered_ordered() into a FUA
> write would _not_ be safe for ext3.
> 

I guess we will require something like set_buffer_preflush_fua() kind of
operation so that we preflush the cache to make sure everything before
commit block is on platter and then do commit block write with FUA
to make sure commit block is on platter.

This is assuming that before issuing commit block request we have waited
for completion of rest of the journal data. This will make sure none of
that journal data is in request queue. Then if we issue commit with 
preflush and FUA, it should make sure all the journal blocks are on
disk and then commit block is on disk.

So as long as we wait in filesystem for completion of the requests commit
block is dependent on, before we issue commit request, we should not
require request queue drain and preflush and FUA write probably should
be fine.

> For ext4, if we don't use journal checksums, then we have the same
> requirements as ext3, and the same method of requesting it.  If we do
> use journal checksums, what ext4 needs is a way of assuring that no
> writes after the commit are reordered with respect to the disk platter
> before the commit record --- but any of the writes before that,
> including the commit, and be reordered because we rely on the checksum
> in the commit record to know at replay time whether the last commit is
> valid or not.  We do that right now by calling blkdev_issue_flush()
> with BLKDEF_IFL_WAIT after submitting the write of the commit block.

IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue
and flush the cache. I guess what we need is only flush and not drain
after we have waited for completion of commit record as well as requests
issued before commit record. That should make sure any WRITE after 
commit record does not get reordered w.r.t previous commit. So we
probably need blkdev_issue_flush_only() which will just flush caches
and not drain request queue.

This is all based on my very primitive knowledge. Please ignore if it is
all rubbish.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  1:44                     ` Ted Ts'o
  2010-07-29  2:43                       ` Vivek Goyal
  2010-07-29  2:43                       ` Vivek Goyal
@ 2010-07-29  8:31                       ` Christoph Hellwig
  2010-07-29 11:16                         ` Jan Kara
  2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
  2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
  2010-07-29 19:44                       ` Ric Wheeler
  4 siblings, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29  8:31 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley

On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote:
> Define "are safe" --- what interface we planning on using for the
> non-draining barrier?  At least for ext3, when we write the commit
> record using set_buffer_ordered(bh), it assumes that this will do a
> flush of all previous writes and that the commit will hit the disk
> before any subsequent writes are sent to the disk.  So turning the
> write of a buffer head marked with set_buffered_ordered() into a FUA
> write would _not_ be safe for ext3.

Please be careful with your wording.  Dou you really mean
"all previous writes" or "all previous writes that were completed".

My reading of the ext3/jbd code we explicitly wait on I/O completion
of dependent writes, and only require those to actually be stable
by issueing a flush.   If that wasn't the case the default ext3
barriers off behaviour would not only be dangerous on devices with
volatile write caches, but also on devices that do not have them,
which in addition to the reading of the code is not what we've seen
in actual power fail testing, where ext3 does well as long as there
is no volatile write cache.

Any, the pre-flush semantics are what the relaxe barriers will
preservere.  REQ_FUA is a separate interface, which we actually have
already inside the block layer, we'll just need to emulate it for
devices withot the FUA bit and handle it in dm and md.

> For ext4, if we don't use journal checksums, then we have the same
> requirements as ext3, and the same method of requesting it.  If we do
> use journal checksums, what ext4 needs is a way of assuring that no
> writes after the commit are reordered with respect to the disk platter
> before the commit record --- but any of the writes before that,
> including the commit, and be reordered because we rely on the checksum
> in the commit record to know at replay time whether the last commit is
> valid or not.  We do that right now by calling blkdev_issue_flush()
> with BLKDEF_IFL_WAIT after submitting the write of the commit block.

blkdev_issue_flush is just am empty barrier, and the current barriers
prevent any kind of reordering.  I'd rather avoid adding a one way
reordering prevention.

Given that we don't appear to actually need the full reordering
prevention even without the journal checksums why do you have stricter
requirements when they are enabled?

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  2:43                       ` Vivek Goyal
@ 2010-07-29  8:42                         ` Christoph Hellwig
  2010-07-29 20:02                           ` Vivek Goyal
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29  8:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 10:43:34PM -0400, Vivek Goyal wrote:
> I guess we will require something like set_buffer_preflush_fua() kind of
> operation so that we preflush the cache to make sure everything before
> commit block is on platter and then do commit block write with FUA
> to make sure commit block is on platter.

No more messing with buffer flags for barriers / cache flush options
please.  It's a flag for the I/O submission, not buffer state.  See
my patch from June to remove BH_Ordered if you're interested.

> This is assuming that before issuing commit block request we have waited
> for completion of rest of the journal data. This will make sure none of
> that journal data is in request queue. Then if we issue commit with 
> preflush and FUA, it should make sure all the journal blocks are on
> disk and then commit block is on disk.
> 
> So as long as we wait in filesystem for completion of the requests commit
> block is dependent on, before we issue commit request, we should not
> require request queue drain and preflush and FUA write probably should
> be fine.

We do not require the drain for that case.  The flush is more difficult,
because it's entirely possible that we have state that we require to be
on disk before writing out a log buffer.  For XFS that's two cases:

 (1) we require the actual file data to be on disk before logging the
     file size update to avoid stale data exposure in case the log
     buffer hits the disk before the data
 (2) we require that the buffers writing back metadata actually made it
     to disk before pushing the log tail

(1) means we'll always a pre-flush when a log buffer contains a size
update from an appending write.
(2) means we need to more complicated tracking of the tail lsn, e.g.
by caching it somewhere and only updating the cached value after a
cache flush happened, with a way to force one if needed.

All that is at least as complicated as it sounds.  While I have a
working prototype just going with the relaxed barriers as a first step
is probably.

> IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue
> and flush the cache.

Exactly.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28 23:00                       ` Christoph Hellwig
@ 2010-07-29 10:45                         ` Jan Kara
  2010-07-29 16:54                           ` Joel Becker
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2010-07-29 10:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Thu 29-07-10 01:00:10, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote:
> >   Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the
> > box and does not need the ordering. ocfs2_sync_file is actually correct
> > (although maybe slightly inefficient) because it does
> > jbd2_journal_force_commit() which creates and immediately commits a
> > transaction and that implies a barrier.
> 
> I don't think that's correct.  ocfs2_sync_file first does
> ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite,
> and from what I can see a just as superflous sync_mapping_buffers (given
> that ocfs doesn't use mark_buffer_dirty_inode) and then might return
> early in case we do fdatasync but the inode isn't marked
> I_DIRTY_DATASYNC.  In that case we might need a cache flush given
> that the data might still be dirty.
  Ah, I see. You're right, fdatasync case is buggy. I'll send Joel a fix.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
@ 2010-07-29 11:16                         ` Jan Kara
  2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Jan Kara @ 2010-07-29 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Thu 29-07-10 10:31:42, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote:
> > Define "are safe" --- what interface we planning on using for the
> > non-draining barrier?  At least for ext3, when we write the commit
> > record using set_buffer_ordered(bh), it assumes that this will do a
> > flush of all previous writes and that the commit will hit the disk
> > before any subsequent writes are sent to the disk.  So turning the
> > write of a buffer head marked with set_buffered_ordered() into a FUA
> > write would _not_ be safe for ext3.
> 
> Please be careful with your wording.  Dou you really mean
> "all previous writes" or "all previous writes that were completed".
> 
> My reading of the ext3/jbd code we explicitly wait on I/O completion
> of dependent writes, and only require those to actually be stable
> by issueing a flush.   If that wasn't the case the default ext3
> barriers off behaviour would not only be dangerous on devices with
> volatile write caches, but also on devices that do not have them,
> which in addition to the reading of the code is not what we've seen
> in actual power fail testing, where ext3 does well as long as there
> is no volatile write cache.
  Yes, ext3 waits for all buffers it needs before writing the commit block
with ordered flag to disk. So preflush + FUA write of commit block is OK
for ext3. Note: We really rely on commit block being on disk before
transaction commit finishes because at that moment we allow reallocation
of blocks freed by the committed transaction. And if they are reallocated
for data, they can get overwritten as soon as they are reallocated, so
we have to be sure they are percieved as free even after journal replay.

> Any, the pre-flush semantics are what the relaxe barriers will
> preservere.  REQ_FUA is a separate interface, which we actually have
> already inside the block layer, we'll just need to emulate it for
> devices withot the FUA bit and handle it in dm and md.
> 
> > For ext4, if we don't use journal checksums, then we have the same
> > requirements as ext3, and the same method of requesting it.  If we do
> > use journal checksums, what ext4 needs is a way of assuring that no
> > writes after the commit are reordered with respect to the disk platter
> > before the commit record --- but any of the writes before that,
> > including the commit, and be reordered because we rely on the checksum
> > in the commit record to know at replay time whether the last commit is
> > valid or not.  We do that right now by calling blkdev_issue_flush()
> > with BLKDEF_IFL_WAIT after submitting the write of the commit block.
> 
> blkdev_issue_flush is just am empty barrier, and the current barriers
> prevent any kind of reordering.  I'd rather avoid adding a one way
> reordering prevention.
> 
> Given that we don't appear to actually need the full reordering
> prevention even without the journal checksums why do you have stricter
> requirements when they are enabled?
  Because Ted found out it actually improves performance - see message
of commit 0e3d2a6313d03413d93327202a60256d1d726fdc. At that time we
thought it's because the latency of forcing commit block to the platter
after flushing caches is still noticeable. But maybe it's something else.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* extfs reliability
  2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
  2010-07-29 11:16                         ` Jan Kara
@ 2010-07-29 13:00                         ` Vladislav Bolkhovitin
  2010-07-29 13:08                           ` Christoph Hellwig
                                             ` (2 more replies)
  1 sibling, 3 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-29 13:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs

[-- Attachment #1: Type: text/plain, Size: 31306 bytes --]

Christoph Hellwig, on 07/29/2010 12:31 PM wrote:
> My reading of the ext3/jbd code we explicitly wait on I/O completion
> of dependent writes, and only require those to actually be stable
> by issueing a flush.   If that wasn't the case the default ext3
> barriers off behaviour would not only be dangerous on devices with
> volatile write caches, but also on devices that do not have them,
> which in addition to the reading of the code is not what we've seen
> in actual power fail testing, where ext3 does well as long as there
> is no volatile write cache.

Basically, it is so, but, unfortunately, not absolutely. I've just tried 2 tests on ext4 with iSCSI:

# uname -a
Linux ini 2.6.32-22-386 #36-Ubuntu SMP Fri Jun 4 00:27:09 UTC 2010 i686 GNU/Linux

# e2fsck -f -y /dev/sdb
e2fsck 1.41.11 (14-Mar-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb: 49/640000 files (0.0% non-contiguous), 56496/1280000 blocks
root@ini:~# mount -t ext4 -o barrier=1 /dev/sdb /mnt
root@ini:~# cd /mnt/dbench-mod/
root@ini:/mnt/dbench-mod# ./dbench 50
50 clients started
...
<-- Pull cable
<-- After sometime a lot of warnings like:
(22002) open CLIENTS/CLIENT44/~DMTMP/COREL/CDRBARS.CFG failed for handle 4235 (Read-only file system)
(22004) open CLIENTS/CLIENT44/~DMTMP/COREL/ARTISTIC.ACL failed for handle 4236 (Read-only file system)
(22010) open CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed for handle 4237 (Read-only file system)
(22011) nb_close: handle 4237 was not open
(22014) unlink CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed (Read-only file system)
(22018) open CLIENTS/CLIENT44/~DMTMP/COREL/CORELDRW.CDT failed for handle 4238 (Read-only file system)
(22021) nb_close: handle 4218 was not open
(22032) open CLIENTS/CLIENT44/~DMTMP/COREL/GRAPHIC1.CDR failed for handle 4239 (Read-only file system)
(22050) open CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed for handle 4240 (Read-only file system)
(22051) nb_close: handle 4240 was not open
(22054) unlink CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed (Read-only file system)
(22057) nb_close: handle 4228 was not open
(22061) nb_close: handle 4182 was not open
(22065) nb_close: handle 4234 was not open
(22078) open CLIENTS/CLIENT44/~DMTMP/COREL/GRAPH1.CDR failed for handle 4242 (Read-only file system)^C^C^C^C^C^C
root@ini:/mnt/dbench-mod# ^C
root@ini:/mnt/dbench-mod# ^C
root@ini:~# umount /mnt
Segmentation fault

Kernel log:

Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743  @7..A7..B7..C7..
Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747  D7..E7..F7..G7..
Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b  H7..I7..J7..K7..
Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000  ,7..............
Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002  ................
Jul 29 19:55:35 ini kernel: [ 3044.722346] c2c28e90: 00000000 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722354] c2c28ea0: c2c28ea0 c2c28ea0 c307f138 c307f138  ........8...8...
Jul 29 19:55:35 ini kernel: [ 3044.722360] c2c28eb0: 0003f800 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722366] c2c28ec0: c2c28ec0 c2c28ec0 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722373] c2c28ed0: 00100100 00200200 c2c28ed8 c2c28ed8  ...... .........
Jul 29 19:55:35 ini kernel: [ 3044.722379] c2c28ee0: c2c28ee0 c2c28ee0 0000800b 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722384] c2c28ef0: 00000001 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722391] c2c28f00: 00000001 00000000 0003f800 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722398] c2c28f10: 00000002 4c51a3cc 00000000 4c51a3cc  ......QL......QL
Jul 29 19:55:35 ini kernel: [ 3044.722404] c2c28f20: 00000000 4c51a3cc 00000000 00000208  ......QL........
Jul 29 19:55:35 ini kernel: [ 3044.722410] c2c28f30: 00000000 0000000c 81800000 00000101  ................
Jul 29 19:55:35 ini kernel: [ 3044.722416] c2c28f40: 00000001 00000000 c2c28f48 c2c28f48  ........H...H...
Jul 29 19:55:35 ini kernel: [ 3044.722422] c2c28f50: 00000000 00000000 00000000 c2c28f5c  ............\...
Jul 29 19:55:35 ini kernel: [ 3044.722428] c2c28f60: c2c28f5c c0593440 c05933c0 ca228a00  \...@4Y..3Y...".
Jul 29 19:55:35 ini kernel: [ 3044.722434] c2c28f70: 00000000 c2c28f78 c2c28ec8 00000000  ....x...........
Jul 29 19:55:35 ini kernel: [ 3044.722440] c2c28f80: 00000020 00000000 00000505 00000000   ...............
Jul 29 19:55:35 ini kernel: [ 3044.722446] c2c28f90: 00000000 00010001 c2c28f98 c2c28f98  ................
Jul 29 19:55:35 ini kernel: [ 3044.722451] c2c28fa0: 00000000 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722457] c2c28fb0: c0593680 000200da cdcc104c 00000202  .6Y.....L.......
Jul 29 19:55:35 ini kernel: [ 3044.722463] c2c28fc0: c2c28fc0 c2c28fc0 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722469] c2c28fd0: 00000000 c2c28fd4 c2c28fd4 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722475] c2c28fe0: 0623225b 00000000 00000000 c2c28fec  ["#.............
Jul 29 19:55:35 ini kernel: [ 3044.722481] c2c28ff0: c2c28fec 00000001 00000000 c2c28ffc  ................
Jul 29 19:55:35 ini kernel: [ 3044.722487] c2c29000: c2c28ffc 00000000 00000040 00000000  ........@.......
Jul 29 19:55:35 ini kernel: [ 3044.722493] c2c29010: 00000000 00000000 00000000 ffffffff  ................
Jul 29 19:55:35 ini kernel: [ 3044.722499] c2c29020: ffffffff 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722505] c2c29030: c2c29030 c2c29030 c2c28ec8 00000000  0...0...........
Jul 29 19:55:35 ini kernel: [ 3044.722510] c2c29040: 00000000 00000000 00000000 00000000  ................
Jul 29 19:55:35 ini kernel: [ 3044.722516] c2c29050: 00000000 4c51a3d8 00000000 c2c2905c  ......QL....\...
Jul 29 19:55:35 ini kernel: [ 3044.722522] c2c29060: c2c2905c 00000101 ffffffff 00000000  \...............
Jul 29 19:55:35 ini kernel: [ 3044.722528] c2c29070: 00000000 00000000 00000000 00000101  ................
Jul 29 19:55:35 ini kernel: [ 3044.722534] c2c29080: 00000000 00000000 c2c29088 c2c29088  ................
Jul 29 19:55:35 ini kernel: [ 3044.722540] c2c29090: 00000000 00005be2 00005be2  .....[...[..
Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu
Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace:
Jul 29 19:55:35 ini kernel: [ 3044.722567]  [<c0291731>] ext4_destroy_inode+0x91/0xa0
Jul 29 19:55:35 ini kernel: [ 3044.722577]  [<c020ecb4>] destroy_inode+0x24/0x40
Jul 29 19:55:35 ini kernel: [ 3044.722583]  [<c020f11e>] dispose_list+0x8e/0x100
Jul 29 19:55:35 ini kernel: [ 3044.722588]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 19:55:35 ini kernel: [ 3044.722598]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 19:55:35 ini kernel: [ 3044.722606]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 19:55:35 ini kernel: [ 3044.722612]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 19:55:35 ini kernel: [ 3044.722618]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 19:55:35 ini kernel: [ 3044.722625]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 19:55:35 ini kernel: [ 3044.722631]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 19:55:35 ini kernel: [ 3044.722636]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 19:55:35 ini kernel: [ 3044.722643]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code
Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00
Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed
Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb
Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy
Jul 29 19:55:36 ini kernel: [ 3044.823044] 
Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA
Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0
Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350
Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138
Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc
Jul 29 19:55:36 ini kernel: [ 3044.823044]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Jul 29 19:55:36 ini kernel: [ 3044.823044]  c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138
Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0
Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc6ca>] ? kill_block_super+0x2a/0x50
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fd4e4>] ? deactivate_super+0x64/0x90
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c021282f>] ? mntput_no_expire+0x8f/0xe0
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212e47>] ? sys_umount+0x47/0xa0
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212ebe>] ? sys_oldumount+0x1e/0x20
Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01033ec>] ? syscall_call+0x7/0xb
Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]---
Jul 29 19:55:36 ini kernel: [ 3045.310429] ------------[ cut here ]------------
Jul 29 19:55:36 ini kernel: [ 3045.321086] WARNING: at /build/buildd/linux-2.6.32/kernel/exit.c:895 do_exit+0x2f9/0x300()
Jul 29 19:55:36 ini kernel: [ 3045.342153] Hardware name: X5DPA
Jul 29 19:55:36 ini kernel: [ 3045.352697] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy
Jul 29 19:55:36 ini kernel: [ 3045.422317] Pid: 1299, comm: umount Tainted: G      D    2.6.32-22-386 #36-Ubuntu
Jul 29 19:55:36 ini kernel: [ 3045.444158] Call Trace:
Jul 29 19:55:36 ini kernel: [ 3045.454755]  [<c01487a2>] warn_slowpath_common+0x72/0xa0
Jul 29 19:55:36 ini kernel: [ 3045.465152]  [<c014ca49>] ? do_exit+0x2f9/0x300
Jul 29 19:55:36 ini kernel: [ 3045.475281]  [<c014ca49>] ? do_exit+0x2f9/0x300
Jul 29 19:55:36 ini kernel: [ 3045.485296]  [<c01487ea>] warn_slowpath_null+0x1a/0x20
Jul 29 19:55:36 ini kernel: [ 3045.495432]  [<c014ca49>] do_exit+0x2f9/0x300
Jul 29 19:55:36 ini kernel: [ 3045.505640]  [<c014856f>] ? print_oops_end_marker+0x2f/0x40
Jul 29 19:55:36 ini kernel: [ 3045.516012]  [<c0579fc5>] oops_end+0x95/0xd0
Jul 29 19:55:36 ini kernel: [ 3045.526394]  [<c01068a4>] die+0x54/0x80
Jul 29 19:55:36 ini kernel: [ 3045.536808]  [<c0579716>] do_trap+0x96/0xc0
Jul 29 19:55:36 ini kernel: [ 3045.547268]  [<c0104980>] ? do_invalid_op+0x0/0xa0
Jul 29 19:55:36 ini kernel: [ 3045.557756]  [<c0104a0b>] do_invalid_op+0x8b/0xa0
Jul 29 19:55:36 ini kernel: [ 3045.568296]  [<c0293c2a>] ? ext4_put_super+0x2ea/0x350
Jul 29 19:55:36 ini kernel: [ 3045.578561]  [<c0149291>] ? vprintk+0x191/0x3f0
Jul 29 19:55:36 ini kernel: [ 3045.588708]  [<c0579493>] error_code+0x73/0x80
Jul 29 19:55:36 ini kernel: [ 3045.598076]  [<c0293c2a>] ? ext4_put_super+0x2ea/0x350
Jul 29 19:55:36 ini kernel: [ 3045.607381]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 19:55:36 ini kernel: [ 3045.616499]  [<c01fc60d>] generic_shutdown_super+0x4d/0xe0
Jul 29 19:55:36 ini kernel: [ 3045.625688]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 19:55:36 ini kernel: [ 3045.634777]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 19:55:36 ini kernel: [ 3045.643744]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 19:55:36 ini kernel: [ 3045.652782]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 19:55:36 ini kernel: [ 3045.661514]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 19:55:36 ini kernel: [ 3045.670139]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 19:55:36 ini kernel: [ 3045.678566] ---[ end trace 426db011a0289db4 ]---

Another test. Everything is as before, only I did not pull the cable, but deleted the corresponding LUN on the target, so all the command starting from this moment failed. Then on umount system rebooted. Kernel log:

Jul 29 20:20:42 ini kernel: [ 1320.251393] umount        D 00478e55     0  1234    924 0x00000000
Jul 29 20:20:42 ini kernel: [ 1320.251403]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:20:42 ini kernel: [ 1320.251415]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:20:42 ini kernel: [ 1320.251425]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:20:42 ini kernel: [ 1320.251436] Call Trace:
Jul 29 20:20:42 ini kernel: [ 1320.251452]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:20:42 ini kernel: [ 1320.251463]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:20:42 ini kernel: [ 1320.251470]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:20:42 ini kernel: [ 1320.251476]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:20:42 ini kernel: [ 1320.251483]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:20:42 ini kernel: [ 1320.251491]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:20:42 ini kernel: [ 1320.251499]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:20:42 ini kernel: [ 1320.251510]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:20:42 ini kernel: [ 1320.251517]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:20:42 ini kernel: [ 1320.251523]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:20:42 ini kernel: [ 1320.251529]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:20:42 ini kernel: [ 1320.251538]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:20:42 ini kernel: [ 1320.251546]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:20:42 ini kernel: [ 1320.251553]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:20:42 ini kernel: [ 1320.251559]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:20:42 ini kernel: [ 1320.251566]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:20:42 ini kernel: [ 1320.251573]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:20:42 ini kernel: [ 1320.251579]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:20:42 ini kernel: [ 1320.251586]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:22:42 ini kernel: [ 1440.285910] umount        D 00478e55     0  1234    924 0x00000004
Jul 29 20:22:42 ini kernel: [ 1440.285919]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:22:42 ini kernel: [ 1440.285931]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:22:42 ini kernel: [ 1440.285942]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:22:42 ini kernel: [ 1440.285953] Call Trace:
Jul 29 20:22:42 ini kernel: [ 1440.285969]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:22:42 ini kernel: [ 1440.285980]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:22:42 ini kernel: [ 1440.285987]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:22:42 ini kernel: [ 1440.285994]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:22:42 ini kernel: [ 1440.286001]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:22:42 ini kernel: [ 1440.286010]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:22:42 ini kernel: [ 1440.286018]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:22:42 ini kernel: [ 1440.286028]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:22:42 ini kernel: [ 1440.286035]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:22:42 ini kernel: [ 1440.286041]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:22:42 ini kernel: [ 1440.286047]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:22:42 ini kernel: [ 1440.286056]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:22:42 ini kernel: [ 1440.286064]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:22:42 ini kernel: [ 1440.286071]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:22:42 ini kernel: [ 1440.286077]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:22:42 ini kernel: [ 1440.286084]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:22:42 ini kernel: [ 1440.286091]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:22:42 ini kernel: [ 1440.286097]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:22:42 ini kernel: [ 1440.286104]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:24:42 ini kernel: [ 1560.321709] umount        D 00478e55     0  1234    924 0x00000004
Jul 29 20:24:42 ini kernel: [ 1560.321718]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:24:42 ini kernel: [ 1560.321730]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:24:42 ini kernel: [ 1560.321741]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:24:42 ini kernel: [ 1560.321751] Call Trace:
Jul 29 20:24:42 ini kernel: [ 1560.321767]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:24:42 ini kernel: [ 1560.321777]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:24:42 ini kernel: [ 1560.321784]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:24:42 ini kernel: [ 1560.321791]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:24:42 ini kernel: [ 1560.321797]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:24:42 ini kernel: [ 1560.321805]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:24:42 ini kernel: [ 1560.321814]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:24:42 ini kernel: [ 1560.321824]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:24:42 ini kernel: [ 1560.321831]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:24:42 ini kernel: [ 1560.321837]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:24:42 ini kernel: [ 1560.321845]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:24:42 ini kernel: [ 1560.321855]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:24:42 ini kernel: [ 1560.321864]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:24:42 ini kernel: [ 1560.321870]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:24:42 ini kernel: [ 1560.321877]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:24:42 ini kernel: [ 1560.321885]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:24:42 ini kernel: [ 1560.321892]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:24:42 ini kernel: [ 1560.321898]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:24:42 ini kernel: [ 1560.321905]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:24:42 ini kernel: [ 1560.358795] sync          D 0004beb0     0  1265   1255 0x00000004
Jul 29 20:24:42 ini kernel: [ 1560.358803]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
Jul 29 20:24:42 ini kernel: [ 1560.358815]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
Jul 29 20:24:42 ini kernel: [ 1560.358826]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
Jul 29 20:24:42 ini kernel: [ 1560.358837] Call Trace:
Jul 29 20:24:42 ini kernel: [ 1560.358845]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
Jul 29 20:24:42 ini kernel: [ 1560.358852]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
Jul 29 20:24:42 ini kernel: [ 1560.358858]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
Jul 29 20:24:42 ini kernel: [ 1560.358863]  [<c057850c>] ? down_read+0x1c/0x20
Jul 29 20:24:42 ini kernel: [ 1560.358870]  [<c021cb6d>] sync_filesystems+0xbd/0x110
Jul 29 20:24:42 ini kernel: [ 1560.358876]  [<c021cc16>] sys_sync+0x16/0x40
Jul 29 20:24:42 ini kernel: [ 1560.358881]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:26:42 ini kernel: [ 1680.392190] umount        D 00478e55     0  1234    924 0x00000004
Jul 29 20:26:42 ini kernel: [ 1680.392200]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:26:42 ini kernel: [ 1680.392212]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:26:42 ini kernel: [ 1680.392223]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:26:42 ini kernel: [ 1680.392233] Call Trace:
Jul 29 20:26:42 ini kernel: [ 1680.392250]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:26:42 ini kernel: [ 1680.392260]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:26:42 ini kernel: [ 1680.392267]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:26:42 ini kernel: [ 1680.392274]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:26:42 ini kernel: [ 1680.392280]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:26:42 ini kernel: [ 1680.392289]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:26:42 ini kernel: [ 1680.392298]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:26:42 ini kernel: [ 1680.392308]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:26:42 ini kernel: [ 1680.392314]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:26:42 ini kernel: [ 1680.392321]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:26:42 ini kernel: [ 1680.392327]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:26:42 ini kernel: [ 1680.392336]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:26:42 ini kernel: [ 1680.392344]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:26:42 ini kernel: [ 1680.392351]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:26:42 ini kernel: [ 1680.392357]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:26:42 ini kernel: [ 1680.392364]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:26:42 ini kernel: [ 1680.392371]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:26:42 ini kernel: [ 1680.392378]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:26:42 ini kernel: [ 1680.392384]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:26:42 ini kernel: [ 1680.427874] sync          D 0004beb0     0  1265   1255 0x00000004
Jul 29 20:26:42 ini kernel: [ 1680.427883]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
Jul 29 20:26:42 ini kernel: [ 1680.427894]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
Jul 29 20:26:42 ini kernel: [ 1680.427904]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
Jul 29 20:26:42 ini kernel: [ 1680.427915] Call Trace:
Jul 29 20:26:42 ini kernel: [ 1680.427922]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
Jul 29 20:26:42 ini kernel: [ 1680.427929]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
Jul 29 20:26:42 ini kernel: [ 1680.427935]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
Jul 29 20:26:42 ini kernel: [ 1680.427940]  [<c057850c>] ? down_read+0x1c/0x20
Jul 29 20:26:42 ini kernel: [ 1680.427947]  [<c021cb6d>] sync_filesystems+0xbd/0x110
Jul 29 20:26:42 ini kernel: [ 1680.427953]  [<c021cc16>] sys_sync+0x16/0x40
Jul 29 20:26:42 ini kernel: [ 1680.427958]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:28:42 ini kernel: [ 1800.458856] umount        D 00478e55     0  1234    924 0x00000004
Jul 29 20:28:42 ini kernel: [ 1800.458866]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:28:42 ini kernel: [ 1800.458877]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:28:42 ini kernel: [ 1800.458888]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:28:42 ini kernel: [ 1800.458899] Call Trace:
Jul 29 20:28:42 ini kernel: [ 1800.458915]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:28:42 ini kernel: [ 1800.458925]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:28:42 ini kernel: [ 1800.458932]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:28:42 ini kernel: [ 1800.458938]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:28:42 ini kernel: [ 1800.458945]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:28:42 ini kernel: [ 1800.458953]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:28:42 ini kernel: [ 1800.458961]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:28:42 ini kernel: [ 1800.458971]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:28:42 ini kernel: [ 1800.458978]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:28:42 ini kernel: [ 1800.458984]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:28:42 ini kernel: [ 1800.458991]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:28:42 ini kernel: [ 1800.458999]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:28:42 ini kernel: [ 1800.459007]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:28:42 ini kernel: [ 1800.459013]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:28:42 ini kernel: [ 1800.459020]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:28:42 ini kernel: [ 1800.459027]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:28:42 ini kernel: [ 1800.459033]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:28:42 ini kernel: [ 1800.459039]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:28:42 ini kernel: [ 1800.459046]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:28:42 ini kernel: [ 1800.493768] sync          D 0004beb0     0  1265   1255 0x00000004
Jul 29 20:28:42 ini kernel: [ 1800.493777]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
Jul 29 20:28:42 ini kernel: [ 1800.493788]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
Jul 29 20:28:42 ini kernel: [ 1800.493798]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
Jul 29 20:28:42 ini kernel: [ 1800.493809] Call Trace:
Jul 29 20:28:42 ini kernel: [ 1800.493816]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
Jul 29 20:28:42 ini kernel: [ 1800.493823]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
Jul 29 20:28:42 ini kernel: [ 1800.493828]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
Jul 29 20:28:42 ini kernel: [ 1800.493834]  [<c057850c>] ? down_read+0x1c/0x20
Jul 29 20:28:42 ini kernel: [ 1800.493841]  [<c021cb6d>] sync_filesystems+0xbd/0x110
Jul 29 20:28:42 ini kernel: [ 1800.493847]  [<c021cc16>] sys_sync+0x16/0x40
Jul 29 20:28:42 ini kernel: [ 1800.493853]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:30:42 ini kernel: [ 1920.526729] umount        D 00478e55     0  1234    924 0x00000004
Jul 29 20:30:42 ini kernel: [ 1920.526739]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
Jul 29 20:30:42 ini kernel: [ 1920.526750]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
Jul 29 20:30:42 ini kernel: [ 1920.526761]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
Jul 29 20:30:42 ini kernel: [ 1920.526772] Call Trace:
Jul 29 20:30:42 ini kernel: [ 1920.526788]  [<c057745a>] io_schedule+0x3a/0x60
Jul 29 20:30:42 ini kernel: [ 1920.526798]  [<c01bd95d>] sync_page+0x3d/0x50
Jul 29 20:30:42 ini kernel: [ 1920.526805]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
Jul 29 20:30:42 ini kernel: [ 1920.526813]  [<c01bd920>] ? sync_page+0x0/0x50
Jul 29 20:30:42 ini kernel: [ 1920.526819]  [<c01bd8ee>] __lock_page+0x7e/0x90
Jul 29 20:30:42 ini kernel: [ 1920.526827]  [<c01624d0>] ? wake_bit_function+0x0/0x50
Jul 29 20:30:42 ini kernel: [ 1920.526836]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
Jul 29 20:30:42 ini kernel: [ 1920.526845]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
Jul 29 20:30:42 ini kernel: [ 1920.526853]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
Jul 29 20:30:42 ini kernel: [ 1920.526859]  [<c020f15c>] dispose_list+0xcc/0x100
Jul 29 20:30:42 ini kernel: [ 1920.526866]  [<c020f534>] invalidate_inodes+0xf4/0x120
Jul 29 20:30:42 ini kernel: [ 1920.526874]  [<c023b310>] ? vfs_quota_off+0x0/0x20
Jul 29 20:30:42 ini kernel: [ 1920.526882]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
Jul 29 20:30:42 ini kernel: [ 1920.526889]  [<c01fc6ca>] kill_block_super+0x2a/0x50
Jul 29 20:30:42 ini kernel: [ 1920.526895]  [<c01fd4e4>] deactivate_super+0x64/0x90
Jul 29 20:30:42 ini kernel: [ 1920.526902]  [<c021282f>] mntput_no_expire+0x8f/0xe0
Jul 29 20:30:42 ini kernel: [ 1920.526908]  [<c0212e47>] sys_umount+0x47/0xa0
Jul 29 20:30:42 ini kernel: [ 1920.526915]  [<c0212ebe>] sys_oldumount+0x1e/0x20
Jul 29 20:30:42 ini kernel: [ 1920.526922]  [<c01033ec>] syscall_call+0x7/0xb
Jul 29 20:30:42 ini kernel: [ 1920.563739] sync          D 0004beb0     0  1265   1255 0x00000004
Jul 29 20:30:42 ini kernel: [ 1920.563747]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
Jul 29 20:30:42 ini kernel: [ 1920.563758]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
Jul 29 20:30:42 ini kernel: [ 1920.563768]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
Jul 29 20:30:42 ini kernel: [ 1920.563779] Call Trace:
Jul 29 20:30:42 ini kernel: [ 1920.563787]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
Jul 29 20:30:42 ini kernel: [ 1920.563793]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
Jul 29 20:30:42 ini kernel: [ 1920.563799]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
Jul 29 20:30:42 ini kernel: [ 1920.563804]  [<c057850c>] ? down_read+0x1c/0x20
Jul 29 20:30:42 ini kernel: [ 1920.563812]  [<c021cb6d>] sync_filesystems+0xbd/0x110
Jul 29 20:30:42 ini kernel: [ 1920.563817]  [<c021cc16>] sys_sync+0x16/0x40
Jul 29 20:30:42 ini kernel: [ 1920.563823]  [<c01033ec>] syscall_call+0x7/0xb

Although in both cases the FS remained consistent:

root@ini:~# mount -t ext4 /dev/sdb /mnt
root@ini:~# umount /mnt
root@ini:~# e2fsck -f -y /dev/sdb
e2fsck 1.41.11 (14-Mar-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdb: 4194/640000 files (74.2% non-contiguous), 334774/1280000 blocks

You can find full kernel logs starting from iSCSI load in the attachments. 

I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.

Vlad

[-- Attachment #2: m.bz2 --]
[-- Type: application/x-bzip, Size: 24364 bytes --]

[-- Attachment #3: m1.bz2 --]
[-- Type: application/x-bzip, Size: 45322 bytes --]

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
@ 2010-07-29 13:08                           ` Christoph Hellwig
  2010-07-29 14:12                             ` Vladislav Bolkhovitin
  2010-07-29 14:26                           ` Jan Kara
  2010-07-29 18:58                           ` Ted Ts'o
  2 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 13:08 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke, linux-kernel,
	kernel-bugs

On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
> You can find full kernel logs starting from iSCSI load in the attachments. 
> 
> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.

I was only talking about ext3.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 13:08                           ` Christoph Hellwig
@ 2010-07-29 14:12                             ` Vladislav Bolkhovitin
  2010-07-29 14:34                               ` Jan Kara
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-29 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs


Christoph Hellwig, on 07/29/2010 05:08 PM wrote:
> On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
>> You can find full kernel logs starting from iSCSI load in the attachments.
>>
>> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.
> 
> I was only talking about ext3.

Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was:

...
(2197) nb_write: handle 4272 was not open size=65475 ofs=0
(2199) nb_write: handle 4272 was not open size=65475 ofs=65534
(2201) nb_write: handle 4272 was not open size=65475 ofs=131068
(2203) nb_write: handle 4272 was not open size=65475 ofs=196602
(2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C
^C
root@ini:/mnt/dbench-mod# ^C
root@ini:/mnt/dbench-mod# ^C
root@ini:/mnt/dbench-mod# cd
root@ini:~# umount /mnt

<- recover device

root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed"

root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt
root@ini:~#

Kernel log:

Jul 29 22:05:54 ini kernel: [ 2927.832893] kjournald starting.  Commit interval 5 seconds
Jul 29 22:05:54 ini kernel: [ 2927.833430] EXT3 FS on sdb, internal journal
Jul 29 22:05:54 ini kernel: [ 2927.833499] EXT3-fs: sdb: 1 orphan inode deleted
Jul 29 22:05:54 ini kernel: [ 2927.833503] EXT3-fs: recovery complete.
Jul 29 22:05:54 ini kernel: [ 2927.838122] EXT3-fs: mounted filesystem with ordered data mode.

But it still remained consistent:

root@ini:~# umount /mnt
root@ini:~# e2fsck -f -y /dev/sdb
e2fsck 1.41.11 (14-Mar-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb: 3504/320000 files (21.1% non-contiguous), 307034/1280000 blocks

Good progress since my original reports for kernels around 2.6.27!

Vlad


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
  2010-07-29 13:08                           ` Christoph Hellwig
@ 2010-07-29 14:26                           ` Jan Kara
  2010-07-29 18:20                             ` Vladislav Bolkhovitin
  2010-07-29 18:58                           ` Ted Ts'o
  2 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2010-07-29 14:26 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke, linux-kernel,
	kernel-bugs

On Thu 29-07-10 17:00:10, Vladislav Bolkhovitin wrote:
> Christoph Hellwig, on 07/29/2010 12:31 PM wrote:
> > My reading of the ext3/jbd code we explicitly wait on I/O completion
> > of dependent writes, and only require those to actually be stable
> > by issueing a flush.   If that wasn't the case the default ext3
> > barriers off behaviour would not only be dangerous on devices with
> > volatile write caches, but also on devices that do not have them,
> > which in addition to the reading of the code is not what we've seen
> > in actual power fail testing, where ext3 does well as long as there
> > is no volatile write cache.
> 
> Basically, it is so, but, unfortunately, not absolutely. I've just tried
> 2 tests on ext4 with iSCSI:
> 
> # uname -a
> Linux ini 2.6.32-22-386 #36-Ubuntu SMP Fri Jun 4 00:27:09 UTC 2010 i686 GNU/Linux
> 
> # e2fsck -f -y /dev/sdb
> e2fsck 1.41.11 (14-Mar-2010)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/sdb: 49/640000 files (0.0% non-contiguous), 56496/1280000 blocks
> root@ini:~# mount -t ext4 -o barrier=1 /dev/sdb /mnt
> root@ini:~# cd /mnt/dbench-mod/
> root@ini:/mnt/dbench-mod# ./dbench 50
> 50 clients started
> ...
> <-- Pull cable
> <-- After sometime a lot of warnings like:
> (22002) open CLIENTS/CLIENT44/~DMTMP/COREL/CDRBARS.CFG failed for handle 4235 (Read-only file system)
> (22004) open CLIENTS/CLIENT44/~DMTMP/COREL/ARTISTIC.ACL failed for handle 4236 (Read-only file system)
  ...
  These are OK. You pulled a cable and now you start getting EIO from the
kernel.

> root@ini:/mnt/dbench-mod# ^C
> root@ini:/mnt/dbench-mod# ^C
> root@ini:~# umount /mnt
> Segmentation fault
  This isn't OK of course ;)

> Kernel log:
> 
> Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743  @7..A7..B7..C7..
> Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747  D7..E7..F7..G7..
> Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b  H7..I7..J7..K7..
> Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000  ,7..............
> Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002  ................
...
Sadly these messages above seem to have overwritten beginning of the
message below. Hmm, but maybe it's just a warning message about inode still
being on orphan list because the next oops still shows untainted kernel.

> Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu
> Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace:
> Jul 29 19:55:35 ini kernel: [ 3044.722567]  [<c0291731>] ext4_destroy_inode+0x91/0xa0
> Jul 29 19:55:35 ini kernel: [ 3044.722577]  [<c020ecb4>] destroy_inode+0x24/0x40
> Jul 29 19:55:35 ini kernel: [ 3044.722583]  [<c020f11e>] dispose_list+0x8e/0x100
> Jul 29 19:55:35 ini kernel: [ 3044.722588]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 19:55:35 ini kernel: [ 3044.722598]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 19:55:35 ini kernel: [ 3044.722606]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 19:55:35 ini kernel: [ 3044.722612]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 19:55:35 ini kernel: [ 3044.722618]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 19:55:35 ini kernel: [ 3044.722625]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 19:55:35 ini kernel: [ 3044.722631]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 19:55:35 ini kernel: [ 3044.722636]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 19:55:35 ini kernel: [ 3044.722643]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code
> Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00
> Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed
> Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb
> Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy
  So here probably starts the real oops. But sadly we are missing the
beginning as well. Can you send me disassembly of your ext4_put_super?

> Jul 29 19:55:36 ini kernel: [ 3044.823044] 
> Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA
> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0
> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350
> Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138
> Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138
> Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0
> Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc6ca>] ? kill_block_super+0x2a/0x50
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fd4e4>] ? deactivate_super+0x64/0x90
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c021282f>] ? mntput_no_expire+0x8f/0xe0
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212e47>] ? sys_umount+0x47/0xa0
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212ebe>] ? sys_oldumount+0x1e/0x20
> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01033ec>] ? syscall_call+0x7/0xb
> Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]---
...
> Another test. Everything is as before, only I did not pull the cable, but
> deleted the corresponding LUN on the target, so all the command starting
> from this moment failed. Then on umount system rebooted. Kernel log:

  Nasty. But the log actually contains only traces of processes in D state
(generally waiting for a page to be unlocked). Do you have any sort of
watchdog which might have rebooted the machine?

> Jul 29 20:20:42 ini kernel: [ 1320.251393] umount        D 00478e55     0  1234    924 0x00000000
> Jul 29 20:20:42 ini kernel: [ 1320.251403]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:20:42 ini kernel: [ 1320.251415]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:20:42 ini kernel: [ 1320.251425]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:20:42 ini kernel: [ 1320.251436] Call Trace:
> Jul 29 20:20:42 ini kernel: [ 1320.251452]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:20:42 ini kernel: [ 1320.251463]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:20:42 ini kernel: [ 1320.251470]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:20:42 ini kernel: [ 1320.251476]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:20:42 ini kernel: [ 1320.251483]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:20:42 ini kernel: [ 1320.251491]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:20:42 ini kernel: [ 1320.251499]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:20:42 ini kernel: [ 1320.251510]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:20:42 ini kernel: [ 1320.251517]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:20:42 ini kernel: [ 1320.251523]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:20:42 ini kernel: [ 1320.251529]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:20:42 ini kernel: [ 1320.251538]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:20:42 ini kernel: [ 1320.251546]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:20:42 ini kernel: [ 1320.251553]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:20:42 ini kernel: [ 1320.251559]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:20:42 ini kernel: [ 1320.251566]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:20:42 ini kernel: [ 1320.251573]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:20:42 ini kernel: [ 1320.251579]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:20:42 ini kernel: [ 1320.251586]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:22:42 ini kernel: [ 1440.285910] umount        D 00478e55     0  1234    924 0x00000004
> Jul 29 20:22:42 ini kernel: [ 1440.285919]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:22:42 ini kernel: [ 1440.285931]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:22:42 ini kernel: [ 1440.285942]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:22:42 ini kernel: [ 1440.285953] Call Trace:
> Jul 29 20:22:42 ini kernel: [ 1440.285969]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:22:42 ini kernel: [ 1440.285980]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:22:42 ini kernel: [ 1440.285987]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:22:42 ini kernel: [ 1440.285994]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:22:42 ini kernel: [ 1440.286001]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:22:42 ini kernel: [ 1440.286010]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:22:42 ini kernel: [ 1440.286018]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:22:42 ini kernel: [ 1440.286028]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:22:42 ini kernel: [ 1440.286035]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:22:42 ini kernel: [ 1440.286041]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:22:42 ini kernel: [ 1440.286047]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:22:42 ini kernel: [ 1440.286056]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:22:42 ini kernel: [ 1440.286064]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:22:42 ini kernel: [ 1440.286071]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:22:42 ini kernel: [ 1440.286077]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:22:42 ini kernel: [ 1440.286084]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:22:42 ini kernel: [ 1440.286091]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:22:42 ini kernel: [ 1440.286097]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:22:42 ini kernel: [ 1440.286104]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:24:42 ini kernel: [ 1560.321709] umount        D 00478e55     0  1234    924 0x00000004
> Jul 29 20:24:42 ini kernel: [ 1560.321718]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:24:42 ini kernel: [ 1560.321730]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:24:42 ini kernel: [ 1560.321741]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:24:42 ini kernel: [ 1560.321751] Call Trace:
> Jul 29 20:24:42 ini kernel: [ 1560.321767]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:24:42 ini kernel: [ 1560.321777]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:24:42 ini kernel: [ 1560.321784]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:24:42 ini kernel: [ 1560.321791]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:24:42 ini kernel: [ 1560.321797]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:24:42 ini kernel: [ 1560.321805]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:24:42 ini kernel: [ 1560.321814]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:24:42 ini kernel: [ 1560.321824]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:24:42 ini kernel: [ 1560.321831]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:24:42 ini kernel: [ 1560.321837]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:24:42 ini kernel: [ 1560.321845]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:24:42 ini kernel: [ 1560.321855]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:24:42 ini kernel: [ 1560.321864]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:24:42 ini kernel: [ 1560.321870]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:24:42 ini kernel: [ 1560.321877]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:24:42 ini kernel: [ 1560.321885]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:24:42 ini kernel: [ 1560.321892]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:24:42 ini kernel: [ 1560.321898]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:24:42 ini kernel: [ 1560.321905]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:24:42 ini kernel: [ 1560.358795] sync          D 0004beb0     0  1265   1255 0x00000004
> Jul 29 20:24:42 ini kernel: [ 1560.358803]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
> Jul 29 20:24:42 ini kernel: [ 1560.358815]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
> Jul 29 20:24:42 ini kernel: [ 1560.358826]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
> Jul 29 20:24:42 ini kernel: [ 1560.358837] Call Trace:
> Jul 29 20:24:42 ini kernel: [ 1560.358845]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
> Jul 29 20:24:42 ini kernel: [ 1560.358852]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
> Jul 29 20:24:42 ini kernel: [ 1560.358858]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
> Jul 29 20:24:42 ini kernel: [ 1560.358863]  [<c057850c>] ? down_read+0x1c/0x20
> Jul 29 20:24:42 ini kernel: [ 1560.358870]  [<c021cb6d>] sync_filesystems+0xbd/0x110
> Jul 29 20:24:42 ini kernel: [ 1560.358876]  [<c021cc16>] sys_sync+0x16/0x40
> Jul 29 20:24:42 ini kernel: [ 1560.358881]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:26:42 ini kernel: [ 1680.392190] umount        D 00478e55     0  1234    924 0x00000004
> Jul 29 20:26:42 ini kernel: [ 1680.392200]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:26:42 ini kernel: [ 1680.392212]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:26:42 ini kernel: [ 1680.392223]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:26:42 ini kernel: [ 1680.392233] Call Trace:
> Jul 29 20:26:42 ini kernel: [ 1680.392250]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:26:42 ini kernel: [ 1680.392260]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:26:42 ini kernel: [ 1680.392267]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:26:42 ini kernel: [ 1680.392274]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:26:42 ini kernel: [ 1680.392280]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:26:42 ini kernel: [ 1680.392289]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:26:42 ini kernel: [ 1680.392298]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:26:42 ini kernel: [ 1680.392308]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:26:42 ini kernel: [ 1680.392314]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:26:42 ini kernel: [ 1680.392321]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:26:42 ini kernel: [ 1680.392327]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:26:42 ini kernel: [ 1680.392336]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:26:42 ini kernel: [ 1680.392344]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:26:42 ini kernel: [ 1680.392351]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:26:42 ini kernel: [ 1680.392357]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:26:42 ini kernel: [ 1680.392364]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:26:42 ini kernel: [ 1680.392371]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:26:42 ini kernel: [ 1680.392378]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:26:42 ini kernel: [ 1680.392384]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:26:42 ini kernel: [ 1680.427874] sync          D 0004beb0     0  1265   1255 0x00000004
> Jul 29 20:26:42 ini kernel: [ 1680.427883]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
> Jul 29 20:26:42 ini kernel: [ 1680.427894]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
> Jul 29 20:26:42 ini kernel: [ 1680.427904]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
> Jul 29 20:26:42 ini kernel: [ 1680.427915] Call Trace:
> Jul 29 20:26:42 ini kernel: [ 1680.427922]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
> Jul 29 20:26:42 ini kernel: [ 1680.427929]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
> Jul 29 20:26:42 ini kernel: [ 1680.427935]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
> Jul 29 20:26:42 ini kernel: [ 1680.427940]  [<c057850c>] ? down_read+0x1c/0x20
> Jul 29 20:26:42 ini kernel: [ 1680.427947]  [<c021cb6d>] sync_filesystems+0xbd/0x110
> Jul 29 20:26:42 ini kernel: [ 1680.427953]  [<c021cc16>] sys_sync+0x16/0x40
> Jul 29 20:26:42 ini kernel: [ 1680.427958]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:28:42 ini kernel: [ 1800.458856] umount        D 00478e55     0  1234    924 0x00000004
> Jul 29 20:28:42 ini kernel: [ 1800.458866]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:28:42 ini kernel: [ 1800.458877]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:28:42 ini kernel: [ 1800.458888]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:28:42 ini kernel: [ 1800.458899] Call Trace:
> Jul 29 20:28:42 ini kernel: [ 1800.458915]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:28:42 ini kernel: [ 1800.458925]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:28:42 ini kernel: [ 1800.458932]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:28:42 ini kernel: [ 1800.458938]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:28:42 ini kernel: [ 1800.458945]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:28:42 ini kernel: [ 1800.458953]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:28:42 ini kernel: [ 1800.458961]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:28:42 ini kernel: [ 1800.458971]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:28:42 ini kernel: [ 1800.458978]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:28:42 ini kernel: [ 1800.458984]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:28:42 ini kernel: [ 1800.458991]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:28:42 ini kernel: [ 1800.458999]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:28:42 ini kernel: [ 1800.459007]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:28:42 ini kernel: [ 1800.459013]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:28:42 ini kernel: [ 1800.459020]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:28:42 ini kernel: [ 1800.459027]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:28:42 ini kernel: [ 1800.459033]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:28:42 ini kernel: [ 1800.459039]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:28:42 ini kernel: [ 1800.459046]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:28:42 ini kernel: [ 1800.493768] sync          D 0004beb0     0  1265   1255 0x00000004
> Jul 29 20:28:42 ini kernel: [ 1800.493777]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
> Jul 29 20:28:42 ini kernel: [ 1800.493788]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
> Jul 29 20:28:42 ini kernel: [ 1800.493798]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
> Jul 29 20:28:42 ini kernel: [ 1800.493809] Call Trace:
> Jul 29 20:28:42 ini kernel: [ 1800.493816]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
> Jul 29 20:28:42 ini kernel: [ 1800.493823]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
> Jul 29 20:28:42 ini kernel: [ 1800.493828]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
> Jul 29 20:28:42 ini kernel: [ 1800.493834]  [<c057850c>] ? down_read+0x1c/0x20
> Jul 29 20:28:42 ini kernel: [ 1800.493841]  [<c021cb6d>] sync_filesystems+0xbd/0x110
> Jul 29 20:28:42 ini kernel: [ 1800.493847]  [<c021cc16>] sys_sync+0x16/0x40
> Jul 29 20:28:42 ini kernel: [ 1800.493853]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:30:42 ini kernel: [ 1920.526729] umount        D 00478e55     0  1234    924 0x00000004
> Jul 29 20:30:42 ini kernel: [ 1920.526739]  ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330
> Jul 29 20:30:42 ini kernel: [ 1920.526750]  c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000
> Jul 29 20:30:42 ini kernel: [ 1920.526761]  0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54
> Jul 29 20:30:42 ini kernel: [ 1920.526772] Call Trace:
> Jul 29 20:30:42 ini kernel: [ 1920.526788]  [<c057745a>] io_schedule+0x3a/0x60
> Jul 29 20:30:42 ini kernel: [ 1920.526798]  [<c01bd95d>] sync_page+0x3d/0x50
> Jul 29 20:30:42 ini kernel: [ 1920.526805]  [<c0577aa7>] __wait_on_bit_lock+0x47/0x90
> Jul 29 20:30:42 ini kernel: [ 1920.526813]  [<c01bd920>] ? sync_page+0x0/0x50
> Jul 29 20:30:42 ini kernel: [ 1920.526819]  [<c01bd8ee>] __lock_page+0x7e/0x90
> Jul 29 20:30:42 ini kernel: [ 1920.526827]  [<c01624d0>] ? wake_bit_function+0x0/0x50
> Jul 29 20:30:42 ini kernel: [ 1920.526836]  [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0
> Jul 29 20:30:42 ini kernel: [ 1920.526845]  [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0
> Jul 29 20:30:42 ini kernel: [ 1920.526853]  [<c01c724f>] truncate_inode_pages+0x1f/0x30
> Jul 29 20:30:42 ini kernel: [ 1920.526859]  [<c020f15c>] dispose_list+0xcc/0x100
> Jul 29 20:30:42 ini kernel: [ 1920.526866]  [<c020f534>] invalidate_inodes+0xf4/0x120
> Jul 29 20:30:42 ini kernel: [ 1920.526874]  [<c023b310>] ? vfs_quota_off+0x0/0x20
> Jul 29 20:30:42 ini kernel: [ 1920.526882]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
> Jul 29 20:30:42 ini kernel: [ 1920.526889]  [<c01fc6ca>] kill_block_super+0x2a/0x50
> Jul 29 20:30:42 ini kernel: [ 1920.526895]  [<c01fd4e4>] deactivate_super+0x64/0x90
> Jul 29 20:30:42 ini kernel: [ 1920.526902]  [<c021282f>] mntput_no_expire+0x8f/0xe0
> Jul 29 20:30:42 ini kernel: [ 1920.526908]  [<c0212e47>] sys_umount+0x47/0xa0
> Jul 29 20:30:42 ini kernel: [ 1920.526915]  [<c0212ebe>] sys_oldumount+0x1e/0x20
> Jul 29 20:30:42 ini kernel: [ 1920.526922]  [<c01033ec>] syscall_call+0x7/0xb
> Jul 29 20:30:42 ini kernel: [ 1920.563739] sync          D 0004beb0     0  1265   1255 0x00000004
> Jul 29 20:30:42 ini kernel: [ 1920.563747]  cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330
> Jul 29 20:30:42 ini kernel: [ 1920.563758]  c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200
> Jul 29 20:30:42 ini kernel: [ 1920.563768]  00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff
> Jul 29 20:30:42 ini kernel: [ 1920.563779] Call Trace:
> Jul 29 20:30:42 ini kernel: [ 1920.563787]  [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0
> Jul 29 20:30:42 ini kernel: [ 1920.563793]  [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30
> Jul 29 20:30:42 ini kernel: [ 1920.563799]  [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10
> Jul 29 20:30:42 ini kernel: [ 1920.563804]  [<c057850c>] ? down_read+0x1c/0x20
> Jul 29 20:30:42 ini kernel: [ 1920.563812]  [<c021cb6d>] sync_filesystems+0xbd/0x110
> Jul 29 20:30:42 ini kernel: [ 1920.563817]  [<c021cc16>] sys_sync+0x16/0x40
> Jul 29 20:30:42 ini kernel: [ 1920.563823]  [<c01033ec>] syscall_call+0x7/0xb
> 
> Although in both cases the FS remained consistent:
  Yes, at least something positive in the end ;).

> root@ini:~# mount -t ext4 /dev/sdb /mnt
> root@ini:~# umount /mnt
> root@ini:~# e2fsck -f -y /dev/sdb
> e2fsck 1.41.11 (14-Mar-2010)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
> /dev/sdb: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/sdb: 4194/640000 files (74.2% non-contiguous), 334774/1280000 blocks
> 
> You can find full kernel logs starting from iSCSI load in the attachments. 
> 
> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.
  Thanks for running the test.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 14:12                             ` Vladislav Bolkhovitin
@ 2010-07-29 14:34                               ` Jan Kara
  2010-07-29 18:20                                 ` Vladislav Bolkhovitin
  2010-07-29 18:49                                 ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 155+ messages in thread
From: Jan Kara @ 2010-07-29 14:34 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke, linux-kernel,
	kernel-bugs

On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote:
> 
> Christoph Hellwig, on 07/29/2010 05:08 PM wrote:
> > On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
> >> You can find full kernel logs starting from iSCSI load in the attachments.
> >>
> >> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.
> > 
> > I was only talking about ext3.
> 
> Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was:
> 
> ...
> (2197) nb_write: handle 4272 was not open size=65475 ofs=0
> (2199) nb_write: handle 4272 was not open size=65475 ofs=65534
> (2201) nb_write: handle 4272 was not open size=65475 ofs=131068
> (2203) nb_write: handle 4272 was not open size=65475 ofs=196602
> (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C
> ^C
> root@ini:/mnt/dbench-mod# ^C
> root@ini:/mnt/dbench-mod# ^C
> root@ini:/mnt/dbench-mod# cd
> root@ini:~# umount /mnt
> 
> <- recover device
> 
> root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/sdb,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> 
> Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed"
  Hmm, this is strange. Are there more messages around this one?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 10:45                         ` Jan Kara
@ 2010-07-29 16:54                           ` Joel Becker
  2010-07-29 17:02                             ` Christoph Hellwig
  2010-07-29 17:02                             ` Christoph Hellwig
  0 siblings, 2 replies; 155+ messages in thread
From: Joel Becker @ 2010-07-29 16:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason,
	swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 12:45:30PM +0200, Jan Kara wrote:
> On Thu 29-07-10 01:00:10, Christoph Hellwig wrote:
> > On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote:
> > >   Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the
> > > box and does not need the ordering. ocfs2_sync_file is actually correct
> > > (although maybe slightly inefficient) because it does
> > > jbd2_journal_force_commit() which creates and immediately commits a
> > > transaction and that implies a barrier.
> > 
> > I don't think that's correct.  ocfs2_sync_file first does
> > ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite,
> > and from what I can see a just as superflous sync_mapping_buffers (given
> > that ocfs doesn't use mark_buffer_dirty_inode) and then might return
> > early in case we do fdatasync but the inode isn't marked
> > I_DIRTY_DATASYNC.  In that case we might need a cache flush given
> > that the data might still be dirty.
>   Ah, I see. You're right, fdatasync case is buggy. I'll send Joel a fix.

	I can certainly see our code being inefficient if the
handled-for-us behaviors of sync have changed.  If the VFS is already
doing some work for us, maybe we don't need to do it.  But we have to be
sure that these calls are always going through those paths.  We sync our
files to disk when we drop cluster locks, regardless of whether there is
a userspace fsync().
	I guess I never knew that data could be dirty without the
I_DIRTY_DATASYNC bit.

Joel

-- 

"Copy from one, it's plagiarism; copy from two, it's research."
        - Wilson Mizner

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 16:54                           ` Joel Becker
  2010-07-29 17:02                             ` Christoph Hellwig
@ 2010-07-29 17:02                             ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 17:02 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel

On Thu, Jul 29, 2010 at 09:54:50AM -0700, Joel Becker wrote:
> handled-for-us behaviors of sync have changed.  If the VFS is already
> doing some work for us, maybe we don't need to do it.  But we have to be
> sure that these calls are always going through those paths.  We sync our
> files to disk when we drop cluster locks, regardless of whether there is
> a userspace fsync().

ocfs2_sync_file only gets called through the fsync inode operation, so
that doesn't happen here.  And if it did the filemap_fdatawrite would
not help at all, given that is only starts writeout, but never waits for
it to finish.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 16:54                           ` Joel Becker
@ 2010-07-29 17:02                             ` Christoph Hellwig
  2010-07-29 17:02                             ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 17:02 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel

On Thu, Jul 29, 2010 at 09:54:50AM -0700, Joel Becker wrote:
> handled-for-us behaviors of sync have changed.  If the VFS is already
> doing some work for us, maybe we don't need to do it.  But we have to be
> sure that these calls are always going through those paths.  We sync our
> files to disk when we drop cluster locks, regardless of whether there is
> a userspace fsync().

ocfs2_sync_file only gets called through the fsync inode operation, so
that doesn't happen here.  And if it did the filemap_fdatawrite would
not help at all, given that is only starts writeout, but never waits for
it to finish.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 14:26                           ` Jan Kara
@ 2010-07-29 18:20                             ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-29 18:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel

Jan Kara, on 07/29/2010 06:26 PM wrote:
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:~# umount /mnt
>> Segmentation fault
>    This isn't OK of course ;)
> 
>> Kernel log:
>>
>> Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743  @7..A7..B7..C7..
>> Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747  D7..E7..F7..G7..
>> Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b  H7..I7..J7..K7..
>> Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000  ,7..............
>> Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002  ................
> ...
> Sadly these messages above seem to have overwritten beginning of the
> message below. Hmm, but maybe it's just a warning message about inode still
> being on orphan list because the next oops still shows untainted kernel.

You can find previous messages in the attachments I attached to the report. They are big (500K and 1M), so I compressed and attached them.
 
>> Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu
>> Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace:
>> Jul 29 19:55:35 ini kernel: [ 3044.722567]  [<c0291731>] ext4_destroy_inode+0x91/0xa0
>> Jul 29 19:55:35 ini kernel: [ 3044.722577]  [<c020ecb4>] destroy_inode+0x24/0x40
>> Jul 29 19:55:35 ini kernel: [ 3044.722583]  [<c020f11e>] dispose_list+0x8e/0x100
>> Jul 29 19:55:35 ini kernel: [ 3044.722588]  [<c020f534>] invalidate_inodes+0xf4/0x120
>> Jul 29 19:55:35 ini kernel: [ 3044.722598]  [<c023b310>] ? vfs_quota_off+0x0/0x20
>> Jul 29 19:55:35 ini kernel: [ 3044.722606]  [<c01fc602>] generic_shutdown_super+0x42/0xe0
>> Jul 29 19:55:35 ini kernel: [ 3044.722612]  [<c01fc6ca>] kill_block_super+0x2a/0x50
>> Jul 29 19:55:35 ini kernel: [ 3044.722618]  [<c01fd4e4>] deactivate_super+0x64/0x90
>> Jul 29 19:55:35 ini kernel: [ 3044.722625]  [<c021282f>] mntput_no_expire+0x8f/0xe0
>> Jul 29 19:55:35 ini kernel: [ 3044.722631]  [<c0212e47>] sys_umount+0x47/0xa0
>> Jul 29 19:55:35 ini kernel: [ 3044.722636]  [<c0212ebe>] sys_oldumount+0x1e/0x20
>> Jul 29 19:55:35 ini kernel: [ 3044.722643]  [<c01033ec>] syscall_call+0x7/0xb
>> Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code
>> Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
>> Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00
>> Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed
>> Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy
>    So here probably starts the real oops.

It isn't yet an oops, it's dump_stack() from ext4_destroy_inode() together with hex dump:

static void ext4_destroy_inode(struct inode *inode)
{
	if (!list_empty(&(EXT4_I(inode)->i_orphan))) {
		ext4_msg(inode->i_sb, KERN_ERR,
			 "Inode %lu (%p): orphan list check failed!",
			 inode->i_ino, EXT4_I(inode));
		print_hex_dump(KERN_INFO, "", DUMP_PREFIX_ADDRESS, 16, 4,
				EXT4_I(inode), sizeof(struct ext4_inode_info),
				true);
		dump_stack();
	}
	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
}

> But sadly we are missing the
> beginning as well.

It was also in the attached file.

> Can you send me disassembly of your ext4_put_super?

In System.map-2.6.32-22-386:

c0293940 t ext4_put_super
c0293c90 t ext4_quota_write

$ objdump -d --start-address=0xc0293940 vmlinux >ext4_put_super
^C
$ cat ext4_put_super 

vmlinux:     file format elf32-i386


Disassembly of section .text:

c0293940 <.text+0x193940>:
c0293940:	55                   	push   %ebp
c0293941:	89 e5                	mov    %esp,%ebp
c0293943:	57                   	push   %edi
c0293944:	56                   	push   %esi
c0293945:	53                   	push   %ebx
c0293946:	83 ec 28             	sub    $0x28,%esp
c0293949:	e8 02 07 e7 ff       	call   0xc0104050
c029394e:	8b 98 84 01 00 00    	mov    0x184(%eax),%ebx
c0293954:	89 c6                	mov    %eax,%esi
c0293956:	8b 83 2c 02 00 00    	mov    0x22c(%ebx),%eax
c029395c:	8b 7b 38             	mov    0x38(%ebx),%edi
c029395f:	e8 ac b5 ec ff       	call   0xc015ef10
c0293964:	8b 83 2c 02 00 00    	mov    0x22c(%ebx),%eax
c029396a:	e8 41 af ec ff       	call   0xc015e8b0
c029396f:	89 f0                	mov    %esi,%eax
c0293971:	e8 5a 86 f6 ff       	call   0xc01fbfd0
c0293976:	e8 05 5a 2e 00       	call   0xc0579380
c029397b:	80 7e 11 00          	cmpb   $0x0,0x11(%esi)
c029397f:	0f 85 0b 02 00 00    	jne    0xc0293b90
c0293985:	8b 83 34 01 00 00    	mov    0x134(%ebx),%eax
c029398b:	85 c0                	test   %eax,%eax
c029398d:	74 17                	je     0xc02939a6
c029398f:	e8 ec 64 02 00       	call   0xc02b9e80
c0293994:	c7 83 34 01 00 00 00 	movl   $0x0,0x134(%ebx)
c029399b:	00 00 00 
c029399e:	85 c0                	test   %eax,%eax
c02939a0:	0f 88 ff 01 00 00    	js     0xc0293ba5
c02939a6:	89 f0                	mov    %esi,%eax
c02939a8:	e8 c3 22 01 00       	call   0xc02a5c70
c02939ad:	89 f0                	mov    %esi,%eax
c02939af:	e8 0c dc 00 00       	call   0xc02a15c0
c02939b4:	89 f0                	mov    %esi,%eax
c02939b6:	e8 c5 51 00 00       	call   0xc0298b80
c02939bb:	89 f0                	mov    %esi,%eax
c02939bd:	e8 de 45 01 00       	call   0xc02a7fa0
c02939c2:	f6 46 30 01          	testb  $0x1,0x30(%esi)
c02939c6:	0f 84 9c 01 00 00    	je     0xc0293b68
c02939cc:	8b 93 f8 00 00 00    	mov    0xf8(%ebx),%edx
c02939d2:	85 d2                	test   %edx,%edx
c02939d4:	74 11                	je     0xc02939e7
c02939d6:	8b 15 c8 8a 8a c0    	mov    0xc08a8ac8,%edx
c02939dc:	8d 86 64 01 00 00    	lea    0x164(%esi),%eax
c02939e2:	e8 59 fd fa ff       	call   0xc0243740
c02939e7:	8d bb fc 00 00 00    	lea    0xfc(%ebx),%edi
c02939ed:	89 f8                	mov    %edi,%eax
c02939ef:	e8 7c 5f 0a 00       	call   0xc0339970
c02939f4:	8b 43 14             	mov    0x14(%ebx),%eax
c02939f7:	85 c0                	test   %eax,%eax
c02939f9:	0f 84 c3 01 00 00    	je     0xc0293bc2
c02939ff:	31 d2                	xor    %edx,%edx
c0293a01:	8b 4b 3c             	mov    0x3c(%ebx),%ecx
c0293a04:	31 c0                	xor    %eax,%eax
c0293a06:	89 75 f0             	mov    %esi,-0x10(%ebp)
c0293a09:	89 de                	mov    %ebx,%esi
c0293a0b:	89 d3                	mov    %edx,%ebx
c0293a0d:	8d 76 00             	lea    0x0(%esi),%esi
c0293a10:	8b 04 81             	mov    (%ecx,%eax,4),%eax
c0293a13:	85 c0                	test   %eax,%eax
c0293a15:	74 08                	je     0xc0293a1f
c0293a17:	e8 54 ab f8 ff       	call   0xc021e570
c0293a1c:	8b 4e 3c             	mov    0x3c(%esi),%ecx
c0293a1f:	83 c3 01             	add    $0x1,%ebx
c0293a22:	39 5e 14             	cmp    %ebx,0x14(%esi)
c0293a25:	89 d8                	mov    %ebx,%eax
c0293a27:	77 e7                	ja     0xc0293a10
c0293a29:	89 f3                	mov    %esi,%ebx
c0293a2b:	8b 75 f0             	mov    -0x10(%ebp),%esi
c0293a2e:	89 c8                	mov    %ecx,%eax
c0293a30:	e8 fb bb f5 ff       	call   0xc01ef630
c0293a35:	8b 15 2c 53 8a c0    	mov    0xc08a532c,%edx
c0293a3b:	8b 83 28 02 00 00    	mov    0x228(%ebx),%eax
c0293a41:	81 c2 00 00 80 00    	add    $0x800000,%edx
c0293a47:	39 d0                	cmp    %edx,%eax
c0293a49:	72 20                	jb     0xc0293a6b
c0293a4b:	8b 15 c0 17 75 c0    	mov    0xc07517c0,%edx
c0293a51:	81 ea 00 20 60 00    	sub    $0x602000,%edx
c0293a57:	81 e2 00 00 c0 ff    	and    $0xffc00000,%edx
c0293a5d:	81 ea 00 20 00 00    	sub    $0x2000,%edx
c0293a63:	39 d0                	cmp    %edx,%eax
c0293a65:	0f 82 ed 00 00 00    	jb     0xc0293b58
c0293a6b:	e8 c0 bb f5 ff       	call   0xc01ef630
c0293a70:	8d 83 94 00 00 00    	lea    0x94(%ebx),%eax
c0293a76:	e8 c5 42 0b 00       	call   0xc0347d40
c0293a7b:	8d 83 ac 00 00 00    	lea    0xac(%ebx),%eax
c0293a81:	e8 ba 42 0b 00       	call   0xc0347d40
c0293a86:	8d 83 c4 00 00 00    	lea    0xc4(%ebx),%eax
c0293a8c:	e8 af 42 0b 00       	call   0xc0347d40
c0293a91:	8d 83 dc 00 00 00    	lea    0xdc(%ebx),%eax
c0293a97:	e8 a4 42 0b 00       	call   0xc0347d40
c0293a9c:	8b 43 34             	mov    0x34(%ebx),%eax
c0293a9f:	85 c0                	test   %eax,%eax
c0293aa1:	74 05                	je     0xc0293aa8
c0293aa3:	e8 c8 aa f8 ff       	call   0xc021e570
c0293aa8:	8b 83 78 01 00 00    	mov    0x178(%ebx),%eax
c0293aae:	e8 7d bb f5 ff       	call   0xc01ef630
c0293ab3:	8b 83 7c 01 00 00    	mov    0x17c(%ebx),%eax
c0293ab9:	e8 72 bb f5 ff       	call   0xc01ef630
c0293abe:	8d 93 38 01 00 00    	lea    0x138(%ebx),%edx
c0293ac4:	3b 93 38 01 00 00    	cmp    0x138(%ebx),%edx
c0293aca:	0f 85 fa 00 00 00    	jne    0xc0293bca
c0293ad0:	8b 86 94 00 00 00    	mov    0x94(%esi),%eax
c0293ad6:	e8 65 b4 f8 ff       	call   0xc021ef40
c0293adb:	8b 83 74 01 00 00    	mov    0x174(%ebx),%eax
c0293ae1:	85 c0                	test   %eax,%eax
c0293ae3:	74 31                	je     0xc0293b16
c0293ae5:	3b 86 94 00 00 00    	cmp    0x94(%esi),%eax
c0293aeb:	74 29                	je     0xc0293b16
c0293aed:	e8 4e 0f f9 ff       	call   0xc0224a40
c0293af2:	8b 83 74 01 00 00    	mov    0x174(%ebx),%eax
c0293af8:	e8 43 b4 f8 ff       	call   0xc021ef40
c0293afd:	8b 83 74 01 00 00    	mov    0x174(%ebx),%eax
c0293b03:	85 c0                	test   %eax,%eax
c0293b05:	74 0f                	je     0xc0293b16
c0293b07:	e8 64 d7 ff ff       	call   0xc0291270
c0293b0c:	c7 83 74 01 00 00 00 	movl   $0x0,0x174(%ebx)
c0293b13:	00 00 00 
c0293b16:	c7 86 84 01 00 00 00 	movl   $0x0,0x184(%esi)
c0293b1d:	00 00 00 
c0293b20:	e8 2b 58 2e 00       	call   0xc0579350
c0293b25:	89 f0                	mov    %esi,%eax
c0293b27:	e8 c4 84 f6 ff       	call   0xc01fbff0
c0293b2c:	89 f8                	mov    %edi,%eax
c0293b2e:	e8 7d 5d 0a 00       	call   0xc03398b0
c0293b33:	8d 83 20 01 00 00    	lea    0x120(%ebx),%eax
c0293b39:	e8 d2 3b 2e 00       	call   0xc0577710
c0293b3e:	8b 83 f4 00 00 00    	mov    0xf4(%ebx),%eax
c0293b44:	e8 e7 ba f5 ff       	call   0xc01ef630
c0293b49:	89 d8                	mov    %ebx,%eax
c0293b4b:	e8 e0 ba f5 ff       	call   0xc01ef630
c0293b50:	83 c4 28             	add    $0x28,%esp
c0293b53:	5b                   	pop    %ebx
c0293b54:	5e                   	pop    %esi
c0293b55:	5f                   	pop    %edi
c0293b56:	5d                   	pop    %ebp
c0293b57:	c3                   	ret    
c0293b58:	e8 e3 db f4 ff       	call   0xc01e1740
c0293b5d:	8d 76 00             	lea    0x0(%esi),%esi
c0293b60:	e9 0b ff ff ff       	jmp    0xc0293a70
c0293b65:	8d 76 00             	lea    0x0(%esi),%esi
c0293b68:	8b 86 84 01 00 00    	mov    0x184(%esi),%eax
c0293b6e:	ba 01 00 00 00       	mov    $0x1,%edx
c0293b73:	8b 40 38             	mov    0x38(%eax),%eax
c0293b76:	83 60 60 fb          	andl   $0xfffffffb,0x60(%eax)
c0293b7a:	0f b7 43 58          	movzwl 0x58(%ebx),%eax
c0293b7e:	66 89 47 3a          	mov    %ax,0x3a(%edi)
c0293b82:	89 f0                	mov    %esi,%eax
c0293b84:	e8 67 e5 ff ff       	call   0xc02920f0
c0293b89:	e9 3e fe ff ff       	jmp    0xc02939cc
c0293b8e:	66 90                	xchg   %ax,%ax
c0293b90:	ba 01 00 00 00       	mov    $0x1,%edx
c0293b95:	89 f0                	mov    %esi,%eax
c0293b97:	e8 54 e5 ff ff       	call   0xc02920f0
c0293b9c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
c0293ba0:	e9 e0 fd ff ff       	jmp    0xc0293985
c0293ba5:	89 34 24             	mov    %esi,(%esp)
c0293ba8:	c7 44 24 08 bf 9c 6c 	movl   $0xc06c9cbf,0x8(%esp)
c0293baf:	c0 
c0293bb0:	c7 44 24 04 64 3e 59 	movl   $0xc0593e64,0x4(%esp)
c0293bb7:	c0 
c0293bb8:	e8 d3 f1 ff ff       	call   0xc0292d90
c0293bbd:	e9 e4 fd ff ff       	jmp    0xc02939a6
c0293bc2:	8b 4b 3c             	mov    0x3c(%ebx),%ecx
c0293bc5:	e9 64 fe ff ff       	jmp    0xc0293a2e
c0293bca:	8b 43 38             	mov    0x38(%ebx),%eax
c0293bcd:	8b 80 e8 00 00 00    	mov    0xe8(%eax),%eax
c0293bd3:	89 55 e8             	mov    %edx,-0x18(%ebp)
c0293bd6:	c7 44 24 08 be a9 6c 	movl   $0xc06ca9be,0x8(%esp)
c0293bdd:	c0 
c0293bde:	c7 44 24 04 b9 0c 6a 	movl   $0xc06a0cb9,0x4(%esp)
c0293be5:	c0 
c0293be6:	89 44 24 0c          	mov    %eax,0xc(%esp)
c0293bea:	89 34 24             	mov    %esi,(%esp)
c0293bed:	e8 be d8 ff ff       	call   0xc02914b0
c0293bf2:	c7 04 24 f6 9c 6c c0 	movl   $0xc06c9cf6,(%esp)
c0293bf9:	e8 4d 2e 2e 00       	call   0xc0576a4b
c0293bfe:	8b 83 38 01 00 00    	mov    0x138(%ebx),%eax
c0293c04:	8b 55 e8             	mov    -0x18(%ebp),%edx
c0293c07:	89 45 f0             	mov    %eax,-0x10(%ebp)
c0293c0a:	89 55 ec             	mov    %edx,-0x14(%ebp)
c0293c0d:	8b 55 f0             	mov    -0x10(%ebp),%edx
c0293c10:	8b 02                	mov    (%edx),%eax
c0293c12:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
c0293c16:	39 55 ec             	cmp    %edx,-0x14(%ebp)
c0293c19:	75 13                	jne    0xc0293c2e
c0293c1b:	8b 55 ec             	mov    -0x14(%ebp),%edx
c0293c1e:	3b 93 38 01 00 00    	cmp    0x138(%ebx),%edx
c0293c24:	0f 84 a6 fe ff ff    	je     0xc0293ad0
c0293c2a:	0f 0b                	ud2a   
c0293c2c:	eb fe                	jmp    0xc0293c2c
c0293c2e:	8b 55 f0             	mov    -0x10(%ebp),%edx
c0293c31:	8b 45 f0             	mov    -0x10(%ebp),%eax
c0293c34:	83 c2 20             	add    $0x20,%edx
c0293c37:	8b 4a c0             	mov    -0x40(%edx),%ecx
c0293c3a:	83 e8 68             	sub    $0x68,%eax
c0293c3d:	89 4c 24 18          	mov    %ecx,0x18(%esp)
c0293c41:	8b 88 b0 00 00 00    	mov    0xb0(%eax),%ecx
c0293c47:	89 4c 24 14          	mov    %ecx,0x14(%esp)
c0293c4b:	0f b7 88 fa 00 00 00 	movzwl 0xfa(%eax),%ecx
c0293c52:	89 54 24 0c          	mov    %edx,0xc(%esp)
c0293c56:	89 4c 24 10          	mov    %ecx,0x10(%esp)
c0293c5a:	8b 90 a8 00 00 00    	mov    0xa8(%eax),%edx
c0293c60:	89 54 24 08          	mov    %edx,0x8(%esp)
c0293c64:	8b 80 2c 01 00 00    	mov    0x12c(%eax),%eax
c0293c6a:	c7 04 24 54 b0 6b c0 	movl   $0xc06bb054,(%esp)
c0293c71:	05 64 01 00 00       	add    $0x164,%eax
c0293c76:	89 44 24 04          	mov    %eax,0x4(%esp)
c0293c7a:	e8 cc 2d 2e 00       	call   0xc0576a4b
c0293c7f:	8b 55 f0             	mov    -0x10(%ebp),%edx
c0293c82:	8b 12                	mov    (%edx),%edx
c0293c84:	89 55 f0             	mov    %edx,-0x10(%ebp)
c0293c87:	eb 84                	jmp    0xc0293c0d
c0293c89:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi

The rest snipped.
 
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138
>> Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]<0>  c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]<0>  cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c023b310>] ? vfs_quota_off+0x0/0x20
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fc6ca>] ? kill_block_super+0x2a/0x50
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01fd4e4>] ? deactivate_super+0x64/0x90
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c021282f>] ? mntput_no_expire+0x8f/0xe0
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212e47>] ? sys_umount+0x47/0xa0
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c0212ebe>] ? sys_oldumount+0x1e/0x20
>> Jul 29 19:55:36 ini kernel: [ 3044.823044]  [<c01033ec>] ? syscall_call+0x7/0xb
>> Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]---
> ...
>> Another test. Everything is as before, only I did not pull the cable, but
>> deleted the corresponding LUN on the target, so all the command starting
>> from this moment failed. Then on umount system rebooted. Kernel log:
> 
>    Nasty. But the log actually contains only traces of processes in D state
> (generally waiting for a page to be unlocked). Do you have any sort of
> watchdog which might have rebooted the machine?

I didn't configured it ;). This is unmodified Ubuntu server 10.04, only with non-PAE kernel.

The reboot wasn't immediate. I even tried to check something in another ssh.

Again, you can find more logs attached to the original message.
 
>    Thanks for running the test.

Thanks for looking at the results!

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 14:34                               ` Jan Kara
@ 2010-07-29 18:20                                 ` Vladislav Bolkhovitin
  2010-07-29 18:49                                 ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-29 18:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel

Jan Kara, on 07/29/2010 06:34 PM wrote:
> On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote:
>>
>> Christoph Hellwig, on 07/29/2010 05:08 PM wrote:
>>> On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
>>>> You can find full kernel logs starting from iSCSI load in the attachments.
>>>>
>>>> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.
>>>
>>> I was only talking about ext3.
>>
>> Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was:
>>
>> ...
>> (2197) nb_write: handle 4272 was not open size=65475 ofs=0
>> (2199) nb_write: handle 4272 was not open size=65475 ofs=65534
>> (2201) nb_write: handle 4272 was not open size=65475 ofs=131068
>> (2203) nb_write: handle 4272 was not open size=65475 ofs=196602
>> (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C
>> ^C
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:/mnt/dbench-mod# cd
>> root@ini:~# umount /mnt
>>
>> <- recover device
>>
>> root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt
>> mount: wrong fs type, bad option, bad superblock on /dev/sdb,
>>         missing codepage or helper program, or other error
>>         In some cases useful info is found in syslog - try
>>         dmesg | tail  or so
>>
>> Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed"
>    Hmm, this is strange. Are there more messages around this one?

Rather none:

Jul 29 22:02:05 ini kernel: [ 2698.488446] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00
Jul 29 22:02:05 ini kernel: [ 2698.505470] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 29 22:02:05 ini kernel: [ 2698.505480] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current]
Jul 29 22:02:05 ini kernel: [ 2698.505488] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported
Jul 29 22:02:05 ini kernel: [ 2698.505497] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00
Jul 29 22:02:05 ini kernel: [ 2698.555147] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 29 22:02:05 ini kernel: [ 2698.555157] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current]
Jul 29 22:02:05 ini kernel: [ 2698.555165] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported
Jul 29 22:02:05 ini kernel: [ 2698.555175] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00
Jul 29 22:02:05 ini kernel: [ 2698.582241] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 29 22:02:05 ini kernel: [ 2698.582251] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current]
Jul 29 22:02:05 ini kernel: [ 2698.582259] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported
Jul 29 22:02:05 ini kernel: [ 2698.582268] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00
Jul 29 22:02:05 ini kernel: [ 2698.614789] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 29 22:02:05 ini kernel: [ 2698.614799] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current]
Jul 29 22:02:05 ini kernel: [ 2698.614807] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported
Jul 29 22:02:05 ini kernel: [ 2698.614817] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00
Jul 29 22:02:45 ini kernel: [ 2738.474386] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474529] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474536] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474570] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474583] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474603] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474615] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474621] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474633] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:02:45 ini kernel: [ 2738.474659] __journal_remove_journal_head: freeing b_committed_data
Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed
Jul 29 22:05:54 ini kernel: [ 2927.832893] kjournald starting.  Commit interval 5 seconds
Jul 29 22:05:54 ini kernel: [ 2927.833430] EXT3 FS on sdb, internal journal
Jul 29 22:05:54 ini kernel: [ 2927.833499] EXT3-fs: sdb: 1 orphan inode deleted
Jul 29 22:05:54 ini kernel: [ 2927.833503] EXT3-fs: recovery complete.
Jul 29 22:05:54 ini kernel: [ 2927.838122] EXT3-fs: mounted filesystem with ordered data mode.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 14:34                               ` Jan Kara
  2010-07-29 18:20                                 ` Vladislav Bolkhovitin
@ 2010-07-29 18:49                                 ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-29 18:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs

Jan Kara, on 07/29/2010 06:34 PM wrote:
> On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote:
>>
>> Christoph Hellwig, on 07/29/2010 05:08 PM wrote:
>>> On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
>>>> You can find full kernel logs starting from iSCSI load in the attachments.
>>>>
>>>> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces.
>>>
>>> I was only talking about ext3.
>>
>> Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was:
>>
>> ...
>> (2197) nb_write: handle 4272 was not open size=65475 ofs=0
>> (2199) nb_write: handle 4272 was not open size=65475 ofs=65534
>> (2201) nb_write: handle 4272 was not open size=65475 ofs=131068
>> (2203) nb_write: handle 4272 was not open size=65475 ofs=196602
>> (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C
>> ^C
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:/mnt/dbench-mod# ^C
>> root@ini:/mnt/dbench-mod# cd
>> root@ini:~# umount /mnt
>>
>> <- recover device
>>
>> root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt
>> mount: wrong fs type, bad option, bad superblock on /dev/sdb,
>>         missing codepage or helper program, or other error
>>         In some cases useful info is found in syslog - try
>>         dmesg | tail  or so
>>
>> Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed"
>    Hmm, this is strange. Are there more messages around this one?

I'd encourage you to reproduce similar setup and perform various failure 
injection testings. I promise you, you'll find a lot of strange and 
interesting ;). Software devices give unique opportunities for that.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: extfs reliability
  2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
  2010-07-29 13:08                           ` Christoph Hellwig
  2010-07-29 14:26                           ` Jan Kara
@ 2010-07-29 18:58                           ` Ted Ts'o
  2 siblings, 0 replies; 155+ messages in thread
From: Ted Ts'o @ 2010-07-29 18:58 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs

On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote:
> Christoph Hellwig, on 07/29/2010 12:31 PM wrote:
> > My reading of the ext3/jbd code we explicitly wait on I/O completion
> > of dependent writes, and only require those to actually be stable
> > by issueing a flush.   If that wasn't the case the default ext3
> > barriers off behaviour would not only be dangerous on devices with
> > volatile write caches, but also on devices that do not have them,
> > which in addition to the reading of the code is not what we've seen
> > in actual power fail testing, where ext3 does well as long as there
> > is no volatile write cache.
> 
> Basically, it is so, but, unfortunately, not absolutely. I've just tried 2 tests on ext4 with iSCSI:

Well, this thread was talking about something else (which is how
various file systems handle barriers), and not bugs about what happen
when a disk disappears from a system due to attachment failure --- but
that's fine, we can deal with that here.

> Segmentation fault

OK, I've looked at your kernel messages, and it looks like the problem
comes from this:

	/* Debugging code just in case the in-memory inode orphan list
	 * isn't empty.  The on-disk one can be non-empty if we've
	 * detected an error and taken the fs readonly, but the
	 * in-memory list had better be clean by this point. */
	if (!list_empty(&sbi->s_orphan))
		dump_orphan_list(sb, sbi);
	J_ASSERT(list_empty(&sbi->s_orphan));   <====

This is a "should never happen situation", and we crash so we can
figure out how we got there.  For production kernels, arguably it
would probably be better to print a message and a WARN_ON(1), and then
not force a crash from a BUG_ON (which is what J_ASSERT is defined to
use).

Looking at your messages and the ext4_delete_inode() warning, I think
I know what caused it.  Can you try this patch (attached below) and
see if it fixes things for you?

> I already reported such issues some time ago, but my reports were
> not too much welcomed, so I gave up. Anyway, anybody can easily do
> my tests at any time.

My apologies.  I've gone through the linux-ext4 mailing list logs, and
I can't find any mention of this problem from any username @vlnb.net.
I'm not sure where you reported it, and I'm sorry we dropped your bug
report.  All I can say is that we do the best that we can, and our
team is relatively small and short-handed.

							- Ted

>From a190d0386e601d58db6d2a6cbf00dc1c17d02136 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Thu, 29 Jul 2010 14:54:48 -0400
Subject: [PATCH] patch explicitly-drop-inode-from-orphan-list-on-ext4_delete_inode-failure

---
 fs/ext4/inode.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a52d5af..533b607 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -221,6 +221,7 @@ void ext4_delete_inode(struct inode *inode)
 				     "couldn't extend journal (err %d)", err);
 		stop_handle:
 			ext4_journal_stop(handle);
+			ext4_orphan_del(NULL, inode);
 			goto no_delete;
 		}
 	}
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  1:44                     ` Ted Ts'o
                                         ` (3 preceding siblings ...)
  2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
@ 2010-07-29 19:44                       ` Ric Wheeler
  4 siblings, 0 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 19:44 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley

On 07/28/2010 09:44 PM, Ted Ts'o wrote:
> On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
>    
>> If we move all filesystems to non-draining barriers with pre- and post-
>> flushes that might actually be a relatively easy first step.  We don't
>> have the complications to deal with multiple types of barriers to
>> start with, and it'll fix the issue for devices without volatile write
>> caches completely.
>>
>> I just need some help from the filesystem folks to determine if they
>> are safe with them.
>>
>> I know for sure that ext3 and xfs are from looking through them.  And
>> I know reiserfs is if we make sure it doesn't hit the code path that
>> relies on it that is currently enabled by the barrier option.
>>
>> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
>> That already ends our small list of barrier supporting filesystems, and
>> possibly ocfs2, too - although the barrier implementation there seems
>> incomplete as it doesn't seem to flush caches in fsync.
>>      
> Define "are safe" --- what interface we planning on using for the
> non-draining barrier?  At least for ext3, when we write the commit
> record using set_buffer_ordered(bh), it assumes that this will do a
> flush of all previous writes and that the commit will hit the disk
> before any subsequent writes are sent to the disk.  So turning the
> write of a buffer head marked with set_buffered_ordered() into a FUA
> write would _not_ be safe for ext3.
>    

I confess that I am a bit fuzzy on FUA, but think that it means that any 
FUA tagged IO will go down to persistent store before returning.

If so, then all order dependent IO would need to be issued in order and 
tagged with FUA. It would not suffice to tag just the commit record as 
FUA, or do I misunderstand what FUA does?

(Looking for a record in the how many times can I use FUA in an email).

ric

> For ext4, if we don't use journal checksums, then we have the same
> requirements as ext3, and the same method of requesting it.  If we do
> use journal checksums, what ext4 needs is a way of assuring that no
> writes after the commit are reordered with respect to the disk platter
> before the commit record --- but any of the writes before that,
> including the commit, and be reordered because we rely on the checksum
> in the commit record to know at replay time whether the last commit is
> valid or not.  We do that right now by calling blkdev_issue_flush()
> with BLKDEF_IFL_WAIT after submitting the write of the commit block.
>
> 					- Ted
>
>    


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  1:44                     ` Ted Ts'o
                                         ` (2 preceding siblings ...)
  2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
@ 2010-07-29 19:44                       ` Ric Wheeler
  2010-07-29 19:49                         ` Christoph Hellwig
  2010-07-31  0:35                         ` Jan Kara
  2010-07-29 19:44                       ` Ric Wheeler
  4 siblings, 2 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 19:44 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley

On 07/28/2010 09:44 PM, Ted Ts'o wrote:
> On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
>    
>> If we move all filesystems to non-draining barriers with pre- and post-
>> flushes that might actually be a relatively easy first step.  We don't
>> have the complications to deal with multiple types of barriers to
>> start with, and it'll fix the issue for devices without volatile write
>> caches completely.
>>
>> I just need some help from the filesystem folks to determine if they
>> are safe with them.
>>
>> I know for sure that ext3 and xfs are from looking through them.  And
>> I know reiserfs is if we make sure it doesn't hit the code path that
>> relies on it that is currently enabled by the barrier option.
>>
>> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
>> That already ends our small list of barrier supporting filesystems, and
>> possibly ocfs2, too - although the barrier implementation there seems
>> incomplete as it doesn't seem to flush caches in fsync.
>>      
> Define "are safe" --- what interface we planning on using for the
> non-draining barrier?  At least for ext3, when we write the commit
> record using set_buffer_ordered(bh), it assumes that this will do a
> flush of all previous writes and that the commit will hit the disk
> before any subsequent writes are sent to the disk.  So turning the
> write of a buffer head marked with set_buffered_ordered() into a FUA
> write would _not_ be safe for ext3.
>    

I confess that I am a bit fuzzy on FUA, but think that it means that any 
FUA tagged IO will go down to persistent store before returning.

If so, then all order dependent IO would need to be issued in order and 
tagged with FUA. It would not suffice to tag just the commit record as 
FUA, or do I misunderstand what FUA does?

(Looking for a record in the how many times can I use FUA in an email).

ric

> For ext4, if we don't use journal checksums, then we have the same
> requirements as ext3, and the same method of requesting it.  If we do
> use journal checksums, what ext4 needs is a way of assuring that no
> writes after the commit are reordered with respect to the disk platter
> before the commit record --- but any of the writes before that,
> including the commit, and be reordered because we rely on the checksum
> in the commit record to know at replay time whether the last commit is
> valid or not.  We do that right now by calling blkdev_issue_flush()
> with BLKDEF_IFL_WAIT after submitting the write of the commit block.
>
> 					- Ted
>
>    


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
@ 2010-07-29 19:49                         ` Christoph Hellwig
  2010-07-29 19:56                           ` Ric Wheeler
  2010-07-31  0:35                         ` Jan Kara
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 19:49 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote:
> I confess that I am a bit fuzzy on FUA, but think that it means that any 
> FUA tagged IO will go down to persistent store before returning.

Exactly.

> If so, then all order dependent IO would need to be issued in order and 
> tagged with FUA. It would not suffice to tag just the commit record as 
> FUA, or do I misunderstand what FUA does?

The commit record is ext3/4 specific terminalogy.  In xfs we just have
one type of log buffers, and we could tag that as FUA.  There is very
little other depenent I/O, but if that is present we need a pre-flush
for it anyway. 


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:49                         ` Christoph Hellwig
@ 2010-07-29 19:56                           ` Ric Wheeler
  2010-07-29 19:59                             ` James Bottomley
  2010-07-29 22:30                             ` Andreas Dilger
  0 siblings, 2 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 19:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On 07/29/2010 03:49 PM, Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote:
>    
>> I confess that I am a bit fuzzy on FUA, but think that it means that any
>> FUA tagged IO will go down to persistent store before returning.
>>      
> Exactly.
>
>    
>> If so, then all order dependent IO would need to be issued in order and
>> tagged with FUA. It would not suffice to tag just the commit record as
>> FUA, or do I misunderstand what FUA does?
>>      
> The commit record is ext3/4 specific terminalogy.  In xfs we just have
> one type of log buffers, and we could tag that as FUA.  There is very
> little other depenent I/O, but if that is present we need a pre-flush
> for it anyway.
>
>    

I assume that for ext3 it would get more complicated depending on the 
journal mode. In ordered or data journal mode, we would have to write 
the dependent non-journal data tagged with FUA, then the FUA tagged 
transaction and finally the FUA tagged commit block.

Not sure how FUA performs, but writing lots of small tagged writes is 
probably not good for performance...

Ric


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:56                           ` Ric Wheeler
@ 2010-07-29 19:59                             ` James Bottomley
  2010-07-29 20:03                               ` Christoph Hellwig
  2010-07-29 20:58                               ` Ric Wheeler
  2010-07-29 22:30                             ` Andreas Dilger
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-07-29 19:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Thu, 2010-07-29 at 15:56 -0400, Ric Wheeler wrote:
> On 07/29/2010 03:49 PM, Christoph Hellwig wrote:
> > On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote:
> >    
> >> I confess that I am a bit fuzzy on FUA, but think that it means that any
> >> FUA tagged IO will go down to persistent store before returning.
> >>      
> > Exactly.
> >
> >    
> >> If so, then all order dependent IO would need to be issued in order and
> >> tagged with FUA. It would not suffice to tag just the commit record as
> >> FUA, or do I misunderstand what FUA does?
> >>      
> > The commit record is ext3/4 specific terminalogy.  In xfs we just have
> > one type of log buffers, and we could tag that as FUA.  There is very
> > little other depenent I/O, but if that is present we need a pre-flush
> > for it anyway.
> >
> >    
> 
> I assume that for ext3 it would get more complicated depending on the 
> journal mode. In ordered or data journal mode, we would have to write 
> the dependent non-journal data tagged with FUA, then the FUA tagged 
> transaction and finally the FUA tagged commit block.
> 
> Not sure how FUA performs, but writing lots of small tagged writes is 
> probably not good for performance...

That's basically everything FUA ... you might just as well switch your
cache to write through and have done.

This, by the way, is one area I'm hoping to have researched on SCSI
(where most devices do obey the caching directives).  Actually see if
write through without flush barriers is faster than writeback with flush
barriers.  I really suspect it is.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29  8:42                         ` Christoph Hellwig
@ 2010-07-29 20:02                           ` Vivek Goyal
  2010-07-29 20:06                             ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vivek Goyal @ 2010-07-29 20:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Thu, Jul 29, 2010 at 10:42:25AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 10:43:34PM -0400, Vivek Goyal wrote:
> > I guess we will require something like set_buffer_preflush_fua() kind of
> > operation so that we preflush the cache to make sure everything before
> > commit block is on platter and then do commit block write with FUA
> > to make sure commit block is on platter.
> 
> No more messing with buffer flags for barriers / cache flush options
> please.  It's a flag for the I/O submission, not buffer state.  See
> my patch from June to remove BH_Ordered if you're interested.

> 
> > This is assuming that before issuing commit block request we have waited
> > for completion of rest of the journal data. This will make sure none of
> > that journal data is in request queue. Then if we issue commit with 
> > preflush and FUA, it should make sure all the journal blocks are on
> > disk and then commit block is on disk.
> > 
> > So as long as we wait in filesystem for completion of the requests commit
> > block is dependent on, before we issue commit request, we should not
> > require request queue drain and preflush and FUA write probably should
> > be fine.
> 
> We do not require the drain for that case.  The flush is more difficult,
> because it's entirely possible that we have state that we require to be
> on disk before writing out a log buffer.  For XFS that's two cases:
> 
>  (1) we require the actual file data to be on disk before logging the
>      file size update to avoid stale data exposure in case the log
>      buffer hits the disk before the data
>  (2) we require that the buffers writing back metadata actually made it
>      to disk before pushing the log tail
> 
> (1) means we'll always a pre-flush when a log buffer contains a size
> update from an appending write.
> (2) means we need to more complicated tracking of the tail lsn, e.g.
> by caching it somewhere and only updating the cached value after a
> cache flush happened, with a way to force one if needed.
> 
> All that is at least as complicated as it sounds.  While I have a
> working prototype just going with the relaxed barriers as a first step
> is probably.

There are so many mails on this topic now that I am kind of lost. I guess
this has already been asked but I will ask one more time.

Looks like you still want to go with option 2 where you will scan the file
system code for requirement of DRAIN semantics and everything is fine then for
devices no supporting volatile caches, you will mark request queue as NONE.

This solves the problem on devices with WCE=0 but what about devices with
WCE=1. If file systems anyway don't require DRAIN semantics, then we
should not require it on devices with WCE=1 also?

If yes, then why not go with another variant of barriers which don't
perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not
supporting FUA). And then file systems can slowly move to using this non
draining barrier usage wherever appropriate.

The advantage here is that it should save us request queue DRAIN even
on devices with WCE=1. 

Am I missing something very obivious here?

Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:59                             ` James Bottomley
@ 2010-07-29 20:03                               ` Christoph Hellwig
  2010-07-29 20:07                                 ` James Bottomley
  2010-07-30 12:46                                 ` Vladislav Bolkhovitin
  2010-07-29 20:58                               ` Ric Wheeler
  1 sibling, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 20:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Christoph Hellwig, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote:
> That's basically everything FUA ... you might just as well switch your
> cache to write through and have done.
> 
> This, by the way, is one area I'm hoping to have researched on SCSI
> (where most devices do obey the caching directives).  Actually see if
> write through without flush barriers is faster than writeback with flush
> barriers.  I really suspect it is.

We have done the research and at least for XFS a write through cache
actually is faster for many workloads.  Ric always has workloads where
the cache is faster, though - mostly doing lots of small file write
kind of setups.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:02                           ` Vivek Goyal
@ 2010-07-29 20:06                             ` Christoph Hellwig
  2010-07-30  3:17                               ` Vivek Goyal
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 20:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 04:02:17PM -0400, Vivek Goyal wrote:
> Looks like you still want to go with option 2 where you will scan the file
> system code for requirement of DRAIN semantics and everything is fine then for
> devices no supporting volatile caches, you will mark request queue as NONE.

The filesystem can't simply change the request queue settings. A request
queue is often shared by multiple filesystems that can have very
different requirements.

> This solves the problem on devices with WCE=0 but what about devices with
> WCE=1. If file systems anyway don't require DRAIN semantics, then we
> should not require it on devices with WCE=1 also?

Yes.

> If yes, then why not go with another variant of barriers which don't
> perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not
> supporting FUA).

I've been trying to prototype it, but it's in fact rather hard to
get this right.  Tejun has done a really good job at the current
barrier implementation and coming up with something just half as
clever for the relaxed barriers has been driving me mad.

> And then file systems can slowly move to using this non
> draining barrier usage wherever appropriate.

Actually supporting different kind of barriers at the same time
is even harder.  We'll need two different state machines for them,
including the actual state in the request_queue.  And then make
sure when different filesystems on the same queue use different
types work well together.  If at all possible switching the semantics
on a flag day would make life a lot simpler.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:03                               ` Christoph Hellwig
@ 2010-07-29 20:07                                 ` James Bottomley
  2010-07-29 20:11                                   ` Christoph Hellwig
  2010-07-30 12:46                                 ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-07-29 20:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Thu, 2010-07-29 at 22:03 +0200, Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote:
> > That's basically everything FUA ... you might just as well switch your
> > cache to write through and have done.
> > 
> > This, by the way, is one area I'm hoping to have researched on SCSI
> > (where most devices do obey the caching directives).  Actually see if
> > write through without flush barriers is faster than writeback with flush
> > barriers.  I really suspect it is.
> 
> We have done the research and at least for XFS a write through cache
> actually is faster for many workloads.  Ric always has workloads where
> the cache is faster, though - mostly doing lots of small file write
> kind of setups.

There's lies, damned lies and benchmarks .. but what I was thinking is
could we just do the right thing?  SCSI exposes (in sd) the interfaces
to change the cache setting, so if the customer *doesn't* specify
barriers on mount, could we just flip the device to write through it
would be more performant in most use cases.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:07                                 ` James Bottomley
@ 2010-07-29 20:11                                   ` Christoph Hellwig
  2010-07-30 12:45                                     ` Vladislav Bolkhovitin
  2010-08-04  1:58                                     ` Jamie Lokier
  0 siblings, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-29 20:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote:
> There's lies, damned lies and benchmarks .. but what I was thinking is
> could we just do the right thing?  SCSI exposes (in sd) the interfaces
> to change the cache setting, so if the customer *doesn't* specify
> barriers on mount, could we just flip the device to write through it
> would be more performant in most use cases.

We could for SCSI and ATA, but probably not easily for other kind of
storage.  Except that it's not that simple as we have partitions and
volume managers inbetween - different filesystems sitting on the same
device might have very different ideas of what they want.

For SCSI we can at least permanently disable the cache, but ATA devices
keep coming up again with the volatile write cache enabled after a
reboot, or even worse a suspend to ram / resume cycle.  The latter is
what keeps me from just disabling the volatile cache on my laptop,
despite that option giving significanly better performance for typical
kernel developer workloads.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:59                             ` James Bottomley
  2010-07-29 20:03                               ` Christoph Hellwig
@ 2010-07-29 20:58                               ` Ric Wheeler
  1 sibling, 0 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 20:58 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On 07/29/2010 03:59 PM, James Bottomley wrote:
> On Thu, 2010-07-29 at 15:56 -0400, Ric Wheeler wrote:
>    
>> On 07/29/2010 03:49 PM, Christoph Hellwig wrote:
>>      
>>> On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote:
>>>
>>>        
>>>> I confess that I am a bit fuzzy on FUA, but think that it means that any
>>>> FUA tagged IO will go down to persistent store before returning.
>>>>
>>>>          
>>> Exactly.
>>>
>>>
>>>        
>>>> If so, then all order dependent IO would need to be issued in order and
>>>> tagged with FUA. It would not suffice to tag just the commit record as
>>>> FUA, or do I misunderstand what FUA does?
>>>>
>>>>          
>>> The commit record is ext3/4 specific terminalogy.  In xfs we just have
>>> one type of log buffers, and we could tag that as FUA.  There is very
>>> little other depenent I/O, but if that is present we need a pre-flush
>>> for it anyway.
>>>
>>>
>>>        
>> I assume that for ext3 it would get more complicated depending on the
>> journal mode. In ordered or data journal mode, we would have to write
>> the dependent non-journal data tagged with FUA, then the FUA tagged
>> transaction and finally the FUA tagged commit block.
>>
>> Not sure how FUA performs, but writing lots of small tagged writes is
>> probably not good for performance...
>>      
> That's basically everything FUA ... you might just as well switch your
> cache to write through and have done.
>    

I think that for data ordered mode that is all of the data more or less 
would get tagged. For data journal, would we have to send 2x the write 
workload down with tags?  I agree that this would be dubious at best.

Note that using the non-FUA cache flush commands, while brute force, 
does have a clear win on slower devices (S-ATA specifically).  Each time 
I have looked, using the write cache enabled on S-ATA was a win (big win 
on streaming write performance, not sure why).

On SAS drives, the flush barriers were not as large a delta (do not 
remember which won out).

> This, by the way, is one area I'm hoping to have researched on SCSI
> (where most devices do obey the caching directives).  Actually see if
> write through without flush barriers is faster than writeback with flush
> barriers.  I really suspect it is.
>
> James
>    

There are clearly much better ways to do this. Even the flushes, if we 
could flush ranges that matched the partition under the file system, 
would be better than today where we flush the entire physical device.

Ric




^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:56                           ` Ric Wheeler
  2010-07-29 19:59                             ` James Bottomley
@ 2010-07-29 22:30                             ` Andreas Dilger
  2010-07-29 23:04                               ` Ted Ts'o
  1 sibling, 1 reply; 155+ messages in thread
From: Andreas Dilger @ 2010-07-29 22:30 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On 2010-07-29, at 13:56, Ric Wheeler wrote:
> I assume that for ext3 it would get more complicated depending on the journal mode. In ordered or data journal mode, we would have to write the dependent non-journal data tagged with FUA, then the FUA tagged transaction and finally the FUA tagged commit block.

Like James wrote, this is basically everything FUA.  It is OK for ordered mode to allow the device to aggregate the normal filesystem and journal IO, but when the commit block is written it should flush all of the previously written data to disk.  This still allows request re-ordering and merging inside the device, but orders the data vs. the commit block.  Having the proposed "flush ranges" interface to the disk would be ideal, since there would be no wasted time flushing data that does not need it (i.e. other partitions).

There is no need to prevent new data from being written during a cache flush, since ext*/jbd will already manage any required data/metadata ordering internally.

There was some proposal (maybe from Eric Sandeen?) about having a device-level IO request counter that numbers every request submitted, and if there are multiple partitions per device, or fsync operations that flush the device cache, it is possible to determine from the request number whether there has already been a cache flush after that request on that device.  This avoids extra cache flushes if it was just done for another file or partition on the same device.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 22:30                             ` Andreas Dilger
@ 2010-07-29 23:04                               ` Ted Ts'o
  2010-07-29 23:08                                 ` Ric Wheeler
                                                   ` (6 more replies)
  0 siblings, 7 replies; 155+ messages in thread
From: Ted Ts'o @ 2010-07-29 23:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: 
> Like James wrote, this is basically everything FUA.  It is OK for
> ordered mode to allow the device to aggregate the normal filesystem
> and journal IO, but when the commit block is written it should flush
> all of the previously written data to disk.  This still allows
> request re-ordering and merging inside the device, but orders the
> data vs. the commit block.  Having the proposed "flush ranges"
> interface to the disk would be ideal, since there would be no wasted
> time flushing data that does not need it (i.e. other partitions).

My understanding is that "everything FUA" can be a performance
disaster.  That's because it bypasses the track buffer, and things get
written directly to disk.  So there is no possibility to reorder
buffers so that they get written in one disk rotation.  Depending on
the disk, it might even be that if you send N sequential sectors all
tagged with FUA, it could be slower than sending the N sectors
followed by a cache flush or SYNCHRONIZE_CACHE command.

It may be worth doing some experiments to see how big N is for various
disks, but I'm pretty sure that FUA will probably turn out to not be
such a great idea for ext3/ext4.

						- Ted

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
  2010-07-29 23:08                                 ` Ric Wheeler
@ 2010-07-29 23:08                                 ` Ric Wheeler
  2010-07-29 23:28                                 ` James Bottomley
                                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 23:08 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara, ja

On 07/29/2010 07:04 PM, Ted Ts'o wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>    
>> Like James wrote, this is basically everything FUA.  It is OK for
>> ordered mode to allow the device to aggregate the normal filesystem
>> and journal IO, but when the commit block is written it should flush
>> all of the previously written data to disk.  This still allows
>> request re-ordering and merging inside the device, but orders the
>> data vs. the commit block.  Having the proposed "flush ranges"
>> interface to the disk would be ideal, since there would be no wasted
>> time flushing data that does not need it (i.e. other partitions).
>>      
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.
>    

You certainly can reorder in a drive with FUA, you just cannot ACK the 
write until the tagged request is on disk.

That clearly depends on the firmware of the device and, if it is an 
uncommon request, firmware people are unlikely to have spent too much 
thought and time doing it right :-)

> It may be worth doing some experiments to see how big N is for various
> disks, but I'm pretty sure that FUA will probably turn out to not be
> such a great idea for ext3/ext4.
>
> 						- Ted
>    

I am also sceptical and would expect a lot of variability in the results,

Ric



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
@ 2010-07-29 23:08                                 ` Ric Wheeler
  2010-07-29 23:08                                 ` Ric Wheeler
                                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 155+ messages in thread
From: Ric Wheeler @ 2010-07-29 23:08 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara

On 07/29/2010 07:04 PM, Ted Ts'o wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>    
>> Like James wrote, this is basically everything FUA.  It is OK for
>> ordered mode to allow the device to aggregate the normal filesystem
>> and journal IO, but when the commit block is written it should flush
>> all of the previously written data to disk.  This still allows
>> request re-ordering and merging inside the device, but orders the
>> data vs. the commit block.  Having the proposed "flush ranges"
>> interface to the disk would be ideal, since there would be no wasted
>> time flushing data that does not need it (i.e. other partitions).
>>      
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.
>    

You certainly can reorder in a drive with FUA, you just cannot ACK the 
write until the tagged request is on disk.

That clearly depends on the firmware of the device and, if it is an 
uncommon request, firmware people are unlikely to have spent too much 
thought and time doing it right :-)

> It may be worth doing some experiments to see how big N is for various
> disks, but I'm pretty sure that FUA will probably turn out to not be
> such a great idea for ext3/ext4.
>
> 						- Ted
>    

I am also sceptical and would expect a lot of variability in the results,

Ric



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
  2010-07-29 23:08                                 ` Ric Wheeler
  2010-07-29 23:08                                 ` Ric Wheeler
@ 2010-07-29 23:28                                 ` James Bottomley
  2010-07-29 23:37                                   ` James Bottomley
  2010-07-30 12:56                                   ` Vladislav Bolkhovitin
  2010-07-30  7:11                                 ` Christoph Hellwig
                                                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-07-29 23:28 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: 
> > Like James wrote, this is basically everything FUA.  It is OK for
> > ordered mode to allow the device to aggregate the normal filesystem
> > and journal IO, but when the commit block is written it should flush
> > all of the previously written data to disk.  This still allows
> > request re-ordering and merging inside the device, but orders the
> > data vs. the commit block.  Having the proposed "flush ranges"
> > interface to the disk would be ideal, since there would be no wasted
> > time flushing data that does not need it (i.e. other partitions).
> 
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

I think we're getting into disk differences here.  This certainly isn't
correct for SCSI disks.  The standard enterprise configuration for a
SCSI disk is actually cache set to write through ... so FUA is a nop.
Even for Write Back cache SCSI devices, FUA is just a wait until I/O is
on media, which is pretty much equivalent to the write through case for
the given cache lines.

I can see the problems you describe possibly affecting ATA devices with
less sophisticated caches ... but, realistically, SATA and SAS devices
come from virtually the same manufacturing process ... I'd be really
surprised if they didn't share caching technologies.

> It may be worth doing some experiments to see how big N is for various
> disks, but I'm pretty sure that FUA will probably turn out to not be
> such a great idea for ext3/ext4.

I think we should definitely run the benchmarks.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:28                                 ` James Bottomley
@ 2010-07-29 23:37                                   ` James Bottomley
  2010-07-30  0:19                                     ` Ted Ts'o
  2010-07-30 12:56                                   ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-07-29 23:37 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, 2010-07-29 at 18:28 -0500, James Bottomley wrote:
> On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote:
> > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: 
> > > Like James wrote, this is basically everything FUA.  It is OK for
> > > ordered mode to allow the device to aggregate the normal filesystem
> > > and journal IO, but when the commit block is written it should flush
> > > all of the previously written data to disk.  This still allows
> > > request re-ordering and merging inside the device, but orders the
> > > data vs. the commit block.  Having the proposed "flush ranges"
> > > interface to the disk would be ideal, since there would be no wasted
> > > time flushing data that does not need it (i.e. other partitions).
> > 
> > My understanding is that "everything FUA" can be a performance
> > disaster.  That's because it bypasses the track buffer, and things get
> > written directly to disk.  So there is no possibility to reorder
> > buffers so that they get written in one disk rotation.  Depending on
> > the disk, it might even be that if you send N sequential sectors all
> > tagged with FUA, it could be slower than sending the N sectors
> > followed by a cache flush or SYNCHRONIZE_CACHE command.
> 
> I think we're getting into disk differences here.  This certainly isn't
> correct for SCSI disks.  The standard enterprise configuration for a
> SCSI disk is actually cache set to write through ... so FUA is a nop.
> Even for Write Back cache SCSI devices, FUA is just a wait until I/O is
> on media, which is pretty much equivalent to the write through case for
> the given cache lines.
> 
> I can see the problems you describe possibly affecting ATA devices with
> less sophisticated caches ... but, realistically, SATA and SAS devices
> come from virtually the same manufacturing process ... I'd be really
> surprised if they didn't share caching technologies.

Actually, just an update on this now that I've taken my SCSI glasses
off.  Anything that does tagging properly ... like SCSI or SATA NCQ
shouldn't have this problem because the multiple outstanding tags hide
the media access latency.  For untagged devices, yes, it will be
painful.

James



^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:37                                   ` James Bottomley
@ 2010-07-30  0:19                                     ` Ted Ts'o
  0 siblings, 0 replies; 155+ messages in thread
From: Ted Ts'o @ 2010-07-30  0:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 06:37:35PM -0500, James Bottomley wrote:
> Actually, just an update on this now that I've taken my SCSI glasses
> off.  Anything that does tagging properly ... like SCSI or SATA NCQ
> shouldn't have this problem because the multiple outstanding tags hide
> the media access latency.  For untagged devices, yes, it will be
> painful.
> 

Maybe I'm just being too paranoid and not trusting enough about the
competence of firmware authors, but let's do a lot of testing on this
first.  Or let's have some options so we can turn off FUA if it turns
out to be a disaster on a particular device.  I'll have to do some
searching, but I distinctly remember reading an article in Ars
Technical or Anandtech about how FUA wasn't all that useful because
what the writer had seen in terms of testing some specific devices.
Maybe that was a while ago and devices have gotten better, and maybe
that writer was on crack, but given that FUA doesn't get used a lot,
I'm nervous....

						- Ted

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:06                             ` Christoph Hellwig
@ 2010-07-30  3:17                               ` Vivek Goyal
  2010-07-30  7:07                                 ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vivek Goyal @ 2010-07-30  3:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Thu, Jul 29, 2010 at 10:06:55PM +0200, Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 04:02:17PM -0400, Vivek Goyal wrote:
> > Looks like you still want to go with option 2 where you will scan the file
> > system code for requirement of DRAIN semantics and everything is fine then for
> > devices no supporting volatile caches, you will mark request queue as NONE.
> 
> The filesystem can't simply change the request queue settings. A request
> queue is often shared by multiple filesystems that can have very
> different requirements.
> 
> > This solves the problem on devices with WCE=0 but what about devices with
> > WCE=1. If file systems anyway don't require DRAIN semantics, then we
> > should not require it on devices with WCE=1 also?
> 
> Yes.
> 
> > If yes, then why not go with another variant of barriers which don't
> > perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not
> > supporting FUA).
> 
> I've been trying to prototype it, but it's in fact rather hard to
> get this right.  Tejun has done a really good job at the current
> barrier implementation and coming up with something just half as
> clever for the relaxed barriers has been driving me mad.
> 
> > And then file systems can slowly move to using this non
> > draining barrier usage wherever appropriate.
> 
> Actually supporting different kind of barriers at the same time
> is even harder.  We'll need two different state machines for them,
> including the actual state in the request_queue.  And then make
> sure when different filesystems on the same queue use different
> types work well together.  If at all possible switching the semantics
> on a flag day would make life a lot simpler.

Hi Christoph,

I was looking at barrier code and was trying to think that how hard it is
to support a new barrier type which does not implement DRAIN but only
does PREFLUSH + FUA for devices with WCE=1.

To me it looked like as if everything is there and it is just a matter
of skipping elevator draining and request queue draining.

Can you please have a look at attached patch. This is not a complete patch
but just a part of it if we were to implement another barrier type, say
FLUSHBARRIER. Do you think this will work or I am blissfuly unaware of 
complexity here and oversimplifying the things.

Thanks
Vivek

---
 block/blk-barrier.c    |   14 +++++++++++++-
 block/elevator.c       |    3 ++-
 include/linux/blkdev.h |    5 ++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2010-06-19 09:54:32.000000000 -0400
+++ linux-2.6/include/linux/blkdev.h	2010-07-29 22:36:52.000000000 -0400
@@ -97,6 +97,7 @@ enum rq_flag_bits {
 	__REQ_SORTED,		/* elevator knows about this request */
 	__REQ_SOFTBARRIER,	/* may not be passed by ioscheduler */
 	__REQ_HARDBARRIER,	/* may not be passed by drive either */
+	__REQ_FLUSHBARRIER,	/* only flush barrier. no drains required  */
 	__REQ_FUA,		/* forced unit access */
 	__REQ_NOMERGE,		/* don't touch this for merging */
 	__REQ_STARTED,		/* drive already may have started this one */
@@ -126,6 +127,7 @@ enum rq_flag_bits {
 #define REQ_SORTED	(1 << __REQ_SORTED)
 #define REQ_SOFTBARRIER	(1 << __REQ_SOFTBARRIER)
 #define REQ_HARDBARRIER	(1 << __REQ_HARDBARRIER)
+#define REQ_FLUSHBARRIER	(1 << __REQ_FLUSHBARRIER)
 #define REQ_FUA		(1 << __REQ_FUA)
 #define REQ_NOMERGE	(1 << __REQ_NOMERGE)
 #define REQ_STARTED	(1 << __REQ_STARTED)
@@ -626,6 +628,7 @@ enum {
 #define blk_rq_cpu_valid(rq)	((rq)->cpu != -1)
 #define blk_sorted_rq(rq)	((rq)->cmd_flags & REQ_SORTED)
 #define blk_barrier_rq(rq)	((rq)->cmd_flags & REQ_HARDBARRIER)
+#define blk_flush_barrier_rq(rq)	((rq)->cmd_flags & REQ_FLUSHBARRIER)
 #define blk_fua_rq(rq)		((rq)->cmd_flags & REQ_FUA)
 #define blk_discard_rq(rq)	((rq)->cmd_flags & REQ_DISCARD)
 #define blk_bidi_rq(rq)		((rq)->next_rq != NULL)
@@ -681,7 +684,7 @@ static inline void blk_clear_queue_full(
  * it already be started by driver.
  */
 #define RQ_NOMERGE_FLAGS	\
-	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | REQ_FLUSHBARRIER)
 #define rq_mergeable(rq)	\
 	(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
 	 (blk_discard_rq(rq) || blk_fs_request((rq))))
Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2010-06-19 09:54:29.000000000 -0400
+++ linux-2.6/block/blk-barrier.c	2010-07-29 23:02:05.000000000 -0400
@@ -219,7 +219,8 @@ static inline bool start_ordered(struct 
 	} else
 		skip |= QUEUE_ORDSEQ_PREFLUSH;
 
-	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)
+	    && !blk_flush_barrier_rq(rq))
 		rq = NULL;
 	else
 		skip |= QUEUE_ORDSEQ_DRAIN;
@@ -241,6 +242,17 @@ bool blk_do_ordered(struct request_queue
 	if (!q->ordseq) {
 		if (!is_barrier)
 			return true;
+		/*
+		 * For flush only barriers, nothing has to be done if there is
+		 * no caching happening on the deice. The barrier request is
+		 * still has to be written to disk but it can written as
+		 * normal rq.
+		 */
+
+		if (blk_flush_barrier_rq(rq)
+		    && (q->ordered & QUEUE_ORDERED_BY_DRAIN
+		        || q->ordered & QUEUE_ORDERED_BY_TAG))
+			return true;
 
 		if (q->next_ordered != QUEUE_ORDERED_NONE)
 			return start_ordered(q, rqp);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c	2010-06-19 09:54:29.000000000 -0400
+++ linux-2.6/block/elevator.c	2010-07-29 23:06:21.000000000 -0400
@@ -628,7 +628,8 @@ void elv_insert(struct request_queue *q,
 
 	case ELEVATOR_INSERT_BACK:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-		elv_drain_elevator(q);
+		if (!blk_flush_barrier_rq(rq))
+			elv_drain_elevator(q);
 		list_add_tail(&rq->queuelist, &q->queue_head);
 		/*
 		 * We kick the queue here for the following reasons.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30  3:17                               ` Vivek Goyal
@ 2010-07-30  7:07                                 ` Christoph Hellwig
  2010-07-30  7:41                                   ` Vivek Goyal
  2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
  0 siblings, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30  7:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote:
> To me it looked like as if everything is there and it is just a matter
> of skipping elevator draining and request queue draining.

The problem is that is just appears to be so.  The code blocking only
the next barrier for tagged writes is there, but in that form it doesn't
work and probably never did.  When I try to use it and debug it I always
get my post-flush request issued before the barrier request has
finished.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
                                                   ` (3 preceding siblings ...)
  2010-07-30  7:11                                 ` Christoph Hellwig
@ 2010-07-30  7:11                                 ` Christoph Hellwig
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  6 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30  7:11 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal

On Thu, Jul 29, 2010 at 07:04:06PM -0400, Ted Ts'o wrote:
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

Not sure why the discussion is drifting in this direction again, but no
one suggested to switch eweryone to forcefully use a FUA only primitive.
If we offer a WRITE_FUA primitive to those who can make use of it, it
won't mean the the cache flush primitive will go away - we will need it
to implement fsync anyway.

> 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
                                                   ` (2 preceding siblings ...)
  2010-07-29 23:28                                 ` James Bottomley
@ 2010-07-30  7:11                                 ` Christoph Hellwig
  2010-07-30  7:11                                 ` Christoph Hellwig
                                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30  7:11 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal

On Thu, Jul 29, 2010 at 07:04:06PM -0400, Ted Ts'o wrote:
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

Not sure why the discussion is drifting in this direction again, but no
one suggested to switch eweryone to forcefully use a FUA only primitive.
If we offer a WRITE_FUA primitive to those who can make use of it, it
won't mean the the cache flush primitive will go away - we will need it
to implement fsync anyway.

> 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30  7:07                                 ` Christoph Hellwig
@ 2010-07-30  7:41                                   ` Vivek Goyal
  2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
  1 sibling, 0 replies; 155+ messages in thread
From: Vivek Goyal @ 2010-07-30  7:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri, Jul 30, 2010 at 09:07:32AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote:
> > To me it looked like as if everything is there and it is just a matter
> > of skipping elevator draining and request queue draining.
> 
> The problem is that is just appears to be so.  The code blocking only
> the next barrier for tagged writes is there, but in that form it doesn't
> work and probably never did.  When I try to use it and debug it I always
> get my post-flush request issued before the barrier request has
> finished.

Are you referring to following piece of code.

if (q->ordered & QUEUE_ORDERED_BY_TAG) {
      /* Ordered by tag.  Blocking the next barrier is enough. */
	if (is_barrier && rq != &q->bar_rq)
		*rqp = NULL;

If request queue is ordered by TAG, then isn't it ok to issue post-flush
after barrier immediately (without barrier request to finish). We just
need to block next barrier (a new barrier and not the post-flush request
of current barrier). I thought for tagged queue, controller will take
care of making sure commands finish in order.

If queue is ordered by DRAIN, then I need to worry about first barrier
to finish and then issue post-flush and I thought following should take
care of it.

        } else {
                /* Ordered by draining.  Wait for turn. */
                WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
                if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
                        *rqp = NULL;
        }

May be there is a bug somewhere. I will do some debugging.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:11                                   ` Christoph Hellwig
@ 2010-07-30 12:45                                     ` Vladislav Bolkhovitin
  2010-07-30 12:56                                       ` Christoph Hellwig
  2010-08-04  1:58                                     ` Jamie Lokier
  1 sibling, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 12:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 12:11 AM wrote:
> On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote:
>> There's lies, damned lies and benchmarks .. but what I was thinking is
>> could we just do the right thing?  SCSI exposes (in sd) the interfaces
>> to change the cache setting, so if the customer *doesn't* specify
>> barriers on mount, could we just flip the device to write through it
>> would be more performant in most use cases.
>
> We could for SCSI and ATA, but probably not easily for other kind of
> storage.  Except that it's not that simple as we have partitions and
> volume managers inbetween - different filesystems sitting on the same
> device might have very different ideas of what they want.
>
> For SCSI we can at least permanently disable the cache, but ATA devices
> keep coming up again with the volatile write cache enabled after a
> reboot, or even worse a suspend to ram / resume cycle.  The latter is
> what keeps me from just disabling the volatile cache on my laptop,
> despite that option giving significanly better performance for typical
> kernel developer workloads.

There are also SCSI devices which keep changed settings only until the 
next reset/restart. (The devices might be shared, so other initiators 
can at any time reset them.)

So, to make the changed settings to not be resetted, there must be a 
procedure which would catch the corresponding notification event (RESET 
Unit Attention for SCSI) and set the resetted settings back in the 
desired value.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:03                               ` Christoph Hellwig
  2010-07-29 20:07                                 ` James Bottomley
@ 2010-07-30 12:46                                 ` Vladislav Bolkhovitin
  2010-07-30 12:57                                   ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 12:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 12:03 AM wrote:
> On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote:
>> That's basically everything FUA ... you might just as well switch your
>> cache to write through and have done.
>>
>> This, by the way, is one area I'm hoping to have researched on SCSI
>> (where most devices do obey the caching directives).  Actually see if
>> write through without flush barriers is faster than writeback with flush
>> barriers.  I really suspect it is.
>
> We have done the research and at least for XFS a write through cache
> actually is faster for many workloads.  Ric always has workloads where
> the cache is faster, though - mostly doing lots of small file write
> kind of setups.

I supposed, with write back cache you did the queue drain after 
request(s) with ordered requirements, correct? Did you also do the queue 
drain in the same places with write through caching?

Just in case, to be sure the comparison was fair. I can't see why 
sequence of [(write command/internal cache sync) .. (write 
command/internal cache sync)] for write through caching should be faster 
than sequence of [(write command) .. (write command) (cache sync) .. 
(write command) .. (write command) (cache sync)], except if there are 
additional queue flushing (draining) in the latter case. I think we need 
to explain that before doing the next step.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
                                                   ` (5 preceding siblings ...)
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
@ 2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  6 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal

Ted Ts'o, on 07/30/2010 03:04 AM wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>> Like James wrote, this is basically everything FUA.  It is OK for
>> ordered mode to allow the device to aggregate the normal filesystem
>> and journal IO, but when the commit block is written it should flush
>> all of the previously written data to disk.  This still allows
>> request re-ordering and merging inside the device, but orders the
>> data vs. the commit block.  Having the proposed "flush ranges"
>> interface to the disk would be ideal, since there would be no wasted
>> time flushing data that does not need it (i.e. other partitions).
>
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

It should be, because it gives the drive opportunity to better load 
internal resources and provide data transfer pipelining. Although, of 
course, it's possible to imagine a stupid drive with nearly broken 
caching which would work in write through mode faster.

I used word "drive", not "disk" above, because I think this discussions 
is not only about disks. Storage might be not only disks, but also 
external arrays and even clusters of arrays. They all look to the system 
as single "disks", but are much more advanced and sophisticated in all 
internal capabilities than dumb (S)ATA disks. Now such arrays and 
clusters are getting more and more commonly used. Anybody can make such 
array using just a Linux box with any OSS SCSI target software and use 
them with a variety of interfaces: iSCSI, Fibre Channel, SAS, InfiniBand 
and even familiar parallel SCSI (Funny, 2 Linux boxes connected by Wide 
SCSI :) ).

So, why to only limit discussion to the low end disks? I believe it 
would be more productive if we at first determine the set of 
capabilities which should be used for the best performance and which 
advanced storage devices can provide and then go down to the lower end 
eliminating the use of the advantage features with sacrificing 
performance. Otherwise, ignoring the "hardware offload" which advanced 
devices provide, we would never achieve the best performance they could 
give.

I'd start the analyze of the best performance facilities from the following:

1. Full set of SCSI queuing and task management control facilities. Namely:

  - SIMPLE, ORDERED, ACA and, maybe, HEAD OF QUEUE commands attributes

  - Never draining the queue to wait for completion of one or more 
commands, except some rare recovery error recovery cases.

  - ACA and UA_INTRCK for protecting the queue order in case if one or 
more commands in it finished abnormally.

  - Use of write back caching by default and switch to write through 
only for "blacklisted" drives.

  - FUA for sequences of few write commands, where either 
SYNCHRONIZE_CACHE command is an overkill, or there is internal order 
dependency between the commands, so they must be written to the media 
exactly in the required order.

So, for instance, a naive sequence of meta-data updates with the 
corresponding journal writes would be a chain of commands:

1. 1st journal write command (SIMPLE)

2. 2d  journal write command (SIMPLE)

3. 3d  journal write command (SIMPLE)

4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)

5. Necessary amount of meta-data update commands (all SIMPLE)

6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

7. Command marking the transaction committed in the journal (ORDERED)

That's all. No queue draining anywhere. Plus, sending commands without 
internal order requirements as SIMPLE would allow the drive to better 
schedule execution of them among internal storage (actual disks).

For an error recovery case consider command (4) abnormally finished 
because of some external event, like Unit Attention. Then the drive 
would establish ACA condition and suspend the commands queue with 
commands from (5) in the head. Then the system would retry this command 
with ACA attribute. Then, when it finished, would clear the ACA 
condition. Then the drive would resume the queue and commands in the 
head ((5)) started being processed.

For a simpler device (a disk without support for ORDERED queuing) the 
same meta-data updates would be:

1. 1st journal write command

2. 2d  journal write command

3. 3d  journal write command

4. The queue draining.

5. SYNCHRONIZE_CACHE

6. The queue draining.

7. Necessary amount of meta-data update commands

8. The queue draining.

9. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

10. The queue draining.

11. Command marking the transaction committed in the journal (ORDERED)

Then we would need to figure out an interface for file systems to let 
them be able to specify the necessary ordering and cache flushing 
requirements in a generic way. The current interface looks almost good, but:

1. In it semantic of "barrier" is quite overloaded, hence confusing and 
hard to implement.

2. It doesn't allow to bind several requests in an ordered chain.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:04                               ` Ted Ts'o
                                                   ` (4 preceding siblings ...)
  2010-07-30  7:11                                 ` Christoph Hellwig
@ 2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  2010-07-30 13:07                                   ` Tejun Heo
  2010-07-30 13:09                                   ` Christoph Hellwig
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  6 siblings, 2 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw)
  To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal

Ted Ts'o, on 07/30/2010 03:04 AM wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>> Like James wrote, this is basically everything FUA.  It is OK for
>> ordered mode to allow the device to aggregate the normal filesystem
>> and journal IO, but when the commit block is written it should flush
>> all of the previously written data to disk.  This still allows
>> request re-ordering and merging inside the device, but orders the
>> data vs. the commit block.  Having the proposed "flush ranges"
>> interface to the disk would be ideal, since there would be no wasted
>> time flushing data that does not need it (i.e. other partitions).
>
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

It should be, because it gives the drive opportunity to better load 
internal resources and provide data transfer pipelining. Although, of 
course, it's possible to imagine a stupid drive with nearly broken 
caching which would work in write through mode faster.

I used word "drive", not "disk" above, because I think this discussions 
is not only about disks. Storage might be not only disks, but also 
external arrays and even clusters of arrays. They all look to the system 
as single "disks", but are much more advanced and sophisticated in all 
internal capabilities than dumb (S)ATA disks. Now such arrays and 
clusters are getting more and more commonly used. Anybody can make such 
array using just a Linux box with any OSS SCSI target software and use 
them with a variety of interfaces: iSCSI, Fibre Channel, SAS, InfiniBand 
and even familiar parallel SCSI (Funny, 2 Linux boxes connected by Wide 
SCSI :) ).

So, why to only limit discussion to the low end disks? I believe it 
would be more productive if we at first determine the set of 
capabilities which should be used for the best performance and which 
advanced storage devices can provide and then go down to the lower end 
eliminating the use of the advantage features with sacrificing 
performance. Otherwise, ignoring the "hardware offload" which advanced 
devices provide, we would never achieve the best performance they could 
give.

I'd start the analyze of the best performance facilities from the following:

1. Full set of SCSI queuing and task management control facilities. Namely:

  - SIMPLE, ORDERED, ACA and, maybe, HEAD OF QUEUE commands attributes

  - Never draining the queue to wait for completion of one or more 
commands, except some rare recovery error recovery cases.

  - ACA and UA_INTRCK for protecting the queue order in case if one or 
more commands in it finished abnormally.

  - Use of write back caching by default and switch to write through 
only for "blacklisted" drives.

  - FUA for sequences of few write commands, where either 
SYNCHRONIZE_CACHE command is an overkill, or there is internal order 
dependency between the commands, so they must be written to the media 
exactly in the required order.

So, for instance, a naive sequence of meta-data updates with the 
corresponding journal writes would be a chain of commands:

1. 1st journal write command (SIMPLE)

2. 2d  journal write command (SIMPLE)

3. 3d  journal write command (SIMPLE)

4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)

5. Necessary amount of meta-data update commands (all SIMPLE)

6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

7. Command marking the transaction committed in the journal (ORDERED)

That's all. No queue draining anywhere. Plus, sending commands without 
internal order requirements as SIMPLE would allow the drive to better 
schedule execution of them among internal storage (actual disks).

For an error recovery case consider command (4) abnormally finished 
because of some external event, like Unit Attention. Then the drive 
would establish ACA condition and suspend the commands queue with 
commands from (5) in the head. Then the system would retry this command 
with ACA attribute. Then, when it finished, would clear the ACA 
condition. Then the drive would resume the queue and commands in the 
head ((5)) started being processed.

For a simpler device (a disk without support for ORDERED queuing) the 
same meta-data updates would be:

1. 1st journal write command

2. 2d  journal write command

3. 3d  journal write command

4. The queue draining.

5. SYNCHRONIZE_CACHE

6. The queue draining.

7. Necessary amount of meta-data update commands

8. The queue draining.

9. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

10. The queue draining.

11. Command marking the transaction committed in the journal (ORDERED)

Then we would need to figure out an interface for file systems to let 
them be able to specify the necessary ordering and cache flushing 
requirements in a generic way. The current interface looks almost good, but:

1. In it semantic of "barrier" is quite overloaded, hence confusing and 
hard to implement.

2. It doesn't allow to bind several requests in an ordered chain.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 23:28                                 ` James Bottomley
  2010-07-29 23:37                                   ` James Bottomley
@ 2010-07-30 12:56                                   ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

James Bottomley, on 07/30/2010 03:28 AM wrote:
> On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote:
>> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>>> Like James wrote, this is basically everything FUA.  It is OK for
>>> ordered mode to allow the device to aggregate the normal filesystem
>>> and journal IO, but when the commit block is written it should flush
>>> all of the previously written data to disk.  This still allows
>>> request re-ordering and merging inside the device, but orders the
>>> data vs. the commit block.  Having the proposed "flush ranges"
>>> interface to the disk would be ideal, since there would be no wasted
>>> time flushing data that does not need it (i.e. other partitions).
>>
>> My understanding is that "everything FUA" can be a performance
>> disaster.  That's because it bypasses the track buffer, and things get
>> written directly to disk.  So there is no possibility to reorder
>> buffers so that they get written in one disk rotation.  Depending on
>> the disk, it might even be that if you send N sequential sectors all
>> tagged with FUA, it could be slower than sending the N sectors
>> followed by a cache flush or SYNCHRONIZE_CACHE command.
>
> I think we're getting into disk differences here.  This certainly isn't
> correct for SCSI disks.  The standard enterprise configuration for a
> SCSI disk is actually cache set to write through ... so FUA is a nop.
> Even for Write Back cache SCSI devices, FUA is just a wait until I/O is
> on media, which is pretty much equivalent to the write through case for
> the given cache lines.
>
> I can see the problems you describe possibly affecting ATA devices with
> less sophisticated caches ... but, realistically, SATA and SAS devices
> come from virtually the same manufacturing process ... I'd be really
> surprised if they didn't share caching technologies.

Please, don't limit consideration to local disks only!

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 12:45                                     ` Vladislav Bolkhovitin
@ 2010-07-30 12:56                                       ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 12:56 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

On Fri, Jul 30, 2010 at 04:45:00PM +0400, Vladislav Bolkhovitin wrote:
> There are also SCSI devices which keep changed settings only until the 
> next reset/restart. (The devices might be shared, so other initiators 
> can at any time reset them.)

I haven't seen a scsi device without support for the saved values
mode pages for years.  But yes, in a shared environment every initator
could change the settings.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 12:46                                 ` Vladislav Bolkhovitin
@ 2010-07-30 12:57                                   ` Christoph Hellwig
  2010-07-30 13:09                                     ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 12:57 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

On Fri, Jul 30, 2010 at 04:46:12PM +0400, Vladislav Bolkhovitin wrote:
> I supposed, with write back cache you did the queue drain after 
> request(s) with ordered requirements, correct? Did you also do the queue 
> drain in the same places with write through caching?

Using the queue drains in both cases.  I can only imagine keeping the
queue drained over the cache flush instead of just a few small I/Os
has nasty side effects.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
@ 2010-07-30 13:07                                   ` Tejun Heo
  2010-07-30 13:22                                     ` Vladislav Bolkhovitin
  2010-07-30 13:09                                   ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-07-30 13:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote:
> 1. 1st journal write command (SIMPLE)
> 
> 2. 2d  journal write command (SIMPLE)
> 
> 3. 3d  journal write command (SIMPLE)
> 
> 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)
> 
> 5. Necessary amount of meta-data update commands (all SIMPLE)
> 
> 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)
> 
> 7. Command marking the transaction committed in the journal (ORDERED)
> 
> That's all. No queue draining anywhere. Plus, sending commands
> without internal order requirements as SIMPLE would allow the drive
> to better schedule execution of them among internal storage (actual
> disks).

Are SIMPLE commands ordered against ORDERED commands?  Aren't ORDERED
ordered among themselves only?

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 12:57                                   ` Christoph Hellwig
@ 2010-07-30 13:09                                     ` Vladislav Bolkhovitin
  2010-07-30 13:12                                       ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 13:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 04:57 PM wrote:
> On Fri, Jul 30, 2010 at 04:46:12PM +0400, Vladislav Bolkhovitin wrote:
>> I supposed, with write back cache you did the queue drain after
>> request(s) with ordered requirements, correct? Did you also do the queue
>> drain in the same places with write through caching?
>
> Using the queue drains in both cases.  I can only imagine keeping the
> queue drained over the cache flush instead of just a few small I/Os
> has nasty side effects.

Sorry, I can't follow you here. What was the load pattern difference 
between the tests in the way how the backend device saw it? I thought, 
it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?) 
in the write through case, but looks like there is something more different?

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 12:56                                 ` Vladislav Bolkhovitin
  2010-07-30 13:07                                   ` Tejun Heo
@ 2010-07-30 13:09                                   ` Christoph Hellwig
  2010-07-30 13:25                                     ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 13:09 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri, Jul 30, 2010 at 04:56:31PM +0400, Vladislav Bolkhovitin wrote:
> For a simpler device (a disk without support for ORDERED queuing) the 
> same meta-data updates would be:
> 
> 1. 1st journal write command
> 
> 2. 2d  journal write command
> 
> 3. 3d  journal write command
> 
> 4. The queue draining.

Which is complete overkill.  We have state machines for everything we do
block I/O on (both data and the journal), which allows us to just wait
on the I/O requests we need inside the filesystem instead of draining
the queue, or enforce global ordering using ordered tags.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:09                                     ` Vladislav Bolkhovitin
@ 2010-07-30 13:12                                       ` Christoph Hellwig
  2010-07-30 17:40                                         ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 13:12 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

On Fri, Jul 30, 2010 at 05:09:52PM +0400, Vladislav Bolkhovitin wrote:
> Sorry, I can't follow you here. What was the load pattern difference 
> between the tests in the way how the backend device saw it? I thought, 
> it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?) 
> in the write through case, but looks like there is something more different?

The only difference in commands is that we see no SYNCHRONIZE_CACHE.
The big picture difference is that we also only drain the queue just
to undrain it ASAP, instead of keeping it drained over a sequence
of SYNCHRONIZE_CACHE + WRITE + SYNCHRONIZE_CACHE, which can make
a huge difference for a device with very low latencies like the SSD
in my laptop.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:07                                   ` Tejun Heo
@ 2010-07-30 13:22                                     ` Vladislav Bolkhovitin
  2010-07-30 13:27                                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 13:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

Tejun Heo, on 07/30/2010 05:07 PM wrote:
> On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote:
>> 1. 1st journal write command (SIMPLE)
>>
>> 2. 2d  journal write command (SIMPLE)
>>
>> 3. 3d  journal write command (SIMPLE)
>>
>> 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)
>>
>> 5. Necessary amount of meta-data update commands (all SIMPLE)
>>
>> 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)
>>
>> 7. Command marking the transaction committed in the journal (ORDERED)
>>
>> That's all. No queue draining anywhere. Plus, sending commands
>> without internal order requirements as SIMPLE would allow the drive
>> to better schedule execution of them among internal storage (actual
>> disks).
>
> Are SIMPLE commands ordered against ORDERED commands?  Aren't ORDERED
> ordered among themselves only?

About SIMPLE commands SAM says: "The command shall not enter the enabled 
command state until all commands having a HEAD OF QUEUE task attribute 
and older commands having an ORDERED task attribute in the task set have 
completed"

About ORDERED commands: "The command shall not enter the enabled command 
state until all commands having a HEAD OF QUEUE task attribute and all 
older commands in the task set have completed".

In a normal language it means that ORDERED commands are ordered against 
all other commands: no SIMPLE command can be executed before ORDERED 
commands ahead of it completed and no ORDERED command can be executed 
before all SIMPLE and ORDERED commands ahead of it completed. (I 
excluded HEAD OF QUEUE commands from the consideration for simplicity.)

Vlad





^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:09                                   ` Christoph Hellwig
@ 2010-07-30 13:25                                     ` Vladislav Bolkhovitin
  2010-07-30 13:34                                       ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 05:09 PM wrote:
> On Fri, Jul 30, 2010 at 04:56:31PM +0400, Vladislav Bolkhovitin wrote:
>> For a simpler device (a disk without support for ORDERED queuing) the
>> same meta-data updates would be:
>>
>> 1. 1st journal write command
>>
>> 2. 2d  journal write command
>>
>> 3. 3d  journal write command
>>
>> 4. The queue draining.
>
> Which is complete overkill.  We have state machines for everything we do
> block I/O on (both data and the journal), which allows us to just wait
> on the I/O requests we need inside the filesystem instead of draining
> the queue, or enforce global ordering using ordered tags.

Sure. It was only a naive example to illustrate my points. But the FS is 
still waiting for the requests, so "draining" its "local queue"?

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:22                                     ` Vladislav Bolkhovitin
@ 2010-07-30 13:27                                       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 13:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

Vladislav Bolkhovitin, on 07/30/2010 05:22 PM wrote:
> Tejun Heo, on 07/30/2010 05:07 PM wrote:
>> On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote:
>>> 1. 1st journal write command (SIMPLE)
>>>
>>> 2. 2d  journal write command (SIMPLE)
>>>
>>> 3. 3d  journal write command (SIMPLE)
>>>
>>> 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)
>>>
>>> 5. Necessary amount of meta-data update commands (all SIMPLE)
>>>
>>> 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)
>>>
>>> 7. Command marking the transaction committed in the journal (ORDERED)
>>>
>>> That's all. No queue draining anywhere. Plus, sending commands
>>> without internal order requirements as SIMPLE would allow the drive
>>> to better schedule execution of them among internal storage (actual
>>> disks).
>>
>> Are SIMPLE commands ordered against ORDERED commands?  Aren't ORDERED
>> ordered among themselves only?
>
> About SIMPLE commands SAM says: "The command shall not enter the enabled
> command state until all commands having a HEAD OF QUEUE task attribute
> and older commands having an ORDERED task attribute in the task set have
> completed"
>
> About ORDERED commands: "The command shall not enter the enabled command
> state until all commands having a HEAD OF QUEUE task attribute and all
> older commands in the task set have completed".
>
> In a normal language it means that ORDERED commands are ordered against
> all other commands: no SIMPLE command can be executed before ORDERED
> commands ahead of it completed and no ORDERED command can be executed
> before all SIMPLE and ORDERED commands ahead of it completed.

...and, of course, SIMPLE commands can be freely reordered against 
neighbor SIMPLE commands.

> Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:25                                     ` Vladislav Bolkhovitin
@ 2010-07-30 13:34                                       ` Christoph Hellwig
  2010-07-30 13:44                                         ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 13:34 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri, Jul 30, 2010 at 05:25:52PM +0400, Vladislav Bolkhovitin wrote:
> Sure. It was only a naive example to illustrate my points. But the FS is 
> still waiting for the requests, so "draining" its "local queue"?

Yes, just a much smaller queue in general.

To present a typical case, fsync() on a regular file that has a few
dirty pages on it using XFS.

We use filemap_write_and_wait to write out those few pages and wait
for it.  And after that we only need to issue a SYNCHRONIZE_CACHE
and we'd be done.  Right now the draining semantics of the (empty)
barrier means we also need to wait for all other I/O in the system
to finish, which is rather suboptimal.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:34                                       ` Christoph Hellwig
@ 2010-07-30 13:44                                         ` Vladislav Bolkhovitin
  2010-07-30 14:20                                           ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 13:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 05:34 PM wrote:
> On Fri, Jul 30, 2010 at 05:25:52PM +0400, Vladislav Bolkhovitin wrote:
>> Sure. It was only a naive example to illustrate my points. But the FS is
>> still waiting for the requests, so "draining" its "local queue"?
>
> Yes, just a much smaller queue in general.
>
> To present a typical case, fsync() on a regular file that has a few
> dirty pages on it using XFS.
>
> We use filemap_write_and_wait to write out those few pages and wait
> for it.  And after that we only need to issue a SYNCHRONIZE_CACHE
> and we'd be done.  Right now the draining semantics of the (empty)
> barrier means we also need to wait for all other I/O in the system
> to finish, which is rather suboptimal.

Yes, but why not to make step further and allow to completely eliminate 
the waiting/draining using ORDERED requests? Current advanced storage 
hardware allows that.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:44                                         ` Vladislav Bolkhovitin
@ 2010-07-30 14:20                                           ` Christoph Hellwig
  2010-07-31  0:47                                             ` Jan Kara
  2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
  0 siblings, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-30 14:20 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
> Yes, but why not to make step further and allow to completely eliminate 
> the waiting/draining using ORDERED requests? Current advanced storage 
> hardware allows that.

There is a few caes where we could do that - the fsync without metadata
changes above would be the prime example.  But there's a lot lower
hanging fruit until we get to the point where it's worth trying.

But in most cases we don't just drain an imaginary queue but actually
need to modify software state before finishing one class of I/O and
submitting the next.

Again, take the example of fsync, but this time we have actually
extended the file and need to log an inode size update, as well
as a modification to to the btree blocks.

Now the fsync in XFS looks like this:

1) write out all the data blocks using WRITE
2) wait for these to finish
3) propagate any I/O error to the inode so we can pick them up
4) update the inode size in the shadow in-memory structure
5) start a transaction to log the inode size
6) flush the write cache to make sure the data really is on disk
7) write out a log buffer containing the inode and btree updates
8) if the FUA bit is not support flush the cache again

and yes, the flush in 6) is important so that we don't happen
to log the inode size update before all data has made it to disk
in case the cache flush in 8) is interrupted


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 13:12                                       ` Christoph Hellwig
@ 2010-07-30 17:40                                         ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-30 17:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 05:12 PM wrote:
> On Fri, Jul 30, 2010 at 05:09:52PM +0400, Vladislav Bolkhovitin wrote:
>> Sorry, I can't follow you here. What was the load pattern difference
>> between the tests in the way how the backend device saw it? I thought,
>> it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?)
>> in the write through case, but looks like there is something more different?
>
> The only difference in commands is that we see no SYNCHRONIZE_CACHE.
> The big picture difference is that we also only drain the queue just
> to undrain it ASAP, instead of keeping it drained over a sequence
> of SYNCHRONIZE_CACHE + WRITE + SYNCHRONIZE_CACHE, which can make
> a huge difference for a device with very low latencies like the SSD
> in my laptop.

It's weird. I can only explain it if:

1. The device fully or particularly lies about write through mode. Under 
"particularly" I mean something like if the response returned when the 
writes "almost" sent to the media.

2. The device has very ineffective SYNCHRONIZE_CACHE implementation. For 
instance, it has relatively slow internal cache scan (you do complete 
cache flush, not only affected by the previous writes blocks, correct?).

It would be good if you performed your test on some software SCSI target 
device, where we can fully control and see what's going on inside.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
  2010-07-29 19:49                         ` Christoph Hellwig
@ 2010-07-31  0:35                         ` Jan Kara
  1 sibling, 0 replies; 155+ messages in thread
From: Jan Kara @ 2010-07-31  0:35 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

On Thu 29-07-10 15:44:31, Ric Wheeler wrote:
> On 07/28/2010 09:44 PM, Ted Ts'o wrote:
> >On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
> >>If we move all filesystems to non-draining barriers with pre- and post-
> >>flushes that might actually be a relatively easy first step.  We don't
> >>have the complications to deal with multiple types of barriers to
> >>start with, and it'll fix the issue for devices without volatile write
> >>caches completely.
> >>
> >>I just need some help from the filesystem folks to determine if they
> >>are safe with them.
> >>
> >>I know for sure that ext3 and xfs are from looking through them.  And
> >>I know reiserfs is if we make sure it doesn't hit the code path that
> >>relies on it that is currently enabled by the barrier option.
> >>
> >>I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> >>That already ends our small list of barrier supporting filesystems, and
> >>possibly ocfs2, too - although the barrier implementation there seems
> >>incomplete as it doesn't seem to flush caches in fsync.
> >Define "are safe" --- what interface we planning on using for the
> >non-draining barrier?  At least for ext3, when we write the commit
> >record using set_buffer_ordered(bh), it assumes that this will do a
> >flush of all previous writes and that the commit will hit the disk
> >before any subsequent writes are sent to the disk.  So turning the
> >write of a buffer head marked with set_buffered_ordered() into a FUA
> >write would _not_ be safe for ext3.
> 
> I confess that I am a bit fuzzy on FUA, but think that it means that
> any FUA tagged IO will go down to persistent store before returning.
> 
> If so, then all order dependent IO would need to be issued in order
> and tagged with FUA. It would not suffice to tag just the commit
> record as FUA, or do I misunderstand what FUA does?
  Ric, I think you misunderstood it a bit. I think the proposal for ext3
was to write ordered data + metadata to the journal except for transaction
commit block, then issue SYNCHRONIZE_CACHE and then write transaction
commit block either with FUA bit set or without it and call
SYNCHRONIZE_CACHE after that as well.
  The difference from the current behavior would be that we save the queue
draining we do these days...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 14:20                                           ` Christoph Hellwig
@ 2010-07-31  0:47                                             ` Jan Kara
  2010-07-31  9:12                                               ` Christoph Hellwig
  2010-08-02 10:38                                               ` Vladislav Bolkhovitin
  2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
  1 sibling, 2 replies; 155+ messages in thread
From: Jan Kara @ 2010-07-31  0:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vladislav Bolkhovitin, Ted Ts'o, Andreas Dilger, Ric Wheeler,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri 30-07-10 16:20:25, Christoph Hellwig wrote:
> On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
> > Yes, but why not to make step further and allow to completely eliminate 
> > the waiting/draining using ORDERED requests? Current advanced storage 
> > hardware allows that.
> 
> There is a few caes where we could do that - the fsync without metadata
> changes above would be the prime example.  But there's a lot lower
> hanging fruit until we get to the point where it's worth trying.
  Umm, I don't understand you. I think that fsync in particular is an
example where you have to wait and issue cache flush if the drive has
volatile write cache. Otherwise you cannot promise to the user data will be
really on disk in case of crash. So no ordering helps you.
  And if you are speaking about a drive without volatile write caches, then
fsync without metadata changes is just trivial and you don't need any
ordering.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-31  0:47                                             ` Jan Kara
@ 2010-07-31  9:12                                               ` Christoph Hellwig
  2010-08-02 13:14                                                 ` Jan Kara
  2010-08-02 10:38                                               ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-07-31  9:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Vladislav Bolkhovitin, Ted Ts'o,
	Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Sat, Jul 31, 2010 at 02:47:57AM +0200, Jan Kara wrote:
> > There is a few caes where we could do that - the fsync without metadata
> > changes above would be the prime example.  But there's a lot lower
> > hanging fruit until we get to the point where it's worth trying.
>   Umm, I don't understand you. I think that fsync in particular is an
> example where you have to wait and issue cache flush if the drive has
> volatile write cache.

Of course.  What makes you believe anyone said something else?


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-31  0:47                                             ` Jan Kara
  2010-07-31  9:12                                               ` Christoph Hellwig
@ 2010-08-02 10:38                                               ` Vladislav Bolkhovitin
  2010-08-02 12:48                                                 ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-02 10:38 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Jan Kara, on 07/31/2010 04:47 AM wrote:
> On Fri 30-07-10 16:20:25, Christoph Hellwig wrote:
>> On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
>>> Yes, but why not to make step further and allow to completely eliminate
>>> the waiting/draining using ORDERED requests? Current advanced storage
>>> hardware allows that.
>>
>> There is a few caes where we could do that - the fsync without metadata
>> changes above would be the prime example.  But there's a lot lower
>> hanging fruit until we get to the point where it's worth trying.
>    Umm, I don't understand you. I think that fsync in particular is an
> example where you have to wait and issue cache flush if the drive has
> volatile write cache. Otherwise you cannot promise to the user data will be
> really on disk in case of crash. So no ordering helps you.

Isn't there the second wait for journal update?

>    And if you are speaking about a drive without volatile write caches, then
> fsync without metadata changes is just trivial and you don't need any
> ordering.

A drive can reorder queued SIMPLE requests at any time doesn't matter if 
it has volatile write caches or not. So, if you expect in-order requests 
execution (with journal updates you do?), you need to enforce that order 
either by ORDERED requests or (local) queue draining.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-02 10:38                                               ` Vladislav Bolkhovitin
@ 2010-08-02 12:48                                                 ` Christoph Hellwig
  2010-08-02 19:03                                                   ` xfs rm performance Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-02 12:48 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jan Kara, Christoph Hellwig, Ted Ts'o, Andreas Dilger,
	Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Mon, Aug 02, 2010 at 02:38:18PM +0400, Vladislav Bolkhovitin wrote:
> >   Umm, I don't understand you. I think that fsync in particular is an
> >example where you have to wait and issue cache flush if the drive has
> >volatile write cache. Otherwise you cannot promise to the user data will be
> >really on disk in case of crash. So no ordering helps you.
> 
> Isn't there the second wait for journal update?

Yes.

> A drive can reorder queued SIMPLE requests at any time doesn't matter if 
> it has volatile write caches or not.

I know.

> So, if you expect in-order requests 
> execution (with journal updates you do?), you need to enforce that order 
> either by ORDERED requests or (local) queue draining.

Yes, exactly what I say.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-31  9:12                                               ` Christoph Hellwig
@ 2010-08-02 13:14                                                 ` Jan Kara
  0 siblings, 0 replies; 155+ messages in thread
From: Jan Kara @ 2010-08-02 13:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Vladislav Bolkhovitin, Ted Ts'o, Andreas Dilger,
	Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Sat 31-07-10 11:12:46, Christoph Hellwig wrote:
> On Sat, Jul 31, 2010 at 02:47:57AM +0200, Jan Kara wrote:
> > > There is a few caes where we could do that - the fsync without metadata
> > > changes above would be the prime example.  But there's a lot lower
> > > hanging fruit until we get to the point where it's worth trying.
> >   Umm, I don't understand you. I think that fsync in particular is an
> > example where you have to wait and issue cache flush if the drive has
> > volatile write cache.
> 
> Of course.  What makes you believe anyone said something else?
  Ok, then I just misunderstood which requests you wanted to send ORDERED.
Never mind, I think we agree on what needs to / can be done.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
                                       ` (3 preceding siblings ...)
  2010-07-29  1:44                     ` Ted Ts'o
@ 2010-08-02 16:47                     ` Ryusuke Konishi
  2010-08-02 17:39                     ` Chris Mason
  5 siblings, 0 replies; 155+ messages in thread
From: Ryusuke Konishi @ 2010-08-02 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke

On Wed, 28 Jul 2010 11:28:59 +0200, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> > I'll re-read barrier code and see how hard it would be to implement a
> > proper solution.
> 
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
> 
> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.

With regard to nilfs, barrier is applied to writeback of super block
since it saves position of a recent log and this log needs to be
written to the platter prior to the super block.

And so, I think a pre-flush + a FUA write can be used instead of
draining for the barrier use in nilfs.

Thanks,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-28  9:28                   ` Christoph Hellwig
                                       ` (4 preceding siblings ...)
  2010-08-02 16:47                     ` Ryusuke Konishi
@ 2010-08-02 17:39                     ` Chris Mason
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
  5 siblings, 2 replies; 155+ messages in thread
From: Chris Mason @ 2010-08-02 17:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke

On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> > Well, if disabling barrier works around the problem for them (which is
> > basically what was suggeseted in the first message), that's not too
> > bad for short term, I think.
> 
> It's a pretty horrible workaround.  Requiring manual mount options to
> get performance out of a setup which could trivially work out of the
> box is a bad workaround.
> 
> > I'll re-read barrier code and see how hard it would be to implement a
> > proper solution.
> 
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
> 
> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> That already ends our small list of barrier supporting filesystems, and
> possibly ocfs2, too - although the barrier implementation there seems
> incomplete as it doesn't seem to flush caches in fsync.

Btrfs is going to be similar to xfs, except because of COW we have to
always pretend someone is extending the file (or filling a hole).

The short answer is that a preflush of the disk cache, followed by FUA
for commits is fine.  Btrfs explicitly waits for all the bios it sends
down without trusting other layers for silent ordering.

The long answer is the btrfs commit is basically:

wait for bio completion of a bunch of different things
write new super block pointing to new tree roots with barrier

Everything we waited for must be fully on disk before the new super
block, and the new super must be fully on disk after we wait for the bh.

I regret putting the ordering into the original barrier code...it
definitely did help reiserfs back in the day but it stinks of magic and
voodoo.

When it goes wrong, we'll only notice .000000001% of the time, and even
then it'll only be when people report some random corruption which we'll
blindly blame on either axboe or the drive.

-chris


^ permalink raw reply	[flat|nested] 155+ messages in thread

* [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics)
  2010-07-30  7:07                                 ` Christoph Hellwig
  2010-07-30  7:41                                   ` Vivek Goyal
@ 2010-08-02 18:28                                   ` Vivek Goyal
  2010-08-03 13:03                                     ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Vivek Goyal @ 2010-08-02 18:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Fri, Jul 30, 2010 at 09:07:32AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote:
> > To me it looked like as if everything is there and it is just a matter
> > of skipping elevator draining and request queue draining.
> 
> The problem is that is just appears to be so.  The code blocking only
> the next barrier for tagged writes is there, but in that form it doesn't
> work and probably never did.  When I try to use it and debug it I always
> get my post-flush request issued before the barrier request has
> finished.

Hi Christoph,

Please find attached a new version of patch where I am trying to implement
flush only barriers. Why do that? I was thinking that it would nice to avoid
elevator drains with WCE=1.

Here I have a DRAIN queue and I seem to be issuing post-flush only after
barrier has finished. Need to find some device with TAG queue also to test.

This is still a very crude patch where I need to do lot of testing to see if
things are working. For the time being I have just hooked up ext3 to use
flush barrier and verified that in case of WCE=0 we don't issue barrier
and in case of WCE=1 we do issue barrier with pre flush and postflush.

I don't yet have found a device with FUA and tagging support to verify 
that functionality.

I looked at your BH_ordered kill patch. For the time being I have
introduced another flag BH_Flush_Ordered along the lines of BH_Ordered.
But it can be easily replaced once your kill patch is in.

Thanks
Vivek


o Implement flush only barriers. These do not implement any drain semantics.
  File system needs to wait for completion of all the dependent IO.

o On storage with no write cache, these barriers should just do nothing.
  Empty barrier request returns immediately and a write request with
  barrier is processed as normal request. No drains, no flushing.

o On storage with write cache, for empty barrier, only pre-flush is done.
  For barrier request with some data one of following should happen depending
  on queue capability.

	Draining queue
	--------------
	preflush ==> barrier (FUA)
	preflush ==> barrier ===> postflush
	
	Ordered Queue
	-------------
	preflush-->barrier (FUA)
	preflush --> barrier ---> postflush

	===> Wait for previous request to finish
	---> Issue an ordered request in Tagged queue.

o For write cache enabled case, we are not completely drain free.

  - I don't try to drain request queue for dispatching pre flush request.

  - But after dispatching pre flush, I wait for it to finish before actual
    barrier request goes in. So if controller re-orders the pre-flush and
    executes it ahead of other request, full draining will be avoided
    otherwise it will take place.

  - Similarly post-flush will wait for previous barrier request to finish
    and this will ultimately lead to draining the queue if drive is not
    re-ordering the requests.

  - So what did we gain by this patch in case of WCE=1. I think primarily
    we avoided elevator draining which can be useful for IO controller
    where we provide service differentation in elevator.

  - Not sure how to avoid this drain. Trying to allow other non-barrier
    requests to dispatch while we wait for pre-flush/flush barrier to finish
    will make code more complicated.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Makefile                    |    2 -
 block/blk-barrier.c         |   67 ++++++++++++++++++++++++++++++++++++++++----
 block/blk-core.c            |    9 +++--
 block/elevator.c            |    9 +++--
 fs/buffer.c                 |    3 +
 fs/ext3/fsync.c             |    2 -
 fs/jbd/commit.c             |    2 -
 include/linux/bio.h         |    7 +++-
 include/linux/blkdev.h      |    9 ++++-
 include/linux/buffer_head.h |    3 +
 include/linux/fs.h          |    1 
 kernel/trace/blktrace.c     |    2 -
 12 files changed, 97 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/include/linux/blkdev.h	2010-08-02 14:01:17.000000000 -0400
@@ -97,6 +97,7 @@ enum rq_flag_bits {
 	__REQ_SORTED,		/* elevator knows about this request */
 	__REQ_SOFTBARRIER,	/* may not be passed by ioscheduler */
 	__REQ_HARDBARRIER,	/* may not be passed by drive either */
+	__REQ_FLUSHBARRIER,	/* only flush barrier. no drains required  */
 	__REQ_FUA,		/* forced unit access */
 	__REQ_NOMERGE,		/* don't touch this for merging */
 	__REQ_STARTED,		/* drive already may have started this one */
@@ -126,6 +127,7 @@ enum rq_flag_bits {
 #define REQ_SORTED	(1 << __REQ_SORTED)
 #define REQ_SOFTBARRIER	(1 << __REQ_SOFTBARRIER)
 #define REQ_HARDBARRIER	(1 << __REQ_HARDBARRIER)
+#define REQ_FLUSHBARRIER	(1 << __REQ_FLUSHBARRIER)
 #define REQ_FUA		(1 << __REQ_FUA)
 #define REQ_NOMERGE	(1 << __REQ_NOMERGE)
 #define REQ_STARTED	(1 << __REQ_STARTED)
@@ -625,7 +627,8 @@ enum {
 
 #define blk_rq_cpu_valid(rq)	((rq)->cpu != -1)
 #define blk_sorted_rq(rq)	((rq)->cmd_flags & REQ_SORTED)
-#define blk_barrier_rq(rq)	((rq)->cmd_flags & REQ_HARDBARRIER)
+#define blk_barrier_rq(rq)	((rq)->cmd_flags & REQ_HARDBARRIER || (rq)->cmd_flags & REQ_FLUSHBARRIER)
+#define blk_flush_barrier_rq(rq)	((rq)->cmd_flags & REQ_FLUSHBARRIER)
 #define blk_fua_rq(rq)		((rq)->cmd_flags & REQ_FUA)
 #define blk_discard_rq(rq)	((rq)->cmd_flags & REQ_DISCARD)
 #define blk_bidi_rq(rq)		((rq)->next_rq != NULL)
@@ -681,7 +684,7 @@ static inline void blk_clear_queue_full(
  * it already be started by driver.
  */
 #define RQ_NOMERGE_FLAGS	\
-	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | REQ_FLUSHBARRIER)
 #define rq_mergeable(rq)	\
 	(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
 	 (blk_discard_rq(rq) || blk_fs_request((rq))))
@@ -1006,9 +1009,11 @@ static inline struct request *blk_map_qu
 enum{
 	BLKDEV_WAIT,	/* wait for completion */
 	BLKDEV_BARRIER,	/*issue request with barrier */
+	BLKDEV_FLUSHBARRIER,	/*issue request with flush barrier. no drains */
 };
 #define BLKDEV_IFL_WAIT		(1 << BLKDEV_WAIT)
 #define BLKDEV_IFL_BARRIER	(1 << BLKDEV_BARRIER)
+#define BLKDEV_IFL_FLUSHBARRIER	(1 << BLKDEV_FLUSHBARRIER)
 extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *,
 			unsigned long);
 extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/block/blk-barrier.c	2010-08-02 14:01:17.000000000 -0400
@@ -129,7 +129,7 @@ static void post_flush_end_io(struct req
 	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }
 
-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, unsigned which, bool ordered)
 {
 	struct request *rq;
 	rq_end_io_fn *end_io;
@@ -143,7 +143,17 @@ static void queue_flush(struct request_q
 	}
 
 	blk_rq_init(q, rq);
-	rq->cmd_flags = REQ_HARDBARRIER;
+
+	/*
+	 * Does this flush request has to be ordered? In case of FLUSHBARRIERS
+	 * we don't need PREFLUSH to be ordered. POSTFLUSH needs to be ordered
+	 * if device does not support FUA.
+	 */
+	if (ordered)
+		rq->cmd_flags = REQ_HARDBARRIER;
+	else
+		rq->cmd_flags = REQ_FLUSHBARRIER;
+
 	rq->rq_disk = q->bar_rq.rq_disk;
 	rq->end_io = end_io;
 	q->prepare_flush_fn(q, rq);
@@ -192,7 +202,7 @@ static inline bool start_ordered(struct 
 	 * request gets inbetween ordered sequence.
 	 */
 	if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
+		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH, 1);
 		rq = &q->post_flush_rq;
 	} else
 		skip |= QUEUE_ORDSEQ_POSTFLUSH;
@@ -207,6 +217,17 @@ static inline bool start_ordered(struct 
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
 		init_request_from_bio(rq, q->orig_bar_rq->bio);
+
+		/*
+		 * For flush barriers, we want these to be ordered w.r.t
+		 * preflush hence mark them as HARDBARRIER here.
+		 *
+		 * Note: init_request_from_bio() call above will mark it
+		 * as FLUSHBARRIER
+		 */
+		if (blk_flush_barrier_rq(q->orig_bar_rq))
+			rq->cmd_flags |= REQ_HARDBARRIER;
+
 		rq->end_io = bar_end_io;
 
 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -214,12 +235,21 @@ static inline bool start_ordered(struct 
 		skip |= QUEUE_ORDSEQ_BAR;
 
 	if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
+		/*
+		 * For flush only barrier, we don't care to order preflush
+		 * request w.r.t other requests in the controller queue.
+		 */
+		if (blk_flush_barrier_rq(q->orig_bar_rq))
+			queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH, 0);
+		else
+			queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH, 1);
+
 		rq = &q->pre_flush_rq;
 	} else
 		skip |= QUEUE_ORDSEQ_PREFLUSH;
 
-	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)
+	    && !blk_flush_barrier_rq(q->orig_bar_rq))
 		rq = NULL;
 	else
 		skip |= QUEUE_ORDSEQ_DRAIN;
@@ -241,6 +271,29 @@ bool blk_do_ordered(struct request_queue
 	if (!q->ordseq) {
 		if (!is_barrier)
 			return true;
+		/*
+		 * For flush only barriers, nothing has to be done if there is
+		 * no caching happening on the deice. The barrier request is
+		 * still has to be written to disk but it can written as
+		 * normal rq.
+		 */
+
+		if (blk_flush_barrier_rq(rq)
+		    && (q->ordered == QUEUE_ORDERED_DRAIN
+		        || q->ordered == QUEUE_ORDERED_TAG)) {
+			if (!blk_rq_sectors(rq)) {
+				/*
+				 * Empty barrier. Device is write through.
+				 * Nothing has to be done. Return success.
+				 */
+				blk_dequeue_request(rq);
+				__blk_end_request_all(rq, 0);
+				*rqp = NULL;
+				return false;
+			} else
+				/* Process as normal rq. */
+				return true;
+		}
 
 		if (q->next_ordered != QUEUE_ORDERED_NONE)
 			return start_ordered(q, rqp);
@@ -311,6 +364,8 @@ int blkdev_issue_flush(struct block_devi
 	struct request_queue *q;
 	struct bio *bio;
 	int ret = 0;
+	int type = flags & BLKDEV_IFL_FLUSHBARRIER ? WRITE_FLUSHBARRIER
+				: WRITE_BARRIER;
 
 	if (bdev->bd_disk == NULL)
 		return -ENXIO;
@@ -326,7 +381,7 @@ int blkdev_issue_flush(struct block_devi
 		bio->bi_private = &wait;
 
 	bio_get(bio);
-	submit_bio(WRITE_BARRIER, bio);
+	submit_bio(type, bio);
 	if (test_bit(BLKDEV_WAIT, &flags)) {
 		wait_for_completion(&wait);
 		/*
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/block/elevator.c	2010-08-02 13:19:02.000000000 -0400
@@ -424,7 +424,8 @@ void elv_dispatch_sort(struct request_qu
 	q->nr_sorted--;
 
 	boundary = q->end_sector;
-	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
+	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED
+			| REQ_FLUSHBARRIER;
 	list_for_each_prev(entry, &q->queue_head) {
 		struct request *pos = list_entry_rq(entry);
 
@@ -628,7 +629,8 @@ void elv_insert(struct request_queue *q,
 
 	case ELEVATOR_INSERT_BACK:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-		elv_drain_elevator(q);
+		if (!blk_flush_barrier_rq(rq))
+			elv_drain_elevator(q);
 		list_add_tail(&rq->queuelist, &q->queue_head);
 		/*
 		 * We kick the queue here for the following reasons.
@@ -712,7 +714,8 @@ void __elv_add_request(struct request_qu
 	if (q->ordcolor)
 		rq->cmd_flags |= REQ_ORDERED_COLOR;
 
-	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
+	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER |
+		REQ_FLUSHBARRIER)) {
 		/*
 		 * toggle ordered color
 		 */
Index: linux-2.6/include/linux/bio.h
===================================================================
--- linux-2.6.orig/include/linux/bio.h	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/include/linux/bio.h	2010-08-02 14:01:17.000000000 -0400
@@ -161,6 +161,10 @@ struct bio {
  *	Don't want driver retries for any fast fail whatever the reason.
  * bit 10 -- Tell the IO scheduler not to wait for more requests after this
 	one has been submitted, even if it is a SYNC request.
+ * bit 11 -- This is flush only barrier and does not perform drain operations.
+ * 	     A user using this should make sure all the requests one is
+ * 	     depndent on have completed and then use this barrier to flush
+ * 	     the cache and also do FUA write if it is non empty barrier.
  */
 enum bio_rw_flags {
 	BIO_RW,
@@ -175,6 +179,7 @@ enum bio_rw_flags {
 	BIO_RW_META,
 	BIO_RW_DISCARD,
 	BIO_RW_NOIDLE,
+	BIO_RW_FLUSHBARRIER,
 };
 
 /*
@@ -211,7 +216,7 @@ static inline bool bio_rw_flagged(struct
 #define bio_offset(bio)		bio_iovec((bio))->bv_offset
 #define bio_segments(bio)	((bio)->bi_vcnt - (bio)->bi_idx)
 #define bio_sectors(bio)	((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio)	(bio_rw_flagged(bio, BIO_RW_BARRIER) && !bio_has_data(bio) && !bio_rw_flagged(bio, BIO_RW_DISCARD))
+#define bio_empty_barrier(bio)	((bio_rw_flagged(bio, BIO_RW_BARRIER) || bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER)) && !bio_has_data(bio) && !bio_rw_flagged(bio, BIO_RW_DISCARD))
 
 static inline unsigned int bio_cur_bytes(struct bio *bio)
 {
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/block/blk-core.c	2010-08-02 14:01:17.000000000 -0400
@@ -1153,6 +1153,8 @@ void init_request_from_bio(struct reques
 		req->cmd_flags |= REQ_DISCARD;
 	if (bio_rw_flagged(bio, BIO_RW_BARRIER))
 		req->cmd_flags |= REQ_HARDBARRIER;
+	if (bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER))
+		req->cmd_flags |= REQ_FLUSHBARRIER;
 	if (bio_rw_flagged(bio, BIO_RW_SYNCIO))
 		req->cmd_flags |= REQ_RW_SYNC;
 	if (bio_rw_flagged(bio, BIO_RW_META))
@@ -1185,9 +1187,10 @@ static int __make_request(struct request
 	const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int rw_flags;
+	const bool is_barrier = (bio_rw_flagged(bio, BIO_RW_BARRIER)
+				|| bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER));
 
-	if (bio_rw_flagged(bio, BIO_RW_BARRIER) &&
-	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
+	if (is_barrier && (q->next_ordered == QUEUE_ORDERED_NONE)) {
 		bio_endio(bio, -EOPNOTSUPP);
 		return 0;
 	}
@@ -1200,7 +1203,7 @@ static int __make_request(struct request
 
 	spin_lock_irq(q->queue_lock);
 
-	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
+	if (unlikely(is_barrier) || elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/include/linux/fs.h	2010-08-02 13:19:02.000000000 -0400
@@ -160,6 +160,7 @@ struct inodes_stat_t {
 			(SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
 #define SWRITE_SYNC	(SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
 #define WRITE_BARRIER	(WRITE | (1 << BIO_RW_BARRIER))
+#define WRITE_FLUSHBARRIER	(WRITE | (1 << BIO_RW_FLUSHBARRIER))
 
 /*
  * These aren't really reads or writes, they pass down information about
Index: linux-2.6/Makefile
===================================================================
--- linux-2.6.orig/Makefile	2010-08-02 13:17:35.000000000 -0400
+++ linux-2.6/Makefile	2010-08-02 13:19:02.000000000 -0400
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 35
-EXTRAVERSION = -rc6
+EXTRAVERSION = -rc6-flush-barriers
 NAME = Sheep on Meth
 
 # *DOCUMENTATION*
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2010-08-02 14:01:08.000000000 -0400
+++ linux-2.6/fs/buffer.c	2010-08-02 14:01:17.000000000 -0400
@@ -3026,6 +3026,9 @@ int submit_bh(int rw, struct buffer_head
 	if (buffer_ordered(bh) && (rw & WRITE))
 		rw |= WRITE_BARRIER;
 
+	if (buffer_flush_ordered(bh) && (rw & WRITE))
+		rw |= WRITE_FLUSHBARRIER;
+
 	/*
 	 * Only clear out a write error when rewriting
 	 */
Index: linux-2.6/fs/ext3/fsync.c
===================================================================
--- linux-2.6.orig/fs/ext3/fsync.c	2010-08-02 14:01:08.000000000 -0400
+++ linux-2.6/fs/ext3/fsync.c	2010-08-02 14:01:17.000000000 -0400
@@ -91,6 +91,6 @@ int ext3_sync_file(struct file *file, in
 	 */
 	if (needs_barrier)
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
-				BLKDEV_IFL_WAIT);
+				BLKDEV_IFL_WAIT | BLKDEV_IFL_FLUSHBARRIER);
 	return ret;
 }
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2010-08-02 14:01:08.000000000 -0400
+++ linux-2.6/fs/jbd/commit.c	2010-08-02 14:01:17.000000000 -0400
@@ -138,7 +138,7 @@ static int journal_write_commit_record(j
 	JBUFFER_TRACE(descriptor, "write commit block");
 	set_buffer_dirty(bh);
 	if (journal->j_flags & JFS_BARRIER) {
-		set_buffer_ordered(bh);
+		set_buffer_flush_ordered(bh);
 		barrier_done = 1;
 	}
 	ret = sync_dirty_buffer(bh);
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h	2010-08-02 14:01:08.000000000 -0400
+++ linux-2.6/include/linux/buffer_head.h	2010-08-02 14:01:17.000000000 -0400
@@ -33,6 +33,8 @@ enum bh_state_bits {
 	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
+	BH_Flush_Ordered,/* ordered write. Ordered w.r.t contents in write
+			    cache */
 	BH_Eopnotsupp,	/* operation not supported (barrier) */
 	BH_Unwritten,	/* Buffer is allocated on disk but not written */
 	BH_Quiet,	/* Buffer Error Prinks to be quiet */
@@ -126,6 +128,7 @@ BUFFER_FNS(Delay, delay)
 BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
+BUFFER_FNS(Flush_Ordered, flush_ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
Index: linux-2.6/kernel/trace/blktrace.c
===================================================================
--- linux-2.6.orig/kernel/trace/blktrace.c	2010-08-02 14:01:08.000000000 -0400
+++ linux-2.6/kernel/trace/blktrace.c	2010-08-02 14:01:17.000000000 -0400
@@ -1764,7 +1764,7 @@ void blk_fill_rwbs(char *rwbs, u32 rw, i
 
 	if (rw & 1 << BIO_RW_AHEAD)
 		rwbs[i++] = 'A';
-	if (rw & 1 << BIO_RW_BARRIER)
+	if (rw & 1 << BIO_RW_BARRIER || rw & 1 << BIO_RW_FLUSHBARRIER)
 		rwbs[i++] = 'B';
 	if (rw & 1 << BIO_RW_SYNCIO)
 		rwbs[i++] = 'S';

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-30 14:20                                           ` Christoph Hellwig
  2010-07-31  0:47                                             ` Jan Kara
@ 2010-08-02 19:01                                             ` Vladislav Bolkhovitin
  2010-08-02 19:26                                               ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-02 19:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 07/30/2010 06:20 PM wrote:
> On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
>> Yes, but why not to make step further and allow to completely eliminate
>> the waiting/draining using ORDERED requests? Current advanced storage
>> hardware allows that.
>
> There is a few caes where we could do that - the fsync without metadata
> changes above would be the prime example.  But there's a lot lower
> hanging fruit until we get to the point where it's worth trying.

Yes, but, since there is also interface and file systems update coming, 
why not to design the interface now and then gracefully fill it with 
implementation?

All barriers discussions are always very hot. It definitely means the 
current approach doesn't satisfy too many people, from FS developers to 
storage vendors and users. I believe this is because the whole barriers 
ideology is not natural, hence there are too many troubles to fit it in 
the real life. Apparently, this approach needs some redesign to get in a 
more acceptable form.

IMHO, all is needed are:

1. Allow to optionally combine requests in groups and set for groups 
optional properties: caching and ordering modes (see below). Each group 
would reflect a higher level operation.

2. Allow to chain requests groups. Each chain would reflect order 
dependency between groups, i.e. higher level operations.

This interface is a natural extension of the current interface. Natural 
for storage too. In the extreme, when a group is empty, it could be 
implemented as a barrier, although, since there would be no dependencies 
between not chained groups, they would be freely reordered between each 
other.

We would need grouping requests sooner or later anyway, because 
otherwise it is impossible to implement selective cache flushing instead 
of flushing cache for the whole device as currently. This is highly 
demanded feature, especially for shared and distributed devices.

The caching properties would be:

  - None (default) - no cache flushing needed.

  - "Flush after each request". It would be translated to FUA on write 
back devices with FUA, (write, sync_cache) sequence on write back 
devices without FUA, and to nothing on write through devices.

  - "Flush at once after all finished". It would be translated to one or 
more SYNC_CACHE commands, executed after all done and syncing _only_ 
what was modified in the group, not the whole device as now.

The order properties would be:

  - None (default) - there are no order dependency between requests in 
the group.

  - ORDERED - all requests in the group must be executed in order.

Additionally, if the backend device supported ORDERED commands, this 
facility would be used to eliminate extra queue draining. For instance, 
"flush after each request" on WB devices without FUA would be a sequence 
of ORDERED commands: [(write, sync_cache) ... (write, sync_cache) wait]. 
Compare to [(write, wait, sync_cache, wait) ... (write, wait, 
sync_cache, wait)] needed achieve the same without ORDERED commands support.

For instance, your example of the fsync in XFS would be:

1) Write out all the data blocks as a group with no caching and ordering 
properties.

2) Wait that group to finish

3) Propagate any I/O error to the inode so we can pick them up

4) Update the inode size in the shadow in-memory structure

5) Start a transaction to log the inode size in the new group with 
properties "Flush at once after all finished" and no ordering (or, if 
necessary, (it isn't clear from your text) ORDERED).

6) Write out a log buffer containing the inode and btree updates in the 
new group in a chain after the group from (5) with necessary cache 
flushing and ordering properties.

I believe, it can be implemented acceptably simply and effectively, 
including the I/O scheduler level, and have some ideas for that.

Just my 5c from the storage vendors side.

> But in most cases we don't just drain an imaginary queue but actually
> need to modify software state before finishing one class of I/O and
> submitting the next.
>
> Again, take the example of fsync, but this time we have actually
> extended the file and need to log an inode size update, as well
> as a modification to to the btree blocks.
>
> Now the fsync in XFS looks like this:
>
> 1) write out all the data blocks using WRITE
> 2) wait for these to finish
> 3) propagate any I/O error to the inode so we can pick them up
> 4) update the inode size in the shadow in-memory structure
> 5) start a transaction to log the inode size
> 6) flush the write cache to make sure the data really is on disk

Here should be "6.1) wait for it to finish" which can be eliminated if 
requests sent ordered, correct?

> 7) write out a log buffer containing the inode and btree updates
> 8) if the FUA bit is not support flush the cache again
>
> and yes, the flush in 6) is important so that we don't happen
> to log the inode size update before all data has made it to disk
> in case the cache flush in 8) is interrupted

^ permalink raw reply	[flat|nested] 155+ messages in thread

* xfs rm performance
  2010-08-02 12:48                                                 ` Christoph Hellwig
@ 2010-08-02 19:03                                                   ` Vladislav Bolkhovitin
  2010-08-02 19:18                                                     ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-02 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

This is somehow related to the discussion, so I think it would be relevant if I send here some my observations.

One of the tests I use to verify performance of SCST is io_trash utility. This utility emulates DB-like access. For more details see http://lkml.org/lkml/2008/11/17/444.

Particularly, I'm running io_trash with the following parameters: "2 2 ./ 500000000 50000000 10  4096 4096 300000 10 90 0 10" over a 5GB XFS iSCSI drive. Backend for this drive is a 5GB file on a 15RPM Wide SCSI HDD. Initiator has 256MB of memory, the target - 2GB. Kernel on the initiator - Ubuntu 2.6.32-22-386.

In this mode io_trash creates sparse files and fill them in a transactional DB-like manner. After it finished it leaves 4 files:

# ls -l
total 1448548
-rw-r--r-- 1 root root 2048000000000 2010-08-03 01:13 _0.db
-rw-r--r-- 1 root root     124596224 2010-08-03 01:13 _0.jnl
-rw-r--r-- 1 root root 2048000000000 2010-08-03 01:13 _1.db
-rw-r--r-- 1 root root     124592128 2010-08-03 01:13 _1.jnl
-rwxr-xr-x 1 root root         24141 2008-11-19 19:29 io_thrash

The problem is:

# time rm _*

real	4m3.769s
user	0m0.000s
sys	0m25.594s

4(!) minutes to delete 4 files! For comparison, ext4 does it in few seconds.

I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern:

kernel: [12703.146464] [4021]: scst_cmd_init_done:286:Receiving CDB:
kernel: [12703.146477]  (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
kernel: [12703.146490]    0: 2a 00 00 09 cc ee 00 00 08 00 00 00 00 00 00 00   *...............
kernel: [12703.146513] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111
kernel: [12703.148201] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0
kernel: [12703.149195] [4021]: scst: scst_cmd_init_done:284:tag=112, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000)
kernel: [12703.149216] [4021]: scst_cmd_init_done:286:Receiving CDB:
kernel: [12703.149228]  (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
kernel: [12703.149242]    0: 2a 00 00 09 cc f6 00 00 08 00 00 00 00 00 00 00   *...............
kernel: [12703.149266] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111
kernel: [12703.150852] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0
kernel: [12703.151887] [4021]: scst: scst_cmd_init_done:284:tag=12, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000)
kernel: [12703.151908] [4021]: scst_cmd_init_done:286:Receiving CDB:
kernel: [12703.151920]  (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
kernel: [12703.151934]    0: 2a 00 00 09 cc fe 00 00 08 00 00 00 00 00 00 00   *...............
kernel: [12703.151955] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111
kernel: [12703.153622] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0
kernel: [12703.154655] [4021]: scst: scst_cmd_init_done:284:tag=15, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000)

"Scst_cmd_init_done" means new coming command, "scst_cmd_done_local" means it's finished. See the 1ms gap between previous command finished and new came. You can see that if XFS was sending many commands at time, it would finish the job several (5-10) times faster.

Is it possible to improve that and make XFS fully fill the device's queue during rm'ing?

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: xfs rm performance
  2010-08-02 19:03                                                   ` xfs rm performance Vladislav Bolkhovitin
@ 2010-08-02 19:18                                                     ` Christoph Hellwig
  2010-08-05 19:31                                                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-02 19:18 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Jan Kara, Ted Ts'o, Andreas Dilger,
	Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Mon, Aug 02, 2010 at 11:03:00PM +0400, Vladislav Bolkhovitin wrote:
> I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern:

That's exactly the queue draining we're talking about here.  To see
how the pattern gets better use the nobarrier option.

Even with that XFS traditionally has a bad I/O pattern for metadata
intensive workloads due to the amount of log I/O needed for it.
Starting from Linux 2.6.35 the delayed logging code fixes this, and
we hope to enable it by default after about 10 to 12 month of
extensive testing.

Try to re-run your test with

	-o delaylog,logbsize=262144

to see better log I/O pattern.  If you target doesn't present a volatile
write cache also add the nobarrier option mentioned above.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
@ 2010-08-02 19:26                                               ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-02 19:26 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler,
	Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Mon, Aug 02, 2010 at 11:01:53PM +0400, Vladislav Bolkhovitin wrote:
> IMHO, all is needed are:

What we need first is a simple interface that

 a) guarantees data integrity
 b) doesn't cause massive slowdowns

and then we can optimize it later.

What we absolutely don't need is a large number of different
interfaces that no one understands and that all are buggy in some way.

> >Now the fsync in XFS looks like this:
> >
> >1) write out all the data blocks using WRITE
> >2) wait for these to finish
> >3) propagate any I/O error to the inode so we can pick them up
> >4) update the inode size in the shadow in-memory structure
> >5) start a transaction to log the inode size
> >6) flush the write cache to make sure the data really is on disk
> 
> Here should be "6.1) wait for it to finish"

yes

> which can be eliminated if 
> requests sent ordered, correct?

not really - if the cache flush returns we shouldn't even send the log
update.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics)
  2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
@ 2010-08-03 13:03                                     ` Christoph Hellwig
  2010-08-04 15:29                                       ` Vivek Goyal
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-03 13:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Mon, Aug 02, 2010 at 02:28:04PM -0400, Vivek Goyal wrote:
> Hi Christoph,
> 
> Please find attached a new version of patch where I am trying to implement
> flush only barriers. Why do that? I was thinking that it would nice to avoid
> elevator drains with WCE=1.
> 
> Here I have a DRAIN queue and I seem to be issuing post-flush only after
> barrier has finished. Need to find some device with TAG queue also to test.
> 
> This is still a very crude patch where I need to do lot of testing to see if
> things are working. For the time being I have just hooked up ext3 to use
> flush barrier and verified that in case of WCE=0 we don't issue barrier
> and in case of WCE=1 we do issue barrier with pre flush and postflush.
> 
> I don't yet have found a device with FUA and tagging support to verify 
> that functionality.

There are not devices that use the tagging support.  Only brd and virtio
every use the QUEUE_ORDERED_TAG type.  For brd Nick chose it at random,
and it really doesn't matter when we're dealing with a ramdisk.  For
virtio-blk it's only used by lguest which only allows a signle
outstanding command anyway.  In short we can just remove it once we
stop draining for the other modes.

> o On storage with write cache, for empty barrier, only pre-flush is done.
>   For barrier request with some data one of following should happen depending
>   on queue capability.
> 
> 	Draining queue
> 	--------------
> 	preflush ==> barrier (FUA)
> 	preflush ==> barrier ===> postflush
> 	
> 	Ordered Queue
> 	-------------
> 	preflush-->barrier (FUA)
> 	preflush --> barrier ---> postflush
> 
> 	===> Wait for previous request to finish
> 	---> Issue an ordered request in Tagged queue.

with ordered you mean the unused _TAG mode?

>   - Not sure how to avoid this drain. Trying to allow other non-barrier
>     requests to dispatch while we wait for pre-flush/flush barrier to finish
>     will make code more complicated.

That's pretty much where I got stuck, too.  Thanks for doing this, but
I'd be surprised if it really gives us all that much benefits for real
life workloads.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* [PATCH, RFC 1/2] relaxed cache flushes
  2010-07-27 17:54 ` Jan Kara
  2010-07-27 18:35   ` Vivek Goyal
  2010-07-27 19:37   ` Christoph Hellwig
@ 2010-08-03 18:49   ` Christoph Hellwig
  2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
  2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
  2 siblings, 2 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-03 18:49 UTC (permalink / raw)
  To: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.maso

So instead of cracking my head on the relaxed barriers I've decided to
do the easiet part first.  That is relaxing explicit cache flushes
done by blkdev_issue_flush.  These days these are handled as an
empty barrier, which is completely overkill.  Instead take advantage
of the way we now handle flushes, that is as REQ_FLUSH FS requests.

Do a few updates to the block layer so that we handle REQ_FLUSH
correctly and we can make blkdev_issue_flush submit them directly.

All request based block drivers should just work with it, but bio
based remappers will need some additional work.  The next patch
will do this for DM, but I haven't quite grasped the barrier code
in MD yet.  Despite doing a lot REQ_HARDBARRIER tests DRBD doesn't
actually advertize any ordered mode so it's not affected.  The
barrier handling in the loop driver is currently broken anyway,
and I'm still undecided if I want to fix it before or after
this conversion.


Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2010-08-03 20:26:50.259005954 +0200
+++ linux-2.6/block/blk-barrier.c	2010-08-03 20:33:39.580266216 +0200
@@ -151,25 +151,7 @@ static inline bool start_ordered(struct
 	q->ordered = q->next_ordered;
 	q->ordseq |= QUEUE_ORDSEQ_STARTED;
 
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq)) {
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-		/*
-		 * Empty barrier on a write-through device w/ ordered
-		 * tag has no command to issue and without any command
-		 * to issue, ordering by tag can't be used.  Drain
-		 * instead.
-		 */
-		if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
-		    !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
-			q->ordered &= ~QUEUE_ORDERED_BY_TAG;
-			q->ordered |= QUEUE_ORDERED_BY_DRAIN;
-		}
-	}
+	BUG_ON(!blk_rq_sectors(rq));
 
 	/* stash away the original request */
 	blk_dequeue_request(rq);
@@ -311,6 +293,9 @@ int blkdev_issue_flush(struct block_devi
 	if (!q)
 		return -ENXIO;
 
+	if (!(q->next_ordered & QUEUE_ORDERED_DO_PREFLUSH))
+		return 0;
+
 	/*
 	 * some block devices may not have their queue correctly set up here
 	 * (e.g. loop device without a backing file) and so issuing a flush
@@ -327,7 +312,7 @@ int blkdev_issue_flush(struct block_devi
 		bio->bi_private = &wait;
 
 	bio_get(bio);
-	submit_bio(WRITE_BARRIER, bio);
+	submit_bio(WRITE_SYNC | REQ_FLUSH, bio);
 	if (test_bit(BLKDEV_WAIT, &flags)) {
 		wait_for_completion(&wait);
 		/*
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c	2010-08-03 20:26:50.268024322 +0200
+++ linux-2.6/block/elevator.c	2010-08-03 20:32:11.949256478 +0200
@@ -423,7 +423,8 @@ void elv_dispatch_sort(struct request_qu
 	q->nr_sorted--;
 
 	boundary = q->end_sector;
-	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
+	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED | \
+		     REQ_FLUSH;
 	list_for_each_prev(entry, &q->queue_head) {
 		struct request *pos = list_entry_rq(entry);
 
Index: linux-2.6/include/linux/bio.h
===================================================================
--- linux-2.6.orig/include/linux/bio.h	2010-08-03 20:26:50.298255570 +0200
+++ linux-2.6/include/linux/bio.h	2010-08-03 20:46:48.367257736 +0200
@@ -153,6 +153,7 @@ enum rq_flag_bits {
 	__REQ_META,		/* metadata io request */
 	__REQ_DISCARD,		/* request to discard sectors */
 	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
+	__REQ_FLUSH,		/* request for cache flush */
 
 	/* bio only flags */
 	__REQ_UNPLUG,		/* unplug the immediately after submission */
@@ -174,7 +175,6 @@ enum rq_flag_bits {
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_COPY_USER,	/* contains copies of user pages */
 	__REQ_INTEGRITY,	/* integrity metadata has been remapped */
-	__REQ_FLUSH,		/* request for cache flush */
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_NR_BITS,		/* stops here */
@@ -189,12 +189,13 @@ enum rq_flag_bits {
 #define REQ_META		(1 << __REQ_META)
 #define REQ_DISCARD		(1 << __REQ_DISCARD)
 #define REQ_NOIDLE		(1 << __REQ_NOIDLE)
+#define REQ_FLUSH		(1 << __REQ_FLUSH)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
-	 REQ_META| REQ_DISCARD | REQ_NOIDLE)
+	 REQ_META| REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH)
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
@@ -214,7 +215,6 @@ enum rq_flag_bits {
 #define REQ_ALLOCED		(1 << __REQ_ALLOCED)
 #define REQ_COPY_USER		(1 << __REQ_COPY_USER)
 #define REQ_INTEGRITY		(1 << __REQ_INTEGRITY)
-#define REQ_FLUSH		(1 << __REQ_FLUSH)
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2010-08-03 20:26:50.311003929 +0200
+++ linux-2.6/include/linux/blkdev.h	2010-08-03 20:32:11.956036684 +0200
@@ -589,7 +589,8 @@ static inline void blk_clear_queue_full(
  * it already be started by driver.
  */
 #define RQ_NOMERGE_FLAGS	\
-	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \
+	 REQ_FLUSH)
 #define rq_mergeable(rq)	\
 	(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
 	 (((rq)->cmd_flags & REQ_DISCARD) || \
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2010-08-03 20:26:50.275003649 +0200
+++ linux-2.6/block/blk-core.c	2010-08-03 20:32:11.960004138 +0200
@@ -1203,7 +1203,7 @@ static int __make_request(struct request
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int rw_flags;
 
-	if ((bio->bi_rw & REQ_HARDBARRIER) &&
+	if ((bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH)) &&
 	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
 		bio_endio(bio, -EOPNOTSUPP);
 		return 0;
@@ -1217,7 +1217,7 @@ static int __make_request(struct request
 
 	spin_lock_irq(q->queue_lock);
 
-	if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q))
+	if ((bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH)) || elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
@ 2010-08-03 18:51     ` Christoph Hellwig
  2010-08-04  4:57       ` Kiyoshi Ueda
  2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-03 18:51 UTC (permalink / raw)
  To: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.maso

Adapt device-mapper to the new world order where even bio based devices
get simple REQ_FLUSH requests for cache flushes, and need to submit
them downwards for implementing barriers.

Note that I've removed the unlikely statements around the REQ_FLUSH
checks.  While these generally aren't as common as normal read/writes
they are common enough that statically mispredictim them is a really
bad idea.

Tested with simple linear LVM volumes only so far.


Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c	2010-08-03 20:26:49.629254174 +0200
+++ linux-2.6/drivers/md/dm-crypt.c	2010-08-03 20:36:59.279003929 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c	2010-08-03 20:26:49.641003999 +0200
+++ linux-2.6/drivers/md/dm-raid1.c	2010-08-03 20:36:59.280003649 +0200
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & (WRITE_BARRIER|REQ_FLUSH)),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1199,12 +1199,14 @@ static int mirror_end_io(struct dm_targe
 	struct dm_bio_details *bd = NULL;
 	struct dm_raid1_read_record *read_record = map_context->ptr;
 
+	if (bio->bi_rw & REQ_FLUSH)
+		return error;
+
 	/*
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
-			dm_rh_dec(ms->rh, map_context->ll);
+		dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
 
Index: linux-2.6/drivers/md/dm-region-hash.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-region-hash.c	2010-08-03 20:26:49.650023346 +0200
+++ linux-2.6/drivers/md/dm-region-hash.c	2010-08-03 20:36:59.285025649 +0200
@@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		rh->barrier_failure = 1;
 		return;
 	}
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio->bi_rw & REQ_FLUSH)
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
Index: linux-2.6/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-snap.c	2010-08-03 20:26:49.656003091 +0200
+++ linux-2.6/drivers/md/dm-snap.c	2010-08-03 20:36:59.290023135 +0200
@@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		if (!map_context->flush_request)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target *
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
Index: linux-2.6/drivers/md/dm-stripe.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-stripe.c	2010-08-03 20:26:49.663003301 +0200
+++ linux-2.6/drivers/md/dm-stripe.c	2010-08-03 20:36:59.295005744 +0200
@@ -214,7 +214,7 @@ static int stripe_map(struct dm_target *
 	sector_t offset, chunk;
 	uint32_t stripe;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		BUG_ON(map_context->flush_request >= sc->stripes);
 		bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
 		return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c	2010-08-03 20:26:49.676004139 +0200
+++ linux-2.6/drivers/md/dm.c	2010-08-03 20:36:59.301005325 +0200
@@ -633,7 +633,7 @@ static void dec_pending(struct dm_io *io
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio == &md->barrier_bio) {
 			/*
 			 * There can be just one barrier request so we use
 			 * a per-device variable for error reporting.
@@ -851,7 +851,7 @@ void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_HARDBARRIER) {
 		/*
 		 * Barrier clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
@@ -950,7 +950,7 @@ static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_HARDBARRIER) {
 		/*
 		 * Barrier clones share an original request.  So can't use
 		 * softirq_done with the original.
@@ -979,7 +979,7 @@ void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_HARDBARRIER) {
 		/*
 		 * Barrier clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
@@ -1208,7 +1208,7 @@ static int __clone_and_map(struct clone_
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return __clone_and_map_empty_barrier(ci);
 
 	ti = dm_table_find_target(ci->map, ci->sector);
@@ -1308,7 +1308,7 @@ static void __split_and_process_bio(stru
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (bio != &md->barrier_bio)
 			bio_io_error(bio);
 		else
 			if (!md->barrier_error)
@@ -1326,7 +1326,7 @@ static void __split_and_process_bio(stru
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
 	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		ci.sector_count = 1;
 	ci.idx = bio->bi_idx;
 
@@ -1421,7 +1421,7 @@ static int _dm_request(struct request_qu
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    unlikely(bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH))) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1462,14 +1462,6 @@ static int dm_request(struct request_que
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1517,10 +1509,10 @@ static int setup_clone(struct request *c
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
+	if (rq->cmd_flags & REQ_FLUSH) {
 		blk_rq_init(NULL, clone);
 		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+		clone->cmd_flags |= (WRITE_SYNC | REQ_FLUSH);
 	} else {
 		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
 				      dm_rq_bio_constructor, tio);
@@ -1573,7 +1565,7 @@ static int dm_prep_fn(struct request_que
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
+	if (rq->cmd_flags & REQ_FLUSH)
 		return BLKPREP_OK;
 
 	if (unlikely(rq->special)) {
@@ -1664,7 +1656,7 @@ static void dm_request_fn(struct request
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
+		if (rq->cmd_flags & REQ_FLUSH) {
 			BUG_ON(md->flush_request);
 			md->flush_request = rq;
 			blk_start_request(rq);
@@ -2239,7 +2231,7 @@ static void dm_flush(struct mapped_devic
 
 	bio_init(&md->barrier_bio);
 	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
+	md->barrier_bio.bi_rw = WRITE_SYNC | REQ_FLUSH;
 	__split_and_process_bio(md, &md->barrier_bio);
 
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
@@ -2250,19 +2242,8 @@ static void process_barrier(struct mappe
 	md->barrier_error = 0;
 
 	dm_flush(md);
-
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		dm_flush(md);
-	}
-
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
+	__split_and_process_bio(md, bio);
+	dm_flush(md);
 }
 
 /*
Index: linux-2.6/include/linux/bio.h
===================================================================
--- linux-2.6.orig/include/linux/bio.h	2010-08-03 20:32:11.951274008 +0200
+++ linux-2.6/include/linux/bio.h	2010-08-03 20:36:59.303005325 +0200
@@ -241,10 +241,6 @@ enum rq_flag_bits {
 #define bio_offset(bio)		bio_iovec((bio))->bv_offset
 #define bio_segments(bio)	((bio)->bi_vcnt - (bio)->bi_idx)
 #define bio_sectors(bio)	((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio) \
-	((bio->bi_rw & REQ_HARDBARRIER) && \
-	 !bio_has_data(bio) && \
-	 !(bio->bi_rw & REQ_DISCARD))
 
 static inline unsigned int bio_cur_bytes(struct bio *bio)
 {
Index: linux-2.6/drivers/md/dm-io.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-io.c	2010-08-03 20:26:49.685023485 +0200
+++ linux-2.6/drivers/md/dm-io.c	2010-08-03 20:36:59.308004417 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-07-29 20:11                                   ` Christoph Hellwig
  2010-07-30 12:45                                     ` Vladislav Bolkhovitin
@ 2010-08-04  1:58                                     ` Jamie Lokier
  1 sibling, 0 replies; 155+ messages in thread
From: Jamie Lokier @ 2010-08-04  1:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig wrote:
> On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote:
> > There's lies, damned lies and benchmarks .. but what I was thinking is
> > could we just do the right thing?  SCSI exposes (in sd) the interfaces
> > to change the cache setting, so if the customer *doesn't* specify
> > barriers on mount, could we just flip the device to write through it
> > would be more performant in most use cases.
> 
> We could for SCSI and ATA, but probably not easily for other kind of
> storage.  Except that it's not that simple as we have partitions and
> volume managers inbetween - different filesystems sitting on the same
> device might have very different ideas of what they want.
> 
> For SCSI we can at least permanently disable the cache, but ATA devices
> keep coming up again with the volatile write cache enabled after a
> reboot, or even worse a suspend to ram / resume cycle.  The latter is
> what keeps me from just disabling the volatile cache on my laptop,
> despite that option giving significanly better performance for typical
> kernel developer workloads.

I have workloads where enabling volatile write cache + barriers is much
faster than disabling the cache.

It is admittedly a 2.4.ancient kernel and PATA on an embedded system,
but still, it's enough of a difference (about 3x speedup for large
file writes) that it was worth porting SuSE's barrier patches to that
kernel so that I could enable the write cache to get a huge speedup
while remaining powerfail safe with ext3.

-- Jamie

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
@ 2010-08-04  4:57       ` Kiyoshi Ueda
  2010-08-04  8:54         ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Kiyoshi Ueda @ 2010-08-04  4:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel,
	linux-raid

Hi Christoph,

On 08/04/2010 03:51 AM +0900, Christoph Hellwig wrote:
> Adapt device-mapper to the new world order where even bio based devices
> get simple REQ_FLUSH requests for cache flushes, and need to submit
> them downwards for implementing barriers.
<snip>
> Index: linux-2.6/drivers/md/dm.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm.c	2010-08-03 20:26:49.676004139 +0200
> +++ linux-2.6/drivers/md/dm.c	2010-08-03 20:36:59.301005325 +0200
<snip>
> @@ -1573,7 +1565,7 @@ static int dm_prep_fn(struct request_que
>  	struct mapped_device *md = q->queuedata;
>  	struct request *clone;
>  
> -	if (unlikely(dm_rq_is_flush_request(rq)))
> +	if (rq->cmd_flags & REQ_FLUSH)
>  		return BLKPREP_OK;
>  
>  	if (unlikely(rq->special)) {
> @@ -1664,7 +1656,7 @@ static void dm_request_fn(struct request
>  		if (!rq)
>  			goto plug_and_out;
>  
> -		if (unlikely(dm_rq_is_flush_request(rq))) {
> +		if (rq->cmd_flags & REQ_FLUSH) {
>  			BUG_ON(md->flush_request);
>  			md->flush_request = rq;
>  			blk_start_request(rq);

Current request-based device-mapper's flush code depends on
the block-layer's barrier behavior which dispatches only one request
at a time when flush is needed.
In other words, current request-based device-mapper can't handle
other requests while a flush request is in progress.

I'll take a look how I can fix the request-based device-mapper to
cope with it.  I think it'll take time for carefull investigation.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-04  4:57       ` Kiyoshi Ueda
@ 2010-08-04  8:54         ` Christoph Hellwig
  2010-08-05  2:16           ` Jun'ichi Nomura
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-04  8:54 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Christoph Hellwig, Jan Kara, jaxboe, tj, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, linux-raid

On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote:
> > -		if (unlikely(dm_rq_is_flush_request(rq))) {
> > +		if (rq->cmd_flags & REQ_FLUSH) {
> >  			BUG_ON(md->flush_request);
> >  			md->flush_request = rq;
> >  			blk_start_request(rq);
> 
> Current request-based device-mapper's flush code depends on
> the block-layer's barrier behavior which dispatches only one request
> at a time when flush is needed.
> In other words, current request-based device-mapper can't handle
> other requests while a flush request is in progress.
> 
> I'll take a look how I can fix the request-based device-mapper to
> cope with it.  I think it'll take time for carefull investigation.

Given that request based device mapper doesn't even look at the
block numbers from what I can see just removing any special casing
for REQ_FLUSH should probably do it.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics)
  2010-08-03 13:03                                     ` Christoph Hellwig
@ 2010-08-04 15:29                                       ` Vivek Goyal
  2010-08-04 16:21                                         ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vivek Goyal @ 2010-08-04 15:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, chris.mason, swhiteho,
	konishi.ryusuke

On Tue, Aug 03, 2010 at 03:03:47PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 02:28:04PM -0400, Vivek Goyal wrote:
> > Hi Christoph,
> > 
> > Please find attached a new version of patch where I am trying to implement
> > flush only barriers. Why do that? I was thinking that it would nice to avoid
> > elevator drains with WCE=1.
> > 
> > Here I have a DRAIN queue and I seem to be issuing post-flush only after
> > barrier has finished. Need to find some device with TAG queue also to test.
> > 
> > This is still a very crude patch where I need to do lot of testing to see if
> > things are working. For the time being I have just hooked up ext3 to use
> > flush barrier and verified that in case of WCE=0 we don't issue barrier
> > and in case of WCE=1 we do issue barrier with pre flush and postflush.
> > 
> > I don't yet have found a device with FUA and tagging support to verify 
> > that functionality.
> 
> There are not devices that use the tagging support.  Only brd and virtio
> every use the QUEUE_ORDERED_TAG type.  For brd Nick chose it at random,
> and it really doesn't matter when we're dealing with a ramdisk.  For
> virtio-blk it's only used by lguest which only allows a signle
> outstanding command anyway.

What about qemu-kvm? Who imposes this single request in queue limitation?
A quick look at virtio-blk driver code did not suggest anything like that. 

> In short we can just remove it once we
> stop draining for the other modes.
> 
> > o On storage with write cache, for empty barrier, only pre-flush is done.
> >   For barrier request with some data one of following should happen depending
> >   on queue capability.
> > 
> > 	Draining queue
> > 	--------------
> > 	preflush ==> barrier (FUA)
> > 	preflush ==> barrier ===> postflush
> > 	
> > 	Ordered Queue
> > 	-------------
> > 	preflush-->barrier (FUA)
> > 	preflush --> barrier ---> postflush
> > 
> > 	===> Wait for previous request to finish
> > 	---> Issue an ordered request in Tagged queue.
> 
> with ordered you mean the unused _TAG mode?

Yes. If nobody is using it, then we can probably drop it but some of the
mails in the thread suggested scsi controllers can support tagged/ordered
queues very well. If so then whole barrier problem is really simplified
a lot without losing performance. That would suggest that instead of
dropping the TAG queue support we should move in the direction of figuring
out how to enable it for scsi devices.

> 
> >   - Not sure how to avoid this drain. Trying to allow other non-barrier
> >     requests to dispatch while we wait for pre-flush/flush barrier to finish
> >     will make code more complicated.
> 
> That's pretty much where I got stuck, too.  Thanks for doing this, but
> I'd be surprised if it really gives us all that much benefits for real
> life workloads.

True. Without getting rid of draining completely performance benefits
might not be there.

May be file systems can take care of ordering completely. You alredy
modifed blkdev_issue_flush() to convert it to just flush reqeust and
it is no longer a barrier. So file systems can always issue flush first
and then issue the dependent commit request with FUA. 

That will bring us back to question of FUA emulation. Can the queue
capability be exposed to file systems so that they issue a post flush
after commit block if device does not support FUA. 

Vivek

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics)
  2010-08-04 15:29                                       ` Vivek Goyal
@ 2010-08-04 16:21                                         ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-04 16:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, chris.mason,
	swhiteho, konishi.ryusuke

On Wed, Aug 04, 2010 at 11:29:16AM -0400, Vivek Goyal wrote:
> > There are not devices that use the tagging support.  Only brd and virtio
> > every use the QUEUE_ORDERED_TAG type.  For brd Nick chose it at random,
> > and it really doesn't matter when we're dealing with a ramdisk.  For
> > virtio-blk it's only used by lguest which only allows a signle
> > outstanding command anyway.
> 
> What about qemu-kvm? Who imposes this single request in queue limitation?
> A quick look at virtio-blk driver code did not suggest anything like that. 

qemu never used that mode exactly because it's buggy.  It has no way to
actually send a cache flush request (aka empty barrier), and to
implement the ordering by tag properly in a Unix userspace program
we just need to do the drain we currently do in the host kernel inside
qemu/lguest.

> > with ordered you mean the unused _TAG mode?
> 
> Yes. If nobody is using it, then we can probably drop it but some of the
> mails in the thread suggested scsi controllers can support tagged/ordered
> queues very well. If so then whole barrier problem is really simplified
> a lot without losing performance. That would suggest that instead of
> dropping the TAG queue support we should move in the direction of figuring
> out how to enable it for scsi devices.

scsi controllers can in theory, but the scsi layer can't without major
work.  I don't mind using ordering by tag, but I'd rather see an
actually working implementation instead of code that doesn't actually
get used and this almost by defintion is getting buggy sooner or later.

> That will bring us back to question of FUA emulation. Can the queue
> capability be exposed to file systems so that they issue a post flush
> after commit block if device does not support FUA. 

Doing the pre and post flushes from the filesystem does mean that

 a) we add a lot of complexity to every single filesystem instead
    of doing it once
 b) much higher latency as we need to go through a lot more layers
    compared to the current implementation.  E.g. for XFS moving
    the log state machines means first waking up a per-cpu kernel
    thread.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-04  8:54         ` Christoph Hellwig
@ 2010-08-05  2:16           ` Jun'ichi Nomura
  2010-08-26 22:50             ` Mike Snitzer
  0 siblings, 1 reply; 155+ messages in thread
From: Jun'ichi Nomura @ 2010-08-05  2:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kiyoshi Ueda, Jan Kara, jaxboe, tj, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, linux-raid

Hi Christoph,

(08/04/10 17:54), Christoph Hellwig wrote:
> On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote:
>>> -		if (unlikely(dm_rq_is_flush_request(rq))) {
>>> +		if (rq->cmd_flags & REQ_FLUSH) {
>>>  			BUG_ON(md->flush_request);
>>>  			md->flush_request = rq;
>>>  			blk_start_request(rq);
>>
>> Current request-based device-mapper's flush code depends on
>> the block-layer's barrier behavior which dispatches only one request
>> at a time when flush is needed.
>> In other words, current request-based device-mapper can't handle
>> other requests while a flush request is in progress.
>>
>> I'll take a look how I can fix the request-based device-mapper to
>> cope with it.  I think it'll take time for carefull investigation.
> 
> Given that request based device mapper doesn't even look at the
> block numbers from what I can see just removing any special casing
> for REQ_FLUSH should probably do it.

Special casing is necessary because device-mapper may have to
send multiple copies of REQ_FLUSH request to multiple
targets, while normal request is just sent to single target.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-02 17:39                     ` Chris Mason
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
@ 2010-08-05 13:11                       ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 13:11 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.B

Chris Mason, on 08/02/2010 09:39 PM wrote:
> I regret putting the ordering into the original barrier code...it
> definitely did help reiserfs back in the day but it stinks of magic and
> voodoo.

But if the ordering isn't in the common (block) code, how to implement 
the "hardware offload" for ordering, i.e. ORDERED commands, in an 
acceptable way?

I believe, the decision was right, but the flags and magic requests 
based interface (and, hence, implementation) was wrong. That's it which 
stinks of magic and voodoo.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-02 17:39                     ` Chris Mason
@ 2010-08-05 13:11                       ` Vladislav Bolkhovitin
  2010-08-05 13:32                         ` Chris Mason
  2010-08-05 17:09                         ` Christoph Hellwig
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 13:11 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.B

Chris Mason, on 08/02/2010 09:39 PM wrote:
> I regret putting the ordering into the original barrier code...it
> definitely did help reiserfs back in the day but it stinks of magic and
> voodoo.

But if the ordering isn't in the common (block) code, how to implement 
the "hardware offload" for ordering, i.e. ORDERED commands, in an 
acceptable way?

I believe, the decision was right, but the flags and magic requests 
based interface (and, hence, implementation) was wrong. That's it which 
stinks of magic and voodoo.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
@ 2010-08-05 13:32                         ` Chris Mason
  2010-08-05 14:52                           ` Hannes Reinecke
                                             ` (3 more replies)
  2010-08-05 17:09                         ` Christoph Hellwig
  1 sibling, 4 replies; 155+ messages in thread
From: Chris Mason @ 2010-08-05 13:32 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho,
	konishi.ryusuke

On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
> Chris Mason, on 08/02/2010 09:39 PM wrote:
> >I regret putting the ordering into the original barrier code...it
> >definitely did help reiserfs back in the day but it stinks of magic and
> >voodoo.
> 
> But if the ordering isn't in the common (block) code, how to
> implement the "hardware offload" for ordering, i.e. ORDERED
> commands, in an acceptable way?
> 
> I believe, the decision was right, but the flags and magic requests
> based interface (and, hence, implementation) was wrong. That's it
> which stinks of magic and voodoo.

The interface definitely has flaws.  We didn't expand it because James
popped up with a long list of error handling problems.  Basically how
do the hardware and the kernel deal with a failed request at the start
of the chain.  Somehow the easy way of failing them all turned out to be
extremely difficult.

Even if that part had been refined, I think trusting the ordering down
to the lower layers was a doomed idea.  The list of ways it could go
wrong is much much longer (and harder to debug) than the list of
benefits.

With all of that said, I did go ahead and benchmark real ordered tags
extensively on a scsi drive in the initial implementation.  There was
very little performance difference.

-chris


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:32                         ` Chris Mason
@ 2010-08-05 14:52                           ` Hannes Reinecke
  2010-08-05 14:52                           ` Hannes Reinecke
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 155+ messages in thread
From: Hannes Reinecke @ 2010-08-05 14:52 UTC (permalink / raw)
  To: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara

Chris Mason wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>> But if the ordering isn't in the common (block) code, how to
>> implement the "hardware offload" for ordering, i.e. ORDERED
>> commands, in an acceptable way?
>>
>> I believe, the decision was right, but the flags and magic requests
>> based interface (and, hence, implementation) was wrong. That's it
>> which stinks of magic and voodoo.
> 
> The interface definitely has flaws.  We didn't expand it because James
> popped up with a long list of error handling problems.  Basically how
> do the hardware and the kernel deal with a failed request at the start
> of the chain.  Somehow the easy way of failing them all turned out to be
> extremely difficult.
> 
> Even if that part had been refined, I think trusting the ordering down
> to the lower layers was a doomed idea.  The list of ways it could go
> wrong is much much longer (and harder to debug) than the list of
> benefits.
> 
> With all of that said, I did go ahead and benchmark real ordered tags
> extensively on a scsi drive in the initial implementation.  There was
> very little performance difference.
> 
Care to dig it up?
I'd wanted to give it a try, and if someone already did some work in
that area it'll make things easier here.

I still think that implementing ordered tags is the correct way of
doing things, implementation details notwithstanding.

It looks better conceptually than using FUA, and would be easier
from the request-queue side of things.
(Or course, as the entire logic is pushed down to the SCSI layer :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:32                         ` Chris Mason
  2010-08-05 14:52                           ` Hannes Reinecke
@ 2010-08-05 14:52                           ` Hannes Reinecke
  2010-08-05 15:17                             ` Chris Mason
  2010-08-05 17:07                             ` Christoph Hellwig
  2010-08-05 19:48                           ` Vladislav Bolkhovitin
  2010-08-05 19:48                           ` Vladislav Bolkhovitin
  3 siblings, 2 replies; 155+ messages in thread
From: Hannes Reinecke @ 2010-08-05 14:52 UTC (permalink / raw)
  To: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara

Chris Mason wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>> But if the ordering isn't in the common (block) code, how to
>> implement the "hardware offload" for ordering, i.e. ORDERED
>> commands, in an acceptable way?
>>
>> I believe, the decision was right, but the flags and magic requests
>> based interface (and, hence, implementation) was wrong. That's it
>> which stinks of magic and voodoo.
> 
> The interface definitely has flaws.  We didn't expand it because James
> popped up with a long list of error handling problems.  Basically how
> do the hardware and the kernel deal with a failed request at the start
> of the chain.  Somehow the easy way of failing them all turned out to be
> extremely difficult.
> 
> Even if that part had been refined, I think trusting the ordering down
> to the lower layers was a doomed idea.  The list of ways it could go
> wrong is much much longer (and harder to debug) than the list of
> benefits.
> 
> With all of that said, I did go ahead and benchmark real ordered tags
> extensively on a scsi drive in the initial implementation.  There was
> very little performance difference.
> 
Care to dig it up?
I'd wanted to give it a try, and if someone already did some work in
that area it'll make things easier here.

I still think that implementing ordered tags is the correct way of
doing things, implementation details notwithstanding.

It looks better conceptually than using FUA, and would be easier
from the request-queue side of things.
(Or course, as the entire logic is pushed down to the SCSI layer :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 14:52                           ` Hannes Reinecke
@ 2010-08-05 15:17                             ` Chris Mason
  2010-08-05 17:07                             ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Chris Mason @ 2010-08-05 15:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo, Vivek Goyal,
	Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, swhiteho, konishi.ryusuke

On Thu, Aug 05, 2010 at 04:52:15PM +0200, Hannes Reinecke wrote:
> Chris Mason wrote:
> > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
> >> Chris Mason, on 08/02/2010 09:39 PM wrote:
> >>> I regret putting the ordering into the original barrier code...it
> >>> definitely did help reiserfs back in the day but it stinks of magic and
> >>> voodoo.
> >> But if the ordering isn't in the common (block) code, how to
> >> implement the "hardware offload" for ordering, i.e. ORDERED
> >> commands, in an acceptable way?
> >>
> >> I believe, the decision was right, but the flags and magic requests
> >> based interface (and, hence, implementation) was wrong. That's it
> >> which stinks of magic and voodoo.
> > 
> > The interface definitely has flaws.  We didn't expand it because James
> > popped up with a long list of error handling problems.  Basically how
> > do the hardware and the kernel deal with a failed request at the start
> > of the chain.  Somehow the easy way of failing them all turned out to be
> > extremely difficult.
> > 
> > Even if that part had been refined, I think trusting the ordering down
> > to the lower layers was a doomed idea.  The list of ways it could go
> > wrong is much much longer (and harder to debug) than the list of
> > benefits.
> > 
> > With all of that said, I did go ahead and benchmark real ordered tags
> > extensively on a scsi drive in the initial implementation.  There was
> > very little performance difference.
> > 
> Care to dig it up?
> I'd wanted to give it a try, and if someone already did some work in
> that area it'll make things easier here.
> 
> I still think that implementing ordered tags is the correct way of
> doing things, implementation details notwithstanding.
> 
> It looks better conceptually than using FUA, and would be easier
> from the request-queue side of things.
> (Or course, as the entire logic is pushed down to the SCSI layer :-)

You see, I'm torn between the dread of giving scsi such great
responsibility and the joy of sending a link for a bitkeeper patch
series from 2.4.x.

http://lwn.net/2002/0214/a/queue-barrier.php3

Have a lot of fun ;)

-chris


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 14:52                           ` Hannes Reinecke
  2010-08-05 15:17                             ` Chris Mason
@ 2010-08-05 17:07                             ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-05 17:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, swhiteho, konishi.ryusuke

On Thu, Aug 05, 2010 at 04:52:15PM +0200, Hannes Reinecke wrote:
> I still think that implementing ordered tags is the correct way of
> doing things, implementation details notwithstanding.
> 
> It looks better conceptually than using FUA, and would be easier
> from the request-queue side of things.

Sorry, but ordered tags are in no way a replacement for the FUA bit.
Admittedly the current barrier semantics are confusing because they
mix up to only minimally related things:

 a) cache flushing
 b) ordering

a) is what we really need from the filesystems point of view.  b) is
something all our filesystems can do ourself.  We could use ordered
tags to offload it, and I'd be happy if someone could prove that
we're getting speedups from it, but it certainly does not replace a).

With enough outstanding tags, be that using ordered tags or software
managed ordering we could fill the disk enough that we don't need to
write cache, but again that'll need a lot of benchmarking.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:11                       ` Vladislav Bolkhovitin
  2010-08-05 13:32                         ` Chris Mason
@ 2010-08-05 17:09                         ` Christoph Hellwig
  2010-08-05 19:32                           ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-05 17:09 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso,
	swhiteho, konishi.ryusuke

On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
> Chris Mason, on 08/02/2010 09:39 PM wrote:
> >I regret putting the ordering into the original barrier code...it
> >definitely did help reiserfs back in the day but it stinks of magic and
> >voodoo.
> 
> But if the ordering isn't in the common (block) code, how to implement 
> the "hardware offload" for ordering, i.e. ORDERED commands, in an 
> acceptable way?

Right now we have no working implementation of actually using ordered
tags for a storage device in Linux.  There's very little need for common
code in that implementation - basically we just need flag in the bio /
request to make this one an ordered tag in addition to the existing
reordering preventing in the block queue.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: xfs rm performance
  2010-08-02 19:18                                                     ` Christoph Hellwig
@ 2010-08-05 19:31                                                       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo,
	Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	chris.mason, swhiteho, konishi.ryusuke

Christoph Hellwig, on 08/02/2010 11:18 PM wrote:
> On Mon, Aug 02, 2010 at 11:03:00PM +0400, Vladislav Bolkhovitin wrote:
>> I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern:
>
> That's exactly the queue draining we're talking about here.  To see
> how the pattern gets better use the nobarrier option.

Yes, with this option it's almost 2 times better and I see slight queue 
depth (1-2-3 entries in average, max 8), but the performance is still bad:

# time rm _*

real	3m31.385s
user	0m0.004s
sys	0m26.674s

> Even with that XFS traditionally has a bad I/O pattern for metadata
> intensive workloads due to the amount of log I/O needed for it.
> Starting from Linux 2.6.35 the delayed logging code fixes this, and
> we hope to enable it by default after about 10 to 12 month of
> extensive testing.
>
> Try to re-run your test with
>
> 	-o delaylog,logbsize=262144
>
> to see better log I/O pattern.  If you target doesn't present a volatile
> write cache also add the nobarrier option mentioned above.

Unfortunately, at the moment I can't run 2.6.35 on that kernel, but will 
try as soon as I can.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 17:09                         ` Christoph Hellwig
@ 2010-08-05 19:32                           ` Vladislav Bolkhovitin
  2010-08-05 19:40                             ` Christoph Hellwig
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 19:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho,
	konishi.ryusuke

Christoph Hellwig, on 08/05/2010 09:09 PM wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>>
>> But if the ordering isn't in the common (block) code, how to implement
>> the "hardware offload" for ordering, i.e. ORDERED commands, in an
>> acceptable way?
>
> Right now we have no working implementation of actually using ordered
> tags for a storage device in Linux. There's very little need for common
> code in that implementation - basically we just need flag in the bio /
> request to make this one an ordered tag in addition to the existing
> reordering preventing in the block queue.

New flag.. Easy to add, hard to live with. Aren't you already tied of 
the existing flags hell?

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 19:32                           ` Vladislav Bolkhovitin
@ 2010-08-05 19:40                             ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-05 19:40 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso,
	swhiteho, konishi.ryusuke

On Thu, Aug 05, 2010 at 11:32:04PM +0400, Vladislav Bolkhovitin wrote:
> New flag.. Easy to add, hard to live with. Aren't you already tied of 
> the existing flags hell?

I'm tired of flags without a very well defined meaning.  For example
I'm really tired of the current REQ_HARDBARRIER because it means so
amy different things.

A must do pre-flush or must do FUA flag is very different from a must
not reorder flag.  Overloading the meaning is what got us into this
mess.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:32                         ` Chris Mason
                                             ` (2 preceding siblings ...)
  2010-08-05 19:48                           ` Vladislav Bolkhovitin
@ 2010-08-05 19:48                           ` Vladislav Bolkhovitin
  2010-08-05 19:50                             ` Christoph Hellwig
  3 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 19:48 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.B

Chris Mason, on 08/05/2010 05:32 PM wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>>
>> But if the ordering isn't in the common (block) code, how to
>> implement the "hardware offload" for ordering, i.e. ORDERED
>> commands, in an acceptable way?
>>
>> I believe, the decision was right, but the flags and magic requests
>> based interface (and, hence, implementation) was wrong. That's it
>> which stinks of magic and voodoo.
>
> The interface definitely has flaws.  We didn't expand it because James
> popped up with a long list of error handling problems.

Could you point on the corresponding message, please? I can't find it in 
my archive.

> Basically how
> do the hardware and the kernel deal with a failed request at the start
> of the chain.  Somehow the easy way of failing them all turned out to be
> extremely difficult.

Have you considered to not fail them all, but using ACA SCSI facility 
just suspend the queue, then requeue the failed request, then restart 
processing? I might be missing something, but using this approach the 
failed requests recovery should look quite simple and, most important, 
compact, hence easily audited. Something like below. Sorry, since it's a 
low level recovery, it requires some deep SCSI knowledge to follow.

We need:

1. A low level driver without internal queue and masking returned status 
and sense. At first look, many of the existing drivers more or less 
satisfy this requirement, including drivers in my direct interest: 
qla2xxx, iscsi and ib_srp.

2. A device with support of ORDERED commands as well as ACA and 
UA_INTLCK facilities in QERR mode 0.

Assume we have N ORDERED requests queued to a device and one of them 
failed. Then submitting new requests to the device would be suspended 
and recovery thread woken up.

Let's we have a list of queued to the device requests in order as they 
queued. Then the recovery thread would need to deal with the following 
cases:

1. The failed command failed with CHECK_CONDITION and from the head of 
the queue. (The device now established ACA and suspended its internal 
queue.) Then the command should be sent to the device as ACA task and, 
after it's finished, ACA should be cleared. (The device now would 
restart its queue.) Then submitting new requests to the device would 
also be resumed.

2. The failed command failed with CHECK_CONDITION and isn't from the 
head of the queue.

2.1. The failed command in the last in the queue. ACA should be cleared 
and the failed command should simply be restarted. Then submitting new 
requests to the device would also be resumed.

2.2. The failed command isn't last in the queue. Then the recovery 
thread would send ACA command TEST UNIT READY to be sure all in-flight 
commands reached the device. Then it would abort all the commands after 
the failed one using ABORT TASK Task Management function. Then ACA 
should be cleared and the failed command as well as all the aborted 
commands would be resend to the device. Then submitting new requests to 
the device would also be resumed.

3. The failed command failed with other status than CHECK_CONDITION and 
from the head of the queue.

3.1. The failed command is the only queued command. Then TEST UNIT READY 
command should be sent to the device to get the post UA_INTLCK CHECK 
CONDITION and trigger ACA. Then ACA should be cleared and the failed 
command restarted. Then submitting new requests to the device would also 
be resumed.

3.2. There are other queued commands. Then the recovery thread should 
remember the failed command and exit. The next command would get the 
post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would 
proceed as in (1), except that 2 failed commands would be restarted as 
ACA commands before clearing ACA.

4. The failed command isn't from the head of the queue and failed with 
other status than CHECK_CONDITION. It might happen in case of TASK QUEUE 
FULL condition. This case would be proceed similarly as cases (3.x), 
then (2.2).

That's all. Simple, compact and clear for auditing.

> Even if that part had been refined, I think trusting the ordering down
> to the lower layers was a doomed idea.  The list of ways it could go
> wrong is much much longer (and harder to debug) than the list of
> benefits.

It's hard to debug, because it's currently a overloaded flags nightmare. 
It isn't the idea to trust lower level doomed, everybody trust lower 
levels everywhere in the kernel. Doomed the idea to provide requested 
functionality via a set of flags and artificial barrier requests with 
obscured side effects. Linux just needs a clear and _natural_ interface 
for that. Like one I proposed in 
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am 
proposing to slowly start thinking to move to a new interface and 
implementation out from the current hell. It's obvious that what Linux 
has now in this area is a dead end. The new flag Christoph is going to 
add makes it even worse.

> With all of that said, I did go ahead and benchmark real ordered tags
> extensively on a scsi drive in the initial implementation.  There was
> very little performance difference.

It isn't surprise that you didn't see much difference with a local 
(Wide?) SCSI drive. Such drives sit on a low latency link, simple enough 
to have small internal latencies and dumb enough to not make much 
benefits from internal reordering. But how about external arrays? Or 
even clusters? Nowadays everybody can build such arrays and clusters 
from any Linux (or other *nix) box using any OSS SCSI target 
implementation starting from SCST I have been developing. Such 
array/cluster devices use links with in an order of magnitude higher 
latency, they are very sophisticated inside, so have much bigger 
internal latencies as well as they have much bigger opportunities to 
optimize I/O pattern by internal reordering. All the record numbers I've 
seen so far were reached with deep queue. For instance, the last SCST 
record (>500K 4K IOPSes from a single target) was achieved with queue 
depth 128!

So, I believe, Linux must use that possibility to get full storage 
performance and to finally simplify its storage stack.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 13:32                         ` Chris Mason
  2010-08-05 14:52                           ` Hannes Reinecke
  2010-08-05 14:52                           ` Hannes Reinecke
@ 2010-08-05 19:48                           ` Vladislav Bolkhovitin
  2010-08-05 19:48                           ` Vladislav Bolkhovitin
  3 siblings, 0 replies; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 19:48 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.B

Chris Mason, on 08/05/2010 05:32 PM wrote:
> On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
>> Chris Mason, on 08/02/2010 09:39 PM wrote:
>>> I regret putting the ordering into the original barrier code...it
>>> definitely did help reiserfs back in the day but it stinks of magic and
>>> voodoo.
>>
>> But if the ordering isn't in the common (block) code, how to
>> implement the "hardware offload" for ordering, i.e. ORDERED
>> commands, in an acceptable way?
>>
>> I believe, the decision was right, but the flags and magic requests
>> based interface (and, hence, implementation) was wrong. That's it
>> which stinks of magic and voodoo.
>
> The interface definitely has flaws.  We didn't expand it because James
> popped up with a long list of error handling problems.

Could you point on the corresponding message, please? I can't find it in 
my archive.

> Basically how
> do the hardware and the kernel deal with a failed request at the start
> of the chain.  Somehow the easy way of failing them all turned out to be
> extremely difficult.

Have you considered to not fail them all, but using ACA SCSI facility 
just suspend the queue, then requeue the failed request, then restart 
processing? I might be missing something, but using this approach the 
failed requests recovery should look quite simple and, most important, 
compact, hence easily audited. Something like below. Sorry, since it's a 
low level recovery, it requires some deep SCSI knowledge to follow.

We need:

1. A low level driver without internal queue and masking returned status 
and sense. At first look, many of the existing drivers more or less 
satisfy this requirement, including drivers in my direct interest: 
qla2xxx, iscsi and ib_srp.

2. A device with support of ORDERED commands as well as ACA and 
UA_INTLCK facilities in QERR mode 0.

Assume we have N ORDERED requests queued to a device and one of them 
failed. Then submitting new requests to the device would be suspended 
and recovery thread woken up.

Let's we have a list of queued to the device requests in order as they 
queued. Then the recovery thread would need to deal with the following 
cases:

1. The failed command failed with CHECK_CONDITION and from the head of 
the queue. (The device now established ACA and suspended its internal 
queue.) Then the command should be sent to the device as ACA task and, 
after it's finished, ACA should be cleared. (The device now would 
restart its queue.) Then submitting new requests to the device would 
also be resumed.

2. The failed command failed with CHECK_CONDITION and isn't from the 
head of the queue.

2.1. The failed command in the last in the queue. ACA should be cleared 
and the failed command should simply be restarted. Then submitting new 
requests to the device would also be resumed.

2.2. The failed command isn't last in the queue. Then the recovery 
thread would send ACA command TEST UNIT READY to be sure all in-flight 
commands reached the device. Then it would abort all the commands after 
the failed one using ABORT TASK Task Management function. Then ACA 
should be cleared and the failed command as well as all the aborted 
commands would be resend to the device. Then submitting new requests to 
the device would also be resumed.

3. The failed command failed with other status than CHECK_CONDITION and 
from the head of the queue.

3.1. The failed command is the only queued command. Then TEST UNIT READY 
command should be sent to the device to get the post UA_INTLCK CHECK 
CONDITION and trigger ACA. Then ACA should be cleared and the failed 
command restarted. Then submitting new requests to the device would also 
be resumed.

3.2. There are other queued commands. Then the recovery thread should 
remember the failed command and exit. The next command would get the 
post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would 
proceed as in (1), except that 2 failed commands would be restarted as 
ACA commands before clearing ACA.

4. The failed command isn't from the head of the queue and failed with 
other status than CHECK_CONDITION. It might happen in case of TASK QUEUE 
FULL condition. This case would be proceed similarly as cases (3.x), 
then (2.2).

That's all. Simple, compact and clear for auditing.

> Even if that part had been refined, I think trusting the ordering down
> to the lower layers was a doomed idea.  The list of ways it could go
> wrong is much much longer (and harder to debug) than the list of
> benefits.

It's hard to debug, because it's currently a overloaded flags nightmare. 
It isn't the idea to trust lower level doomed, everybody trust lower 
levels everywhere in the kernel. Doomed the idea to provide requested 
functionality via a set of flags and artificial barrier requests with 
obscured side effects. Linux just needs a clear and _natural_ interface 
for that. Like one I proposed in 
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am 
proposing to slowly start thinking to move to a new interface and 
implementation out from the current hell. It's obvious that what Linux 
has now in this area is a dead end. The new flag Christoph is going to 
add makes it even worse.

> With all of that said, I did go ahead and benchmark real ordered tags
> extensively on a scsi drive in the initial implementation.  There was
> very little performance difference.

It isn't surprise that you didn't see much difference with a local 
(Wide?) SCSI drive. Such drives sit on a low latency link, simple enough 
to have small internal latencies and dumb enough to not make much 
benefits from internal reordering. But how about external arrays? Or 
even clusters? Nowadays everybody can build such arrays and clusters 
from any Linux (or other *nix) box using any OSS SCSI target 
implementation starting from SCST I have been developing. Such 
array/cluster devices use links with in an order of magnitude higher 
latency, they are very sophisticated inside, so have much bigger 
internal latencies as well as they have much bigger opportunities to 
optimize I/O pattern by internal reordering. All the record numbers I've 
seen so far were reached with deep queue. For instance, the last SCST 
record (>500K 4K IOPSes from a single target) was achieved with queue 
depth 128!

So, I believe, Linux must use that possibility to get full storage 
performance and to finally simplify its storage stack.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 19:48                           ` Vladislav Bolkhovitin
@ 2010-08-05 19:50                             ` Christoph Hellwig
  2010-08-05 20:05                               ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-05 19:50 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso,
	swhiteho, konishi.ryusuke

On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote:
> So, I believe, Linux must use that possibility to get full storage 
> performance and to finally simplify its storage stack.

So instead of talking what about doing a prototype and show us what
improvement it gives?

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 19:50                             ` Christoph Hellwig
@ 2010-08-05 20:05                               ` Vladislav Bolkhovitin
  2010-08-06 14:56                                 ` Hannes Reinecke
  0 siblings, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-05 20:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho,
	konishi.ryusuke

Christoph Hellwig, on 08/05/2010 11:50 PM wrote:
> On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote:
>> So, I believe, Linux must use that possibility to get full storage
>> performance and to finally simplify its storage stack.
>
> So instead of talking what about doing a prototype and show us what
> improvement it gives?

Sure, I'd love to. But, unfortunately, I can't clone myself, so I'm 
trying to help the best of what I could: my level of storage and SCSI 
expertise. This area is quite special, so I'm trying to explain some 
misunderstandings I see and illustrate my points by some possible work 
flows and interfaces.

But I can shut up if you'd like.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-05 20:05                               ` Vladislav Bolkhovitin
@ 2010-08-06 14:56                                 ` Hannes Reinecke
  2010-08-06 18:38                                   ` Vladislav Bolkhovitin
  2010-08-06 23:34                                   ` Christoph Hellwig
  0 siblings, 2 replies; 155+ messages in thread
From: Hannes Reinecke @ 2010-08-06 14:56 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Christoph Hellwig, Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara,
	jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso,
	swhiteho, konishi.ryusuke

Vladislav Bolkhovitin wrote:
> Christoph Hellwig, on 08/05/2010 11:50 PM wrote:
>> On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote:
>>> So, I believe, Linux must use that possibility to get full storage
>>> performance and to finally simplify its storage stack.
>>
>> So instead of talking what about doing a prototype and show us what
>> improvement it gives?
> 
> Sure, I'd love to. But, unfortunately, I can't clone myself, so I'm
> trying to help the best of what I could: my level of storage and SCSI
> expertise. This area is quite special, so I'm trying to explain some
> misunderstandings I see and illustrate my points by some possible work
> flows and interfaces.
> 
I can't, neither.

But I can do bonnie runs in no time.
I have done some preliminary benchmarks by just enable ordered
queueing in sd.c and no other changes.
Bonnie says:

Writing intelligently: 115208 vs.  82739 
Reading intelligently: 134133 vs. 129395

putc() performance suffers, though:
I get 52M vs 90M writing and 50M vs. 65M reading.
No idea why; shouldn't be that harmful here.

But in any case there is some speed improvement
to be had from using ordered tags.

Oh, and that was against an EVA 6400.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC] relaxed barriers
  2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
  2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
@ 2010-08-06 16:04     ` Tejun Heo
  2010-08-06 23:34       ` Christoph Hellwig
  2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
  1 sibling, 2 replies; 155+ messages in thread
From: Tejun Heo @ 2010-08-06 16:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel,
	linux-raid

Hello,

So, here's my shot at it.  After this patch, barrier no longer
dictates the ordering of other requests.  The block layer sequences
the barrier request without interfering with other requests (not even
elevator draining).  Multiple pending barriers are handled by saving
those in a separate queue and servicing them one by one.  Basically,
barrier sequences form a separate FIFO command stream independent of
other requests and all the ordering between the two streams is
filesystem's responsibility.

Ordered tag support is dropped as no one seems to be making any
meaningful use of it.  I'm fairly skeptical about its usefulness
anyway.  The only thing ordered tag saves is latencies between command
completions and issues in barrier sequences, which isn't much to begin
with and puts additional ordering restrictions compared to ordering in
software (ordered tag commands will unnecessary affect processing of
simple tag commands).

Lightly tested for all three BAR (!WC), FLUSH and FUA cases.  The
multiple pending barrier code path isn't tested yet.

Christoph, does this look like something the filesystems can use or
have I misunderstood something?

Thanks.

NOT_SIGNED_OFF_YET
---
 block/blk-barrier.c          |  253 +++++++++++++++----------------------------
 block/blk-core.c             |   31 ++---
 block/blk.h                  |    5
 block/elevator.c             |   80 +------------
 drivers/block/brd.c          |    2
 drivers/block/loop.c         |    2
 drivers/block/osdblk.c       |    2
 drivers/block/pktcdvd.c      |    1
 drivers/block/ps3disk.c      |    3
 drivers/block/virtio_blk.c   |    4
 drivers/block/xen-blkfront.c |    2
 drivers/ide/ide-disk.c       |    4
 drivers/md/dm.c              |    3
 drivers/mmc/card/queue.c     |    2
 drivers/s390/block/dasd.c    |    2
 drivers/scsi/sd.c            |    8 -
 include/linux/blkdev.h       |   59 +++-------
 include/linux/elevator.h     |    6 -
 18 files changed, 154 insertions(+), 315 deletions(-)

Index: work/block/blk-barrier.c
===================================================================
--- work.orig/block/blk-barrier.c
+++ work/block/blk-barrier.c
@@ -9,6 +9,8 @@

 #include "blk.h"

+static struct request *queue_next_ordseq(struct request_queue *q);
+
 /**
  * blk_queue_ordered - does this queue support ordered writes
  * @q:        the request queue
@@ -31,13 +33,8 @@ int blk_queue_ordered(struct request_que
 		return -EINVAL;
 	}

-	if (ordered != QUEUE_ORDERED_NONE &&
-	    ordered != QUEUE_ORDERED_DRAIN &&
-	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA &&
-	    ordered != QUEUE_ORDERED_TAG &&
-	    ordered != QUEUE_ORDERED_TAG_FLUSH &&
-	    ordered != QUEUE_ORDERED_TAG_FUA) {
+	if (ordered != QUEUE_ORDERED_NONE && ordered != QUEUE_ORDERED_BAR &&
+	    ordered != QUEUE_ORDERED_FLUSH && ordered != QUEUE_ORDERED_FUA) {
 		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
 		return -EINVAL;
 	}
@@ -60,38 +57,10 @@ unsigned blk_ordered_cur_seq(struct requ
 	return 1 << ffz(q->ordseq);
 }

-unsigned blk_ordered_req_seq(struct request *rq)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+						unsigned seq, int error)
 {
-	struct request_queue *q = rq->q;
-
-	BUG_ON(q->ordseq == 0);
-
-	if (rq == &q->pre_flush_rq)
-		return QUEUE_ORDSEQ_PREFLUSH;
-	if (rq == &q->bar_rq)
-		return QUEUE_ORDSEQ_BAR;
-	if (rq == &q->post_flush_rq)
-		return QUEUE_ORDSEQ_POSTFLUSH;
-
-	/*
-	 * !fs requests don't need to follow barrier ordering.  Always
-	 * put them at the front.  This fixes the following deadlock.
-	 *
-	 * http://thread.gmane.org/gmane.linux.kernel/537473
-	 */
-	if (!blk_fs_request(rq))
-		return QUEUE_ORDSEQ_DRAIN;
-
-	if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
-	    (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
-		return QUEUE_ORDSEQ_DRAIN;
-	else
-		return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
-{
-	struct request *rq;
+	struct request *rq = NULL;

 	if (error && !q->orderr)
 		q->orderr = error;
@@ -99,16 +68,22 @@ bool blk_ordered_complete_seq(struct req
 	BUG_ON(q->ordseq & seq);
 	q->ordseq |= seq;

-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
-		return false;
-
-	/*
-	 * Okay, sequence complete.
-	 */
-	q->ordseq = 0;
-	rq = q->orig_bar_rq;
-	__blk_end_request_all(rq, q->orderr);
-	return true;
+	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+		/* not complete yet, queue the next ordered sequence */
+		rq = queue_next_ordseq(q);
+	} else {
+		/* complete this barrier request */
+		__blk_end_request_all(q->orig_bar_rq, q->orderr);
+		q->orig_bar_rq = NULL;
+		q->ordseq = 0;
+
+		/* dispatch the next barrier if there's one */
+		if (!list_empty(&q->pending_barriers)) {
+			rq = list_entry_rq(q->pending_barriers.next);
+			list_move(&rq->queuelist, &q->queue_head);
+		}
+	}
+	return rq;
 }

 static void pre_flush_end_io(struct request *rq, int error)
@@ -129,21 +104,10 @@ static void post_flush_end_io(struct req
 	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }

-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, struct request *rq,
+			rq_end_io_fn *end_io)
 {
-	struct request *rq;
-	rq_end_io_fn *end_io;
-
-	if (which == QUEUE_ORDERED_DO_PREFLUSH) {
-		rq = &q->pre_flush_rq;
-		end_io = pre_flush_end_io;
-	} else {
-		rq = &q->post_flush_rq;
-		end_io = post_flush_end_io;
-	}
-
 	blk_rq_init(q, rq);
-	rq->cmd_flags = REQ_HARDBARRIER;
 	rq->rq_disk = q->bar_rq.rq_disk;
 	rq->end_io = end_io;
 	q->prepare_flush_fn(q, rq);
@@ -151,130 +115,93 @@ static void queue_flush(struct request_q
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }

-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static struct request *queue_next_ordseq(struct request_queue *q)
 {
-	struct request *rq = *rqp;
-	unsigned skip = 0;
-
-	q->orderr = 0;
-	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq)) {
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-		/*
-		 * Empty barrier on a write-through device w/ ordered
-		 * tag has no command to issue and without any command
-		 * to issue, ordering by tag can't be used.  Drain
-		 * instead.
-		 */
-		if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
-		    !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
-			q->ordered &= ~QUEUE_ORDERED_BY_TAG;
-	}
-
-	/* stash away the original request */
-	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
-	rq = NULL;
-
-	/*
-	 * Queue ordered sequence.  As we stack them at the head, we
-	 * need to queue in reverse order.  Note that we rely on that
-	 * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
-	 * request gets inbetween ordered sequence.
-	 */
-	if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
-		rq = &q->post_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
+	struct request *rq = &q->bar_rq;

-	if (q->ordered & QUEUE_ORDERED_DO_BAR) {
-		rq = &q->bar_rq;
+	switch (blk_ordered_cur_seq(q)) {
+	case QUEUE_ORDSEQ_PREFLUSH:
+		queue_flush(q, rq, pre_flush_end_io);
+		break;

+	case QUEUE_ORDSEQ_BAR:
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
-		if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
-			rq->cmd_flags |= REQ_RW;
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		rq->cmd_flags &= ~REQ_HARDBARRIER;
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		rq->end_io = bar_end_io;

 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	} else
-		skip |= QUEUE_ORDSEQ_BAR;
+		break;

-	if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
-		rq = &q->pre_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
+	case QUEUE_ORDSEQ_POSTFLUSH:
+		queue_flush(q, rq, post_flush_end_io);
+		break;

-	if (!(q->ordered & QUEUE_ORDERED_BY_TAG) && queue_in_flight(q))
-		rq = NULL;
-	else
-		skip |= QUEUE_ORDSEQ_DRAIN;
-
-	*rqp = rq;
-
-	/*
-	 * Complete skipped sequences.  If whole sequence is complete,
-	 * return false to tell elevator that this request is gone.
-	 */
-	return !blk_ordered_complete_seq(q, skip, 0);
+	default:
+		BUG();
+	}
+	return rq;
 }

-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 {
-	struct request *rq = *rqp;
-	const int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq);
+	unsigned skip = 0;

-	if (!q->ordseq) {
-		if (!is_barrier)
-			return true;
-
-		if (q->next_ordered != QUEUE_ORDERED_NONE)
-			return start_ordered(q, rqp);
-		else {
-			/*
-			 * Queue ordering not supported.  Terminate
-			 * with prejudice.
-			 */
-			blk_dequeue_request(rq);
-			__blk_end_request_all(rq, -EOPNOTSUPP);
-			*rqp = NULL;
-			return false;
-		}
+	if (!blk_barrier_rq(rq))
+		return rq;
+
+	if (q->ordseq) {
+		/*
+		 * Barrier is already in progress and they can't be
+		 * processed in parallel.  Queue for later processing.
+		 */
+		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		return NULL;
+	}
+
+	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+		/*
+		 * Queue ordering not supported.  Terminate
+		 * with prejudice.
+		 */
+		blk_dequeue_request(rq);
+		__blk_end_request_all(rq, -EOPNOTSUPP);
+		return NULL;
 	}

 	/*
-	 * Ordered sequence in progress
+	 * Start a new ordered sequence
 	 */
+	q->orderr = 0;
+	q->ordered = q->next_ordered;
+	q->ordseq |= QUEUE_ORDSEQ_STARTED;

-	/* Special requests are not subject to ordering rules. */
-	if (!blk_fs_request(rq) &&
-	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
-		return true;
-
-	if (q->ordered & QUEUE_ORDERED_BY_TAG) {
-		/* Ordered by tag.  Blocking the next barrier is enough. */
-		if (is_barrier && rq != &q->bar_rq)
-			*rqp = NULL;
-	} else {
-		/* Ordered by draining.  Wait for turn. */
-		WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
-		if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-			*rqp = NULL;
-	}
+	/*
+	 * For an empty barrier, there's no actual BAR request, which
+	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
+	 */
+	if (!blk_rq_sectors(rq))
+		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+				QUEUE_ORDERED_DO_POSTFLUSH);
+
+	/* stash away the original request */
+	blk_dequeue_request(rq);
+	q->orig_bar_rq = rq;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+		skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+		skip |= QUEUE_ORDSEQ_BAR;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+		skip |= QUEUE_ORDSEQ_POSTFLUSH;

-	return true;
+	/* complete skipped sequences and return the first sequence */
+	return blk_ordered_complete_seq(q, skip, 0);
 }

 static void bio_end_empty_barrier(struct bio *bio, int err)
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -106,7 +106,6 @@ enum rq_flag_bits {
 	__REQ_FAILED,		/* set if the request failed */
 	__REQ_QUIET,		/* don't worry about errors */
 	__REQ_PREEMPT,		/* set for "ide_preempt" requests */
-	__REQ_ORDERED_COLOR,	/* is before or after barrier */
 	__REQ_RW_SYNC,		/* request is sync (sync write or read) */
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_RW_META,		/* metadata io request */
@@ -135,7 +134,6 @@ enum rq_flag_bits {
 #define REQ_FAILED	(1 << __REQ_FAILED)
 #define REQ_QUIET	(1 << __REQ_QUIET)
 #define REQ_PREEMPT	(1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR	(1 << __REQ_ORDERED_COLOR)
 #define REQ_RW_SYNC	(1 << __REQ_RW_SYNC)
 #define REQ_ALLOCED	(1 << __REQ_ALLOCED)
 #define REQ_RW_META	(1 << __REQ_RW_META)
@@ -437,9 +435,10 @@ struct request_queue
 	 * reserved for flush operations
 	 */
 	unsigned int		ordered, next_ordered, ordseq;
-	int			orderr, ordcolor;
-	struct request		pre_flush_rq, bar_rq, post_flush_rq;
-	struct request		*orig_bar_rq;
+	int			orderr;
+	struct request		bar_rq;
+	struct request          *orig_bar_rq;
+	struct list_head	pending_barriers;

 	struct mutex		sysfs_lock;

@@ -543,47 +542,33 @@ enum {
 	 * Hardbarrier is supported with one of the following methods.
 	 *
 	 * NONE		: hardbarrier unsupported
-	 * DRAIN	: ordering by draining is enough
-	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
-	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
-	 * TAG		: ordering by tag is enough
-	 * TAG_FLUSH	: ordering by tag w/ pre and post flushes
-	 * TAG_FUA	: ordering by tag w/ pre flush and FUA write
-	 */
-	QUEUE_ORDERED_BY_TAG		= 0x02,
-	QUEUE_ORDERED_DO_PREFLUSH	= 0x10,
-	QUEUE_ORDERED_DO_BAR		= 0x20,
-	QUEUE_ORDERED_DO_POSTFLUSH	= 0x40,
-	QUEUE_ORDERED_DO_FUA		= 0x80,
+	 * BAR		: writing out barrier is enough
+	 * FLUSH	: barrier and surrounding pre and post flushes
+	 * FUA		: FUA barrier w/ pre flush
+	 */
+	QUEUE_ORDERED_DO_PREFLUSH	= 1 << 0,
+	QUEUE_ORDERED_DO_BAR		= 1 << 1,
+	QUEUE_ORDERED_DO_POSTFLUSH	= 1 << 2,
+	QUEUE_ORDERED_DO_FUA		= 1 << 3,

-	QUEUE_ORDERED_NONE		= 0x00,
+	QUEUE_ORDERED_NONE		= 0,

-	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_DRAIN_FLUSH	= QUEUE_ORDERED_DRAIN |
+	QUEUE_ORDERED_BAR		= QUEUE_ORDERED_DO_BAR,
+	QUEUE_ORDERED_FLUSH		= QUEUE_ORDERED_DO_BAR |
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_DRAIN_FUA		= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
-
-	QUEUE_ORDERED_TAG		= QUEUE_ORDERED_BY_TAG |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_TAG_FLUSH		= QUEUE_ORDERED_TAG |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_TAG_FUA		= QUEUE_ORDERED_TAG |
+	QUEUE_ORDERED_FUA		= QUEUE_ORDERED_DO_BAR |
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_FUA,

 	/*
 	 * Ordered operation sequence
 	 */
-	QUEUE_ORDSEQ_STARTED	= 0x01,	/* flushing in progress */
-	QUEUE_ORDSEQ_DRAIN	= 0x02,	/* waiting for the queue to be drained */
-	QUEUE_ORDSEQ_PREFLUSH	= 0x04,	/* pre-flushing in progress */
-	QUEUE_ORDSEQ_BAR	= 0x08,	/* original barrier req in progress */
-	QUEUE_ORDSEQ_POSTFLUSH	= 0x10,	/* post-flushing in progress */
-	QUEUE_ORDSEQ_DONE	= 0x20,
+	QUEUE_ORDSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_ORDSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_ORDSEQ_BAR	= (1 << 2), /* barrier write in progress */
+	QUEUE_ORDSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_ORDSEQ_DONE	= (1 << 4),
 };

 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -965,10 +950,8 @@ extern void blk_queue_rq_timed_out(struc
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 extern int blk_queue_ordered(struct request_queue *, unsigned, prepare_flush_fn *);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
 extern unsigned blk_ordered_cur_seq(struct request_queue *);
 extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);

 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
Index: work/drivers/block/brd.c
===================================================================
--- work.orig/drivers/block/brd.c
+++ work/drivers/block/brd.c
@@ -479,7 +479,7 @@ static struct brd_device *brd_alloc(int
 	if (!brd->brd_queue)
 		goto out_free_dev;
 	blk_queue_make_request(brd->brd_queue, brd_make_request);
-	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG, NULL);
+	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_BAR, NULL);
 	blk_queue_max_hw_sectors(brd->brd_queue, 1024);
 	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

Index: work/drivers/block/virtio_blk.c
===================================================================
--- work.orig/drivers/block/virtio_blk.c
+++ work/drivers/block/virtio_blk.c
@@ -368,10 +368,10 @@ static int __devinit virtblk_probe(struc

 	/* If barriers are supported, tell block layer that queue is ordered */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH,
+		blk_queue_ordered(q, QUEUE_ORDERED_FLUSH,
 				  virtblk_prepare_flush);
 	else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
-		blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL);
+		blk_queue_ordered(q, QUEUE_ORDERED_BAR, NULL);

 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
Index: work/drivers/scsi/sd.c
===================================================================
--- work.orig/drivers/scsi/sd.c
+++ work/drivers/scsi/sd.c
@@ -2103,15 +2103,13 @@ static int sd_revalidate_disk(struct gen

 	/*
 	 * We now have all cache related info, determine how we deal
-	 * with ordered requests.  Note that as the current SCSI
-	 * dispatch function can alter request order, we cannot use
-	 * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+	 * with ordered requests.
 	 */
 	if (sdkp->WCE)
 		ordered = sdkp->DPOFUA
-			? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
+			? QUEUE_ORDERED_FUA : QUEUE_ORDERED_FLUSH;
 	else
-		ordered = QUEUE_ORDERED_DRAIN;
+		ordered = QUEUE_ORDERED_BAR;

 	blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush);

Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
+	INIT_LIST_HEAD(&q->pending_barriers);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);

 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1036,22 +1037,6 @@ void blk_insert_request(struct request_q
 }
 EXPORT_SYMBOL(blk_insert_request);

-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
-	drive_stat_acct(req, 1);
-
-	/*
-	 * elevator indicated where it wants this request to be
-	 * inserted at elevator_merge time
-	 */
-	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
@@ -1184,6 +1169,7 @@ static int __make_request(struct request
 	const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
 	const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;

 	if (bio_rw_flagged(bio, BIO_RW_BARRIER) &&
@@ -1191,6 +1177,7 @@ static int __make_request(struct request
 		bio_endio(bio, -EOPNOTSUPP);
 		return 0;
 	}
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1200,7 +1187,12 @@ static int __make_request(struct request

 	spin_lock_irq(q->queue_lock);

-	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
+	if (bio_rw_flagged(bio, BIO_RW_BARRIER)) {
+		where = ELEVATOR_INSERT_ORDERED;
+		goto get_rq;
+	}
+
+	if (elv_queue_empty(q))
 		goto get_rq;

 	el_ret = elv_merge(q, &req, bio);
@@ -1297,7 +1289,10 @@ get_rq:
 		req->cpu = blk_cpu_to_group(smp_processor_id());
 	if (queue_should_plug(q) && elv_queue_empty(q))
 		blk_plug_device(q);
-	add_request(q, req);
+
+	/* insert the request into the elevator */
+	drive_stat_acct(req, 1);
+	__elv_add_request(q, req, where, 0);
 out:
 	if (unplug || !queue_should_plug(q))
 		__generic_unplug_device(q);
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -564,7 +564,7 @@ void elv_requeue_request(struct request_

 	rq->cmd_flags &= ~REQ_STARTED;

-	elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE);
+	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }

 void elv_drain_elevator(struct request_queue *q)
@@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu

 void elv_insert(struct request_queue *q, struct request *rq, int where)
 {
-	struct list_head *pos;
-	unsigned ordseq;
 	int unplug_it = 1;

 	trace_block_rq_insert(q, rq);
@@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q,
 	switch (where) {
 	case ELEVATOR_INSERT_FRONT:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-
 		list_add(&rq->queuelist, &q->queue_head);
 		break;

+	case ELEVATOR_INSERT_ORDERED:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		list_add_tail(&rq->queuelist, &q->queue_head);
+		break;
+
 	case ELEVATOR_INSERT_BACK:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
 		elv_drain_elevator(q);
@@ -661,36 +663,6 @@ void elv_insert(struct request_queue *q,
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;

-	case ELEVATOR_INSERT_REQUEUE:
-		/*
-		 * If ordered flush isn't in progress, we do front
-		 * insertion; otherwise, requests should be requeued
-		 * in ordseq order.
-		 */
-		rq->cmd_flags |= REQ_SOFTBARRIER;
-
-		/*
-		 * Most requeues happen because of a busy condition,
-		 * don't force unplug of the queue for that case.
-		 */
-		unplug_it = 0;
-
-		if (q->ordseq == 0) {
-			list_add(&rq->queuelist, &q->queue_head);
-			break;
-		}
-
-		ordseq = blk_ordered_req_seq(rq);
-
-		list_for_each(pos, &q->queue_head) {
-			struct request *pos_rq = list_entry_rq(pos);
-			if (ordseq <= blk_ordered_req_seq(pos_rq))
-				break;
-		}
-
-		list_add_tail(&rq->queuelist, pos);
-		break;
-
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -709,32 +681,14 @@ void elv_insert(struct request_queue *q,
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
-	if (q->ordcolor)
-		rq->cmd_flags |= REQ_ORDERED_COLOR;
-
 	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
-		/*
-		 * toggle ordered color
-		 */
-		if (blk_barrier_rq(rq))
-			q->ordcolor ^= 1;
-
-		/*
-		 * barriers implicitly indicate back insertion
-		 */
-		if (where == ELEVATOR_INSERT_SORT)
-			where = ELEVATOR_INSERT_BACK;
-
-		/*
-		 * this request is scheduling boundary, update
-		 * end_sector
-		 */
+		/* barriers are scheduling boundary, update end_sector */
 		if (blk_fs_request(rq) || blk_discard_rq(rq)) {
 			q->end_sector = rq_end_sector(rq);
 			q->boundary_rq = rq;
 		}
 	} else if (!(rq->cmd_flags & REQ_ELVPRIV) &&
-		    where == ELEVATOR_INSERT_SORT)
+		   where == ELEVATOR_INSERT_SORT)
 		where = ELEVATOR_INSERT_BACK;

 	if (plug)
@@ -846,24 +800,6 @@ void elv_completed_request(struct reques
 		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}
-
-	/*
-	 * Check if the queue is waiting for fs requests to be
-	 * drained for flush sequence.
-	 */
-	if (unlikely(q->ordseq)) {
-		struct request *next = NULL;
-
-		if (!list_empty(&q->queue_head))
-			next = list_entry_rq(q->queue_head.next);
-
-		if (!queue_in_flight(q) &&
-		    blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
-		    (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
-			blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
-			__blk_run_queue(q);
-		}
-	}
 }

 #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))

+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
@@ -58,7 +60,8 @@ static inline struct request *__elv_next
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (blk_do_ordered(q, &rq))
+			rq = blk_do_ordered(q, rq);
+			if (rq)
 				return rq;
 		}

Index: work/drivers/block/loop.c
===================================================================
--- work.orig/drivers/block/loop.c
+++ work/drivers/block/loop.c
@@ -831,7 +831,7 @@ static int loop_set_fd(struct loop_devic
 	lo->lo_queue->unplug_fn = loop_unplug;

 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL);
+		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_BAR, NULL);

 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
Index: work/drivers/block/osdblk.c
===================================================================
--- work.orig/drivers/block/osdblk.c
+++ work/drivers/block/osdblk.c
@@ -446,7 +446,7 @@ static int osdblk_init_disk(struct osdbl
 	blk_queue_stack_limits(q, osd_request_queue(osdev->osd));

 	blk_queue_prep_rq(q, blk_queue_start_tag);
-	blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);
+	blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, osdblk_prepare_flush);

 	disk->queue = q;

Index: work/drivers/block/ps3disk.c
===================================================================
--- work.orig/drivers/block/ps3disk.c
+++ work/drivers/block/ps3disk.c
@@ -480,8 +480,7 @@ static int __devinit ps3disk_probe(struc
 	blk_queue_dma_alignment(queue, dev->blk_size-1);
 	blk_queue_logical_block_size(queue, dev->blk_size);

-	blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  ps3disk_prepare_flush);
+	blk_queue_ordered(queue, QUEUE_ORDERED_FLUSH, ps3disk_prepare_flush);

 	blk_queue_max_segments(queue, -1);
 	blk_queue_max_segment_size(queue, dev->bounce_size);
Index: work/drivers/block/xen-blkfront.c
===================================================================
--- work.orig/drivers/block/xen-blkfront.c
+++ work/drivers/block/xen-blkfront.c
@@ -373,7 +373,7 @@ static int xlvbd_barrier(struct blkfront
 	int err;

 	err = blk_queue_ordered(info->rq,
-				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
+				info->feature_barrier ? QUEUE_ORDERED_BAR : QUEUE_ORDERED_NONE,
 				NULL);

 	if (err)
Index: work/drivers/ide/ide-disk.c
===================================================================
--- work.orig/drivers/ide/ide-disk.c
+++ work/drivers/ide/ide-disk.c
@@ -537,11 +537,11 @@ static void update_ordered(ide_drive_t *
 		       drive->name, barrier ? "" : "not ");

 		if (barrier) {
-			ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+			ordered = QUEUE_ORDERED_FLUSH;
 			prep_fn = idedisk_prepare_flush;
 		}
 	} else
-		ordered = QUEUE_ORDERED_DRAIN;
+		ordered = QUEUE_ORDERED_BAR;

 	blk_queue_ordered(drive->queue, ordered, prep_fn);
 }
Index: work/drivers/md/dm.c
===================================================================
--- work.orig/drivers/md/dm.c
+++ work/drivers/md/dm.c
@@ -1912,8 +1912,7 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  dm_rq_prepare_flush);
+	blk_queue_ordered(md->queue, QUEUE_ORDERED_FLUSH, dm_rq_prepare_flush);

 	md->disk = alloc_disk(1);
 	if (!md->disk)
Index: work/drivers/mmc/card/queue.c
===================================================================
--- work.orig/drivers/mmc/card/queue.c
+++ work/drivers/mmc/card/queue.c
@@ -128,7 +128,7 @@ int mmc_init_queue(struct mmc_queue *mq,
 	mq->req = NULL;

 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
-	blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN, NULL);
+	blk_queue_ordered(mq->queue, QUEUE_ORDERED_BAR, NULL);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);

 #ifdef CONFIG_MMC_BLOCK_BOUNCE
Index: work/drivers/s390/block/dasd.c
===================================================================
--- work.orig/drivers/s390/block/dasd.c
+++ work/drivers/s390/block/dasd.c
@@ -2196,7 +2196,7 @@ static void dasd_setup_queue(struct dasd
 	 */
 	blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
 	blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
-	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN, NULL);
+	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_BAR, NULL);
 }

 /*
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc
  * Insertion selection
  */
 #define ELEVATOR_INSERT_FRONT	1
-#define ELEVATOR_INSERT_BACK	2
-#define ELEVATOR_INSERT_SORT	3
-#define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_ORDERED	2
+#define ELEVATOR_INSERT_BACK	3
+#define ELEVATOR_INSERT_SORT	4

 /*
  * return values from elevator_may_queue_fn
Index: work/drivers/block/pktcdvd.c
===================================================================
--- work.orig/drivers/block/pktcdvd.c
+++ work/drivers/block/pktcdvd.c
@@ -752,7 +752,6 @@ static int pkt_generic_packet(struct pkt

 	rq->timeout = 60*HZ;
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->cmd_flags |= REQ_HARDBARRIER;
 	if (cgc->quiet)
 		rq->cmd_flags |= REQ_QUIET;


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-06 14:56                                 ` Hannes Reinecke
@ 2010-08-06 18:38                                   ` Vladislav Bolkhovitin
  2010-08-06 23:38                                     ` Christoph Hellwig
  2010-08-06 23:34                                   ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-06 18:38 UTC (permalink / raw)
  To: Hannes Reinecke, Tejun Heo
  Cc: Christoph Hellwig, Chris Mason, Vivek Goyal, Jan Kara, jaxboe,
	James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho,
	konishi.ryusuke

Hannes Reinecke, on 08/06/2010 06:56 PM wrote:
> But I can do bonnie runs in no time.
> I have done some preliminary benchmarks by just enable ordered
> queueing in sd.c and no other changes.
> Bonnie says:
>
> Writing intelligently: 115208 vs.  82739
> Reading intelligently: 134133 vs. 129395
>
> putc() performance suffers, though:
> I get 52M vs 90M writing and 50M vs. 65M reading.
> No idea why; shouldn't be that harmful here.
>
> But in any case there is some speed improvement
> to be had from using ordered tags.
>
> Oh, and that was against an EVA 6400.

Here are my numbers. They are taken using:

fio --bs=X --ioengine=aio --buffered=0 --size=128M --rw=read --thread 
--numjobs=1 --loops=100 --group_reporting --gtod_reduce=1 --name=AAA 
--filename=/dev/sdc --iodepth=Y

/dev/sdc is 1GbE iSCSI device with on the other side iSCSI-SCST with a 
single 15K RPM Wide SCSI HDD. All values are in MB/s. The system 
(initiator) is pretty old 1.7GHz Xeon.

     Y |	1	2	4	8	32
----------------------------------------------------------------------
X     |
4K    |	16	25	32	34	34  (initiator CPU overloaded)
16K   |	25	57	72	85	85  (initiator CPU overloaded)
32K   |	44	72	97	106	106 (initiator CPU overloaded)
64K   |	65	95	114	115	115 (max of 1GbE)
128K  |	80	112	115	115	115 (max of 1GbE)

Are there still any people thinking that tagged queuing doesn't have any 
meaningful use?

Or 350% performance increase doesn't matter? (If the system was more 
powerful, the difference would be even bigger.)

As you can see on external storage even with 128K commands the queue 
should have at least 2 entries queued to go with full performance.

Vlad

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC] relaxed barriers
  2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
@ 2010-08-06 23:34       ` Christoph Hellwig
  2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-06 23:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, linux-raid

> Christoph, does this look like something the filesystems can use or
> have I misunderstood something?

This sounds very useful.  I'll review and test it once I get a bit time.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-06 14:56                                 ` Hannes Reinecke
  2010-08-06 18:38                                   ` Vladislav Bolkhovitin
@ 2010-08-06 23:34                                   ` Christoph Hellwig
  1 sibling, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-06 23:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vladislav Bolkhovitin, Christoph Hellwig, Chris Mason, Tejun Heo,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, swhiteho, konishi.ryusuke

On Fri, Aug 06, 2010 at 04:56:56PM +0200, Hannes Reinecke wrote:
> But I can do bonnie runs in no time.
> I have done some preliminary benchmarks by just enable ordered
> queueing in sd.c and no other changes.

Enabled what exactly?


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [RFC] relaxed barrier semantics
  2010-08-06 18:38                                   ` Vladislav Bolkhovitin
@ 2010-08-06 23:38                                     ` Christoph Hellwig
  0 siblings, 0 replies; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-06 23:38 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Hannes Reinecke, Tejun Heo, Christoph Hellwig, Chris Mason,
	Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel,
	linux-scsi, tytso, swhiteho, konishi.ryusuke

On Fri, Aug 06, 2010 at 10:38:46PM +0400, Vladislav Bolkhovitin wrote:
> Are there still any people thinking that tagged queuing doesn't have any 
> meaningful use?
> 
> Or 350% performance increase doesn't matter? (If the system was more 
> powerful, the difference would be even bigger.)
> 
> As you can see on external storage even with 128K commands the queue 
> should have at least 2 entries queued to go with full performance.

Vlad, no one disagrees that draining the queue is really bad for
performance.  That's in fact what started the whole thread.  The
question is wether it's worth to deal with the complexities of using
tagged queing all the way through the I/O and filesystem stack, or
wether to keep the existing perfectly working code to wait on individual
I/O requests in the filesystem.  The latter won't be able to keep the
queue filled for the case where we try to max out the I/O subsystem
with a single synchronous writer thread, so tagged queueing would be
a clear win for that.  It's not exactly the typical use case for high
end storage, though - and once you have multiple threads keeping the
queue busy the advantage of the tagging shrinks.

Of course all this is just talking and someone would need to actually
do the work of using tagged queueing in a useful (and non-buggy) way
and benchmark it.


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH REPOST RFC] relaxed barriers
  2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
  2010-08-06 23:34       ` Christoph Hellwig
@ 2010-08-07 10:13       ` Tejun Heo
  2010-08-08 14:31         ` Christoph Hellwig
  1 sibling, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-08-07 10:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel,
	linux-raid

The patch was on top of v2.6.35 but was generated against dirty tree
and wouldn't apply cleanly.  Here's the proper one.

Thanks.
---
 block/blk-barrier.c          |  255 +++++++++++++++----------------------------
 block/blk-core.c             |   31 ++---
 block/blk.h                  |    5
 block/elevator.c             |   80 +------------
 drivers/block/brd.c          |    2
 drivers/block/loop.c         |    2
 drivers/block/osdblk.c       |    2
 drivers/block/pktcdvd.c      |    1
 drivers/block/ps3disk.c      |    3
 drivers/block/virtio_blk.c   |    4
 drivers/block/xen-blkfront.c |    2
 drivers/ide/ide-disk.c       |    4
 drivers/md/dm.c              |    3
 drivers/mmc/card/queue.c     |    2
 drivers/s390/block/dasd.c    |    2
 drivers/scsi/sd.c            |    8 -
 include/linux/blkdev.h       |   63 +++-------
 include/linux/elevator.h     |    6 -
 18 files changed, 155 insertions(+), 320 deletions(-)

Index: work/block/blk-barrier.c
===================================================================
--- work.orig/block/blk-barrier.c
+++ work/block/blk-barrier.c
@@ -9,6 +9,8 @@

 #include "blk.h"

+static struct request *queue_next_ordseq(struct request_queue *q);
+
 /**
  * blk_queue_ordered - does this queue support ordered writes
  * @q:        the request queue
@@ -31,13 +33,8 @@ int blk_queue_ordered(struct request_que
 		return -EINVAL;
 	}

-	if (ordered != QUEUE_ORDERED_NONE &&
-	    ordered != QUEUE_ORDERED_DRAIN &&
-	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA &&
-	    ordered != QUEUE_ORDERED_TAG &&
-	    ordered != QUEUE_ORDERED_TAG_FLUSH &&
-	    ordered != QUEUE_ORDERED_TAG_FUA) {
+	if (ordered != QUEUE_ORDERED_NONE && ordered != QUEUE_ORDERED_BAR &&
+	    ordered != QUEUE_ORDERED_FLUSH && ordered != QUEUE_ORDERED_FUA) {
 		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
 		return -EINVAL;
 	}
@@ -60,38 +57,10 @@ unsigned blk_ordered_cur_seq(struct requ
 	return 1 << ffz(q->ordseq);
 }

-unsigned blk_ordered_req_seq(struct request *rq)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+						unsigned seq, int error)
 {
-	struct request_queue *q = rq->q;
-
-	BUG_ON(q->ordseq == 0);
-
-	if (rq == &q->pre_flush_rq)
-		return QUEUE_ORDSEQ_PREFLUSH;
-	if (rq == &q->bar_rq)
-		return QUEUE_ORDSEQ_BAR;
-	if (rq == &q->post_flush_rq)
-		return QUEUE_ORDSEQ_POSTFLUSH;
-
-	/*
-	 * !fs requests don't need to follow barrier ordering.  Always
-	 * put them at the front.  This fixes the following deadlock.
-	 *
-	 * http://thread.gmane.org/gmane.linux.kernel/537473
-	 */
-	if (!blk_fs_request(rq))
-		return QUEUE_ORDSEQ_DRAIN;
-
-	if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
-	    (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
-		return QUEUE_ORDSEQ_DRAIN;
-	else
-		return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
-{
-	struct request *rq;
+	struct request *rq = NULL;

 	if (error && !q->orderr)
 		q->orderr = error;
@@ -99,16 +68,22 @@ bool blk_ordered_complete_seq(struct req
 	BUG_ON(q->ordseq & seq);
 	q->ordseq |= seq;

-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
-		return false;
-
-	/*
-	 * Okay, sequence complete.
-	 */
-	q->ordseq = 0;
-	rq = q->orig_bar_rq;
-	__blk_end_request_all(rq, q->orderr);
-	return true;
+	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+		/* not complete yet, queue the next ordered sequence */
+		rq = queue_next_ordseq(q);
+	} else {
+		/* complete this barrier request */
+		__blk_end_request_all(q->orig_bar_rq, q->orderr);
+		q->orig_bar_rq = NULL;
+		q->ordseq = 0;
+
+		/* dispatch the next barrier if there's one */
+		if (!list_empty(&q->pending_barriers)) {
+			rq = list_entry_rq(q->pending_barriers.next);
+			list_move(&rq->queuelist, &q->queue_head);
+		}
+	}
+	return rq;
 }

 static void pre_flush_end_io(struct request *rq, int error)
@@ -129,21 +104,10 @@ static void post_flush_end_io(struct req
 	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }

-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, struct request *rq,
+			rq_end_io_fn *end_io)
 {
-	struct request *rq;
-	rq_end_io_fn *end_io;
-
-	if (which == QUEUE_ORDERED_DO_PREFLUSH) {
-		rq = &q->pre_flush_rq;
-		end_io = pre_flush_end_io;
-	} else {
-		rq = &q->post_flush_rq;
-		end_io = post_flush_end_io;
-	}
-
 	blk_rq_init(q, rq);
-	rq->cmd_flags = REQ_HARDBARRIER;
 	rq->rq_disk = q->bar_rq.rq_disk;
 	rq->end_io = end_io;
 	q->prepare_flush_fn(q, rq);
@@ -151,132 +115,93 @@ static void queue_flush(struct request_q
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }

-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static struct request *queue_next_ordseq(struct request_queue *q)
 {
-	struct request *rq = *rqp;
-	unsigned skip = 0;
+	struct request *rq = &q->bar_rq;

-	q->orderr = 0;
-	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq)) {
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-		/*
-		 * Empty barrier on a write-through device w/ ordered
-		 * tag has no command to issue and without any command
-		 * to issue, ordering by tag can't be used.  Drain
-		 * instead.
-		 */
-		if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
-		    !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
-			q->ordered &= ~QUEUE_ORDERED_BY_TAG;
-			q->ordered |= QUEUE_ORDERED_BY_DRAIN;
-		}
-	}
-
-	/* stash away the original request */
-	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
-	rq = NULL;
-
-	/*
-	 * Queue ordered sequence.  As we stack them at the head, we
-	 * need to queue in reverse order.  Note that we rely on that
-	 * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
-	 * request gets inbetween ordered sequence.
-	 */
-	if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
-		rq = &q->post_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
-
-	if (q->ordered & QUEUE_ORDERED_DO_BAR) {
-		rq = &q->bar_rq;
+	switch (blk_ordered_cur_seq(q)) {
+	case QUEUE_ORDSEQ_PREFLUSH:
+		queue_flush(q, rq, pre_flush_end_io);
+		break;

+	case QUEUE_ORDSEQ_BAR:
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
-		if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
-			rq->cmd_flags |= REQ_RW;
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		rq->cmd_flags &= ~REQ_HARDBARRIER;
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		rq->end_io = bar_end_io;

 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	} else
-		skip |= QUEUE_ORDSEQ_BAR;
+		break;

-	if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
-		rq = &q->pre_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
+	case QUEUE_ORDSEQ_POSTFLUSH:
+		queue_flush(q, rq, post_flush_end_io);
+		break;

-	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
-		rq = NULL;
-	else
-		skip |= QUEUE_ORDSEQ_DRAIN;
-
-	*rqp = rq;
-
-	/*
-	 * Complete skipped sequences.  If whole sequence is complete,
-	 * return false to tell elevator that this request is gone.
-	 */
-	return !blk_ordered_complete_seq(q, skip, 0);
+	default:
+		BUG();
+	}
+	return rq;
 }

-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 {
-	struct request *rq = *rqp;
-	const int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq);
+	unsigned skip = 0;

-	if (!q->ordseq) {
-		if (!is_barrier)
-			return true;
-
-		if (q->next_ordered != QUEUE_ORDERED_NONE)
-			return start_ordered(q, rqp);
-		else {
-			/*
-			 * Queue ordering not supported.  Terminate
-			 * with prejudice.
-			 */
-			blk_dequeue_request(rq);
-			__blk_end_request_all(rq, -EOPNOTSUPP);
-			*rqp = NULL;
-			return false;
-		}
+	if (!blk_barrier_rq(rq))
+		return rq;
+
+	if (q->ordseq) {
+		/*
+		 * Barrier is already in progress and they can't be
+		 * processed in parallel.  Queue for later processing.
+		 */
+		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		return NULL;
+	}
+
+	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+		/*
+		 * Queue ordering not supported.  Terminate
+		 * with prejudice.
+		 */
+		blk_dequeue_request(rq);
+		__blk_end_request_all(rq, -EOPNOTSUPP);
+		return NULL;
 	}

 	/*
-	 * Ordered sequence in progress
+	 * Start a new ordered sequence
 	 */
+	q->orderr = 0;
+	q->ordered = q->next_ordered;
+	q->ordseq |= QUEUE_ORDSEQ_STARTED;

-	/* Special requests are not subject to ordering rules. */
-	if (!blk_fs_request(rq) &&
-	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
-		return true;
-
-	if (q->ordered & QUEUE_ORDERED_BY_TAG) {
-		/* Ordered by tag.  Blocking the next barrier is enough. */
-		if (is_barrier && rq != &q->bar_rq)
-			*rqp = NULL;
-	} else {
-		/* Ordered by draining.  Wait for turn. */
-		WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
-		if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-			*rqp = NULL;
-	}
+	/*
+	 * For an empty barrier, there's no actual BAR request, which
+	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
+	 */
+	if (!blk_rq_sectors(rq))
+		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+				QUEUE_ORDERED_DO_POSTFLUSH);
+
+	/* stash away the original request */
+	blk_dequeue_request(rq);
+	q->orig_bar_rq = rq;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+		skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+		skip |= QUEUE_ORDSEQ_BAR;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+		skip |= QUEUE_ORDSEQ_POSTFLUSH;

-	return true;
+	/* complete skipped sequences and return the first sequence */
+	return blk_ordered_complete_seq(q, skip, 0);
 }

 static void bio_end_empty_barrier(struct bio *bio, int err)
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -106,7 +106,6 @@ enum rq_flag_bits {
 	__REQ_FAILED,		/* set if the request failed */
 	__REQ_QUIET,		/* don't worry about errors */
 	__REQ_PREEMPT,		/* set for "ide_preempt" requests */
-	__REQ_ORDERED_COLOR,	/* is before or after barrier */
 	__REQ_RW_SYNC,		/* request is sync (sync write or read) */
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_RW_META,		/* metadata io request */
@@ -135,7 +134,6 @@ enum rq_flag_bits {
 #define REQ_FAILED	(1 << __REQ_FAILED)
 #define REQ_QUIET	(1 << __REQ_QUIET)
 #define REQ_PREEMPT	(1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR	(1 << __REQ_ORDERED_COLOR)
 #define REQ_RW_SYNC	(1 << __REQ_RW_SYNC)
 #define REQ_ALLOCED	(1 << __REQ_ALLOCED)
 #define REQ_RW_META	(1 << __REQ_RW_META)
@@ -437,9 +435,10 @@ struct request_queue
 	 * reserved for flush operations
 	 */
 	unsigned int		ordered, next_ordered, ordseq;
-	int			orderr, ordcolor;
-	struct request		pre_flush_rq, bar_rq, post_flush_rq;
-	struct request		*orig_bar_rq;
+	int			orderr;
+	struct request		bar_rq;
+	struct request          *orig_bar_rq;
+	struct list_head	pending_barriers;

 	struct mutex		sysfs_lock;

@@ -543,49 +542,33 @@ enum {
 	 * Hardbarrier is supported with one of the following methods.
 	 *
 	 * NONE		: hardbarrier unsupported
-	 * DRAIN	: ordering by draining is enough
-	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
-	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
-	 * TAG		: ordering by tag is enough
-	 * TAG_FLUSH	: ordering by tag w/ pre and post flushes
-	 * TAG_FUA	: ordering by tag w/ pre flush and FUA write
-	 */
-	QUEUE_ORDERED_BY_DRAIN		= 0x01,
-	QUEUE_ORDERED_BY_TAG		= 0x02,
-	QUEUE_ORDERED_DO_PREFLUSH	= 0x10,
-	QUEUE_ORDERED_DO_BAR		= 0x20,
-	QUEUE_ORDERED_DO_POSTFLUSH	= 0x40,
-	QUEUE_ORDERED_DO_FUA		= 0x80,
-
-	QUEUE_ORDERED_NONE		= 0x00,
-
-	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_BY_DRAIN |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_DRAIN_FLUSH	= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_DRAIN_FUA		= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
+	 * BAR		: writing out barrier is enough
+	 * FLUSH	: barrier and surrounding pre and post flushes
+	 * FUA		: FUA barrier w/ pre flush
+	 */
+	QUEUE_ORDERED_DO_PREFLUSH	= 1 << 0,
+	QUEUE_ORDERED_DO_BAR		= 1 << 1,
+	QUEUE_ORDERED_DO_POSTFLUSH	= 1 << 2,
+	QUEUE_ORDERED_DO_FUA		= 1 << 3,
+
+	QUEUE_ORDERED_NONE		= 0,

-	QUEUE_ORDERED_TAG		= QUEUE_ORDERED_BY_TAG |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_TAG_FLUSH		= QUEUE_ORDERED_TAG |
+	QUEUE_ORDERED_BAR		= QUEUE_ORDERED_DO_BAR,
+	QUEUE_ORDERED_FLUSH		= QUEUE_ORDERED_DO_BAR |
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_TAG_FUA		= QUEUE_ORDERED_TAG |
+	QUEUE_ORDERED_FUA		= QUEUE_ORDERED_DO_BAR |
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_FUA,

 	/*
 	 * Ordered operation sequence
 	 */
-	QUEUE_ORDSEQ_STARTED	= 0x01,	/* flushing in progress */
-	QUEUE_ORDSEQ_DRAIN	= 0x02,	/* waiting for the queue to be drained */
-	QUEUE_ORDSEQ_PREFLUSH	= 0x04,	/* pre-flushing in progress */
-	QUEUE_ORDSEQ_BAR	= 0x08,	/* original barrier req in progress */
-	QUEUE_ORDSEQ_POSTFLUSH	= 0x10,	/* post-flushing in progress */
-	QUEUE_ORDSEQ_DONE	= 0x20,
+	QUEUE_ORDSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_ORDSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_ORDSEQ_BAR	= (1 << 2), /* barrier write in progress */
+	QUEUE_ORDSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_ORDSEQ_DONE	= (1 << 4),
 };

 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -967,10 +950,8 @@ extern void blk_queue_rq_timed_out(struc
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 extern int blk_queue_ordered(struct request_queue *, unsigned, prepare_flush_fn *);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
 extern unsigned blk_ordered_cur_seq(struct request_queue *);
 extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);

 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
Index: work/drivers/block/brd.c
===================================================================
--- work.orig/drivers/block/brd.c
+++ work/drivers/block/brd.c
@@ -479,7 +479,7 @@ static struct brd_device *brd_alloc(int
 	if (!brd->brd_queue)
 		goto out_free_dev;
 	blk_queue_make_request(brd->brd_queue, brd_make_request);
-	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG, NULL);
+	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_BAR, NULL);
 	blk_queue_max_hw_sectors(brd->brd_queue, 1024);
 	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

Index: work/drivers/block/virtio_blk.c
===================================================================
--- work.orig/drivers/block/virtio_blk.c
+++ work/drivers/block/virtio_blk.c
@@ -368,10 +368,10 @@ static int __devinit virtblk_probe(struc

 	/* If barriers are supported, tell block layer that queue is ordered */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH,
+		blk_queue_ordered(q, QUEUE_ORDERED_FLUSH,
 				  virtblk_prepare_flush);
 	else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
-		blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL);
+		blk_queue_ordered(q, QUEUE_ORDERED_BAR, NULL);

 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
Index: work/drivers/scsi/sd.c
===================================================================
--- work.orig/drivers/scsi/sd.c
+++ work/drivers/scsi/sd.c
@@ -2103,15 +2103,13 @@ static int sd_revalidate_disk(struct gen

 	/*
 	 * We now have all cache related info, determine how we deal
-	 * with ordered requests.  Note that as the current SCSI
-	 * dispatch function can alter request order, we cannot use
-	 * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+	 * with ordered requests.
 	 */
 	if (sdkp->WCE)
 		ordered = sdkp->DPOFUA
-			? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
+			? QUEUE_ORDERED_FUA : QUEUE_ORDERED_FLUSH;
 	else
-		ordered = QUEUE_ORDERED_DRAIN;
+		ordered = QUEUE_ORDERED_BAR;

 	blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush);

Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
+	INIT_LIST_HEAD(&q->pending_barriers);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);

 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1036,22 +1037,6 @@ void blk_insert_request(struct request_q
 }
 EXPORT_SYMBOL(blk_insert_request);

-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
-	drive_stat_acct(req, 1);
-
-	/*
-	 * elevator indicated where it wants this request to be
-	 * inserted at elevator_merge time
-	 */
-	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
@@ -1184,6 +1169,7 @@ static int __make_request(struct request
 	const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
 	const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;

 	if (bio_rw_flagged(bio, BIO_RW_BARRIER) &&
@@ -1191,6 +1177,7 @@ static int __make_request(struct request
 		bio_endio(bio, -EOPNOTSUPP);
 		return 0;
 	}
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1200,7 +1187,12 @@ static int __make_request(struct request

 	spin_lock_irq(q->queue_lock);

-	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
+	if (bio_rw_flagged(bio, BIO_RW_BARRIER)) {
+		where = ELEVATOR_INSERT_ORDERED;
+		goto get_rq;
+	}
+
+	if (elv_queue_empty(q))
 		goto get_rq;

 	el_ret = elv_merge(q, &req, bio);
@@ -1297,7 +1289,10 @@ get_rq:
 		req->cpu = blk_cpu_to_group(smp_processor_id());
 	if (queue_should_plug(q) && elv_queue_empty(q))
 		blk_plug_device(q);
-	add_request(q, req);
+
+	/* insert the request into the elevator */
+	drive_stat_acct(req, 1);
+	__elv_add_request(q, req, where, 0);
 out:
 	if (unplug || !queue_should_plug(q))
 		__generic_unplug_device(q);
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -564,7 +564,7 @@ void elv_requeue_request(struct request_

 	rq->cmd_flags &= ~REQ_STARTED;

-	elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE);
+	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }

 void elv_drain_elevator(struct request_queue *q)
@@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu

 void elv_insert(struct request_queue *q, struct request *rq, int where)
 {
-	struct list_head *pos;
-	unsigned ordseq;
 	int unplug_it = 1;

 	trace_block_rq_insert(q, rq);
@@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q,
 	switch (where) {
 	case ELEVATOR_INSERT_FRONT:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-
 		list_add(&rq->queuelist, &q->queue_head);
 		break;

+	case ELEVATOR_INSERT_ORDERED:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		list_add_tail(&rq->queuelist, &q->queue_head);
+		break;
+
 	case ELEVATOR_INSERT_BACK:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
 		elv_drain_elevator(q);
@@ -661,36 +663,6 @@ void elv_insert(struct request_queue *q,
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;

-	case ELEVATOR_INSERT_REQUEUE:
-		/*
-		 * If ordered flush isn't in progress, we do front
-		 * insertion; otherwise, requests should be requeued
-		 * in ordseq order.
-		 */
-		rq->cmd_flags |= REQ_SOFTBARRIER;
-
-		/*
-		 * Most requeues happen because of a busy condition,
-		 * don't force unplug of the queue for that case.
-		 */
-		unplug_it = 0;
-
-		if (q->ordseq == 0) {
-			list_add(&rq->queuelist, &q->queue_head);
-			break;
-		}
-
-		ordseq = blk_ordered_req_seq(rq);
-
-		list_for_each(pos, &q->queue_head) {
-			struct request *pos_rq = list_entry_rq(pos);
-			if (ordseq <= blk_ordered_req_seq(pos_rq))
-				break;
-		}
-
-		list_add_tail(&rq->queuelist, pos);
-		break;
-
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -709,32 +681,14 @@ void elv_insert(struct request_queue *q,
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
-	if (q->ordcolor)
-		rq->cmd_flags |= REQ_ORDERED_COLOR;
-
 	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
-		/*
-		 * toggle ordered color
-		 */
-		if (blk_barrier_rq(rq))
-			q->ordcolor ^= 1;
-
-		/*
-		 * barriers implicitly indicate back insertion
-		 */
-		if (where == ELEVATOR_INSERT_SORT)
-			where = ELEVATOR_INSERT_BACK;
-
-		/*
-		 * this request is scheduling boundary, update
-		 * end_sector
-		 */
+		/* barriers are scheduling boundary, update end_sector */
 		if (blk_fs_request(rq) || blk_discard_rq(rq)) {
 			q->end_sector = rq_end_sector(rq);
 			q->boundary_rq = rq;
 		}
 	} else if (!(rq->cmd_flags & REQ_ELVPRIV) &&
-		    where == ELEVATOR_INSERT_SORT)
+		   where == ELEVATOR_INSERT_SORT)
 		where = ELEVATOR_INSERT_BACK;

 	if (plug)
@@ -846,24 +800,6 @@ void elv_completed_request(struct reques
 		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}
-
-	/*
-	 * Check if the queue is waiting for fs requests to be
-	 * drained for flush sequence.
-	 */
-	if (unlikely(q->ordseq)) {
-		struct request *next = NULL;
-
-		if (!list_empty(&q->queue_head))
-			next = list_entry_rq(q->queue_head.next);
-
-		if (!queue_in_flight(q) &&
-		    blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
-		    (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
-			blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
-			__blk_run_queue(q);
-		}
-	}
 }

 #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))

+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
@@ -58,7 +60,8 @@ static inline struct request *__elv_next
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (blk_do_ordered(q, &rq))
+			rq = blk_do_ordered(q, rq);
+			if (rq)
 				return rq;
 		}

Index: work/drivers/block/loop.c
===================================================================
--- work.orig/drivers/block/loop.c
+++ work/drivers/block/loop.c
@@ -831,7 +831,7 @@ static int loop_set_fd(struct loop_devic
 	lo->lo_queue->unplug_fn = loop_unplug;

 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL);
+		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_BAR, NULL);

 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
Index: work/drivers/block/osdblk.c
===================================================================
--- work.orig/drivers/block/osdblk.c
+++ work/drivers/block/osdblk.c
@@ -446,7 +446,7 @@ static int osdblk_init_disk(struct osdbl
 	blk_queue_stack_limits(q, osd_request_queue(osdev->osd));

 	blk_queue_prep_rq(q, blk_queue_start_tag);
-	blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);
+	blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, osdblk_prepare_flush);

 	disk->queue = q;

Index: work/drivers/block/ps3disk.c
===================================================================
--- work.orig/drivers/block/ps3disk.c
+++ work/drivers/block/ps3disk.c
@@ -480,8 +480,7 @@ static int __devinit ps3disk_probe(struc
 	blk_queue_dma_alignment(queue, dev->blk_size-1);
 	blk_queue_logical_block_size(queue, dev->blk_size);

-	blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  ps3disk_prepare_flush);
+	blk_queue_ordered(queue, QUEUE_ORDERED_FLUSH, ps3disk_prepare_flush);

 	blk_queue_max_segments(queue, -1);
 	blk_queue_max_segment_size(queue, dev->bounce_size);
Index: work/drivers/block/xen-blkfront.c
===================================================================
--- work.orig/drivers/block/xen-blkfront.c
+++ work/drivers/block/xen-blkfront.c
@@ -373,7 +373,7 @@ static int xlvbd_barrier(struct blkfront
 	int err;

 	err = blk_queue_ordered(info->rq,
-				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
+				info->feature_barrier ? QUEUE_ORDERED_BAR : QUEUE_ORDERED_NONE,
 				NULL);

 	if (err)
Index: work/drivers/ide/ide-disk.c
===================================================================
--- work.orig/drivers/ide/ide-disk.c
+++ work/drivers/ide/ide-disk.c
@@ -537,11 +537,11 @@ static void update_ordered(ide_drive_t *
 		       drive->name, barrier ? "" : "not ");

 		if (barrier) {
-			ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+			ordered = QUEUE_ORDERED_FLUSH;
 			prep_fn = idedisk_prepare_flush;
 		}
 	} else
-		ordered = QUEUE_ORDERED_DRAIN;
+		ordered = QUEUE_ORDERED_BAR;

 	blk_queue_ordered(drive->queue, ordered, prep_fn);
 }
Index: work/drivers/md/dm.c
===================================================================
--- work.orig/drivers/md/dm.c
+++ work/drivers/md/dm.c
@@ -1912,8 +1912,7 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  dm_rq_prepare_flush);
+	blk_queue_ordered(md->queue, QUEUE_ORDERED_FLUSH, dm_rq_prepare_flush);

 	md->disk = alloc_disk(1);
 	if (!md->disk)
Index: work/drivers/mmc/card/queue.c
===================================================================
--- work.orig/drivers/mmc/card/queue.c
+++ work/drivers/mmc/card/queue.c
@@ -128,7 +128,7 @@ int mmc_init_queue(struct mmc_queue *mq,
 	mq->req = NULL;

 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
-	blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN, NULL);
+	blk_queue_ordered(mq->queue, QUEUE_ORDERED_BAR, NULL);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);

 #ifdef CONFIG_MMC_BLOCK_BOUNCE
Index: work/drivers/s390/block/dasd.c
===================================================================
--- work.orig/drivers/s390/block/dasd.c
+++ work/drivers/s390/block/dasd.c
@@ -2196,7 +2196,7 @@ static void dasd_setup_queue(struct dasd
 	 */
 	blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
 	blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
-	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN, NULL);
+	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_BAR, NULL);
 }

 /*
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc
  * Insertion selection
  */
 #define ELEVATOR_INSERT_FRONT	1
-#define ELEVATOR_INSERT_BACK	2
-#define ELEVATOR_INSERT_SORT	3
-#define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_ORDERED	2
+#define ELEVATOR_INSERT_BACK	3
+#define ELEVATOR_INSERT_SORT	4

 /*
  * return values from elevator_may_queue_fn
Index: work/drivers/block/pktcdvd.c
===================================================================
--- work.orig/drivers/block/pktcdvd.c
+++ work/drivers/block/pktcdvd.c
@@ -752,7 +752,6 @@ static int pkt_generic_packet(struct pkt

 	rq->timeout = 60*HZ;
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->cmd_flags |= REQ_HARDBARRIER;
 	if (cgc->quiet)
 		rq->cmd_flags |= REQ_QUIET;


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH REPOST RFC] relaxed barriers
  2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
@ 2010-08-08 14:31         ` Christoph Hellwig
  2010-08-09 14:50           ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Christoph Hellwig @ 2010-08-08 14:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Jan Kara, jaxboe, James.Bottomley,
	linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, linux-raid

On Sat, Aug 07, 2010 at 12:13:06PM +0200, Tejun Heo wrote:
> The patch was on top of v2.6.35 but was generated against dirty tree
> and wouldn't apply cleanly.  Here's the proper one.

Here's an updated version:

 (a) ported to Jens' current block tree
 (b) optimize barriers on devices not requiring flushes to be no-ops
 (b) redo the blk_queue_ordered interface to just set QUEUE_HAS_FLUSH
     and QUEUE_HAS_FUA flags.

Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2010-08-07 12:53:23.727479189 -0400
+++ linux-2.6/block/blk-barrier.c	2010-08-07 14:52:21.402479191 -0400
@@ -9,37 +9,36 @@
 
 #include "blk.h"
 
+/*
+ * Ordered operation sequence.
+ */
+enum {
+	QUEUE_ORDSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_ORDSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_ORDSEQ_BAR	= (1 << 2), /* barrier write in progress */
+	QUEUE_ORDSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_ORDSEQ_DONE	= (1 << 4),
+};
+
+static struct request *queue_next_ordseq(struct request_queue *q);
+
 /**
- * blk_queue_ordered - does this queue support ordered writes
- * @q:        the request queue
- * @ordered:  one of QUEUE_ORDERED_*
- *
- * Description:
- *   For journalled file systems, doing ordered writes on a commit
- *   block instead of explicitly doing wait_on_buffer (which is bad
- *   for performance) can be a big win. Block drivers supporting this
- *   feature should call this function and indicate so.
- *
+ * blk_queue_cache_features - set the supported cache control features
+ * @q:        		the request queue
+ * @cache_features:	the support features
  **/
-int blk_queue_ordered(struct request_queue *q, unsigned ordered)
+int blk_queue_cache_features(struct request_queue *q, unsigned cache_features)
 {
-	if (ordered != QUEUE_ORDERED_NONE &&
-	    ordered != QUEUE_ORDERED_DRAIN &&
-	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA &&
-	    ordered != QUEUE_ORDERED_TAG &&
-	    ordered != QUEUE_ORDERED_TAG_FLUSH &&
-	    ordered != QUEUE_ORDERED_TAG_FUA) {
-		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
+	if (cache_features & ~(QUEUE_HAS_FLUSH|QUEUE_HAS_FUA)) {
+		printk(KERN_ERR "blk_queue_cache_features: bad value %d\n",
+			cache_features);
 		return -EINVAL;
 	}
 
-	q->ordered = ordered;
-	q->next_ordered = ordered;
-
+	q->cache_features = cache_features;
 	return 0;
 }
-EXPORT_SYMBOL(blk_queue_ordered);
+EXPORT_SYMBOL(blk_queue_cache_features);
 
 /*
  * Cache flushing for ordered writes handling
@@ -51,38 +50,10 @@ unsigned blk_ordered_cur_seq(struct requ
 	return 1 << ffz(q->ordseq);
 }
 
-unsigned blk_ordered_req_seq(struct request *rq)
-{
-	struct request_queue *q = rq->q;
-
-	BUG_ON(q->ordseq == 0);
-
-	if (rq == &q->pre_flush_rq)
-		return QUEUE_ORDSEQ_PREFLUSH;
-	if (rq == &q->bar_rq)
-		return QUEUE_ORDSEQ_BAR;
-	if (rq == &q->post_flush_rq)
-		return QUEUE_ORDSEQ_POSTFLUSH;
-
-	/*
-	 * !fs requests don't need to follow barrier ordering.  Always
-	 * put them at the front.  This fixes the following deadlock.
-	 *
-	 * http://thread.gmane.org/gmane.linux.kernel/537473
-	 */
-	if (rq->cmd_type != REQ_TYPE_FS)
-		return QUEUE_ORDSEQ_DRAIN;
-
-	if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
-	    (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
-		return QUEUE_ORDSEQ_DRAIN;
-	else
-		return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+						unsigned seq, int error)
 {
-	struct request *rq;
+	struct request *rq = NULL;
 
 	if (error && !q->orderr)
 		q->orderr = error;
@@ -90,16 +61,22 @@ bool blk_ordered_complete_seq(struct req
 	BUG_ON(q->ordseq & seq);
 	q->ordseq |= seq;
 
-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
-		return false;
-
-	/*
-	 * Okay, sequence complete.
-	 */
-	q->ordseq = 0;
-	rq = q->orig_bar_rq;
-	__blk_end_request_all(rq, q->orderr);
-	return true;
+	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+		/* not complete yet, queue the next ordered sequence */
+		rq = queue_next_ordseq(q);
+	} else {
+		/* complete this barrier request */
+		__blk_end_request_all(q->orig_bar_rq, q->orderr);
+		q->orig_bar_rq = NULL;
+		q->ordseq = 0;
+
+		/* dispatch the next barrier if there's one */
+		if (!list_empty(&q->pending_barriers)) {
+			rq = list_entry_rq(q->pending_barriers.next);
+			list_move(&rq->queuelist, &q->queue_head);
+		}
+	}
+	return rq;
 }
 
 static void pre_flush_end_io(struct request *rq, int error)
@@ -120,155 +97,100 @@ static void post_flush_end_io(struct req
 	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }
 
-static void queue_flush(struct request_queue *q, unsigned which)
+static void init_flush_request(struct request_queue *q, struct request *rq)
 {
-	struct request *rq;
-	rq_end_io_fn *end_io;
+	rq->cmd_type = REQ_TYPE_FS;
+	rq->cmd_flags = REQ_FLUSH;
+	rq->rq_disk = q->orig_bar_rq->rq_disk;
+}
 
-	if (which == QUEUE_ORDERED_DO_PREFLUSH) {
-		rq = &q->pre_flush_rq;
-		end_io = pre_flush_end_io;
-	} else {
-		rq = &q->post_flush_rq;
-		end_io = post_flush_end_io;
-	}
+/*
+ * Initialize proxy request and queue it.
+ */
+static struct request *queue_next_ordseq(struct request_queue *q)
+{
+	struct request *rq = &q->bar_rq;
 
 	blk_rq_init(q, rq);
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_HARDBARRIER | REQ_FLUSH;
-	rq->rq_disk = q->orig_bar_rq->rq_disk;
-	rq->end_io = end_io;
+
+	switch (blk_ordered_cur_seq(q)) {
+	case QUEUE_ORDSEQ_PREFLUSH:
+		init_flush_request(q, rq);
+		rq->end_io = pre_flush_end_io;
+		break;
+	case QUEUE_ORDSEQ_BAR:
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		rq->cmd_flags &= ~REQ_HARDBARRIER;
+ 		if (q->cache_features & QUEUE_HAS_FUA)
+ 			rq->cmd_flags |= REQ_FUA;
+		rq->end_io = bar_end_io;
+		break;
+	case QUEUE_ORDSEQ_POSTFLUSH:
+		init_flush_request(q, rq);
+		rq->end_io = post_flush_end_io;
+		break;
+	default:
+		BUG();
+	}
 
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+	return rq;
 }
 
-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 {
-	struct request *rq = *rqp;
 	unsigned skip = 0;
 
-	q->orderr = 0;
-	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
+	if (rq->cmd_type != REQ_TYPE_FS)
+		return rq;
+	if (!(rq->cmd_flags & REQ_HARDBARRIER))
+		return rq;
 
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq)) {
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
+	if (!(q->cache_features & QUEUE_HAS_FLUSH)) {
 		/*
-		 * Empty barrier on a write-through device w/ ordered
-		 * tag has no command to issue and without any command
-		 * to issue, ordering by tag can't be used.  Drain
-		 * instead.
+		 * No flush required.  We can just send on write requests
+		 * and complete cache flush requests ASAP.
 		 */
-		if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
-		    !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
-			q->ordered &= ~QUEUE_ORDERED_BY_TAG;
-			q->ordered |= QUEUE_ORDERED_BY_DRAIN;
+		if (blk_rq_sectors(rq)) {
+			rq->cmd_flags &= ~REQ_HARDBARRIER;
+			return rq;
 		}
+		blk_dequeue_request(rq);
+		__blk_end_request_all(rq, 0);
+		return NULL;
 	}
 
-	/* stash away the original request */
-	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
-	rq = NULL;
-
-	/*
-	 * Queue ordered sequence.  As we stack them at the head, we
-	 * need to queue in reverse order.  Note that we rely on that
-	 * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
-	 * request gets inbetween ordered sequence.
-	 */
-	if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
-		rq = &q->post_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
-
-	if (q->ordered & QUEUE_ORDERED_DO_BAR) {
-		rq = &q->bar_rq;
-
-		/* initialize proxy request and queue it */
-		blk_rq_init(q, rq);
-		if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
-			rq->cmd_flags |= REQ_WRITE;
-		if (q->ordered & QUEUE_ORDERED_DO_FUA)
-			rq->cmd_flags |= REQ_FUA;
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
-		rq->end_io = bar_end_io;
-
-		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	} else
-		skip |= QUEUE_ORDSEQ_BAR;
-
-	if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
-		rq = &q->pre_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
-
-	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
-		rq = NULL;
-	else
-		skip |= QUEUE_ORDSEQ_DRAIN;
+	if (q->ordseq) {
+		/*
+		 * Barrier is already in progress and they can't be
+		 * processed in parallel.  Queue for later processing.
+		 */
+		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		return NULL;
+	}
 
-	*rqp = rq;
 
 	/*
-	 * Complete skipped sequences.  If whole sequence is complete,
-	 * return false to tell elevator that this request is gone.
+	 * Start a new ordered sequence
 	 */
-	return !blk_ordered_complete_seq(q, skip, 0);
-}
-
-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
-{
-	struct request *rq = *rqp;
-	const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
-				(rq->cmd_flags & REQ_HARDBARRIER);
-
-	if (!q->ordseq) {
-		if (!is_barrier)
-			return true;
-
-		if (q->next_ordered != QUEUE_ORDERED_NONE)
-			return start_ordered(q, rqp);
-		else {
-			/*
-			 * Queue ordering not supported.  Terminate
-			 * with prejudice.
-			 */
-			blk_dequeue_request(rq);
-			__blk_end_request_all(rq, -EOPNOTSUPP);
-			*rqp = NULL;
-			return false;
-		}
-	}
+	q->orderr = 0;
+	q->ordseq |= QUEUE_ORDSEQ_STARTED;
 
 	/*
-	 * Ordered sequence in progress
+	 * For an empty barrier, there's no actual BAR request, which
+	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
 	 */
+	if (!blk_rq_sectors(rq))
+		skip |= (QUEUE_ORDSEQ_BAR|QUEUE_ORDSEQ_POSTFLUSH);
+	else if (q->cache_features & QUEUE_HAS_FUA)
+		skip |= QUEUE_ORDSEQ_POSTFLUSH;
 
-	/* Special requests are not subject to ordering rules. */
-	if (rq->cmd_type != REQ_TYPE_FS &&
-	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
-		return true;
-
-	if (q->ordered & QUEUE_ORDERED_BY_TAG) {
-		/* Ordered by tag.  Blocking the next barrier is enough. */
-		if (is_barrier && rq != &q->bar_rq)
-			*rqp = NULL;
-	} else {
-		/* Ordered by draining.  Wait for turn. */
-		WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
-		if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-			*rqp = NULL;
-	}
+	/* stash away the original request */
+	blk_dequeue_request(rq);
+	q->orig_bar_rq = rq;
 
-	return true;
+	/* complete skipped sequences and return the first sequence */
+	return blk_ordered_complete_seq(q, skip, 0);
 }
 
 static void bio_end_empty_barrier(struct bio *bio, int err)
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2010-08-07 12:53:23.774479189 -0400
+++ linux-2.6/include/linux/blkdev.h	2010-08-07 14:51:42.751479190 -0400
@@ -354,13 +354,20 @@ struct request_queue
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace	*blk_trace;
 #endif
+
+	/*
+	 * Features this queue understands.
+	 */
+	unsigned int		cache_features;
+
 	/*
 	 * reserved for flush operations
 	 */
-	unsigned int		ordered, next_ordered, ordseq;
-	int			orderr, ordcolor;
-	struct request		pre_flush_rq, bar_rq, post_flush_rq;
-	struct request		*orig_bar_rq;
+	unsigned int		ordseq;
+	int			orderr;
+	struct request		bar_rq;
+	struct request          *orig_bar_rq;
+	struct list_head	pending_barriers;
 
 	struct mutex		sysfs_lock;
 
@@ -461,54 +468,12 @@ static inline void queue_flag_clear(unsi
 	__clear_bit(flag, &q->queue_flags);
 }
 
+/*
+ * Possible features to control a volatile write cache.
+ */
 enum {
-	/*
-	 * Hardbarrier is supported with one of the following methods.
-	 *
-	 * NONE		: hardbarrier unsupported
-	 * DRAIN	: ordering by draining is enough
-	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
-	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
-	 * TAG		: ordering by tag is enough
-	 * TAG_FLUSH	: ordering by tag w/ pre and post flushes
-	 * TAG_FUA	: ordering by tag w/ pre flush and FUA write
-	 */
-	QUEUE_ORDERED_BY_DRAIN		= 0x01,
-	QUEUE_ORDERED_BY_TAG		= 0x02,
-	QUEUE_ORDERED_DO_PREFLUSH	= 0x10,
-	QUEUE_ORDERED_DO_BAR		= 0x20,
-	QUEUE_ORDERED_DO_POSTFLUSH	= 0x40,
-	QUEUE_ORDERED_DO_FUA		= 0x80,
-
-	QUEUE_ORDERED_NONE		= 0x00,
-
-	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_BY_DRAIN |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_DRAIN_FLUSH	= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_DRAIN_FUA		= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
-
-	QUEUE_ORDERED_TAG		= QUEUE_ORDERED_BY_TAG |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_TAG_FLUSH		= QUEUE_ORDERED_TAG |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_TAG_FUA		= QUEUE_ORDERED_TAG |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
-
-	/*
-	 * Ordered operation sequence
-	 */
-	QUEUE_ORDSEQ_STARTED	= 0x01,	/* flushing in progress */
-	QUEUE_ORDSEQ_DRAIN	= 0x02,	/* waiting for the queue to be drained */
-	QUEUE_ORDSEQ_PREFLUSH	= 0x04,	/* pre-flushing in progress */
-	QUEUE_ORDSEQ_BAR	= 0x08,	/* original barrier req in progress */
-	QUEUE_ORDSEQ_POSTFLUSH	= 0x10,	/* post-flushing in progress */
-	QUEUE_ORDSEQ_DONE	= 0x20,
+	QUEUE_HAS_FLUSH		= 1 << 0,	/* supports REQ_FLUSH */
+	QUEUE_HAS_FUA		= 1 << 1,	/* supports REQ_FUA */
 };
 
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -879,11 +844,9 @@ extern void blk_queue_softirq_done(struc
 extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern int blk_queue_ordered(struct request_queue *, unsigned);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
+extern int blk_queue_cache_features(struct request_queue *, unsigned);
 extern unsigned blk_ordered_cur_seq(struct request_queue *);
 extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2010-08-07 12:53:23.800479189 -0400
+++ linux-2.6/drivers/block/virtio_blk.c	2010-08-07 14:51:34.198479189 -0400
@@ -388,31 +388,8 @@ static int __devinit virtblk_probe(struc
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;
 
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
-		/*
-		 * If the FLUSH feature is supported we do have support for
-		 * flushing a volatile write cache on the host.  Use that
-		 * to implement write barrier support.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
-	} else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
-		/*
-		 * If the BARRIER feature is supported the host expects us
-		 * to order request by tags.  This implies there is not
-		 * volatile write cache on the host, and that the host
-		 * never re-orders outstanding I/O.  This feature is not
-		 * useful for real life scenarious and deprecated.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_TAG);
-	} else {
-		/*
-		 * If the FLUSH feature is not supported we must assume that
-		 * the host does not perform any kind of volatile write
-		 * caching. We still need to drain the queue to provider
-		 * proper barrier semantics.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
-	}
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+		blk_queue_cache_features(q, QUEUE_HAS_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c	2010-08-07 12:53:23.872479189 -0400
+++ linux-2.6/drivers/scsi/sd.c	2010-08-07 14:54:47.812479189 -0400
@@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gen
 	struct scsi_disk *sdkp = scsi_disk(disk);
 	struct scsi_device *sdp = sdkp->device;
 	unsigned char *buffer;
-	unsigned ordered;
+	unsigned ordered = 0;
 
 	SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
 				      "sd_revalidate_disk\n"));
@@ -2151,17 +2151,14 @@ static int sd_revalidate_disk(struct gen
 
 	/*
 	 * We now have all cache related info, determine how we deal
-	 * with ordered requests.  Note that as the current SCSI
-	 * dispatch function can alter request order, we cannot use
-	 * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+	 * with barriers.
 	 */
-	if (sdkp->WCE)
-		ordered = sdkp->DPOFUA
-			? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
-	else
-		ordered = QUEUE_ORDERED_DRAIN;
-
-	blk_queue_ordered(sdkp->disk->queue, ordered);
+	if (sdkp->WCE) {
+		ordered |= QUEUE_HAS_FLUSH;
+		if (sdkp->DPOFUA)
+			ordered |= QUEUE_HAS_FUA;
+	}
+	blk_queue_cache_features(sdkp->disk->queue, ordered);
 
 	set_capacity(disk, sdkp->capacity);
 	kfree(buffer);
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c	2010-08-07 12:53:23.744479189 -0400
+++ linux-2.6/block/blk-core.c	2010-08-07 14:56:35.087479189 -0400
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
+	INIT_LIST_HEAD(&q->pending_barriers);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1037,22 +1038,6 @@ void blk_insert_request(struct request_q
 }
 EXPORT_SYMBOL(blk_insert_request);
 
-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
-	drive_stat_acct(req, 1);
-
-	/*
-	 * elevator indicated where it wants this request to be
-	 * inserted at elevator_merge time
-	 */
-	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
@@ -1201,13 +1186,9 @@ static int __make_request(struct request
 	const bool sync = (bio->bi_rw & REQ_SYNC);
 	const bool unplug = (bio->bi_rw & REQ_UNPLUG);
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;
 
-	if ((bio->bi_rw & REQ_HARDBARRIER) &&
-	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
-		bio_endio(bio, -EOPNOTSUPP);
-		return 0;
-	}
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1217,7 +1198,12 @@ static int __make_request(struct request
 
 	spin_lock_irq(q->queue_lock);
 
-	if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q))
+	if (bio->bi_rw & REQ_HARDBARRIER) {
+		where = ELEVATOR_INSERT_ORDERED;
+		goto get_rq;
+	}
+
+	if (elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);
@@ -1314,7 +1300,10 @@ get_rq:
 		req->cpu = blk_cpu_to_group(smp_processor_id());
 	if (queue_should_plug(q) && elv_queue_empty(q))
 		blk_plug_device(q);
-	add_request(q, req);
+
+	/* insert the request into the elevator */
+	drive_stat_acct(req, 1);
+	__elv_add_request(q, req, where, 0);
 out:
 	if (unplug || !queue_should_plug(q))
 		__generic_unplug_device(q);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c	2010-08-07 12:53:23.752479189 -0400
+++ linux-2.6/block/elevator.c	2010-08-07 12:53:53.162479190 -0400
@@ -564,7 +564,7 @@ void elv_requeue_request(struct request_
 
 	rq->cmd_flags &= ~REQ_STARTED;
 
-	elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE);
+	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }
 
 void elv_drain_elevator(struct request_queue *q)
@@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu
 
 void elv_insert(struct request_queue *q, struct request *rq, int where)
 {
-	struct list_head *pos;
-	unsigned ordseq;
 	int unplug_it = 1;
 
 	trace_block_rq_insert(q, rq);
@@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q,
 	switch (where) {
 	case ELEVATOR_INSERT_FRONT:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-
 		list_add(&rq->queuelist, &q->queue_head);
 		break;
 
+	case ELEVATOR_INSERT_ORDERED:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		list_add_tail(&rq->queuelist, &q->queue_head);
+		break;
+
 	case ELEVATOR_INSERT_BACK:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
 		elv_drain_elevator(q);
@@ -662,36 +664,6 @@ void elv_insert(struct request_queue *q,
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
-	case ELEVATOR_INSERT_REQUEUE:
-		/*
-		 * If ordered flush isn't in progress, we do front
-		 * insertion; otherwise, requests should be requeued
-		 * in ordseq order.
-		 */
-		rq->cmd_flags |= REQ_SOFTBARRIER;
-
-		/*
-		 * Most requeues happen because of a busy condition,
-		 * don't force unplug of the queue for that case.
-		 */
-		unplug_it = 0;
-
-		if (q->ordseq == 0) {
-			list_add(&rq->queuelist, &q->queue_head);
-			break;
-		}
-
-		ordseq = blk_ordered_req_seq(rq);
-
-		list_for_each(pos, &q->queue_head) {
-			struct request *pos_rq = list_entry_rq(pos);
-			if (ordseq <= blk_ordered_req_seq(pos_rq))
-				break;
-		}
-
-		list_add_tail(&rq->queuelist, pos);
-		break;
-
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -710,33 +682,15 @@ void elv_insert(struct request_queue *q,
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
-	if (q->ordcolor)
-		rq->cmd_flags |= REQ_ORDERED_COLOR;
-
 	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
-		/*
-		 * toggle ordered color
-		 */
-		if (rq->cmd_flags & REQ_HARDBARRIER)
-			q->ordcolor ^= 1;
-
-		/*
-		 * barriers implicitly indicate back insertion
-		 */
-		if (where == ELEVATOR_INSERT_SORT)
-			where = ELEVATOR_INSERT_BACK;
-
-		/*
-		 * this request is scheduling boundary, update
-		 * end_sector
-		 */
+		/* barriers are scheduling boundary, update end_sector */
 		if (rq->cmd_type == REQ_TYPE_FS ||
 		    (rq->cmd_flags & REQ_DISCARD)) {
 			q->end_sector = rq_end_sector(rq);
 			q->boundary_rq = rq;
 		}
 	} else if (!(rq->cmd_flags & REQ_ELVPRIV) &&
-		    where == ELEVATOR_INSERT_SORT)
+		   where == ELEVATOR_INSERT_SORT)
 		where = ELEVATOR_INSERT_BACK;
 
 	if (plug)
@@ -849,24 +803,6 @@ void elv_completed_request(struct reques
 		    e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}
-
-	/*
-	 * Check if the queue is waiting for fs requests to be
-	 * drained for flush sequence.
-	 */
-	if (unlikely(q->ordseq)) {
-		struct request *next = NULL;
-
-		if (!list_empty(&q->queue_head))
-			next = list_entry_rq(q->queue_head.next);
-
-		if (!queue_in_flight(q) &&
-		    blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
-		    (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
-			blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
-			__blk_run_queue(q);
-		}
-	}
 }
 
 #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
Index: linux-2.6/block/blk.h
===================================================================
--- linux-2.6.orig/block/blk.h	2010-08-07 12:53:23.762479189 -0400
+++ linux-2.6/block/blk.h	2010-08-07 12:53:53.171479190 -0400
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
@@ -58,7 +60,8 @@ static inline struct request *__elv_next
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (blk_do_ordered(q, &rq))
+			rq = blk_do_ordered(q, rq);
+			if (rq)
 				return rq;
 		}
 
Index: linux-2.6/drivers/block/xen-blkfront.c
===================================================================
--- linux-2.6.orig/drivers/block/xen-blkfront.c	2010-08-07 12:53:23.807479189 -0400
+++ linux-2.6/drivers/block/xen-blkfront.c	2010-08-07 14:44:39.564479189 -0400
@@ -417,30 +417,6 @@ static int xlvbd_init_blk_queue(struct g
 	return 0;
 }
 
-
-static int xlvbd_barrier(struct blkfront_info *info)
-{
-	int err;
-	const char *barrier;
-
-	switch (info->feature_barrier) {
-	case QUEUE_ORDERED_DRAIN:	barrier = "enabled (drain)"; break;
-	case QUEUE_ORDERED_TAG:		barrier = "enabled (tag)"; break;
-	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
-	default:			return -EINVAL;
-	}
-
-	err = blk_queue_ordered(info->rq, info->feature_barrier);
-
-	if (err)
-		return err;
-
-	printk(KERN_INFO "blkfront: %s: barriers %s\n",
-	       info->gd->disk_name, barrier);
-	return 0;
-}
-
-
 static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 			       struct blkfront_info *info,
 			       u16 vdisk_info, u16 sector_size)
@@ -516,8 +492,6 @@ static int xlvbd_alloc_gendisk(blkif_sec
 	info->rq = gd->queue;
 	info->gd = gd;
 
-	xlvbd_barrier(info);
-
 	if (vdisk_info & VDISK_READONLY)
 		set_disk_ro(gd, 1);
 
@@ -662,8 +636,6 @@ static irqreturn_t blkif_interrupt(int i
 				printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
 				       info->gd->disk_name);
 				error = -EOPNOTSUPP;
-				info->feature_barrier = QUEUE_ORDERED_NONE;
-				xlvbd_barrier(info);
 			}
 			/* fall through */
 		case BLKIF_OP_READ:
@@ -1073,24 +1045,6 @@ static void blkfront_connect(struct blkf
 			    "feature-barrier", "%lu", &barrier,
 			    NULL);
 
-	/*
-	 * If there's no "feature-barrier" defined, then it means
-	 * we're dealing with a very old backend which writes
-	 * synchronously; draining will do what needs to get done.
-	 *
-	 * If there are barriers, then we can do full queued writes
-	 * with tagged barriers.
-	 *
-	 * If barriers are not supported, then there's no much we can
-	 * do, so just set ordering to NONE.
-	 */
-	if (err)
-		info->feature_barrier = QUEUE_ORDERED_DRAIN;
-	else if (barrier)
-		info->feature_barrier = QUEUE_ORDERED_TAG;
-	else
-		info->feature_barrier = QUEUE_ORDERED_NONE;
-
 	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
 	if (err) {
 		xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
Index: linux-2.6/drivers/ide/ide-disk.c
===================================================================
--- linux-2.6.orig/drivers/ide/ide-disk.c	2010-08-07 12:53:23.889479189 -0400
+++ linux-2.6/drivers/ide/ide-disk.c	2010-08-07 15:00:30.215479189 -0400
@@ -518,12 +518,13 @@ static int ide_do_setfeature(ide_drive_t
 
 static void update_ordered(ide_drive_t *drive)
 {
-	u16 *id = drive->id;
-	unsigned ordered = QUEUE_ORDERED_NONE;
+	unsigned ordered = 0;
 
 	if (drive->dev_flags & IDE_DFLAG_WCACHE) {
+		u16 *id = drive->id;
 		unsigned long long capacity;
 		int barrier;
+
 		/*
 		 * We must avoid issuing commands a drive does not
 		 * understand or we may crash it. We check flush cache
@@ -543,13 +544,18 @@ static void update_ordered(ide_drive_t *
 		       drive->name, barrier ? "" : "not ");
 
 		if (barrier) {
-			ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+			printk(KERN_INFO "%s: cache flushes supported\n",
+				drive->name);
 			blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
+			ordered |= QUEUE_HAS_FLUSH;
+		} else {
+			printk(KERN_INFO
+				"%s: WARNING: cache flushes not supported\n",
+				drive->name);
 		}
-	} else
-		ordered = QUEUE_ORDERED_DRAIN;
+	}
 
-	blk_queue_ordered(drive->queue, ordered);
+	blk_queue_cache_features(drive->queue, ordered);
 }
 
 ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c	2010-08-07 12:53:23.905479189 -0400
+++ linux-2.6/drivers/md/dm.c	2010-08-07 14:51:38.240479189 -0400
@@ -1908,7 +1908,7 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_cache_features(md->queue, QUEUE_HAS_FLUSH);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
Index: linux-2.6/drivers/mmc/card/queue.c
===================================================================
--- linux-2.6.orig/drivers/mmc/card/queue.c	2010-08-07 12:53:23.927479189 -0400
+++ linux-2.6/drivers/mmc/card/queue.c	2010-08-07 14:30:09.666479189 -0400
@@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq,
 	mq->req = NULL;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
-	blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
 
 #ifdef CONFIG_MMC_BLOCK_BOUNCE
Index: linux-2.6/drivers/s390/block/dasd.c
===================================================================
--- linux-2.6.orig/drivers/s390/block/dasd.c	2010-08-07 12:53:23.939479189 -0400
+++ linux-2.6/drivers/s390/block/dasd.c	2010-08-07 14:30:13.307479189 -0400
@@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd
 	 */
 	blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
 	blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
-	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
 }
 
 /*
Index: linux-2.6/include/linux/elevator.h
===================================================================
--- linux-2.6.orig/include/linux/elevator.h	2010-08-07 12:53:23.781479189 -0400
+++ linux-2.6/include/linux/elevator.h	2010-08-07 12:53:53.208479190 -0400
@@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc
  * Insertion selection
  */
 #define ELEVATOR_INSERT_FRONT	1
-#define ELEVATOR_INSERT_BACK	2
-#define ELEVATOR_INSERT_SORT	3
-#define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_ORDERED	2
+#define ELEVATOR_INSERT_BACK	3
+#define ELEVATOR_INSERT_SORT	4
 
 /*
  * return values from elevator_may_queue_fn
Index: linux-2.6/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.orig/drivers/block/pktcdvd.c	2010-08-07 12:53:23.815479189 -0400
+++ linux-2.6/drivers/block/pktcdvd.c	2010-08-07 12:53:53.211479190 -0400
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pkt
 
 	rq->timeout = 60*HZ;
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->cmd_flags |= REQ_HARDBARRIER;
 	if (cgc->quiet)
 		rq->cmd_flags |= REQ_QUIET;
 
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c	2010-08-07 12:53:23.825479189 -0400
+++ linux-2.6/drivers/block/brd.c	2010-08-07 14:26:12.293479191 -0400
@@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int
 	if (!brd->brd_queue)
 		goto out_free_dev;
 	blk_queue_make_request(brd->brd_queue, brd_make_request);
-	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG);
 	blk_queue_max_hw_sectors(brd->brd_queue, 1024);
 	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
 
Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c	2010-08-07 12:53:23.836479189 -0400
+++ linux-2.6/drivers/block/loop.c	2010-08-07 14:51:27.937479189 -0400
@@ -831,8 +831,8 @@ static int loop_set_fd(struct loop_devic
 	lo->lo_queue->queuedata = lo;
 	lo->lo_queue->unplug_fn = loop_unplug;
 
-	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+	/* XXX(hch): loop can't properly deal with flush requests currently */
+//	blk_queue_cache_features(lo->lo_queue, QUEUE_HAS_FLUSH);
 
 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
Index: linux-2.6/drivers/block/osdblk.c
===================================================================
--- linux-2.6.orig/drivers/block/osdblk.c	2010-08-07 12:53:23.843479189 -0400
+++ linux-2.6/drivers/block/osdblk.c	2010-08-07 14:51:30.091479189 -0400
@@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdbl
 	blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
 
 	blk_queue_prep_rq(q, blk_queue_start_tag);
-	blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_cache_features(q, QUEUE_HAS_FLUSH);
 
 	disk->queue = q;
 
Index: linux-2.6/drivers/block/ps3disk.c
===================================================================
--- linux-2.6.orig/drivers/block/ps3disk.c	2010-08-07 12:53:23.859479189 -0400
+++ linux-2.6/drivers/block/ps3disk.c	2010-08-07 14:51:32.204479189 -0400
@@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struc
 	blk_queue_dma_alignment(queue, dev->blk_size-1);
 	blk_queue_logical_block_size(queue, dev->blk_size);
 
-	blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_cache_features(queue, QUEUE_HAS_FLUSH);
 
 	blk_queue_max_segments(queue, -1);
 	blk_queue_max_segment_size(queue, dev->bounce_size);
Index: linux-2.6/include/linux/blk_types.h
===================================================================
--- linux-2.6.orig/include/linux/blk_types.h	2010-08-07 12:53:23.793479189 -0400
+++ linux-2.6/include/linux/blk_types.h	2010-08-07 12:53:53.243479190 -0400
@@ -141,7 +141,6 @@ enum rq_flag_bits {
 	__REQ_FAILED,		/* set if the request failed */
 	__REQ_QUIET,		/* don't worry about errors */
 	__REQ_PREEMPT,		/* set for "ide_preempt" requests */
-	__REQ_ORDERED_COLOR,	/* is before or after barrier */
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_COPY_USER,	/* contains copies of user pages */
 	__REQ_INTEGRITY,	/* integrity metadata has been remapped */
@@ -181,7 +180,6 @@ enum rq_flag_bits {
 #define REQ_FAILED		(1 << __REQ_FAILED)
 #define REQ_QUIET		(1 << __REQ_QUIET)
 #define REQ_PREEMPT		(1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR	(1 << __REQ_ORDERED_COLOR)
 #define REQ_ALLOCED		(1 << __REQ_ALLOCED)
 #define REQ_COPY_USER		(1 << __REQ_COPY_USER)
 #define REQ_INTEGRITY		(1 << __REQ_INTEGRITY)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH REPOST RFC] relaxed barriers
  2010-08-08 14:31         ` Christoph Hellwig
@ 2010-08-09 14:50           ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-08-09 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi,
	tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel,
	linux-raid

On 08/08/2010 04:31 PM, Christoph Hellwig wrote:
> On Sat, Aug 07, 2010 at 12:13:06PM +0200, Tejun Heo wrote:
>> The patch was on top of v2.6.35 but was generated against dirty tree
>> and wouldn't apply cleanly.  Here's the proper one.
> 
> Here's an updated version:
> 
>  (a) ported to Jens' current block tree
>  (b) optimize barriers on devices not requiring flushes to be no-ops
>  (b) redo the blk_queue_ordered interface to just set QUEUE_HAS_FLUSH
>      and QUEUE_HAS_FUA flags.

Nice.  I'm working on a properly split patchset implementing
REQ_FLUSH/FUA based interface, which replaces REQ_HARDBARRIER.  Empty
request w/ REQ_FLUSH just flushes cache but has no other ordering
restrictions.  REQ_FLUSH + data means preflush + data write.  REQ_FUA
+ data means data would be committed to NV media on completion.
REQ_FLUSH + FUA + data means preflush + NV data write.  All FLUSH/FUA
requests w/ data are ordered only against each other.  I think I'll be
able to post in several days.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-05  2:16           ` Jun'ichi Nomura
@ 2010-08-26 22:50             ` Mike Snitzer
  2010-08-27  0:40               ` Mike Snitzer
  2010-08-27  1:43               ` Jun'ichi Nomura
  0 siblings, 2 replies; 155+ messages in thread
From: Mike Snitzer @ 2010-08-26 22:50 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

On Wed, Aug 04 2010 at 10:16pm -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Christoph,
> 
> (08/04/10 17:54), Christoph Hellwig wrote:
> > On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote:
> >>> -		if (unlikely(dm_rq_is_flush_request(rq))) {
> >>> +		if (rq->cmd_flags & REQ_FLUSH) {
> >>>  			BUG_ON(md->flush_request);
> >>>  			md->flush_request = rq;
> >>>  			blk_start_request(rq);
> >>
> >> Current request-based device-mapper's flush code depends on
> >> the block-layer's barrier behavior which dispatches only one request
> >> at a time when flush is needed.
> >> In other words, current request-based device-mapper can't handle
> >> other requests while a flush request is in progress.
> >>
> >> I'll take a look how I can fix the request-based device-mapper to
> >> cope with it.  I think it'll take time for carefull investigation.
> > 
> > Given that request based device mapper doesn't even look at the
> > block numbers from what I can see just removing any special casing
> > for REQ_FLUSH should probably do it.
> 
> Special casing is necessary because device-mapper may have to
> send multiple copies of REQ_FLUSH request to multiple
> targets, while normal request is just sent to single target.

Yes, request-based DM is meant to have all the same capabilities as
bio-based DM.  So in theory it should support multiple targets but in
practice it doesn't.  DM's multipath target is the only consumer of
request-based DM and it only ever clones a single flush request
(num_flush_requests = 1).

So why not remove all of request-based DM's barrier infrastructure and
simply rely on the revised block layer to sequence the FLUSH+WRITE
request for request-based DM?

Given that we do not have a request-based DM target that requires
cloning multiple FLUSH requests its unused code that is delaying DM
support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still
needs work in this area).

Once we have a need for using request-based DM for something other than
multipath we can take a fresh look at implementing rq-based FLUSH+FUA.

Mike

p.s. I know how hard NEC worked on request-based DM's barrier support;
so I'm not suggesting this lightly.  For me it just seems like we're
carrying complexity in DM that hasn't ever been required.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-26 22:50             ` Mike Snitzer
@ 2010-08-27  0:40               ` Mike Snitzer
  2010-08-27  1:20                 ` Jamie Lokier
  2010-08-27  1:43               ` Jun'ichi Nomura
  1 sibling, 1 reply; 155+ messages in thread
From: Mike Snitzer @ 2010-08-27  0:40 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

On Thu, Aug 26 2010 at  6:50pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> Once we have a need for using request-based DM for something other than
> multipath we can take a fresh look at implementing rq-based FLUSH+FUA.
> 
> Mike
> 
> p.s. I know how hard NEC worked on request-based DM's barrier support;
> so I'm not suggesting this lightly.  For me it just seems like we're
> carrying complexity in DM that hasn't ever been required.

To be clear: the piece that I was saying wasn't required is the need to
for request-based DM to clone a FLUSH to send to multiple targets
(saying as much was just a confusing distraction.. please ignore that).

Anyway, my previous email's question still stands.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-27  0:40               ` Mike Snitzer
@ 2010-08-27  1:20                 ` Jamie Lokier
  0 siblings, 0 replies; 155+ messages in thread
From: Jamie Lokier @ 2010-08-27  1:20 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara,
	linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley,
	konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel

Mike Snitzer wrote:
> On Thu, Aug 26 2010 at  6:50pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > Once we have a need for using request-based DM for something other than
> > multipath we can take a fresh look at implementing rq-based FLUSH+FUA.
> > 
> > Mike
> > 
> > p.s. I know how hard NEC worked on request-based DM's barrier support;
> > so I'm not suggesting this lightly.  For me it just seems like we're
> > carrying complexity in DM that hasn't ever been required.
> 
> To be clear: the piece that I was saying wasn't required is the need to
> for request-based DM to clone a FLUSH to send to multiple targets
> (saying as much was just a confusing distraction.. please ignore that).
> 
> Anyway, my previous email's question still stands.

On a slightly related note: DM suggests a reason for the lower layer, or the
request queues, to implement the trivial optimisation of discarding
FLUSHes if there's been no WRITE since the previous FLUSH.

That was mentioned elsewhere in this big thread as not being worth
even the small effort - because the filesystem is able to make good
decisions anyway.

But once you have something like RAID or striping, it's quite common
for the filesystem to issue a FLUSH when only a subset of the target
devices have received WRITEs through the RAID/striping layer since
they last received a FLUSH.

-- Jamie

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-26 22:50             ` Mike Snitzer
  2010-08-27  0:40               ` Mike Snitzer
@ 2010-08-27  1:43               ` Jun'ichi Nomura
  2010-08-27  4:08                 ` Mike Snitzer
  1 sibling, 1 reply; 155+ messages in thread
From: Jun'ichi Nomura @ 2010-08-27  1:43 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

Hi Mike,

(08/27/10 07:50), Mike Snitzer wrote:
>> Special casing is necessary because device-mapper may have to
>> send multiple copies of REQ_FLUSH request to multiple
>> targets, while normal request is just sent to single target.
> 
> Yes, request-based DM is meant to have all the same capabilities as
> bio-based DM.  So in theory it should support multiple targets but in
> practice it doesn't.  DM's multipath target is the only consumer of
> request-based DM and it only ever clones a single flush request
> (num_flush_requests = 1).

This is correct. But,

> So why not remove all of request-based DM's barrier infrastructure and
> simply rely on the revised block layer to sequence the FLUSH+WRITE
> request for request-based DM?
> 
> Given that we do not have a request-based DM target that requires
> cloning multiple FLUSH requests its unused code that is delaying DM
> support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still
> needs work in this area).

the above mentioned 'special casing' is not a hard part.
See the attached patch.

The hard part is discerning the error type for flush failure
as discussed in the other thread.
And as Kiyoshi wrote, that's an existing problem so it can
be worked on as a separate issue than the new FLUSH work.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation


Cope with new sequencing of flush requests in the block layer.

Request-based dm used to depend on the barrier sequencer in the block layer
in that, when a flush request is dispatched, there are no other requests
in-flight. So it reused md->pending counter for checking completion of
cloned flush requests.

This patch separates the pending counter for flush request
as a prepartion for the new FLUSH work, where a flush request can be
dispatched while other normal requests are in-flight.

Index: linux-2.6.36-rc2/drivers/md/dm.c
===================================================================
--- linux-2.6.36-rc2.orig/drivers/md/dm.c
+++ linux-2.6.36-rc2/drivers/md/dm.c
@@ -162,6 +162,7 @@ struct mapped_device {
 
 	/* A pointer to the currently processing pre/post flush request */
 	struct request *flush_request;
+	atomic_t flush_pending;
 
 	/*
 	 * The current mapping.
@@ -777,10 +778,16 @@ static void store_barrier_error(struct m
  * the md may be freed in dm_put() at the end of this function.
  * Or do dm_get() before calling this function and dm_put() later.
  */
-static void rq_completed(struct mapped_device *md, int rw, int run_queue)
+static void rq_completed(struct mapped_device *md, int rw, int run_queue, bool is_flush)
 {
 	atomic_dec(&md->pending[rw]);
 
+	if (is_flush) {
+		atomic_dec(&md->flush_pending);
+		if (!atomic_read(&md->flush_pending))
+			wake_up(&md->wait);
+	}
+
 	/* nudge anyone waiting on suspend queue */
 	if (!md_in_flight(md))
 		wake_up(&md->wait);
@@ -837,7 +844,7 @@ static void dm_end_request(struct reques
 	} else
 		blk_end_request_all(rq, error);
 
-	rq_completed(md, rw, run_queue);
+	rq_completed(md, rw, run_queue, is_barrier);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -880,7 +887,7 @@ void dm_requeue_unmapped_request(struct 
 	blk_requeue_request(q, rq);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	rq_completed(md, rw, 0);
+	rq_completed(md, rw, 0, false);
 }
 EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request);
 
@@ -1993,6 +2000,7 @@ static struct mapped_device *alloc_dev(i
 
 	atomic_set(&md->pending[0], 0);
 	atomic_set(&md->pending[1], 0);
+	atomic_set(&md->flush_pending, 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
 	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
@@ -2375,7 +2383,7 @@ void dm_put(struct mapped_device *md)
 }
 EXPORT_SYMBOL_GPL(dm_put);
 
-static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
+static int dm_wait_for_completion(struct mapped_device *md, int interruptible, bool for_flush)
 {
 	int r = 0;
 	DECLARE_WAITQUEUE(wait, current);
@@ -2388,6 +2396,8 @@ static int dm_wait_for_completion(struct
 		set_current_state(interruptible);
 
 		smp_mb();
+		if (for_flush && !atomic_read(&md->flush_pending))
+			break;
 		if (!md_in_flight(md))
 			break;
 
@@ -2408,14 +2418,14 @@ static int dm_wait_for_completion(struct
 
 static void dm_flush(struct mapped_device *md)
 {
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, false);
 
 	bio_init(&md->barrier_bio);
 	md->barrier_bio.bi_bdev = md->bdev;
 	md->barrier_bio.bi_rw = WRITE_BARRIER;
 	__split_and_process_bio(md, &md->barrier_bio);
 
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, false);
 }
 
 static void process_barrier(struct mapped_device *md, struct bio *bio)
@@ -2512,11 +2522,12 @@ static int dm_rq_barrier(struct mapped_d
 			clone = clone_rq(md->flush_request, md, GFP_NOIO);
 			dm_rq_set_target_request_nr(clone, j);
 			atomic_inc(&md->pending[rq_data_dir(clone)]);
+			atomic_inc(&md->flush_pending);
 			map_request(ti, clone, md);
 		}
 	}
 
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, true);
 	dm_table_put(map);
 
 	return md->barrier_error;
@@ -2705,7 +2716,7 @@ int dm_suspend(struct mapped_device *md,
 	 * We call dm_wait_for_completion to wait for all existing requests
 	 * to finish.
 	 */
-	r = dm_wait_for_completion(md, TASK_INTERRUPTIBLE);
+	r = dm_wait_for_completion(md, TASK_INTERRUPTIBLE, false);
 
 	down_write(&md->io_lock);
 	if (noflush)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-27  1:43               ` Jun'ichi Nomura
@ 2010-08-27  4:08                 ` Mike Snitzer
  2010-08-27  5:52                   ` Jun'ichi Nomura
  0 siblings, 1 reply; 155+ messages in thread
From: Mike Snitzer @ 2010-08-27  4:08 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

On Thu, Aug 26 2010 at  9:43pm -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Mike,
> 
> (08/27/10 07:50), Mike Snitzer wrote:
> >> Special casing is necessary because device-mapper may have to
> >> send multiple copies of REQ_FLUSH request to multiple
> >> targets, while normal request is just sent to single target.
> > 
> > Yes, request-based DM is meant to have all the same capabilities as
> > bio-based DM.  So in theory it should support multiple targets but in
> > practice it doesn't.  DM's multipath target is the only consumer of
> > request-based DM and it only ever clones a single flush request
> > (num_flush_requests = 1).
> 
> This is correct. But,
> 
> > So why not remove all of request-based DM's barrier infrastructure and
> > simply rely on the revised block layer to sequence the FLUSH+WRITE
> > request for request-based DM?
> > 
> > Given that we do not have a request-based DM target that requires
> > cloning multiple FLUSH requests its unused code that is delaying DM
> > support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still
> > needs work in this area).
> 
> the above mentioned 'special casing' is not a hard part.
> See the attached patch.

Yes, Tejun suggested something like this in one of the threads.  Thanks
for implementing it.

But do you agree that the request-based barrier code (added in commit
d0bcb8786) could be reverted given the new FLUSH work?

We no longer need waiting now that ordering isn't a concern.  Especially
so given rq-based doesn't support multiple targets.  As you know, from
dm_table_set_type:

        /*
         * Request-based dm supports only tables that have a single target now.
         * To support multiple targets, request splitting support is needed,
         * and that needs lots of changes in the block-layer.
         * (e.g. request completion process for partial completion.)
         */

I think we need to at least benchmark the performance of dm-mpath
without any of this extra, soon to be unnecessary, code.

Maybe my concern is overblown...

> The hard part is discerning the error type for flush failure
> as discussed in the other thread.
> And as Kiyoshi wrote, that's an existing problem so it can
> be worked on as a separate issue than the new FLUSH work.

Right, Mike Christie will be refreshing his patchset that should enable
us to resolve that separate issue.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-27  4:08                 ` Mike Snitzer
@ 2010-08-27  5:52                   ` Jun'ichi Nomura
  2010-08-27 14:13                     ` Mike Snitzer
  0 siblings, 1 reply; 155+ messages in thread
From: Jun'ichi Nomura @ 2010-08-27  5:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

Hi Mike,

(08/27/10 13:08), Mike Snitzer wrote:
>> the above mentioned 'special casing' is not a hard part.
>> See the attached patch.
> 
> Yes, Tejun suggested something like this in one of the threads.  Thanks
> for implementing it.
> 
> But do you agree that the request-based barrier code (added in commit
> d0bcb8786) could be reverted given the new FLUSH work?

No, it's a separate thing.
If we don't need to care about the case where multiple clones
of flush request are necessary, the special casing of flush
request can be removed regardless of the new FLUSH work.

> We no longer need waiting now that ordering isn't a concern.  Especially

The waiting is not for ordering, but for multiple clones.

> so given rq-based doesn't support multiple targets.  As you know, from
> dm_table_set_type:
> 
>         /*
>          * Request-based dm supports only tables that have a single target now.
>          * To support multiple targets, request splitting support is needed,
>          * and that needs lots of changes in the block-layer.
>          * (e.g. request completion process for partial completion.)
>          */

This comment is about multiple targets.
The special code for barrier is for single target whose
num_flush_requests > 1. Different thing.

> I think we need to at least benchmark the performance of dm-mpath
> without any of this extra, soon to be unnecessary, code.

If there will be no need for supporting a request-based target
with num_flush_requests > 1, the special handling of flush
can be removed.

And since there is no such target in the current tree,
I don't object if you remove that part of code for good reason.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-27  5:52                   ` Jun'ichi Nomura
@ 2010-08-27 14:13                     ` Mike Snitzer
  2010-08-30  4:45                       ` Jun'ichi Nomura
  0 siblings, 1 reply; 155+ messages in thread
From: Mike Snitzer @ 2010-08-27 14:13 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

On Fri, Aug 27 2010 at  1:52am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Mike,
> 
> (08/27/10 13:08), Mike Snitzer wrote:
> > But do you agree that the request-based barrier code (added in commit
> > d0bcb8786) could be reverted given the new FLUSH work?
> 
> No, it's a separate thing.
> If we don't need to care about the case where multiple clones
> of flush request are necessary, the special casing of flush
> request can be removed regardless of the new FLUSH work.

Ah, yes thanks for clarifying.  But we've never cared about multiple
clone of a flush so it's odd that such elaborate infrastructure was
introduced without a need.

> > We no longer need waiting now that ordering isn't a concern.  Especially
> 
> The waiting is not for ordering, but for multiple clones.
> 
> > so given rq-based doesn't support multiple targets.  As you know, from
> > dm_table_set_type:
> > 
> >         /*
> >          * Request-based dm supports only tables that have a single target now.
> >          * To support multiple targets, request splitting support is needed,
> >          * and that needs lots of changes in the block-layer.
> >          * (e.g. request completion process for partial completion.)
> >          */
> 
> This comment is about multiple targets.
> The special code for barrier is for single target whose
> num_flush_requests > 1. Different thing.

Yes, I need to not send mail just before going to bed..
 
> > I think we need to at least benchmark the performance of dm-mpath
> > without any of this extra, soon to be unnecessary, code.
> 
> If there will be no need for supporting a request-based target
> with num_flush_requests > 1, the special handling of flush
> can be removed.
> 
> And since there is no such target in the current tree,
> I don't object if you remove that part of code for good reason.

OK, certainly something to keep in mind.  But _really_ knowing the
multipath FLUSH+FUA performance difference (extra special-case code vs
none) requires a full FLUSH conversion of request-based DM anyway.

In general, request-based DM's barrier/flush code does carry a certain
maintenance overhead.  It is quite a bit of distracting code in the core
DM which isn't buying us anything.. so we _could_ just remove it and
never look back (until we have some specific need for num_flush_requests
> 1 in rq-based DM).

Mike

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-27 14:13                     ` Mike Snitzer
@ 2010-08-30  4:45                       ` Jun'ichi Nomura
  2010-08-30  8:33                         ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Jun'ichi Nomura @ 2010-08-30  4:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe,
	linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj,
	tytso, swhiteho, chris.mason, dm-devel

Hi Mike,

(08/27/10 23:13), Mike Snitzer wrote:
>> If there will be no need for supporting a request-based target
>> with num_flush_requests > 1, the special handling of flush
>> can be removed.
>>
>> And since there is no such target in the current tree,
>> I don't object if you remove that part of code for good reason.
> 
> OK, certainly something to keep in mind.  But _really_ knowing the
> multipath FLUSH+FUA performance difference (extra special-case code vs
> none) requires a full FLUSH conversion of request-based DM anyway.
> 
> In general, request-based DM's barrier/flush code does carry a certain
> maintenance overhead.  It is quite a bit of distracting code in the core
> DM which isn't buying us anything.. so we _could_ just remove it and
> never look back (until we have some specific need for num_flush_requests
>> 1 in rq-based DM).

So, I'm not objecting to your idea.
Could you please create a patch to remove that?

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-30  4:45                       ` Jun'ichi Nomura
@ 2010-08-30  8:33                         ` Tejun Heo
  2010-08-30 12:43                           ` Mike Snitzer
  0 siblings, 1 reply; 155+ messages in thread
From: Tejun Heo @ 2010-08-30  8:33 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Mike Snitzer, Christoph Hellwig, Kiyoshi Ueda, Jan Kara,
	linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley,
	konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel

On 08/30/2010 06:45 AM, Jun'ichi Nomura wrote:
> Hi Mike,
> 
> (08/27/10 23:13), Mike Snitzer wrote:
>>> If there will be no need for supporting a request-based target
>>> with num_flush_requests > 1, the special handling of flush
>>> can be removed.
>>>
>>> And since there is no such target in the current tree,
>>> I don't object if you remove that part of code for good reason.
>>
>> OK, certainly something to keep in mind.  But _really_ knowing the
>> multipath FLUSH+FUA performance difference (extra special-case code vs
>> none) requires a full FLUSH conversion of request-based DM anyway.
>>
>> In general, request-based DM's barrier/flush code does carry a certain
>> maintenance overhead.  It is quite a bit of distracting code in the core
>> DM which isn't buying us anything.. so we _could_ just remove it and
>> never look back (until we have some specific need for num_flush_requests
>>> 1 in rq-based DM).
> 
> So, I'm not objecting to your idea.
> Could you please create a patch to remove that?

I did that yesterday.  Will post the patch soon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-30  8:33                         ` Tejun Heo
@ 2010-08-30 12:43                           ` Mike Snitzer
  2010-08-30 12:45                             ` Tejun Heo
  0 siblings, 1 reply; 155+ messages in thread
From: Mike Snitzer @ 2010-08-30 12:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara,
	linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley,
	konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel

On Mon, Aug 30 2010 at  4:33am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 06:45 AM, Jun'ichi Nomura wrote:
> > Hi Mike,
> > 
> > (08/27/10 23:13), Mike Snitzer wrote:
> >>> If there will be no need for supporting a request-based target
> >>> with num_flush_requests > 1, the special handling of flush
> >>> can be removed.
> >>>
> >>> And since there is no such target in the current tree,
> >>> I don't object if you remove that part of code for good reason.
> >>
> >> OK, certainly something to keep in mind.  But _really_ knowing the
> >> multipath FLUSH+FUA performance difference (extra special-case code vs
> >> none) requires a full FLUSH conversion of request-based DM anyway.
> >>
> >> In general, request-based DM's barrier/flush code does carry a certain
> >> maintenance overhead.  It is quite a bit of distracting code in the core
> >> DM which isn't buying us anything.. so we _could_ just remove it and
> >> never look back (until we have some specific need for num_flush_requests
> >>> 1 in rq-based DM).
> > 
> > So, I'm not objecting to your idea.
> > Could you please create a patch to remove that?
> 
> I did that yesterday.  Will post the patch soon.

I did it yesterday also, mine builds on your previous DM patchset...

I'll review your recent patchset, from today, to compare and will share
my findings.

I was hoping we could get the current request-based code working with
your new FLUSH+FUA work without removing support for num_flush_requests
(yet).  And then layer in the removal to give us the before and after so
we would know the overhead associated with keeping/dropping
num_flush_requests.  But like I said earlier "we _could_ just remove it
and never look back".

Thanks,
Mike

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly
  2010-08-30 12:43                           ` Mike Snitzer
@ 2010-08-30 12:45                             ` Tejun Heo
  0 siblings, 0 replies; 155+ messages in thread
From: Tejun Heo @ 2010-08-30 12:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara,
	linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley,
	konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel

Hello,

On 08/30/2010 02:43 PM, Mike Snitzer wrote:
> I did it yesterday also, mine builds on your previous DM patchset...
> 
> I'll review your recent patchset, from today, to compare and will share
> my findings.

Thanks. :-)

> I was hoping we could get the current request-based code working with
> your new FLUSH+FUA work without removing support for num_flush_requests
> (yet).  And then layer in the removal to give us the before and after so
> we would know the overhead associated with keeping/dropping
> num_flush_requests.  But like I said earlier "we _could_ just remove it
> and never look back".

I tried but it's not very easy because the original implementation
depended on the block layer suppressing other requests while flush
sequence is in progress.  The painful part was that block layer no
longer sorts requeued flush requests in front of other front inserted
requests, so explicit queue suppressing can't be implemented simply.
Another route would be adding a separate wait/wakeup logic for flushes
(someone posted a demo patch for that which was almost there but not
fully), but it seemed like a aimless effort to build a new facility to
rip it out in the next patch.  After all, the whole thing seemed
somewhat pointless given that writes can't be routed to multiple
targets (if writes can't target multiple devices, flushes won't need
to either).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 155+ messages in thread

end of thread, other threads:[~2010-08-30 12:45 UTC | newest]

Thread overview: 155+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-27 17:54 ` Jan Kara
2010-07-27 18:35   ` Vivek Goyal
2010-07-27 18:42     ` James Bottomley
2010-07-27 18:51       ` Ric Wheeler
2010-07-27 19:43       ` Christoph Hellwig
2010-07-27 19:38     ` Christoph Hellwig
2010-07-28  8:08     ` Tejun Heo
2010-07-28  8:20       ` Tejun Heo
2010-07-28 13:55         ` Vladislav Bolkhovitin
2010-07-28 14:23           ` Tejun Heo
2010-07-28 14:37             ` James Bottomley
2010-07-28 14:44               ` Tejun Heo
2010-07-28 16:17                 ` Vladislav Bolkhovitin
2010-07-28 16:17               ` Vladislav Bolkhovitin
2010-07-28 16:16             ` Vladislav Bolkhovitin
2010-07-28  8:24       ` Christoph Hellwig
2010-07-28  8:40         ` Tejun Heo
2010-07-28  8:50           ` Christoph Hellwig
2010-07-28  8:58             ` Tejun Heo
2010-07-28  9:00               ` Christoph Hellwig
2010-07-28  9:11                 ` Hannes Reinecke
2010-07-28  9:16                   ` Christoph Hellwig
2010-07-28  9:24                     ` Tejun Heo
2010-07-28  9:38                       ` Christoph Hellwig
2010-07-28  9:28                   ` Steven Whitehouse
2010-07-28  9:35                     ` READ_META semantics, was " Christoph Hellwig
2010-07-28 13:52                       ` Jeff Moyer
2010-07-28  9:17                 ` Tejun Heo
2010-07-28  9:28                   ` Christoph Hellwig
2010-07-28  9:48                     ` Tejun Heo
2010-07-28 10:19                     ` Steven Whitehouse
2010-07-28 11:45                       ` Christoph Hellwig
2010-07-28 12:47                     ` Jan Kara
2010-07-28 23:00                       ` Christoph Hellwig
2010-07-29 10:45                         ` Jan Kara
2010-07-29 16:54                           ` Joel Becker
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29  1:44                     ` Ted Ts'o
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  8:42                         ` Christoph Hellwig
2010-07-29 20:02                           ` Vivek Goyal
2010-07-29 20:06                             ` Christoph Hellwig
2010-07-30  3:17                               ` Vivek Goyal
2010-07-30  7:07                                 ` Christoph Hellwig
2010-07-30  7:41                                   ` Vivek Goyal
2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
2010-08-03 13:03                                     ` Christoph Hellwig
2010-08-04 15:29                                       ` Vivek Goyal
2010-08-04 16:21                                         ` Christoph Hellwig
2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-29 11:16                         ` Jan Kara
2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
2010-07-29 13:08                           ` Christoph Hellwig
2010-07-29 14:12                             ` Vladislav Bolkhovitin
2010-07-29 14:34                               ` Jan Kara
2010-07-29 18:20                                 ` Vladislav Bolkhovitin
2010-07-29 18:49                                 ` Vladislav Bolkhovitin
2010-07-29 14:26                           ` Jan Kara
2010-07-29 18:20                             ` Vladislav Bolkhovitin
2010-07-29 18:58                           ` Ted Ts'o
2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
2010-07-29 19:49                         ` Christoph Hellwig
2010-07-29 19:56                           ` Ric Wheeler
2010-07-29 19:59                             ` James Bottomley
2010-07-29 20:03                               ` Christoph Hellwig
2010-07-29 20:07                                 ` James Bottomley
2010-07-29 20:11                                   ` Christoph Hellwig
2010-07-30 12:45                                     ` Vladislav Bolkhovitin
2010-07-30 12:56                                       ` Christoph Hellwig
2010-08-04  1:58                                     ` Jamie Lokier
2010-07-30 12:46                                 ` Vladislav Bolkhovitin
2010-07-30 12:57                                   ` Christoph Hellwig
2010-07-30 13:09                                     ` Vladislav Bolkhovitin
2010-07-30 13:12                                       ` Christoph Hellwig
2010-07-30 17:40                                         ` Vladislav Bolkhovitin
2010-07-29 20:58                               ` Ric Wheeler
2010-07-29 22:30                             ` Andreas Dilger
2010-07-29 23:04                               ` Ted Ts'o
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:28                                 ` James Bottomley
2010-07-29 23:37                                   ` James Bottomley
2010-07-30  0:19                                     ` Ted Ts'o
2010-07-30 12:56                                   ` Vladislav Bolkhovitin
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30 12:56                                 ` Vladislav Bolkhovitin
2010-07-30 13:07                                   ` Tejun Heo
2010-07-30 13:22                                     ` Vladislav Bolkhovitin
2010-07-30 13:27                                       ` Vladislav Bolkhovitin
2010-07-30 13:09                                   ` Christoph Hellwig
2010-07-30 13:25                                     ` Vladislav Bolkhovitin
2010-07-30 13:34                                       ` Christoph Hellwig
2010-07-30 13:44                                         ` Vladislav Bolkhovitin
2010-07-30 14:20                                           ` Christoph Hellwig
2010-07-31  0:47                                             ` Jan Kara
2010-07-31  9:12                                               ` Christoph Hellwig
2010-08-02 13:14                                                 ` Jan Kara
2010-08-02 10:38                                               ` Vladislav Bolkhovitin
2010-08-02 12:48                                                 ` Christoph Hellwig
2010-08-02 19:03                                                   ` xfs rm performance Vladislav Bolkhovitin
2010-08-02 19:18                                                     ` Christoph Hellwig
2010-08-05 19:31                                                       ` Vladislav Bolkhovitin
2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
2010-08-02 19:26                                               ` Christoph Hellwig
2010-07-30 12:56                                 ` Vladislav Bolkhovitin
2010-07-31  0:35                         ` Jan Kara
2010-07-29 19:44                       ` Ric Wheeler
2010-08-02 16:47                     ` Ryusuke Konishi
2010-08-02 17:39                     ` Chris Mason
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-08-05 13:32                         ` Chris Mason
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 15:17                             ` Chris Mason
2010-08-05 17:07                             ` Christoph Hellwig
2010-08-05 19:48                           ` Vladislav Bolkhovitin
2010-08-05 19:48                           ` Vladislav Bolkhovitin
2010-08-05 19:50                             ` Christoph Hellwig
2010-08-05 20:05                               ` Vladislav Bolkhovitin
2010-08-06 14:56                                 ` Hannes Reinecke
2010-08-06 18:38                                   ` Vladislav Bolkhovitin
2010-08-06 23:38                                     ` Christoph Hellwig
2010-08-06 23:34                                   ` Christoph Hellwig
2010-08-05 17:09                         ` Christoph Hellwig
2010-08-05 19:32                           ` Vladislav Bolkhovitin
2010-08-05 19:40                             ` Christoph Hellwig
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-07-28 13:56                   ` Vladislav Bolkhovitin
2010-07-28 14:42                 ` Vivek Goyal
2010-07-27 19:37   ` Christoph Hellwig
2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
2010-08-04  4:57       ` Kiyoshi Ueda
2010-08-04  8:54         ` Christoph Hellwig
2010-08-05  2:16           ` Jun'ichi Nomura
2010-08-26 22:50             ` Mike Snitzer
2010-08-27  0:40               ` Mike Snitzer
2010-08-27  1:20                 ` Jamie Lokier
2010-08-27  1:43               ` Jun'ichi Nomura
2010-08-27  4:08                 ` Mike Snitzer
2010-08-27  5:52                   ` Jun'ichi Nomura
2010-08-27 14:13                     ` Mike Snitzer
2010-08-30  4:45                       ` Jun'ichi Nomura
2010-08-30  8:33                         ` Tejun Heo
2010-08-30 12:43                           ` Mike Snitzer
2010-08-30 12:45                             ` Tejun Heo
2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
2010-08-06 23:34       ` Christoph Hellwig
2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
2010-08-08 14:31         ` Christoph Hellwig
2010-08-09 14:50           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.