linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: commit "xen/blkfront: use tagged queuing for barriers"
       [not found]     ` <4C5AF01C.3040601@goop.org>
@ 2010-08-05 17:19       ` Christoph Hellwig
  2010-08-05 17:46         ` Gerd Hoffmann
                           ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-08-05 17:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, Daniel Stodden, kraxel

On Thu, Aug 05, 2010 at 10:08:44AM -0700, Jeremy Fitzhardinge wrote:
>  On 08/04/2010 09:44 AM, Christoph Hellwig wrote:
> >>But either the blkfront patch is wrong and it needs to be fixed,
> >Actually both the old and the new one are wrong, but I'd say the new
> >one is even more wrong.
> >
> >_TAG implies that the device can do ordering by tag.  And at least the
> >qemu xen_disk backend doesn't when it advertizes this feature.
> 
> We don't use qemu at all for block storage; qemu (afaik) doesn't
> have a blkback protocol implementation in it.  I'm guessing xen_disk
> is to allow kvm to be compatible with Xen disk images?  It certainly
> isn't a reference implementation.

Disk images formats have nothing to do with the I/O interface.  I
believe Gerd added it for running unmodified Xen guests in qemu,
but he can explain more of it.

I've only mentioned it here because it's the one I easily have access
to.  Given Xen's about 4 different I/O backends and the various forked
trees it's rather hard to find the official reference.

> >I'm pretty sure most if not all of the original Xen backends do the
> >same.  Given that I have tried to implement tagged ordering in qemu
> >I know that comes down to doing exactly the same draining we already
> >do in the kernel, just duplicated in the virtual disk backend.  That
> >is for a userspace implementation - for a kernel implementation only
> >using block devices we could in theory implement it using barriers,
> >but that would be even more inefficient.  And last time I looked
> >at the in-kernel xen disk backed it didn't do that either.
> 
> blkback - the in-kernel backend - does generate barriers when it
> receives one from the guest.  Could you expand on why passing a
> guest barrier through to the host IO stack would be bad for
> performance?  Isn't this exactly the same as a local writer
> generating a barrier?

If you pass it on it has the same semantics, but given that you'll
usually end up having multiple guest disks on a single volume using
lvm or similar you'll end up draining even more I/O as there is one
queue for all of them.  That way you can easily have one guest starve
others.

Note that we're going to get rid of the draining for common cases
anyway, but that's a separate discussion thread the "relaxed barriers"
one.

> It's true that a number of the Xen backends end up implementing
> barriers via drain for simplicity's sake, but there's no inherent
> reason why they couldn't implement a more complete tagged model.

If they are in Linux/Posix userspace they can't because there are
not system calls to archive that.  And then again there really is
no need to implement all this in the host anyway - the draining
is something we enforced on ourselves in Linux without good reason,
which we're trying to get rid of and no other OS ever did.

> >Now where both old and new one are buggy is that that they don't
> >include the QUEUE_ORDERED_DO_PREFLUSH  and
> >QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
> >explicit cache flush (aka empty barrier) is silently dropped, making
> >fsync and co not preserve data integrity.
> 
> Ah, OK, something specific.  What level ends up dropping the empty
> barrier?  Certainly an empty WRITE_BARRIER operation to the backend
> will cause all prior writes to be durable, which should be enough.
> Are you saying that there's an extra flag we should be passing to
> blk_queue_ordered(), or is there some other interface we should be
> implementing for explicit flushes?
> 
> Is there a good reference implementation we can use as a model?

Just read Documentation/block/barriers.txt, it's very well described
there.  Even the naming of the various ORDERED constant should
give enough hints.

> As I said before, the qemu xen backend is irrelevent.

It's one of the many backends written to the protocol specification,
I don't think it's fair to call it irrelevant.  And as mentioned before
I'd be very surprised if the other backends all get it right.  If you
send me pointers to one or two backends you considered "relevent" I'm
happy to look at them.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-05 17:19       ` commit "xen/blkfront: use tagged queuing for barriers" Christoph Hellwig
@ 2010-08-05 17:46         ` Gerd Hoffmann
  2010-08-05 21:07         ` Daniel Stodden
  2010-08-05 23:11         ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 8+ messages in thread
From: Gerd Hoffmann @ 2010-08-05 17:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Fitzhardinge, Jens Axboe, linux-kernel, Daniel Stodden

On 08/05/10 19:19, Christoph Hellwig wrote:
> On Thu, Aug 05, 2010 at 10:08:44AM -0700, Jeremy Fitzhardinge wrote:
>>   On 08/04/2010 09:44 AM, Christoph Hellwig wrote:
>>>> But either the blkfront patch is wrong and it needs to be fixed,
>>> Actually both the old and the new one are wrong, but I'd say the new
>>> one is even more wrong.
>>>
>>> _TAG implies that the device can do ordering by tag.  And at least the
>>> qemu xen_disk backend doesn't when it advertizes this feature.
>>
>> We don't use qemu at all for block storage; qemu (afaik) doesn't
>> have a blkback protocol implementation in it.

Upstream qemu has.

>> I'm guessing xen_disk
>> is to allow kvm to be compatible with Xen disk images?

No, is actually is a blkback implementation.

>> It certainly
>> isn't a reference implementation.

Indeed.  I also havn't tested it for ages, not sure whenever it still works.

> Disk images formats have nothing to do with the I/O interface.  I
> believe Gerd added it for running unmodified Xen guests in qemu,
> but he can explain more of it.

Well, you can boot pv kernels with upstream qemu.  qemu must be compiled 
with xen support enabled, you need xen underneath and xenstored must 
run, but nothing else (xend, tapdisk, ...) is required.  qemu will call 
xen libraries to build the domain and run the pv kernel.  qemu provides 
backends for console, framebuffer, network and disk.

There was also the plan to allow xen being emulated, so you could run pv 
kernels in qemu without xen (using tcg or kvm).  Basically xenner merged 
into qemu.  That project was never finished though and I didn't spend 
any time on it for at least one year ...

Hope this clarifies,
   Gerd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-05 17:19       ` commit "xen/blkfront: use tagged queuing for barriers" Christoph Hellwig
  2010-08-05 17:46         ` Gerd Hoffmann
@ 2010-08-05 21:07         ` Daniel Stodden
  2010-08-06 12:43           ` Christoph Hellwig
  2010-08-05 23:11         ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 8+ messages in thread
From: Daniel Stodden @ 2010-08-05 21:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jeremy Fitzhardinge, Jens Axboe, linux-kernel, kraxel

On Thu, 2010-08-05 at 13:19 -0400, Christoph Hellwig wrote:

> > blkback - the in-kernel backend - does generate barriers when it
> > receives one from the guest.  Could you expand on why passing a
> > guest barrier through to the host IO stack would be bad for
> > performance?  Isn't this exactly the same as a local writer
> > generating a barrier?
> 
> If you pass it on it has the same semantics, but given that you'll
> usually end up having multiple guest disks on a single volume using
> lvm or similar you'll end up draining even more I/O as there is one
> queue for all of them.  That way you can easily have one guest starve
> others.

> > >Now where both old and new one are buggy is that that they don't
> > >include the QUEUE_ORDERED_DO_PREFLUSH  and
> > >QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
> > >explicit cache flush (aka empty barrier) is silently dropped, making
> > >fsync and co not preserve data integrity.
> > 
> > Ah, OK, something specific.  What level ends up dropping the empty
> > barrier?  Certainly an empty WRITE_BARRIER operation to the backend
> > will cause all prior writes to be durable, which should be enough.
> > Are you saying that there's an extra flag we should be passing to
> > blk_queue_ordered(), or is there some other interface we should be
> > implementing for explicit flushes?
> > 
> > Is there a good reference implementation we can use as a model?
> 
> Just read Documentation/block/barriers.txt, it's very well described
> there.  Even the naming of the various ORDERED constant should
> give enough hints.

That one is read and well understood.

I presently don't see a  point in having the frontend perform its own
pre or post flushes as long as there's a single queue in the block
layer. But if the kernel drops the plain _TAG mode, there is no  problem
with that. Essentially the frontend may drain the queue as much as as it
wants. It just won't buy you much if the backend I/O was actually
buffered, other than adding latency to the transport.

The only thing which matters is that the frontend lld gets to see the
actual barrier point, anything else needs to be sorted out next to the
physical layer anyway, so it's better left to the backends.

Not sure if I understand your above comment regarding the flush and fua
bits. Did you mean to indicate that _TAG on the frontend's request_queue
is presently not coming up with the empty barrier request to make
_explicit_ cache flushes happen? That would be something which
definitely needs a workaround in the frontend then. In that case, would
PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might
insert the tag itself then? It's sounds a bit over the top to combine
this with a queue drain on the transport, but I'm rather after
correctness.

Regarding the potential starvation problems when accessing shared
physical storage you mentioned above: Yes, good point, we discussed that
too, although only briefly, and it's a todo which I don't think has been
solved in any present backend. But again, scheduling/merging
drain/flush/fua on shared physical nodes more carefully would be
something better *enforced*. The frontend can't even avoid it.

I wonder if there's a userspace solution for that. Does e.g. fdatasync()
deal with independent invocations other than serializing? Couldn't find
anything which indicates that, but I might not have looked hard enough.
The blktap userspace component presently doesn't buffer, so a _DRAIN is
sufficient. But if it did, then it'd be kinda cool if handled more
carefully. If the kernel does it, all the better.

Thanks,
Daniel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-05 17:19       ` commit "xen/blkfront: use tagged queuing for barriers" Christoph Hellwig
  2010-08-05 17:46         ` Gerd Hoffmann
  2010-08-05 21:07         ` Daniel Stodden
@ 2010-08-05 23:11         ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-05 23:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, Daniel Stodden, kraxel

  On 08/05/2010 10:19 AM, Christoph Hellwig wrote:
>>> I'm pretty sure most if not all of the original Xen backends do the
>>> same.  Given that I have tried to implement tagged ordering in qemu
>>> I know that comes down to doing exactly the same draining we already
>>> do in the kernel, just duplicated in the virtual disk backend.  That
>>> is for a userspace implementation - for a kernel implementation only
>>> using block devices we could in theory implement it using barriers,
>>> but that would be even more inefficient.  And last time I looked
>>> at the in-kernel xen disk backed it didn't do that either.
>> blkback - the in-kernel backend - does generate barriers when it
>> receives one from the guest.  Could you expand on why passing a
>> guest barrier through to the host IO stack would be bad for
>> performance?  Isn't this exactly the same as a local writer
>> generating a barrier?
> If you pass it on it has the same semantics, but given that you'll
> usually end up having multiple guest disks on a single volume using
> lvm or similar you'll end up draining even more I/O as there is one
> queue for all of them.  That way you can easily have one guest starve
> others.

Yes, that's unfortunate.  In the normal case the IO streams would 
actually be independent so they wouldn't need to be serialized with 
respect to each other.  But I don't know if that kind of partial-order 
dependency is possible or on the cards.

> Note that we're going to get rid of the draining for common cases
> anyway, but that's a separate discussion thread the "relaxed barriers"
> one.

Does that mean barriers which enforce ordering without flushing?

>> It's true that a number of the Xen backends end up implementing
>> barriers via drain for simplicity's sake, but there's no inherent
>> reason why they couldn't implement a more complete tagged model.
> If they are in Linux/Posix userspace they can't because there are
> not system calls to archive that.  And then again there really is
> no need to implement all this in the host anyway - the draining
> is something we enforced on ourselves in Linux without good reason,
> which we're trying to get rid of and no other OS ever did.

Userspace might not be relying on the kernel to do storage (it might 
have its own iscsi implementation or something).

>>> Now where both old and new one are buggy is that that they don't
>>> include the QUEUE_ORDERED_DO_PREFLUSH  and
>>> QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
>>> explicit cache flush (aka empty barrier) is silently dropped, making
>>> fsync and co not preserve data integrity.
>> Ah, OK, something specific.  What level ends up dropping the empty
>> barrier?  Certainly an empty WRITE_BARRIER operation to the backend
>> will cause all prior writes to be durable, which should be enough.
>> Are you saying that there's an extra flag we should be passing to
>> blk_queue_ordered(), or is there some other interface we should be
>> implementing for explicit flushes?
>>
>> Is there a good reference implementation we can use as a model?
> Just read Documentation/block/barriers.txt, it's very well described
> there.  Even the naming of the various ORDERED constant should
> give enough hints.

I've gone over it a few times.  Since the blkback barriers do both 
ordering and flushing, it seems to me that plain _TAG is the right 
choice; we don't need _TAG_FLUSH or _TAG_FUA.  I still don't understand 
what you mean about "explicit cache flush (aka empty barrier) is 
silently dropped".  Who drops it where?  Do you mean the block subsystem 
will drop an empty write, even if it has a barrier associated with it, 
but if I set PREFLUSH and POSTFLUSH/FUA then those will still come 
through?  If so, isn't dropping a write with a barrier the problem?

> It's one of the many backends written to the protocol specification,
> I don't think it's fair to call it irrelevant.  And as mentioned before
> I'd be very surprised if the other backends all get it right.  If you
> send me pointers to one or two backends you considered "relevent" I'm
> happy to look at them.

You can see the current state in 
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git 
xen/dom0/backend/blkback is the actual backend part.  It can either 
attach directly to a file/device, or go via blktap for usermode processing.

     J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-05 21:07         ` Daniel Stodden
@ 2010-08-06 12:43           ` Christoph Hellwig
  2010-08-06 21:20             ` Daniel Stodden
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-08-06 12:43 UTC (permalink / raw)
  To: Daniel Stodden
  Cc: Christoph Hellwig, Jeremy Fitzhardinge, Jens Axboe, linux-kernel, kraxel

On Thu, Aug 05, 2010 at 02:07:42PM -0700, Daniel Stodden wrote:
> That one is read and well understood.

Given that xen blkfront does not actually implement cache flushes
correctly that doesn't seem to be the case.

> I presently don't see a  point in having the frontend perform its own
> pre or post flushes as long as there's a single queue in the block
> layer. But if the kernel drops the plain _TAG mode, there is no  problem
> with that. Essentially the frontend may drain the queue as much as as it
> wants. It just won't buy you much if the backend I/O was actually
> buffered, other than adding latency to the transport.

You do need the _FLUSH or _FUA modes (either with TAGs or DRAIN) to get
the block layer to send you pure cache flush requests (aka "empty
barriers") without this they don't work.  They way the current barrier
code is implemented means you will always get manual cache flushes
before the actual barrier requests once you implement that.  By using
the _FUA mode you can still do your own post flush.

I've been through doing all this, and given how hard it is to do a
semi-efficient drain in a backend driver, and given that non-Linux
guests don't even benefit from it just leaving the draining to the
guest is the easiest solution.  If you already have the draining around
and are confident that it gets all corner cases right you can of
course keep it and use the QUEUE_ORDERED_TAG_FLUSH/QUEUE_ORDERED_TAG_FUA
modes.  But from dealing with data integrity issues in virtualized
environment I'm not confident that things will just work, both on the
backend side, especially if image formats are around, and also on the
guest side given that QUEUE_ORDERED_TAG* has zero real life testing.

> Not sure if I understand your above comment regarding the flush and fua
> bits. Did you mean to indicate that _TAG on the frontend's request_queue
> is presently not coming up with the empty barrier request to make
> _explicit_ cache flushes happen?

Yes.

> That would be something which
> definitely needs a workaround in the frontend then. In that case, would
> PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might
> insert the tag itself then? It's sounds a bit over the top to combine
> this with a queue drain on the transport, but I'm rather after
> correctness.

prepare_flush_fn is gone now.

> I wonder if there's a userspace solution for that. Does e.g. fdatasync()
> deal with independent invocations other than serializing?

fsync/fdatasync is serialized by i_mutex.

> The blktap userspace component presently doesn't buffer, so a _DRAIN is
> sufficient. But if it did, then it'd be kinda cool if handled more
> carefully. If the kernel does it, all the better.

Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT?  You still need
to call fdatsync for the latter, to flush out transaction for block
allocations in sparse / fallocated images and to flush the volatile
write cache of the host disks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-06 12:43           ` Christoph Hellwig
@ 2010-08-06 21:20             ` Daniel Stodden
  2010-08-07 17:20               ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Stodden @ 2010-08-06 21:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jeremy Fitzhardinge, Jens Axboe, linux-kernel, kraxel

On Fri, 2010-08-06 at 08:43 -0400, Christoph Hellwig wrote:
> On Thu, Aug 05, 2010 at 02:07:42PM -0700, Daniel Stodden wrote:
> > That one is read and well understood.
> 
> Given that xen blkfront does not actually implement cache flushes
> correctly that doesn't seem to be the case.

Well, given the stuff below I might actually go and read it again, maybe
I've been just too optimistic last time I did. :)

> > I presently don't see a  point in having the frontend perform its own
> > pre or post flushes as long as there's a single queue in the block
> > layer. But if the kernel drops the plain _TAG mode, there is no  problem
> > with that. Essentially the frontend may drain the queue as much as as it
> > wants. It just won't buy you much if the backend I/O was actually
> > buffered, other than adding latency to the transport.
> 
> You do need the _FLUSH or _FUA modes (either with TAGs or DRAIN) to get
> the block layer to send you pure cache flush requests (aka "empty
> barriers") without this they don't work.  They way the current barrier
> code is implemented means you will always get manual cache flushes
> before the actual barrier requests once you implement that.  By using
> the _FUA mode you can still do your own post flush.

Okay. Well that's a frontend thing, let's absolutely fix according to
kernel demand, and move on with whatever you guys are doing there.

> I've been through doing all this, and given how hard it is to do a
> semi-efficient drain in a backend driver, and given that non-Linux
> guests don't even benefit from it just leaving the draining to the
> guest is the easiest solution.  

Stop, now that's different thing, if we want to keep stuff simple (we
really want) this asks for making draining the default mode for
everyone?

You basically want everybody to commit to a preflush, right? Only? Is
that everything? 

Does the problem relate to merging barrier points, grouping frontends on
shared storage, or am I missing something more general? Because
otherwise it still sounds relatively straightforward. 

I wouldn't be against defaulting frontend barrier users to DRAIN if it's
clearly beneficial for any kind of backend involved. For present blkback
it's a no-brainer because we just map to the the blk queue, boring as we
are.

But even considering group synchronization, instead of just dumb
serialization, the need for a guest queue drain doesn't look obvious.
The backend would have to drain everybody else on its own terms anyway
to find a good merge point.

So I'm still wondering. Can you explain a little more what makes your
backend depend on it? 

Otherwise one could always go and impose a couple extra flags on
frontend authors, provided there's a benefit and it doesn't result in
just mapping the entire QUEUE_ORDERED set into the control interface. :P
But either way, that doesn't sound like a preferrable solution if we can
spare it.

> If you already have the draining around
> and are confident that it gets all corner cases right you can of
> course keep it and use the QUEUE_ORDERED_TAG_FLUSH/QUEUE_ORDERED_TAG_FUA
> modes.  But from dealing with data integrity issues in virtualized
> environment I'm not confident that things will just work, both on the
> backend side, especially if image formats are around, and also on the
> guest side given that QUEUE_ORDERED_TAG* has zero real life testing.

The image format stuff in Xen servers is not sitting immediately behind
the frontend ring anymore. This one indeed orders with with a drain, but
again we use the block layer to take care of that. Works, cheaply for
now.

> > Not sure if I understand your above comment regarding the flush and fua
> > bits. Did you mean to indicate that _TAG on the frontend's request_queue
> > is presently not coming up with the empty barrier request to make
> > _explicit_ cache flushes happen?
> 
> Yes.

Well, I understand that _TAG is the only model in there which doesn't
map easily to the concept of a full cache flush on the normal data path,
after all it's the only one in there where the device wants to deal with
it alone. Which happens to be exactly the reason why we wanted it in
xen-blkfront. If it doesn't really work like that for a linux guest,
tough luck. It certainly works for a backend sitting on the bio layer.

To fully disqualify myself: What are the normal kernel entries going for
a _full_ explicit cache flush? I'm only aware of BLKFLSBUF etc, and that
even kicks down into the driver as far as I see, so I wonder who else
would want empty barriers so badly under plain TAG ordering...

> > That would be something which
> > definitely needs a workaround in the frontend then. In that case, would
> > PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might
> > insert the tag itself then? It's sounds a bit over the top to combine
> > this with a queue drain on the transport, but I'm rather after
> > correctness.
> 
> prepare_flush_fn is gone now.
> 
> > I wonder if there's a userspace solution for that. Does e.g. fdatasync()
> > deal with independent invocations other than serializing?
> 
> fsync/fdatasync is serialized by i_mutex.

That sounds like a fairly similar issue to me, despite starting out from
a different layer. I understand that it's hardly up to a single disk
queue to deal with that, just wondering about the overall impact.

> > The blktap userspace component presently doesn't buffer, so a _DRAIN is
> > sufficient. But if it did, then it'd be kinda cool if handled more
> > carefully. If the kernel does it, all the better.
> 
> Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT?  

Right.

> You still need
> to call fdatsync for the latter, to flush out transaction for block
> allocations in sparse / fallocated images and 

Yup, this happens, most of our file-based backends run block
preallocation synchronously before returning to aio. The rest I'm not
sure (but don't care much about).

> to flush the volatile
> write cache of the host disks.

I take it the RW_BARRIER at least in the fully allocated case is well
taken care for? Just to make sure...

Last time I checked I'm pretty sure I saw our aio writes completing with
a proper hw barrier. Extra rules for sparse regions are already bad
enough for a plain luserland interface, expecially since the lio
documentation doesn't exactly seem to cry it out loud and far... :}

Thanks,
Daniel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-06 21:20             ` Daniel Stodden
@ 2010-08-07 17:20               ` Christoph Hellwig
  2010-08-07 23:02                 ` Daniel Stodden
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-08-07 17:20 UTC (permalink / raw)
  To: Daniel Stodden
  Cc: Christoph Hellwig, Jeremy Fitzhardinge, Jens Axboe, linux-kernel, kraxel

On Fri, Aug 06, 2010 at 02:20:32PM -0700, Daniel Stodden wrote:
> > I've been through doing all this, and given how hard it is to do a
> > semi-efficient drain in a backend driver, and given that non-Linux
> > guests don't even benefit from it just leaving the draining to the
> > guest is the easiest solution.  
> 
> Stop, now that's different thing, if we want to keep stuff simple (we
> really want) this asks for making draining the default mode for
> everyone?
> 
> You basically want everybody to commit to a preflush, right? Only? Is
> that everything? 

Witht the barrier model we have in current kernels you basically need to
a) do a drain (typically inside the guest) and you need to have a cache
flush command if you have volatile write cache semantics.  The cache
flush command will be used for pre-flushes, standalone flushes and
if you don't have a FUA bit in the protocol post-flushes.

> So I'm still wondering. Can you explain a little more what makes your
> backend depend on it? 

Which backend?  Currently filesystems can in theory rely on the ordering
semantics, although very few do.  And we've not seen a working
implementation except for draining for it - the _TAG versions exist,
but they are basically untested, and no one has solved the issues of
error handling for it yet.

> Otherwise one could always go and impose a couple extra flags on
> frontend authors, provided there's a benefit and it doesn't result in
> just mapping the entire QUEUE_ORDERED set into the control interface. :P
> But either way, that doesn't sound like a preferrable solution if we can
> spare it.

Basically the only think you need it a cache flush command right now,
that solves everything the Linux kernel needs, as does windows or
possibly other guests.  The draining is something imposed on us by
the current Linux barrier semantics, and I'm confident it will be a
thing of the past by Linux 2.6.37.

> Well, I understand that _TAG is the only model in there which doesn't
> map easily to the concept of a full cache flush on the normal data path,
> after all it's the only one in there where the device wants to deal with
> it alone. Which happens to be exactly the reason why we wanted it in
> xen-blkfront. If it doesn't really work like that for a linux guest,
> tough luck. It certainly works for a backend sitting on the bio layer.

I thikn you're another victim of the overloaded barrier concept.  What
the Linux barrier flags does is two only slightly related things:

 a) give the filesystem a way to flush volatile write caches and thus
    gurantee data integrity
 b) provide block level ordering losly modeled after the SCSI ordered
    tag model

Practice has shown that we just need (a) in most cases, there's only
two filesystems that can theoretically take advantage of (b), and even
there I'm sure we could do better without the block level draining.

The _TAG implementation of barriers is related to (b) only - the pure
QUEUE_ORDERED_TAG is only safe if you do not have a volatile write
cache - to do cache flushing you need the QUEUE_ORDERED_TAG_FLUSH or
QUEUE_ORDERED_TAG_FUA modes.   In theory we could also add another
mode that not only integrates the post-flush in the _FUA mode but
also a pre-flush into a single command, but so far there wasn't
any demand, most likely because no on the wire storage protocol
implements it.

> To fully disqualify myself: What are the normal kernel entries going for
> a _full_ explicit cache flush?

The typical one is f(data)sync for the case where there have been no
modifications of metadata, or when using an external log device.

No metadata modifications are quite typical for databases or
virtualization images, or other bulk storage that doesn't allocate space
on the fly.

> I'm only aware of BLKFLSBUF etc, and that
> even kicks down into the driver as far as I see, so I wonder who else
> would want empty barriers so badly under plain TAG ordering...

It does issue normal write barriers when you have dirty metadata, else
it sends empty barriers if supported.

> > > The blktap userspace component presently doesn't buffer, so a _DRAIN is
> > > sufficient. But if it did, then it'd be kinda cool if handled more
> > > carefully. If the kernel does it, all the better.
> > 
> > Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT?  
> 
> Right.

Err, that was a question.  For O_SYNC/O_DYSNC you don't need the
explicit fsync.  For O_DIRECT you do (or use O_SYNC/O_DYSNC in addition)

> > to flush the volatile
> > write cache of the host disks.
> 
> I take it the RW_BARRIER at least in the fully allocated case is well
> taken care for? Just to make sure...

No, if you're using O_DIRECT you still need f(data)sync to flush out
the host disk cache.

> Last time I checked I'm pretty sure I saw our aio writes completing with
> a proper hw barrier. Extra rules for sparse regions are already bad
> enough for a plain luserland interface, expecially since the lio
> documentation doesn't exactly seem to cry it out loud and far... :}

All this will depends a lot on the filesystem.  But if you're not
doing any allocation and you're not using O_SYNC/O_DYSNC most
filesystems will not send any barrier at all.  The obvious exception is
btrfs because it has to allocate new blocks anyway due to it's copy on
write scheme.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: commit "xen/blkfront: use tagged queuing for barriers"
  2010-08-07 17:20               ` Christoph Hellwig
@ 2010-08-07 23:02                 ` Daniel Stodden
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Stodden @ 2010-08-07 23:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jeremy Fitzhardinge, Jens Axboe, linux-kernel, kraxel

On Sat, 2010-08-07 at 13:20 -0400, Christoph Hellwig wrote:
> On Fri, Aug 06, 2010 at 02:20:32PM -0700, Daniel Stodden wrote:
> > > I've been through doing all this, and given how hard it is to do a
> > > semi-efficient drain in a backend driver, and given that non-Linux
> > > guests don't even benefit from it just leaving the draining to the
> > > guest is the easiest solution.  
> > 
> > Stop, now that's different thing, if we want to keep stuff simple (we
> > really want) this asks for making draining the default mode for
> > everyone?
> > 
> > You basically want everybody to commit to a preflush, right? Only? Is
> > that everything? 
> 
> Witht the barrier model we have in current kernels you basically need to
> a) do a drain (typically inside the guest) and you need to have a cache
> flush command if you have volatile write cache semantics.  The cache
> flush command will be used for pre-flushes, standalone flushes and
> if you don't have a FUA bit in the protocol post-flushes.

That's all true at the physical layer. I'm rather about the virtual one
-- what consistutes the transport between the frontend and backend.

So if the block queue above xen-blkfront wants to jump through a couple
extra loops, such as declaring TAG_FUA mode, to realize proper
out-of-band cache flushing, fine.

Underneath a backend, whether that's blkback or qemu, that draining and
flushing will happen on to the physical layer, too. Agreed.

That still doesn't mean you have to impose a drain the transport in
between. The block layer underneath the backend does all the draining
necessary, with a request stream just submitted in-order and barrier
bits set where the guest saw fit. Including an empty one for an explicit
cache flush.

Neither does a backend want to know how the physical layer will deal
with it in detail, or can. Except for the NONE case, of course.

And I still don't see where any backends can claim overall benefit from
requiring the guest to drain. At that level, a "TAG" is the much simpler
and efficient one. Even if it neither applies to a Linux guest, nor a
caching disk. Especially the ones far below, underneath some image
format.

It maps well to the bio layer, it even maps well to a trivial datasync()
implementation in userspace, and I don't see why it wouldn't  map well
to a non-trivial one either. These aren't just two shorted Linux block
layers.

So far I'd suggest we keep the ring model as TAG vs. NONE, fix
xen-blkfront to keep the empty barrier stuff going, and keep additional
details where they belong, which is on either end, respectively.

On the Linux frontend side, does TAG_FUA sound about right to you?
Because to me that appears to be the one with the least noise around the
actual barrier request. According to barrier.txt, then I guess we will
map the flush to an empty barrier on the ring and in turn drop a
gratuitous empty barrier following that (?). I obviously didn't try that
out yet. Please absolutely correct me so we maybe get it right this
time.

Also, is my understanding correct that on a Linux backend side, the
empty barrier case at the bio layer isn't compromised? Provided all disk
types declare established ordering modes. Which would be non-TAG for
virtually anything I'm currently aware of. I hope that question is
sufficiently silly.

In the backend, we then keep mapping this to the normal data path above
a gendisk. Pending some overdue optimizations for barriers on shared
physical storage, I guess. Gulp. 

> > So I'm still wondering. Can you explain a little more what makes your
> > backend depend on it? 
> 
> Which backend? 

My understanding so far was that you want to have a draining bit
included as be the default mode on the frontend/backend link. Maybe I
just got you wrong, in that case correct me and we get back to a proper
frontend patch and drop half of this thread.

If that's still the case, you need to enlighten me, I just don't seem to
get it.

We used to have one which actually was a DRAIN model, which was blktap
v1. That's why the patch submitted has this DRAIN if no barrier mode was
declared at all. We don't have a true application for that, it's mainly
because nobody really wants to fix it. If we had to, we'd still rather
fix it in the backend.

The way I see it, that kind of thing is just not expensive enough to add
any kind of complexity to the disk model than the idealized SCSI disk
the Linux block layer has been modeled after, for ages. And still is.

To me, you somewhat sound like you're working toward an entirely
different queue model, but I might be mistaken. Well, that'd be *really*
interesting, but not a particularly hot topic until you about to get
there. I suspect we'd have throw the SCSI model over board entirely if
we wanted to take advantage of it. At which point we'd be basically
starting all over, that's hardly about a xen-blkfront bugfix then... :P

> Currently filesystems can in theory rely on the ordering
> semantics, although very few do.  And we've not seen a working
> implementation except for draining for it - the _TAG versions exist,
> but they are basically untested, and no one has solved the issues of
> error handling for it yet.
> 
> > Otherwise one could always go and impose a couple extra flags on
> > frontend authors, provided there's a benefit and it doesn't result in
> > just mapping the entire QUEUE_ORDERED set into the control interface. :P
> > But either way, that doesn't sound like a preferrable solution if we can
> > spare it.
> 
> Basically the only think you need it a cache flush command right now,
> that solves everything the Linux kernel needs, as does windows or
> possibly other guests.  The draining is something imposed on us by
> the current Linux barrier semantics, and I'm confident it will be a
> thing of the past by Linux 2.6.37.
> 
> > Well, I understand that _TAG is the only model in there which doesn't
> > map easily to the concept of a full cache flush on the normal data path,
> > after all it's the only one in there where the device wants to deal with
> > it alone. Which happens to be exactly the reason why we wanted it in
> > xen-blkfront. If it doesn't really work like that for a linux guest,
> > tough luck. It certainly works for a backend sitting on the bio layer.
> 
> I thikn you're another victim of the overloaded barrier concept.  What
> the Linux barrier flags does is two only slightly related things:
> 
>  a) give the filesystem a way to flush volatile write caches and thus
>     gurantee data integrity
>  b) provide block level ordering losly modeled after the SCSI ordered
>     tag model

No victim, that's exactly the way I understand it. And without the
overloading we wouldn't have this TAG discussion.

We got ourselves b) as the virtual I/O model. We carried over a) for
explicity cache flushes as well.

Both because it's relatively painless, maps well to Linux dom0, and
quite probably to any other backend design as well. All in turn just
because it's so SCSIish.

> Practice has shown that we just need (a) in most cases, there's only
> two filesystems that can theoretically take advantage of (b), and even
> there I'm sure we could do better without the block level draining.
> 
> The _TAG implementation of barriers is related to (b) only - the pure
> QUEUE_ORDERED_TAG is only safe if you do not have a volatile write
> cache .

That's not true. It just degrades the issue of a cache flush to a don't
care to the driver on a sufficiently expensive disk. I don't mean to to
imply it's not broken in blk-* if you state so, but that's the _model_
employed. 

I'm also not sure if even SCSI works that way in the explicit flush
case, because I'm in that fortunate position of never having written a
SCSI lld. It somewhat start to take it like it doesn't? Well, tough. 

I'm only after an idealizing TAG on the virtual layer. On the physical
layer with a Linux dom0 it may have no practical application, but there
it still works quite well for anyone, hence that that yet be fixed
xen-blkfront patch.

So to defend Jeremy and me, barrier.txt didn't exactly state it's plain
broken, right?

> - to do cache flushing you need the QUEUE_ORDERED_TAG_FLUSH or
> QUEUE_ORDERED_TAG_FUA modes.   In theory we could also add another
> mode that not only integrates the post-flush in the _FUA mode but
> also a pre-flush into a single command, but so far there wasn't
> any demand, most likely because no on the wire storage protocol
> implements it.
> 
> > To fully disqualify myself: What are the normal kernel entries going for
> > a _full_ explicit cache flush?
> 
> The typical one is f(data)sync for the case where there have been no
> modifications of metadata, or when using an external log device.
> 
> No metadata modifications are quite typical for databases or
> virtualization images, or other bulk storage that doesn't allocate space
> on the fly.
> 
> > I'm only aware of BLKFLSBUF etc, and that
> > even kicks down into the driver as far as I see, so I wonder who else
> > would want empty barriers so badly under plain TAG ordering...
> 
> It does issue normal write barriers when you have dirty metadata, else
> it sends empty barriers if supported.
> 
> > > > The blktap userspace component presently doesn't buffer, so a _DRAIN is
> > > > sufficient. But if it did, then it'd be kinda cool if handled more
> > > > carefully. If the kernel does it, all the better.
> > > 
> > > Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT?  
> > 
> > Right.
> 
> Err, that was a question.  For O_SYNC/O_DYSNC you don't need the
> explicit fsync.  For O_DIRECT you do (or use O_SYNC/O_DYSNC in addition)
> 
> > > to flush the volatile
> > > write cache of the host disks.
> > 
> > I take it the RW_BARRIER at least in the fully allocated case is well
> > taken care for? Just to make sure...
> 
> No, if you're using O_DIRECT you still need f(data)sync to flush out
> the host disk cache.
> 
> > Last time I checked I'm pretty sure I saw our aio writes completing with
> > a proper hw barrier. Extra rules for sparse regions are already bad
> > enough for a plain luserland interface, expecially since the lio
> > documentation doesn't exactly seem to cry it out loud and far... :}
> 
> All this will depends a lot on the filesystem.  But if you're not
> doing any allocation and you're not using O_SYNC/O_DYSNC most
> filesystems will not send any barrier at all.  The obvious exception is
> btrfs because it has to allocate new blocks anyway due to it's copy on
> write scheme.

Okay, that all sounds sufficiently terrible to just go on vacation. I'll
be gone during the next week, can I return to bugging you about that
stuff later on?

Thanks a lot :)

Daniel


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-07 23:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20100804115124.GA1496@infradead.org>
     [not found] ` <4C596252.9010806@fusionio.com>
     [not found]   ` <20100804164441.GA7838@infradead.org>
     [not found]     ` <4C5AF01C.3040601@goop.org>
2010-08-05 17:19       ` commit "xen/blkfront: use tagged queuing for barriers" Christoph Hellwig
2010-08-05 17:46         ` Gerd Hoffmann
2010-08-05 21:07         ` Daniel Stodden
2010-08-06 12:43           ` Christoph Hellwig
2010-08-06 21:20             ` Daniel Stodden
2010-08-07 17:20               ` Christoph Hellwig
2010-08-07 23:02                 ` Daniel Stodden
2010-08-05 23:11         ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).