linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
@ 2002-01-10  9:55 Miklos Szeredi
  2002-01-13  3:10 ` Pavel Machek
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Miklos Szeredi @ 2002-01-10  9:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, avfs


FUSE 0.95 is available (download or CVS) from:

   http://sourceforge.net/projects/avf

What's new in 0.95 compared to 0.9

   - Major performance improvements in both the kernel module and the
     library parts.

   - Small number of bugs fixed.  FUSE has been through some stress
     testing and no problems have turned up yet.

   - Library interface simplified.  A simple 'hello world' filesystem
     can now be implemented in less than 100 lines.

   - Python (by Jeff Epler) and Perl (by Mark Glines) bindings are in
     the works, and will be released some time in the future (now
     available through CVS).

Problems still remaining:

   - Security problems when fuse is used by non-privileged users:

       o permissions on mountpoint can only be checked by kernel
         (patch exists)

       o user can intentionally block the page writeback operation,
         causing a system lockup.  I'm not sure this can be solved in
         a truly secure way.  Ideas?

Introduction for newbies:

  FUSE provides a simple interface for userspace programs to export a
  virtual filesystem to the Linux kernel.  FUSE also aims to provide a
  secure method for non privileged users to create and mount their own
  filesystem implementations.

  Fuse is available for the 2.4 (and later) kernel series.
  Installation is easy and does not need a kernel recompile.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-10  9:55 [ANNOUNCE] FUSE: Filesystem in Userspace 0.95 Miklos Szeredi
@ 2002-01-13  3:10 ` Pavel Machek
  2002-01-21 10:18 ` Miklos Szeredi
  2002-01-22 19:07 ` Daniel Phillips
  2 siblings, 0 replies; 73+ messages in thread
From: Pavel Machek @ 2002-01-13  3:10 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, avfs

Hi!

> FUSE 0.95 is available (download or CVS) from:
> 
>    http://sourceforge.net/projects/avf

(I'm offline, that's why I'm asking like that):

> What's new in 0.95 compared to 0.9
> 
>    - Major performance improvements in both the kernel module and the
>      library parts.
> 
>    - Small number of bugs fixed.  FUSE has been through some stress
>      testing and no problems have turned up yet.
> 
>    - Library interface simplified.  A simple 'hello world' filesystem
>      can now be implemented in less than 100 lines.

Are you multithreaded? Like will big ftp download block all FUSE, all ftp,
only one server, or everything?

>    - Python (by Jeff Epler) and Perl (by Mark Glines) bindings are in
>      the works, and will be released some time in the future (now
>      available through CVS).

Nice!

> Problems still remaining:
> 
>    - Security problems when fuse is used by non-privileged users:
> 
>        o user can intentionally block the page writeback operation,
>          causing a system lockup.  I'm not sure this can be solved in
>          a truly secure way.  Ideas?

How does GRUB solve this?

> Introduction for newbies:
> 
>   FUSE provides a simple interface for userspace programs to export a
>   virtual filesystem to the Linux kernel.  FUSE also aims to provide a
>   secure method for non privileged users to create and mount their own
>   filesystem implementations.
> 
>   Fuse is available for the 2.4 (and later) kernel series.
>   Installation is easy and does not need a kernel recompile.

Maybe it could replace sf/coda and fs/intermezzo? Is it powerfull/fast
enough for that?
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-10  9:55 [ANNOUNCE] FUSE: Filesystem in Userspace 0.95 Miklos Szeredi
  2002-01-13  3:10 ` Pavel Machek
@ 2002-01-21 10:18 ` Miklos Szeredi
  2002-01-23 10:47   ` Pavel Machek
  2002-01-22 19:07 ` Daniel Phillips
  2 siblings, 1 reply; 73+ messages in thread
From: Miklos Szeredi @ 2002-01-21 10:18 UTC (permalink / raw)
  To: pavel; +Cc: linux-fsdevel, linux-kernel, avfs

Hi Pavel!

> Are you multithreaded? Like will big ftp download block all FUSE, all ftp,
> only one server, or everything?

FUSE and AVFS are both multithreaded.  Specifically ftp is quite well
threaded, and a big download will not block any operation, even on the
same server. 

> >        o user can intentionally block the page writeback operation,
> >          causing a system lockup.  I'm not sure this can be solved in
> >          a truly secure way.  Ideas?
> 
> How does GRUB solve this?

What GRUB? 

> Maybe it could replace sf/coda and fs/intermezzo? Is it powerfull/fast
> enough for that?

No, at the moment there are a few features missing in FUSE, that CODA
needs (access caching, reading directly from files, file
reintegration).  But it is perfectly possible to add these features to
FUSE if somebody wants them. 

Actually having real disk files as a backing store for virtual files
could be the solution to the page writeback problem.  The only
question is how to manage this in an efficient way.

Miklos

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-10  9:55 [ANNOUNCE] FUSE: Filesystem in Userspace 0.95 Miklos Szeredi
  2002-01-13  3:10 ` Pavel Machek
  2002-01-21 10:18 ` Miklos Szeredi
@ 2002-01-22 19:07 ` Daniel Phillips
  2002-01-23  2:33   ` [Avfs] " Justin Mason
  2 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-01-22 19:07 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel, linux-kernel, avfs

On January 10, 2002 10:55 am, Miklos Szeredi wrote:
> FUSE 0.95 is available (download or CVS) from:
> 
>    http://sourceforge.net/projects/avf

I've been meaning to have a read through this for some time, and I'm finally 
getting around to it.  Random question: you seem to have embedded much of Joe 
Orton's Neon project (http://freshmeat.net/projects/neon/) in your tgz.  Is 
there any particular reason for that?  Isn't this going to turn into a 
maintainance problem eventually?

--
Daniel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Avfs] Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-22 19:07 ` Daniel Phillips
@ 2002-01-23  2:33   ` Justin Mason
  2002-01-23  5:26     ` Daniel Phillips
  0 siblings, 1 reply; 73+ messages in thread
From: Justin Mason @ 2002-01-23  2:33 UTC (permalink / raw)
  To: avfs; +Cc: Miklos Szeredi, linux-fsdevel, linux-kernel


Daniel Phillips said:

> I've been meaning to have a read through this for some time, and I'm finally 
> getting around to it.  Random question: you seem to have embedded much of Joe 
> Orton's Neon project (http://freshmeat.net/projects/neon/) in your tgz.  Is 
> there any particular reason for that?  Isn't this going to turn into a 
> maintainance problem eventually?

It provides an API for remote DAV fs access, which AVFS/FUSE uses for the
dav module.  I wrote the initial DAV support, hence the reply ;)

There are some problems with it, namely that FUSE would work better with a
block-oriented GET/PUT api instead of a file-oriented one which Neon
provides.  So it should probably be replaced with calls to a HTTP client
implementation anyway, instead, at some point.

Not sure how it could be a maintainance problem, though; as long as Neon
is linked into the FUSE userspace daemon statically, it shouldn't collide
with other stuff.

--j.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Avfs] Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-23  2:33   ` [Avfs] " Justin Mason
@ 2002-01-23  5:26     ` Daniel Phillips
  0 siblings, 0 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-01-23  5:26 UTC (permalink / raw)
  To: Justin Mason, avfs; +Cc: Miklos Szeredi, linux-fsdevel, linux-kernel

On January 23, 2002 03:33 am, Justin Mason wrote:
> Daniel Phillips said:
> 
> > I've been meaning to have a read through this for some time, and I'm finally 
> > getting around to it.  Random question: you seem to have embedded much of Joe 
> > Orton's Neon project (http://freshmeat.net/projects/neon/) in your tgz.  Is 
> > there any particular reason for that?  Isn't this going to turn into a 
> > maintainance problem eventually?
> 
> It provides an API for remote DAV fs access, which AVFS/FUSE uses for the
> dav module.  I wrote the initial DAV support, hence the reply ;)
> 
> There are some problems with it, namely that FUSE would work better with a
> block-oriented GET/PUT api instead of a file-oriented one which Neon
> provides.  So it should probably be replaced with calls to a HTTP client
> implementation anyway, instead, at some point.
> 
> Not sure how it could be a maintainance problem, though; as long as Neon
> is linked into the FUSE userspace daemon statically, it shouldn't collide
> with other stuff.

I meant: your version of his code will start to age, or perhaps you will hack
on it, meanwhile he will move ahead with his.  So the two code bases, which
are really the same thing, will start to diverge.  How will you handle that?

--
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
  2002-01-21 10:18 ` Miklos Szeredi
@ 2002-01-23 10:47   ` Pavel Machek
  0 siblings, 0 replies; 73+ messages in thread
From: Pavel Machek @ 2002-01-23 10:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, avfs

Hi!

> > Are you multithreaded? Like will big ftp download block all FUSE, all ftp,
> > only one server, or everything?
> 
> FUSE and AVFS are both multithreaded.  Specifically ftp is quite well
> threaded, and a big download will not block any operation, even on the
> same server. 

Good!

> > >        o user can intentionally block the page writeback operation,
> > >          causing a system lockup.  I'm not sure this can be solved in
> > >          a truly secure way.  Ideas?
> > 
> > How does GRUB solve this?
> 
> What GRUB? 

Sorry, I meant HURD. It has "untrusted" filesystems in userland, too.
									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-02-22 15:57 James Bottomley
  2002-02-22 16:10 ` Chris Mason
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: James Bottomley @ 2002-02-22 15:57 UTC (permalink / raw)
  To: Stephen C. Tweedie, Chris Mason; +Cc: linux-kernel, James.Bottomley

> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

> We'd also have to track the sync status of the most recent
> transaction, too, so that on fsync of a non-dirty file/inode, we make
> sure that its data had been forced to disk by at least one synchronous
> flush.  

> But that's really only a win for SCSI, where proper async ordered tags
> are supported.  For IDE, the single BH_Ordered_Flush is quite
> sufficient.

Unfortunately, there's actually a hole in the SCSI spec that means ordered 
tags are actually extremely difficult to use in the way you want (although I 
think this is an accident, conceptually, I think they were supposed to be used 
for this).  For the interested, I attach the details at the bottom.

The easy way out of the problem, I think, is to impose the barrier as an 
effective queue plug in the SCSI mid-layer, so that after the mid-layer 
recevies the barrier, it plugs the device queue from below, drains the drive 
tag queue, sends the barrier and unplugs the device queue on barrier I/O 
completion.

Ordinarily, this would produce extremely poor performance since you're 
effectively starving the device to implement the barrier.  However, in Linux 
it might just work because it will give the elevator more time to coalesce 
requests.

James Bottomley

Problems Using Ordered Tags as a Barrier
========================================

Note, the following is independent of re-ordering on error conditions which 
was discussed in a previous thread.  This discussion pertains to normal device 
operations.

The SCSI tag system allows all devices to have a dynamic queue.  This means 
that there is no a priori guarantee about how many tags the device will accept 
before the queue becomes full.

The problem comes because most drivers issue SCSI commands directly from the 
incoming I/O thread but complete them via the SCSI interrupt routine.  What 
this means is that if the SCSI device decides it has no more resources left, 
the driver won't be told until it recevies an interrupt indicating that the 
queue is full and the particular I/O wasn't queued.  At this point, the user 
threads may have send down several more I/Os, and worse still, the SCSI device 
may have accepted some of the later I/Os because the local conditions causing 
it to signal queue full may have abated.

As I read the standard, there's no way to avoid this problem, since the queue 
full signal entitles the device not to queue the command, and not to act on 
any ordering the command may have had.

The other problem is actually driver related, not SCSI.  Most HBA chips are 
intelligent beasts which can process independently of the host CPU.  
Unfortunately, implementing linked lists tends to be rather beyond their 
capabilities.  For this reason, most low level drivers have a certain number 
of queue slots (per device, per chip or whatever).  The usual act of feeding 
an I/O command to a device involves stuffing it in the first available free 
slot.  This can lead to command re-ordering because even though the HBA is 
sequentially processing slots in a round-robin fashion, you don't often know 
which slot it is currently looking at.  Also, the multi threaded nature of tag 
command queuing means that the slot will remain full long after the HBA has 
started processing it and moved on to the next slot.

One possible get out is to process the queue full signal internally (either in 
the interrupt routine or in the chip driver itself) to force it to re-send of 
the non-queued tag until the drive actually accepts it.  As long as this 
looping is done at a level which prevents the device from accepting any more 
commands.  In general, this is nasty because it is effectively a busy wait 
inside the HBA and will block commands to all other devices until the device 
queue drained sufficiently to accept the tag.

The other possibility would be to treat all pending commands for a particular 
device as queue full errors if we get that for one of them.  This would 
require the interrupt or chip script routine to complete all commands for the 
particular device as queue full, which would still be quite a large amount of 
work for device driver writers.

Finally, I think the driver ordering problem can be solved easily as long as 
an observation I have about your barrier is true.  It seems to me that the 
barrier is only semi permeable, namely its purpose is to complete *after* a 
particular set of commands do.  This means that it doesnt matter if later 
commands move through the barrier, it only matters that earlier commands 
cannot move past it?  If this is true, then we can fix the slot problem simply 
by having a slot dedicated to barrier tags, so the processing engine goes over 
it once per cycle.  However, if it finds the barrier slot full, it doesn't 
issue the command until the *next* cycle, thus ensuring that all commands sent 
down before the barrier (plus a few after) are accepted by the device queue 
before we send the barrier with its ordered tag.




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 15:57 [PATCH] 2.4.x write barriers (updated for ext3) James Bottomley
@ 2002-02-22 16:10 ` Chris Mason
  2002-02-22 16:13 ` Stephen C. Tweedie
  2002-02-25 10:57 ` Helge Hafting
  2 siblings, 0 replies; 73+ messages in thread
From: Chris Mason @ 2002-02-22 16:10 UTC (permalink / raw)
  To: James Bottomley, Stephen C. Tweedie; +Cc: linux-kernel



On Friday, February 22, 2002 10:57:51 AM -0500 James Bottomley <James.Bottomley@steeleye.com> wrote:

[ very interesting stuff ]

> Finally, I think the driver ordering problem can be solved easily as long as 
> an observation I have about your barrier is true.  It seems to me that the 
> barrier is only semi permeable, namely its purpose is to complete *after* a 
> particular set of commands do.  

This is my requirement for reiserfs, where I still want to wait on the 
commit block to check for io errors.  sct might have other plans.

> This means that it doesnt matter if later 
> commands move through the barrier, it only matters that earlier commands 
> cannot move past it?  If this is true, then we can fix the slot problem simply 
> by having a slot dedicated to barrier tags, so the processing engine goes over 
> it once per cycle.  However, if it finds the barrier slot full, it doesn't 
> issue the command until the *next* cycle, thus ensuring that all commands sent 
> down before the barrier (plus a few after) are accepted by the device queue 
> before we send the barrier with its ordered tag.

Interesting, certainly sounds good.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 15:57 [PATCH] 2.4.x write barriers (updated for ext3) James Bottomley
  2002-02-22 16:10 ` Chris Mason
@ 2002-02-22 16:13 ` Stephen C. Tweedie
  2002-02-22 17:36   ` James Bottomley
  2002-02-25 10:57 ` Helge Hafting
  2 siblings, 1 reply; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-02-22 16:13 UTC (permalink / raw)
  To: James Bottomley; +Cc: Stephen C. Tweedie, Chris Mason, linux-kernel

Hi,

On Fri, Feb 22, 2002 at 10:57:51AM -0500, James Bottomley wrote:

> Finally, I think the driver ordering problem can be solved easily as long as 
> an observation I have about your barrier is true.  It seems to me that the 
> barrier is only semi permeable, namely its purpose is to complete *after* a 
> particular set of commands do.  This means that it doesnt matter if later 
> commands move through the barrier, it only matters that earlier commands 
> cannot move past it?

No.  A commit block must be fully ordered.  

If the commit block fails to be written, then we must be able to roll
the filesystem back to the consistent, pre-commit state, which implies
that any later IOs (which might be writeback IOs updating
now-committed metadata to final locations on disk) must not be allowed
to overtake the commit block.

However, in the current code, we don't assume that ordered queuing
works, so that later writeback will never be scheduled until we get a
positive completion acknowledgement for the commit block.  In other
words, right now, the scenario you describe is not a problem.

But ideally, with ordered queueing we would want to be able to relax
things by allowing writeback to be queued immediately the commit is
queued.  The ordered tag must be honoured in both directions in that
case.

There is a get-out for ext3 --- we can submit new journal IOs without
waiting for the commit IO to complete, but hold back on writeback IOs.
That still has the desired advantage of allowing us to stream to the
journal, but only requires that the commit block be ordered with
respect to older, not newer, IOs.  That gives us most of the benefits
of tagged queuing without any problems in your scenario.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 16:13 ` Stephen C. Tweedie
@ 2002-02-22 17:36   ` James Bottomley
  2002-02-22 18:14     ` Chris Mason
  0 siblings, 1 reply; 73+ messages in thread
From: James Bottomley @ 2002-02-22 17:36 UTC (permalink / raw)
  To: Stephen C. Tweedie, Chris Mason; +Cc: James Bottomley, linux-kernel

sct@redhat.com said:
> There is a get-out for ext3 --- we can submit new journal IOs without
> waiting for the commit IO to complete, but hold back on writeback IOs.
> That still has the desired advantage of allowing us to stream to the
> journal, but only requires that the commit block be ordered with
> respect to older, not newer, IOs.  That gives us most of the benefits
> of tagged queuing without any problems in your scenario. 

Actually, I intended the tagged queueing discussion to be discouraging.  The 
amount of work that would have to be done to implement it is huge, touching, 
as it does, every low level driver's interrupt routine.  For the drivers that 
require scripting changes to the chip engine, it's even worse: only someone 
with specialised knowledge can actually make the changes.

It's feasible, but I think we'd have to demonstrate some quite significant 
performance or other improvements before changes on this scale would fly.

Neither of you commented on the original suggestion.  What I was wondering is 
if we could benchmark (or preferably improve on) it:

James.Bottomley@SteelEye.com said:
> The easy way out of the problem, I think, is to impose the barrier as
> an  effective queue plug in the SCSI mid-layer, so that after the
> mid-layer  recevies the barrier, it plugs the device queue from below,
> drains the drive  tag queue, sends the barrier and unplugs the device
> queue on barrier I/O  completion. 

If you need strict barrier ordering, then the queue is double plugged since 
the barrier has to be sent down and waited for on its own.  If you allow the 
discussed permiability, the queue is only single plugged since the barrier can 
be sent down along with the subsequent writes.

I can take a look at implementing this in the SCSI mid-layer and you could see 
what the benchmark figures look like with it in place.  If it really is the 
performance pig it looks like, then we could go back to the linux-scsi list 
with the tag change suggestions.

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 17:36   ` James Bottomley
@ 2002-02-22 18:14     ` Chris Mason
  2002-02-28 15:36       ` James Bottomley
  0 siblings, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-02-22 18:14 UTC (permalink / raw)
  To: James Bottomley, Stephen C. Tweedie; +Cc: linux-kernel



On Friday, February 22, 2002 12:36:22 PM -0500 James Bottomley <James.Bottomley@steeleye.com> wrote:

> sct@redhat.com said:
>> There is a get-out for ext3 --- we can submit new journal IOs without
>> waiting for the commit IO to complete, but hold back on writeback IOs.
>> That still has the desired advantage of allowing us to stream to the
>> journal, but only requires that the commit block be ordered with
>> respect to older, not newer, IOs.  That gives us most of the benefits
>> of tagged queuing without any problems in your scenario. 
> 
> Actually, I intended the tagged queueing discussion to be discouraging.  

;-) 

> The 
> amount of work that would have to be done to implement it is huge, touching, 
> as it does, every low level driver's interrupt routine.  For the drivers that 
> require scripting changes to the chip engine, it's even worse: only someone 
> with specialised knowledge can actually make the changes.
> 
> It's feasible, but I think we'd have to demonstrate some quite significant 
> performance or other improvements before changes on this scale would fly.

Very true.  At best, we pick one card we know it could work on, and
one target that we know is smart about tags, and try to demonstrate
the improvement.

> 
> Neither of you commented on the original suggestion.  What I was wondering is 
> if we could benchmark (or preferably improve on) it:
> 
> James.Bottomley@SteelEye.com said:
>> The easy way out of the problem, I think, is to impose the barrier as
>> an  effective queue plug in the SCSI mid-layer, so that after the
>> mid-layer  recevies the barrier, it plugs the device queue from below,
>> drains the drive  tag queue, sends the barrier and unplugs the device
>> queue on barrier I/O  completion. 

The main way the barriers could help performance is by allowing the
drive to write all the transaction and commit blocks at once.  Your
idea increases the chance the drive heads will still be correctly 
positioned to write the commit block, but doesn't let the drive 
stream things better.

The big advantage to using wait_on_buffer() instead is that it doesn't
order against data writes at all (from bdflush, or some other proc
other than a commit), allowing the drive to optimize those
at the same time it is writing the commit.  Using ordered tags has the
same problem, it might just be that wait_on_buffer is the best way to 
go.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 15:57 [PATCH] 2.4.x write barriers (updated for ext3) James Bottomley
  2002-02-22 16:10 ` Chris Mason
  2002-02-22 16:13 ` Stephen C. Tweedie
@ 2002-02-25 10:57 ` Helge Hafting
  2002-02-25 15:04   ` James Bottomley
  2 siblings, 1 reply; 73+ messages in thread
From: Helge Hafting @ 2002-02-25 10:57 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
[...]
> Unfortunately, there's actually a hole in the SCSI spec that means ordered
> tags are actually extremely difficult to use in the way you want (although I
> think this is an accident, conceptually, I think they were supposed to be used
> for this).  For the interested, I attach the details at the bottom.
> 
[...]
> The SCSI tag system allows all devices to have a dynamic queue.  This means
> that there is no a priori guarantee about how many tags the device will accept
> before the queue becomes full.
> 

I just wonder - isn't the amount of outstanding requests a device
can handle constant?  If so, the user could determine this (from spec or
by running an utility that generates "too much" traffic.)  

The max number of requests may then be compiled in or added as
a kernel boot parameter.  The kernel would honor this and never ever
have more outstanding requests than it believes the device
can handle.  

Those who don't want to bother can use some low default or accept the
risk.

Helge Hafting

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-25 10:57 ` Helge Hafting
@ 2002-02-25 15:04   ` James Bottomley
  0 siblings, 0 replies; 73+ messages in thread
From: James Bottomley @ 2002-02-25 15:04 UTC (permalink / raw)
  To: Helge Hafting; +Cc: James Bottomley, linux-kernel

helgehaf@aitel.hist.no said:
> I just wonder - isn't the amount of outstanding requests a device can
> handle constant?  If so, the user could determine this (from spec or
> by running an utility that generates "too much" traffic.)   

The spec doesn't make any statements about this, so the devices are allowed to 
do whatever seems best.  Although it is undoubtedly implemented as a fixed 
queue on a few devices, there are others whose queue depth depends on the 
available resources (most disk arrays function this way---they tend to juggle 
tag queue depth dynamically per lun).

Even if the queue depth is fixed, you have to probe it dynamically because it 
will be different for each device.  Even worse, on a SAN or other shared bus, 
you might not be the only initiator using the device queue, so even for a 
device with a fixed queue depth you don't own all the slots so the queue depth 
you see varies.

The bottom line is that you have to treat the queue full return as a normal 
part of I/O flow control to SCSI devices.

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-22 18:14     ` Chris Mason
@ 2002-02-28 15:36       ` James Bottomley
  2002-02-28 15:55         ` Chris Mason
                           ` (3 more replies)
  0 siblings, 4 replies; 73+ messages in thread
From: James Bottomley @ 2002-02-28 15:36 UTC (permalink / raw)
  To: Chris Mason, Stephen C. Tweedie; +Cc: James Bottomley, linux-kernel, linux-scsi

Doug Gilbert prompted me to re-examine my notions about SCSI drive caching, 
and sure enough the standard says (and all the drives I've looked at so far 
come with) write back caching enabled by default.

Since this is a threat to the integrity of Journalling FS in power failure 
situations now, I think it needs to be addressed with some urgency.

The "quick fix" would obviously be to get the sd driver to do a mode select at 
probe time to turn off the WCE and RCD bits (this will place the cache into 
write through mode), which would match the assumptions all the JFSs currently 
make.  I'll see if I can code up a quick patch to do this.

A longer term solution might be to keep the writeback cache but send down a 
SYNCHRONIZE CACHE command as part of the back end completion of a barrier 
write, so the fs wouldn't get a completion until the write was done and all 
the dirty cache blocks flushed to the medium.

Clearly, there would also have to be a mechanism to flush the cache on 
unmount, so if this were done by ioctl, would you prefer that the filesystem 
be in charge of flushing the cache on barrier writes, or would you like the sd 
device to do it transparently?

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 15:36       ` James Bottomley
@ 2002-02-28 15:55         ` Chris Mason
  2002-02-28 17:58           ` Mike Anderson
  2002-02-28 18:12         ` Chris Mason
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-02-28 15:55 UTC (permalink / raw)
  To: James Bottomley, Stephen C. Tweedie; +Cc: linux-kernel, linux-scsi



On Thursday, February 28, 2002 09:36:52 AM -0600 James Bottomley <James.Bottomley@steeleye.com> wrote:

> Doug Gilbert prompted me to re-examine my notions about SCSI drive caching, 
> and sure enough the standard says (and all the drives I've looked at so far 
> come with) write back caching enabled by default.

Really.  Has it always been this way?

> 
> Since this is a threat to the integrity of Journalling FS in power failure 
> situations now, I think it needs to be addressed with some urgency.
> 
> The "quick fix" would obviously be to get the sd driver to do a mode select at 
> probe time to turn off the WCE and RCD bits (this will place the cache into 
> write through mode), which would match the assumptions all the JFSs currently 
> make.  I'll see if I can code up a quick patch to do this.

Ok.

> 
> A longer term solution might be to keep the writeback cache but send down a 
> SYNCHRONIZE CACHE command as part of the back end completion of a barrier 
> write, so the fs wouldn't get a completion until the write was done and all 
> the dirty cache blocks flushed to the medium.

Right, they could just implement ORDERED_FLUSH in the barrier patch.

> 
> Clearly, there would also have to be a mechanism to flush the cache on 
> unmount, so if this were done by ioctl, would you prefer that the filesystem 
> be in charge of flushing the cache on barrier writes, or would you like the sd 
> device to do it transparently?

How about triggered by closing the block device.  That would also cover
people like oracle that do stuff to the raw device.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 15:55         ` Chris Mason
@ 2002-02-28 17:58           ` Mike Anderson
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Anderson @ 2002-02-28 17:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: James Bottomley, Stephen C. Tweedie, linux-kernel, linux-scsi

Chris Mason [mason@suse.com] wrote:
> 
> ..snip..
> > 
> > Clearly, there would also have to be a mechanism to flush the cache on 
> > unmount, so if this were done by ioctl, would you prefer that the filesystem 
> > be in charge of flushing the cache on barrier writes, or would you like the sd 
> > device to do it transparently?
> 
> How about triggered by closing the block device.  That would also cover
> people like oracle that do stuff to the raw device.
> 
> -chris

Doing something in sd_release should be covered in the raw case. 
raw_release->blkdev_put->bdev->bd_op->release "sd_release".

At least from what I understand of the raw release call path :-).
-Mike
-- 
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 15:36       ` James Bottomley
  2002-02-28 15:55         ` Chris Mason
@ 2002-02-28 18:12         ` Chris Mason
  2002-03-01  2:08           ` James Bottomley
  2002-03-03 22:11         ` Daniel Phillips
  2002-03-04  3:34         ` Chris Mason
  3 siblings, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-02-28 18:12 UTC (permalink / raw)
  To: James Bottomley, Stephen C. Tweedie; +Cc: linux-kernel, linux-scsi



On Thursday, February 28, 2002 10:55:46 AM -0500 Chris Mason <mason@suse.com> wrote:

>> 
>> A longer term solution might be to keep the writeback cache but send down a 
>> SYNCHRONIZE CACHE command as part of the back end completion of a barrier 
>> write, so the fs wouldn't get a completion until the write was done and all 
>> the dirty cache blocks flushed to the medium.
> 
> Right, they could just implement ORDERED_FLUSH in the barrier patch.

So, a little testing with scsi_info shows my scsi drives do have
writeback cache on.  great.  What's interesting is they
must be doing additional work for ordered tags.  If they were treating
the block as written once in cache, using the tags should not change 
performance at all.  But, I can clearly show the tags changing
performance, and hear the drive write pattern change when tags are on.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 18:12         ` Chris Mason
@ 2002-03-01  2:08           ` James Bottomley
  0 siblings, 0 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-01  2:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: James Bottomley, Stephen C. Tweedie, linux-kernel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

mason@suse.com said:
> So, a little testing with scsi_info shows my scsi drives do have
> writeback cache on.  great.  What's interesting is they must be doing
> additional work for ordered tags.  If they were treating the block as
> written once in cache, using the tags should not change  performance
> at all.  But, I can clearly show the tags changing performance, and
> hear the drive write pattern change when tags are on. 

I checked all mine and they're write through.  However, I inherited all my 
drives from an enterprise vendor so this might not be that surprising.

I can surmise why ordered tags kill performance on your drive, since an 
ordered tag is required to affect the ordering of the write to the medium, not 
the cache, it is probably implemented with an implicit cache flush.

Anyway, the attached patch against 2.4.18 (and I know it's rather gross code) 
will probe the cache type and try to set it to write through on boot.  See 
what this does to your performance ordinarily, and also to your tagged write 
barrier performance.

James



[-- Attachment #2: sd-cache.diff --]
[-- Type: text/plain , Size: 3973 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.166   -> 1.167  
#	   drivers/scsi/sd.c	1.18    -> 1.19   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/28	jejb@malley.il.steeleye.com	1.167
# changes in sd driver
# 
# Drive cache set to write back if possible.
# --------------------------------------------
#
diff -Nru a/drivers/scsi/sd.c b/drivers/scsi/sd.c
--- a/drivers/scsi/sd.c	Thu Feb 28 20:04:49 2002
+++ b/drivers/scsi/sd.c	Thu Feb 28 20:04:49 2002
@@ -741,7 +741,7 @@
 	char nbuff[6];
 	unsigned char *buffer;
 	unsigned long spintime_value = 0;
-	int the_result, retries, spintime;
+	int the_result, retries, spintime, mode_retries;
 	int sector_size;
 	Scsi_Request *SRpnt;
 
@@ -858,6 +858,105 @@
 		else
 			printk("ready\n");
 	}
+
+	mode_retries = 2;	/* make two attempts to change the cache type */
+
+ retry_mode_select:
+	retries = 3;
+	do {
+
+		memset((void *) &cmd[0], 0, 10);
+		cmd[0] = MODE_SENSE;
+		cmd[1] = (rscsi_disks[i].device->scsi_level <= SCSI_2) ?
+			 ((rscsi_disks[i].device->lun << 5) & 0xe0) : 0;
+		cmd[1] |= 0x08;	/* DBD */
+		cmd[2] = 0x08;	/* current values, cache page */
+		cmd[4] = 24;	/* allocation length */
+
+
+		memset((void *) buffer, 0, 24);
+		SRpnt->sr_cmd_len = 0;
+		SRpnt->sr_sense_buffer[0] = 0;
+		SRpnt->sr_sense_buffer[2] = 0;
+
+		SRpnt->sr_data_direction = SCSI_DATA_READ;
+		scsi_wait_req(SRpnt, (void *) cmd, (void *) buffer,
+			    24, SD_TIMEOUT, MAX_RETRIES);
+
+		the_result = SRpnt->sr_result;
+		retries--;
+
+	} while (the_result && retries);
+
+	if (the_result) {
+		printk("%s : MODE SENSE failed.\n"
+		       "%s : status = %x, message = %02x, host = %d, driver = %02x \n",
+		       nbuff, nbuff,
+		       status_byte(the_result),
+		       msg_byte(the_result),
+		       host_byte(the_result),
+		       driver_byte(the_result)
+		    );
+		if (driver_byte(the_result) & DRIVER_SENSE)
+			print_req_sense("sd", SRpnt);
+		else
+			printk("%s : sense not available. \n", nbuff);
+	} else {
+		const char *types[] = { "write through", "none", "write back", "write back, no read (daft)" };
+		int ct = 0;
+
+		ct = (buffer[6] & 0x01 /* RCD */) | ((buffer[6] & 0x04 /* WCE */) >> 1);
+
+		printk("%s : checking drive cache: %s \n", nbuff, types[ct]);
+		if(ct != 0x0 && mode_retries-- == 0) {
+			printk("%s : FAILED to change cache to write back, continuing\n", nbuff);
+		}
+		else if(ct != 0x0) {
+			retries = 3;
+			buffer[6] &= (~0x05); /* clear RCD and WCE */
+			do {
+				memset((void *) &cmd[0], 0, 10);
+				cmd[0] = MODE_SELECT;
+				cmd[1] = (rscsi_disks[i].device->scsi_level <= SCSI_2) ?
+					((rscsi_disks[i].device->lun << 5) & 0xe0) : 0;
+				cmd[1] |= 0x10;	/* PF */
+				cmd[4] = 24;	/* allocation length */
+				
+				
+				SRpnt->sr_cmd_len = 0;
+				SRpnt->sr_sense_buffer[0] = 0;
+				SRpnt->sr_sense_buffer[2] = 0;
+				
+				SRpnt->sr_data_direction = SCSI_DATA_WRITE;
+				scsi_wait_req(SRpnt, (void *) cmd, (void *) buffer,
+					      24, SD_TIMEOUT, MAX_RETRIES);
+
+				the_result = SRpnt->sr_result;
+				retries--;
+
+			} while (the_result && retries);
+
+			if (the_result) {
+				printk("%s : MODE SELECT failed.\n"
+				       "%s : status = %x, message = %02x, host = %d, driver = %02x \n",
+				       nbuff, nbuff,
+				       status_byte(the_result),
+				       msg_byte(the_result),
+				       host_byte(the_result),
+				       driver_byte(the_result)
+				       );
+				if (driver_byte(the_result) & DRIVER_SENSE)
+					print_req_sense("sd", SRpnt);
+				else
+					printk("%s : sense not available. \n", nbuff);
+			} else {
+				printk("%s : changing drive cache to write through\n", nbuff);
+			}
+			goto retry_mode_select;
+		}
+		
+	}
+
 	retries = 3;
 	do {
 		cmd[0] = READ_CAPACITY;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 15:36       ` James Bottomley
  2002-02-28 15:55         ` Chris Mason
  2002-02-28 18:12         ` Chris Mason
@ 2002-03-03 22:11         ` Daniel Phillips
  2002-03-04  4:21           ` Jeremy Higdon
  2002-03-04 14:48           ` James Bottomley
  2002-03-04  3:34         ` Chris Mason
  3 siblings, 2 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-03-03 22:11 UTC (permalink / raw)
  To: James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: James Bottomley, linux-kernel, linux-scsi

On February 28, 2002 04:36 pm, James Bottomley wrote:
> Doug Gilbert prompted me to re-examine my notions about SCSI drive caching, 
> and sure enough the standard says (and all the drives I've looked at so far 
> come with) write back caching enabled by default.
> 
> Since this is a threat to the integrity of Journalling FS in power failure 
> situations now, I think it needs to be addressed with some urgency.
> 
> The "quick fix" would obviously be to get the sd driver to do a mode select at 
> probe time to turn off the WCE and RCD bits (this will place the cache into 
> write through mode), which would match the assumptions all the JFSs currently 
> make.  I'll see if I can code up a quick patch to do this.
> 
> A longer term solution might be to keep the writeback cache but send down a 
> SYNCHRONIZE CACHE command as part of the back end completion of a barrier 
> write, so the fs wouldn't get a completion until the write was done and all 
> the dirty cache blocks flushed to the medium.

I've been following the thread, I hope I haven't missed anything fundamental.
A better long term solution is to have ordered tags work as designed.  It's 
not broken by design is it, just implementation?

I have a standing offer from at least one engineer to make firmware changes 
to the drives if it makes Linux work better.  So a reasonable plan is: first 
know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
identifying drives that don't work in the ideal way, and require a fallback.

In my opinion, the only correct behavior is a write barrier that completes
when data is on the platter, and that does this even when write-back is
enabled.  Surely this is not rocket science at the disk firmware level.  Is
this or is this not the way ordered tags were supposed to work?

> Clearly, there would also have to be a mechanism to flush the cache on 
> unmount, so if this were done by ioctl, would you prefer that the filesystem 
> be in charge of flushing the cache on barrier writes, or would you like the sd 
> device to do it transparently?

The filesystem should just say 'this request is a write barrier' and the 
lower layers, whether that's scsi or bio, should do what's necessary to make
it come true.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-28 15:36       ` James Bottomley
                           ` (2 preceding siblings ...)
  2002-03-03 22:11         ` Daniel Phillips
@ 2002-03-04  3:34         ` Chris Mason
  2002-03-04  5:05           ` Daniel Phillips
                             ` (2 more replies)
  3 siblings, 3 replies; 73+ messages in thread
From: Chris Mason @ 2002-03-04  3:34 UTC (permalink / raw)
  To: Daniel Phillips, James Bottomley, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi



On Sunday, March 03, 2002 11:11:44 PM +0100 Daniel Phillips <phillips@bonn-fries.net> wrote:

> I have a standing offer from at least one engineer to make firmware changes 
> to the drives if it makes Linux work better.  So a reasonable plan is: first 
> know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
> identifying drives that don't work in the ideal way, and require a fallback.
> 
> In my opinion, the only correct behavior is a write barrier that completes
> when data is on the platter, and that does this even when write-back is
> enabled.  

With a battery backup, we want the raid controller (or whatever) to 
pretend the barrier is done right away.  It should be as safe, and 
allow the target to merge the writes.

> Surely this is not rocket science at the disk firmware level.  Is
> this or is this not the way ordered tags were supposed to work?

There are many issues at play in this thread, here's an attempt at
a summary (please correct any mistakes).

1) The drivers would need to be changed to properly keep tag ordering 
in place on resets, and error conditions.

2) ordered tags force ordering of all writes the drive is processing.
For some workloads, it will be forced to order stuff the journal code
doesn't care about at all, perhaps leading to lower performance than
the simple wait_on_buffer() we're using now.

2a) Are the filesystems asking for something impossible?  Can drives
really write block N and N+1, making sure to commit N to media before
N+1 (including an abort on N+1 if N fails), but still keeping up a 
nice seek free stream of writes?

3) Some drives may not be very smart about ordered tags.  We need
to figure out which is faster, using the ordered tag or using a
simple cache flush (when writeback is on).  The good news about
the cache flush is that it doesn't require major surgery in the
scsi error handlers.

4) If some scsi drives come with writeback on by default, do they also
turn it on under high load like IDE drives do?

> 
>> Clearly, there would also have to be a mechanism to flush the cache on 
>> unmount, so if this were done by ioctl, would you prefer that the filesystem 
>> be in charge of flushing the cache on barrier writes, or would you like the sd 
>> device to do it transparently?
> 
> The filesystem should just say 'this request is a write barrier' and the 
> lower layers, whether that's scsi or bio, should do what's necessary to make
> it come true.

That's the goal.  The current 2.4 patch differentiates between ordered
barriers and flush barriers just so I can make the flush the default
on IDE, and enable the ordered stuff when I want to experiment on scsi.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-03 22:11         ` Daniel Phillips
@ 2002-03-04  4:21           ` Jeremy Higdon
  2002-03-04  5:31             ` Daniel Phillips
  2002-03-04 14:48           ` James Bottomley
  1 sibling, 1 reply; 73+ messages in thread
From: Jeremy Higdon @ 2002-03-04  4:21 UTC (permalink / raw)
  To: Daniel Phillips, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On Mar 3, 11:11pm, Daniel Phillips wrote:
> I have a standing offer from at least one engineer to make firmware changes 
> to the drives if it makes Linux work better.  So a reasonable plan is: first 
> know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
> identifying drives that don't work in the ideal way, and require a fallback.
> 
> In my opinion, the only correct behavior is a write barrier that completes
> when data is on the platter, and that does this even when write-back is
> enabled.  Surely this is not rocket science at the disk firmware level.  Is
> this or is this not the way ordered tags were supposed to work?


Ordered tags just specify ordering in the command stream.  The WCE bit
specifies when the write command is complete.  I have never heard of
any implied requirement to flush to media when a drive receives an
ordered tag and WCE is set.  It does seem like a useful feature to have
in the standard, but I don't think it's there.

So if one vendor implements those semantics, but the others don't where
does that leave us?

jeremy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  3:34         ` Chris Mason
@ 2002-03-04  5:05           ` Daniel Phillips
  2002-03-04 15:03             ` James Bottomley
  2002-03-04  8:19           ` Helge Hafting
  2002-03-04 14:57           ` James Bottomley
  2 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04  5:05 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Stephen C. Tweedie; +Cc: linux-kernel, linux-scsi

On March 4, 2002 04:34 am, Chris Mason wrote:
> On Sunday, March 03, 2002 11:11:44 PM +0100 Daniel Phillips <phillips@bonn-fries.net> wrote:
> 
> > I have a standing offer from at least one engineer to make firmware changes 
> > to the drives if it makes Linux work better.  So a reasonable plan is: first 
> > know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
> > identifying drives that don't work in the ideal way, and require a fallback.
> > 
> > In my opinion, the only correct behavior is a write barrier that completes
> > when data is on the platter, and that does this even when write-back is
> > enabled.  
> 
> With a battery backup, we want the raid controller (or whatever) to 
> pretend the barrier is done right away.  It should be as safe, and 
> allow the target to merge the writes.

Agreed, that should count as 'on the platter'.  Unless the battery is flat...

> > Surely this is not rocket science at the disk firmware level.  Is
> > this or is this not the way ordered tags were supposed to work?
> 
> There are many issues at play in this thread, here's an attempt at
> a summary (please correct any mistakes).
> 
> 1) The drivers would need to be changed to properly keep tag ordering 
> in place on resets, and error conditions.

Linux drivers?  Isn't that a simple matter of coding? ;-)

> 2) ordered tags force ordering of all writes the drive is processing.
> For some workloads, it will be forced to order stuff the journal code
> doesn't care about at all, perhaps leading to lower performance than
> the simple wait_on_buffer() we're using now.

OK, thanks for the clear definition of the problem.  This corresponds
to my reading of this document:

   http://www.storage.ibm.com/hardsoft/products/ess/pubs/f2ascsi1.pdf

   Ordered Queue Tag:

   The command begins execution after all previously issued commands
   complete.  Subsequent commands may not begin execution until this
   command completes (unless they are issued with Head of Queue tag
   messages).

But chances are, almost all the IOs ahead of the journal commit belong
to your same filesystem anyway, so you may be worrying too much about
possibly waiting for something on another partition.

In theory, bio could notice the barrier coming down the pipe and hold
back commands on other partitions, if they're too far away physically.

> 2a) Are the filesystems asking for something impossible?  Can drives
> really write block N and N+1, making sure to commit N to media before
> N+1 (including an abort on N+1 if N fails), but still keeping up a 
> nice seek free stream of writes?
> 
> 3) Some drives may not be very smart about ordered tags.  We need
> to figure out which is faster, using the ordered tag or using a
> simple cache flush (when writeback is on).  The good news about
> the cache flush is that it doesn't require major surgery in the
> scsi error handlers.

Everything else seems to be getting major surgery these days, so...

> 4) If some scsi drives come with writeback on by default, do they also
> turn it on under high load like IDE drives do?

It shouldn't matter, if the ordered queue tag is implemented properly.
>From the thread I gather it isn't always, which means we need a
blacklist, or putting on a happier face, a whitelist.

> >> Clearly, there would also have to be a mechanism to flush the cache on 
> >> unmount, so if this were done by ioctl, would you prefer that the filesystem 
> >> be in charge of flushing the cache on barrier writes, or would you like the sd 
> >> device to do it transparently?
> > 
> > The filesystem should just say 'this request is a write barrier' and the 
> > lower layers, whether that's scsi or bio, should do what's necessary to make
> > it come true.
> 
> That's the goal.  The current 2.4 patch differentiates between ordered
> barriers and flush barriers just so I can make the flush the default
> on IDE, and enable the ordered stuff when I want to experiment on scsi.

I should state it more precisely: 'this request is a write barrier for this
partition'.  Is that what you had in mind?

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  4:21           ` Jeremy Higdon
@ 2002-03-04  5:31             ` Daniel Phillips
  2002-03-04  6:09               ` Jeremy Higdon
  0 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04  5:31 UTC (permalink / raw)
  To: Jeremy Higdon, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> On Mar 3, 11:11pm, Daniel Phillips wrote:
> > I have a standing offer from at least one engineer to make firmware changes 
> > to the drives if it makes Linux work better.  So a reasonable plan is: first 
> > know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
> > identifying drives that don't work in the ideal way, and require a fallback.
> > 
> > In my opinion, the only correct behavior is a write barrier that completes
> > when data is on the platter, and that does this even when write-back is
> > enabled.  Surely this is not rocket science at the disk firmware level.  Is
> > this or is this not the way ordered tags were supposed to work?
> 
> Ordered tags just specify ordering in the command stream.  The WCE bit
> specifies when the write command is complete.

WCE is per-command?  And 0 means no caching, so the command must complete
when the data is on the media?

> I have never heard of
> any implied requirement to flush to media when a drive receives an
> ordered tag and WCE is set.  It does seem like a useful feature to have
> in the standard, but I don't think it's there.

It seems to be pretty strongly implied that things should work that way.
What is the use of being sure the write with the ordered tag is on media
if you're not sure about the writes that were supposedly supposed to
precede it?  Spelling this out would indeed be helpful.

> So if one vendor implements those semantics, but the others don't where
> does that leave us?

It leaves us with a vendor we want to buy our drives from, if we want our
data to be safe.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  5:31             ` Daniel Phillips
@ 2002-03-04  6:09               ` Jeremy Higdon
  2002-03-04  7:57                 ` Daniel Phillips
  2002-03-04 16:52                 ` Stephen C. Tweedie
  0 siblings, 2 replies; 73+ messages in thread
From: Jeremy Higdon @ 2002-03-04  6:09 UTC (permalink / raw)
  To: Daniel Phillips, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On Mar 4,  6:31am, Daniel Phillips wrote:
> On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > On Mar 3, 11:11pm, Daniel Phillips wrote:
> > > I have a standing offer from at least one engineer to make firmware changes 
> > > to the drives if it makes Linux work better.  So a reasonable plan is: first 
> > > know what's ideal, second ask for it.  Coupled with that, we'd need a way of 
> > > identifying drives that don't work in the ideal way, and require a fallback.
> > > 
> > > In my opinion, the only correct behavior is a write barrier that completes
> > > when data is on the platter, and that does this even when write-back is
> > > enabled.  Surely this is not rocket science at the disk firmware level.  Is
> > > this or is this not the way ordered tags were supposed to work?
> > 
> > Ordered tags just specify ordering in the command stream.  The WCE bit
> > specifies when the write command is complete.
> 
> WCE is per-command?  And 0 means no caching, so the command must complete
> when the data is on the media?

My reading is that WCE==1 means that the command is complete when the
data is in the drive buffer.

> > I have never heard of
> > any implied requirement to flush to media when a drive receives an
> > ordered tag and WCE is set.  It does seem like a useful feature to have
> > in the standard, but I don't think it's there.
> 
> It seems to be pretty strongly implied that things should work that way.
> What is the use of being sure the write with the ordered tag is on media
> if you're not sure about the writes that were supposedly supposed to
> precede it?  Spelling this out would indeed be helpful.

WCE==1 and ordered tag means that the data for previous commands is in
the drive buffer before the data for the ordered tag is in the drive
buffer.

> > So if one vendor implements those semantics, but the others don't where
> > does that leave us?
> 
> It leaves us with a vendor we want to buy our drives from, if we want our
> data to be safe.

The point is, do you write code that depends on one vendor's interpretation?
If so, then the vendor needs to be identified.  Perhaps other vendors will
then align themselves.

> Daniel

jeremy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  6:09               ` Jeremy Higdon
@ 2002-03-04  7:57                 ` Daniel Phillips
  2002-03-05  7:09                   ` Jeremy Higdon
  2002-03-04 16:52                 ` Stephen C. Tweedie
  1 sibling, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04  7:57 UTC (permalink / raw)
  To: Jeremy Higdon, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> On Mar 4,  6:31am, Daniel Phillips wrote:
> > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > I have never heard of
> > > any implied requirement to flush to media when a drive receives an
> > > ordered tag and WCE is set.  It does seem like a useful feature to have
> > > in the standard, but I don't think it's there.
> > 
> > It seems to be pretty strongly implied that things should work that way.
> > What is the use of being sure the write with the ordered tag is on media
> > if you're not sure about the writes that were supposedly supposed to
> > precede it?  Spelling this out would indeed be helpful.
> 
> WCE==1 and ordered tag means that the data for previous commands is in
> the drive buffer before the data for the ordered tag is in the drive
> buffer.

Right, and what we're talking about is going further and requiring that WCE=0
and ordered tag means the data for previous commands is *not* in the buffer,
i.e., on the platter, which is the only interpretation that makes sense.

> > > So if one vendor implements those semantics, but the others don't where
> > > does that leave us?
> > 
> > It leaves us with a vendor we want to buy our drives from, if we want our
> > data to be safe.
> 
> The point is, do you write code that depends on one vendor's interpretation?

Yes, that's the idea.  And we need some way of knowing which vendors have
interpreted the scsi spec in the way that maximizes both throughput and
safety.  That's the 'whitelist'.

> If so, then the vendor needs to be identified.  Perhaps other vendors will
> then align themselves.

I'm sure they will.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  3:34         ` Chris Mason
  2002-03-04  5:05           ` Daniel Phillips
@ 2002-03-04  8:19           ` Helge Hafting
  2002-03-04 14:57           ` James Bottomley
  2 siblings, 0 replies; 73+ messages in thread
From: Helge Hafting @ 2002-03-04  8:19 UTC (permalink / raw)
  To: Chris Mason, linux-kernel, linux-scsi

On Sun, Mar 03, 2002 at 10:34:07PM -0500, Chris Mason wrote:
[...]
> 3) Some drives may not be very smart about ordered tags.  We need
> to figure out which is faster, using the ordered tag or using a
> simple cache flush (when writeback is on).  The good news about
> the cache flush is that it doesn't require major surgery in the
> scsi error handlers.

Isn't that a userspace thing?  I.e. use ordered tags in the best
way possible for drives that _are_ smart about ordered tags.
Let the admin change that with a hdparm-like utility
if testing (or specs) confirms that this particular
drive takes a performance hit.  

I thing the days of putting up with any stupid hw is
slowly going away.  Linux is a serious server os these
days, and disk makers will be smart about ordered tags
if some server os do benefit from it.  It won't
really cost them much either.  

Old hw is another story of course - some sort of
fallback might be useful for that. But probably
not for next year's drives. :-)

Helge Hafting

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-03 22:11         ` Daniel Phillips
  2002-03-04  4:21           ` Jeremy Higdon
@ 2002-03-04 14:48           ` James Bottomley
  2002-03-06 13:59             ` Daniel Phillips
  1 sibling, 1 reply; 73+ messages in thread
From: James Bottomley @ 2002-03-04 14:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: James Bottomley, Chris Mason, Stephen C. Tweedie, linux-kernel,
	linux-scsi

phillips@bonn-fries.net said:
> I've been following the thread, I hope I haven't missed anything
> fundamental. A better long term solution is to have ordered tags work
> as designed.  It's  not broken by design is it, just implementation? 

There is actually one hole in the design:  A scsi device may accept a command 
with an ordered tag, disconnect and at a later time reconnect and return a 
QUEUE FULL status indicating that the tag must be retried.  In the time 
between the disconnect and reconnect, the standard doesn't require that no 
other tags be accepted, so if the local flow control conditions abate, the 
device is allowed to accept and execute a tag sent down in between the 
disconnect and reconnect.

I think this would introduce a very minor deviation where one tag could 
overtake another, but we may still get a useable implementation even with this.

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  3:34         ` Chris Mason
  2002-03-04  5:05           ` Daniel Phillips
  2002-03-04  8:19           ` Helge Hafting
@ 2002-03-04 14:57           ` James Bottomley
  2002-03-04 17:24             ` Chris Mason
  2002-03-05  7:22             ` Jeremy Higdon
  2 siblings, 2 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-04 14:57 UTC (permalink / raw)
  To: Chris Mason
  Cc: Daniel Phillips, James Bottomley, Stephen C. Tweedie,
	linux-kernel, linux-scsi

mason@suse.com said:
> 1) The drivers would need to be changed to properly keep tag ordering
> in place on resets, and error conditions. 

And actually theres also a problem with  normal operations that disrupt flow 
control like QUEUE FULL returns and contingent allegiance conditions.

Basically, neither the SCSI mid-layer nor the low level drivers were designed 
to keep absolute command ordering.  They take the chaotic I/O approach:  you 
give me a bunch of commands and I tell you when they complete.

> 2) ordered tags force ordering of all writes the drive is processing.
> For some workloads, it will be forced to order stuff the journal code
> doesn't care about at all, perhaps leading to lower performance than
> the simple wait_on_buffer() we're using now.

> 2a) Are the filesystems asking for something impossible?  Can drives
> really write block N and N+1, making sure to commit N to media before
> N+1 (including an abort on N+1 if N fails), but still keeping up a
> nice seek free stream of writes? 

These are the "big" issues.  There's not much point doing all the work to 
implement ordered tags, if the end result is going to be no gain in 
performance.

> 4) If some scsi drives come with writeback on by default, do they also
> turn it on under high load like IDE drives do? 

Finally, an easy one...the answer's "no".  The cache control bits are the only 
way to alter caching behaviour (nothing stops a WCE=1 operating as write 
through if the drive wants to, but a WCE=0 cannot operate write back).

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  5:05           ` Daniel Phillips
@ 2002-03-04 15:03             ` James Bottomley
  2002-03-04 17:04               ` Stephen C. Tweedie
  2002-03-04 17:16               ` Chris Mason
  0 siblings, 2 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-04 15:03 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Chris Mason, James Bottomley, Stephen C. Tweedie, linux-kernel,
	linux-scsi

phillips@bonn-fries.net said:
> But chances are, almost all the IOs ahead of the journal commit belong
> to your same filesystem anyway, so you may be worrying too much about
> possibly waiting for something on another partition. 

My impression is that most modern JFS can work on multiple transactions 
simultaneously.  All you really care about, I believe, is I/O ordering within 
the transaction.  However, separate transactions have no I/O ordering 
requirements with respect to each other (unless they actually overlap).  Using 
ordered tags imposes a global ordering, not just a local transaction ordering, 
so they may not be the most appropriate way to ensure the ordering of writes 
within a single transaction.

I'm not really a JFS expert, so perhaps those who actually develop these 
filesystems could comment?

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  6:09               ` Jeremy Higdon
  2002-03-04  7:57                 ` Daniel Phillips
@ 2002-03-04 16:52                 ` Stephen C. Tweedie
  2002-03-04 18:15                   ` Daniel Phillips
  2002-03-10  5:24                   ` Douglas Gilbert
  1 sibling, 2 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 16:52 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: Daniel Phillips, James Bottomley, Chris Mason,
	Stephen C. Tweedie, linux-kernel, linux-scsi

Hi,

On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:

> > WCE is per-command?  And 0 means no caching, so the command must complete
> > when the data is on the media?
> 
> My reading is that WCE==1 means that the command is complete when the
> data is in the drive buffer.

Even if WCE is enabled in the caching mode page, we can still set FUA
(Force Unit Access) in individual write commands to force platter
completion before commands complete.

Of course, it's a good question whether this is honoured properly on
all drives.

FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 15:03             ` James Bottomley
@ 2002-03-04 17:04               ` Stephen C. Tweedie
  2002-03-04 17:35                 ` James Bottomley
  2002-03-04 17:16               ` Chris Mason
  1 sibling, 1 reply; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 17:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Daniel Phillips, Chris Mason, Stephen C. Tweedie, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 09:03:31AM -0600, James Bottomley wrote:
> phillips@bonn-fries.net said:
> > But chances are, almost all the IOs ahead of the journal commit belong
> > to your same filesystem anyway, so you may be worrying too much about
> > possibly waiting for something on another partition. 
> 
> My impression is that most modern JFS can work on multiple transactions 
> simultaneously.  All you really care about, I believe, is I/O ordering within 
> the transaction.  However, separate transactions have no I/O ordering 
> requirements with respect to each other (unless they actually overlap).

Generally, that may be true but it's irrelevant.  Internally, the fs
may keep transactions as independent, but as soon as IO is scheduled,
those transactions become serialised.  Given that pure sequential IO
is so much more efficient than random IO, we usually expect
performance to be improved, not degraded, by such serialisation.

I don't know of any filesystems which will be able to recover a
transaction X+1 if transaction X is not complete in the log.  Once you
start writing, the transactions lose their independence.

> Using 
> ordered tags imposes a global ordering, not just a local transaction ordering, 
Actually, ordered tags are in many cases not global enough.  LVM, for
example.

Basically, as far as journal writes are concerned, you just want
things sequential for performance, so serialisation isn't a problem
(and it typically happens anyway).  After the journal write, the
eventual proper writeback of the dirty data to disk has no internal
ordering requirement at all --- it just needs to start strictly after
the commit, and end before the journal records get reused.  Beyond
that, the write order for the writeback data is irrelevant.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 15:03             ` James Bottomley
  2002-03-04 17:04               ` Stephen C. Tweedie
@ 2002-03-04 17:16               ` Chris Mason
  2002-03-04 18:05                 ` Stephen C. Tweedie
  2002-03-04 19:51                 ` Daniel Phillips
  1 sibling, 2 replies; 73+ messages in thread
From: Chris Mason @ 2002-03-04 17:16 UTC (permalink / raw)
  To: Stephen C. Tweedie, James Bottomley
  Cc: Daniel Phillips, linux-kernel, linux-scsi



On Monday, March 04, 2002 05:04:34 PM +0000 "Stephen C. Tweedie" <sct@redhat.com> wrote:

> Basically, as far as journal writes are concerned, you just want
> things sequential for performance, so serialisation isn't a problem
> (and it typically happens anyway).  After the journal write, the
> eventual proper writeback of the dirty data to disk has no internal
> ordering requirement at all --- it just needs to start strictly after
> the commit, and end before the journal records get reused.  Beyond
> that, the write order for the writeback data is irrelevant.
> 

writeback data order is important, mostly because of where the data blocks
are in relation to the log.  If you've got bdflush unloading data blocks
to the disk, and another process doing a commit, the drive's queue
might look like this:

data1, data2, data3, commit1, data4, data5 etc.

If commit1 is an ordered tag, the drive is required to flush 
data1, data2 and data3, then write the commit, then seek back
for data4 and data5.

If commit1 is not an ordered tag, the drive can write all the
data blocks, then seek back to get the commit.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 14:57           ` James Bottomley
@ 2002-03-04 17:24             ` Chris Mason
  2002-03-04 19:02               ` Daniel Phillips
  2002-03-05  7:22             ` Jeremy Higdon
  1 sibling, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-03-04 17:24 UTC (permalink / raw)
  To: James Bottomley
  Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel, linux-scsi



On Monday, March 04, 2002 08:57:57 AM -0600 James Bottomley <James.Bottomley@steeleye.com> wrote:


>> 2) ordered tags force ordering of all writes the drive is processing.
>> For some workloads, it will be forced to order stuff the journal code
>> doesn't care about at all, perhaps leading to lower performance than
>> the simple wait_on_buffer() we're using now.
> 
>> 2a) Are the filesystems asking for something impossible?  Can drives
>> really write block N and N+1, making sure to commit N to media before
>> N+1 (including an abort on N+1 if N fails), but still keeping up a
>> nice seek free stream of writes? 
> 
> These are the "big" issues.  There's not much point doing all the work to 
> implement ordered tags, if the end result is going to be no gain in 
> performance.

Right, 2a seems to be the show stopper to me.  The good news is 
the existing patches are enough to benchmark the thing and see if
any devices actually benefit.  If we find enough that do, then it
might be worth the extra driver coding required to make the code
correct.

> 
>> 4) If some scsi drives come with writeback on by default, do they also
>> turn it on under high load like IDE drives do? 
> 
> Finally, an easy one...the answer's "no".  The cache control bits are the only 
> way to alter caching behaviour (nothing stops a WCE=1 operating as write 
> through if the drive wants to, but a WCE=0 cannot operate write back).

good to hear, thanks.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:04               ` Stephen C. Tweedie
@ 2002-03-04 17:35                 ` James Bottomley
  2002-03-04 17:48                   ` Chris Mason
  2002-03-04 18:09                   ` Stephen C. Tweedie
  0 siblings, 2 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-04 17:35 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: James Bottomley, Daniel Phillips, Chris Mason, linux-kernel, linux-scsi

sct@redhat.com said:
> Generally, that may be true but it's irrelevant.  Internally, the fs
> may keep transactions as independent, but as soon as IO is scheduled,
> those transactions become serialised.  Given that pure sequential IO
> is so much more efficient than random IO, we usually expect
> performance to be improved, not degraded, by such serialisation. 

This is the part I'm struggling with.  Even without error handling and certain 
other changes that would have to be made to give guaranteed integrity to the 
tag ordering, Chris' patch is a very reasonable experimental model of how an 
optimal system for implementing write barriers via ordered tags would work; 
yet when he benchmarks, he sees a performance decrease.

I can dismiss his results as being due to firmware problems with his drives 
making them behave non-optimally for ordered tags, but I really would like to 
see evidence that someone somewhere acutally sees a performance boost with 
Chris' patch.

Have there been any published comparisons of a write barrier implementation 
verses something like the McKusick soft update idea, or even just 
multi-threaded back end completion of the transactions?

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:35                 ` James Bottomley
@ 2002-03-04 17:48                   ` Chris Mason
  2002-03-04 18:11                     ` James Bottomley
  2002-03-04 18:09                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-03-04 17:48 UTC (permalink / raw)
  To: James Bottomley, Stephen C. Tweedie
  Cc: Daniel Phillips, linux-kernel, linux-scsi



On Monday, March 04, 2002 11:35:24 AM -0600 James Bottomley <James.Bottomley@steeleye.com> wrote:

> sct@redhat.com said:
>> Generally, that may be true but it's irrelevant.  Internally, the fs
>> may keep transactions as independent, but as soon as IO is scheduled,
>> those transactions become serialised.  Given that pure sequential IO
>> is so much more efficient than random IO, we usually expect
>> performance to be improved, not degraded, by such serialisation. 
> 
> This is the part I'm struggling with.  Even without error handling and certain 
> other changes that would have to be made to give guaranteed integrity to the 
> tag ordering, Chris' patch is a very reasonable experimental model of how an 
> optimal system for implementing write barriers via ordered tags would work; 
> yet when he benchmarks, he sees a performance decrease.
> 

Actually most tests I've done show no change at all.  So far, only
lots of O_SYNC writes stress the log enough to show a performance
difference, about 10% faster with tags on.

> I can dismiss his results as being due to firmware problems with his drives 
> making them behave non-optimally for ordered tags, but I really would like to 
> see evidence that someone somewhere acutally sees a performance boost with 
> Chris' patch.

So would I ;-)

> 
> Have there been any published comparisons of a write barrier implementation 
> verses something like the McKusick soft update idea, or even just 
> multi-threaded back end completion of the transactions?

Sorry, what do you mean by multi-threaded back end completion of the
transaction? 

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:16               ` Chris Mason
@ 2002-03-04 18:05                 ` Stephen C. Tweedie
  2002-03-04 18:28                   ` James Bottomley
  2002-03-04 19:48                   ` Daniel Phillips
  2002-03-04 19:51                 ` Daniel Phillips
  1 sibling, 2 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 18:05 UTC (permalink / raw)
  To: Chris Mason
  Cc: Stephen C. Tweedie, James Bottomley, Daniel Phillips,
	linux-kernel, linux-scsi

Hi,

On Mon, Mar 04, 2002 at 12:16:35PM -0500, Chris Mason wrote:
 
> writeback data order is important, mostly because of where the data blocks
> are in relation to the log.  If you've got bdflush unloading data blocks
> to the disk, and another process doing a commit, the drive's queue
> might look like this:
> 
> data1, data2, data3, commit1, data4, data5 etc.
> 
> If commit1 is an ordered tag, the drive is required to flush 
> data1, data2 and data3, then write the commit, then seek back
> for data4 and data5.

Yes, but that's a performance issue, not a correctness one.

Also, as soon as we have journals on external devices, this whole
thing changes entirely.  We still have to enforce the commit ordering
in the journal, but we also still need the ordering between that
commit and any subsequent writeback, and that obviousy can no longer
be achieved via ordered tags if the two writes are happening on
different devices.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:35                 ` James Bottomley
  2002-03-04 17:48                   ` Chris Mason
@ 2002-03-04 18:09                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 18:09 UTC (permalink / raw)
  To: James Bottomley
  Cc: Stephen C. Tweedie, Daniel Phillips, Chris Mason, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 11:35:24AM -0600, James Bottomley wrote:

> Have there been any published comparisons of a write barrier implementation 
> verses something like the McKusick soft update idea

Soft updates are just another mechanism of doing ordered writes.  If
the disk IO subsystem is lying about write ordering or is doing
unexpected writeback caching, soft updates are no more of a cure than
journaling.

> or even just 
> multi-threaded back end completion of the transactions?

ext3 already does the on-disk transaction complete asynchronously
within a separate kjournald thread, independent of writeback IO going
on in the VM's own writeback threads.  Given that it is kernel code
given full access to the kernel's internal lazy IO completion
mechanisms, I'm not sure that it can usefully be given any more
threading.  I think the reiserfs situation is similar.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:48                   ` Chris Mason
@ 2002-03-04 18:11                     ` James Bottomley
  2002-03-04 18:41                       ` Chris Mason
  2002-03-04 21:34                       ` Stephen C. Tweedie
  0 siblings, 2 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-04 18:11 UTC (permalink / raw)
  To: Chris Mason
  Cc: James Bottomley, Stephen C. Tweedie, Daniel Phillips,
	linux-kernel, linux-scsi

mason@suse.com said:
> Sorry, what do you mean by multi-threaded back end completion of the
> transaction?  

It's an old idea from databases with fine grained row level locking.  To alter 
data in a single row, you reserve space in the rollback log, take the row 
lock, write the transaction description, write the data, undo the transaction 
description and release the rollback log space and row lock.  These actions 
are sequential, but there may be many such transactions going on in the table 
simultaneously.  The way I've seen a database do this is to set up the actions 
as linked threads which are run as part of the completion routine of the 
previous thread.  Thus, you don't need to wait for the update to complete, you 
just kick off the transaction.   You are prevented from stepping on your own 
transaction because if you want to alter the same row again you have to wait 
for the row lock to be released.  The row locks are the "barriers" in this 
case, but they preserve the concept of transaction independence.  Of course, 
most DB transactions involve many row locks and you don't even want to get 
into what the deadlock detection algorithms look like...

I always imagined a journalled filesystem worked something like this, since 
most of the readers/writers will be acting independently there shouldn't be so 
much deadlock potential.

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 16:52                 ` Stephen C. Tweedie
@ 2002-03-04 18:15                   ` Daniel Phillips
  2002-03-05  7:40                     ` Jens Axboe
  2002-03-10  5:24                   ` Douglas Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04 18:15 UTC (permalink / raw)
  To: Stephen C. Tweedie, Jeremy Higdon
  Cc: James Bottomley, Chris Mason, Stephen C. Tweedie, linux-kernel,
	linux-scsi

On March 4, 2002 05:52 pm, Stephen C. Tweedie wrote:
> Hi,
> 
> On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:
> 
> > > WCE is per-command?  And 0 means no caching, so the command must complete
> > > when the data is on the media?
> > 
> > My reading is that WCE==1 means that the command is complete when the
> > data is in the drive buffer.
> 
> Even if WCE is enabled in the caching mode page, we can still set FUA
> (Force Unit Access) in individual write commands to force platter
> completion before commands complete.

Yes, I discovered the FUA bit just after making the previous post, so please
substitute 'FUA' for 'WCE' in the above.

> Of course, it's a good question whether this is honoured properly on
> all drives.
> 
> FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

I'm having a little trouble seeing the difference between WRITE10, WRITE12
and WRITE16.  WRITE6 seems to be different only in not garaunteeing to 
support the FUA (and one other) bit.  I'm reading the Scsi Block Commands
2 pdf:

   ftp://ftp.t10.org/t10/drafts/sbc2/sbc2r05a.pdf

(Side note: how nice it would be if t10.org got a clue and posted their
docs in html, in addition to the inconvenient, unhyperlinked, proprietary
format pdfs.)

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:05                 ` Stephen C. Tweedie
@ 2002-03-04 18:28                   ` James Bottomley
  2002-03-04 19:55                     ` Stephen C. Tweedie
  2002-03-04 19:48                   ` Daniel Phillips
  1 sibling, 1 reply; 73+ messages in thread
From: James Bottomley @ 2002-03-04 18:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Chris Mason, James Bottomley, Daniel Phillips, linux-kernel, linux-scsi

sct@redhat.com said:
> Also, as soon as we have journals on external devices, this whole
> thing changes entirely.  We still have to enforce the commit ordering
> in the journal, but we also still need the ordering between that
> commit and any subsequent writeback, and that obviousy can no longer
> be achieved via ordered tags if the two writes are happening on
> different devices. 

Yes, that's a killer: ordered tags aren't going to be able to enforce cross 
device write barriers.

There is one remaining curiosity I have, at least about the benchmarks: Since 
the linux elevator and tag queueing perform essentially similar function 
(except that the disk itself has a better notion of ordering because it knows 
its own geometry).  Might we get better performance by reducing the number of 
tags we allow the device to use, thus forcing the writes to remain longer in 
the linux elevator?

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:11                     ` James Bottomley
@ 2002-03-04 18:41                       ` Chris Mason
  2002-03-04 21:34                       ` Stephen C. Tweedie
  1 sibling, 0 replies; 73+ messages in thread
From: Chris Mason @ 2002-03-04 18:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Stephen C. Tweedie, Daniel Phillips, linux-kernel, linux-scsi



On Monday, March 04, 2002 12:11:27 PM -0600 James Bottomley <James.Bottomley@steeleye.com> wrote:

> mason@suse.com said:
>> Sorry, what do you mean by multi-threaded back end completion of the
>> transaction?  
> 
> It's an old idea from databases with fine grained row level locking.  To alter 
> data in a single row, you reserve space in the rollback log, take the row 
> lock, write the transaction description, write the data, undo the transaction 
> description and release the rollback log space and row lock.  These actions 
> are sequential, but there may be many such transactions going on in the table 
> simultaneously.  The way I've seen a database do this is to set up the actions 
> as linked threads which are run as part of the completion routine of the 
> previous thread.  Thus, you don't need to wait for the update to complete, you 
> just kick off the transaction.

Ok, then, like sct said, we try really hard to have external threads
do log io for us.  It also helps that an atomic unit usually isn't
as small as 'mkdir p'.  Many operations get batched together to
reduce log overhead.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:24             ` Chris Mason
@ 2002-03-04 19:02               ` Daniel Phillips
  0 siblings, 0 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04 19:02 UTC (permalink / raw)
  To: Chris Mason, James Bottomley; +Cc: Stephen C. Tweedie, linux-kernel, linux-scsi

On March 4, 2002 06:24 pm, Chris Mason wrote:
> On Monday, March 04, 2002 08:57:57 AM -0600 James Bottomley wrote:
> >> 2a) Are the filesystems asking for something impossible?  Can drives
> >> really write block N and N+1, making sure to commit N to media before
> >> N+1 (including an abort on N+1 if N fails), but still keeping up a
> >> nice seek free stream of writes? 
> > 
> > These are the "big" issues.  There's not much point doing all the work to 
> > implement ordered tags, if the end result is going to be no gain in 
> > performance.
> 
> Right, 2a seems to be the show stopper to me.  The good news is 
> the existing patches are enough to benchmark the thing and see if
> any devices actually benefit.  If we find enough that do, then it
> might be worth the extra driver coding required to make the code
> correct.

Waiting with breathless anticipation.  And once these issues are worked out, 
there's a tough one remaining: enforcing the write barrier through a virtual 
volume with multiple spindles underneath with separate command queues, so 
that the write barrier applies to all.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:05                 ` Stephen C. Tweedie
  2002-03-04 18:28                   ` James Bottomley
@ 2002-03-04 19:48                   ` Daniel Phillips
  2002-03-04 19:57                     ` Stephen C. Tweedie
  2002-03-05  7:48                     ` Jens Axboe
  1 sibling, 2 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04 19:48 UTC (permalink / raw)
  To: Stephen C. Tweedie, Chris Mason
  Cc: Stephen C. Tweedie, James Bottomley, linux-kernel, linux-scsi

On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Mar 04, 2002 at 12:16:35PM -0500, Chris Mason wrote:
>  
> > writeback data order is important, mostly because of where the data blocks
> > are in relation to the log.  If you've got bdflush unloading data blocks
> > to the disk, and another process doing a commit, the drive's queue
> > might look like this:
> > 
> > data1, data2, data3, commit1, data4, data5 etc.
> > 
> > If commit1 is an ordered tag, the drive is required to flush 
> > data1, data2 and data3, then write the commit, then seek back
> > for data4 and data5.
> 
> Yes, but that's a performance issue, not a correctness one.
> 
> Also, as soon as we have journals on external devices, this whole
> thing changes entirely.  We still have to enforce the commit ordering
> in the journal, but we also still need the ordering between that
> commit and any subsequent writeback, and that obviousy can no longer
> be achieved via ordered tags if the two writes are happening on
> different devices.

But the bio layer can manage it, by sending a write barrier down all relevant 
queues.  We can send a zero length write barrier command, yes?

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 17:16               ` Chris Mason
  2002-03-04 18:05                 ` Stephen C. Tweedie
@ 2002-03-04 19:51                 ` Daniel Phillips
  2002-03-05  7:42                   ` Jens Axboe
  1 sibling, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04 19:51 UTC (permalink / raw)
  To: Chris Mason, Stephen C. Tweedie, James Bottomley; +Cc: linux-kernel, linux-scsi

On March 4, 2002 06:16 pm, Chris Mason wrote:
> On Monday, March 04, 2002 05:04:34 PM +0000 "Stephen C. Tweedie" <sct@redhat.com> wrote:
> 
> > Basically, as far as journal writes are concerned, you just want
> > things sequential for performance, so serialisation isn't a problem
> > (and it typically happens anyway).  After the journal write, the
> > eventual proper writeback of the dirty data to disk has no internal
> > ordering requirement at all --- it just needs to start strictly after
> > the commit, and end before the journal records get reused.  Beyond
> > that, the write order for the writeback data is irrelevant.
> 
> writeback data order is important, mostly because of where the data blocks
> are in relation to the log.  If you've got bdflush unloading data blocks
> to the disk, and another process doing a commit, the drive's queue
> might look like this:
> 
> data1, data2, data3, commit1, data4, data5 etc.
> 
> If commit1 is an ordered tag, the drive is required to flush 
> data1, data2 and data3, then write the commit, then seek back
> for data4 and data5.
> 
> If commit1 is not an ordered tag, the drive can write all the
> data blocks, then seek back to get the commit.

We can have more than one queue per device I think.  Then we can have reads
unaffected by write barriers, for example.  It never makes sense for a the 
write barrier to wait on a read.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:28                   ` James Bottomley
@ 2002-03-04 19:55                     ` Stephen C. Tweedie
  0 siblings, 0 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 19:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Stephen C. Tweedie, Chris Mason, Daniel Phillips, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 12:28:19PM -0600, James Bottomley wrote:
 
> There is one remaining curiosity I have, at least about the benchmarks: Since 
> the linux elevator and tag queueing perform essentially similar function 
> (except that the disk itself has a better notion of ordering because it knows 
> its own geometry).  Might we get better performance by reducing the number of 
> tags we allow the device to use, thus forcing the writes to remain longer in 
> the linux elevator?

Possibly, but my gut feeling says no and so do any benchmarks I've
seen regarding queue depths on adaptec controllers (not that I've seen
many).  For relatively-closeby IOs, the disk will always have a better
idea of how to optimise a number of IOs than the Linux elevator can
have, especially if we have multiple IOs spanning multiple heads
within a single cylinder.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 19:48                   ` Daniel Phillips
@ 2002-03-04 19:57                     ` Stephen C. Tweedie
  2002-03-04 21:06                       ` Daniel Phillips
  2002-03-05  7:48                     ` Jens Axboe
  1 sibling, 1 reply; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 19:57 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Chris Mason, James Bottomley, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > Also, as soon as we have journals on external devices, this whole
> > thing changes entirely.  We still have to enforce the commit ordering
> > in the journal, but we also still need the ordering between that
> > commit and any subsequent writeback, and that obviousy can no longer
> > be achieved via ordered tags if the two writes are happening on
> > different devices.
> 
> But the bio layer can manage it, by sending a write barrier down all relevant 
> queues.  We can send a zero length write barrier command, yes?

Sort of --- there are various flush commands we can use.  However, bio
can't just submit the barriers, it needs to synchronise them, and that
means doing a global wait over all the devices until they have all
acked their barrier op.  That's expensive: you may be as well off just
using the current fs-internal synchronous commands at that point.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 19:57                     ` Stephen C. Tweedie
@ 2002-03-04 21:06                       ` Daniel Phillips
  2002-03-05 14:58                         ` Stephen C. Tweedie
  0 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-04 21:06 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Stephen C. Tweedie, Chris Mason, James Bottomley, linux-kernel,
	linux-scsi

On March 4, 2002 08:57 pm, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> > On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > > Also, as soon as we have journals on external devices, this whole
> > > thing changes entirely.  We still have to enforce the commit ordering
> > > in the journal, but we also still need the ordering between that
> > > commit and any subsequent writeback, and that obviousy can no longer
> > > be achieved via ordered tags if the two writes are happening on
> > > different devices.
> > 
> > But the bio layer can manage it, by sending a write barrier down all
> > relevant queues.  We can send a zero length write barrier command, yes?
> 
> Sort of --- there are various flush commands we can use.  However, bio
> can't just submit the barriers, it needs to synchronise them, and that
> means doing a global wait over all the devices until they have all
> acked their barrier op.  That's expensive: you may be as well off just
> using the current fs-internal synchronous commands at that point.

With ordered tags, at least we get the benefit of not having to wait on all 
the commands before the write barrier.

It's annoying to have to let the all the command queues empty, but it's hard 
to see what can be done about that, the synchronization *has* to be global.  
In this case, all we can do is to be sure to respond quickly to the command 
completion interrupt.  So the unavoidable cost is one request's worth of bus 
transfer (is there an advantage in trying to make it a small request?) and 
the latency of the interrupt.  100 uSec?

In the meantime, if I am right about being able to have multiple queues per 
disk, reads can continue.  It's not so bad.

The only way I can imagine of improving this is if there's a way to queue 
some commands on the understanding they're not to be carried out until the 
word is given.  My scsi-fu is not great enough to know if there's a way to do 
this.  Even if we could, it's probably not worth the effort, because all the 
drives will have to wait for the slowest/most loaded anyway.

That's life.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:11                     ` James Bottomley
  2002-03-04 18:41                       ` Chris Mason
@ 2002-03-04 21:34                       ` Stephen C. Tweedie
  1 sibling, 0 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-04 21:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Mason, Stephen C. Tweedie, Daniel Phillips, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 12:11:27PM -0600, James Bottomley wrote:

> The way I've seen a database do this is to set up the actions 
> as linked threads which are run as part of the completion routine of the 
> previous thread.  Thus, you don't need to wait for the update to complete, you 
> just kick off the transaction.   You are prevented from stepping on your own 
> transaction because if you want to alter the same row again you have to wait 
> for the row lock to be released.  The row locks are the "barriers" in this 
> case, but they preserve the concept of transaction independence.

Right, but in the database world we are usually doing synchronous
transactions, so allowing the writeback to be done in parallel is
important; and typically there's a combination of undo and redo
logging, so there is a much more relaxed ordering requirement on the
main data writes.

In filesystems it's much more common just to use redo logging, so we
can't do any file writes before the journal commit; and the IO is
usually done as writeback after the application's syscall has
finished.

Linux already has such fine-grained locking for the actual completion
of the filesystem operations, and in the journaling case,
coarse-grained writeback is usually done because it's far more
efficient to be able to batch up a bunch of updates into a single
transaction in the redo log.

There are some exceptions.  GFS, for example, takes care to maintain
transactional fine grainedness even for writeback, because in a
distributed filesystem you have to be able to release pinned metadata
back to another node on demand as quickly as possible, and you don't
want to force huge compound transactions out to disk when doing so.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04  7:57                 ` Daniel Phillips
@ 2002-03-05  7:09                   ` Jeremy Higdon
  2002-03-05 22:56                     ` Daniel Phillips
  0 siblings, 1 reply; 73+ messages in thread
From: Jeremy Higdon @ 2002-03-05  7:09 UTC (permalink / raw)
  To: Daniel Phillips, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On Mar 4,  8:57am, Daniel Phillips wrote:
> 
> On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> > On Mar 4,  6:31am, Daniel Phillips wrote:
> > > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > > I have never heard of
> > > > any implied requirement to flush to media when a drive receives an
> > > > ordered tag and WCE is set.  It does seem like a useful feature to have
> > > > in the standard, but I don't think it's there.
> > > 
> > > It seems to be pretty strongly implied that things should work that way.
> > > What is the use of being sure the write with the ordered tag is on media
> > > if you're not sure about the writes that were supposedly supposed to
> > > precede it?  Spelling this out would indeed be helpful.
> > 
> > WCE==1 and ordered tag means that the data for previous commands is in
> > the drive buffer before the data for the ordered tag is in the drive
> > buffer.
> 
> Right, and what we're talking about is going further and requiring that WCE=0
> and ordered tag means the data for previous commands is *not* in the buffer,
> i.e., on the platter, which is the only interpretation that makes sense.


Sorry to be slow here, but if WCE=0, then commands aren't complete until
data is on the media, so since previous commands don't complete until
data is on the media, and they must complete before the ordered tag
command does, what you say would have to be the case.  I thought the idea
was to buffer commands to drive memory (so that the drive could increase
performance by writing back to back commands without losing a rev) and
then issue a command with a "flush" side effect.

Here is an interesting question.  If you use WCE=1 and then send an
ordered tag with FUA=1, does that imply that data from previous
write commands is flushed to media?  I don't think so, though it
would be a useful feature if it did.

jeremy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 14:57           ` James Bottomley
  2002-03-04 17:24             ` Chris Mason
@ 2002-03-05  7:22             ` Jeremy Higdon
  2002-03-05 23:01               ` Daniel Phillips
  1 sibling, 1 reply; 73+ messages in thread
From: Jeremy Higdon @ 2002-03-05  7:22 UTC (permalink / raw)
  To: James Bottomley, Chris Mason
  Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel, linux-scsi

On Mar 4,  8:57am, James Bottomley wrote:
> 
> > 2a) Are the filesystems asking for something impossible?  Can drives
> > really write block N and N+1, making sure to commit N to media before
> > N+1 (including an abort on N+1 if N fails), but still keeping up a
> > nice seek free stream of writes? 
> 
> These are the "big" issues.  There's not much point doing all the work to 
> implement ordered tags, if the end result is going to be no gain in 
> performance.


If a drive does reduced latency writes, then blocks can be written out
of order.  Also, for a trivial case:  with hardware RAIDs, when the
data for a single command is split across multiple drives, you can get
data blocks written out of order, no matter what you do.

I don't think a filesystem can make any assumptions about blocks within
a single command, though with ordered tags (assuming driver and device
support) and no write caching, it can make assumptions between commands.

jeremy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 18:15                   ` Daniel Phillips
@ 2002-03-05  7:40                     ` Jens Axboe
  2002-03-05 22:29                       ` Daniel Phillips
  0 siblings, 1 reply; 73+ messages in thread
From: Jens Axboe @ 2002-03-05  7:40 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Jeremy Higdon, James Bottomley, Chris Mason,
	linux-kernel, linux-scsi

On Mon, Mar 04 2002, Daniel Phillips wrote:
> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> 
> I'm having a little trouble seeing the difference between WRITE10, WRITE12
> and WRITE16.  WRITE6 seems to be different only in not garaunteeing to 
> support the FUA (and one other) bit.  I'm reading the Scsi Block Commands

WRITE6 was deprecated because there is only one byte available to set
transfer size. Enter WRITE10. WRITE12 allows the use of the streaming
performance settings, that's the only functional difference wrt WRITE10
iirc.

> (Side note: how nice it would be if t10.org got a clue and posted their
> docs in html, in addition to the inconvenient, unhyperlinked, proprietary
> format pdfs.)

See the mtfuji docs as an example for how nicely pdf's can be setup too.
The thought of substituting that for a html version makes me want to
barf.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 19:51                 ` Daniel Phillips
@ 2002-03-05  7:42                   ` Jens Axboe
  0 siblings, 0 replies; 73+ messages in thread
From: Jens Axboe @ 2002-03-05  7:42 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Chris Mason, Stephen C. Tweedie, James Bottomley, linux-kernel,
	linux-scsi

On Mon, Mar 04 2002, Daniel Phillips wrote:
> > writeback data order is important, mostly because of where the data blocks
> > are in relation to the log.  If you've got bdflush unloading data blocks
> > to the disk, and another process doing a commit, the drive's queue
> > might look like this:
> > 
> > data1, data2, data3, commit1, data4, data5 etc.
> > 
> > If commit1 is an ordered tag, the drive is required to flush 
> > data1, data2 and data3, then write the commit, then seek back
> > for data4 and data5.
> > 
> > If commit1 is not an ordered tag, the drive can write all the
> > data blocks, then seek back to get the commit.
> 
> We can have more than one queue per device I think.  Then we can have reads
> unaffected by write barriers, for example.  It never makes sense for a the 
> write barrier to wait on a read.

No, there will always be at most one queue for a device. There might be
more than one device on a queue, though, so yes the implementation at
the block/queue level still leaves something to be desired.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 19:48                   ` Daniel Phillips
  2002-03-04 19:57                     ` Stephen C. Tweedie
@ 2002-03-05  7:48                     ` Jens Axboe
  1 sibling, 0 replies; 73+ messages in thread
From: Jens Axboe @ 2002-03-05  7:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Chris Mason, James Bottomley, linux-kernel,
	linux-scsi

On Mon, Mar 04 2002, Daniel Phillips wrote:
> But the bio layer can manage it, by sending a write barrier down all relevant 
> queues.  We can send a zero length write barrier command, yes?

Actually, yes that was indeed one of the things I wanted to achieve with
the block layer rewrite -- the ability to send down other commands than
read/write down the queue. So not exactly bio, but more of a new block
feature.

See, now fs requests have REQ_CMD set in the request flag bits. This
means that it's a "regular" request, which has a string of bios attached
to it. Doing something ala

	struct request *rq = get_request();

	init_request(rq);
	rq->rq_dev = target_dev;
	rq->cmd[0] = GPCMD_FLUSH_CACHE;
	rq->flags = REQ_PC;
	/* additional info... */
	queue_request(rq);

would indeed be possible. The attentive reader will now know where
ide-scsi is headed and why :-)

This would work for any SCSI and psueo-SCSI device, basically all the
stuff out there. For IDE, the request pre-handler would transform this
into an IDE command (or taskfile).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 21:06                       ` Daniel Phillips
@ 2002-03-05 14:58                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-05 14:58 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Chris Mason, James Bottomley, linux-kernel,
	linux-scsi

Hi,

On Mon, Mar 04, 2002 at 10:06:19PM +0100, Daniel Phillips wrote:
> On March 4, 2002 08:57 pm, Stephen C. Tweedie wrote:
> > On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> > > On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > > > Also, as soon as we have journals on external devices, this whole
> > > > thing changes entirely. 

> > > We can send a zero length write barrier command, yes?
> > 
> > Sort of --- there are various flush commands we can use.  However, bio
> > can't just submit the barriers, it needs to synchronise them, and that
> > means doing a global wait over all the devices until they have all
> > acked their barrier op.  That's expensive: you may be as well off just
> > using the current fs-internal synchronous commands at that point.
> 
> With ordered tags, at least we get the benefit of not having to wait on all 
> the commands before the write barrier.
> 
> It's annoying to have to let the all the command queues empty, but it's hard 
> to see what can be done about that, the synchronization *has* to be global.  
> In this case, all we can do is to be sure to respond quickly to the command 
> completion interrupt.  So the unavoidable cost is one request's worth of bus 
> transfer (is there an advantage in trying to make it a small request?) and 
> the latency of the interrupt.  100 uSec?

It probably doesn't really matter.  For performance, we want to stream
both the journal writes and the primary disk writeback as much as
possible, but a bit of latency in the synchronisation between the two
ought to be largely irrelevant.

Much more significant than the external-journal case is probably the
stripe case, either with raid5, striped LVM or raid-1+0.  In that
case, even sequential IO to the notionally-sequential journal may have
to be split over multiple disks, and at that point the pipeline stall
in the middle of IO that was supposed to be sequential will really
hurt.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-05  7:40                     ` Jens Axboe
@ 2002-03-05 22:29                       ` Daniel Phillips
  2002-03-12  7:01                         ` Jens Axboe
  0 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-05 22:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stephen C. Tweedie, Jeremy Higdon, James Bottomley, Chris Mason,
	linux-kernel, linux-scsi

On March 5, 2002 08:40 am, Jens Axboe wrote:
> On Mon, Mar 04 2002, Daniel Phillips wrote:
> > > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> > 
> > I'm having a little trouble seeing the difference between WRITE10, WRITE12
> > and WRITE16.  WRITE6 seems to be different only in not garaunteeing to 
> > support the FUA (and one other) bit.  I'm reading the Scsi Block Commands
> 
> WRITE6 was deprecated because there is only one byte available to set
> transfer size. Enter WRITE10. WRITE12 allows the use of the streaming
> performance settings, that's the only functional difference wrt WRITE10
> iirc.

Thanks.  This is poorly documented, to say the least.

> > (Side note: how nice it would be if t10.org got a clue and posted their
> > docs in html, in addition to the inconvenient, unhyperlinked, proprietary
> > format pdfs.)
> 
> See the mtfuji docs as an example for how nicely pdf's can be setup too.

Do you have a url?

> The thought of substituting that for a html version makes me want to
> barf.

Who said substitute?  Provide beside, as is reasonable.  For my part,
pdf's tend to cause severe indigestion, if not actually cause
regurgitation.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-05  7:09                   ` Jeremy Higdon
@ 2002-03-05 22:56                     ` Daniel Phillips
  0 siblings, 0 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-03-05 22:56 UTC (permalink / raw)
  To: Jeremy Higdon, James Bottomley, Chris Mason, Stephen C. Tweedie
  Cc: linux-kernel, linux-scsi

On March 5, 2002 08:09 am, Jeremy Higdon wrote:
> On Mar 4,  8:57am, Daniel Phillips wrote:
> > 
> > On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> > > On Mar 4,  6:31am, Daniel Phillips wrote:
> > > > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > > > I have never heard of
> > > > > any implied requirement to flush to media when a drive receives an
> > > > > ordered tag and WCE is set.  It does seem like a useful feature to have
> > > > > in the standard, but I don't think it's there.
> > > > 
> > > > It seems to be pretty strongly implied that things should work that way.
> > > > What is the use of being sure the write with the ordered tag is on media
> > > > if you're not sure about the writes that were supposedly supposed to
> > > > precede it?  Spelling this out would indeed be helpful.
> > > 
> > > WCE==1 and ordered tag means that the data for previous commands is in
> > > the drive buffer before the data for the ordered tag is in the drive
> > > buffer.
> > 
> > Right, and what we're talking about is going further and requiring that WCE=0
> > and ordered tag means the data for previous commands is *not* in the buffer,
> > i.e., on the platter, which is the only interpretation that makes sense.
> 
> Sorry to be slow here, but if WCE=0, then commands aren't complete until
> data is on the media,

Sorry, I meant FUA, not WCE.  For this error I offer the apology that there
is a whole new set of TLA's to learn here, and I started yesterday.

> so since previous commands don't complete until
> data is on the media, and they must complete before the ordered tag
> command does, what you say would have to be the case.  I thought the idea
> was to buffer commands to drive memory (so that the drive could increase
> performance by writing back to back commands without losing a rev) and
> then issue a command with a "flush" side effect.
> 
> Here is an interesting question.  If you use WCE=1 and then send an
> ordered tag with FUA=1, does that imply that data from previous
> write commands is flushed to media?  I don't think so, though it
> would be a useful feature if it did.

That's my point all right.  And what I tried to say is, it's useless to
have it otherwise, so we should now start beating up drive makers to do it
this way (I don't think they'll need a lot of convincing actually) and we
should write a test procedure to determine which drives do it correctly,
according to our definition of correctness.  If we agree on what is
correct of course.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-05  7:22             ` Jeremy Higdon
@ 2002-03-05 23:01               ` Daniel Phillips
  0 siblings, 0 replies; 73+ messages in thread
From: Daniel Phillips @ 2002-03-05 23:01 UTC (permalink / raw)
  To: Jeremy Higdon, James Bottomley, Chris Mason
  Cc: Stephen C. Tweedie, linux-kernel, linux-scsi

On March 5, 2002 08:22 am, Jeremy Higdon wrote:
> On Mar 4,  8:57am, James Bottomley wrote:
> > 
> > > 2a) Are the filesystems asking for something impossible?  Can drives
> > > really write block N and N+1, making sure to commit N to media before
> > > N+1 (including an abort on N+1 if N fails), but still keeping up a
> > > nice seek free stream of writes? 
> > 
> > These are the "big" issues.  There's not much point doing all the work to 
> > implement ordered tags, if the end result is going to be no gain in 
> > performance.
> 
> If a drive does reduced latency writes, then blocks can be written out
> of order.  Also, for a trivial case:  with hardware RAIDs, when the
> data for a single command is split across multiple drives, you can get
> data blocks written out of order, no matter what you do.

That's ok, the journal takes care of this.  And hence the need to be so
careful about how the journal commit is handled.

> I don't think a filesystem can make any assumptions about blocks within
> a single command, though with ordered tags (assuming driver and device
> support) and no write caching, it can make assumptions between commands.

We're trying to get rid of the 'no write caching' requirement.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 14:48           ` James Bottomley
@ 2002-03-06 13:59             ` Daniel Phillips
  2002-03-06 14:34               ` James Bottomley
  0 siblings, 1 reply; 73+ messages in thread
From: Daniel Phillips @ 2002-03-06 13:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: James Bottomley, Chris Mason, Stephen C. Tweedie, linux-kernel,
	linux-scsi

On March 4, 2002 03:48 pm, James Bottomley wrote:
> phillips@bonn-fries.net said:
> > I've been following the thread, I hope I haven't missed anything
> > fundamental. A better long term solution is to have ordered tags work
> > as designed.  It's  not broken by design is it, just implementation? 
> 
> There is actually one hole in the design:  A scsi device may accept a command 
> with an ordered tag, disconnect and at a later time reconnect and return a 
> QUEUE FULL status indicating that the tag must be retried.  In the time 
> between the disconnect and reconnect, the standard doesn't require that no 
> other tags be accepted, so if the local flow control conditions abate, the 
> device is allowed to accept and execute a tag sent down in between the 
> disconnect and reconnect.

How can a drive can accept a command while it is disconnected from the bus.
Did you mean that after it reconnects it might refuse the ordered tag and
accept another?  That would be a bug, I'd think.

> I think this would introduce a very minor deviation where one tag could 
> overtake another, but we may still get a useable implementation even with this.

It would mean we would have to wait for completion of the tagged command
before submitting any more commands.  Not nice, but not horribly costly
either.

-- 
Daniel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-06 13:59             ` Daniel Phillips
@ 2002-03-06 14:34               ` James Bottomley
  0 siblings, 0 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-06 14:34 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: James Bottomley, Chris Mason, Stephen C. Tweedie, linux-kernel,
	linux-scsi

phillips@bonn-fries.net said:
> How can a drive can accept a command while it is disconnected from the
> bus. Did you mean that after it reconnects it might refuse the ordered
> tag and accept another?  That would be a bug, I'd think. 

Disconnect is SCSI slang for releasing the bus to other uses, it doesn't imply 
electrical disconnection from it.  The architecture of SCSI is like this, the 
usual (and simplified) operation of a single command is:

- Initiator selects device and sends command and tag information.
- device disconnects
....
- device reselects initiator, presents tag and demands to transfer data (in 
the direction dictated by the command).
- device may disconnect and reselect as many times as it wishes during data 
transfer as dictated by its flow control (at least one block of data must 
transfer for each reselection)
- device disconnects to complete operation
...
- device reselects and presents tag and status (command is now complete)

A tag is like a temporary ticket for identifying the command in progress.

During the (...) phases, the bus is free and the initiator is able to send 
down new commands with different tags.  If the device isn't going to be able 
to accept the command, it is allowed to skip the data transfer phase and go 
straight to status and present a QUEUE FULL status return.  However, there is 
still a disconnected period where the initiator doesn't know the command won't 
be accepted and may send down other tagged commands.

> It would mean we would have to wait for completion of the tagged
> command before submitting any more commands.  Not nice, but not
> horribly costly either. 

But if we must await completion of ordered tags just to close this hole, it 
makes the most sense to do it in the bio layer (or the journal layer, where 
the wait is currently being done anyway) since it is generic to every low 
level driver.

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-04 16:52                 ` Stephen C. Tweedie
  2002-03-04 18:15                   ` Daniel Phillips
@ 2002-03-10  5:24                   ` Douglas Gilbert
  2002-03-11 11:13                     ` Kurt Garloff
                                       ` (2 more replies)
  1 sibling, 3 replies; 73+ messages in thread
From: Douglas Gilbert @ 2002-03-10  5:24 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Jeremy Higdon, Daniel Phillips, James Bottomley, Chris Mason,
	linux-kernel, linux-scsi

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:
> 
> > > WCE is per-command?  And 0 means no caching, so the command must complete
> > > when the data is on the media?
> >
> > My reading is that WCE==1 means that the command is complete when the
> > data is in the drive buffer.
> 
> Even if WCE is enabled in the caching mode page, we can still set FUA
> (Force Unit Access) in individual write commands to force platter
> completion before commands complete.
> 
> Of course, it's a good question whether this is honoured properly on
> all drives.
> 
> FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

Stephen,
FUA is also available on WRITE16. The same FUA support pattern
applies to the READ6,10,12 and 16 series. Interestingly if a
WRITE10 is called with FUA==0 followed by a READ10 with FUA=1
on the same block(s) then the READ causes the a flush from the
cache to the platter (if it hasn't already been done). [It
would be pretty ugly otherwise :-)]

Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
to the platter but the size of the range is limited to 2**16 - 1
blocks which is probably too small to be useful. If the
"number of blocks" field is set to 0 then the whole disk cache
is flushed to the platter. There is a SYNCHRONIZE CACHE(16)
defined in recent sbc2 drafts that allows a 32 bit range
but it is unlikely to appear on any disk any time soon. There
is also an "Immed"-iate bit on these sync_cache commands
that may be of interest. When set this bit instructs the
target to respond with a good status immediately on receipt
of the command (and thus before the dirty blocks of the disk 
cache are flushed to the platter).

Doug Gilbert


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-10  5:24                   ` Douglas Gilbert
@ 2002-03-11 11:13                     ` Kurt Garloff
  2002-03-12  6:58                       ` Jens Axboe
  2002-03-11 11:34                     ` Stephen C. Tweedie
  2002-03-12  1:17                     ` GOTO Masanori
  2 siblings, 1 reply; 73+ messages in thread
From: Kurt Garloff @ 2002-03-11 11:13 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Stephen C. Tweedie, Jeremy Higdon, Daniel Phillips,
	James Bottomley, Chris Mason, linux-kernel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

Hi Doug,

On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:
> "Stephen C. Tweedie" wrote:
> > Even if WCE is enabled in the caching mode page, we can still set FUA
> > (Force Unit Access) in individual write commands to force platter
> > completion before commands complete.
> > 
> > Of course, it's a good question whether this is honoured properly on
> > all drives.
> > 
> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> 
> Stephen,
[...]
> 
> Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
> to the platter but the size of the range is limited to 2**16 - 1
> blocks which is probably too small to be useful. If the
> "number of blocks" field is set to 0 then the whole disk cache
> is flushed to the platter.

Which I think we should send before shutdown (and possible poweroff) for
disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices
Unfortunately, SYNCHRONIZE CACHE is optional, so we would need to ignore any
errors returned by this command.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE Linux AG, Nuernberg, DE                            SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-10  5:24                   ` Douglas Gilbert
  2002-03-11 11:13                     ` Kurt Garloff
@ 2002-03-11 11:34                     ` Stephen C. Tweedie
  2002-03-11 17:15                       ` James Bottomley
  2002-03-12  1:17                     ` GOTO Masanori
  2 siblings, 1 reply; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-03-11 11:34 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Stephen C. Tweedie, Jeremy Higdon, Daniel Phillips,
	James Bottomley, Chris Mason, linux-kernel, linux-scsi

Hi,

On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:

> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> 
> Stephen,
> FUA is also available on WRITE16.

I said WRITE6, not WRITE16. :-)  WRITE6 uses the low 5 bits of the LUN
byte for the top bits of the block number; WRITE10 and later use those
5 bits for DPO/FUA etc.  But WRITE6 is a horribly limited interface:
you only have 21 bits of block number for a start, so it's limited to
1GB on 512-byte-sector devices.  We can probably ignore WRITE6 safely
enough.

--Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-11 11:34                     ` Stephen C. Tweedie
@ 2002-03-11 17:15                       ` James Bottomley
  0 siblings, 0 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-11 17:15 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Douglas Gilbert, Jeremy Higdon, Daniel Phillips, James Bottomley,
	Chris Mason, linux-kernel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 564 bytes --]

This patch (against 2.4.18) addresses our synchronisation problems with write 
back caches only, not the ordering problem with tags.

It probes the cache type on attach and inserts synchronisation instructions on 
release() (i.e. unmount) or if the reboot notifier is called.

How would you like the cache synchronize instruction plugged into the journal 
writes?  I can do it either by exposing an ioctl which the journal code can 
use, or I can try to use the write barrier (however, the bio layer is going to 
have to ensure the ordering if I do that).

James


[-- Attachment #2: sd-cache-2.4.18.diff --]
[-- Type: text/plain , Size: 7437 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.160   -> 1.162  
#	   drivers/scsi/sd.h	1.1     -> 1.2    
#	   drivers/scsi/sd.c	1.19    -> 1.21   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/03/08	jejb@mulgrave.(none)	1.161
# sd cache control changes
#   
# - detect cache type on boot and store in extra fields in Scsi_Disk.
# - if cache is write back synchronize on device release.
# - if cache is write back also synchronize on shutdown notifier.
# 
# --------------------------------------------
# 02/03/11	jejb@mulgrave.(none)	1.162
# sd cache control
# 
# - bug fixes
# --------------------------------------------
#
diff -Nru a/drivers/scsi/sd.c b/drivers/scsi/sd.c
--- a/drivers/scsi/sd.c	Mon Mar 11 12:09:13 2002
+++ b/drivers/scsi/sd.c	Mon Mar 11 12:09:13 2002
@@ -42,6 +42,7 @@
 #include <linux/errno.h>
 #include <linux/interrupt.h>
 #include <linux/init.h>
+#include <linux/reboot.h>
 
 #include <linux/smp.h>
 
@@ -104,6 +105,10 @@
 static int sd_detect(Scsi_Device *);
 static void sd_detach(Scsi_Device *);
 static int sd_init_command(Scsi_Cmnd *);
+static int sd_synchronize_cache(Scsi_Device *, int);
+static int sd_notifier(struct notifier_block *, unsigned long, void *);
+
+static struct notifier_block sd_notifier_block = {sd_notifier, NULL, 0}; 
 
 static struct Scsi_Device_Template sd_template = {
 	name:"disk",
@@ -549,6 +554,11 @@
 		__MOD_DEC_USE_COUNT(SDev->host->hostt->module);
 	if (sd_template.module)
 		__MOD_DEC_USE_COUNT(sd_template.module);
+
+	/* check that we actually have a write back cache to synchronize */
+	if(rscsi_disks[target].WCE)
+		sd_synchronize_cache(SDev, 1);
+		       
 	return 0;
 }
 
@@ -860,8 +870,6 @@
 	}
 
 	mode_retries = 2;	/* make two attempts to change the cache type */
-
- retry_mode_select:
 	retries = 3;
 	do {
 
@@ -901,60 +909,20 @@
 			print_req_sense("sd", SRpnt);
 		else
 			printk("%s : sense not available. \n", nbuff);
+
+		printk("%s : assuming drive cache: write through\n", nbuff);
+		rscsi_disks[i].WCE = 0;
+		rscsi_disks[i].RCD = 0;
 	} else {
 		const char *types[] = { "write through", "none", "write back", "write back, no read (daft)" };
 		int ct = 0;
 
-		ct = (buffer[6] & 0x01 /* RCD */) | ((buffer[6] & 0x04 /* WCE */) >> 1);
+		rscsi_disks[i].WCE = buffer[6] & 0x04;
+		rscsi_disks[i].RCD = buffer[6] & 0x01;
 
-		printk("%s : checking drive cache: %s \n", nbuff, types[ct]);
-		if(ct != 0x0 && mode_retries-- == 0) {
-			printk("%s : FAILED to change cache to write back, continuing\n", nbuff);
-		}
-		else if(ct != 0x0) {
-			retries = 3;
-			buffer[6] &= (~0x05); /* clear RCD and WCE */
-			do {
-				memset((void *) &cmd[0], 0, 10);
-				cmd[0] = MODE_SELECT;
-				cmd[1] = (rscsi_disks[i].device->scsi_level <= SCSI_2) ?
-					((rscsi_disks[i].device->lun << 5) & 0xe0) : 0;
-				cmd[1] |= 0x10;	/* PF */
-				cmd[4] = 24;	/* allocation length */
-				
-				
-				SRpnt->sr_cmd_len = 0;
-				SRpnt->sr_sense_buffer[0] = 0;
-				SRpnt->sr_sense_buffer[2] = 0;
-				
-				SRpnt->sr_data_direction = SCSI_DATA_WRITE;
-				scsi_wait_req(SRpnt, (void *) cmd, (void *) buffer,
-					      24, SD_TIMEOUT, MAX_RETRIES);
+		ct =  rscsi_disks[i].RCD + 2*rscsi_disks[i].WCE;
 
-				the_result = SRpnt->sr_result;
-				retries--;
-
-			} while (the_result && retries);
-
-			if (the_result) {
-				printk("%s : MODE SELECT failed.\n"
-				       "%s : status = %x, message = %02x, host = %d, driver = %02x \n",
-				       nbuff, nbuff,
-				       status_byte(the_result),
-				       msg_byte(the_result),
-				       host_byte(the_result),
-				       driver_byte(the_result)
-				       );
-				if (driver_byte(the_result) & DRIVER_SENSE)
-					print_req_sense("sd", SRpnt);
-				else
-					printk("%s : sense not available. \n", nbuff);
-			} else {
-				printk("%s : changing drive cache to write through\n", nbuff);
-			}
-			goto retry_mode_select;
-		}
-		
+		printk("%s : drive cache: %s\n", nbuff, types[ct]);
 	}
 
 	retries = 3;
@@ -1491,8 +1459,13 @@
 
 static int __init init_sd(void)
 {
+	int ret;
+
 	sd_template.module = THIS_MODULE;
-	return scsi_register_module(MODULE_SCSI_DEV, &sd_template);
+	ret = scsi_register_module(MODULE_SCSI_DEV, &sd_template);
+	if(ret == 0)
+		register_reboot_notifier(&sd_notifier_block);
+	return ret;
 }
 
 static void __exit exit_sd(void)
@@ -1521,6 +1494,92 @@
 	sd_template.dev_max = 0;
 	if (sd_gendisks != &sd_gendisk)
 		kfree(sd_gendisks);
+
+	unregister_reboot_notifier(&sd_notifier_block);
+}
+
+static int sd_notifier(struct notifier_block *nbt, unsigned long event, void *buf)
+{
+	Scsi_Disk *dpnt;
+	int i;
+
+	if (!(event == SYS_RESTART || event == SYS_HALT 
+	      || event == SYS_POWER_OFF))
+		return NOTIFY_DONE;
+	for (dpnt = rscsi_disks, i = 0; i < sd_template.dev_max; i++, dpnt++) {
+		if (!dpnt->device)
+			continue;
+		if (dpnt->WCE)
+			sd_synchronize_cache(dpnt->device, 1);
+	}
+
+	return NOTIFY_OK;
+}
+
+/* send a SYNCHRONIZE CACHE instruction down to the device through the
+ * normal SCSI command structure.  Wait for the command to complete (must
+ * have user context) */
+static int sd_synchronize_cache(Scsi_Device *SDpnt, int verbose)
+{
+	Scsi_Request *SRpnt;
+	int retries, the_result;
+
+	if(verbose) {
+		/* we actually want an sd name, so this awful lookup
+		 * is only done if verbose is specified */
+		int i;
+		char buf[16];
+		Scsi_Disk *dpnt;
+
+		for (dpnt = rscsi_disks, i = 0; i < sd_template.dev_max; i++, dpnt++) {
+		if (dpnt->device == SDpnt)
+			break;
+		}
+		/* no error checking ! */
+		sd_devname(i, buf);
+
+		printk("%s: synchronizing cache...", buf);
+	}
+
+	SRpnt = scsi_allocate_request(SDpnt);
+	if(!SRpnt) {
+		if(verbose)
+			printk("FAILED\n  No memory for request\n");
+		return 0;
+	}
+		
+
+	for(retries = 3; retries > 0; --retries) {
+		unsigned char cmd[10] = { 0 };
+
+		cmd[0] = SYNCHRONIZE_CACHE;
+		cmd[1] = SDpnt->scsi_level <= SCSI_2 ? (SDpnt->lun << 5) & 0xe0 : 0;
+		/* leave the rest of the command zero to indicate 
+		 * flush everything */
+		scsi_wait_req(SRpnt, (void *)cmd, NULL, 0,
+			      SD_TIMEOUT, MAX_RETRIES);
+
+		if(SRpnt->sr_result == 0)
+			break;
+	}
+
+	the_result = SRpnt->sr_result;
+	scsi_release_request(SRpnt);
+	if(verbose) {
+		if(the_result == 0) {
+			printk("OK\n");
+		} else {
+			printk("FAILED\n  status = %x, message = %02x, host = %d, driver = %02x\n  ",
+			       status_byte(the_result),
+			       msg_byte(the_result),
+			       host_byte(the_result),
+			       driver_byte(the_result));
+			if (driver_byte(the_result) & DRIVER_SENSE)
+				print_req_sense("sd", SRpnt);
+
+		}
+	}
+	return (the_result == 0);
 }
 
 module_init(init_sd);
diff -Nru a/drivers/scsi/sd.h b/drivers/scsi/sd.h
--- a/drivers/scsi/sd.h	Mon Mar 11 12:09:13 2002
+++ b/drivers/scsi/sd.h	Mon Mar 11 12:09:13 2002
@@ -33,6 +33,8 @@
 	unsigned char sector_bit_size;	/* sector_size = 2 to the  bit size power */
 	unsigned char sector_bit_shift;		/* power of 2 sectors per FS block */
 	unsigned has_part_table:1;	/* has partition table */
+        unsigned WCE:1;         /* state of disk WCE bit */
+        unsigned RCD:1;         /* state of disk RCD bit */
 } Scsi_Disk;
 
 extern int revalidate_scsidisk(kdev_t dev, int maxusage);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-10  5:24                   ` Douglas Gilbert
  2002-03-11 11:13                     ` Kurt Garloff
  2002-03-11 11:34                     ` Stephen C. Tweedie
@ 2002-03-12  1:17                     ` GOTO Masanori
  2 siblings, 0 replies; 73+ messages in thread
From: GOTO Masanori @ 2002-03-12  1:17 UTC (permalink / raw)
  To: garloff, dougg, sct, jeremy, phillips, James.Bottomley, mason,
	linux-kernel, linux-scsi

At Mon, 11 Mar 2002 12:13:00 +0100,
Kurt Garloff <garloff@suse.de> wrote:
> On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:
> > "Stephen C. Tweedie" wrote:
> > > Even if WCE is enabled in the caching mode page, we can still set FUA
> > > (Force Unit Access) in individual write commands to force platter
> > > completion before commands complete.
> > > 
> > > Of course, it's a good question whether this is honoured properly on
> > > all drives.
> > > 
> > > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> > 
> > Stephen,
> [...]
> > 
> > Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
> > to the platter but the size of the range is limited to 2**16 - 1
> > blocks which is probably too small to be useful. If the
> > "number of blocks" field is set to 0 then the whole disk cache
> > is flushed to the platter.
> 
> Which I think we should send before shutdown (and possible poweroff) for
> disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
> SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices
> Unfortunately, SYNCHRONIZE CACHE is optional, so we would need to ignore any
> errors returned by this command.

I agree.
BTW, power management like suspend/resume needs 
SYNCHRONIZE_CACHE for the broken HDD/controller...?

-- gotom

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-11 11:13                     ` Kurt Garloff
@ 2002-03-12  6:58                       ` Jens Axboe
  2002-03-13 22:37                         ` Peter Osterlund
  0 siblings, 1 reply; 73+ messages in thread
From: Jens Axboe @ 2002-03-12  6:58 UTC (permalink / raw)
  To: Kurt Garloff, Douglas Gilbert, Stephen C. Tweedie, Jeremy Higdon,
	Daniel Phillips, James Bottomley, Chris Mason, linux-kernel,
	linux-scsi

On Mon, Mar 11 2002, Kurt Garloff wrote:
> disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
> SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices

Hey, I use SYNCHRONIZE_CACHE in the packet writing stuff for CD-ROM's
all the time :-). Not all are read-only. In fact, Peter Osterlund
discovered that if you have pending writes on the CD-ROM it's a really
good idea to sync the cache prior to starting reads or they have a nasty
tendency to time out.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-05 22:29                       ` Daniel Phillips
@ 2002-03-12  7:01                         ` Jens Axboe
  0 siblings, 0 replies; 73+ messages in thread
From: Jens Axboe @ 2002-03-12  7:01 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Jeremy Higdon, James Bottomley, Chris Mason,
	linux-kernel, linux-scsi

On Tue, Mar 05 2002, Daniel Phillips wrote:
> > On Mon, Mar 04 2002, Daniel Phillips wrote:
> > > > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> > > 
> > > I'm having a little trouble seeing the difference between WRITE10, WRITE12
> > > and WRITE16.  WRITE6 seems to be different only in not garaunteeing to 
> > > support the FUA (and one other) bit.  I'm reading the Scsi Block Commands
> > 
> > WRITE6 was deprecated because there is only one byte available to set
> > transfer size. Enter WRITE10. WRITE12 allows the use of the streaming
> > performance settings, that's the only functional difference wrt WRITE10
> > iirc.
> 
> Thanks.  This is poorly documented, to say the least.

Maybe in the t10 spec, it's quite nicely explained elsewhere. Try the
Mtfuji spec, it really is better organized and easier to browse through.

> > > (Side note: how nice it would be if t10.org got a clue and posted their
> > > docs in html, in addition to the inconvenient, unhyperlinked, proprietary
> > > format pdfs.)
> > 
> > See the mtfuji docs as an example for how nicely pdf's can be setup too.
> 
> Do you have a url?

ftp.avc-pioneer.com/Mtfuji5/Spec

> > The thought of substituting that for a html version makes me want to
> > barf.
> 
> Who said substitute?  Provide beside, as is reasonable.  For my part,
> pdf's tend to cause severe indigestion, if not actually cause
> regurgitation.

Matter of taste I guess, I find html slow cumbersome to use.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-12  6:58                       ` Jens Axboe
@ 2002-03-13 22:37                         ` Peter Osterlund
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Osterlund @ 2002-03-13 22:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi

Jens Axboe <axboe@suse.de> writes:

> On Mon, Mar 11 2002, Kurt Garloff wrote:
> > disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
> > SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices
> 
> Hey, I use SYNCHRONIZE_CACHE in the packet writing stuff for CD-ROM's
> all the time :-). Not all are read-only. In fact, Peter Osterlund
> discovered that if you have pending writes on the CD-ROM it's a really
> good idea to sync the cache prior to starting reads or they have a nasty
> tendency to time out.

Not only time out, some drives give up immediately with SK/ASC/ASCQ
05/2c/00 "command sequence error" unless you flush the cache first.
After some googling, I found a plausible explanation for that
behaviour here:

http://www.rahul.net/endl/cdaccess/t10/email/mmc/1997/m9703031.txt

-- 
Peter Osterlund - petero2@telia.com
http://w1.894.telia.com/~u89404340

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-03-01 15:26 Dieter Nützel
@ 2002-03-01 16:00 ` James Bottomley
  0 siblings, 0 replies; 73+ messages in thread
From: James Bottomley @ 2002-03-01 16:00 UTC (permalink / raw)
  To: Dieter Nützel; +Cc: James Bottomley, Chris Mason, Linux Kernel List

Dieter.Nuetzel@hamburg.de said:
> How do you checked it? 

I used sginfo from Doug Gilbert's sg utilities (http://www.torque.net/sg)

The version was sg3_utils-0.98

Dieter.Nuetzel@hamburg.de said:
> But when I use "scsi-config" I get under "Cache Control Page": Read
> cache enabled: Yes Write cache enabled: No 

I believe write cache enabled is the state of the WCE bit and read cache 
enabled is the inverse of the RCD bit, so you have a write through cache.

I think that notwithstanding the spec, most drives are write through (purely 
because of the safety aspect).  I suspect certain manufacturers use write back 
caching to try to improve performance figures (at the expense of safety).

James



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-03-01 15:26 Dieter Nützel
  2002-03-01 16:00 ` James Bottomley
  0 siblings, 1 reply; 73+ messages in thread
From: Dieter Nützel @ 2002-03-01 15:26 UTC (permalink / raw)
  To: James Bottomley, Chris Mason; +Cc: Linux Kernel List

James Bottomley wrote:
> mason@suse.com said:
> > So, a little testing with scsi_info shows my scsi drives do have
> > writeback cache on.  great.  What's interesting is they must be doing
> > additional work for ordered tags.  If they were treating the block as
> > written once in cache, using the tags should not change  performance
> > at all.  But, I can clearly show the tags changing performance, and
> > hear the drive write pattern change when tags are on. 

> I checked all mine and they're write through.  However, I inherited all my 
> drives from an enterprise vendor so this might not be that surprising.

How do you checked it?
Which scsi_info version?
Mine gave only the below info:

SunWave1 /home/nuetzel# scsi_info /dev/sda
SCSI_ID="0,0,0"
MODEL="IBM DDYS-T18350N"
FW_REV="S96H"
SunWave1 /home/nuetzel# scsi_info /dev/sdb
SCSI_ID="0,1,0"
MODEL="IBM DDRS-34560D"
FW_REV="DC1B"
SunWave1 /home/nuetzel# scsi_info /dev/sdc
SCSI_ID="0,2,0"
MODEL="IBM DDRS-34560W"
FW_REV="S71D"

But when I use "scsi-config" I get under "Cache Control Page":
Read cache enabled: Yes
Write cache enabled: No

I've tested it with setting this by hand some months ago, but the speed 
doesn't change in anyway (ReiserFS).

> I can surmise why ordered tags kill performance on your drive, since an 
> ordered tag is required to affect the ordering of the write to the medium,
> not the cache, it is probably implemented with an implicit cache flush.
>
> Anyway, the attached patch against 2.4.18 (and I know it's rather gross
> code)  will probe the cache type and try to set it to write through on boot.
>  See what this does to your performance ordinarily, and also to your
> tagged write barrier performance.

Will test it over the weekend on 2.4.19-pre1aa1 with all Reiserfs 
2.4.18.pending patches applied.

Regards,
	Dieter

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-21 23:30 Chris Mason
  2002-02-22 14:19 ` Stephen C. Tweedie
@ 2002-02-22 15:26 ` Chris Mason
  1 sibling, 0 replies; 73+ messages in thread
From: Chris Mason @ 2002-02-22 15:26 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrew Morton, linux-kernel



On Friday, February 22, 2002 02:19:15 PM +0000 "Stephen C. Tweedie" <sct@redhat.com> wrote:

>> There might be additional spots in ext3 where ordering needs to be 
>> enforced, I've included the ext3 code below in hopes of getting 
>> some comments.
> 
> No.  However, there is another optimisation which we can make.
> 
> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

Just a note, the scsi code doesn't implement flush at all, flush
either gets ignored or failed (if BH_Ordered_Hard is set), the
assumption being that scsi devices don't write back by default, so
wait_on_buffer() is enough.

The reiserfs code tries to be smart with _Tag, in pratice I haven't
found a device that gains from it, so I didn't want to make the larger
changes to ext3 until I was sure it was worthwhile ;-)

It seems the scsi drives don't do tag ordering as nicely as we'd 
hoped, I'm hoping someone with a big raid controller can help 
benchmark the ordered tag mode on scsi.  Also, check the barrier
threads from last week on how write errors might break the 
ordering with the current scsi code.

-chris


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] 2.4.x write barriers (updated for ext3)
  2002-02-21 23:30 Chris Mason
@ 2002-02-22 14:19 ` Stephen C. Tweedie
  2002-02-22 15:26 ` Chris Mason
  1 sibling, 0 replies; 73+ messages in thread
From: Stephen C. Tweedie @ 2002-02-22 14:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel

Hi,

On Thu, Feb 21, 2002 at 06:30:20PM -0500, Chris Mason wrote:
 
> This makes it much easier to add support for ide writeback
> flushing to things like ext3 and lvm, where I want to make
> the minimal possible changes to make things safe.

Nice.

> There might be additional spots in ext3 where ordering needs to be 
> enforced, I've included the ext3 code below in hopes of getting 
> some comments.

No.  However, there is another optimisation which we can make.

Most ext3 commits, in practice, are lazy, asynchronous commits, and we
only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
enough to track whether a given transaction has any synchronous
waiters, and if not, to use the async *_Tag request for the commit
block instead of forcing a flush.

We'd also have to track the sync status of the most recent
transaction, too, so that on fsync of a non-dirty file/inode, we make
sure that its data had been forced to disk by at least one synchronous
flush.  

But that's really only a win for SCSI, where proper async ordered tags
are supported.  For IDE, the single BH_Ordered_Flush is quite
sufficient.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-02-21 23:30 Chris Mason
  2002-02-22 14:19 ` Stephen C. Tweedie
  2002-02-22 15:26 ` Chris Mason
  0 siblings, 2 replies; 73+ messages in thread
From: Chris Mason @ 2002-02-21 23:30 UTC (permalink / raw)
  To: Andrew Morton, Stephen C. Tweedie; +Cc: linux-kernel


Hi everyone,

I've changed the write barrier code around a little so the block layer 
isn't forced to fail barrier requests the queue can't handle.

This makes it much easier to add support for ide writeback
flushing to things like ext3 and lvm, where I want to make
the minimal possible changes to make things safe.

The full patch is at:
ftp.suse.com/pub/people/mason/patches/2.4.18/queue-barrier-8.diff

There might be additional spots in ext3 where ordering needs to be 
enforced, I've included the ext3 code below in hopes of getting 
some comments.

The only other change was to make reiserfs use the IDE flushing mode
by default.  It falls back to non-ordered calls on scsi.

-chris

--- linus.23/fs/jbd/commit.c Mon, 28 Jan 2002 09:51:50 -0500
+++ linus.23(w)/fs/jbd/commit.c Thu, 21 Feb 2002 17:11:00 -0500
@@ -595,7 +595,15 @@
                struct buffer_head *bh = jh2bh(descriptor);
                clear_bit(BH_Dirty, &bh->b_state);
                bh->b_end_io = journal_end_buffer_io_sync;
+
+               /* if we're on an ide device, setting BH_Ordered_Flush
+                  will force a write cache flush before and after the
+                  commit block.  Otherwise, it'll do nothing.  */
+
+               set_bit(BH_Ordered_Flush, &bh->b_state);
                submit_bh(WRITE, bh);
+               clear_bit(BH_Ordered_Flush, &bh->b_state);
+
                wait_on_buffer(bh);
                put_bh(bh);             /* One for getblk() */
                journal_unlock_journal_head(descriptor);









^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2002-03-13 22:37 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-02-22 15:57 [PATCH] 2.4.x write barriers (updated for ext3) James Bottomley
2002-02-22 16:10 ` Chris Mason
2002-02-22 16:13 ` Stephen C. Tweedie
2002-02-22 17:36   ` James Bottomley
2002-02-22 18:14     ` Chris Mason
2002-02-28 15:36       ` James Bottomley
2002-02-28 15:55         ` Chris Mason
2002-02-28 17:58           ` Mike Anderson
2002-02-28 18:12         ` Chris Mason
2002-03-01  2:08           ` James Bottomley
2002-03-03 22:11         ` Daniel Phillips
2002-03-04  4:21           ` Jeremy Higdon
2002-03-04  5:31             ` Daniel Phillips
2002-03-04  6:09               ` Jeremy Higdon
2002-03-04  7:57                 ` Daniel Phillips
2002-03-05  7:09                   ` Jeremy Higdon
2002-03-05 22:56                     ` Daniel Phillips
2002-03-04 16:52                 ` Stephen C. Tweedie
2002-03-04 18:15                   ` Daniel Phillips
2002-03-05  7:40                     ` Jens Axboe
2002-03-05 22:29                       ` Daniel Phillips
2002-03-12  7:01                         ` Jens Axboe
2002-03-10  5:24                   ` Douglas Gilbert
2002-03-11 11:13                     ` Kurt Garloff
2002-03-12  6:58                       ` Jens Axboe
2002-03-13 22:37                         ` Peter Osterlund
2002-03-11 11:34                     ` Stephen C. Tweedie
2002-03-11 17:15                       ` James Bottomley
2002-03-12  1:17                     ` GOTO Masanori
2002-03-04 14:48           ` James Bottomley
2002-03-06 13:59             ` Daniel Phillips
2002-03-06 14:34               ` James Bottomley
2002-03-04  3:34         ` Chris Mason
2002-03-04  5:05           ` Daniel Phillips
2002-03-04 15:03             ` James Bottomley
2002-03-04 17:04               ` Stephen C. Tweedie
2002-03-04 17:35                 ` James Bottomley
2002-03-04 17:48                   ` Chris Mason
2002-03-04 18:11                     ` James Bottomley
2002-03-04 18:41                       ` Chris Mason
2002-03-04 21:34                       ` Stephen C. Tweedie
2002-03-04 18:09                   ` Stephen C. Tweedie
2002-03-04 17:16               ` Chris Mason
2002-03-04 18:05                 ` Stephen C. Tweedie
2002-03-04 18:28                   ` James Bottomley
2002-03-04 19:55                     ` Stephen C. Tweedie
2002-03-04 19:48                   ` Daniel Phillips
2002-03-04 19:57                     ` Stephen C. Tweedie
2002-03-04 21:06                       ` Daniel Phillips
2002-03-05 14:58                         ` Stephen C. Tweedie
2002-03-05  7:48                     ` Jens Axboe
2002-03-04 19:51                 ` Daniel Phillips
2002-03-05  7:42                   ` Jens Axboe
2002-03-04  8:19           ` Helge Hafting
2002-03-04 14:57           ` James Bottomley
2002-03-04 17:24             ` Chris Mason
2002-03-04 19:02               ` Daniel Phillips
2002-03-05  7:22             ` Jeremy Higdon
2002-03-05 23:01               ` Daniel Phillips
2002-02-25 10:57 ` Helge Hafting
2002-02-25 15:04   ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2002-03-01 15:26 Dieter Nützel
2002-03-01 16:00 ` James Bottomley
2002-02-21 23:30 Chris Mason
2002-02-22 14:19 ` Stephen C. Tweedie
2002-02-22 15:26 ` Chris Mason
2002-01-10  9:55 [ANNOUNCE] FUSE: Filesystem in Userspace 0.95 Miklos Szeredi
2002-01-13  3:10 ` Pavel Machek
2002-01-21 10:18 ` Miklos Szeredi
2002-01-23 10:47   ` Pavel Machek
2002-01-22 19:07 ` Daniel Phillips
2002-01-23  2:33   ` [Avfs] " Justin Mason
2002-01-23  5:26     ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).