linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Thoughts on using fs/jbd from drivers/md
@ 2002-05-16  5:54 Neil Brown
  2002-05-16 15:17 ` Stephen C. Tweedie
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2002-05-16  5:54 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel


Stephen, 
  You mentioned to me some time ago the idea of using jbd to journal
  RAID5 changes so as to improve recovery for raid5 from a crash.   It
  didn't get very high on my list of priorities at the time, but I
  have been thinking about it a bit more lately and thought I would
  share my thoughts with you and the readership of linux-kernel in the
  hope than any misunderstanding might get cleared up and some
  improvements might be suggested.

  I don't know if or when I will get time to implement any of this,
  but one never knows...


The basic idea is to provide journaling for md/RAID arrays.  There
are two reasons that one might want to do this:
 1/ crash recovery.  Both raid1 and raid5 need to reconstruct the
   redundancy after a crash.  For a degraded raid5 array, this is not
   possible and you can suffer undetected data corruption.
   If we have a journal of recent changes we can avoid the
   reconstruction and the risk of corruption.

 2/ latency reduction.  If the journal is on a small, fast device
   (e.g. NVRAM) then you can get greatly reduced latency (like ext3 with
   data=journal).   This could benefit any raid level and would
   effectively provide a write-behind cache.

I think the most interesting version in an NVRAM journal for a RAID5
array, so that is what I will focus on.  If that can be made to work
then any other configuration should fall out.

A/ where to put the journal.
 Presumably JBD doesn't care where the journal is.  It's client just
 provides a mapping from journal offset to dev/sector and JBD just
 calls submit_bh with this information(?).
 The only other requirement that the JBD places would be a correct jbd
 superblock at the start.  Would that be right?

 Having it on a separate device would be easiest, particularly if you
 wanted it to be on NVRAM.
 The md module could - at array configuration time - reserve the
 head (or tail) of the array for a journal.  This wouldn't work for
 raid5 - you would need to reserve the first (or last) few stripes and
 treat them as raid1 so that there is no risk of data loss.  
 I'm not sure how valuable having a journal on the main raid devices
 would be though as it would probably kill performance...

B/ what to journal.

 For raid levels other than 4/5, we obviously just journal all data
 blocks.  There are no interdependencies or anything interesting.

 For raid4/5 we have the parity block to worry about.
 I think we want to write data blocks to the journal ASAP, and then
 once parity has been calculated for a stripe we write the parity
 block to the journal and then are free to write the parity and data
 to the array.

 On journal replay we would collect together data blocks in a stripe
 until we get a parity block for that stripe.
 When we get a parity block we can write the parity block and
 collected data block to the array.  If we hit the end of the journal
 before getting a parity block, then we can assume that the data never
 hit the array and we can schedule writes for the data blocks as
 normal.

 The only remaining issue is addressing. The journal presumably
 doesn't know about "parity" or "data" blocks.  It just knows about
 sector addresses.
 I think I would tell the journal that data blocks have the address
 that they have in the array, and parity block, which don't have an
 address in the array, have an address which is the address on the
 disc, plus some offset which is at least the size of the array.
 Would it cause JBD any problems if the sector address it is given is
 not a real address on any real device but is something that can be
 adequately interpreted by the client?

C/ data management.

 One big difference between a filesystem using JBD and a device driver
 using JBD is the ownership of buffers.
 It is very important that a buffer which has been written to the
 journal not be changed before it gets written to the main storage, so
 ownership is important.

 As I understand it, the filesystem owns it's buffers and can pretty
 much control who writes and when (a possible exception being mem-mapped
 buffers, but they aren't journaled except with data=journal...).
 This it can ensure that the same data that was written to the journal
 is written to the device.

 However a device drive does not own the buffers that it uses. It
 cannot control changes and it cannot be sure that the buffer will
 still even exist after it has acknowledged a write.
 RAID5 faces this problem as it needs to be sure that the data used
 for the parity calculation is the same as the data that ends up on
 disc.  To ensure this raid5 makes a copy of the data after doing any
 necessary pre-reading and just before making the final parity block
 calculation. 

 When journaling raid5, we could use the same approach: copy to the
 buffer, write to the journal, and then write to the main array.  Not
 only would this not work for other raid levels, but it would not be
 ideal for raid5 either.  This is because one of our aims is reducing
 latency, and if we had to wait for pre-reading to complete before
 writing to the journal, we would lose that benefit.  We could
 possibly copy to the same buffer earlier, but that would cause other
 problems - when doing read-modify-write parity update, we pre-read
 into the buffer that we will later copy the new data into, so we
 would need to allocate more buffers. (Is that coherent?)

 It seems that we need a generic buffer-cache in front of the md
 driver:
   - A write request gets copied into a buffer from this cache
   - the buffer gets written to the journal
   - the original write request gets returned
   - the buffer gets written to the array

 This would work, but means allocating lots more memory, and adds an
 extra mem-to-mem copy which will slow things down.

 The only improvement that I can think of would only work with an
 NVRAM journal device.  It involves writing to the journal and then
 acknowledging the write - with minimal latency - and then reading the
 data back in off the journal into a buffer that then gets written to
 the main device.
 This would possibly require less memory and give less latency.  But
 it would be doing an extra DMA over the PCI buss, rather than a
 mem-to-mem copy.  Which causes least overhead?

 A variation of this could be to write to the main storage directly
 out of the NVRAM.  This could only work on devices that can be
 completely mapped into the pci address space, which some can...

 I feel that the best approach would be to implement two options:
  1/ write straight to the journal and then read-back for writing to
     the device.  This would be used when the journal was on NVRAM and
     would be the only option for raid levels other than raid5.
  2/ Write to the journal after doing a parity calculation and before
     writing a new stripe to disc.  This would only be available with
     raid5 and would (probably) only be used if the journal was on a
     disc drive (or mirrored pair of disc drives).

That's about all my thoughts for now.
All comments welcome.

Now it's probably time for me to go read
   http://lwn.net/2002/0328/a/jbd-doc.php3
or is there something better?

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-16  5:54 Thoughts on using fs/jbd from drivers/md Neil Brown
@ 2002-05-16 15:17 ` Stephen C. Tweedie
  2002-05-17 18:29   ` Mike Fedyk
  2002-05-26  8:41   ` Daniel Phillips
  0 siblings, 2 replies; 10+ messages in thread
From: Stephen C. Tweedie @ 2002-05-16 15:17 UTC (permalink / raw)
  To: Neil Brown; +Cc: Stephen C. Tweedie, linux-kernel

Hi,

On Thu, May 16, 2002 at 03:54:20PM +1000, Neil Brown wrote:
 
>   You mentioned to me some time ago the idea of using jbd to journal
>   RAID5 changes so as to improve recovery for raid5 from a crash.

Yes.  It just so happens that I've been swamped with maintenance stuff
recently but it looks like that's done with now --- I've got a patch
ready for the new ext3 version --- and I'm about to start on new
development bits and pieces myself.  One of the things on the list for
that is adding the necessary support to allow multiple disks to be
listed in the journal at once (I want that to allow different ext3
filesystems to share the same external journal disk), and that work
would be relevant to raid5 journaling too.

Oh, and my main test box has now got a Cenatek Rocket Drive
solid-state PCI memory card in it, so I can test this with a
zero-seek-cost shared journal, too. :-)

> The basic idea is to provide journaling for md/RAID arrays.  There
> are two reasons that one might want to do this:
>  1/ crash recovery.  Both raid1 and raid5 need to reconstruct the
>    redundancy after a crash.  For a degraded raid5 array, this is not
>    possible and you can suffer undetected data corruption.
>    If we have a journal of recent changes we can avoid the
>    reconstruction and the risk of corruption.

Right.  The ability of soft raid5 to lose data in degraded mode over a
reboot (including data that was not being modified at the time of the
crash) is something that is not nearly as widely understood as it
should be, and I'd love for us to do something about it.

>  2/ latency reduction.  If the journal is on a small, fast device
>    (e.g. NVRAM) then you can get greatly reduced latency (like ext3 with
>    data=journal).   This could benefit any raid level and would
>    effectively provide a write-behind cache.

Yep.  This already works a treat for ext3.

> A/ where to put the journal.
>  Presumably JBD doesn't care where the journal is.

Correct.  There are basically two choices --- you either give it an
inode, and it uses bmap to map the inode to a block device; or you
give it a block device and an offset.  The journal must be logically
contiguous, and its length is encoded in the superblock when you
create the journal.

>  The only other requirement that the JBD places would be a correct jbd
>  superblock at the start.  Would that be right?

Yes, although there is kernel code to create a new journal superblock
from scratch (all of the journal IO code is there already, so
journal_create is a tiny function to add on top of that.)

>  The md module could - at array configuration time - reserve the
>  head (or tail) of the array for a journal.  This wouldn't work for
>  raid5 - you would need to reserve the first (or last) few stripes and
>  treat them as raid1 so that there is no risk of data loss.  
>  I'm not sure how valuable having a journal on the main raid devices
>  would be though as it would probably kill performance...

It would depend on the workload.  For raid1 you'd be doubling the
number of IOs for writes, but because the journal writes are
sequential you don't double the number of seeks, which saves a bit.
For raid5, it might depend on how many of the disks you are going to
mirror the journal on.

For the initial development work, though, it would be much easier to
assume that the journal is just on a different device entirely.  Once
we can cleanly support multiple soft raid devices journaling to the
same external device, that restriction becomes much less onerous.

> B/ what to journal.
 
>  For raid4/5 we have the parity block to worry about.
>  I think we want to write data blocks to the journal ASAP, and then
>  once parity has been calculated for a stripe we write the parity
>  block to the journal and then are free to write the parity and data
>  to the array.

>  On journal replay we would collect together data blocks in a stripe
>  until we get a parity block for that stripe.

Actually, a block can appear multiple times in the journal.  In this
case, though, all we really need is a list of all stripes which have
been modified by the journal replay --- then you simply recalculate
parity on all of those stripes.  Until you have done that, we'll keep
the journal marked needs_recovery, so if the parity recalculation gets
interrupted by a crash, we'll just replay the whole lot and do another
full parity recalc on the subsequent reboot.

However, we have degraded mode to worry about, and in degraded mode we
*do* need to journal parity updates for all stripes (except those in
which it's the parity disk which has failed).

>  The only remaining issue is addressing. The journal presumably
>  doesn't know about "parity" or "data" blocks.  It just knows about
>  sector addresses.
>  I think I would tell the journal that data blocks have the address
>  that they have in the array, and parity block, which don't have an
>  address in the array, have an address which is the address on the
>  disc, plus some offset which is at least the size of the array.

Why not just journal them as writes to the underlying disks which
comprise the array?  That's the clean abstraction --- it means that
when raid5 schedules its stripe write, the journal deals with that
stripe as an atomic unit, but raid5 still sees it as a physical write.

>  Would it cause JBD any problems if the sector address it is given is
>  not a real address on any real device but is something that can be
>  adequately interpreted by the client?

Yes, because once you've given the block to JBD, it assumes that it is
ultimately responsible for all writeback.  

But that's something we'd need to discuss --- the JBD API would need
some enhancing in any case to cope with raid5's submit_bh regime to
deal with repeated writes to the same stripe (with the fs as a client,
we can assume that such repeated writes come from the same struct
buffer_head, at least until the bh is deleted, which is something
under our own control; or until the bh is explicitly released by JBD
once final writeback has happened.)

> C/ data management.
> 
>  One big difference between a filesystem using JBD and a device driver
>  using JBD is the ownership of buffers.
>  It is very important that a buffer which has been written to the
>  journal not be changed before it gets written to the main storage, so
>  ownership is important.

Not true!  If you write to the same stripe twice, then we can journal
the first version, modify the block, then journal the second version,
all without the write hitting backing store.

There's a second problem implicit in this --- while the stripe
writeback is pending, reads from the block device need to be satisfied
from the copy that is awaiting writeback.  For a filesystem client,
this isn't a problem --- the fs has its own cache and can make sure
that it reads from memory in the case where we've got a local,
non-writebacked copy.  But for a block device client, there isn't any
such automatic caching.

>  As I understand it, the filesystem owns it's buffers and can pretty
>  much control who writes and when

Not really; if the buffer is marked dirty, the VFS can write it
whenever it feels like it.

>  However a device drive does not own the buffers that it uses.

Soft raid5 _does_, however, own the temporary bh'es that are used to
schedule writes to the underlying physical devices.

>  It seems that we need a generic buffer-cache in front of the md
>  driver:
>    - A write request gets copied into a buffer from this cache
>    - the buffer gets written to the journal
>    - the original write request gets returned
>    - the buffer gets written to the array
> 
>  This would work, but means allocating lots more memory, and adds an
>  extra mem-to-mem copy which will slow things down.

Right, although I thought that you were already doing such a copy
inside raid5?

Doing zero-copy is essentially impossible in the general case.  The
writes can be coming from shared memory (mmap(MAP_SHARED,
PROT_WRITE)), so there is no guarantee that the incoming buffer_heads
will remain static throughout their progress towards the final raid
stripe.  If you want the parity to be correct, you are pretty much
forced to make a copy (unless we can do copy-on-write tricks to defer
this copy in certain cases, but that gets tricky and requires far more
interaction with the VM than is healthy for a block device driver!)

>  The only improvement that I can think of would only work with an
>  NVRAM journal device.  It involves writing to the journal and then
>  acknowledging the write - with minimal latency - and then reading the
>  data back in off the journal into a buffer that then gets written to
>  the main device.

If part of our objective is to be able to survive power loss plus loss
of a disk on power-return, then I don't think we've got any choice ---
we have to wait for the parity to be available before we commit and
acknowledge the write.

Most applications are not all that bound by write latency.  They
typically care more about read latency and/or write throughput, and
any fancy games which try to minimise write latency at the expense of
correctness feel wrong to me.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-16 15:17 ` Stephen C. Tweedie
@ 2002-05-17 18:29   ` Mike Fedyk
  2002-05-17 18:34     ` Stephen C. Tweedie
  2002-05-26  8:41   ` Daniel Phillips
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Fedyk @ 2002-05-17 18:29 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Neil Brown, linux-kernel

On Thu, May 16, 2002 at 04:17:49PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, May 16, 2002 at 03:54:20PM +1000, Neil Brown wrote:
> > The basic idea is to provide journaling for md/RAID arrays.  There
> > are two reasons that one might want to do this:
> >  1/ crash recovery.  Both raid1 and raid5 need to reconstruct the
> >    redundancy after a crash.  For a degraded raid5 array, this is not
> >    possible and you can suffer undetected data corruption.
> >    If we have a journal of recent changes we can avoid the
> >    reconstruction and the risk of corruption.
> 
> Right.  The ability of soft raid5 to lose data in degraded mode over a
> reboot (including data that was not being modified at the time of the
> crash) is something that is not nearly as widely understood as it
> should be, and I'd love for us to do something about it.

Are there workarounds to avoid this problem?

What does it take to trigger the corruption?

I ask this because I have used a degraded raid5 because the source drive
would become a member, but I needed to copy the data first.  While doing so,
I had to reboot a couple times to reconfigure the boot loader.  All seems to
be working fine on the system today though.

Thanks,

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-17 18:29   ` Mike Fedyk
@ 2002-05-17 18:34     ` Stephen C. Tweedie
  2002-05-18  1:35       ` Mike Fedyk
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen C. Tweedie @ 2002-05-17 18:34 UTC (permalink / raw)
  To: Stephen C. Tweedie, Neil Brown, linux-kernel

Hi,

On Fri, May 17, 2002 at 11:29:42AM -0700, Mike Fedyk wrote:
> On Thu, May 16, 2002 at 04:17:49PM +0100, Stephen C. Tweedie wrote:

> > Right.  The ability of soft raid5 to lose data in degraded mode over a
> > reboot (including data that was not being modified at the time of the
> > crash) is something that is not nearly as widely understood as it
> > should be, and I'd love for us to do something about it.
> 
> Are there workarounds to avoid this problem?

No.

> What does it take to trigger the corruption?

It just takes degraded mode, an unexpected power cycle, and concurrent
write activity.  

Degraded mode relies on the parity disk being in sync at all times ---
you can't recover data from the missing spindle unless that is true.
However, writes to a stripe are not atomic, and you can get a reboot
when, say, a write to one of the surviving data chunks has succeeded,
but the corresponding write to the parity disk has not.  If this
happens, the parity is no longer in sync, and the data belonging to
the missing spindle in that stripe will be lost forever.

> I ask this because I have used a degraded raid5 because the source drive
> would become a member, but I needed to copy the data first.  While doing so,
> I had to reboot a couple times to reconfigure the boot loader.  All seems to
> be working fine on the system today though.

If it was a clean shutdown and reboot, you're fine.

Cheers, 
 Stephen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-17 18:34     ` Stephen C. Tweedie
@ 2002-05-18  1:35       ` Mike Fedyk
  2002-05-18 12:47         ` Stephen C. Tweedie
  2002-05-21  9:03         ` Helge Hafting
  0 siblings, 2 replies; 10+ messages in thread
From: Mike Fedyk @ 2002-05-18  1:35 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Neil Brown, linux-kernel

On Fri, May 17, 2002 at 07:34:10PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, May 17, 2002 at 11:29:42AM -0700, Mike Fedyk wrote:
> > On Thu, May 16, 2002 at 04:17:49PM +0100, Stephen C. Tweedie wrote:
> 
> > > Right.  The ability of soft raid5 to lose data in degraded mode over a
> > > reboot (including data that was not being modified at the time of the
> > > crash) is something that is not nearly as widely understood as it
> > > should be, and I'd love for us to do something about it.
> > 
> > Are there workarounds to avoid this problem?
> 
> No.
> 
> > What does it take to trigger the corruption?
> 
> It just takes degraded mode, an unexpected power cycle, and concurrent
> write activity.  
> 
> Degraded mode relies on the parity disk being in sync at all times ---

Doesn't degraded mode imply that there are not any parity
disk(raid4)/stripe(raid5) updates?

> you can't recover data from the missing spindle unless that is true.
> However, writes to a stripe are not atomic, and you can get a reboot
> when, say, a write to one of the surviving data chunks has succeeded,
> but the corresponding write to the parity disk has not.  If this
> happens, the parity is no longer in sync, and the data belonging to
> the missing spindle in that stripe will be lost forever.
> 
> > I ask this because I have used a degraded raid5 because the source drive
> > would become a member, but I needed to copy the data first.  While doing so,
> > I had to reboot a couple times to reconfigure the boot loader.  All seems to
> > be working fine on the system today though.
> 
> If it was a clean shutdown and reboot, you're fine.
>

OK, that's good.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-18  1:35       ` Mike Fedyk
@ 2002-05-18 12:47         ` Stephen C. Tweedie
  2002-05-21  9:03         ` Helge Hafting
  1 sibling, 0 replies; 10+ messages in thread
From: Stephen C. Tweedie @ 2002-05-18 12:47 UTC (permalink / raw)
  To: Stephen C. Tweedie, Neil Brown, linux-kernel

Hi,

On Fri, May 17, 2002 at 06:35:37PM -0700, Mike Fedyk wrote:
> On Fri, May 17, 2002 at 07:34:10PM +0100, Stephen C. Tweedie wrote:
> > Degraded mode relies on the parity disk being in sync at all times ---
> 
> Doesn't degraded mode imply that there are not any parity
> disk(raid4)/stripe(raid5) updates?

Nope, partity updates still occur.  It's more expensive than in
non-degraded mode, but parity still gets updated.  If it wasn't, you
would not be able to write to a degraded array at all, as updating
parity is the only way that you can write to a block which maps to a
failed disk.  By using parity, we only ever fail requests if there are
two or more failed disks in the array.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-18  1:35       ` Mike Fedyk
  2002-05-18 12:47         ` Stephen C. Tweedie
@ 2002-05-21  9:03         ` Helge Hafting
  1 sibling, 0 replies; 10+ messages in thread
From: Helge Hafting @ 2002-05-21  9:03 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: linux-kernel

Mike Fedyk wrote:
>
> Doesn't degraded mode imply that there are not any parity
> disk(raid4)/stripe(raid5) updates?
> 
Degraded mode means one of the (redundant) disks have
failed and isn't used.  In raid-4 that might be
the parity disk - and then you get the no parity
case.

But it might be a data disk that failed instead,
the missing data will be calculated from 
parity when needed, and of course parity is
modified upon write so information
can be stored even though some storage is missing.

Raid-5 spreads the parity over all the disks
for performance, so wether a missing disk
translates to a missing parity stripe or a missing
data stripe depends on the exact
block number.

Helge Hafting

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-16 15:17 ` Stephen C. Tweedie
  2002-05-17 18:29   ` Mike Fedyk
@ 2002-05-26  8:41   ` Daniel Phillips
  2002-05-27 11:34     ` Stephen C. Tweedie
  1 sibling, 1 reply; 10+ messages in thread
From: Daniel Phillips @ 2002-05-26  8:41 UTC (permalink / raw)
  To: Stephen C. Tweedie, Neil Brown; +Cc: linux-kernel

On Thursday 16 May 2002 17:17, Stephen C. Tweedie wrote:
> Most applications are not all that bound by write latency.

But some are.  Transaction processing applications, where each transaction 
has to be safely on disk before it can be acknowledged, care about write 
latency a lot, since it translates more or less directly into throughput.

> They
> typically care more about read latency and/or write throughput, and
> any fancy games which try to minimise write latency at the expense of
> correctness feel wrong to me.

I doubt you'll have trouble convincing anyone that correctness is not 
negotiable.

-- 
Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
  2002-05-26  8:41   ` Daniel Phillips
@ 2002-05-27 11:34     ` Stephen C. Tweedie
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen C. Tweedie @ 2002-05-27 11:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, Neil Brown, linux-kernel

Hi,

On Sun, May 26, 2002 at 10:41:22AM +0200, Daniel Phillips wrote:
> On Thursday 16 May 2002 17:17, Stephen C. Tweedie wrote:
> > Most applications are not all that bound by write latency.
> 
> But some are.  Transaction processing applications, where each transaction 
> has to be safely on disk before it can be acknowledged, care about write 
> latency a lot, since it translates more or less directly into throughput.

Not really.  They care about throughput, and will happily sacrifice
latency for that.  The postmark stuff showed that very clearly --- by
yielding in transaction commit and allowing multiple transactions to
batch up, Andrew saw an instant improvement of about 3000% in postmark
figures, despite the fact that the yield is obviously only going to
increase the latency of each individual transaction.  Pretty much all
TP benchmarks focus on throughput, not latency.

So while latency is important, if we have to tradeoff against
throughput, that is normally the right tradeoff on synchronous write
traffic.  For reads, latency is obviously critical in nearly all
cases.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Thoughts on using fs/jbd from drivers/md
@ 2002-05-27 11:50 Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2002-05-27 11:50 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, linux-kernel

On Monday May 27, sct@redhat.com wrote:
> Hi,
> 
> On Sun, May 26, 2002 at 10:41:22AM +0200, Daniel Phillips wrote:
> > On Thursday 16 May 2002 17:17, Stephen C. Tweedie wrote:
> > > Most applications are not all that bound by write latency.
> > 
> > But some are.  Transaction processing applications, where each transaction 
> > has to be safely on disk before it can be acknowledged, care about write 
> > latency a lot, since it translates more or less directly into throughput.
> 
> Not really.  They care about throughput, and will happily sacrifice
> latency for that.

And some aren't...  my main thrust for pursuing this idea was to
present minimal latency to the application.  That is why I want to use
NVRAM for the journal.
My particular application is an NFS server which traditionally suffers
badly if there is too much latency.
Certainly there are situations where a small drop in latency can
improve throughput, but I want to maximise the throughout without any
cost in latency.  And I am willing to spend on the NVRAM to do it.

I'm seeing two very different approaches to journalling an MD device
being significant.
One journals to NVRAM and trys to minimise latency, and works for any
RAID level.  It is basically a write-behind cache.

The other journals to a normal drive and only works for RAID5 (which
is the only level that really needs a journal other than for latency
reasons) and writes to the journal after a the stripe parity
calculation and before the data+parity is sent to disc.

They will probably be very different implementations, though they will
hopefully have a very similar interface.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-05-27 11:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-16  5:54 Thoughts on using fs/jbd from drivers/md Neil Brown
2002-05-16 15:17 ` Stephen C. Tweedie
2002-05-17 18:29   ` Mike Fedyk
2002-05-17 18:34     ` Stephen C. Tweedie
2002-05-18  1:35       ` Mike Fedyk
2002-05-18 12:47         ` Stephen C. Tweedie
2002-05-21  9:03         ` Helge Hafting
2002-05-26  8:41   ` Daniel Phillips
2002-05-27 11:34     ` Stephen C. Tweedie
2002-05-27 11:50 Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).