All of lore.kernel.org
 help / color / mirror / Atom feed
* Mis-Design of Btrfs?
@ 2011-06-23 10:53 Nico Schottelius
  2011-06-27  6:46 ` NeilBrown
  0 siblings, 1 reply; 39+ messages in thread
From: Nico Schottelius @ 2011-06-23 10:53 UTC (permalink / raw)
  To: LKML

Good morning devs,

I'm wondering whether the raid- and volume-management-builtin of btrfs is
actually a sane idea or not.
Currently we do have md/device-mapper support for raid
already, btrfs lacks raid5 support and re-implements stuff that
has already been done.

I'm aware of the fact that it is very useful to know on which devices
we are in a filesystem. But I'm wondering, whether it wouldn't be
smarter to generalise the information exposure through the VFS layer
instead of replicating functionality:

Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
Crypto:     Luks                              |
LVM:        Groups/Volumes                    |
FS:         xfs/jfs/reiser/ext3               v

Thus a filesystem like ext3 could be aware that it is running
on a USB HD, enable -o sync be default or have the filesystem
to rewrite blocks when running on crypto or optimise for an SSD, ...

Cheers,

Nico

-- 
PGP key: 7ED9 F7D3 6B10 81D7 0EC5  5C09 D7DC C8E4 3187 7DF0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-06-23 10:53 Mis-Design of Btrfs? Nico Schottelius
@ 2011-06-27  6:46 ` NeilBrown
  2011-06-29  9:29   ` Ric Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: NeilBrown @ 2011-06-27  6:46 UTC (permalink / raw)
  To: Nico Schottelius; +Cc: LKML

On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
<nico-lkml-20110623@schottelius.org> wrote:

> Good morning devs,
> 
> I'm wondering whether the raid- and volume-management-builtin of btrfs is
> actually a sane idea or not.
> Currently we do have md/device-mapper support for raid
> already, btrfs lacks raid5 support and re-implements stuff that
> has already been done.
> 
> I'm aware of the fact that it is very useful to know on which devices
> we are in a filesystem. But I'm wondering, whether it wouldn't be
> smarter to generalise the information exposure through the VFS layer
> instead of replicating functionality:
> 
> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
> Crypto:     Luks                              |
> LVM:        Groups/Volumes                    |
> FS:         xfs/jfs/reiser/ext3               v
> 
> Thus a filesystem like ext3 could be aware that it is running
> on a USB HD, enable -o sync be default or have the filesystem
> to rewrite blocks when running on crypto or optimise for an SSD, ...

I would certainly agree that exposing information to higher levels is a good
idea.  To some extent we do.  But it isn't always as easy as it might sound.
Choosing exactly what information to expose is the challenge.  If you lack
sufficient foresight you might expose something which turns out to be
very specific to just one device, so all those upper levels which make use of
the information find they are really special-casing one specific device,
which isn't a good idea.


However it doesn't follow that RAID5 should not be implemented in BTRFS.
The levels that you have drawn are just one perspective.  While that has
value, it may not be universal.
I could easily argue that the LVM layer is a mistake and that filesystems
should provide that functionality directly.
I could almost argue the same for crypto.
RAID1 can make a lot of sense to be tightly integrated with the FS.
RAID5 ... I'm less convinced, but then I have a vested interest there so that
isn't an objective assessment.

Part of "the way Linux works" is that s/he who writes the code gets to make
the design decisions.   The BTRFS developers might create something truly
awesome, or might end up having to support a RAID feature that they
subsequently think is a bad idea.  But it really is their decision to make.

NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-06-27  6:46 ` NeilBrown
@ 2011-06-29  9:29   ` Ric Wheeler
  2011-06-29 10:47     ` A. James Lewis
  2011-07-14  5:56     ` NeilBrown
  0 siblings, 2 replies; 39+ messages in thread
From: Ric Wheeler @ 2011-06-29  9:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Nico Schottelius, LKML, Chris Mason, linux-btrfs, Alasdair G Kergon

On 06/27/2011 07:46 AM, NeilBrown wrote:
> On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
> <nico-lkml-20110623@schottelius.org>  wrote:
>
>> Good morning devs,
>>
>> I'm wondering whether the raid- and volume-management-builtin of btrfs is
>> actually a sane idea or not.
>> Currently we do have md/device-mapper support for raid
>> already, btrfs lacks raid5 support and re-implements stuff that
>> has already been done.
>>
>> I'm aware of the fact that it is very useful to know on which devices
>> we are in a filesystem. But I'm wondering, whether it wouldn't be
>> smarter to generalise the information exposure through the VFS layer
>> instead of replicating functionality:
>>
>> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
>> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
>> Crypto:     Luks                              |
>> LVM:        Groups/Volumes                    |
>> FS:         xfs/jfs/reiser/ext3               v
>>
>> Thus a filesystem like ext3 could be aware that it is running
>> on a USB HD, enable -o sync be default or have the filesystem
>> to rewrite blocks when running on crypto or optimise for an SSD, ...
> I would certainly agree that exposing information to higher levels is a good
> idea.  To some extent we do.  But it isn't always as easy as it might sound.
> Choosing exactly what information to expose is the challenge.  If you lack
> sufficient foresight you might expose something which turns out to be
> very specific to just one device, so all those upper levels which make use of
> the information find they are really special-casing one specific device,
> which isn't a good idea.
>
>
> However it doesn't follow that RAID5 should not be implemented in BTRFS.
> The levels that you have drawn are just one perspective.  While that has
> value, it may not be universal.
> I could easily argue that the LVM layer is a mistake and that filesystems
> should provide that functionality directly.
> I could almost argue the same for crypto.
> RAID1 can make a lot of sense to be tightly integrated with the FS.
> RAID5 ... I'm less convinced, but then I have a vested interest there so that
> isn't an objective assessment.
>
> Part of "the way Linux works" is that s/he who writes the code gets to make
> the design decisions.   The BTRFS developers might create something truly
> awesome, or might end up having to support a RAID feature that they
> subsequently think is a bad idea.  But it really is their decision to make.
>
> NeilBrown
>

One more thing to add here is that I think that we still have a chance to 
increase the sharing between btrfs and the MD stack if we can get those changes 
made. No one likes to duplicate code, but we will need a richer interface 
between the block and file system layer to help close that gap.

Ric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-06-29  9:29   ` Ric Wheeler
@ 2011-06-29 10:47     ` A. James Lewis
  2011-07-14 20:47       ` Erik Jensen
  2011-07-14  5:56     ` NeilBrown
  1 sibling, 1 reply; 39+ messages in thread
From: A. James Lewis @ 2011-06-29 10:47 UTC (permalink / raw)
  To: linux-btrfs

On Wed, 2011-06-29 at 10:29 +0100, Ric Wheeler wrote:
> On 06/27/2011 07:46 AM, NeilBrown wrote:
> > On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
> > <nico-lkml-20110623@schottelius.org>  wrote:
> >
> >> Good morning devs,
> >>
> >> I'm wondering whether the raid- and volume-management-builtin of btrfs is
> >> actually a sane idea or not.
> >> Currently we do have md/device-mapper support for raid
> >> already, btrfs lacks raid5 support and re-implements stuff that
> >> has already been done.
> >>
> >> I'm aware of the fact that it is very useful to know on which devices
> >> we are in a filesystem. But I'm wondering, whether it wouldn't be
> >> smarter to generalise the information exposure through the VFS layer
> >> instead of replicating functionality:
> >>
> >> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
> >> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
> >> Crypto:     Luks                              |
> >> LVM:        Groups/Volumes                    |
> >> FS:         xfs/jfs/reiser/ext3               v
> >>
> >> Thus a filesystem like ext3 could be aware that it is running
> >> on a USB HD, enable -o sync be default or have the filesystem
> >> to rewrite blocks when running on crypto or optimise for an SSD, ...
> > I would certainly agree that exposing information to higher levels is a good
> > idea.  To some extent we do.  But it isn't always as easy as it might sound.
> > Choosing exactly what information to expose is the challenge.  If you lack
> > sufficient foresight you might expose something which turns out to be
> > very specific to just one device, so all those upper levels which make use of
> > the information find they are really special-casing one specific device,
> > which isn't a good idea.
> >
> >
> > However it doesn't follow that RAID5 should not be implemented in BTRFS.
> > The levels that you have drawn are just one perspective.  While that has
> > value, it may not be universal.
> > I could easily argue that the LVM layer is a mistake and that filesystems
> > should provide that functionality directly.
> > I could almost argue the same for crypto.
> > RAID1 can make a lot of sense to be tightly integrated with the FS.
> > RAID5 ... I'm less convinced, but then I have a vested interest there so that
> > isn't an objective assessment.
> >
> > Part of "the way Linux works" is that s/he who writes the code gets to make
> > the design decisions.   The BTRFS developers might create something truly
> > awesome, or might end up having to support a RAID feature that they
> > subsequently think is a bad idea.  But it really is their decision to make.
> >
> > NeilBrown
> >
> 
> One more thing to add here is that I think that we still have a chance to 
> increase the sharing between btrfs and the MD stack if we can get those changes 
> made. No one likes to duplicate code, but we will need a richer interface 
> between the block and file system layer to help close that gap.
> 
> Ric
> 
Is there a possibility that one could have a 3 disk RAID5 array, and
then add a 4th disk and then do a balance, growing the RAID5 onto 4
disks and gaining the space still with RAID5?  It seems that to be
consistent, BTRFS would have to do this.

If this is the case, then I think that the BTRFS implementation of RAID5
would have to be quite different to the MD implementation.

James.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-06-29  9:29   ` Ric Wheeler
  2011-06-29 10:47     ` A. James Lewis
@ 2011-07-14  5:56     ` NeilBrown
  2011-07-14  6:02       ` Ric Wheeler
  1 sibling, 1 reply; 39+ messages in thread
From: NeilBrown @ 2011-07-14  5:56 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Nico Schottelius, LKML, Chris Mason, linux-btrfs, Alasdair G Kergon

On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:

> On 06/27/2011 07:46 AM, NeilBrown wrote:
> > On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
> > <nico-lkml-20110623@schottelius.org>  wrote:
> >
> >> Good morning devs,
> >>
> >> I'm wondering whether the raid- and volume-management-builtin of btrfs is
> >> actually a sane idea or not.
> >> Currently we do have md/device-mapper support for raid
> >> already, btrfs lacks raid5 support and re-implements stuff that
> >> has already been done.
> >>
> >> I'm aware of the fact that it is very useful to know on which devices
> >> we are in a filesystem. But I'm wondering, whether it wouldn't be
> >> smarter to generalise the information exposure through the VFS layer
> >> instead of replicating functionality:
> >>
> >> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
> >> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
> >> Crypto:     Luks                              |
> >> LVM:        Groups/Volumes                    |
> >> FS:         xfs/jfs/reiser/ext3               v
> >>
> >> Thus a filesystem like ext3 could be aware that it is running
> >> on a USB HD, enable -o sync be default or have the filesystem
> >> to rewrite blocks when running on crypto or optimise for an SSD, ...
> > I would certainly agree that exposing information to higher levels is a good
> > idea.  To some extent we do.  But it isn't always as easy as it might sound.
> > Choosing exactly what information to expose is the challenge.  If you lack
> > sufficient foresight you might expose something which turns out to be
> > very specific to just one device, so all those upper levels which make use of
> > the information find they are really special-casing one specific device,
> > which isn't a good idea.
> >
> >
> > However it doesn't follow that RAID5 should not be implemented in BTRFS.
> > The levels that you have drawn are just one perspective.  While that has
> > value, it may not be universal.
> > I could easily argue that the LVM layer is a mistake and that filesystems
> > should provide that functionality directly.
> > I could almost argue the same for crypto.
> > RAID1 can make a lot of sense to be tightly integrated with the FS.
> > RAID5 ... I'm less convinced, but then I have a vested interest there so that
> > isn't an objective assessment.
> >
> > Part of "the way Linux works" is that s/he who writes the code gets to make
> > the design decisions.   The BTRFS developers might create something truly
> > awesome, or might end up having to support a RAID feature that they
> > subsequently think is a bad idea.  But it really is their decision to make.
> >
> > NeilBrown
> >
> 
> One more thing to add here is that I think that we still have a chance to 
> increase the sharing between btrfs and the MD stack if we can get those changes 
> made. No one likes to duplicate code, but we will need a richer interface 
> between the block and file system layer to help close that gap.
> 
> Ric
> 

I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  5:56     ` NeilBrown
@ 2011-07-14  6:02       ` Ric Wheeler
  2011-07-14  6:38         ` NeilBrown
  2011-07-14  6:59         ` Arne Jansen
  0 siblings, 2 replies; 39+ messages in thread
From: Ric Wheeler @ 2011-07-14  6:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Nico Schottelius, LKML, Chris Mason, linux-btrfs, Alasdair G Kergon

On 07/14/2011 06:56 AM, NeilBrown wrote:
> On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
>
>> On 06/27/2011 07:46 AM, NeilBrown wrote:
>>> On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
>>> <nico-lkml-20110623@schottelius.org>   wrote:
>>>
>>>> Good morning devs,
>>>>
>>>> I'm wondering whether the raid- and volume-management-builtin of btrfs is
>>>> actually a sane idea or not.
>>>> Currently we do have md/device-mapper support for raid
>>>> already, btrfs lacks raid5 support and re-implements stuff that
>>>> has already been done.
>>>>
>>>> I'm aware of the fact that it is very useful to know on which devices
>>>> we are in a filesystem. But I'm wondering, whether it wouldn't be
>>>> smarter to generalise the information exposure through the VFS layer
>>>> instead of replicating functionality:
>>>>
>>>> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
>>>> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
>>>> Crypto:     Luks                              |
>>>> LVM:        Groups/Volumes                    |
>>>> FS:         xfs/jfs/reiser/ext3               v
>>>>
>>>> Thus a filesystem like ext3 could be aware that it is running
>>>> on a USB HD, enable -o sync be default or have the filesystem
>>>> to rewrite blocks when running on crypto or optimise for an SSD, ...
>>> I would certainly agree that exposing information to higher levels is a good
>>> idea.  To some extent we do.  But it isn't always as easy as it might sound.
>>> Choosing exactly what information to expose is the challenge.  If you lack
>>> sufficient foresight you might expose something which turns out to be
>>> very specific to just one device, so all those upper levels which make use of
>>> the information find they are really special-casing one specific device,
>>> which isn't a good idea.
>>>
>>>
>>> However it doesn't follow that RAID5 should not be implemented in BTRFS.
>>> The levels that you have drawn are just one perspective.  While that has
>>> value, it may not be universal.
>>> I could easily argue that the LVM layer is a mistake and that filesystems
>>> should provide that functionality directly.
>>> I could almost argue the same for crypto.
>>> RAID1 can make a lot of sense to be tightly integrated with the FS.
>>> RAID5 ... I'm less convinced, but then I have a vested interest there so that
>>> isn't an objective assessment.
>>>
>>> Part of "the way Linux works" is that s/he who writes the code gets to make
>>> the design decisions.   The BTRFS developers might create something truly
>>> awesome, or might end up having to support a RAID feature that they
>>> subsequently think is a bad idea.  But it really is their decision to make.
>>>
>>> NeilBrown
>>>
>> One more thing to add here is that I think that we still have a chance to
>> increase the sharing between btrfs and the MD stack if we can get those changes
>> made. No one likes to duplicate code, but we will need a richer interface
>> between the block and file system layer to help close that gap.
>>
>> Ric
>>
> I'm certainly open to suggestions and collaboration.  Do you have in mind any
> particular way to make the interface richer??
>
> NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that 
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it 
can detect that it got bad data on a write. If that data was protected by RAID, 
it would like to ask the block layer to try to read from another mirror (for 
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of 
getting data from a different mirror or the same one that we got data from on 
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:02       ` Ric Wheeler
@ 2011-07-14  6:38         ` NeilBrown
  2011-07-14  6:57           ` Ric Wheeler
                             ` (4 more replies)
  2011-07-14  6:59         ` Arne Jansen
  1 sibling, 5 replies; 39+ messages in thread
From: NeilBrown @ 2011-07-14  6:38 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Nico Schottelius, LKML, Chris Mason, linux-btrfs, Alasdair G Kergon

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:

> > I'm certainly open to suggestions and collaboration.  Do you have in mind any
> > particular way to make the interface richer??
> >
> > NeilBrown
> 
> Hi Neil,
> 
> I know that Chris has a very specific set of use cases for btrfs and think that 
> Alasdair and others have started to look at what is doable.
> 
> The obvious use case is the following:
> 
> If a file system uses checksumming or other data corruption detection bits, it 
> can detect that it got bad data on a write. If that data was protected by RAID, 
> it would like to ask the block layer to try to read from another mirror (for 
> raid1) or try to validate/rebuild from parity.
> 
> Today, I think that a retry will basically just give us back a random chance of 
> getting data from a different mirror or the same one that we got data from on 
> the first go.
> 
> Chris, Alasdair, was that a good summary of one concern?
> 
> Thanks!
> 
> Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means "get this data way you like".
When non-zero it means "get this data using method N", where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means "use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N > 0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means "I cannot work with that number, but try the next one".

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:38         ` NeilBrown
@ 2011-07-14  6:57           ` Ric Wheeler
  2011-07-15  2:32             ` Chris Mason
  2011-07-14  9:37           ` Jan Schmidt
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Ric Wheeler @ 2011-07-14  6:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Nico Schottelius, LKML, Chris Mason, linux-btrfs, Alasdair G Kergon

On 07/14/2011 07:38 AM, NeilBrown wrote:
> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
>
>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
>>> particular way to make the interface richer??
>>>
>>> NeilBrown
>> Hi Neil,
>>
>> I know that Chris has a very specific set of use cases for btrfs and think that
>> Alasdair and others have started to look at what is doable.
>>
>> The obvious use case is the following:
>>
>> If a file system uses checksumming or other data corruption detection bits, it
>> can detect that it got bad data on a write. If that data was protected by RAID,
>> it would like to ask the block layer to try to read from another mirror (for
>> raid1) or try to validate/rebuild from parity.
>>
>> Today, I think that a retry will basically just give us back a random chance of
>> getting data from a different mirror or the same one that we got data from on
>> the first go.
>>
>> Chris, Alasdair, was that a good summary of one concern?
>>
>> Thanks!
>>
>> Ric
> I imagine a new field in 'struct bio' which was normally zero but could be
> some small integer.  It is only meaningful for read.
> When 0 it means "get this data way you like".
> When non-zero it means "get this data using method N", where the different
> methods are up to the device.
>
> For a mirrored RAID, method N means read from device N-1.
> For stripe/parity RAID, method 1 means "use other data blocks and parity
> blocks to reconstruct data.
>
> The default for non RAID devices is to return EINVAL for any N>  0.
> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> some thought.
>
> So if btrfs reads a block and the checksum looks wrong, it reads again with
> a larger N.  It continues incrementing N and retrying until it gets a block
> that it likes or it gets EINVAL.  There should probably be an error code
> (EAGAIN?) which means "I cannot work with that number, but try the next one".
>
> It would be trivial for me to implement this for RAID1 and RAID10, and
> relatively easy for RAID5.
> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> ways to reconstruct from different combinations of parity and data.  I'm not
> sure if there would be much point in doing that though.
>
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
>
> NeilBrown
>

I think that the above makes sense. Not sure what your "0" definition is, but I 
would assume that for non-raided devices (i.e., a single s-ata disk), "0" would 
be an indication that there is nothing more that can be tried. The data you got 
is all you are ever going to get :)

For multiple mirrors, you might want to have a scheme where you would be able to 
cycle through the mirrors. You could retry, cycling through the various mirrors 
until you have tried and returned them all at which point you would get a "no 
more" error back or some such thing.

Thanks!

ric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:02       ` Ric Wheeler
  2011-07-14  6:38         ` NeilBrown
@ 2011-07-14  6:59         ` Arne Jansen
  1 sibling, 0 replies; 39+ messages in thread
From: Arne Jansen @ 2011-07-14  6:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: NeilBrown, Nico Schottelius, LKML, Chris Mason, linux-btrfs,
	Alasdair G Kergon

On 14.07.2011 08:02, Ric Wheeler wrote:
> On 07/14/2011 06:56 AM, NeilBrown wrote:
>> I'm certainly open to suggestions and collaboration. Do you have in
>> mind any
>> particular way to make the interface richer??
>
> If a file system uses checksumming or other data corruption detection
> bits, it can detect that it got bad data on a write. If that data was
> protected by RAID, it would like to ask the block layer to try to read
> from another mirror (for raid1) or try to validate/rebuild from parity.
>
> Today, I think that a retry will basically just give us back a random
> chance of getting data from a different mirror or the same one that we
> got data from on the first go.

Another case that comes to mind is the 'remove device' operation.
To accomplish this, btrfs just rewrites all the data that reside
on that device to other drives.
Also, scrub and my recently posted readahead patches make heavy
use of the knowledge of how the raid is laid out. Readahead always
uses as many drives as possible in parallel, while trying to
avoid unnecessary seeks on each device.

-Arne

>
> Chris, Alasdair, was that a good summary of one concern?
>
> Thanks!
>
> Ric
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:38         ` NeilBrown
  2011-07-14  6:57           ` Ric Wheeler
@ 2011-07-14  9:37           ` Jan Schmidt
  2011-07-14  9:55             ` NeilBrown
  2011-07-14 16:27           ` Goffredo Baroncelli
                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Jan Schmidt @ 2011-07-14  9:37 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ric Wheeler, Nico Schottelius, LKML, Chris Mason, linux-btrfs,
	Alasdair G Kergon

Hi Neil,

On 14.07.2011 08:38, NeilBrown wrote:
> I imagine a new field in 'struct bio' which was normally zero but could be
> some small integer.  It is only meaningful for read.
> When 0 it means "get this data way you like".
> When non-zero it means "get this data using method N", where the different
> methods are up to the device.
> 
> For a mirrored RAID, method N means read from device N-1.
> For stripe/parity RAID, method 1 means "use other data blocks and parity
> blocks to reconstruct data.
> 
> The default for non RAID devices is to return EINVAL for any N > 0.
> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> some thought.
> 
> So if btrfs reads a block and the checksum looks wrong, it reads again with
> a larger N.  It continues incrementing N and retrying until it gets a block
> that it likes or it gets EINVAL.  There should probably be an error code
> (EAGAIN?) which means "I cannot work with that number, but try the next one".

I like this idea. It comes pretty close to what btrfs is currently doing
(what was the reason for this thread being started, wasn't it?), only
not within struct bio.

The way you describe the extra parameter is input only. I think it would
be a nice add on if we knew which "N" was used when "0" passed for the
request. At least for the RAID1 case, I'd like to see that when I submit
a bio with method (or whatever we call it) "0", it comes back with the
method field set to the value that reflects the method the device
actually used. Obviously, when passing non-zero values, the bio would
have to come back with that value unmodified.

That way, we'll get more control on the retry algorithms and are free to
decide not to use the failed method again, if we like.

Setting "method" on the return path might be valuable not only for
RAID1, but I haven't thought that trough.

-Jan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  9:37           ` Jan Schmidt
@ 2011-07-14  9:55             ` NeilBrown
  0 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2011-07-14  9:55 UTC (permalink / raw)
  To: Jan Schmidt
  Cc: Ric Wheeler, Nico Schottelius, LKML, Chris Mason, linux-btrfs,
	Alasdair G Kergon

On Thu, 14 Jul 2011 11:37:41 +0200 Jan Schmidt <list.btrfs@jan-o-sch.net>
wrote:

> Hi Neil,
> 
> On 14.07.2011 08:38, NeilBrown wrote:
> > I imagine a new field in 'struct bio' which was normally zero but could be
> > some small integer.  It is only meaningful for read.
> > When 0 it means "get this data way you like".
> > When non-zero it means "get this data using method N", where the different
> > methods are up to the device.
> > 
> > For a mirrored RAID, method N means read from device N-1.
> > For stripe/parity RAID, method 1 means "use other data blocks and parity
> > blocks to reconstruct data.
> > 
> > The default for non RAID devices is to return EINVAL for any N > 0.
> > A remapping device (dm-linear, dm-stripe etc) would just pass the number
> > down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> > some thought.
> > 
> > So if btrfs reads a block and the checksum looks wrong, it reads again with
> > a larger N.  It continues incrementing N and retrying until it gets a block
> > that it likes or it gets EINVAL.  There should probably be an error code
> > (EAGAIN?) which means "I cannot work with that number, but try the next one".
> 
> I like this idea. It comes pretty close to what btrfs is currently doing
> (what was the reason for this thread being started, wasn't it?), only
> not within struct bio.
> 
> The way you describe the extra parameter is input only. I think it would
> be a nice add on if we knew which "N" was used when "0" passed for the
> request. At least for the RAID1 case, I'd like to see that when I submit
> a bio with method (or whatever we call it) "0", it comes back with the
> method field set to the value that reflects the method the device
> actually used. Obviously, when passing non-zero values, the bio would
> have to come back with that value unmodified.
> 
> That way, we'll get more control on the retry algorithms and are free to
> decide not to use the failed method again, if we like.
> 
> Setting "method" on the return path might be valuable not only for
> RAID1, but I haven't thought that trough.
> 
> -Jan

That sounds like it would be reasonable...

It would be important not to read too much into the number though.  i.e.
don't think of it as "RAID1" but keep a much more abstract idea, else you
could run into trouble.

A (near) future addition to md is keeping track of 'bad blocks' so we can
fail more gracefully as devices start to fail.
So a read request to a RAID1 might not be served by just one device, but
might be served by one device for some parts, and another device for other
parts, so as to avoid one or more 'bad blocks'.

I think I could still provide a reasonably consistent mapping from 'arbitrary
number' to 'set of devices', but it may not be what you expect.  And the
number '1' may well correspond to different devices for different bi_sector
offsets.

i.e. the abstraction must allow the low level implementation substantial
freedom to choosing how to implement each request.

But yes - I don't see why we couldn't report which strategy was used with the
implication that using that same strategy at the same offset with the same
size would be expected to produce the same result.

NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:38         ` NeilBrown
  2011-07-14  6:57           ` Ric Wheeler
  2011-07-14  9:37           ` Jan Schmidt
@ 2011-07-14 16:27           ` Goffredo Baroncelli
  2011-07-14 16:55           ` Alasdair G Kergon
  2011-07-14 16:55           ` Alasdair G Kergon
  4 siblings, 0 replies; 39+ messages in thread
From: Goffredo Baroncelli @ 2011-07-14 16:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ric Wheeler, Nico Schottelius, LKML, Chris Mason, linux-btrfs,
	Alasdair G Kergon

On 07/14/2011 08:38 AM, NeilBrown wrote:
> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:
> 
>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
>>> particular way to make the interface richer??
>>>
>>> NeilBrown
>>
>> Hi Neil,
>>
>> I know that Chris has a very specific set of use cases for btrfs and think that 
>> Alasdair and others have started to look at what is doable.
>>
>> The obvious use case is the following:
>>
>> If a file system uses checksumming or other data corruption detection bits, it 
>> can detect that it got bad data on a write. If that data was protected by RAID, 
>> it would like to ask the block layer to try to read from another mirror (for 
>> raid1) or try to validate/rebuild from parity.
>>
>> Today, I think that a retry will basically just give us back a random chance of 
>> getting data from a different mirror or the same one that we got data from on 
>> the first go.
>>
>> Chris, Alasdair, was that a good summary of one concern?
>>
>> Thanks!
>>
>> Ric
> 
> I imagine a new field in 'struct bio' which was normally zero but could be
> some small integer.  It is only meaningful for read.
> When 0 it means "get this data way you like".
> When non-zero it means "get this data using method N", where the different
> methods are up to the device.

In more general terms, the filesystem should be able to require: try
another read different regarding the previous ones. The term are
important because we should differentiate the case of "wrong data from
disk1, read from disk0" and "wrong data from disk0 read disk1". I prefer
thinking the field as bitmap. Every bit represent a different way of
read. So it is possible to reuse to track which "kind of read" was
already used.

After a 2nd read, the block layer should:
	a) redo the read if possible, otherwise FAIL
	b) pass the data to the filesystem
	c) if the filesystem accepts the new data, replace the wrong
	   data with the correct one or mark the block as broken.
	d) inform the userspace/filesystem of the result

> 
> For a mirrored RAID, method N means read from device N-1.
> For stripe/parity RAID, method 1 means "use other data blocks and parity
> blocks to reconstruct data.
> 
> The default for non RAID devices is to return EINVAL for any N > 0.
> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> some thought.
> 
> So if btrfs reads a block and the checksum looks wrong, it reads again with
> a larger N.  It continues incrementing N and retrying until it gets a block
> that it likes or it gets EINVAL.  There should probably be an error code
> (EAGAIN?) which means "I cannot work with that number, but try the next one".
> 
> It would be trivial for me to implement this for RAID1 and RAID10, and
> relatively easy for RAID5.
> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> ways to reconstruct from different combinations of parity and data.  I'm not
> sure if there would be much point in doing that though.
> 
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
> 
> NeilBrown
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:38         ` NeilBrown
                             ` (3 preceding siblings ...)
  2011-07-14 16:55           ` Alasdair G Kergon
@ 2011-07-14 16:55           ` Alasdair G Kergon
  4 siblings, 0 replies; 39+ messages in thread
From: Alasdair G Kergon @ 2011-07-14 16:55 UTC (permalink / raw)
  To: NeilBrown, Ric Wheeler, Nico Schottelius, LKML, Chris Mason

On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
 
I'll just say that any solution ought to be stackable.
This means understanding both that the number of data access routes may
vary as you move through the stack, and that this number may depend on
the offset within the device.

Alasdair


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:38         ` NeilBrown
                             ` (2 preceding siblings ...)
  2011-07-14 16:27           ` Goffredo Baroncelli
@ 2011-07-14 16:55           ` Alasdair G Kergon
  2011-07-14 19:50             ` John Stoffel
  2011-07-14 16:55           ` Alasdair G Kergon
  4 siblings, 1 reply; 39+ messages in thread
From: Alasdair G Kergon @ 2011-07-14 16:55 UTC (permalink / raw)
  To: NeilBrown, Ric Wheeler, Nico Schottelius, LKML, Chris Mason,
	linux-btrfs, Alasdair G Kergon

On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
> It might make sense for a device to be able to report what the maximum
> 'N' supported is... that might make stacked raid easier to manage...
 
I'll just say that any solution ought to be stackable.
This means understanding both that the number of data access routes may
vary as you move through the stack, and that this number may depend on
the offset within the device.

Alasdair


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14 16:55           ` Alasdair G Kergon
@ 2011-07-14 19:50             ` John Stoffel
  2011-07-14 20:48               ` david
  2011-07-14 20:50               ` Erik Jensen
  0 siblings, 2 replies; 39+ messages in thread
From: John Stoffel @ 2011-07-14 19:50 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: NeilBrown, Ric Wheeler, Nico Schottelius, LKML, Chris Mason, linux-btrfs

>>>>> "Alasdair" == Alasdair G Kergon <agk@redhat.com> writes:

Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
>> It might make sense for a device to be able to report what the maximum
>> 'N' supported is... that might make stacked raid easier to manage...
 
Alasdair> I'll just say that any solution ought to be stackable.

I've been mulling this over too and wondering how you'd handle this,
because upper layers really can't peak down into lower layers easily.
As far as I understand things.

So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks

How does btrfs handle errors (or does it even see them!) from the
raid6 level when a single nbd device goes away?  Or taking the
original example, when btrfs notices a checksum isn't correct, how
would it push down multiple levels to try and find the correct data? 

Alasdair> This means understanding both that the number of data access
Alasdair> routes may vary as you move through the stack, and that this
Alasdair> number may depend on the offset within the device.

It almost seems to me that the retry needs to be done at each level on
it's own, without pushing down or up the stack.  But this doesn't
handle the wrong file checksum issue.  

Hmm... maybe instead of just one number, we need another to count the
levels down you go (or just split 16bit integer in half, bottom half
being count of tries, the upper half being levels down to try that
read?)

It seems to defeat the purpose of layers if you can go down and find
out how many layers there are underneath you....

John

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-06-29 10:47     ` A. James Lewis
@ 2011-07-14 20:47       ` Erik Jensen
  0 siblings, 0 replies; 39+ messages in thread
From: Erik Jensen @ 2011-07-14 20:47 UTC (permalink / raw)
  Cc: linux-btrfs

 On Wed, Jun 29, 2011 at 3:47 AM, A. James Lewis <james@fsck.co.uk> wro=
te:
> Is there a possibility that one could have a 3 disk RAID5 array, and
> then add a 4th disk and then do a balance, growing the RAID5 onto 4
> disks and gaining the space still with RAID5?  It seems that to be
> consistent, BTRFS would have to do this.
>
> If this is the case, then I think that the BTRFS implementation of RA=
ID5
> would have to be quite different to the MD implementation.
>
> James.

 My understanding, gleaned from IRC several months ago, is that Btrfs
 would use the new drive, but not change the stripe size. Each
 allocation would then be written across some selection of three of the
 four drives.

 In other words, if you started with four stripes across three drives:
   AAA
   BBB
   CCC
   DDD
 and then added a drive and balanced, you might get something like:
   AAAB
   BBCC
   CDDD
 which would give you more space, but still use =E2=85=93 of the space =
for parity.

 Trying to remove a drive from the original three-drive configuration
 would be an error, similarly to trying to remove the second to last
 drive in RAID 1, currently.

 Actually changing the stripe size would be done using the same
 mechanism as changing RAID levels.

 Again, this is just an interested but uninvolved person's
 understanding based on an IRC conversation, so please salt to taste.

 -- Erik
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14 19:50             ` John Stoffel
@ 2011-07-14 20:48               ` david
  2011-07-14 20:50               ` Erik Jensen
  1 sibling, 0 replies; 39+ messages in thread
From: david @ 2011-07-14 20:48 UTC (permalink / raw)
  To: John Stoffel
  Cc: Alasdair G Kergon, NeilBrown, Ric Wheeler, Nico Schottelius,
	LKML, Chris Mason, linux-btrfs

On Thu, 14 Jul 2011, John Stoffel wrote:

> Alasdair> I'll just say that any solution ought to be stackable.
>
> I've been mulling this over too and wondering how you'd handle this,
> because upper layers really can't peak down into lower layers easily.
> As far as I understand things.
>
> So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks
>
> How does btrfs handle errors (or does it even see them!) from the
> raid6 level when a single nbd device goes away?  Or taking the
> original example, when btrfs notices a checksum isn't correct, how
> would it push down multiple levels to try and find the correct data?
>
> Alasdair> This means understanding both that the number of data access
> Alasdair> routes may vary as you move through the stack, and that this
> Alasdair> number may depend on the offset within the device.
>
> It almost seems to me that the retry needs to be done at each level on
> it's own, without pushing down or up the stack.  But this doesn't
> handle the wrong file checksum issue.
>
> Hmm... maybe instead of just one number, we need another to count the
> levels down you go (or just split 16bit integer in half, bottom half
> being count of tries, the upper half being levels down to try that
> read?)
>
> It seems to defeat the purpose of layers if you can go down and find
> out how many layers there are underneath you....

this is why just an arbatrary 'method number' rather than a bitmap or 
something like that should be used.

using your example:

> So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks

raid1 has at least 2 values, raid 6 has at least 2 values, the combination 
of the two stacked should have at least 4 values.

if each layer can query the layer below it to find out how many methods it 
supports, it can then combine each of the methods it supports with each of 
the methods supported by the layer below it.

this will stack to an arbatrary number of layers, only limited by how 
large the value is allowed to be limiting the combinational permutations 
of all the layers options.

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14 19:50             ` John Stoffel
  2011-07-14 20:48               ` david
@ 2011-07-14 20:50               ` Erik Jensen
  1 sibling, 0 replies; 39+ messages in thread
From: Erik Jensen @ 2011-07-14 20:50 UTC (permalink / raw)
  To: John Stoffel
  Cc: Alasdair G Kergon, NeilBrown, Ric Wheeler, Nico Schottelius,
	LKML, Chris Mason, linux-btrfs

 On Thu, Jul 14, 2011 at 12:50 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Alasdair" == Alasdair G Kergon <agk@redhat.com> writes:
>
> Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:
>>> It might make sense for a device to be able to report what the maximum
>>> 'N' supported is... that might make stacked raid easier to manage...
>
> Alasdair> I'll just say that any solution ought to be stackable.
>
> I've been mulling this over too and wondering how you'd handle this,
> because upper layers really can't peak down into lower layers easily.
> As far as I understand things.
>
> So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks
>
> How does btrfs handle errors (or does it even see them!) from the
> raid6 level when a single nbd device goes away?  Or taking the
> original example, when btrfs notices a checksum isn't correct, how
> would it push down multiple levels to try and find the correct data?
>
> Alasdair> This means understanding both that the number of data access
> Alasdair> routes may vary as you move through the stack, and that this
> Alasdair> number may depend on the offset within the device.
>
> It almost seems to me that the retry needs to be done at each level on
> it's own, without pushing down or up the stack.  But this doesn't
> handle the wrong file checksum issue.
>
> Hmm... maybe instead of just one number, we need another to count the
> levels down you go (or just split 16bit integer in half, bottom half
> being count of tries, the upper half being levels down to try that
> read?)
>
> It seems to defeat the purpose of layers if you can go down and find
> out how many layers there are underneath you....
>
> John

A random thought: What if we allow the number to wrap at each level,
and, each time it wraps, increment the number passed to the next lower
level.

A zero would propagate down, letting each level do what it wants:
luks: 0
raid1: 0
raid6: 0
nbd: 0

And higher numbers would indicate the method at each level:

For a 1:
luks: 1
raid1: 1
raid6: 1
nbd: 1

For a 3:
luks: 1 (only one possibility, passes three down)
raid1: 1 (two possibilities, so we wrap back to one and pass two down,
since we wrapped once)
raid6: 2 (not wrapped)
nbd: 1

When the bottom-most level gets an N that it can't handle, it would
return EINVAL, which would be propagated up the stack.

This would allow the same algorithm of incrementing N until we receive
good data or EINVAL, and would exhaust all ways of reading the data at
all levels.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-14  6:57           ` Ric Wheeler
@ 2011-07-15  2:32             ` Chris Mason
  2011-07-15  4:58               ` david
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2011-07-15  2:32 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: NeilBrown, Nico Schottelius, LKML, linux-btrfs, Alasdair G Kergon

Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
> On 07/14/2011 07:38 AM, NeilBrown wrote:
> > On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
> >
> >>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
> >>> particular way to make the interface richer??
> >>>
> >>> NeilBrown
> >> Hi Neil,
> >>
> >> I know that Chris has a very specific set of use cases for btrfs and think that
> >> Alasdair and others have started to look at what is doable.
> >>
> >> The obvious use case is the following:
> >>
> >> If a file system uses checksumming or other data corruption detection bits, it
> >> can detect that it got bad data on a write. If that data was protected by RAID,
> >> it would like to ask the block layer to try to read from another mirror (for
> >> raid1) or try to validate/rebuild from parity.
> >>
> >> Today, I think that a retry will basically just give us back a random chance of
> >> getting data from a different mirror or the same one that we got data from on
> >> the first go.
> >>
> >> Chris, Alasdair, was that a good summary of one concern?
> >>
> >> Thanks!
> >>
> >> Ric
> > I imagine a new field in 'struct bio' which was normally zero but could be
> > some small integer.  It is only meaningful for read.
> > When 0 it means "get this data way you like".
> > When non-zero it means "get this data using method N", where the different
> > methods are up to the device.
> >
> > For a mirrored RAID, method N means read from device N-1.
> > For stripe/parity RAID, method 1 means "use other data blocks and parity
> > blocks to reconstruct data.
> >
> > The default for non RAID devices is to return EINVAL for any N>  0.
> > A remapping device (dm-linear, dm-stripe etc) would just pass the number
> > down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> > some thought.
> >
> > So if btrfs reads a block and the checksum looks wrong, it reads again with
> > a larger N.  It continues incrementing N and retrying until it gets a block
> > that it likes or it gets EINVAL.  There should probably be an error code
> > (EAGAIN?) which means "I cannot work with that number, but try the next one".
> >
> > It would be trivial for me to implement this for RAID1 and RAID10, and
> > relatively easy for RAID5.
> > I'd need to give a bit of thought to RAID6 as there are possibly multiple
> > ways to reconstruct from different combinations of parity and data.  I'm not
> > sure if there would be much point in doing that though.
> >
> > It might make sense for a device to be able to report what the maximum
> > 'N' supported is... that might make stacked raid easier to manage...
> >
> > NeilBrown
> >
> 
> I think that the above makes sense. Not sure what your "0" definition is, but I 
> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would 
> be an indication that there is nothing more that can be tried. The data you got 
> is all you are ever going to get :)
> 
> For multiple mirrors, you might want to have a scheme where you would be able to 
> cycle through the mirrors. You could retry, cycling through the various mirrors 
> until you have tried and returned them all at which point you would get a "no 
> more" error back or some such thing.

Hi everyone,

The mirror number idea is basically what btrfs does today, and I think
it fits in with Neil's idea to have different policies for different
blocks.

Basically what btrfs does:

read_block(block_num, mirror = 0)
if (no io error and not csum error)
	horray()

num_mirrors = get_num_copies(block number)
for (i = 1; i < num_mirrors; i++) {
	read_block(block_num, mirror = i);
}

In a stacked configuration, the get_num_copies function can be smarter,
basically adding up all the copies of the lower levels and finding a way
to combine them.  We could just send down a fake bio that is responsible
for adding up the storage combinations into a bitmask or whatever works.

We could also just keep retrying until the lower layers reported no more
mirror were available.

In btrfs at least, we don't set the data blocks up to date until the crc
has passed, so replacing bogus blocks is easy.  Metadata is a little
more complex, but that's not really related to this topic.

mirror number 0 just means "no mirror preference/pick the fastest
mirror" to the btrfs block mapping code.

Btrfs has the concept of different raid levels for different logical
block numbers, so you get_num_copies might return one answer for block A
and a different answer for block B.

Either way, we could easily make use of a bio field here if it were
exported out.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15  2:32             ` Chris Mason
@ 2011-07-15  4:58               ` david
  2011-07-15  6:33                 ` NeilBrown
  0 siblings, 1 reply; 39+ messages in thread
From: david @ 2011-07-15  4:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, NeilBrown, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On Thu, 14 Jul 2011, Chris Mason wrote:

> Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
>> On 07/14/2011 07:38 AM, NeilBrown wrote:
>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
>>>
>>>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
>>>>> particular way to make the interface richer??
>>>>>
>>>>> NeilBrown
>>>> Hi Neil,
>>>>
>>>> I know that Chris has a very specific set of use cases for btrfs and think that
>>>> Alasdair and others have started to look at what is doable.
>>>>
>>>> The obvious use case is the following:
>>>>
>>>> If a file system uses checksumming or other data corruption detection bits, it
>>>> can detect that it got bad data on a write. If that data was protected by RAID,
>>>> it would like to ask the block layer to try to read from another mirror (for
>>>> raid1) or try to validate/rebuild from parity.
>>>>
>>>> Today, I think that a retry will basically just give us back a random chance of
>>>> getting data from a different mirror or the same one that we got data from on
>>>> the first go.
>>>>
>>>> Chris, Alasdair, was that a good summary of one concern?
>>>>
>>>> Thanks!
>>>>
>>>> Ric
>>> I imagine a new field in 'struct bio' which was normally zero but could be
>>> some small integer.  It is only meaningful for read.
>>> When 0 it means "get this data way you like".
>>> When non-zero it means "get this data using method N", where the different
>>> methods are up to the device.
>>>
>>> For a mirrored RAID, method N means read from device N-1.
>>> For stripe/parity RAID, method 1 means "use other data blocks and parity
>>> blocks to reconstruct data.
>>>
>>> The default for non RAID devices is to return EINVAL for any N>  0.
>>> A remapping device (dm-linear, dm-stripe etc) would just pass the number
>>> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
>>> some thought.
>>>
>>> So if btrfs reads a block and the checksum looks wrong, it reads again with
>>> a larger N.  It continues incrementing N and retrying until it gets a block
>>> that it likes or it gets EINVAL.  There should probably be an error code
>>> (EAGAIN?) which means "I cannot work with that number, but try the next one".
>>>
>>> It would be trivial for me to implement this for RAID1 and RAID10, and
>>> relatively easy for RAID5.
>>> I'd need to give a bit of thought to RAID6 as there are possibly multiple
>>> ways to reconstruct from different combinations of parity and data.  I'm not
>>> sure if there would be much point in doing that though.
>>>
>>> It might make sense for a device to be able to report what the maximum
>>> 'N' supported is... that might make stacked raid easier to manage...
>>>
>>> NeilBrown
>>>
>>
>> I think that the above makes sense. Not sure what your "0" definition is, but I
>> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would
>> be an indication that there is nothing more that can be tried. The data you got
>> is all you are ever going to get :)
>>
>> For multiple mirrors, you might want to have a scheme where you would be able to
>> cycle through the mirrors. You could retry, cycling through the various mirrors
>> until you have tried and returned them all at which point you would get a "no
>> more" error back or some such thing.
>
> Hi everyone,
>
> The mirror number idea is basically what btrfs does today, and I think
> it fits in with Neil's idea to have different policies for different
> blocks.
>
> Basically what btrfs does:
>
> read_block(block_num, mirror = 0)
> if (no io error and not csum error)
> 	horray()
>
> num_mirrors = get_num_copies(block number)
> for (i = 1; i < num_mirrors; i++) {
> 	read_block(block_num, mirror = i);
> }
>
> In a stacked configuration, the get_num_copies function can be smarter,
> basically adding up all the copies of the lower levels and finding a way
> to combine them.  We could just send down a fake bio that is responsible
> for adding up the storage combinations into a bitmask or whatever works.
>
> We could also just keep retrying until the lower layers reported no more
> mirror were available.
>
> In btrfs at least, we don't set the data blocks up to date until the crc
> has passed, so replacing bogus blocks is easy.  Metadata is a little
> more complex, but that's not really related to this topic.
>
> mirror number 0 just means "no mirror preference/pick the fastest
> mirror" to the btrfs block mapping code.
>
> Btrfs has the concept of different raid levels for different logical
> block numbers, so you get_num_copies might return one answer for block A
> and a different answer for block B.
>
> Either way, we could easily make use of a bio field here if it were
> exported out.

you don't want to just pass the value down as that will cause problems 
with layering (especially if the lower layer supports more values than a 
higher layer)

I would suggest that each layer take the value it's give, do an integer 
division by the number of values that layer supports, using the modulo 
value for that layer and pass the rest of the result down to the next 
layer.

this works for just trying values until you get the error that tells you 
there are no more options.


things can get very 'intersesting' if the different options below you have 
different number of options (say a raid1 across a raid5 and a single disk) 
but I can't think of any good way to figure this out and assemble a 
sane order without doing an exaustive search down to find the number of 
options for each layer (and since this can be different for different 
blocks, you would have to do this each time you invoked this option)

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15  4:58               ` david
@ 2011-07-15  6:33                 ` NeilBrown
  2011-07-15 11:34                   ` Chris Mason
  2011-07-15 16:03                   ` david
  0 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2011-07-15  6:33 UTC (permalink / raw)
  To: david
  Cc: Chris Mason, Ric Wheeler, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:

> On Thu, 14 Jul 2011, Chris Mason wrote:
> 
> > Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
> >> On 07/14/2011 07:38 AM, NeilBrown wrote:
> >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
> >>>
> >>>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
> >>>>> particular way to make the interface richer??
> >>>>>
> >>>>> NeilBrown
> >>>> Hi Neil,
> >>>>
> >>>> I know that Chris has a very specific set of use cases for btrfs and think that
> >>>> Alasdair and others have started to look at what is doable.
> >>>>
> >>>> The obvious use case is the following:
> >>>>
> >>>> If a file system uses checksumming or other data corruption detection bits, it
> >>>> can detect that it got bad data on a write. If that data was protected by RAID,
> >>>> it would like to ask the block layer to try to read from another mirror (for
> >>>> raid1) or try to validate/rebuild from parity.
> >>>>
> >>>> Today, I think that a retry will basically just give us back a random chance of
> >>>> getting data from a different mirror or the same one that we got data from on
> >>>> the first go.
> >>>>
> >>>> Chris, Alasdair, was that a good summary of one concern?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Ric
> >>> I imagine a new field in 'struct bio' which was normally zero but could be
> >>> some small integer.  It is only meaningful for read.
> >>> When 0 it means "get this data way you like".
> >>> When non-zero it means "get this data using method N", where the different
> >>> methods are up to the device.
> >>>
> >>> For a mirrored RAID, method N means read from device N-1.
> >>> For stripe/parity RAID, method 1 means "use other data blocks and parity
> >>> blocks to reconstruct data.
> >>>
> >>> The default for non RAID devices is to return EINVAL for any N>  0.
> >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> >>> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> >>> some thought.
> >>>
> >>> So if btrfs reads a block and the checksum looks wrong, it reads again with
> >>> a larger N.  It continues incrementing N and retrying until it gets a block
> >>> that it likes or it gets EINVAL.  There should probably be an error code
> >>> (EAGAIN?) which means "I cannot work with that number, but try the next one".
> >>>
> >>> It would be trivial for me to implement this for RAID1 and RAID10, and
> >>> relatively easy for RAID5.
> >>> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> >>> ways to reconstruct from different combinations of parity and data.  I'm not
> >>> sure if there would be much point in doing that though.
> >>>
> >>> It might make sense for a device to be able to report what the maximum
> >>> 'N' supported is... that might make stacked raid easier to manage...
> >>>
> >>> NeilBrown
> >>>
> >>
> >> I think that the above makes sense. Not sure what your "0" definition is, but I
> >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would
> >> be an indication that there is nothing more that can be tried. The data you got
> >> is all you are ever going to get :)
> >>
> >> For multiple mirrors, you might want to have a scheme where you would be able to
> >> cycle through the mirrors. You could retry, cycling through the various mirrors
> >> until you have tried and returned them all at which point you would get a "no
> >> more" error back or some such thing.
> >
> > Hi everyone,
> >
> > The mirror number idea is basically what btrfs does today, and I think
> > it fits in with Neil's idea to have different policies for different
> > blocks.
> >
> > Basically what btrfs does:
> >
> > read_block(block_num, mirror = 0)
> > if (no io error and not csum error)
> > 	horray()
> >
> > num_mirrors = get_num_copies(block number)
> > for (i = 1; i < num_mirrors; i++) {
> > 	read_block(block_num, mirror = i);
> > }
> >
> > In a stacked configuration, the get_num_copies function can be smarter,
> > basically adding up all the copies of the lower levels and finding a way
> > to combine them.  We could just send down a fake bio that is responsible
> > for adding up the storage combinations into a bitmask or whatever works.
> >
> > We could also just keep retrying until the lower layers reported no more
> > mirror were available.
> >
> > In btrfs at least, we don't set the data blocks up to date until the crc
> > has passed, so replacing bogus blocks is easy.  Metadata is a little
> > more complex, but that's not really related to this topic.
> >
> > mirror number 0 just means "no mirror preference/pick the fastest
> > mirror" to the btrfs block mapping code.
> >
> > Btrfs has the concept of different raid levels for different logical
> > block numbers, so you get_num_copies might return one answer for block A
> > and a different answer for block B.
> >
> > Either way, we could easily make use of a bio field here if it were
> > exported out.
> 
> you don't want to just pass the value down as that will cause problems 
> with layering (especially if the lower layer supports more values than a 
> higher layer)
> 
> I would suggest that each layer take the value it's give, do an integer 
> division by the number of values that layer supports, using the modulo 
> value for that layer and pass the rest of the result down to the next 
> layer.

I, on the other hand, would suggest that each layer be given the freedom, and
the responsibility, to do whatever it thinks is most appropriate.

This would probably involved checking the lower levels to see how many
strategies each has, and doing some appropriate arithmetic depending on how
it combines those devices.

One problem here is the assumption that the lower levels don't change, and we
know that not to be the case.
However this is already a problem.  It is entirely possible that the request
size parameters of devices can change at any moment, even while a request is
in-flight ... though we try to avoid that or work around it.

The sort of dependencies that we see forming here would probably just make
the problem worse.

Not sure what to do about it though.... maybe just hope it isn't a problem.

NeilBrown


> 
> this works for just trying values until you get the error that tells you 
> there are no more options.
> 
> 
> things can get very 'intersesting' if the different options below you have 
> different number of options (say a raid1 across a raid5 and a single disk) 
> but I can't think of any good way to figure this out and assemble a 
> sane order without doing an exaustive search down to find the number of 
> options for each layer (and since this can be different for different 
> blocks, you would have to do this each time you invoked this option)
> 
> David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15  6:33                 ` NeilBrown
@ 2011-07-15 11:34                   ` Chris Mason
  2011-07-15 12:58                     ` Ric Wheeler
  2011-07-15 16:03                   ` david
  1 sibling, 1 reply; 39+ messages in thread
From: Chris Mason @ 2011-07-15 11:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: david, Ric Wheeler, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:
> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:
> 
> > On Thu, 14 Jul 2011, Chris Mason wrote:
> > 
> > > Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
> > >> On 07/14/2011 07:38 AM, NeilBrown wrote:
> > >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
> > >>>
> > >>>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
> > >>>>> particular way to make the interface richer??
> > >>>>>
> > >>>>> NeilBrown
> > >>>> Hi Neil,
> > >>>>
> > >>>> I know that Chris has a very specific set of use cases for btrfs and think that
> > >>>> Alasdair and others have started to look at what is doable.
> > >>>>
> > >>>> The obvious use case is the following:
> > >>>>
> > >>>> If a file system uses checksumming or other data corruption detection bits, it
> > >>>> can detect that it got bad data on a write. If that data was protected by RAID,
> > >>>> it would like to ask the block layer to try to read from another mirror (for
> > >>>> raid1) or try to validate/rebuild from parity.
> > >>>>
> > >>>> Today, I think that a retry will basically just give us back a random chance of
> > >>>> getting data from a different mirror or the same one that we got data from on
> > >>>> the first go.
> > >>>>
> > >>>> Chris, Alasdair, was that a good summary of one concern?
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> Ric
> > >>> I imagine a new field in 'struct bio' which was normally zero but could be
> > >>> some small integer.  It is only meaningful for read.
> > >>> When 0 it means "get this data way you like".
> > >>> When non-zero it means "get this data using method N", where the different
> > >>> methods are up to the device.
> > >>>
> > >>> For a mirrored RAID, method N means read from device N-1.
> > >>> For stripe/parity RAID, method 1 means "use other data blocks and parity
> > >>> blocks to reconstruct data.
> > >>>
> > >>> The default for non RAID devices is to return EINVAL for any N>  0.
> > >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number
> > >>> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
> > >>> some thought.
> > >>>
> > >>> So if btrfs reads a block and the checksum looks wrong, it reads again with
> > >>> a larger N.  It continues incrementing N and retrying until it gets a block
> > >>> that it likes or it gets EINVAL.  There should probably be an error code
> > >>> (EAGAIN?) which means "I cannot work with that number, but try the next one".
> > >>>
> > >>> It would be trivial for me to implement this for RAID1 and RAID10, and
> > >>> relatively easy for RAID5.
> > >>> I'd need to give a bit of thought to RAID6 as there are possibly multiple
> > >>> ways to reconstruct from different combinations of parity and data.  I'm not
> > >>> sure if there would be much point in doing that though.
> > >>>
> > >>> It might make sense for a device to be able to report what the maximum
> > >>> 'N' supported is... that might make stacked raid easier to manage...
> > >>>
> > >>> NeilBrown
> > >>>
> > >>
> > >> I think that the above makes sense. Not sure what your "0" definition is, but I
> > >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would
> > >> be an indication that there is nothing more that can be tried. The data you got
> > >> is all you are ever going to get :)
> > >>
> > >> For multiple mirrors, you might want to have a scheme where you would be able to
> > >> cycle through the mirrors. You could retry, cycling through the various mirrors
> > >> until you have tried and returned them all at which point you would get a "no
> > >> more" error back or some such thing.
> > >
> > > Hi everyone,
> > >
> > > The mirror number idea is basically what btrfs does today, and I think
> > > it fits in with Neil's idea to have different policies for different
> > > blocks.
> > >
> > > Basically what btrfs does:
> > >
> > > read_block(block_num, mirror = 0)
> > > if (no io error and not csum error)
> > >     horray()
> > >
> > > num_mirrors = get_num_copies(block number)
> > > for (i = 1; i < num_mirrors; i++) {
> > >     read_block(block_num, mirror = i);
> > > }
> > >
> > > In a stacked configuration, the get_num_copies function can be smarter,
> > > basically adding up all the copies of the lower levels and finding a way
> > > to combine them.  We could just send down a fake bio that is responsible
> > > for adding up the storage combinations into a bitmask or whatever works.
> > >
> > > We could also just keep retrying until the lower layers reported no more
> > > mirror were available.
> > >
> > > In btrfs at least, we don't set the data blocks up to date until the crc
> > > has passed, so replacing bogus blocks is easy.  Metadata is a little
> > > more complex, but that's not really related to this topic.
> > >
> > > mirror number 0 just means "no mirror preference/pick the fastest
> > > mirror" to the btrfs block mapping code.
> > >
> > > Btrfs has the concept of different raid levels for different logical
> > > block numbers, so you get_num_copies might return one answer for block A
> > > and a different answer for block B.
> > >
> > > Either way, we could easily make use of a bio field here if it were
> > > exported out.
> > 
> > you don't want to just pass the value down as that will cause problems 
> > with layering (especially if the lower layer supports more values than a 
> > higher layer)
> > 
> > I would suggest that each layer take the value it's give, do an integer 
> > division by the number of values that layer supports, using the modulo 
> > value for that layer and pass the rest of the result down to the next 
> > layer.
> 
> I, on the other hand, would suggest that each layer be given the freedom, and
> the responsibility, to do whatever it thinks is most appropriate.
> 
> This would probably involved checking the lower levels to see how many
> strategies each has, and doing some appropriate arithmetic depending on how
> it combines those devices.
> 
> One problem here is the assumption that the lower levels don't change, and we
> know that not to be the case.
> However this is already a problem.  It is entirely possible that the request
> size parameters of devices can change at any moment, even while a request is
> in-flight ... though we try to avoid that or work around it.
> 
> The sort of dependencies that we see forming here would probably just make
> the problem worse.
> 
> Not sure what to do about it though.... maybe just hope it isn't a problem.

Agreed, if we want to go beyond best effort for a stacking config, we'll
need to put some state struct in the bio that each layer can play with.
That way each layer knows which mirrors have already been tried.

But, maybe the whole btrfs model is backwards for a generic layer.
Instead of sending down ios and testing when they come back, we could
just set a verification function (or stack of them?).

For metadata, btrfs compares the crc and a few other fields of the
metadata block, so we can easily add a compare function pointer and a
void * to pass in.

The problem is the crc can take a lot of CPU, so btrfs kicks it off to
threading pools so saturate all the cpus on the box.  But there's no
reason we can't make that available lower down.

If we pushed the verification down, the retries could bubble up the
stack instead of the other way around.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 11:34                   ` Chris Mason
@ 2011-07-15 12:58                     ` Ric Wheeler
  2011-07-15 13:20                       ` Chris Mason
  2011-07-15 13:55                         ` Mike Snitzer
  0 siblings, 2 replies; 39+ messages in thread
From: Ric Wheeler @ 2011-07-15 12:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: NeilBrown, david, Nico Schottelius, LKML, linux-btrfs, Alasdair G Kergon

On 07/15/2011 12:34 PM, Chris Mason wrote:
> Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:
>> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:
>>
>>> On Thu, 14 Jul 2011, Chris Mason wrote:
>>>
>>>> Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
>>>>> On 07/14/2011 07:38 AM, NeilBrown wrote:
>>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>   wrote:
>>>>>>
>>>>>>>> I'm certainly open to suggestions and collaboration.  Do you have in mind any
>>>>>>>> particular way to make the interface richer??
>>>>>>>>
>>>>>>>> NeilBrown
>>>>>>> Hi Neil,
>>>>>>>
>>>>>>> I know that Chris has a very specific set of use cases for btrfs and think that
>>>>>>> Alasdair and others have started to look at what is doable.
>>>>>>>
>>>>>>> The obvious use case is the following:
>>>>>>>
>>>>>>> If a file system uses checksumming or other data corruption detection bits, it
>>>>>>> can detect that it got bad data on a write. If that data was protected by RAID,
>>>>>>> it would like to ask the block layer to try to read from another mirror (for
>>>>>>> raid1) or try to validate/rebuild from parity.
>>>>>>>
>>>>>>> Today, I think that a retry will basically just give us back a random chance of
>>>>>>> getting data from a different mirror or the same one that we got data from on
>>>>>>> the first go.
>>>>>>>
>>>>>>> Chris, Alasdair, was that a good summary of one concern?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Ric
>>>>>> I imagine a new field in 'struct bio' which was normally zero but could be
>>>>>> some small integer.  It is only meaningful for read.
>>>>>> When 0 it means "get this data way you like".
>>>>>> When non-zero it means "get this data using method N", where the different
>>>>>> methods are up to the device.
>>>>>>
>>>>>> For a mirrored RAID, method N means read from device N-1.
>>>>>> For stripe/parity RAID, method 1 means "use other data blocks and parity
>>>>>> blocks to reconstruct data.
>>>>>>
>>>>>> The default for non RAID devices is to return EINVAL for any N>   0.
>>>>>> A remapping device (dm-linear, dm-stripe etc) would just pass the number
>>>>>> down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
>>>>>> some thought.
>>>>>>
>>>>>> So if btrfs reads a block and the checksum looks wrong, it reads again with
>>>>>> a larger N.  It continues incrementing N and retrying until it gets a block
>>>>>> that it likes or it gets EINVAL.  There should probably be an error code
>>>>>> (EAGAIN?) which means "I cannot work with that number, but try the next one".
>>>>>>
>>>>>> It would be trivial for me to implement this for RAID1 and RAID10, and
>>>>>> relatively easy for RAID5.
>>>>>> I'd need to give a bit of thought to RAID6 as there are possibly multiple
>>>>>> ways to reconstruct from different combinations of parity and data.  I'm not
>>>>>> sure if there would be much point in doing that though.
>>>>>>
>>>>>> It might make sense for a device to be able to report what the maximum
>>>>>> 'N' supported is... that might make stacked raid easier to manage...
>>>>>>
>>>>>> NeilBrown
>>>>>>
>>>>> I think that the above makes sense. Not sure what your "0" definition is, but I
>>>>> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would
>>>>> be an indication that there is nothing more that can be tried. The data you got
>>>>> is all you are ever going to get :)
>>>>>
>>>>> For multiple mirrors, you might want to have a scheme where you would be able to
>>>>> cycle through the mirrors. You could retry, cycling through the various mirrors
>>>>> until you have tried and returned them all at which point you would get a "no
>>>>> more" error back or some such thing.
>>>> Hi everyone,
>>>>
>>>> The mirror number idea is basically what btrfs does today, and I think
>>>> it fits in with Neil's idea to have different policies for different
>>>> blocks.
>>>>
>>>> Basically what btrfs does:
>>>>
>>>> read_block(block_num, mirror = 0)
>>>> if (no io error and not csum error)
>>>>      horray()
>>>>
>>>> num_mirrors = get_num_copies(block number)
>>>> for (i = 1; i<  num_mirrors; i++) {
>>>>      read_block(block_num, mirror = i);
>>>> }
>>>>
>>>> In a stacked configuration, the get_num_copies function can be smarter,
>>>> basically adding up all the copies of the lower levels and finding a way
>>>> to combine them.  We could just send down a fake bio that is responsible
>>>> for adding up the storage combinations into a bitmask or whatever works.
>>>>
>>>> We could also just keep retrying until the lower layers reported no more
>>>> mirror were available.
>>>>
>>>> In btrfs at least, we don't set the data blocks up to date until the crc
>>>> has passed, so replacing bogus blocks is easy.  Metadata is a little
>>>> more complex, but that's not really related to this topic.
>>>>
>>>> mirror number 0 just means "no mirror preference/pick the fastest
>>>> mirror" to the btrfs block mapping code.
>>>>
>>>> Btrfs has the concept of different raid levels for different logical
>>>> block numbers, so you get_num_copies might return one answer for block A
>>>> and a different answer for block B.
>>>>
>>>> Either way, we could easily make use of a bio field here if it were
>>>> exported out.
>>> you don't want to just pass the value down as that will cause problems
>>> with layering (especially if the lower layer supports more values than a
>>> higher layer)
>>>
>>> I would suggest that each layer take the value it's give, do an integer
>>> division by the number of values that layer supports, using the modulo
>>> value for that layer and pass the rest of the result down to the next
>>> layer.
>> I, on the other hand, would suggest that each layer be given the freedom, and
>> the responsibility, to do whatever it thinks is most appropriate.
>>
>> This would probably involved checking the lower levels to see how many
>> strategies each has, and doing some appropriate arithmetic depending on how
>> it combines those devices.
>>
>> One problem here is the assumption that the lower levels don't change, and we
>> know that not to be the case.
>> However this is already a problem.  It is entirely possible that the request
>> size parameters of devices can change at any moment, even while a request is
>> in-flight ... though we try to avoid that or work around it.
>>
>> The sort of dependencies that we see forming here would probably just make
>> the problem worse.
>>
>> Not sure what to do about it though.... maybe just hope it isn't a problem.
> Agreed, if we want to go beyond best effort for a stacking config, we'll
> need to put some state struct in the bio that each layer can play with.
> That way each layer knows which mirrors have already been tried.
>
> But, maybe the whole btrfs model is backwards for a generic layer.
> Instead of sending down ios and testing when they come back, we could
> just set a verification function (or stack of them?).
>
> For metadata, btrfs compares the crc and a few other fields of the
> metadata block, so we can easily add a compare function pointer and a
> void * to pass in.
>
> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> threading pools so saturate all the cpus on the box.  But there's no
> reason we can't make that available lower down.
>
> If we pushed the verification down, the retries could bubble up the
> stack instead of the other way around.
>
> -chris

I do like the idea of having the ability to do the verification and retries down 
the stack where you actually have the most context to figure out what is possible...

Why would you need to bubble back up anything other than an error when all 
retries have failed?

Ric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 12:58                     ` Ric Wheeler
@ 2011-07-15 13:20                       ` Chris Mason
  2011-07-15 13:31                         ` Ric Wheeler
  2011-07-15 16:23                         ` david
  2011-07-15 13:55                         ` Mike Snitzer
  1 sibling, 2 replies; 39+ messages in thread
From: Chris Mason @ 2011-07-15 13:20 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: NeilBrown, david, Nico Schottelius, LKML, linux-btrfs, Alasdair G Kergon

Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> On 07/15/2011 12:34 PM, Chris Mason wrote:

[ triggering IO retries on failed crc or other checks ]

> >
> > But, maybe the whole btrfs model is backwards for a generic layer.
> > Instead of sending down ios and testing when they come back, we could
> > just set a verification function (or stack of them?).
> >
> > For metadata, btrfs compares the crc and a few other fields of the
> > metadata block, so we can easily add a compare function pointer and a
> > void * to pass in.
> >
> > The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > threading pools so saturate all the cpus on the box.  But there's no
> > reason we can't make that available lower down.
> >
> > If we pushed the verification down, the retries could bubble up the
> > stack instead of the other way around.
> >
> > -chris
> 
> I do like the idea of having the ability to do the verification and retries down 
> the stack where you actually have the most context to figure out what is possible...
> 
> Why would you need to bubble back up anything other than an error when all 
> retries have failed?

By bubble up I mean that if you have multiple layers capable of doing
retries, the lowest levels would retry first.  Basically by the time we
get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
do to help.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 13:20                       ` Chris Mason
@ 2011-07-15 13:31                         ` Ric Wheeler
  2011-07-15 14:00                           ` Chris Mason
  2011-07-15 16:23                         ` david
  1 sibling, 1 reply; 39+ messages in thread
From: Ric Wheeler @ 2011-07-15 13:31 UTC (permalink / raw)
  To: Chris Mason
  Cc: NeilBrown, david, Nico Schottelius, LKML, linux-btrfs, Alasdair G Kergon

On 07/15/2011 02:20 PM, Chris Mason wrote:
> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>> On 07/15/2011 12:34 PM, Chris Mason wrote:
> [ triggering IO retries on failed crc or other checks ]
>
>>> But, maybe the whole btrfs model is backwards for a generic layer.
>>> Instead of sending down ios and testing when they come back, we could
>>> just set a verification function (or stack of them?).
>>>
>>> For metadata, btrfs compares the crc and a few other fields of the
>>> metadata block, so we can easily add a compare function pointer and a
>>> void * to pass in.
>>>
>>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
>>> threading pools so saturate all the cpus on the box.  But there's no
>>> reason we can't make that available lower down.
>>>
>>> If we pushed the verification down, the retries could bubble up the
>>> stack instead of the other way around.
>>>
>>> -chris
>> I do like the idea of having the ability to do the verification and retries down
>> the stack where you actually have the most context to figure out what is possible...
>>
>> Why would you need to bubble back up anything other than an error when all
>> retries have failed?
> By bubble up I mean that if you have multiple layers capable of doing
> retries, the lowest levels would retry first.  Basically by the time we
> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> do to help.
>
> -chris

Absolutely sounds like the most sane way to go to me, thanks!

Ric


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 12:58                     ` Ric Wheeler
@ 2011-07-15 13:55                         ` Mike Snitzer
  2011-07-15 13:55                         ` Mike Snitzer
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Snitzer @ 2011-07-15 13:55 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

On Fri, Jul 15, 2011 at 8:58 AM, Ric Wheeler <rwheeler@redhat.com> wrot=
e:
> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>
>> Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:
>>>
>>> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:
>>>
>>>> On Thu, 14 Jul 2011, Chris Mason wrote:
>>>>
>>>>> Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
>>>>>>
>>>>>> On 07/14/2011 07:38 AM, NeilBrown wrote:
>>>>>>>
>>>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.=
com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>> I'm certainly open to suggestions and collaboration. =A0Do yo=
u have
>>>>>>>>> in mind any
>>>>>>>>> particular way to make the interface richer??
>>>>>>>>>
>>>>>>>>> NeilBrown
>>>>>>>>
>>>>>>>> Hi Neil,
>>>>>>>>
>>>>>>>> I know that Chris has a very specific set of use cases for btr=
fs and
>>>>>>>> think that
>>>>>>>> Alasdair and others have started to look at what is doable.
>>>>>>>>
>>>>>>>> The obvious use case is the following:
>>>>>>>>
>>>>>>>> If a file system uses checksumming or other data corruption
>>>>>>>> detection bits, it
>>>>>>>> can detect that it got bad data on a write. If that data was
>>>>>>>> protected by RAID,
>>>>>>>> it would like to ask the block layer to try to read from anoth=
er
>>>>>>>> mirror (for
>>>>>>>> raid1) or try to validate/rebuild from parity.
>>>>>>>>
>>>>>>>> Today, I think that a retry will basically just give us back a
>>>>>>>> random chance of
>>>>>>>> getting data from a different mirror or the same one that we g=
ot
>>>>>>>> data from on
>>>>>>>> the first go.
>>>>>>>>
>>>>>>>> Chris, Alasdair, was that a good summary of one concern?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Ric
>>>>>>>
>>>>>>> I imagine a new field in 'struct bio' which was normally zero b=
ut
>>>>>>> could be
>>>>>>> some small integer. =A0It is only meaningful for read.
>>>>>>> When 0 it means "get this data way you like".
>>>>>>> When non-zero it means "get this data using method N", where th=
e
>>>>>>> different
>>>>>>> methods are up to the device.
>>>>>>>
>>>>>>> For a mirrored RAID, method N means read from device N-1.
>>>>>>> For stripe/parity RAID, method 1 means "use other data blocks a=
nd
>>>>>>> parity
>>>>>>> blocks to reconstruct data.
>>>>>>>
>>>>>>> The default for non RAID devices is to return EINVAL for any N>=
 =A0 0.
>>>>>>> A remapping device (dm-linear, dm-stripe etc) would just pass t=
he
>>>>>>> number
>>>>>>> down. =A0I'm not sure how RAID1 over RAID5 would handle it... t=
hat
>>>>>>> might need
>>>>>>> some thought.
>>>>>>>
>>>>>>> So if btrfs reads a block and the checksum looks wrong, it read=
s
>>>>>>> again with
>>>>>>> a larger N. =A0It continues incrementing N and retrying until i=
t gets a
>>>>>>> block
>>>>>>> that it likes or it gets EINVAL. =A0There should probably be an=
 error
>>>>>>> code
>>>>>>> (EAGAIN?) which means "I cannot work with that number, but try =
the
>>>>>>> next one".
>>>>>>>
>>>>>>> It would be trivial for me to implement this for RAID1 and RAID=
10,
>>>>>>> and
>>>>>>> relatively easy for RAID5.
>>>>>>> I'd need to give a bit of thought to RAID6 as there are possibl=
y
>>>>>>> multiple
>>>>>>> ways to reconstruct from different combinations of parity and d=
ata.
>>>>>>> =A0I'm not
>>>>>>> sure if there would be much point in doing that though.
>>>>>>>
>>>>>>> It might make sense for a device to be able to report what the
>>>>>>> maximum
>>>>>>> 'N' supported is... that might make stacked raid easier to mana=
ge...
>>>>>>>
>>>>>>> NeilBrown
>>>>>>>
>>>>>> I think that the above makes sense. Not sure what your "0" defin=
ition
>>>>>> is, but I
>>>>>> would assume that for non-raided devices (i.e., a single s-ata d=
isk),
>>>>>> "0" would
>>>>>> be an indication that there is nothing more that can be tried. T=
he
>>>>>> data you got
>>>>>> is all you are ever going to get :)
>>>>>>
>>>>>> For multiple mirrors, you might want to have a scheme where you =
would
>>>>>> be able to
>>>>>> cycle through the mirrors. You could retry, cycling through the
>>>>>> various mirrors
>>>>>> until you have tried and returned them all at which point you wo=
uld
>>>>>> get a "no
>>>>>> more" error back or some such thing.
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> The mirror number idea is basically what btrfs does today, and I =
think
>>>>> it fits in with Neil's idea to have different policies for differ=
ent
>>>>> blocks.
>>>>>
>>>>> Basically what btrfs does:
>>>>>
>>>>> read_block(block_num, mirror =3D 0)
>>>>> if (no io error and not csum error)
>>>>> =A0 =A0 horray()
>>>>>
>>>>> num_mirrors =3D get_num_copies(block number)
>>>>> for (i =3D 1; i< =A0num_mirrors; i++) {
>>>>> =A0 =A0 read_block(block_num, mirror =3D i);
>>>>> }
>>>>>
>>>>> In a stacked configuration, the get_num_copies function can be sm=
arter,
>>>>> basically adding up all the copies of the lower levels and findin=
g a
>>>>> way
>>>>> to combine them. =A0We could just send down a fake bio that is
>>>>> responsible
>>>>> for adding up the storage combinations into a bitmask or whatever
>>>>> works.
>>>>>
>>>>> We could also just keep retrying until the lower layers reported =
no
>>>>> more
>>>>> mirror were available.
>>>>>
>>>>> In btrfs at least, we don't set the data blocks up to date until =
the
>>>>> crc
>>>>> has passed, so replacing bogus blocks is easy. =A0Metadata is a l=
ittle
>>>>> more complex, but that's not really related to this topic.
>>>>>
>>>>> mirror number 0 just means "no mirror preference/pick the fastest
>>>>> mirror" to the btrfs block mapping code.
>>>>>
>>>>> Btrfs has the concept of different raid levels for different logi=
cal
>>>>> block numbers, so you get_num_copies might return one answer for =
block
>>>>> A
>>>>> and a different answer for block B.
>>>>>
>>>>> Either way, we could easily make use of a bio field here if it we=
re
>>>>> exported out.
>>>>
>>>> you don't want to just pass the value down as that will cause prob=
lems
>>>> with layering (especially if the lower layer supports more values =
than a
>>>> higher layer)
>>>>
>>>> I would suggest that each layer take the value it's give, do an in=
teger
>>>> division by the number of values that layer supports, using the mo=
dulo
>>>> value for that layer and pass the rest of the result down to the n=
ext
>>>> layer.
>>>
>>> I, on the other hand, would suggest that each layer be given the fr=
eedom,
>>> and
>>> the responsibility, to do whatever it thinks is most appropriate.
>>>
>>> This would probably involved checking the lower levels to see how m=
any
>>> strategies each has, and doing some appropriate arithmetic dependin=
g on
>>> how
>>> it combines those devices.
>>>
>>> One problem here is the assumption that the lower levels don't chan=
ge,
>>> and we
>>> know that not to be the case.
>>> However this is already a problem. =A0It is entirely possible that =
the
>>> request
>>> size parameters of devices can change at any moment, even while a r=
equest
>>> is
>>> in-flight ... though we try to avoid that or work around it.
>>>
>>> The sort of dependencies that we see forming here would probably ju=
st
>>> make
>>> the problem worse.
>>>
>>> Not sure what to do about it though.... maybe just hope it isn't a
>>> problem.
>>
>> Agreed, if we want to go beyond best effort for a stacking config, w=
e'll
>> need to put some state struct in the bio that each layer can play wi=
th.
>> That way each layer knows which mirrors have already been tried.
>>
>> But, maybe the whole btrfs model is backwards for a generic layer.
>> Instead of sending down ios and testing when they come back, we coul=
d
>> just set a verification function (or stack of them?).
>>
>> For metadata, btrfs compares the crc and a few other fields of the
>> metadata block, so we can easily add a compare function pointer and =
a
>> void * to pass in.
>>
>> The problem is the crc can take a lot of CPU, so btrfs kicks it off =
to
>> threading pools so saturate all the cpus on the box. =A0But there's =
no
>> reason we can't make that available lower down.
>>
>> If we pushed the verification down, the retries could bubble up the
>> stack instead of the other way around.
>>
>> -chris
>
> I do like the idea of having the ability to do the verification and r=
etries
> down the stack where you actually have the most context to figure out=
 what
> is possible...
>
> Why would you need to bubble back up anything other than an error whe=
n all
> retries have failed?

Right, passing a validator is natural.  We've done the same for
validating DM thinp's metadata (crc, blocknr, etc), btree nodes,
bitmaps, etc.  The dm-thinp code is very much under review but I
figured I'd share.

See:
https://github.com/jthornber/linux-2.6/blob/4c4089de2e5a4f343d9810f7653=
1cb25aa13f91c/drivers/md/dm-thin-metadata.c

Specifically:
static struct dm_block_validator sb_validator_

So we pass a dm_block_validator around when doing block IO.  Each
validator has a .check and a .preparse_for_write method.

If the greater kernel offered a comparable mechanism we'd just switch
over to using it.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
@ 2011-07-15 13:55                         ` Mike Snitzer
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Snitzer @ 2011-07-15 13:55 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

On Fri, Jul 15, 2011 at 8:58 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>
>> Excerpts from NeilBrown's message of 2011-07-15 02:33:54 -0400:
>>>
>>> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:
>>>
>>>> On Thu, 14 Jul 2011, Chris Mason wrote:
>>>>
>>>>> Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
>>>>>>
>>>>>> On 07/14/2011 07:38 AM, NeilBrown wrote:
>>>>>>>
>>>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>> I'm certainly open to suggestions and collaboration.  Do you have
>>>>>>>>> in mind any
>>>>>>>>> particular way to make the interface richer??
>>>>>>>>>
>>>>>>>>> NeilBrown
>>>>>>>>
>>>>>>>> Hi Neil,
>>>>>>>>
>>>>>>>> I know that Chris has a very specific set of use cases for btrfs and
>>>>>>>> think that
>>>>>>>> Alasdair and others have started to look at what is doable.
>>>>>>>>
>>>>>>>> The obvious use case is the following:
>>>>>>>>
>>>>>>>> If a file system uses checksumming or other data corruption
>>>>>>>> detection bits, it
>>>>>>>> can detect that it got bad data on a write. If that data was
>>>>>>>> protected by RAID,
>>>>>>>> it would like to ask the block layer to try to read from another
>>>>>>>> mirror (for
>>>>>>>> raid1) or try to validate/rebuild from parity.
>>>>>>>>
>>>>>>>> Today, I think that a retry will basically just give us back a
>>>>>>>> random chance of
>>>>>>>> getting data from a different mirror or the same one that we got
>>>>>>>> data from on
>>>>>>>> the first go.
>>>>>>>>
>>>>>>>> Chris, Alasdair, was that a good summary of one concern?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Ric
>>>>>>>
>>>>>>> I imagine a new field in 'struct bio' which was normally zero but
>>>>>>> could be
>>>>>>> some small integer.  It is only meaningful for read.
>>>>>>> When 0 it means "get this data way you like".
>>>>>>> When non-zero it means "get this data using method N", where the
>>>>>>> different
>>>>>>> methods are up to the device.
>>>>>>>
>>>>>>> For a mirrored RAID, method N means read from device N-1.
>>>>>>> For stripe/parity RAID, method 1 means "use other data blocks and
>>>>>>> parity
>>>>>>> blocks to reconstruct data.
>>>>>>>
>>>>>>> The default for non RAID devices is to return EINVAL for any N>   0.
>>>>>>> A remapping device (dm-linear, dm-stripe etc) would just pass the
>>>>>>> number
>>>>>>> down.  I'm not sure how RAID1 over RAID5 would handle it... that
>>>>>>> might need
>>>>>>> some thought.
>>>>>>>
>>>>>>> So if btrfs reads a block and the checksum looks wrong, it reads
>>>>>>> again with
>>>>>>> a larger N.  It continues incrementing N and retrying until it gets a
>>>>>>> block
>>>>>>> that it likes or it gets EINVAL.  There should probably be an error
>>>>>>> code
>>>>>>> (EAGAIN?) which means "I cannot work with that number, but try the
>>>>>>> next one".
>>>>>>>
>>>>>>> It would be trivial for me to implement this for RAID1 and RAID10,
>>>>>>> and
>>>>>>> relatively easy for RAID5.
>>>>>>> I'd need to give a bit of thought to RAID6 as there are possibly
>>>>>>> multiple
>>>>>>> ways to reconstruct from different combinations of parity and data.
>>>>>>>  I'm not
>>>>>>> sure if there would be much point in doing that though.
>>>>>>>
>>>>>>> It might make sense for a device to be able to report what the
>>>>>>> maximum
>>>>>>> 'N' supported is... that might make stacked raid easier to manage...
>>>>>>>
>>>>>>> NeilBrown
>>>>>>>
>>>>>> I think that the above makes sense. Not sure what your "0" definition
>>>>>> is, but I
>>>>>> would assume that for non-raided devices (i.e., a single s-ata disk),
>>>>>> "0" would
>>>>>> be an indication that there is nothing more that can be tried. The
>>>>>> data you got
>>>>>> is all you are ever going to get :)
>>>>>>
>>>>>> For multiple mirrors, you might want to have a scheme where you would
>>>>>> be able to
>>>>>> cycle through the mirrors. You could retry, cycling through the
>>>>>> various mirrors
>>>>>> until you have tried and returned them all at which point you would
>>>>>> get a "no
>>>>>> more" error back or some such thing.
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> The mirror number idea is basically what btrfs does today, and I think
>>>>> it fits in with Neil's idea to have different policies for different
>>>>> blocks.
>>>>>
>>>>> Basically what btrfs does:
>>>>>
>>>>> read_block(block_num, mirror = 0)
>>>>> if (no io error and not csum error)
>>>>>     horray()
>>>>>
>>>>> num_mirrors = get_num_copies(block number)
>>>>> for (i = 1; i<  num_mirrors; i++) {
>>>>>     read_block(block_num, mirror = i);
>>>>> }
>>>>>
>>>>> In a stacked configuration, the get_num_copies function can be smarter,
>>>>> basically adding up all the copies of the lower levels and finding a
>>>>> way
>>>>> to combine them.  We could just send down a fake bio that is
>>>>> responsible
>>>>> for adding up the storage combinations into a bitmask or whatever
>>>>> works.
>>>>>
>>>>> We could also just keep retrying until the lower layers reported no
>>>>> more
>>>>> mirror were available.
>>>>>
>>>>> In btrfs at least, we don't set the data blocks up to date until the
>>>>> crc
>>>>> has passed, so replacing bogus blocks is easy.  Metadata is a little
>>>>> more complex, but that's not really related to this topic.
>>>>>
>>>>> mirror number 0 just means "no mirror preference/pick the fastest
>>>>> mirror" to the btrfs block mapping code.
>>>>>
>>>>> Btrfs has the concept of different raid levels for different logical
>>>>> block numbers, so you get_num_copies might return one answer for block
>>>>> A
>>>>> and a different answer for block B.
>>>>>
>>>>> Either way, we could easily make use of a bio field here if it were
>>>>> exported out.
>>>>
>>>> you don't want to just pass the value down as that will cause problems
>>>> with layering (especially if the lower layer supports more values than a
>>>> higher layer)
>>>>
>>>> I would suggest that each layer take the value it's give, do an integer
>>>> division by the number of values that layer supports, using the modulo
>>>> value for that layer and pass the rest of the result down to the next
>>>> layer.
>>>
>>> I, on the other hand, would suggest that each layer be given the freedom,
>>> and
>>> the responsibility, to do whatever it thinks is most appropriate.
>>>
>>> This would probably involved checking the lower levels to see how many
>>> strategies each has, and doing some appropriate arithmetic depending on
>>> how
>>> it combines those devices.
>>>
>>> One problem here is the assumption that the lower levels don't change,
>>> and we
>>> know that not to be the case.
>>> However this is already a problem.  It is entirely possible that the
>>> request
>>> size parameters of devices can change at any moment, even while a request
>>> is
>>> in-flight ... though we try to avoid that or work around it.
>>>
>>> The sort of dependencies that we see forming here would probably just
>>> make
>>> the problem worse.
>>>
>>> Not sure what to do about it though.... maybe just hope it isn't a
>>> problem.
>>
>> Agreed, if we want to go beyond best effort for a stacking config, we'll
>> need to put some state struct in the bio that each layer can play with.
>> That way each layer knows which mirrors have already been tried.
>>
>> But, maybe the whole btrfs model is backwards for a generic layer.
>> Instead of sending down ios and testing when they come back, we could
>> just set a verification function (or stack of them?).
>>
>> For metadata, btrfs compares the crc and a few other fields of the
>> metadata block, so we can easily add a compare function pointer and a
>> void * to pass in.
>>
>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
>> threading pools so saturate all the cpus on the box.  But there's no
>> reason we can't make that available lower down.
>>
>> If we pushed the verification down, the retries could bubble up the
>> stack instead of the other way around.
>>
>> -chris
>
> I do like the idea of having the ability to do the verification and retries
> down the stack where you actually have the most context to figure out what
> is possible...
>
> Why would you need to bubble back up anything other than an error when all
> retries have failed?

Right, passing a validator is natural.  We've done the same for
validating DM thinp's metadata (crc, blocknr, etc), btree nodes,
bitmaps, etc.  The dm-thinp code is very much under review but I
figured I'd share.

See:
https://github.com/jthornber/linux-2.6/blob/4c4089de2e5a4f343d9810f76531cb25aa13f91c/drivers/md/dm-thin-metadata.c

Specifically:
static struct dm_block_validator sb_validator_

So we pass a dm_block_validator around when doing block IO.  Each
validator has a .check and a .preparse_for_write method.

If the greater kernel offered a comparable mechanism we'd just switch
over to using it.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 13:31                         ` Ric Wheeler
@ 2011-07-15 14:00                           ` Chris Mason
  2011-07-15 14:07                             ` Hugo Mills
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2011-07-15 14:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: NeilBrown, david, Nico Schottelius, LKML, linux-btrfs, Alasdair G Kergon

Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> On 07/15/2011 02:20 PM, Chris Mason wrote:
> > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > [ triggering IO retries on failed crc or other checks ]
> >
> >>> But, maybe the whole btrfs model is backwards for a generic layer.
> >>> Instead of sending down ios and testing when they come back, we could
> >>> just set a verification function (or stack of them?).
> >>>
> >>> For metadata, btrfs compares the crc and a few other fields of the
> >>> metadata block, so we can easily add a compare function pointer and a
> >>> void * to pass in.
> >>>
> >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> >>> threading pools so saturate all the cpus on the box.  But there's no
> >>> reason we can't make that available lower down.
> >>>
> >>> If we pushed the verification down, the retries could bubble up the
> >>> stack instead of the other way around.
> >>>
> >>> -chris
> >> I do like the idea of having the ability to do the verification and retries down
> >> the stack where you actually have the most context to figure out what is possible...
> >>
> >> Why would you need to bubble back up anything other than an error when all
> >> retries have failed?
> > By bubble up I mean that if you have multiple layers capable of doing
> > retries, the lowest levels would retry first.  Basically by the time we
> > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > do to help.
> >
> > -chris
> 
> Absolutely sounds like the most sane way to go to me, thanks!
> 

It really seemed like a good idea, but I just realized it doesn't work
well when parts of the stack transform the data.

Picture dm-crypt on top of raid1.  If raid1 is responsible for the
crc retries, there's no way to crc the data because it needs to be
decrypted first.

I think the raided dm-crypt config is much more common (and interesting)
than multiple layers that can retry for other reasons (raid1 on top of
raid10?)

In other words, do we really want to do a lot of design work for
multiple layers where each one maintains multiple copies of the data
blocks?  Are there configs where this really makes sense?

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 14:00                           ` Chris Mason
@ 2011-07-15 14:07                             ` Hugo Mills
  2011-07-15 14:24                               ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Hugo Mills @ 2011-07-15 14:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

[-- Attachment #1: Type: text/plain, Size: 3110 bytes --]

On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > [ triggering IO retries on failed crc or other checks ]
> > >
> > >>> But, maybe the whole btrfs model is backwards for a generic layer.
> > >>> Instead of sending down ios and testing when they come back, we could
> > >>> just set a verification function (or stack of them?).
> > >>>
> > >>> For metadata, btrfs compares the crc and a few other fields of the
> > >>> metadata block, so we can easily add a compare function pointer and a
> > >>> void * to pass in.
> > >>>
> > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > >>> threading pools so saturate all the cpus on the box.  But there's no
> > >>> reason we can't make that available lower down.
> > >>>
> > >>> If we pushed the verification down, the retries could bubble up the
> > >>> stack instead of the other way around.
> > >>>
> > >>> -chris
> > >> I do like the idea of having the ability to do the verification and retries down
> > >> the stack where you actually have the most context to figure out what is possible...
> > >>
> > >> Why would you need to bubble back up anything other than an error when all
> > >> retries have failed?
> > > By bubble up I mean that if you have multiple layers capable of doing
> > > retries, the lowest levels would retry first.  Basically by the time we
> > > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > > do to help.
> > >
> > > -chris
> > 
> > Absolutely sounds like the most sane way to go to me, thanks!
> > 
> 
> It really seemed like a good idea, but I just realized it doesn't work
> well when parts of the stack transform the data.
> 
> Picture dm-crypt on top of raid1.  If raid1 is responsible for the
> crc retries, there's no way to crc the data because it needs to be
> decrypted first.
> 
> I think the raided dm-crypt config is much more common (and interesting)
> than multiple layers that can retry for other reasons (raid1 on top of
> raid10?)

   Isn't this a case where the transformative mid-layer would replace
the validation function before passing it down the stack? So btrfs
hands dm-crypt a checksum function; dm-crypt then stores that function
for its own purposes and hands off a new function to the DM layer
below that which decrypts the data and calls the btrfs checksum
function it stored earlier.

> In other words, do we really want to do a lot of design work for
> multiple layers where each one maintains multiple copies of the data
> blocks?  Are there configs where this really makes sense?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- "What are we going to do tonight?" "The same thing we do ---     
            every night, Pinky.  Try to take over the world!"            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 14:07                             ` Hugo Mills
@ 2011-07-15 14:24                               ` Chris Mason
  2011-07-15 14:47                                   ` Christian Aßfalg
  2011-07-15 14:54                                 ` Hugo Mills
  0 siblings, 2 replies; 39+ messages in thread
From: Chris Mason @ 2011-07-15 14:24 UTC (permalink / raw)
  To: Hugo Mills
  Cc: Ric Wheeler, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
> On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> > Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> > > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > > [ triggering IO retries on failed crc or other checks ]
> > > >
> > > >>> But, maybe the whole btrfs model is backwards for a generic layer.
> > > >>> Instead of sending down ios and testing when they come back, we could
> > > >>> just set a verification function (or stack of them?).
> > > >>>
> > > >>> For metadata, btrfs compares the crc and a few other fields of the
> > > >>> metadata block, so we can easily add a compare function pointer and a
> > > >>> void * to pass in.
> > > >>>
> > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > > >>> threading pools so saturate all the cpus on the box.  But there's no
> > > >>> reason we can't make that available lower down.
> > > >>>
> > > >>> If we pushed the verification down, the retries could bubble up the
> > > >>> stack instead of the other way around.
> > > >>>
> > > >>> -chris
> > > >> I do like the idea of having the ability to do the verification and retries down
> > > >> the stack where you actually have the most context to figure out what is possible...
> > > >>
> > > >> Why would you need to bubble back up anything other than an error when all
> > > >> retries have failed?
> > > > By bubble up I mean that if you have multiple layers capable of doing
> > > > retries, the lowest levels would retry first.  Basically by the time we
> > > > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > > > do to help.
> > > >
> > > > -chris
> > > 
> > > Absolutely sounds like the most sane way to go to me, thanks!
> > > 
> > 
> > It really seemed like a good idea, but I just realized it doesn't work
> > well when parts of the stack transform the data.
> > 
> > Picture dm-crypt on top of raid1.  If raid1 is responsible for the
> > crc retries, there's no way to crc the data because it needs to be
> > decrypted first.
> > 
> > I think the raided dm-crypt config is much more common (and interesting)
> > than multiple layers that can retry for other reasons (raid1 on top of
> > raid10?)
> 
>    Isn't this a case where the transformative mid-layer would replace
> the validation function before passing it down the stack? So btrfs
> hands dm-crypt a checksum function; dm-crypt then stores that function
> for its own purposes and hands off a new function to the DM layer
> below that which decrypts the data and calls the btrfs checksum
> function it stored earlier.

Then we're requiring each transformation layer to have their own crcs,
and if the higher layers have a stronger crc (or other checks), there's
no path to ask the lower layers for other copies.

Here's a concrete example.  In each metadata block, btrfs stores the
fsid and the transid of the transaction that created it.  In the case of
a missed write, we'll read a perfect block from the lower layers.  Any
crcs will be correct and it'll pass through dm-crypt with flying colors.

But, it won't be the right block.  Btrfs will notice this and EIO.  In
the current ask-for-another-mirror config we'll go down and grab the
other copy.

In the stacked validation function model, dm-crypt replaces our
verification functions with something that operates on the encrypted
data, and it won't be able to detect the error or kick down to the
underlying raid1 for another copy.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 14:24                               ` Chris Mason
@ 2011-07-15 14:47                                   ` Christian Aßfalg
  2011-07-15 14:54                                 ` Hugo Mills
  1 sibling, 0 replies; 39+ messages in thread
From: Christian Aßfalg @ 2011-07-15 14:47 UTC (permalink / raw)
  To: Chris Mason
  Cc: Hugo Mills, Ric Wheeler, NeilBrown, david, Nico Schottelius,
	LKML, linux-btrfs, Alasdair G Kergon

Am Freitag, den 15.07.2011, 10:24 -0400 schrieb Chris Mason:
> Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
> > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> > > Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > > > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0=
400:
> > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > > > [ triggering IO retries on failed crc or other checks ]
> > > > >
> > > > >>> But, maybe the whole btrfs model is backwards for a generic=
 layer.
> > > > >>> Instead of sending down ios and testing when they come back=
, we could
> > > > >>> just set a verification function (or stack of them?).
> > > > >>>
> > > > >>> For metadata, btrfs compares the crc and a few other fields=
 of the
> > > > >>> metadata block, so we can easily add a compare function poi=
nter and a
> > > > >>> void * to pass in.
> > > > >>>
> > > > >>> The problem is the crc can take a lot of CPU, so btrfs kick=
s it off to
> > > > >>> threading pools so saturate all the cpus on the box.  But t=
here's no
> > > > >>> reason we can't make that available lower down.
> > > > >>>
> > > > >>> If we pushed the verification down, the retries could bubbl=
e up the
> > > > >>> stack instead of the other way around.
> > > > >>>
> > > > >>> -chris
> > > > >> I do like the idea of having the ability to do the verificat=
ion and retries down
> > > > >> the stack where you actually have the most context to figure=
 out what is possible...
> > > > >>
> > > > >> Why would you need to bubble back up anything other than an =
error when all
> > > > >> retries have failed?
> > > > > By bubble up I mean that if you have multiple layers capable =
of doing
> > > > > retries, the lowest levels would retry first.  Basically by t=
he time we
> > > > > get an -EIO_ALREADY_RETRIED we know there's nothing that lowe=
r level can
> > > > > do to help.
> > > > >
> > > > > -chris
> > > >=20
> > > > Absolutely sounds like the most sane way to go to me, thanks!
> > > >=20
> > >=20
> > > It really seemed like a good idea, but I just realized it doesn't=
 work
> > > well when parts of the stack transform the data.
> > >=20
> > > Picture dm-crypt on top of raid1.  If raid1 is responsible for th=
e
> > > crc retries, there's no way to crc the data because it needs to b=
e
> > > decrypted first.
> > >=20
> > > I think the raided dm-crypt config is much more common (and inter=
esting)
> > > than multiple layers that can retry for other reasons (raid1 on t=
op of
> > > raid10?)
> >=20
> >    Isn't this a case where the transformative mid-layer would repla=
ce
> > the validation function before passing it down the stack? So btrfs
> > hands dm-crypt a checksum function; dm-crypt then stores that funct=
ion
> > for its own purposes and hands off a new function to the DM layer
> > below that which decrypts the data and calls the btrfs checksum
> > function it stored earlier.
>=20
> Then we're requiring each transformation layer to have their own crcs=
,
> and if the higher layers have a stronger crc (or other checks), there=
's
> no path to ask the lower layers for other copies.
>=20
> Here's a concrete example.  In each metadata block, btrfs stores the
> fsid and the transid of the transaction that created it.  In the case=
 of
> a missed write, we'll read a perfect block from the lower layers.  An=
y
> crcs will be correct and it'll pass through dm-crypt with flying colo=
rs.
>=20
> But, it won't be the right block.  Btrfs will notice this and EIO.  I=
n
> the current ask-for-another-mirror config we'll go down and grab the
> other copy.
>=20
> In the stacked validation function model, dm-crypt replaces our
> verification functions with something that operates on the encrypted
> data, and it won't be able to detect the error or kick down to the
> underlying raid1 for another copy.
>=20
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think the point is not to replace the crc function in the dm_crypt
case, but to wrap it with an decrypt function which then calls the crc
function. So even if a lower mirror uses the new dm-crypt crc function,
the btrfs crc function still gets called - at the end of the chain.

Regards,
Christian A=C3=9Ffalg

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
@ 2011-07-15 14:47                                   ` Christian Aßfalg
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Aßfalg @ 2011-07-15 14:47 UTC (permalink / raw)
  To: Chris Mason
  Cc: Hugo Mills, Ric Wheeler, NeilBrown, david, Nico Schottelius,
	LKML, linux-btrfs, Alasdair G Kergon

Am Freitag, den 15.07.2011, 10:24 -0400 schrieb Chris Mason:
> Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
> > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> > > Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > > > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > > > [ triggering IO retries on failed crc or other checks ]
> > > > >
> > > > >>> But, maybe the whole btrfs model is backwards for a generic layer.
> > > > >>> Instead of sending down ios and testing when they come back, we could
> > > > >>> just set a verification function (or stack of them?).
> > > > >>>
> > > > >>> For metadata, btrfs compares the crc and a few other fields of the
> > > > >>> metadata block, so we can easily add a compare function pointer and a
> > > > >>> void * to pass in.
> > > > >>>
> > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > > > >>> threading pools so saturate all the cpus on the box.  But there's no
> > > > >>> reason we can't make that available lower down.
> > > > >>>
> > > > >>> If we pushed the verification down, the retries could bubble up the
> > > > >>> stack instead of the other way around.
> > > > >>>
> > > > >>> -chris
> > > > >> I do like the idea of having the ability to do the verification and retries down
> > > > >> the stack where you actually have the most context to figure out what is possible...
> > > > >>
> > > > >> Why would you need to bubble back up anything other than an error when all
> > > > >> retries have failed?
> > > > > By bubble up I mean that if you have multiple layers capable of doing
> > > > > retries, the lowest levels would retry first.  Basically by the time we
> > > > > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > > > > do to help.
> > > > >
> > > > > -chris
> > > > 
> > > > Absolutely sounds like the most sane way to go to me, thanks!
> > > > 
> > > 
> > > It really seemed like a good idea, but I just realized it doesn't work
> > > well when parts of the stack transform the data.
> > > 
> > > Picture dm-crypt on top of raid1.  If raid1 is responsible for the
> > > crc retries, there's no way to crc the data because it needs to be
> > > decrypted first.
> > > 
> > > I think the raided dm-crypt config is much more common (and interesting)
> > > than multiple layers that can retry for other reasons (raid1 on top of
> > > raid10?)
> > 
> >    Isn't this a case where the transformative mid-layer would replace
> > the validation function before passing it down the stack? So btrfs
> > hands dm-crypt a checksum function; dm-crypt then stores that function
> > for its own purposes and hands off a new function to the DM layer
> > below that which decrypts the data and calls the btrfs checksum
> > function it stored earlier.
> 
> Then we're requiring each transformation layer to have their own crcs,
> and if the higher layers have a stronger crc (or other checks), there's
> no path to ask the lower layers for other copies.
> 
> Here's a concrete example.  In each metadata block, btrfs stores the
> fsid and the transid of the transaction that created it.  In the case of
> a missed write, we'll read a perfect block from the lower layers.  Any
> crcs will be correct and it'll pass through dm-crypt with flying colors.
> 
> But, it won't be the right block.  Btrfs will notice this and EIO.  In
> the current ask-for-another-mirror config we'll go down and grab the
> other copy.
> 
> In the stacked validation function model, dm-crypt replaces our
> verification functions with something that operates on the encrypted
> data, and it won't be able to detect the error or kick down to the
> underlying raid1 for another copy.
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think the point is not to replace the crc function in the dm_crypt
case, but to wrap it with an decrypt function which then calls the crc
function. So even if a lower mirror uses the new dm-crypt crc function,
the btrfs crc function still gets called - at the end of the chain.

Regards,
Christian Aßfalg


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 14:24                               ` Chris Mason
  2011-07-15 14:47                                   ` Christian Aßfalg
@ 2011-07-15 14:54                                 ` Hugo Mills
  2011-07-15 15:12                                   ` Chris Mason
  1 sibling, 1 reply; 39+ messages in thread
From: Hugo Mills @ 2011-07-15 14:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

[-- Attachment #1: Type: text/plain, Size: 6309 bytes --]

On Fri, Jul 15, 2011 at 10:24:25AM -0400, Chris Mason wrote:
> Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
> > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> > > Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > > > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > > > [ triggering IO retries on failed crc or other checks ]
> > > > >
> > > > >>> But, maybe the whole btrfs model is backwards for a generic layer.
> > > > >>> Instead of sending down ios and testing when they come back, we could
> > > > >>> just set a verification function (or stack of them?).
> > > > >>>
> > > > >>> For metadata, btrfs compares the crc and a few other fields of the
> > > > >>> metadata block, so we can easily add a compare function pointer and a
> > > > >>> void * to pass in.
> > > > >>>
> > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > > > >>> threading pools so saturate all the cpus on the box.  But there's no
> > > > >>> reason we can't make that available lower down.
> > > > >>>
> > > > >>> If we pushed the verification down, the retries could bubble up the
> > > > >>> stack instead of the other way around.
> > > > >>>
> > > > >>> -chris
> > > > >> I do like the idea of having the ability to do the verification and retries down
> > > > >> the stack where you actually have the most context to figure out what is possible...
> > > > >>
> > > > >> Why would you need to bubble back up anything other than an error when all
> > > > >> retries have failed?
> > > > > By bubble up I mean that if you have multiple layers capable of doing
> > > > > retries, the lowest levels would retry first.  Basically by the time we
> > > > > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > > > > do to help.
> > > > >
> > > > > -chris
> > > > 
> > > > Absolutely sounds like the most sane way to go to me, thanks!
> > > > 
> > > 
> > > It really seemed like a good idea, but I just realized it doesn't work
> > > well when parts of the stack transform the data.
> > > 
> > > Picture dm-crypt on top of raid1.  If raid1 is responsible for the
> > > crc retries, there's no way to crc the data because it needs to be
> > > decrypted first.
> > > 
> > > I think the raided dm-crypt config is much more common (and interesting)
> > > than multiple layers that can retry for other reasons (raid1 on top of
> > > raid10?)
> > 
> >    Isn't this a case where the transformative mid-layer would replace
> > the validation function before passing it down the stack? So btrfs
> > hands dm-crypt a checksum function; dm-crypt then stores that function
> > for its own purposes and hands off a new function to the DM layer
> > below that which decrypts the data and calls the btrfs checksum
> > function it stored earlier.
> 
> Then we're requiring each transformation layer to have their own crcs,
> and if the higher layers have a stronger crc (or other checks), there's
> no path to ask the lower layers for other copies.
> 
> Here's a concrete example.  In each metadata block, btrfs stores the
> fsid and the transid of the transaction that created it.  In the case of
> a missed write, we'll read a perfect block from the lower layers.  Any
> crcs will be correct and it'll pass through dm-crypt with flying colors.
> 
> But, it won't be the right block.  Btrfs will notice this and EIO.  In
> the current ask-for-another-mirror config we'll go down and grab the
> other copy.
> 
> In the stacked validation function model, dm-crypt replaces our
> verification functions with something that operates on the encrypted
> data, and it won't be able to detect the error or kick down to the
> underlying raid1 for another copy.

   What I'm suggesting is easiest to describe with a functional
language, or mathematical notation, but I'll try without those
anyway...

   A generic validation function is a two-parameter function, taking a
block of data and some layer-dependent state, and returning a boolean.

So, at the btrfs layer, we have something like:

struct btrfs_validate_state {
	u32 cs;
};

bool btrfs_validate(void *state, char *data) {
	return crc32(data) == (btrfs_validate_state*)state->cs;
}

   When reading a specific block, we look up the checksum for that
block, put it in state->cs and pass both our state and the
btrfs_validate function to the lower layer.

   The crypto layer beneath will look something like:

struct crypto_validate_state {
	void *old_state;
	bool (*old_validator)(void *state, char* data);
};

bool crypto_validate(void *state, char *data) {
	char plaintext[4096];
	decrypt(plaintext, data);
	
	return state->old_validator(state->old_state, plaintext);
}

   Then, when a request is received from the upper layer, we can put
the (state, validator) pair into a crypto_validate_state structure,
call that our new state, and pass on the new (state, crypto_validate)
to the layer underneath.

   So, where a transformative layer exists in the stack, it replaces
the validation function with a helper function that does the necessary
transformation, and performs the check.

   The code I've described above would clearly need to be hooked into
the existing data-transformation paths so that we don't replicate the
effort of decrypting or uncompressing good data once we've decided
that it _is_ good.

   Oh, and if there's a layer with its own validation, we can put in a
"firebreak" implementation of the above code, which basically drops
the validator it's passed, and replaces it completely with its own.
(This would look exactly like the btrfs version at the top of my reply
here).

   The standard case for most block-layer implementations will simply
be to pass the (state, validator) pair unchanged to the lower layer,
because the block being checked won't be any different. It's only
where we have layers that change things, or definitely know better
than the top layer that we need to play tricks.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- Sometimes, when I'm alone, I Google myself. ---           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 14:54                                 ` Hugo Mills
@ 2011-07-15 15:12                                   ` Chris Mason
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Mason @ 2011-07-15 15:12 UTC (permalink / raw)
  To: Hugo Mills
  Cc: Ric Wheeler, NeilBrown, david, Nico Schottelius, LKML,
	linux-btrfs, Alasdair G Kergon

Excerpts from Hugo Mills's message of 2011-07-15 10:54:36 -0400:
> On Fri, Jul 15, 2011 at 10:24:25AM -0400, Chris Mason wrote:
> > Excerpts from Hugo Mills's message of 2011-07-15 10:07:24 -0400:
> > > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:
> > > > Excerpts from Ric Wheeler's message of 2011-07-15 09:31:37 -0400:
> > > > > On 07/15/2011 02:20 PM, Chris Mason wrote:
> > > > > > Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
> > > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote:
> > > > > > [ triggering IO retries on failed crc or other checks ]
> > > > > >
> > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer.
> > > > > >>> Instead of sending down ios and testing when they come back, we could
> > > > > >>> just set a verification function (or stack of them?).
> > > > > >>>
> > > > > >>> For metadata, btrfs compares the crc and a few other fields of the
> > > > > >>> metadata block, so we can easily add a compare function pointer and a
> > > > > >>> void * to pass in.
> > > > > >>>
> > > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to
> > > > > >>> threading pools so saturate all the cpus on the box.  But there's no
> > > > > >>> reason we can't make that available lower down.
> > > > > >>>
> > > > > >>> If we pushed the verification down, the retries could bubble up the
> > > > > >>> stack instead of the other way around.
> > > > > >>>
> > > > > >>> -chris
> > > > > >> I do like the idea of having the ability to do the verification and retries down
> > > > > >> the stack where you actually have the most context to figure out what is possible...
> > > > > >>
> > > > > >> Why would you need to bubble back up anything other than an error when all
> > > > > >> retries have failed?
> > > > > > By bubble up I mean that if you have multiple layers capable of doing
> > > > > > retries, the lowest levels would retry first.  Basically by the time we
> > > > > > get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> > > > > > do to help.
> > > > > >
> > > > > > -chris
> > > > > 
> > > > > Absolutely sounds like the most sane way to go to me, thanks!
> > > > > 
> > > > 
> > > > It really seemed like a good idea, but I just realized it doesn't work
> > > > well when parts of the stack transform the data.
> > > > 
> > > > Picture dm-crypt on top of raid1.  If raid1 is responsible for the
> > > > crc retries, there's no way to crc the data because it needs to be
> > > > decrypted first.
> > > > 
> > > > I think the raided dm-crypt config is much more common (and interesting)
> > > > than multiple layers that can retry for other reasons (raid1 on top of
> > > > raid10?)
> > > 
> > >    Isn't this a case where the transformative mid-layer would replace
> > > the validation function before passing it down the stack? So btrfs
> > > hands dm-crypt a checksum function; dm-crypt then stores that function
> > > for its own purposes and hands off a new function to the DM layer
> > > below that which decrypts the data and calls the btrfs checksum
> > > function it stored earlier.
> > 
> > Then we're requiring each transformation layer to have their own crcs,
> > and if the higher layers have a stronger crc (or other checks), there's
> > no path to ask the lower layers for other copies.
> > 
> > Here's a concrete example.  In each metadata block, btrfs stores the
> > fsid and the transid of the transaction that created it.  In the case of
> > a missed write, we'll read a perfect block from the lower layers.  Any
> > crcs will be correct and it'll pass through dm-crypt with flying colors.
> > 
> > But, it won't be the right block.  Btrfs will notice this and EIO.  In
> > the current ask-for-another-mirror config we'll go down and grab the
> > other copy.
> > 
> > In the stacked validation function model, dm-crypt replaces our
> > verification functions with something that operates on the encrypted
> > data, and it won't be able to detect the error or kick down to the
> > underlying raid1 for another copy.
> 
>    What I'm suggesting is easiest to describe with a functional
> language, or mathematical notation, but I'll try without those
> anyway...
> 
>    A generic validation function is a two-parameter function, taking a
> block of data and some layer-dependent state, and returning a boolean.
> 
> So, at the btrfs layer, we have something like:
> 
> struct btrfs_validate_state {
>     u32 cs;
> };
> 
> bool btrfs_validate(void *state, char *data) {
>     return crc32(data) == (btrfs_validate_state*)state->cs;
> }
> 
>    When reading a specific block, we look up the checksum for that
> block, put it in state->cs and pass both our state and the
> btrfs_validate function to the lower layer.
> 
>    The crypto layer beneath will look something like:
> 
> struct crypto_validate_state {
>     void *old_state;
>     bool (*old_validator)(void *state, char* data);
> };
> 
> bool crypto_validate(void *state, char *data) {
>     char plaintext[4096];
>     decrypt(plaintext, data);
>     
>     return state->old_validator(state->old_state, plaintext);
> }
> 
>    Then, when a request is received from the upper layer, we can put
> the (state, validator) pair into a crypto_validate_state structure,
> call that our new state, and pass on the new (state, crypto_validate)
> to the layer underneath.
> 
>    So, where a transformative layer exists in the stack, it replaces
> the validation function with a helper function that does the necessary
> transformation, and performs the check.
> 
>    The code I've described above would clearly need to be hooked into
> the existing data-transformation paths so that we don't replicate the
> effort of decrypting or uncompressing good data once we've decided
> that it _is_ good.
> 
>    Oh, and if there's a layer with its own validation, we can put in a
> "firebreak" implementation of the above code, which basically drops
> the validator it's passed, and replaces it completely with its own.
> (This would look exactly like the btrfs version at the top of my reply
> here).
> 
>    The standard case for most block-layer implementations will simply
> be to pass the (state, validator) pair unchanged to the lower layer,
> because the block being checked won't be any different. It's only
> where we have layers that change things, or definitely know better
> than the top layer that we need to play tricks.

It's not that I disagree, but what you're describing is the entire
end_io chain in the current code.  Replace validator with end_io and we
have the current configuration, since everyone ends up stacking the
end_io handlers now.

The difference is that in the validator config, we'll be running the
validation stack at the lower layer in the IO stack.  This sounds like a
great idea because we can go ahead and retry at the lower layer or move
up the stack and retry up higher.

But, lets say we're doing striping over raid1.  Now the lower layer IO
completion (and stack of validations) may have to wait for multiple
stripes to end their IO.

I think the complexity gets out of hand really quickly on this.

-chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15  6:33                 ` NeilBrown
  2011-07-15 11:34                   ` Chris Mason
@ 2011-07-15 16:03                   ` david
  1 sibling, 0 replies; 39+ messages in thread
From: david @ 2011-07-15 16:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Mason, Ric Wheeler, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On Fri, 15 Jul 2011, NeilBrown wrote:

> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:
>
>> On Thu, 14 Jul 2011, Chris Mason wrote:
>>
>>> Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
>>>> On 07/14/2011 07:38 AM, NeilBrown wrote:
>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
>>>>>
>>
>> I would suggest that each layer take the value it's give, do an integer
>> division by the number of values that layer supports, using the modulo
>> value for that layer and pass the rest of the result down to the next
>> layer.
>
> I, on the other hand, would suggest that each layer be given the freedom, and
> the responsibility, to do whatever it thinks is most appropriate.
>
> This would probably involved checking the lower levels to see how many
> strategies each has, and doing some appropriate arithmetic depending on how
> it combines those devices.

nothing is wrong with doing something smarter than what I was proposing, I 
was just intending to propose a default rule that would work (just about) 
everywhere. I was specifically trying to counter the throught hat the 
method number would/should be contant as it's passed down the layers.

The proposal had been made to have each layer do the retries for the layer 
below it, that would avoid this stacking problem, but at the cost of 
potentially doing a LOT of retries without the upper level being able to 
limit it.

David Lang

> One problem here is the assumption that the lower levels don't change, and we
> know that not to be the case.
> However this is already a problem.  It is entirely possible that the request
> size parameters of devices can change at any moment, even while a request is
> in-flight ... though we try to avoid that or work around it.
>
> The sort of dependencies that we see forming here would probably just make
> the problem worse.
>
> Not sure what to do about it though.... maybe just hope it isn't a problem.
>
> NeilBrown
>
>
>>
>> this works for just trying values until you get the error that tells you
>> there are no more options.
>>
>>
>> things can get very 'intersesting' if the different options below you have
>> different number of options (say a raid1 across a raid5 and a single disk)
>> but I can't think of any good way to figure this out and assemble a
>> sane order without doing an exaustive search down to find the number of
>> options for each layer (and since this can be different for different
>> blocks, you would have to do this each time you invoked this option)
>>
>> David Lang
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 13:20                       ` Chris Mason
  2011-07-15 13:31                         ` Ric Wheeler
@ 2011-07-15 16:23                         ` david
  2011-07-15 16:51                           ` Ric Wheeler
  1 sibling, 1 reply; 39+ messages in thread
From: david @ 2011-07-15 16:23 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, NeilBrown, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On Fri, 15 Jul 2011, Chris Mason wrote:

> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>> On 07/15/2011 12:34 PM, Chris Mason wrote:
>
> By bubble up I mean that if you have multiple layers capable of doing
> retries, the lowest levels would retry first.  Basically by the time we
> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
> do to help.

the problem with doing this is that it can end up stalling the box for 
significant amounts of time while all the retries happen.

we already see this happening today where a disk read failure is retried 
multiple times by the disk, multiple times by the raid controller, and 
then multiple times by Linux, resulting is multi-minute stalls when you 
hit a disk error in some cases.

having the lower layers do the retries automatically runs the risk of 
making this even worse.

This needs to be able to be throttled by some layer that can see the 
entire picture (either by cutting off the retries after a number, after 
some time, or by spacing out the retries to allow other queries to get in 
and let the box do useful work in the meantime)

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 16:23                         ` david
@ 2011-07-15 16:51                           ` Ric Wheeler
  2011-07-15 17:01                             ` david
  0 siblings, 1 reply; 39+ messages in thread
From: Ric Wheeler @ 2011-07-15 16:51 UTC (permalink / raw)
  To: david
  Cc: Chris Mason, NeilBrown, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On 07/15/2011 05:23 PM, david@lang.hm wrote:
> On Fri, 15 Jul 2011, Chris Mason wrote:
>
>> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>>> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>
>> By bubble up I mean that if you have multiple layers capable of doing
>> retries, the lowest levels would retry first.  Basically by the time we
>> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
>> do to help.
>
> the problem with doing this is that it can end up stalling the box for 
> significant amounts of time while all the retries happen.
>
> we already see this happening today where a disk read failure is retried 
> multiple times by the disk, multiple times by the raid controller, and then 
> multiple times by Linux, resulting is multi-minute stalls when you hit a disk 
> error in some cases.
>
> having the lower layers do the retries automatically runs the risk of making 
> this even worse.
>
> This needs to be able to be throttled by some layer that can see the entire 
> picture (either by cutting off the retries after a number, after some time, or 
> by spacing out the retries to allow other queries to get in and let the box do 
> useful work in the meantime)
>
> David Lang
>

That should not be an issue - we have a "fast fail" path for IO that should 
avoid retrying just for those reasons (i.e., for multi-path or when recovering a 
flaky drive).

This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
RAID1, you would read the data no more than 2 extra times and almost never more 
than once.  That should be *much* faster than the multiple-second long timeout 
you see when waiting for SCSI timeout to fire, etc.

Ric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 16:51                           ` Ric Wheeler
@ 2011-07-15 17:01                             ` david
  2011-07-15 17:23                               ` Ric Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: david @ 2011-07-15 17:01 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, NeilBrown, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On Fri, 15 Jul 2011, Ric Wheeler wrote:

> On 07/15/2011 05:23 PM, david@lang.hm wrote:
>> On Fri, 15 Jul 2011, Chris Mason wrote:
>> 
>>> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>>>> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>> 
>>> By bubble up I mean that if you have multiple layers capable of doing
>>> retries, the lowest levels would retry first.  Basically by the time we
>>> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
>>> do to help.
>> 
>> the problem with doing this is that it can end up stalling the box for 
>> significant amounts of time while all the retries happen.
>> 
>> we already see this happening today where a disk read failure is retried 
>> multiple times by the disk, multiple times by the raid controller, and then 
>> multiple times by Linux, resulting is multi-minute stalls when you hit a 
>> disk error in some cases.
>> 
>> having the lower layers do the retries automatically runs the risk of 
>> making this even worse.
>> 
>> This needs to be able to be throttled by some layer that can see the entire 
>> picture (either by cutting off the retries after a number, after some time, 
>> or by spacing out the retries to allow other queries to get in and let the 
>> box do useful work in the meantime)
>> 
>> David Lang
>> 
>
> That should not be an issue - we have a "fast fail" path for IO that should 
> avoid retrying just for those reasons (i.e., for multi-path or when 
> recovering a flaky drive).
>
> This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
> RAID1, you would read the data no more than 2 extra times and almost never 
> more than once.  That should be *much* faster than the multiple-second long 
> timeout you see when waiting for SCSI timeout to fire, etc.

this issue is when you stack things.

what if you have a 3 piece raid 1 on top of a 4 piece raid 6?

so you have 3 raid1 retries * N raid 6 retries. depending on the order 
that you do the retries in, and how long it takes that try to fail, this 
could start to take significant amounts of time.

if you do a retry on the lower level first, the raid 6 could try several 
different ways to combine the drives to get the valid data (disks 1,2  2,3 
3,4 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse 
fast.

add more layers and you multiple the number of possible retries.

my guess is that changing to a different method at the upper level is 
going to avoid the problem area faster then doing so at a lower level 
(because there is less hardware in common with the method that just gave 
the wrong answer)

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Mis-Design of Btrfs?
  2011-07-15 17:01                             ` david
@ 2011-07-15 17:23                               ` Ric Wheeler
  0 siblings, 0 replies; 39+ messages in thread
From: Ric Wheeler @ 2011-07-15 17:23 UTC (permalink / raw)
  To: david
  Cc: Chris Mason, NeilBrown, Nico Schottelius, LKML, linux-btrfs,
	Alasdair G Kergon

On 07/15/2011 06:01 PM, david@lang.hm wrote:
> On Fri, 15 Jul 2011, Ric Wheeler wrote:
>
>> On 07/15/2011 05:23 PM, david@lang.hm wrote:
>>> On Fri, 15 Jul 2011, Chris Mason wrote:
>>>
>>>> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>>>>> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>>>
>>>> By bubble up I mean that if you have multiple layers capable of doing
>>>> retries, the lowest levels would retry first.  Basically by the time we
>>>> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
>>>> do to help.
>>>
>>> the problem with doing this is that it can end up stalling the box for 
>>> significant amounts of time while all the retries happen.
>>>
>>> we already see this happening today where a disk read failure is retried 
>>> multiple times by the disk, multiple times by the raid controller, and then 
>>> multiple times by Linux, resulting is multi-minute stalls when you hit a 
>>> disk error in some cases.
>>>
>>> having the lower layers do the retries automatically runs the risk of making 
>>> this even worse.
>>>
>>> This needs to be able to be throttled by some layer that can see the entire 
>>> picture (either by cutting off the retries after a number, after some time, 
>>> or by spacing out the retries to allow other queries to get in and let the 
>>> box do useful work in the meantime)
>>>
>>> David Lang
>>>
>>
>> That should not be an issue - we have a "fast fail" path for IO that should 
>> avoid retrying just for those reasons (i.e., for multi-path or when 
>> recovering a flaky drive).
>>
>> This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
>> RAID1, you would read the data no more than 2 extra times and almost never 
>> more than once.  That should be *much* faster than the multiple-second long 
>> timeout you see when waiting for SCSI timeout to fire, etc.
>
> this issue is when you stack things.
>
> what if you have a 3 piece raid 1 on top of a 4 piece raid 6?
>
> so you have 3 raid1 retries * N raid 6 retries. depending on the order that 
> you do the retries in, and how long it takes that try to fail, this could 
> start to take significant amounts of time.
>
> if you do a retry on the lower level first, the raid 6 could try several 
> different ways to combine the drives to get the valid data (disks 1,2  2,3 3,4 
> 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse fast.
>
> add more layers and you multiple the number of possible retries.
>
> my guess is that changing to a different method at the upper level is going to 
> avoid the problem area faster then doing so at a lower level (because there is 
> less hardware in common with the method that just gave the wrong answer)
>
> David Lang

At some point, the question is why would you do that?  Two parity drives for 
each 4 drive RAID-6 set and then mirror that 3 times (12 drives in total, only 2 
data drives)? Better off doing a 4 way mirror :)

I think that you are still missing the point.

If the lowest layer can repair the data, it would return the first validated 
answer. The major time sync would be the IO to read the sectors from each of the 
4 drives in your example, not computing the various combination of parity or 
validating (all done in memory).

If the RAID-6 layer failed, you would do the same for each of the mirrors which 
would read the IO from each of their RAID-6 low levels exactly once as well (and 
then verify in memory).

Ric


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2011-07-15 17:23 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-23 10:53 Mis-Design of Btrfs? Nico Schottelius
2011-06-27  6:46 ` NeilBrown
2011-06-29  9:29   ` Ric Wheeler
2011-06-29 10:47     ` A. James Lewis
2011-07-14 20:47       ` Erik Jensen
2011-07-14  5:56     ` NeilBrown
2011-07-14  6:02       ` Ric Wheeler
2011-07-14  6:38         ` NeilBrown
2011-07-14  6:57           ` Ric Wheeler
2011-07-15  2:32             ` Chris Mason
2011-07-15  4:58               ` david
2011-07-15  6:33                 ` NeilBrown
2011-07-15 11:34                   ` Chris Mason
2011-07-15 12:58                     ` Ric Wheeler
2011-07-15 13:20                       ` Chris Mason
2011-07-15 13:31                         ` Ric Wheeler
2011-07-15 14:00                           ` Chris Mason
2011-07-15 14:07                             ` Hugo Mills
2011-07-15 14:24                               ` Chris Mason
2011-07-15 14:47                                 ` Christian Aßfalg
2011-07-15 14:47                                   ` Christian Aßfalg
2011-07-15 14:54                                 ` Hugo Mills
2011-07-15 15:12                                   ` Chris Mason
2011-07-15 16:23                         ` david
2011-07-15 16:51                           ` Ric Wheeler
2011-07-15 17:01                             ` david
2011-07-15 17:23                               ` Ric Wheeler
2011-07-15 13:55                       ` Mike Snitzer
2011-07-15 13:55                         ` Mike Snitzer
2011-07-15 16:03                   ` david
2011-07-14  9:37           ` Jan Schmidt
2011-07-14  9:55             ` NeilBrown
2011-07-14 16:27           ` Goffredo Baroncelli
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14 19:50             ` John Stoffel
2011-07-14 20:48               ` david
2011-07-14 20:50               ` Erik Jensen
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14  6:59         ` Arne Jansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.