Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
@ 2017-11-16  2:18 Qu Wenruo
  2017-11-16  6:54 ` Nikolay Borisov
  2017-11-17  1:26 ` Andreas Dilger
  0 siblings, 2 replies; 20+ messages in thread
From: Qu Wenruo @ 2017-11-16  2:18 UTC (permalink / raw)
  To: linux-block, dm-devel, linux-fsdevel; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1472 bytes --]

Hi all,

[Background]
Recently I'm considering the possibility to use checksum from filesystem
to enhance device-mapper raid.

The idea behind it is quite simple, since most modern filesystems have
checksum for their metadata, and even some (btrfs) have checksum for data.

And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
it can use the checksum to determine which copy is correct so it can
return the correct data even one copy get corrupted.

[Objective]
The final objective is to allow device mapper to do the checksum
verification (and repair if possible).

If only for verification, it's not much different from current endio
hook method used by most of the fs.
However if we can move the repair part from filesystem (well, only btrfs
supports it yet), it would benefit all fs.

[What we have]
The nearest infrastructure I found in kernel is bio_integrity_payload.

However I found it's bounded to device, as it's designed to support
SCSI/SATA integrity protocol.
While for such use case, it's more bounded to filesystem, as fs (or
higher layer dm device) is the source of integrity data, and device
(dm-raid) only do the verification and possible repair.

I'm not sure if this is a good idea to reuse or abuse
bio_integrity_payload for this purpose.

Should we use some new infrastructure or enhance existing
bio_integrity_payload?

(Or is this a valid idea or just another crazy dream?)

Thanks,
Qu

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  2:18 Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6? Qu Wenruo
@ 2017-11-16  6:54 ` Nikolay Borisov
  2017-11-16  7:38   ` Qu Wenruo
  2017-11-17  1:26 ` Andreas Dilger
  1 sibling, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2017-11-16  6:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-block, dm-devel, linux-fsdevel; +Cc: linux-btrfs



On 16.11.2017 04:18, Qu Wenruo wrote:
> Hi all,
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.
> 
> [What we have]
> The nearest infrastructure I found in kernel is bio_integrity_payload.
> 
> However I found it's bounded to device, as it's designed to support
> SCSI/SATA integrity protocol.
> While for such use case, it's more bounded to filesystem, as fs (or
> higher layer dm device) is the source of integrity data, and device
> (dm-raid) only do the verification and possible repair.
> 
> I'm not sure if this is a good idea to reuse or abuse
> bio_integrity_payload for this purpose.
> 
> Should we use some new infrastructure or enhance existing
> bio_integrity_payload?
> 
> (Or is this a valid idea or just another crazy dream?)
> 

This sounds good in principle, however I think there is one crucial
point which needs to be considered:

All fs with checksums store those checksums in some specific way, then
when they fetch data from disk they they also know how to acquire the
respective checksum. What you suggest might be doable but it will
require lower layers (dm) be aware of how to acquire the specific
checksum for some data. I don't think at this point there is such infra
and frankly I cannot even envision how it will work elegantly. Sure you
can create a dm-checksum target (which I believe dm-verity is very
similar to) that stores checksums alongside data but at this point the
fs is really out of the picture.


> Thanks,
> Qu
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  6:54 ` Nikolay Borisov
@ 2017-11-16  7:38   ` Qu Wenruo
  2017-11-16  7:42     ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2017-11-16  7:38 UTC (permalink / raw)
  To: Nikolay Borisov, linux-block, dm-devel, linux-fsdevel; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3345 bytes --]



On 2017年11月16日 14:54, Nikolay Borisov wrote:
> 
> 
> On 16.11.2017 04:18, Qu Wenruo wrote:
>> Hi all,
>>
>> [Background]
>> Recently I'm considering the possibility to use checksum from filesystem
>> to enhance device-mapper raid.
>>
>> The idea behind it is quite simple, since most modern filesystems have
>> checksum for their metadata, and even some (btrfs) have checksum for data.
>>
>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
>> it can use the checksum to determine which copy is correct so it can
>> return the correct data even one copy get corrupted.
>>
>> [Objective]
>> The final objective is to allow device mapper to do the checksum
>> verification (and repair if possible).
>>
>> If only for verification, it's not much different from current endio
>> hook method used by most of the fs.
>> However if we can move the repair part from filesystem (well, only btrfs
>> supports it yet), it would benefit all fs.
>>
>> [What we have]
>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>
>> However I found it's bounded to device, as it's designed to support
>> SCSI/SATA integrity protocol.
>> While for such use case, it's more bounded to filesystem, as fs (or
>> higher layer dm device) is the source of integrity data, and device
>> (dm-raid) only do the verification and possible repair.
>>
>> I'm not sure if this is a good idea to reuse or abuse
>> bio_integrity_payload for this purpose.
>>
>> Should we use some new infrastructure or enhance existing
>> bio_integrity_payload?
>>
>> (Or is this a valid idea or just another crazy dream?)
>>
> 
> This sounds good in principle, however I think there is one crucial
> point which needs to be considered:
> 
> All fs with checksums store those checksums in some specific way, then
> when they fetch data from disk they they also know how to acquire the
> respective checksum.

Just like integrity payload, we generate READ bio attached with checksum
hook function and checksum data.

So for data read, we read checksum first and attach it to data READ bio,
then submit it.

And for metadata read, in most case the checksum is integrated into
metadata header, like what we did in btrfs.

In that case we attach empty checksum data to bio, but use metadata
specific function hook to handle it.

> What you suggest might be doable but it will
> require lower layers (dm) be aware of how to acquire the specific
> checksum for some data.

In above case, dm only needs to call the verification hook function.
If verification passed, that's good.
If not, try other copy if we have.

In this case, I don't think dm layer needs any extra interface to
communicate with higher layer.

Thanks,
Qu

> I don't think at this point there is such infra
> and frankly I cannot even envision how it will work elegantly. Sure you
> can create a dm-checksum target (which I believe dm-verity is very
> similar to) that stores checksums alongside data but at this point the
> fs is really out of the picture.
> 
> 
>> Thanks,
>> Qu
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  7:38   ` Qu Wenruo
@ 2017-11-16  7:42     ` Nikolay Borisov
  2017-11-16  8:08       ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2017-11-16  7:42 UTC (permalink / raw)
  To: Qu Wenruo, linux-block, dm-devel, linux-fsdevel; +Cc: linux-btrfs



On 16.11.2017 09:38, Qu Wenruo wrote:
> 
> 
> On 2017年11月16日 14:54, Nikolay Borisov wrote:
>>
>>
>> On 16.11.2017 04:18, Qu Wenruo wrote:
>>> Hi all,
>>>
>>> [Background]
>>> Recently I'm considering the possibility to use checksum from filesystem
>>> to enhance device-mapper raid.
>>>
>>> The idea behind it is quite simple, since most modern filesystems have
>>> checksum for their metadata, and even some (btrfs) have checksum for data.
>>>
>>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
>>> it can use the checksum to determine which copy is correct so it can
>>> return the correct data even one copy get corrupted.
>>>
>>> [Objective]
>>> The final objective is to allow device mapper to do the checksum
>>> verification (and repair if possible).
>>>
>>> If only for verification, it's not much different from current endio
>>> hook method used by most of the fs.
>>> However if we can move the repair part from filesystem (well, only btrfs
>>> supports it yet), it would benefit all fs.
>>>
>>> [What we have]
>>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>>
>>> However I found it's bounded to device, as it's designed to support
>>> SCSI/SATA integrity protocol.
>>> While for such use case, it's more bounded to filesystem, as fs (or
>>> higher layer dm device) is the source of integrity data, and device
>>> (dm-raid) only do the verification and possible repair.
>>>
>>> I'm not sure if this is a good idea to reuse or abuse
>>> bio_integrity_payload for this purpose.
>>>
>>> Should we use some new infrastructure or enhance existing
>>> bio_integrity_payload?
>>>
>>> (Or is this a valid idea or just another crazy dream?)
>>>
>>
>> This sounds good in principle, however I think there is one crucial
>> point which needs to be considered:
>>
>> All fs with checksums store those checksums in some specific way, then
>> when they fetch data from disk they they also know how to acquire the
>> respective checksum.
> 
> Just like integrity payload, we generate READ bio attached with checksum
> hook function and checksum data.

So how is this checksum data acquired in the first place?

> 
> So for data read, we read checksum first and attach it to data READ bio,
> then submit it.
> 
> And for metadata read, in most case the checksum is integrated into
> metadata header, like what we did in btrfs.
> 
> In that case we attach empty checksum data to bio, but use metadata
> specific function hook to handle it.
> 
>> What you suggest might be doable but it will
>> require lower layers (dm) be aware of how to acquire the specific
>> checksum for some data.
> 
> In above case, dm only needs to call the verification hook function.
> If verification passed, that's good.
> If not, try other copy if we have.
> 
> In this case, I don't think dm layer needs any extra interface to
> communicate with higher layer.


Well that verification function is the interface I meant, you are
communicating the checksum out of band essentially (notwithstanding the
metadata case, since you said checksum is in the actual metadata header)

In the end - which problem are you trying to solve, allow for a generic
checksumming layer which filesystems may use if they decide to ?

> 
> Thanks,
> Qu
> 
>> I don't think at this point there is such infra
>> and frankly I cannot even envision how it will work elegantly. Sure you
>> can create a dm-checksum target (which I believe dm-verity is very
>> similar to) that stores checksums alongside data but at this point the
>> fs is really out of the picture.
>>
>>
>>> Thanks,
>>> Qu
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  7:42     ` Nikolay Borisov
@ 2017-11-16  8:08       ` Qu Wenruo
  2017-11-16  9:43         ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2017-11-16  8:08 UTC (permalink / raw)
  To: Nikolay Borisov, linux-block, dm-devel, linux-fsdevel; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5487 bytes --]



On 2017年11月16日 15:42, Nikolay Borisov wrote:
> 
> 
> On 16.11.2017 09:38, Qu Wenruo wrote:
>>
>>
>> On 2017年11月16日 14:54, Nikolay Borisov wrote:
>>>
>>>
>>> On 16.11.2017 04:18, Qu Wenruo wrote:
>>>> Hi all,
>>>>
>>>> [Background]
>>>> Recently I'm considering the possibility to use checksum from filesystem
>>>> to enhance device-mapper raid.
>>>>
>>>> The idea behind it is quite simple, since most modern filesystems have
>>>> checksum for their metadata, and even some (btrfs) have checksum for data.
>>>>
>>>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
>>>> it can use the checksum to determine which copy is correct so it can
>>>> return the correct data even one copy get corrupted.
>>>>
>>>> [Objective]
>>>> The final objective is to allow device mapper to do the checksum
>>>> verification (and repair if possible).
>>>>
>>>> If only for verification, it's not much different from current endio
>>>> hook method used by most of the fs.
>>>> However if we can move the repair part from filesystem (well, only btrfs
>>>> supports it yet), it would benefit all fs.
>>>>
>>>> [What we have]
>>>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>>>
>>>> However I found it's bounded to device, as it's designed to support
>>>> SCSI/SATA integrity protocol.
>>>> While for such use case, it's more bounded to filesystem, as fs (or
>>>> higher layer dm device) is the source of integrity data, and device
>>>> (dm-raid) only do the verification and possible repair.
>>>>
>>>> I'm not sure if this is a good idea to reuse or abuse
>>>> bio_integrity_payload for this purpose.
>>>>
>>>> Should we use some new infrastructure or enhance existing
>>>> bio_integrity_payload?
>>>>
>>>> (Or is this a valid idea or just another crazy dream?)
>>>>
>>>
>>> This sounds good in principle, however I think there is one crucial
>>> point which needs to be considered:
>>>
>>> All fs with checksums store those checksums in some specific way, then
>>> when they fetch data from disk they they also know how to acquire the
>>> respective checksum.
>>
>> Just like integrity payload, we generate READ bio attached with checksum
>> hook function and checksum data.
> 
> So how is this checksum data acquired in the first place?

In btrfs case, through metadata read bio.
Since btrfs put data csum into its csum tree, as metadata.

Pass a READ bio with metadata specific verification function, and empty
verification data.

> 
>>
>> So for data read, we read checksum first and attach it to data READ bio,
>> then submit it.
>>
>> And for metadata read, in most case the checksum is integrated into
>> metadata header, like what we did in btrfs.
>>
>> In that case we attach empty checksum data to bio, but use metadata
>> specific function hook to handle it.
>>
>>> What you suggest might be doable but it will
>>> require lower layers (dm) be aware of how to acquire the specific
>>> checksum for some data.
>>
>> In above case, dm only needs to call the verification hook function.
>> If verification passed, that's good.
>> If not, try other copy if we have.
>>
>> In this case, I don't think dm layer needs any extra interface to
>> communicate with higher layer.
> 
> 
> Well that verification function is the interface I meant, you are
> communicating the checksum out of band essentially (notwithstanding the
> metadata case, since you said checksum is in the actual metadata header)
> 
> In the end - which problem are you trying to solve, allow for a generic
> checksumming layer which filesystems may use if they decide to ?

To make it clear, to allow device mapper layer to take use of filesystem
checksum (if they have) when there are multiple copies.

One problem of current dm raid1/10 (and possible raid5/6) is that they
don't have ability to know which copy is correct.
They can only handle device disappear.

Btrfs handles it by verifying data/metadata checksum.
While xfs/ext4 also has checksum for their metadata, why not allowing
device mapper to use such checksum to get the correct copy?

The mechanism is *NOT* a generic checksum layer.
How the csum is stored is determined by fs.
Just allow device mapper layer to be aware of this and make clever decision.

And more, this only affects READ bio, WRITE bio is not affected at all.
Csum calculation and storing is all handled by filesystem.
Device mapper layer won't need to get involved in that case.

And of course, btrfs can reuse this facility to do something bigger, but
that's another story.

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu
>>
>>> I don't think at this point there is such infra
>>> and frankly I cannot even envision how it will work elegantly. Sure you
>>> can create a dm-checksum target (which I believe dm-verity is very
>>> similar to) that stores checksums alongside data but at this point the
>>> fs is really out of the picture.
>>>
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  8:08       ` Qu Wenruo
@ 2017-11-16  9:43         ` Zdenek Kabelac
  2017-11-16 10:04           ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2017-11-16  9:43 UTC (permalink / raw)
  To: Qu Wenruo, Nikolay Borisov, linux-block, dm-devel, linux-fsdevel
  Cc: linux-btrfs

Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
> 
> 
>>>>>>
>>>>> [What we have]
>>>>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>>>>

Hi

We already have  dm-integrity target upstream.
What's missing in this target ?

Regards

Zdenek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  9:43         ` Zdenek Kabelac
@ 2017-11-16 10:04           ` Qu Wenruo
  2017-11-16 12:33             ` Zdenek Kabelac
  2017-11-16 22:32             ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: Qu Wenruo @ 2017-11-16 10:04 UTC (permalink / raw)
  To: Zdenek Kabelac, Nikolay Borisov, linux-block, dm-devel, linux-fsdevel
  Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 2451 bytes --]

On 2017年11月16日 17:43, Zdenek Kabelac wrote:
> Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
>>
>>
>>>>>>>
>>>>>> [What we have]
>>>>>> The nearest infrastructure I found in kernel is
>>>>>> bio_integrity_payload.
>>>>>>
> 
> Hi
> 
> We already have  dm-integrity target upstream.
> What's missing in this target ?

If I didn't miss anything, the dm-integrity is designed to calculate and
restore csum into its space to verify the integrity.
The csum happens when bio reaches dm-integrity.

However what I want is, fs generate bio with attached verification hook,
and pass to lower layers to verify it.

For example, if we use the following device mapper layout:

        FS (can be any fs with metadata csum)
                |
             dm-integrity
                |
             dm-raid1
               / \
         disk1     disk2

If some data in disk1 get corrupted (the disk itself is still good), and
when dm-raid1 tries to read the corrupted data, it may return the
corrupted one, and then caught by dm-integrity, finally return -EIO to FS.

But the truth is, we could at least try to read out data in disk2 if we
know the csum for it.
And use the checksum to verify if it's the correct data.

So my idea will be:
     FS (with metadata csum, or even data csum support)
                |  READ bio for metadata
                |  -With metadata verification hook
            dm-raid1
               / \
          disk1   disk2

dm-raid1 handles the bio, reading out data from disk1.
But the result can't pass verification hook.
Then retry with disk2.

If result from disk2 passes verification hook. That's good, returning
the result from disk2 to upper layer (fs).
And we can even submit WRITE bio to try to write the good result back to
disk1.

If result from disk2 doesn't pass verification hook, then we return -EIO
to upper layer.

That's what btrfs has already done for DUP/RAID1/10 (although RAID5/6
will also try to rebuild data, but it still has some problem).

I just want to make device-mapper raid able to handle such case too.
Especially when most fs supports checksum for their metadata.

Thanks,
Qu
> 
> Regards
> 
> Zdenek
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 10:04           ` Qu Wenruo
@ 2017-11-16 12:33             ` Zdenek Kabelac
  2017-11-16 12:41               ` Austin S. Hemmelgarn
  2017-11-16 14:06               ` Qu Wenruo
  2017-11-16 22:32             ` Chris Murphy
  1 sibling, 2 replies; 20+ messages in thread
From: Zdenek Kabelac @ 2017-11-16 12:33 UTC (permalink / raw)
  To: Qu Wenruo, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, linux-fsdevel
  Cc: linux-btrfs

Dne 16.11.2017 v 11:04 Qu Wenruo napsal(a):
> 
> 
> On 2017年11月16日 17:43, Zdenek Kabelac wrote:
>> Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
>>>
>>>
>>>>>>>>
>>>>>>> [What we have]
>>>>>>> The nearest infrastructure I found in kernel is
>>>>>>> bio_integrity_payload.
>>>>>>>
>>
>> Hi
>>
>> We already have  dm-integrity target upstream.
>> What's missing in this target ?
> 
> If I didn't miss anything, the dm-integrity is designed to calculate and
> restore csum into its space to verify the integrity.
> The csum happens when bio reaches dm-integrity.
> 
> However what I want is, fs generate bio with attached verification hook,
> and pass to lower layers to verify it.
> 
> For example, if we use the following device mapper layout:
> 
>          FS (can be any fs with metadata csum)
>                  |
>               dm-integrity
>                  |
>               dm-raid1
>                 / \
>           disk1     disk2
> 
> If some data in disk1 get corrupted (the disk itself is still good), and
> when dm-raid1 tries to read the corrupted data, it may return the
> corrupted one, and then caught by dm-integrity, finally return -EIO to FS.
> 
> But the truth is, we could at least try to read out data in disk2 if we
> know the csum for it.
> And use the checksum to verify if it's the correct data.
> 
> 
> So my idea will be:
>       FS (with metadata csum, or even data csum support)
>                  |  READ bio for metadata
>                  |  -With metadata verification hook
>              dm-raid1
>                 / \
>            disk1   disk2
> 
> dm-raid1 handles the bio, reading out data from disk1.
> But the result can't pass verification hook.
> Then retry with disk2.
> 
> If result from disk2 passes verification hook. That's good, returning
> the result from disk2 to upper layer (fs).
> And we can even submit WRITE bio to try to write the good result back to
> disk1.
> 
> If result from disk2 doesn't pass verification hook, then we return -EIO
> to upper layer.
> 
> That's what btrfs has already done for DUP/RAID1/10 (although RAID5/6
> will also try to rebuild data, but it still has some problem).
> 
> I just want to make device-mapper raid able to handle such case too.
> Especially when most fs supports checksum for their metadata.
> 

Hi

IMHO you are looking for too complicated solution.

If your checksum is calculated and checked at FS level there is no added value 
when you spread this logic to other layers.

dm-integrity adds basic 'check-summing' to any filesystem without the need to 
modify fs itself - the paid price is - if there is bug between passing data 
from  'fs' to dm-integrity'  it cannot be captured.

Advantage of having separated 'fs' and 'block' layer is in its separation and 
simplicity at each level.

If you want integrated solution - you are simply looking for btrfs where 
multiple layers are integrated together.

You are also possibly missing feature of dm-interity - it's not just giving 
you 'checksum' - it also makes you sure - device has proper content - you 
can't just 'replace block' even with proper checksum for a block somewhere in 
the middle of you device... and when joined with crypto - it makes it way more 
secure...

Regards

Zdenek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 12:33             ` Zdenek Kabelac
@ 2017-11-16 12:41               ` Austin S. Hemmelgarn
  2017-11-16 14:06               ` Qu Wenruo
  1 sibling, 0 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-16 12:41 UTC (permalink / raw)
  To: Zdenek Kabelac, Qu Wenruo, Nikolay Borisov, linux-block,
	dm-devel, linux-fsdevel
  Cc: linux-btrfs

On 2017-11-16 07:33, Zdenek Kabelac wrote:
> Dne 16.11.2017 v 11:04 Qu Wenruo napsal(a):
>>
>>
>> On 2017年11月16日 17:43, Zdenek Kabelac wrote:
>>> Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
>>>>
>>>>
>>>>>>>>>
>>>>>>>> [What we have]
>>>>>>>> The nearest infrastructure I found in kernel is
>>>>>>>> bio_integrity_payload.
>>>>>>>>
>>>
>>> Hi
>>>
>>> We already have  dm-integrity target upstream.
>>> What's missing in this target ?
>>
>> If I didn't miss anything, the dm-integrity is designed to calculate and
>> restore csum into its space to verify the integrity.
>> The csum happens when bio reaches dm-integrity.
>>
>> However what I want is, fs generate bio with attached verification hook,
>> and pass to lower layers to verify it.
>>
>> For example, if we use the following device mapper layout:
>>
>>          FS (can be any fs with metadata csum)
>>                  |
>>               dm-integrity
>>                  |
>>               dm-raid1
>>                 / \
>>           disk1     disk2
>>
>> If some data in disk1 get corrupted (the disk itself is still good), and
>> when dm-raid1 tries to read the corrupted data, it may return the
>> corrupted one, and then caught by dm-integrity, finally return -EIO to 
>> FS.
>>
>> But the truth is, we could at least try to read out data in disk2 if we
>> know the csum for it.
>> And use the checksum to verify if it's the correct data.
>>
>>
>> So my idea will be:
>>       FS (with metadata csum, or even data csum support)
>>                  |  READ bio for metadata
>>                  |  -With metadata verification hook
>>              dm-raid1
>>                 / \
>>            disk1   disk2
>>
>> dm-raid1 handles the bio, reading out data from disk1.
>> But the result can't pass verification hook.
>> Then retry with disk2.
>>
>> If result from disk2 passes verification hook. That's good, returning
>> the result from disk2 to upper layer (fs).
>> And we can even submit WRITE bio to try to write the good result back to
>> disk1.
>>
>> If result from disk2 doesn't pass verification hook, then we return -EIO
>> to upper layer.
>>
>> That's what btrfs has already done for DUP/RAID1/10 (although RAID5/6
>> will also try to rebuild data, but it still has some problem).
>>
>> I just want to make device-mapper raid able to handle such case too.
>> Especially when most fs supports checksum for their metadata.
>>
> 
> Hi
> 
> IMHO you are looking for too complicated solution.
> 
> If your checksum is calculated and checked at FS level there is no added 
> value when you spread this logic to other layers.
> 
> dm-integrity adds basic 'check-summing' to any filesystem without the 
> need to modify fs itself - the paid price is - if there is bug between 
> passing data from  'fs' to dm-integrity'  it cannot be captured.
But that is true of pretty much any layering, not just dm-integrity. 
There's just a slightly larger window for corruption with dm-integrity.
> 
> Advantage of having separated 'fs' and 'block' layer is in its 
> separation and simplicity at each level.
> 
> If you want integrated solution - you are simply looking for btrfs where 
> multiple layers are integrated together.
> 
> You are also possibly missing feature of dm-interity - it's not just 
> giving you 'checksum' - it also makes you sure - device has proper 
> content - you can't just 'replace block' even with proper checksum for a 
> block somewhere in the middle of you device... and when joined with 
> crypto - it makes it way more secure...
And to expand a bit further, the correct way to integrate dm-integrity 
into the stack when RAID is involved is to put it _below_ the RAID 
layer, so each underlying device is it's own dm-integrity target. 
Assuming I understand the way dm-raid and md handle -EIO, that should 
get you a similar level of protection to BTRFS (worse in some ways, 
better in others).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 12:33             ` Zdenek Kabelac
  2017-11-16 12:41               ` Austin S. Hemmelgarn
@ 2017-11-16 14:06               ` Qu Wenruo
  2017-11-16 16:47                 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2017-11-16 14:06 UTC (permalink / raw)
  To: Zdenek Kabelac, Nikolay Borisov, linux-block, dm-devel, linux-fsdevel
  Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5569 bytes --]



On 2017年11月16日 20:33, Zdenek Kabelac wrote:
> Dne 16.11.2017 v 11:04 Qu Wenruo napsal(a):
>>
>>
>> On 2017年11月16日 17:43, Zdenek Kabelac wrote:
>>> Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
>>>>
>>>>
>>>>>>>>>
>>>>>>>> [What we have]
>>>>>>>> The nearest infrastructure I found in kernel is
>>>>>>>> bio_integrity_payload.
>>>>>>>>
>>>
>>> Hi
>>>
>>> We already have  dm-integrity target upstream.
>>> What's missing in this target ?
>>
>> If I didn't miss anything, the dm-integrity is designed to calculate and
>> restore csum into its space to verify the integrity.
>> The csum happens when bio reaches dm-integrity.
>>
>> However what I want is, fs generate bio with attached verification hook,
>> and pass to lower layers to verify it.
>>
>> For example, if we use the following device mapper layout:
>>
>>          FS (can be any fs with metadata csum)
>>                  |
>>               dm-integrity
>>                  |
>>               dm-raid1
>>                 / \
>>           disk1     disk2
>>
>> If some data in disk1 get corrupted (the disk itself is still good), and
>> when dm-raid1 tries to read the corrupted data, it may return the
>> corrupted one, and then caught by dm-integrity, finally return -EIO to
>> FS.
>>
>> But the truth is, we could at least try to read out data in disk2 if we
>> know the csum for it.
>> And use the checksum to verify if it's the correct data.
>>
>>
>> So my idea will be:
>>       FS (with metadata csum, or even data csum support)
>>                  |  READ bio for metadata
>>                  |  -With metadata verification hook
>>              dm-raid1
>>                 / \
>>            disk1   disk2
>>
>> dm-raid1 handles the bio, reading out data from disk1.
>> But the result can't pass verification hook.
>> Then retry with disk2.
>>
>> If result from disk2 passes verification hook. That's good, returning
>> the result from disk2 to upper layer (fs).
>> And we can even submit WRITE bio to try to write the good result back to
>> disk1.
>>
>> If result from disk2 doesn't pass verification hook, then we return -EIO
>> to upper layer.
>>
>> That's what btrfs has already done for DUP/RAID1/10 (although RAID5/6
>> will also try to rebuild data, but it still has some problem).
>>
>> I just want to make device-mapper raid able to handle such case too.
>> Especially when most fs supports checksum for their metadata.
>>
> 
> Hi
> 
> IMHO you are looking for too complicated solution.

This is at least less complicated than dm-integrity.

Just a new hook for READ bio. And it can start from easy part.
Like starting from dm-raid1 and other fs support.

> 
> If your checksum is calculated and checked at FS level there is no added
> value when you spread this logic to other layers.

That's why I'm moving the checking part to lower level, to make more
value from the checksum.

> 
> dm-integrity adds basic 'check-summing' to any filesystem without the
> need to modify fs itself

Well, despite the fact that modern filesystem has already implemented
their metadata csum.

 - the paid price is - if there is bug between
> passing data from  'fs' to dm-integrity'  it cannot be captured.
> 
> Advantage of having separated 'fs' and 'block' layer is in its
> separation and simplicity at each level.

Totally agreed on this.

But the idea here should not bring that large impact (compared to big
things like ZFS/Btrfs).

1) It only affect READ bio
2) Every dm target can choose if to support or pass down the hook.
   no mean to support it for RAID0 for example.
   And for complex raid like RAID5/6 no need to support it from the very
   beginning.
3) Main part of the functionality is already implemented
   The core complexity contains 2 parts:
   a) checksum calculation and checking
      Modern fs is already doing this, at least for metadata.
   b) recovery
      dm targets already have this implemented for supported raid
      profile.
   All these are already implemented, just moving them to different
   timing is not bringing such big modification IIRC.
> 
> If you want integrated solution - you are simply looking for btrfs where
> multiple layers are integrated together.

If with such verification hook (along with something extra to handle
scrub), btrfs chunk mapping can be re-implemented with device-mapper:

In fact btrfs logical space is just a dm-linear device, and each chunk
can be implemented by its corresponding dm-* module like:

dm-linear:       | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
and
btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
btrfs chunk 2: data, using dm-raid0 on disk A B C D
...
btrfs chunk n: system, using dm-raid1 on disk A B

At least btrfs can take the advantage of the simplicity of separate layers.

And other filesystem can get a little higher chance to recover its
metadata if built on dm-raid.

Thanks,
Qu

> 
> You are also possibly missing feature of dm-interity - it's not just
> giving you 'checksum' - it also makes you sure - device has proper
> content - you can't just 'replace block' even with proper checksum for a
> block somewhere in the middle of you device... and when joined with
> crypto - it makes it way more secure...
> 
> Regards
> 
> Zdenek


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 14:06               ` Qu Wenruo
@ 2017-11-16 16:47                 ` Austin S. Hemmelgarn
  2017-11-16 21:05                   ` Pasi Kärkkäinen
  2017-11-17  1:30                   ` Qu Wenruo
  0 siblings, 2 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-16 16:47 UTC (permalink / raw)
  To: Qu Wenruo, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, linux-fsdevel
  Cc: linux-btrfs

On 2017-11-16 09:06, Qu Wenruo wrote:
> 
> 
> On 2017年11月16日 20:33, Zdenek Kabelac wrote:
>> Dne 16.11.2017 v 11:04 Qu Wenruo napsal(a):
>>>
>>>
>>> On 2017年11月16日 17:43, Zdenek Kabelac wrote:
>>>> Dne 16.11.2017 v 09:08 Qu Wenruo napsal(a):
>>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>> [What we have]
>>>>>>>>> The nearest infrastructure I found in kernel is
>>>>>>>>> bio_integrity_payload.
>>>>>>>>>
>>>>
>>>> Hi
>>>>
>>>> We already have  dm-integrity target upstream.
>>>> What's missing in this target ?
>>>
>>> If I didn't miss anything, the dm-integrity is designed to calculate and
>>> restore csum into its space to verify the integrity.
>>> The csum happens when bio reaches dm-integrity.
>>>
>>> However what I want is, fs generate bio with attached verification hook,
>>> and pass to lower layers to verify it.
>>>
>>> For example, if we use the following device mapper layout:
>>>
>>>           FS (can be any fs with metadata csum)
>>>                   |
>>>                dm-integrity
>>>                   |
>>>                dm-raid1
>>>                  / \
>>>            disk1     disk2
>>>
>>> If some data in disk1 get corrupted (the disk itself is still good), and
>>> when dm-raid1 tries to read the corrupted data, it may return the
>>> corrupted one, and then caught by dm-integrity, finally return -EIO to
>>> FS.
>>>
>>> But the truth is, we could at least try to read out data in disk2 if we
>>> know the csum for it.
>>> And use the checksum to verify if it's the correct data.
>>>
>>>
>>> So my idea will be:
>>>        FS (with metadata csum, or even data csum support)
>>>                   |  READ bio for metadata
>>>                   |  -With metadata verification hook
>>>               dm-raid1
>>>                  / \
>>>             disk1   disk2
>>>
>>> dm-raid1 handles the bio, reading out data from disk1.
>>> But the result can't pass verification hook.
>>> Then retry with disk2.
>>>
>>> If result from disk2 passes verification hook. That's good, returning
>>> the result from disk2 to upper layer (fs).
>>> And we can even submit WRITE bio to try to write the good result back to
>>> disk1.
>>>
>>> If result from disk2 doesn't pass verification hook, then we return -EIO
>>> to upper layer.
>>>
>>> That's what btrfs has already done for DUP/RAID1/10 (although RAID5/6
>>> will also try to rebuild data, but it still has some problem).
>>>
>>> I just want to make device-mapper raid able to handle such case too.
>>> Especially when most fs supports checksum for their metadata.
>>>
>>
>> Hi
>>
>> IMHO you are looking for too complicated solution.
> 
> This is at least less complicated than dm-integrity.
> 
> Just a new hook for READ bio. And it can start from easy part.
> Like starting from dm-raid1 and other fs support.
It's less complicated for end users (in theory, but cryptsetup devs are 
working on that for dm-integrity), but significantly more complicated 
for developers.

It also brings up the question of what happens when you want some other 
layer between the filesystem and the MD/DM RAID layer (say, running 
bcache or dm-cache on top of the RAID array).  In the case of 
dm-integrity, that's not an issue because dm-integrity is entirely 
self-contained, it doesn't depend on other layers beyond the standard 
block interface.

As I mentioned in my other reply on this thread, running with 
dm-integrity _below_ the RAID layer instead of on top of it will provide 
the same net effect, and in fact provide a stronger guarantee than what 
you are proposing (because dm-integrity does real cryptographic 
integrity verification, as opposed to just checking for bit-rot).
> 
>>
>> If your checksum is calculated and checked at FS level there is no added
>> value when you spread this logic to other layers.
> 
> That's why I'm moving the checking part to lower level, to make more
> value from the checksum.
> 
>>
>> dm-integrity adds basic 'check-summing' to any filesystem without the
>> need to modify fs itself
> 
> Well, despite the fact that modern filesystem has already implemented
> their metadata csum.
> 
>   - the paid price is - if there is bug between
>> passing data from  'fs' to dm-integrity'  it cannot be captured.
>>
>> Advantage of having separated 'fs' and 'block' layer is in its
>> separation and simplicity at each level.
> 
> Totally agreed on this.
> 
> But the idea here should not bring that large impact (compared to big
> things like ZFS/Btrfs).
> 
> 1) It only affect READ bio
> 2) Every dm target can choose if to support or pass down the hook.
>     no mean to support it for RAID0 for example.
>     And for complex raid like RAID5/6 no need to support it from the very
>     beginning.
> 3) Main part of the functionality is already implemented
>     The core complexity contains 2 parts:
>     a) checksum calculation and checking
>        Modern fs is already doing this, at least for metadata.
>     b) recovery
>        dm targets already have this implemented for supported raid
>        profile.
>     All these are already implemented, just moving them to different
>     timing is not bringing such big modification IIRC.
>>
>> If you want integrated solution - you are simply looking for btrfs where
>> multiple layers are integrated together.
> 
> If with such verification hook (along with something extra to handle
> scrub), btrfs chunk mapping can be re-implemented with device-mapper:
> 
> In fact btrfs logical space is just a dm-linear device, and each chunk
> can be implemented by its corresponding dm-* module like:
> 
> dm-linear:       | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
> and
> btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
> btrfs chunk 2: data, using dm-raid0 on disk A B C D
> ...
> btrfs chunk n: system, using dm-raid1 on disk A B
> 
> At least btrfs can take the advantage of the simplicity of separate layers.
> 
> And other filesystem can get a little higher chance to recover its
> metadata if built on dm-raid.
Again, just put dm-integrity below dm-raid.  The other filesystems 
primarily have metadata checksums to catch data corruption, not repair 
it, and I severely doubt that you will manage to convince developers to 
add support in their filesystem (especially XFS) because:
1. It's a layering violation (yes, I know BTRFS is too, but that's a bit 
less of an issue because it's a completely self-contained layering 
violation, while this isn't).
2. There's no precedent in hardware (I challenge you to find a block 
device that lets you respond to a read completing with 'Hey, this data 
is bogus, give me the real data!').
3. You can get the same net effect with a higher guarantee of security 
using dm-integrity.
> 
> Thanks,
> Qu
> 
>>
>> You are also possibly missing feature of dm-interity - it's not just
>> giving you 'checksum' - it also makes you sure - device has proper
>> content - you can't just 'replace block' even with proper checksum for a
>> block somewhere in the middle of you device... and when joined with
>> crypto - it makes it way more secure...
>>
>> Regards
>>
>> Zdenek
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 16:47                 ` Austin S. Hemmelgarn
@ 2017-11-16 21:05                   ` Pasi Kärkkäinen
  2017-11-17  1:30                   ` Qu Wenruo
  1 sibling, 0 replies; 20+ messages in thread
From: Pasi Kärkkäinen @ 2017-11-16 21:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Qu Wenruo, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, linux-fsdevel, linux-btrfs

On Thu, Nov 16, 2017 at 11:47:45AM -0500, Austin S. Hemmelgarn wrote:
> >
> >At least btrfs can take the advantage of the simplicity of separate layers.
> >
> >And other filesystem can get a little higher chance to recover its
> >metadata if built on dm-raid.
> Again, just put dm-integrity below dm-raid.  The other filesystems primarily
> have metadata checksums to catch data corruption, not repair it, and I
> severely doubt that you will manage to convince developers to add support in
> their filesystem (especially XFS) because:
> 1. It's a layering violation (yes, I know BTRFS is too, but that's a bit
> less of an issue because it's a completely self-contained layering
> violation, while this isn't).
> 2. There's no precedent in hardware (I challenge you to find a block device
> that lets you respond to a read completing with 'Hey, this data is bogus,
> give me the real data!').
>

Isn't this what T10 DIF/DIX (Data Integrity Fields / Data Integrity Extenstions) allows.. using checksums all the way from userspace applications to the disks in the storage backend, with checksum verification at all points in between? 

Does require compatible hardware/firmware/kernel/drivers/apps though.. so not really a generic solution.


-- Pasi

> 3. You can get the same net effect with a higher guarantee of security using
> dm-integrity.
> >
> >Thanks,
> >Qu
> >
> >>
> >>You are also possibly missing feature of dm-interity - it's not just
> >>giving you 'checksum' - it also makes you sure - device has proper
> >>content - you can't just 'replace block' even with proper checksum for a
> >>block somewhere in the middle of you device... and when joined with
> >>crypto - it makes it way more secure...
> >>
> >>Regards
> >>
> >>Zdenek
> >
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 10:04           ` Qu Wenruo
  2017-11-16 12:33             ` Zdenek Kabelac
@ 2017-11-16 22:32             ` Chris Murphy
  2017-11-17  1:22               ` Qu Wenruo
  2017-11-21  2:53               ` [dm-devel] " Theodore Ts'o
  1 sibling, 2 replies; 20+ messages in thread
From: Chris Murphy @ 2017-11-16 22:32 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Zdenek Kabelac, Nikolay Borisov, linux-block, dm-devel,
	Linux FS Devel, linux-btrfs

On Thu, Nov 16, 2017 at 3:04 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> For example, if we use the following device mapper layout:
>
>         FS (can be any fs with metadata csum)
>                 |
>              dm-integrity
>                 |
>              dm-raid1
>                / \
>          disk1     disk2

You would instead do dm-integrity per physical device, then make the
two dm-integrity devices, members of md raid1 array. Now when
integrity fails, basically it's UNC error to raid1 which then gets the
copy from the other device.

But what you're getting at, that dm-integrity is more complicated, is
true, in that it's at least partly COW based in order to get the
atomic write guarantee needed to ensure data blocks and csums are
always in sync, and reliable. But this also applies to the entire file
system. The READ bio concept you're proposing leverages pretty much
already existing code, has no write performance penalty or complexity
at all, but does miss data for file systems that don't csum data
blocks. It's good the file system can stay alive, but data is the much
bigger target in terms of percent space on the physical media, and
more likely to be corrupt or go missing due to media defect or
whatever. It's still possible for silent data corruption to happen.

> I just want to make device-mapper raid able to handle such case too.
> Especially when most fs supports checksum for their metadata.

XFS by default does metadata csums. But ext4 doesn't use it for either
metadata or the journal by default still, it is still optional. So for
now it mainly benefits XFS.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 22:32             ` Chris Murphy
@ 2017-11-17  1:22               ` Qu Wenruo
  2017-11-17  1:54                 ` Chris Murphy
  2017-11-21  2:53               ` [dm-devel] " Theodore Ts'o
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2017-11-17  1:22 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Zdenek Kabelac, Nikolay Borisov, linux-block, dm-devel,
	Linux FS Devel, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2204 bytes --]



On 2017年11月17日 06:32, Chris Murphy wrote:
> On Thu, Nov 16, 2017 at 3:04 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
>> For example, if we use the following device mapper layout:
>>
>>         FS (can be any fs with metadata csum)
>>                 |
>>              dm-integrity
>>                 |
>>              dm-raid1
>>                / \
>>          disk1     disk2
> 
> 
> You would instead do dm-integrity per physical device, then make the
> two dm-integrity devices, members of md raid1 array. Now when
> integrity fails, basically it's UNC error to raid1 which then gets the
> copy from the other device.


Yep, dm-integrity under raid1 makes much more sense here.

Although, double the CPU usage for each device added in.

> 
> But what you're getting at, that dm-integrity is more complicated, is
> true, in that it's at least partly COW based in order to get the
> atomic write guarantee needed to ensure data blocks and csums are
> always in sync, and reliable. But this also applies to the entire file
> system. The READ bio concept you're proposing leverages pretty much
> already existing code, has no write performance penalty or complexity
> at all, but does miss data for file systems that don't csum data
> blocks.

That's true, since currently only Btrfs supports data csum.
And to make filesystem to support data csum, it needs CoW support while
only XFS and Btrfs supports CoW yet.

> It's good the file system can stay alive, but data is the much
> bigger target in terms of percent space on the physical media,

It's also true.
(Although working on btrfs sometimes makes me care more about safe metadata)

Thanks,
Qu

> and
> more likely to be corrupt or go missing due to media defect or
> whatever. It's still possible for silent data corruption to happen.
> 
> 
> 
> 
>> I just want to make device-mapper raid able to handle such case too.
>> Especially when most fs supports checksum for their metadata.
> 
> XFS by default does metadata csums. But ext4 doesn't use it for either
> metadata or the journal by default still, it is still optional. So for
> now it mainly benefits XFS.
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16  2:18 Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6? Qu Wenruo
  2017-11-16  6:54 ` Nikolay Borisov
@ 2017-11-17  1:26 ` Andreas Dilger
  1 sibling, 0 replies; 20+ messages in thread
From: Andreas Dilger @ 2017-11-17  1:26 UTC (permalink / raw)
  To: Qu Wenruo, Darrick J. Wong
  Cc: linux-block, dm-devel, linux-fsdevel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]

On Nov 15, 2017, at 7:18 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.

I recall Darrick was looking into a mechanism to do this.  Rather than
changing the whole block layer to take a callback to do a checksum, what
we looked at was to allow the upper-layer read to specify a "retry count"
to the lower-layer block device.  If the lower layer is able to retry the
read then it will read a different device (or combination of devices for
e.g. RAID-6) based on the retry count, until the upper layer gets a good
read (based on checksum, or whatever).  If there are no more devices (or
combinations) to try then a final error is returned.

Darrick can probably point at the original thread/patch.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 16:47                 ` Austin S. Hemmelgarn
  2017-11-16 21:05                   ` Pasi Kärkkäinen
@ 2017-11-17  1:30                   ` Qu Wenruo
  2017-11-17 12:22                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2017-11-17  1:30 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Zdenek Kabelac, Nikolay Borisov,
	linux-block, dm-devel, linux-fsdevel
  Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5437 bytes --]



On 2017年11月17日 00:47, Austin S. Hemmelgarn wrote:

>>
>> This is at least less complicated than dm-integrity.
>>
>> Just a new hook for READ bio. And it can start from easy part.
>> Like starting from dm-raid1 and other fs support.
> It's less complicated for end users (in theory, but cryptsetup devs are
> working on that for dm-integrity), but significantly more complicated
> for developers.
> 
> It also brings up the question of what happens when you want some other
> layer between the filesystem and the MD/DM RAID layer (say, running
> bcache or dm-cache on top of the RAID array).  In the case of
> dm-integrity, that's not an issue because dm-integrity is entirely
> self-contained, it doesn't depend on other layers beyond the standard
> block interface.

Each layer can choose to drop the support for extra verification.

If the layer is not modifying the data, it can pass it do lower layer.
Just as integrity payload.

> 
> As I mentioned in my other reply on this thread, running with
> dm-integrity _below_ the RAID layer instead of on top of it will provide
> the same net effect, and in fact provide a stronger guarantee than what
> you are proposing (because dm-integrity does real cryptographic
> integrity verification, as opposed to just checking for bit-rot).

Although with more CPU usage for each device even they are containing
same data.

>>
>>>
>>> If your checksum is calculated and checked at FS level there is no added
>>> value when you spread this logic to other layers.
>>
>> That's why I'm moving the checking part to lower level, to make more
>> value from the checksum.
>>
>>>
>>> dm-integrity adds basic 'check-summing' to any filesystem without the
>>> need to modify fs itself
>>
>> Well, despite the fact that modern filesystem has already implemented
>> their metadata csum.
>>
>>   - the paid price is - if there is bug between
>>> passing data from  'fs' to dm-integrity'  it cannot be captured.
>>>
>>> Advantage of having separated 'fs' and 'block' layer is in its
>>> separation and simplicity at each level.
>>
>> Totally agreed on this.
>>
>> But the idea here should not bring that large impact (compared to big
>> things like ZFS/Btrfs).
>>
>> 1) It only affect READ bio
>> 2) Every dm target can choose if to support or pass down the hook.
>>     no mean to support it for RAID0 for example.
>>     And for complex raid like RAID5/6 no need to support it from the very
>>     beginning.
>> 3) Main part of the functionality is already implemented
>>     The core complexity contains 2 parts:
>>     a) checksum calculation and checking
>>        Modern fs is already doing this, at least for metadata.
>>     b) recovery
>>        dm targets already have this implemented for supported raid
>>        profile.
>>     All these are already implemented, just moving them to different
>>     timing is not bringing such big modification IIRC.
>>>
>>> If you want integrated solution - you are simply looking for btrfs where
>>> multiple layers are integrated together.
>>
>> If with such verification hook (along with something extra to handle
>> scrub), btrfs chunk mapping can be re-implemented with device-mapper:
>>
>> In fact btrfs logical space is just a dm-linear device, and each chunk
>> can be implemented by its corresponding dm-* module like:
>>
>> dm-linear:       | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
>> and
>> btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
>> btrfs chunk 2: data, using dm-raid0 on disk A B C D
>> ...
>> btrfs chunk n: system, using dm-raid1 on disk A B
>>
>> At least btrfs can take the advantage of the simplicity of separate
>> layers.
>>
>> And other filesystem can get a little higher chance to recover its
>> metadata if built on dm-raid.
> Again, just put dm-integrity below dm-raid.  The other filesystems
> primarily have metadata checksums to catch data corruption, not repair
> it,

Because they have no extra copy.
If they have, they will definitely use the extra copy to repair.

> and I severely doubt that you will manage to convince developers to
> add support in their filesystem (especially XFS) because:
> 1. It's a layering violation (yes, I know BTRFS is too, but that's a bit
> less of an issue because it's a completely self-contained layering
> violation, while this isn't).

If passing something along with bio is violating layers, then integrity
payload is already doing this for a long time.

> 2. There's no precedent in hardware (I challenge you to find a block
> device that lets you respond to a read completing with 'Hey, this data
> is bogus, give me the real data!').
> 3. You can get the same net effect with a higher guarantee of security
> using dm-integrity.

With more CPU and IO overhead (journal mode will write data twice, one
for journal and one for real data).

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>>
>>> You are also possibly missing feature of dm-interity - it's not just
>>> giving you 'checksum' - it also makes you sure - device has proper
>>> content - you can't just 'replace block' even with proper checksum for a
>>> block somewhere in the middle of you device... and when joined with
>>> crypto - it makes it way more secure...
>>>
>>> Regards
>>>
>>> Zdenek
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-17  1:22               ` Qu Wenruo
@ 2017-11-17  1:54                 ` Chris Murphy
  2017-11-17  1:55                   ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2017-11-17  1:54 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Chris Murphy, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, Linux FS Devel, linux-btrfs

On Thu, Nov 16, 2017 at 6:22 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2017年11月17日 06:32, Chris Murphy wrote:
>
>> It's good the file system can stay alive, but data is the much
>> bigger target in terms of percent space on the physical media,
>
> It's also true.
> (Although working on btrfs sometimes makes me care more about safe metadata)

It seems like a good idea if it's lightweight enough, because we get
Btrfs-like metadata error detection and recovery from a copy, for
free. The user doesn't have to setup dm-verity to get this.
Additionally, if the work happens in the md driver, then both mdadm
and LVM based arrays get the feature (strictly speaking I think
dm-raid is deprecated, everything I'm aware of these days uses the md
code (including Intel's IMSM firmware based RAID).

The gotcha of course is that anytime there's a file system format
change, now this layer has to become aware of it and support all
versions of that file system's metadata for the purpose of error
detection. That might be a bitter pill to swallow in the long term.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-17  1:54                 ` Chris Murphy
@ 2017-11-17  1:55                   ` Chris Murphy
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Murphy @ 2017-11-17  1:55 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Qu Wenruo, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, Linux FS Devel, linux-btrfs

On Thu, Nov 16, 2017 at 6:54 PM, Chris Murphy <lists@colorremedies.com> wrote:

> The user doesn't have to setup dm-verity to get this.

Or dm-integrity, rather.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-17  1:30                   ` Qu Wenruo
@ 2017-11-17 12:22                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-17 12:22 UTC (permalink / raw)
  To: Qu Wenruo, Zdenek Kabelac, Nikolay Borisov, linux-block,
	dm-devel, linux-fsdevel
  Cc: linux-btrfs

On 2017-11-16 20:30, Qu Wenruo wrote:
> 
> 
> On 2017年11月17日 00:47, Austin S. Hemmelgarn wrote:
> 
>>>
>>> This is at least less complicated than dm-integrity.
>>>
>>> Just a new hook for READ bio. And it can start from easy part.
>>> Like starting from dm-raid1 and other fs support.
>> It's less complicated for end users (in theory, but cryptsetup devs are
>> working on that for dm-integrity), but significantly more complicated
>> for developers.
>>
>> It also brings up the question of what happens when you want some other
>> layer between the filesystem and the MD/DM RAID layer (say, running
>> bcache or dm-cache on top of the RAID array).  In the case of
>> dm-integrity, that's not an issue because dm-integrity is entirely
>> self-contained, it doesn't depend on other layers beyond the standard
>> block interface.
> 
> Each layer can choose to drop the support for extra verification.
> 
> If the layer is not modifying the data, it can pass it do lower layer.
> Just as integrity payload.
Which then makes things a bit more complicated in every other layer as 
well, in turn making things more complicated for all developers.
> 
>>
>> As I mentioned in my other reply on this thread, running with
>> dm-integrity _below_ the RAID layer instead of on top of it will provide
>> the same net effect, and in fact provide a stronger guarantee than what
>> you are proposing (because dm-integrity does real cryptographic
>> integrity verification, as opposed to just checking for bit-rot).
> 
> Although with more CPU usage for each device even they are containing
> same data.
I never said it wasn't higher resource usage.
> 
>>>
>>>>
>>>> If your checksum is calculated and checked at FS level there is no added
>>>> value when you spread this logic to other layers.
>>>
>>> That's why I'm moving the checking part to lower level, to make more
>>> value from the checksum.
>>>
>>>>
>>>> dm-integrity adds basic 'check-summing' to any filesystem without the
>>>> need to modify fs itself
>>>
>>> Well, despite the fact that modern filesystem has already implemented
>>> their metadata csum.
>>>
>>>    - the paid price is - if there is bug between
>>>> passing data from  'fs' to dm-integrity'  it cannot be captured.
>>>>
>>>> Advantage of having separated 'fs' and 'block' layer is in its
>>>> separation and simplicity at each level.
>>>
>>> Totally agreed on this.
>>>
>>> But the idea here should not bring that large impact (compared to big
>>> things like ZFS/Btrfs).
>>>
>>> 1) It only affect READ bio
>>> 2) Every dm target can choose if to support or pass down the hook.
>>>      no mean to support it for RAID0 for example.
>>>      And for complex raid like RAID5/6 no need to support it from the very
>>>      beginning.
>>> 3) Main part of the functionality is already implemented
>>>      The core complexity contains 2 parts:
>>>      a) checksum calculation and checking
>>>         Modern fs is already doing this, at least for metadata.
>>>      b) recovery
>>>         dm targets already have this implemented for supported raid
>>>         profile.
>>>      All these are already implemented, just moving them to different
>>>      timing is not bringing such big modification IIRC.
>>>>
>>>> If you want integrated solution - you are simply looking for btrfs where
>>>> multiple layers are integrated together.
>>>
>>> If with such verification hook (along with something extra to handle
>>> scrub), btrfs chunk mapping can be re-implemented with device-mapper:
>>>
>>> In fact btrfs logical space is just a dm-linear device, and each chunk
>>> can be implemented by its corresponding dm-* module like:
>>>
>>> dm-linear:       | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
>>> and
>>> btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
>>> btrfs chunk 2: data, using dm-raid0 on disk A B C D
>>> ...
>>> btrfs chunk n: system, using dm-raid1 on disk A B
>>>
>>> At least btrfs can take the advantage of the simplicity of separate
>>> layers.
>>>
>>> And other filesystem can get a little higher chance to recover its
>>> metadata if built on dm-raid.
>> Again, just put dm-integrity below dm-raid.  The other filesystems
>> primarily have metadata checksums to catch data corruption, not repair
>> it,
> 
> Because they have no extra copy.
> If they have, they will definitely use the extra copy to repair.
But they don't have those extra copies now, so that really becomes 
irrelevant as an argument (especially since it's not likely they will 
add data or metadata replication in the filesystem any time in the near 
future).
> 
>> and I severely doubt that you will manage to convince developers to
>> add support in their filesystem (especially XFS) because:
>> 1. It's a layering violation (yes, I know BTRFS is too, but that's a bit
>> less of an issue because it's a completely self-contained layering
>> violation, while this isn't).
> 
> If passing something along with bio is violating layers, then integrity
> payload is already doing this for a long time.
The block integrity layer is also interfacing directly with hardware and 
_needs_ to pass that data down.  Unless I'm mistaken, it also doesn't do 
any verification except in the filesystem layer, and doesn't pass down 
any complaints about the integrity of the data (it may try to re-read 
it, but that's not the same as what you're talking about).
> 
>> 2. There's no precedent in hardware (I challenge you to find a block
>> device that lets you respond to a read completing with 'Hey, this data
>> is bogus, give me the real data!').
>> 3. You can get the same net effect with a higher guarantee of security
>> using dm-integrity.
> 
> With more CPU and IO overhead (journal mode will write data twice, one
> for journal and one for real data).
If you're concerned about that, then the same argument could be made 
about having checksumming at all.  Yes, it's not cheap, but security and 
data safety almost never are.  CoW semantics in BTRFS are just as 
resource intensive (if not more so).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
  2017-11-16 22:32             ` Chris Murphy
  2017-11-17  1:22               ` Qu Wenruo
@ 2017-11-21  2:53               ` Theodore Ts'o
  1 sibling, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2017-11-21  2:53 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Qu Wenruo, Nikolay Borisov, linux-block, dm-devel,
	Zdenek Kabelac, Linux FS Devel, linux-btrfs

On Thu, Nov 16, 2017 at 03:32:05PM -0700, Chris Murphy wrote:
> 
> XFS by default does metadata csums. But ext4 doesn't use it for either
> metadata or the journal by default still, it is still optional. So for
> now it mainly benefits XFS.

Metadata checksums are enabled by default in the version of e2fsprogs
shipped by Debian.  Since there were no real problems reported by
Debian users, in the next release of e2fsprogs, coming soon, it will
be enabled by default for all new ext4 file systems.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-11-21  2:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-16  2:18 Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6? Qu Wenruo
2017-11-16  6:54 ` Nikolay Borisov
2017-11-16  7:38   ` Qu Wenruo
2017-11-16  7:42     ` Nikolay Borisov
2017-11-16  8:08       ` Qu Wenruo
2017-11-16  9:43         ` Zdenek Kabelac
2017-11-16 10:04           ` Qu Wenruo
2017-11-16 12:33             ` Zdenek Kabelac
2017-11-16 12:41               ` Austin S. Hemmelgarn
2017-11-16 14:06               ` Qu Wenruo
2017-11-16 16:47                 ` Austin S. Hemmelgarn
2017-11-16 21:05                   ` Pasi Kärkkäinen
2017-11-17  1:30                   ` Qu Wenruo
2017-11-17 12:22                     ` Austin S. Hemmelgarn
2017-11-16 22:32             ` Chris Murphy
2017-11-17  1:22               ` Qu Wenruo
2017-11-17  1:54                 ` Chris Murphy
2017-11-17  1:55                   ` Chris Murphy
2017-11-21  2:53               ` [dm-devel] " Theodore Ts'o
2017-11-17  1:26 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).