All of lore.kernel.org
 help / color / mirror / Atom feed
* Proposal draft for data checksumming for ext4
@ 2014-03-20 16:40 Lukáš Czerner
  2014-03-20 17:59 ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Lukáš Czerner @ 2014-03-20 16:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o

Hi all,

I've started thinking about implementing data checksumming for ext4 file
system. This is not meant to be a formal proposal or a definitive design
description since I am not that far yet, but just a few ideas to start
the discussion and trying to figure out what the best design for data
checksumming in ext4 might be.



			   Data checksumming for ext4
				  Version 0.1
				 March 20, 2014


Goal
====

The goal is to implement data checksumming for ext4 file system in order
to improve data integrity and increase protection against silent data
corruption while maintaining reasonable performance and usability of the
file system.

While data checksums can be certainly used in different ways, for example
data deduplication this proposal is very much focused on data integrity.


Checksum function
=================

By default I plan to use crc32c checksum, but I do not see a reason why not
not to be able to support different checksum function. Also by default the
checksum size should be 32 bits, but the plan is to make the format
flexible enough to be able to support different checksum sizes.


Checksumming and Validating
===========================

On write checksums on the data blocks need to be computed right before its
bio is submitted and written out as metadata to its position (see bellow)
after the bio completes (similarly as we do unwritten extent conversion
today).

Similarly on read checksums needs to be computed after the bio completes
and compared with the stored values to verify that the data is intact.

All of this should be done using workqueues (Concurrency Managed
Workqueues) so we do not block the other operations and to spread the
checksum computation and comparison across CPUs. One wq for reads and one
for writes. Specific setup of the wq such as priority, or concurrency limits
should be decided later based on the performance evaluation.

While we already have ext4 infrastructure to submit bios in
fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
need the same for reads to be able to provide ext4 specific hooks for
io completion.


Where to store the checksums
============================

While the problems above are pretty straightforward when it comes to the
design, actually storing and retrieving the data checksums from to/from
the ext4 format requires much more thought to be efficient enough and play
nicely with the overall ext4 design while trying not to be too intrusive.

I came up with several ideas about where to store and how to access data
checksums. While some of the ideas might not be the most viable options,
it's still interesting to think about the advantages and disadvantages of
each particular solution.

a) Static layout
----------------

This scheme fits perfectly into the ext4 design. Checksum blocks
would be preallocated the same way as we do with inode tables for example.
Each block group should have it's own contiguous region of checksum blocks
to be able to store checksums for bocks from entire block group it belongs
to. Each checksum block would contain header including checksum of the
checksum block.

We still have unused 4 Bytes in the ext4_group_desc structure, so storing
a block number for the checksum table should not be a problem.

Finding a checksum location of each block in the block group should be done
in O(1) time, which is very good. Other advantage is a locality with the
data blocks in question since both resides in the same block group.

Big disadvantage is the fact that this solution is not very flexibile which
comes from the fact that the location of "checksum table" is statically
located at a precise position in the file system at mkfs time.

There are also other problems we should be concerned with. Ext4 file system
does have support for metadata checksumming so all the metadata does have
its own checksum. While we can avoid unnecessarily checksuming inodes, group
descriptors and basicall all statically positioned metadata, we still have
dynamically allocated metadata blocks such as extent blocks. These block
do not have to be checksummed but we would still have space reserved in the
checksum table.

I think that we should be able to make this feature without introducing any
incompatibility, but it would make more sense to make it RO compatible only
so we can preserve the checksums. But that's up to the implementation.

b) Special inode
----------------

This is very "lazy" solution and should not be difficult to implement. The
idea is to have a special inode which would store the checksum blocks in
it's own data blocks.

The big disadvantage is that we would have to walk the extent tree twice for
each read, or write. There is not much to say about this solution other than
again we can make this feature without introducing any incompatibility, but
it would probably make more sense to make it RO compatible to preserve the
checksums.

c) Per inode checksum b-tree
----------------------------

See d)

d) Per block group checksum b-tree
----------------------------------

Those two schemes are very similar in that both would store checksum in a
b-tree with a block number (we could use logical block number in per inode
tree) as a key. Obviously finding a checksum would be in logarithmic time,
while the size of the tree would be possibly much bigger in the per-inode
case. In per block group case we will have much smaller boundary of
number of checksum blocks stored.

This and the fact that we would have to have at least one checksum block
per inode (which would be wasteful in the case of small files) is making per
block group solution much more viable. However the major disadvantage of
per block group solution is that the checksum tree would create a source of
contention when reading/writing from/to a different inodes in the same block
group. This might be mitigated by having a worker thread per a range of block
groups - but it might still be a bottleneck.

Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
root of the tree. While the ext4_inode structure have 4Bytes of
i_obso_faddr but that's not enough. So we would have to figure out where to
store it - we could possibly abuse i_block to store it along with the extent
nodes.

File system scrub
=================

While this is certainly a feature which we want to have in both userspace
e2fsprogs and kernel I do not have any design notes at this stage.




I am sure that there are other possibilities and variants of those design
ideas, but I think that this should be enough to have a discussion started.
As I is not I think that the most viable option is d) that is, per block
group checksum tree, which gives us enough flexibility while not being too
complex solution.

I'll try to update this description as it will be getting more concrete
structure and I hope that we will have some productive discussion about
this at LSF.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-03-20 16:40 Proposal draft for data checksumming for ext4 Lukáš Czerner
@ 2014-03-20 17:59 ` Darrick J. Wong
  2014-03-24  1:59   ` Lukáš Czerner
  2014-04-28 16:21   ` Dmitry Monakhov
  0 siblings, 2 replies; 8+ messages in thread
From: Darrick J. Wong @ 2014-03-20 17:59 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
> Hi all,
> 
> I've started thinking about implementing data checksumming for ext4 file
> system. This is not meant to be a formal proposal or a definitive design
> description since I am not that far yet, but just a few ideas to start
> the discussion and trying to figure out what the best design for data
> checksumming in ext4 might be.
> 
> 
> 
> 			   Data checksumming for ext4
> 				  Version 0.1
> 				 March 20, 2014
> 
> 
> Goal
> ====
> 
> The goal is to implement data checksumming for ext4 file system in order
> to improve data integrity and increase protection against silent data
> corruption while maintaining reasonable performance and usability of the
> file system.
> 
> While data checksums can be certainly used in different ways, for example
> data deduplication this proposal is very much focused on data integrity.
> 
> 
> Checksum function
> =================
> 
> By default I plan to use crc32c checksum, but I do not see a reason why not
> not to be able to support different checksum function. Also by default the
> checksum size should be 32 bits, but the plan is to make the format
> flexible enough to be able to support different checksum sizes.

<nod> Were you thinking of allowing the use of different functions for data and
metadata checksums?

> Checksumming and Validating
> ===========================
> 
> On write checksums on the data blocks need to be computed right before its
> bio is submitted and written out as metadata to its position (see bellow)
> after the bio completes (similarly as we do unwritten extent conversion
> today).
> 
> Similarly on read checksums needs to be computed after the bio completes
> and compared with the stored values to verify that the data is intact.
> 
> All of this should be done using workqueues (Concurrency Managed
> Workqueues) so we do not block the other operations and to spread the
> checksum computation and comparison across CPUs. One wq for reads and one
> for writes. Specific setup of the wq such as priority, or concurrency limits
> should be decided later based on the performance evaluation.
> 
> While we already have ext4 infrastructure to submit bios in
> fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
> need the same for reads to be able to provide ext4 specific hooks for
> io completion.
> 
> 
> Where to store the checksums
> ============================
> 
> While the problems above are pretty straightforward when it comes to the
> design, actually storing and retrieving the data checksums from to/from
> the ext4 format requires much more thought to be efficient enough and play
> nicely with the overall ext4 design while trying not to be too intrusive.
> 
> I came up with several ideas about where to store and how to access data
> checksums. While some of the ideas might not be the most viable options,
> it's still interesting to think about the advantages and disadvantages of
> each particular solution.
> 
> a) Static layout
> ----------------
> 
> This scheme fits perfectly into the ext4 design. Checksum blocks
> would be preallocated the same way as we do with inode tables for example.
> Each block group should have it's own contiguous region of checksum blocks
> to be able to store checksums for bocks from entire block group it belongs
> to. Each checksum block would contain header including checksum of the
> checksum block.
> 
> We still have unused 4 Bytes in the ext4_group_desc structure, so storing
> a block number for the checksum table should not be a problem.

What if you have a 64bit filesystem?  Do you have some strategy in mind to work
around that?  What about the snapshot exclusion bitmap field?  Afaict that
never went in, so perhaps that field could be reused?

> Finding a checksum location of each block in the block group should be done
> in O(1) time, which is very good. Other advantage is a locality with the
> data blocks in question since both resides in the same block group.
> 
> Big disadvantage is the fact that this solution is not very flexibile which
> comes from the fact that the location of "checksum table" is statically
> located at a precise position in the file system at mkfs time.

Having a big dumb block of checksums would be easier to prefetch from disk for
fsck and kernel driver, rather than having to dig through some tree structure.
(More on that below)

> There are also other problems we should be concerned with. Ext4 file system
> does have support for metadata checksumming so all the metadata does have
> its own checksum. While we can avoid unnecessarily checksuming inodes, group
> descriptors and basicall all statically positioned metadata, we still have
> dynamically allocated metadata blocks such as extent blocks. These block
> do not have to be checksummed but we would still have space reserved in the
> checksum table.

Don't forget directory blocks--they (should) have checksums too, so you can
skip those.

I wonder, could we use this table to store backrefs too?  It would make the
table considerably larger, but then we could (potentially) reconstruct broken
extent trees.

> I think that we should be able to make this feature without introducing any
> incompatibility, but it would make more sense to make it RO compatible only
> so we can preserve the checksums. But that's up to the implementation.

I think you'd have to have it be rocompat, otherwise you could write data with
an old kernel and a new kernel would freak out.

> b) Special inode
> ----------------
> 
> This is very "lazy" solution and should not be difficult to implement. The
> idea is to have a special inode which would store the checksum blocks in
> it's own data blocks.
> 
> The big disadvantage is that we would have to walk the extent tree twice for
> each read, or write. There is not much to say about this solution other than
> again we can make this feature without introducing any incompatibility, but
> it would probably make more sense to make it RO compatible to preserve the
> checksums.
> 
> c) Per inode checksum b-tree
> ----------------------------
> 
> See d)
> 
> d) Per block group checksum b-tree
> ----------------------------------
> 
> Those two schemes are very similar in that both would store checksum in a
> b-tree with a block number (we could use logical block number in per inode
> tree) as a key. Obviously finding a checksum would be in logarithmic time,
> while the size of the tree would be possibly much bigger in the per-inode
> case. In per block group case we will have much smaller boundary of
> number of checksum blocks stored.
> 
> This and the fact that we would have to have at least one checksum block
> per inode (which would be wasteful in the case of small files) is making per
> block group solution much more viable. However the major disadvantage of
> per block group solution is that the checksum tree would create a source of
> contention when reading/writing from/to a different inodes in the same block
> group. This might be mitigated by having a worker thread per a range of block
> groups - but it might still be a bottleneck.
> 
> Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
> root of the tree. While the ext4_inode structure have 4Bytes of
> i_obso_faddr but that's not enough. So we would have to figure out where to
> store it - we could possibly abuse i_block to store it along with the extent
> nodes.

I think(?) your purpose in using either a special inode or a btree to store the
checksums is to avoid wasting checksum blocks on things that are already
checksummed?  I'm not sure that we'd save enough space to justify the extra
processing.

--D

> File system scrub
> =================
> 
> While this is certainly a feature which we want to have in both userspace
> e2fsprogs and kernel I do not have any design notes at this stage.
> 
> 
> 
> 
> I am sure that there are other possibilities and variants of those design
> ideas, but I think that this should be enough to have a discussion started.
> As I is not I think that the most viable option is d) that is, per block
> group checksum tree, which gives us enough flexibility while not being too
> complex solution.
> 
> I'll try to update this description as it will be getting more concrete
> structure and I hope that we will have some productive discussion about
> this at LSF.
> 
> Thanks!
> -Lukas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-03-20 17:59 ` Darrick J. Wong
@ 2014-03-24  1:59   ` Lukáš Czerner
  2014-03-25  1:48     ` Andreas Dilger
  2014-04-15  0:26     ` mingming cao
  2014-04-28 16:21   ` Dmitry Monakhov
  1 sibling, 2 replies; 8+ messages in thread
From: Lukáš Czerner @ 2014-03-24  1:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10123 bytes --]

On Thu, 20 Mar 2014, Darrick J. Wong wrote:

> Date: Thu, 20 Mar 2014 10:59:50 -0700
> From: Darrick J. Wong <darrick.wong@oracle.com>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
> Subject: Re: Proposal draft for data checksumming for ext4
> 
> On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
> > Hi all,
> > 
> > I've started thinking about implementing data checksumming for ext4 file
> > system. This is not meant to be a formal proposal or a definitive design
> > description since I am not that far yet, but just a few ideas to start
> > the discussion and trying to figure out what the best design for data
> > checksumming in ext4 might be.
> > 
> > 
> > 
> > 			   Data checksumming for ext4
> > 				  Version 0.1
> > 				 March 20, 2014
> > 
> > 
> > Goal
> > ====
> > 
> > The goal is to implement data checksumming for ext4 file system in order
> > to improve data integrity and increase protection against silent data
> > corruption while maintaining reasonable performance and usability of the
> > file system.
> > 
> > While data checksums can be certainly used in different ways, for example
> > data deduplication this proposal is very much focused on data integrity.
> > 
> > 
> > Checksum function
> > =================
> > 
> > By default I plan to use crc32c checksum, but I do not see a reason why not
> > not to be able to support different checksum function. Also by default the
> > checksum size should be 32 bits, but the plan is to make the format
> > flexible enough to be able to support different checksum sizes.
> 
> <nod> Were you thinking of allowing the use of different functions for data and
> metadata checksums?

Hi Darrick,

I have not, but I think that this would be very easy to do if we can
agree that it's good to have.


> 
> > Checksumming and Validating
> > ===========================
> > 
> > On write checksums on the data blocks need to be computed right before its
> > bio is submitted and written out as metadata to its position (see bellow)
> > after the bio completes (similarly as we do unwritten extent conversion
> > today).
> > 
> > Similarly on read checksums needs to be computed after the bio completes
> > and compared with the stored values to verify that the data is intact.
> > 
> > All of this should be done using workqueues (Concurrency Managed
> > Workqueues) so we do not block the other operations and to spread the
> > checksum computation and comparison across CPUs. One wq for reads and one
> > for writes. Specific setup of the wq such as priority, or concurrency limits
> > should be decided later based on the performance evaluation.
> > 
> > While we already have ext4 infrastructure to submit bios in
> > fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
> > need the same for reads to be able to provide ext4 specific hooks for
> > io completion.
> > 
> > 
> > Where to store the checksums
> > ============================
> > 
> > While the problems above are pretty straightforward when it comes to the
> > design, actually storing and retrieving the data checksums from to/from
> > the ext4 format requires much more thought to be efficient enough and play
> > nicely with the overall ext4 design while trying not to be too intrusive.
> > 
> > I came up with several ideas about where to store and how to access data
> > checksums. While some of the ideas might not be the most viable options,
> > it's still interesting to think about the advantages and disadvantages of
> > each particular solution.
> > 
> > a) Static layout
> > ----------------
> > 
> > This scheme fits perfectly into the ext4 design. Checksum blocks
> > would be preallocated the same way as we do with inode tables for example.
> > Each block group should have it's own contiguous region of checksum blocks
> > to be able to store checksums for bocks from entire block group it belongs
> > to. Each checksum block would contain header including checksum of the
> > checksum block.
> > 
> > We still have unused 4 Bytes in the ext4_group_desc structure, so storing
> > a block number for the checksum table should not be a problem.
> 
> What if you have a 64bit filesystem?  Do you have some strategy in mind to work
> around that?  What about the snapshot exclusion bitmap field?  Afaict that
> never went in, so perhaps that field could be reused?

Yes we can use the exclusion bitmap field. I think that would not be
a problem. We could also use addressing from the start of the block
group and keep the checksum table in the block group.

> 
> > Finding a checksum location of each block in the block group should be done
> > in O(1) time, which is very good. Other advantage is a locality with the
> > data blocks in question since both resides in the same block group.
> > 
> > Big disadvantage is the fact that this solution is not very flexibile which
> > comes from the fact that the location of "checksum table" is statically
> > located at a precise position in the file system at mkfs time.
> 
> Having a big dumb block of checksums would be easier to prefetch from disk for
> fsck and kernel driver, rather than having to dig through some tree structure.
> (More on that below)

I agree, it is also much more robust solution than having a tree.

> 
> > There are also other problems we should be concerned with. Ext4 file system
> > does have support for metadata checksumming so all the metadata does have
> > its own checksum. While we can avoid unnecessarily checksuming inodes, group
> > descriptors and basicall all statically positioned metadata, we still have
> > dynamically allocated metadata blocks such as extent blocks. These block
> > do not have to be checksummed but we would still have space reserved in the
> > checksum table.
> 
> Don't forget directory blocks--they (should) have checksums too, so you can
> skip those.
> 
> I wonder, could we use this table to store backrefs too?  It would make the
> table considerably larger, but then we could (potentially) reconstruct broken
> extent trees.

Definitely, that is one thing I did not discussed here, but I'd like
to have the checksum blocks self descriptive so we can alway know
where it belongs and who is the owner. So yes, having a backrefs is
really good idea.

> 
> > I think that we should be able to make this feature without introducing any
> > incompatibility, but it would make more sense to make it RO compatible only
> > so we can preserve the checksums. But that's up to the implementation.
> 
> I think you'd have to have it be rocompat, otherwise you could write data with
> an old kernel and a new kernel would freak out.

Yes, I think that we could make it not freak out, but we would loose
the checksums, so for that I think that having this rocompat will
probably make more sense.

Thanks!
-Lukas

> 
> > b) Special inode
> > ----------------
> > 
> > This is very "lazy" solution and should not be difficult to implement. The
> > idea is to have a special inode which would store the checksum blocks in
> > it's own data blocks.
> > 
> > The big disadvantage is that we would have to walk the extent tree twice for
> > each read, or write. There is not much to say about this solution other than
> > again we can make this feature without introducing any incompatibility, but
> > it would probably make more sense to make it RO compatible to preserve the
> > checksums.
> > 
> > c) Per inode checksum b-tree
> > ----------------------------
> > 
> > See d)
> > 
> > d) Per block group checksum b-tree
> > ----------------------------------
> > 
> > Those two schemes are very similar in that both would store checksum in a
> > b-tree with a block number (we could use logical block number in per inode
> > tree) as a key. Obviously finding a checksum would be in logarithmic time,
> > while the size of the tree would be possibly much bigger in the per-inode
> > case. In per block group case we will have much smaller boundary of
> > number of checksum blocks stored.
> > 
> > This and the fact that we would have to have at least one checksum block
> > per inode (which would be wasteful in the case of small files) is making per
> > block group solution much more viable. However the major disadvantage of
> > per block group solution is that the checksum tree would create a source of
> > contention when reading/writing from/to a different inodes in the same block
> > group. This might be mitigated by having a worker thread per a range of block
> > groups - but it might still be a bottleneck.
> > 
> > Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
> > root of the tree. While the ext4_inode structure have 4Bytes of
> > i_obso_faddr but that's not enough. So we would have to figure out where to
> > store it - we could possibly abuse i_block to store it along with the extent
> > nodes.
> 
> I think(?) your purpose in using either a special inode or a btree to store the
> checksums is to avoid wasting checksum blocks on things that are already
> checksummed?  I'm not sure that we'd save enough space to justify the extra
> processing.
> 
> --D
> 
> > File system scrub
> > =================
> > 
> > While this is certainly a feature which we want to have in both userspace
> > e2fsprogs and kernel I do not have any design notes at this stage.
> > 
> > 
> > 
> > 
> > I am sure that there are other possibilities and variants of those design
> > ideas, but I think that this should be enough to have a discussion started.
> > As I is not I think that the most viable option is d) that is, per block
> > group checksum tree, which gives us enough flexibility while not being too
> > complex solution.
> > 
> > I'll try to update this description as it will be getting more concrete
> > structure and I hope that we will have some productive discussion about
> > this at LSF.
> > 
> > Thanks!
> > -Lukas
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-03-24  1:59   ` Lukáš Czerner
@ 2014-03-25  1:48     ` Andreas Dilger
  2014-04-15  0:26     ` mingming cao
  1 sibling, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2014-03-25  1:48 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Darrick J. Wong, linux-ext4, Theodore Ts'o

Some quick thoughts in this:
- checksum blocks should cover all non-static blocks in the group,
  don't need separate checksums for itable, bitmap, and descriptors
- if it is complex to skip static blocks with their own checksums, just
  leave those blocks empty (zero checksum).
- address in group descriptor should be relative to other group metadata,
  for example the block bitmap, so it works with/without flex_bg and it is
  clear that "0" is an invalid checksum block array address

Cheers, Andreas

> On Mar 23, 2014, at 19:59, Lukáš Czerner <lczerner@redhat.com> wrote:
> 
>> On Thu, 20 Mar 2014, Darrick J. Wong wrote:
>> 
>> Date: Thu, 20 Mar 2014 10:59:50 -0700
>> From: Darrick J. Wong <darrick.wong@oracle.com>
>> To: Lukáš Czerner <lczerner@redhat.com>
>> Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
>> Subject: Re: Proposal draft for data checksumming for ext4
>> 
>>> On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
>>> Hi all,
>>> 
>>> I've started thinking about implementing data checksumming for ext4 file
>>> system. This is not meant to be a formal proposal or a definitive design
>>> description since I am not that far yet, but just a few ideas to start
>>> the discussion and trying to figure out what the best design for data
>>> checksumming in ext4 might be.
>>> 
>>> 
>>> 
>>>               Data checksumming for ext4
>>>                  Version 0.1
>>>                 March 20, 2014
>>> 
>>> 
>>> Goal
>>> ====
>>> 
>>> The goal is to implement data checksumming for ext4 file system in order
>>> to improve data integrity and increase protection against silent data
>>> corruption while maintaining reasonable performance and usability of the
>>> file system.
>>> 
>>> While data checksums can be certainly used in different ways, for example
>>> data deduplication this proposal is very much focused on data integrity.
>>> 
>>> 
>>> Checksum function
>>> =================
>>> 
>>> By default I plan to use crc32c checksum, but I do not see a reason why not
>>> not to be able to support different checksum function. Also by default the
>>> checksum size should be 32 bits, but the plan is to make the format
>>> flexible enough to be able to support different checksum sizes.
>> 
>> <nod> Were you thinking of allowing the use of different functions for data and
>> metadata checksums?
> 
> Hi Darrick,
> 
> I have not, but I think that this would be very easy to do if we can
> agree that it's good to have.
> 
> 
>> 
>>> Checksumming and Validating
>>> ===========================
>>> 
>>> On write checksums on the data blocks need to be computed right before its
>>> bio is submitted and written out as metadata to its position (see bellow)
>>> after the bio completes (similarly as we do unwritten extent conversion
>>> today).
>>> 
>>> Similarly on read checksums needs to be computed after the bio completes
>>> and compared with the stored values to verify that the data is intact.
>>> 
>>> All of this should be done using workqueues (Concurrency Managed
>>> Workqueues) so we do not block the other operations and to spread the
>>> checksum computation and comparison across CPUs. One wq for reads and one
>>> for writes. Specific setup of the wq such as priority, or concurrency limits
>>> should be decided later based on the performance evaluation.
>>> 
>>> While we already have ext4 infrastructure to submit bios in
>>> fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
>>> need the same for reads to be able to provide ext4 specific hooks for
>>> io completion.
>>> 
>>> 
>>> Where to store the checksums
>>> ============================
>>> 
>>> While the problems above are pretty straightforward when it comes to the
>>> design, actually storing and retrieving the data checksums from to/from
>>> the ext4 format requires much more thought to be efficient enough and play
>>> nicely with the overall ext4 design while trying not to be too intrusive.
>>> 
>>> I came up with several ideas about where to store and how to access data
>>> checksums. While some of the ideas might not be the most viable options,
>>> it's still interesting to think about the advantages and disadvantages of
>>> each particular solution.
>>> 
>>> a) Static layout
>>> ----------------
>>> 
>>> This scheme fits perfectly into the ext4 design. Checksum blocks
>>> would be preallocated the same way as we do with inode tables for example.
>>> Each block group should have it's own contiguous region of checksum blocks
>>> to be able to store checksums for bocks from entire block group it belongs
>>> to. Each checksum block would contain header including checksum of the
>>> checksum block.
>>> 
>>> We still have unused 4 Bytes in the ext4_group_desc structure, so storing
>>> a block number for the checksum table should not be a problem.
>> 
>> What if you have a 64bit filesystem?  Do you have some strategy in mind to work
>> around that?  What about the snapshot exclusion bitmap field?  Afaict that
>> never went in, so perhaps that field could be reused?
> 
> Yes we can use the exclusion bitmap field. I think that would not be
> a problem. We could also use addressing from the start of the block
> group and keep the checksum table in the block group.
> 
>> 
>>> Finding a checksum location of each block in the block group should be done
>>> in O(1) time, which is very good. Other advantage is a locality with the
>>> data blocks in question since both resides in the same block group.
>>> 
>>> Big disadvantage is the fact that this solution is not very flexibile which
>>> comes from the fact that the location of "checksum table" is statically
>>> located at a precise position in the file system at mkfs time.
>> 
>> Having a big dumb block of checksums would be easier to prefetch from disk for
>> fsck and kernel driver, rather than having to dig through some tree structure.
>> (More on that below)
> 
> I agree, it is also much more robust solution than having a tree.
> 
>> 
>>> There are also other problems we should be concerned with. Ext4 file system
>>> does have support for metadata checksumming so all the metadata does have
>>> its own checksum. While we can avoid unnecessarily checksuming inodes, group
>>> descriptors and basicall all statically positioned metadata, we still have
>>> dynamically allocated metadata blocks such as extent blocks. These block
>>> do not have to be checksummed but we would still have space reserved in the
>>> checksum table.
>> 
>> Don't forget directory blocks--they (should) have checksums too, so you can
>> skip those.
>> 
>> I wonder, could we use this table to store backrefs too?  It would make the
>> table considerably larger, but then we could (potentially) reconstruct broken
>> extent trees.
> 
> Definitely, that is one thing I did not discussed here, but I'd like
> to have the checksum blocks self descriptive so we can alway know
> where it belongs and who is the owner. So yes, having a backrefs is
> really good idea.
> 
>> 
>>> I think that we should be able to make this feature without introducing any
>>> incompatibility, but it would make more sense to make it RO compatible only
>>> so we can preserve the checksums. But that's up to the implementation.
>> 
>> I think you'd have to have it be rocompat, otherwise you could write data with
>> an old kernel and a new kernel would freak out.
> 
> Yes, I think that we could make it not freak out, but we would loose
> the checksums, so for that I think that having this rocompat will
> probably make more sense.
> 
> Thanks!
> -Lukas
> 
>> 
>>> b) Special inode
>>> ----------------
>>> 
>>> This is very "lazy" solution and should not be difficult to implement. The
>>> idea is to have a special inode which would store the checksum blocks in
>>> it's own data blocks.
>>> 
>>> The big disadvantage is that we would have to walk the extent tree twice for
>>> each read, or write. There is not much to say about this solution other than
>>> again we can make this feature without introducing any incompatibility, but
>>> it would probably make more sense to make it RO compatible to preserve the
>>> checksums.
>>> 
>>> c) Per inode checksum b-tree
>>> ----------------------------
>>> 
>>> See d)
>>> 
>>> d) Per block group checksum b-tree
>>> ----------------------------------
>>> 
>>> Those two schemes are very similar in that both would store checksum in a
>>> b-tree with a block number (we could use logical block number in per inode
>>> tree) as a key. Obviously finding a checksum would be in logarithmic time,
>>> while the size of the tree would be possibly much bigger in the per-inode
>>> case. In per block group case we will have much smaller boundary of
>>> number of checksum blocks stored.
>>> 
>>> This and the fact that we would have to have at least one checksum block
>>> per inode (which would be wasteful in the case of small files) is making per
>>> block group solution much more viable. However the major disadvantage of
>>> per block group solution is that the checksum tree would create a source of
>>> contention when reading/writing from/to a different inodes in the same block
>>> group. This might be mitigated by having a worker thread per a range of block
>>> groups - but it might still be a bottleneck.
>>> 
>>> Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
>>> root of the tree. While the ext4_inode structure have 4Bytes of
>>> i_obso_faddr but that's not enough. So we would have to figure out where to
>>> store it - we could possibly abuse i_block to store it along with the extent
>>> nodes.
>> 
>> I think(?) your purpose in using either a special inode or a btree to store the
>> checksums is to avoid wasting checksum blocks on things that are already
>> checksummed?  I'm not sure that we'd save enough space to justify the extra
>> processing.
>> 
>> --D
>> 
>>> File system scrub
>>> =================
>>> 
>>> While this is certainly a feature which we want to have in both userspace
>>> e2fsprogs and kernel I do not have any design notes at this stage.
>>> 
>>> 
>>> 
>>> 
>>> I am sure that there are other possibilities and variants of those design
>>> ideas, but I think that this should be enough to have a discussion started.
>>> As I is not I think that the most viable option is d) that is, per block
>>> group checksum tree, which gives us enough flexibility while not being too
>>> complex solution.
>>> 
>>> I'll try to update this description as it will be getting more concrete
>>> structure and I hope that we will have some productive discussion about
>>> this at LSF.
>>> 
>>> Thanks!
>>> -Lukas
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-03-24  1:59   ` Lukáš Czerner
  2014-03-25  1:48     ` Andreas Dilger
@ 2014-04-15  0:26     ` mingming cao
  1 sibling, 0 replies; 8+ messages in thread
From: mingming cao @ 2014-04-15  0:26 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Darrick J. Wong, linux-ext4, Theodore Ts'o

On 03/23/2014 06:59 PM, Lukáš Czerner wrote:
> On Thu, 20 Mar 2014, Darrick J. Wong wrote:
>
>> Date: Thu, 20 Mar 2014 10:59:50 -0700
>> From: Darrick J. Wong <darrick.wong@oracle.com>
>> To: Lukáš Czerner <lczerner@redhat.com>
>> Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
>> Subject: Re: Proposal draft for data checksumming for ext4
>>
>> On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
>>> Hi all,
>>>
>>> I've started thinking about implementing data checksumming for ext4 file
>>> system. This is not meant to be a formal proposal or a definitive design
>>> description since I am not that far yet, but just a few ideas to start
>>> the discussion and trying to figure out what the best design for data
>>> checksumming in ext4 might be.
>>>
>>>
>>>
>>> 			   Data checksumming for ext4
>>> 				  Version 0.1
>>> 				 March 20, 2014
>>>
>>>
>>> Goal
>>> ====
>>>
>>> The goal is to implement data checksumming for ext4 file system in order
>>> to improve data integrity and increase protection against silent data
>>> corruption while maintaining reasonable performance and usability of the
>>> file system.
>>>
>>> While data checksums can be certainly used in different ways, for example
>>> data deduplication this proposal is very much focused on data integrity.
>>>
>>>
>>> Checksum function
>>> =================
>>>
>>> By default I plan to use crc32c checksum, but I do not see a reason why not
>>> not to be able to support different checksum function. Also by default the
>>> checksum size should be 32 bits, but the plan is to make the format
>>> flexible enough to be able to support different checksum sizes.
>>
>> <nod> Were you thinking of allowing the use of different functions for data and
>> metadata checksums?
>
> Hi Darrick,
>
> I have not, but I think that this would be very easy to do if we can
> agree that it's good to have.
>
>
>>
>>> Checksumming and Validating
>>> ===========================
>>>
>>> On write checksums on the data blocks need to be computed right before its
>>> bio is submitted and written out as metadata to its position (see bellow)
>>> after the bio completes (similarly as we do unwritten extent conversion
>>> today).
>>>
>>> Similarly on read checksums needs to be computed after the bio completes
>>> and compared with the stored values to verify that the data is intact.
>>>
>>> All of this should be done using workqueues (Concurrency Managed
>>> Workqueues) so we do not block the other operations and to spread the
>>> checksum computation and comparison across CPUs. One wq for reads and one
>>> for writes. Specific setup of the wq such as priority, or concurrency limits
>>> should be decided later based on the performance evaluation.
>>>
>>> While we already have ext4 infrastructure to submit bios in
>>> fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
>>> need the same for reads to be able to provide ext4 specific hooks for
>>> io completion.
>>>
>>>
>>> Where to store the checksums
>>> ============================
>>>
>>> While the problems above are pretty straightforward when it comes to the
>>> design, actually storing and retrieving the data checksums from to/from
>>> the ext4 format requires much more thought to be efficient enough and play
>>> nicely with the overall ext4 design while trying not to be too intrusive.
>>>
>>> I came up with several ideas about where to store and how to access data
>>> checksums. While some of the ideas might not be the most viable options,
>>> it's still interesting to think about the advantages and disadvantages of
>>> each particular solution.
>>>
>>> a) Static layout
>>> ----------------
>>>
>>> This scheme fits perfectly into the ext4 design. Checksum blocks
>>> would be preallocated the same way as we do with inode tables for example.
>>> Each block group should have it's own contiguous region of checksum blocks
>>> to be able to store checksums for bocks from entire block group it belongs
>>> to. Each checksum block would contain header including checksum of the
>>> checksum block.
>>>
>>> We still have unused 4 Bytes in the ext4_group_desc structure, so storing
>>> a block number for the checksum table should not be a problem.
>>
>> What if you have a 64bit filesystem?  Do you have some strategy in mind to work
>> around that?  What about the snapshot exclusion bitmap field?  Afaict that
>> never went in, so perhaps that field could be reused?
>
> Yes we can use the exclusion bitmap field. I think that would not be
> a problem. We could also use addressing from the start of the block
> group and keep the checksum table in the block group.
>
>>
>>> Finding a checksum location of each block in the block group should be done
>>> in O(1) time, which is very good. Other advantage is a locality with the
>>> data blocks in question since both resides in the same block group.
>>>
>>> Big disadvantage is the fact that this solution is not very flexibile which
>>> comes from the fact that the location of "checksum table" is statically
>>> located at a precise position in the file system at mkfs time.
>>
>> Having a big dumb block of checksums would be easier to prefetch from disk for
>> fsck and kernel driver, rather than having to dig through some tree structure.
>> (More on that below)
>
> I agree, it is also much more robust solution than having a tree.
>
>>
>>> There are also other problems we should be concerned with. Ext4 file system
>>> does have support for metadata checksumming so all the metadata does have
>>> its own checksum. While we can avoid unnecessarily checksuming inodes, group
>>> descriptors and basicall all statically positioned metadata, we still have
>>> dynamically allocated metadata blocks such as extent blocks. These block
>>> do not have to be checksummed but we would still have space reserved in the
>>> checksum table.
>>
>> Don't forget directory blocks--they (should) have checksums too, so you can
>> skip those.
>>
>> I wonder, could we use this table to store backrefs too?  It would make the
>> table considerably larger, but then we could (potentially) reconstruct broken
>> extent trees.
>
> Definitely, that is one thing I did not discussed here, but I'd like
> to have the checksum blocks self descriptive so we can alway know
> where it belongs and who is the owner. So yes, having a backrefs is
> really good idea.
>

Hello Lukas, Darrick,


I just read this documents, quite interesting. I wonder if this was 
discussed at ext4 workshop in Napa and any agreement has reached as to 
how to store the data checksumming?

Have you thought about store extent checksumming/backrefs rather than 
per data block?

If we are interested in adding backrefs, then I wonder if we could use 
this table to store reflink counters as well? I understand only data 
blocks referenced by multiple files need this, but this is  a O(1) 
faster solution than per-inode refcount tree,  and we are thinking of 
tracking data checksums/backrefs anyway, and the refcounters could be 
checksummed at the same time.

Mingming
>>
>>> I think that we should be able to make this feature without introducing any
>>> incompatibility, but it would make more sense to make it RO compatible only
>>> so we can preserve the checksums. But that's up to the implementation.
>>
>> I think you'd have to have it be rocompat, otherwise you could write data with
>> an old kernel and a new kernel would freak out.
>
> Yes, I think that we could make it not freak out, but we would loose
> the checksums, so for that I think that having this rocompat will
> probably make more sense.
>
> Thanks!
> -Lukas
>
>>
>>> b) Special inode
>>> ----------------
>>>
>>> This is very "lazy" solution and should not be difficult to implement. The
>>> idea is to have a special inode which would store the checksum blocks in
>>> it's own data blocks.
>>>
>>> The big disadvantage is that we would have to walk the extent tree twice for
>>> each read, or write. There is not much to say about this solution other than
>>> again we can make this feature without introducing any incompatibility, but
>>> it would probably make more sense to make it RO compatible to preserve the
>>> checksums.
>>>
>>> c) Per inode checksum b-tree
>>> ----------------------------
>>>
>>> See d)
>>>
>>> d) Per block group checksum b-tree
>>> ----------------------------------
>>>
>>> Those two schemes are very similar in that both would store checksum in a
>>> b-tree with a block number (we could use logical block number in per inode
>>> tree) as a key. Obviously finding a checksum would be in logarithmic time,
>>> while the size of the tree would be possibly much bigger in the per-inode
>>> case. In per block group case we will have much smaller boundary of
>>> number of checksum blocks stored.
>>>
>>> This and the fact that we would have to have at least one checksum block
>>> per inode (which would be wasteful in the case of small files) is making per
>>> block group solution much more viable. However the major disadvantage of
>>> per block group solution is that the checksum tree would create a source of
>>> contention when reading/writing from/to a different inodes in the same block
>>> group. This might be mitigated by having a worker thread per a range of block
>>> groups - but it might still be a bottleneck.
>>>
>>> Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
>>> root of the tree. While the ext4_inode structure have 4Bytes of
>>> i_obso_faddr but that's not enough. So we would have to figure out where to
>>> store it - we could possibly abuse i_block to store it along with the extent
>>> nodes.
>>
>> I think(?) your purpose in using either a special inode or a btree to store the
>> checksums is to avoid wasting checksum blocks on things that are already
>> checksummed?  I'm not sure that we'd save enough space to justify the extra
>> processing.
>>
>> --D
>>
>>> File system scrub
>>> =================
>>>
>>> While this is certainly a feature which we want to have in both userspace
>>> e2fsprogs and kernel I do not have any design notes at this stage.
>>>
>>>
>>>
>>>
>>> I am sure that there are other possibilities and variants of those design
>>> ideas, but I think that this should be enough to have a discussion started.
>>> As I is not I think that the most viable option is d) that is, per block
>>> group checksum tree, which gives us enough flexibility while not being too
>>> complex solution.
>>>
>>> I'll try to update this description as it will be getting more concrete
>>> structure and I hope that we will have some productive discussion about
>>> this at LSF.
>>>
>>> Thanks!
>>> -Lukas
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-03-20 17:59 ` Darrick J. Wong
  2014-03-24  1:59   ` Lukáš Czerner
@ 2014-04-28 16:21   ` Dmitry Monakhov
  2014-04-28 19:36     ` Darrick J. Wong
  2014-04-28 20:16     ` Andreas Dilger
  1 sibling, 2 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2014-04-28 16:21 UTC (permalink / raw)
  To: Darrick J. Wong, Lukáš Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, 20 Mar 2014 10:59:50 -0700, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
> On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
> > Hi all,
> > 
> > I've started thinking about implementing data checksumming for ext4 file
> > system. This is not meant to be a formal proposal or a definitive design
> > description since I am not that far yet, but just a few ideas to start
> > the discussion and trying to figure out what the best design for data
> > checksumming in ext4 might be.
> > 
> > 
> > 
> > 			   Data checksumming for ext4
> > 				  Version 0.1
> > 				 March 20, 2014
> > 
> > 
> > Goal
> > ====
> > 
> > The goal is to implement data checksumming for ext4 file system in order
> > to improve data integrity and increase protection against silent data
> > corruption while maintaining reasonable performance and usability of the
> > file system.
> > 
> > While data checksums can be certainly used in different ways, for example
> > data deduplication this proposal is very much focused on data integrity.
> > 
> > 
> > Checksum function
> > =================
> > 
> > By default I plan to use crc32c checksum, but I do not see a reason why not
> > not to be able to support different checksum function. Also by default the
> > checksum size should be 32 bits, but the plan is to make the format
> > flexible enough to be able to support different checksum sizes.
> 
> <nod> Were you thinking of allowing the use of different functions for data and
> metadata checksums?
> 
> > Checksumming and Validating
> > ===========================
> > 
> > On write checksums on the data blocks need to be computed right before its
> > bio is submitted and written out as metadata to its position (see bellow)
> > after the bio completes (similarly as we do unwritten extent conversion
> > today).
> > 
> > Similarly on read checksums needs to be computed after the bio completes
> > and compared with the stored values to verify that the data is intact.
> > 
> > All of this should be done using workqueues (Concurrency Managed
> > Workqueues) so we do not block the other operations and to spread the
> > checksum computation and comparison across CPUs. One wq for reads and one
> > for writes. Specific setup of the wq such as priority, or concurrency limits
> > should be decided later based on the performance evaluation.
> > 
> > While we already have ext4 infrastructure to submit bios in
> > fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
> > need the same for reads to be able to provide ext4 specific hooks for
> > io completion.
> > 
> > 
> > Where to store the checksums
> > ============================
> > 
> > While the problems above are pretty straightforward when it comes to the
> > design, actually storing and retrieving the data checksums from to/from
> > the ext4 format requires much more thought to be efficient enough and play
> > nicely with the overall ext4 design while trying not to be too intrusive.
> > 
> > I came up with several ideas about where to store and how to access data
> > checksums. While some of the ideas might not be the most viable options,
> > it's still interesting to think about the advantages and disadvantages of
> > each particular solution.
> > 
> > a) Static layout
> > ----------------
> > 
> > This scheme fits perfectly into the ext4 design. Checksum blocks
> > would be preallocated the same way as we do with inode tables for example.
> > Each block group should have it's own contiguous region of checksum blocks
> > to be able to store checksums for bocks from entire block group it belongs
> > to. Each checksum block would contain header including checksum of the
> > checksum block.
Oh. The most thing that bother me about that feature is possible
performance degradation. number seeks increase dramatically because 
csumblock is not continuous with datablock. Off course journal should
absorb that and real io will happen during journal checkpoint.
But I assumes that mail server which does a lot of
create()/write()/fsync() will complain about bad performance.

BTW: it looks like we do not try to optimize io pattern inside
jbd2_log_do_checkpoint(). For example __flush_batch() can submit
buffer in sorted order(according to block numbers).
 
 
> > 
> > We still have unused 4 Bytes in the ext4_group_desc structure, so storing
> > a block number for the checksum table should not be a problem.
> 
> What if you have a 64bit filesystem?  Do you have some strategy in mind to work
> around that?  What about the snapshot exclusion bitmap field?  Afaict that
> never went in, so perhaps that field could be reused?
> 
> > Finding a checksum location of each block in the block group should be done
> > in O(1) time, which is very good. Other advantage is a locality with the
> > data blocks in question since both resides in the same block group.
> > 
> > Big disadvantage is the fact that this solution is not very flexibile which
> > comes from the fact that the location of "checksum table" is statically
> > located at a precise position in the file system at mkfs time.
> 
> Having a big dumb block of checksums would be easier to prefetch from disk for
> fsck and kernel driver, rather than having to dig through some tree structure.
> (More on that below)
> 
> > There are also other problems we should be concerned with. Ext4 file system
> > does have support for metadata checksumming so all the metadata does have
> > its own checksum. While we can avoid unnecessarily checksuming inodes, group
> > descriptors and basicall all statically positioned metadata, we still have
> > dynamically allocated metadata blocks such as extent blocks. These block
> > do not have to be checksummed but we would still have space reserved in the
> > checksum table.
> 
> Don't forget directory blocks--they (should) have checksums too, so you can
> skip those.
Just quick note: We can hide checksum for directory inside
ext4_dir_entry_2 for a special dirs '.' or '..' simply be increasing
->rec_len which make this feature compatible with older FS
> 
> I wonder, could we use this table to store backrefs too?  It would make the
> table considerably larger, but then we could (potentially) reconstruct broken
> extent trees.
> 
> > I think that we should be able to make this feature without introducing any
> > incompatibility, but it would make more sense to make it RO compatible only
> > so we can preserve the checksums. But that's up to the implementation.
> 
> I think you'd have to have it be rocompat, otherwise you could write data with
> an old kernel and a new kernel would freak out.
> 
> > b) Special inode
> > ----------------
> > 
> > This is very "lazy" solution and should not be difficult to implement. The
> > idea is to have a special inode which would store the checksum blocks in
> > it's own data blocks.
> > 
> > The big disadvantage is that we would have to walk the extent tree twice for
> > each read, or write. There is not much to say about this solution other than
> > again we can make this feature without introducing any incompatibility, but
> > it would probably make more sense to make it RO compatible to preserve the
> > checksums.
> > 
> > c) Per inode checksum b-tree
> > ----------------------------
> > 
> > See d)
> > 
> > d) Per block group checksum b-tree
> > ----------------------------------
> > 
> > Those two schemes are very similar in that both would store checksum in a
> > b-tree with a block number (we could use logical block number in per inode
> > tree) as a key. Obviously finding a checksum would be in logarithmic time,
> > while the size of the tree would be possibly much bigger in the per-inode
> > case. In per block group case we will have much smaller boundary of
> > number of checksum blocks stored.
> > 
> > This and the fact that we would have to have at least one checksum block
> > per inode (which would be wasteful in the case of small files) is making per
> > block group solution much more viable. However the major disadvantage of
> > per block group solution is that the checksum tree would create a source of
> > contention when reading/writing from/to a different inodes in the same block
> > group. This might be mitigated by having a worker thread per a range of block
> > groups - but it might still be a bottleneck.
> > 
> > Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
> > root of the tree. While the ext4_inode structure have 4Bytes of
> > i_obso_faddr but that's not enough. So we would have to figure out where to
> > store it - we could possibly abuse i_block to store it along with the extent
> > nodes.
> 
> I think(?) your purpose in using either a special inode or a btree to store the
> checksums is to avoid wasting checksum blocks on things that are already
> checksummed?  I'm not sure that we'd save enough space to justify the extra
> processing.
> 
> --D
> 
> > File system scrub
> > =================
> > 
> > While this is certainly a feature which we want to have in both userspace
> > e2fsprogs and kernel I do not have any design notes at this stage.
> > 
> > 
> > 
> > 
> > I am sure that there are other possibilities and variants of those design
> > ideas, but I think that this should be enough to have a discussion started.
> > As I is not I think that the most viable option is d) that is, per block
> > group checksum tree, which gives us enough flexibility while not being too
> > complex solution.
> > 
> > I'll try to update this description as it will be getting more concrete
> > structure and I hope that we will have some productive discussion about
> > this at LSF.
> > 
> > Thanks!
> > -Lukas
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-04-28 16:21   ` Dmitry Monakhov
@ 2014-04-28 19:36     ` Darrick J. Wong
  2014-04-28 20:16     ` Andreas Dilger
  1 sibling, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2014-04-28 19:36 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Lukáš Czerner, linux-ext4, Theodore Ts'o

On Mon, Apr 28, 2014 at 08:21:51PM +0400, Dmitry Monakhov wrote:
> On Thu, 20 Mar 2014 10:59:50 -0700, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
> > On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
> > > Hi all,
> > > 
> > > I've started thinking about implementing data checksumming for ext4 file
> > > system. This is not meant to be a formal proposal or a definitive design
> > > description since I am not that far yet, but just a few ideas to start
> > > the discussion and trying to figure out what the best design for data
> > > checksumming in ext4 might be.
> > > 
> > > 
> > > 
> > > 			   Data checksumming for ext4
> > > 				  Version 0.1
> > > 				 March 20, 2014
> > > 
> > > 
> > > Goal
> > > ====
> > > 
> > > The goal is to implement data checksumming for ext4 file system in order
> > > to improve data integrity and increase protection against silent data
> > > corruption while maintaining reasonable performance and usability of the
> > > file system.
> > > 
> > > While data checksums can be certainly used in different ways, for example
> > > data deduplication this proposal is very much focused on data integrity.
> > > 
> > > 
> > > Checksum function
> > > =================
> > > 
> > > By default I plan to use crc32c checksum, but I do not see a reason why not
> > > not to be able to support different checksum function. Also by default the
> > > checksum size should be 32 bits, but the plan is to make the format
> > > flexible enough to be able to support different checksum sizes.
> > 
> > <nod> Were you thinking of allowing the use of different functions for data and
> > metadata checksums?
> > 
> > > Checksumming and Validating
> > > ===========================
> > > 
> > > On write checksums on the data blocks need to be computed right before its
> > > bio is submitted and written out as metadata to its position (see bellow)
> > > after the bio completes (similarly as we do unwritten extent conversion
> > > today).
> > > 
> > > Similarly on read checksums needs to be computed after the bio completes
> > > and compared with the stored values to verify that the data is intact.
> > > 
> > > All of this should be done using workqueues (Concurrency Managed
> > > Workqueues) so we do not block the other operations and to spread the
> > > checksum computation and comparison across CPUs. One wq for reads and one
> > > for writes. Specific setup of the wq such as priority, or concurrency limits
> > > should be decided later based on the performance evaluation.
> > > 
> > > While we already have ext4 infrastructure to submit bios in
> > > fs/ext4/page-io.c where the entry point is ext4_bio_write_page() we would
> > > need the same for reads to be able to provide ext4 specific hooks for
> > > io completion.
> > > 
> > > 
> > > Where to store the checksums
> > > ============================
> > > 
> > > While the problems above are pretty straightforward when it comes to the
> > > design, actually storing and retrieving the data checksums from to/from
> > > the ext4 format requires much more thought to be efficient enough and play
> > > nicely with the overall ext4 design while trying not to be too intrusive.
> > > 
> > > I came up with several ideas about where to store and how to access data
> > > checksums. While some of the ideas might not be the most viable options,
> > > it's still interesting to think about the advantages and disadvantages of
> > > each particular solution.
> > > 
> > > a) Static layout
> > > ----------------
> > > 
> > > This scheme fits perfectly into the ext4 design. Checksum blocks
> > > would be preallocated the same way as we do with inode tables for example.
> > > Each block group should have it's own contiguous region of checksum blocks
> > > to be able to store checksums for bocks from entire block group it belongs
> > > to. Each checksum block would contain header including checksum of the
> > > checksum block.
> Oh. The most thing that bother me about that feature is possible
> performance degradation. number seeks increase dramatically because 
> csumblock is not continuous with datablock. Off course journal should
> absorb that and real io will happen during journal checkpoint.
> But I assumes that mail server which does a lot of
> create()/write()/fsync() will complain about bad performance.
> 
> BTW: it looks like we do not try to optimize io pattern inside
> jbd2_log_do_checkpoint(). For example __flush_batch() can submit
> buffer in sorted order(according to block numbers).
>  
>  
> > > 
> > > We still have unused 4 Bytes in the ext4_group_desc structure, so storing
> > > a block number for the checksum table should not be a problem.
> > 
> > What if you have a 64bit filesystem?  Do you have some strategy in mind to work
> > around that?  What about the snapshot exclusion bitmap field?  Afaict that
> > never went in, so perhaps that field could be reused?
> > 
> > > Finding a checksum location of each block in the block group should be done
> > > in O(1) time, which is very good. Other advantage is a locality with the
> > > data blocks in question since both resides in the same block group.
> > > 
> > > Big disadvantage is the fact that this solution is not very flexibile which
> > > comes from the fact that the location of "checksum table" is statically
> > > located at a precise position in the file system at mkfs time.
> > 
> > Having a big dumb block of checksums would be easier to prefetch from disk for
> > fsck and kernel driver, rather than having to dig through some tree structure.
> > (More on that below)
> > 
> > > There are also other problems we should be concerned with. Ext4 file system
> > > does have support for metadata checksumming so all the metadata does have
> > > its own checksum. While we can avoid unnecessarily checksuming inodes, group
> > > descriptors and basicall all statically positioned metadata, we still have
> > > dynamically allocated metadata blocks such as extent blocks. These block
> > > do not have to be checksummed but we would still have space reserved in the
> > > checksum table.
> > 
> > Don't forget directory blocks--they (should) have checksums too, so you can
> > skip those.
> Just quick note: We can hide checksum for directory inside
> ext4_dir_entry_2 for a special dirs '.' or '..' simply be increasing
> ->rec_len which make this feature compatible with older FS

metadata_csum already does this.

--D
> > 
> > I wonder, could we use this table to store backrefs too?  It would make the
> > table considerably larger, but then we could (potentially) reconstruct broken
> > extent trees.
> > 
> > > I think that we should be able to make this feature without introducing any
> > > incompatibility, but it would make more sense to make it RO compatible only
> > > so we can preserve the checksums. But that's up to the implementation.
> > 
> > I think you'd have to have it be rocompat, otherwise you could write data with
> > an old kernel and a new kernel would freak out.
> > 
> > > b) Special inode
> > > ----------------
> > > 
> > > This is very "lazy" solution and should not be difficult to implement. The
> > > idea is to have a special inode which would store the checksum blocks in
> > > it's own data blocks.
> > > 
> > > The big disadvantage is that we would have to walk the extent tree twice for
> > > each read, or write. There is not much to say about this solution other than
> > > again we can make this feature without introducing any incompatibility, but
> > > it would probably make more sense to make it RO compatible to preserve the
> > > checksums.
> > > 
> > > c) Per inode checksum b-tree
> > > ----------------------------
> > > 
> > > See d)
> > > 
> > > d) Per block group checksum b-tree
> > > ----------------------------------
> > > 
> > > Those two schemes are very similar in that both would store checksum in a
> > > b-tree with a block number (we could use logical block number in per inode
> > > tree) as a key. Obviously finding a checksum would be in logarithmic time,
> > > while the size of the tree would be possibly much bigger in the per-inode
> > > case. In per block group case we will have much smaller boundary of
> > > number of checksum blocks stored.
> > > 
> > > This and the fact that we would have to have at least one checksum block
> > > per inode (which would be wasteful in the case of small files) is making per
> > > block group solution much more viable. However the major disadvantage of
> > > per block group solution is that the checksum tree would create a source of
> > > contention when reading/writing from/to a different inodes in the same block
> > > group. This might be mitigated by having a worker thread per a range of block
> > > groups - but it might still be a bottleneck.
> > > 
> > > Again we still have 4 Bytes in ext4_group_desc to store the pointer to the
> > > root of the tree. While the ext4_inode structure have 4Bytes of
> > > i_obso_faddr but that's not enough. So we would have to figure out where to
> > > store it - we could possibly abuse i_block to store it along with the extent
> > > nodes.
> > 
> > I think(?) your purpose in using either a special inode or a btree to store the
> > checksums is to avoid wasting checksum blocks on things that are already
> > checksummed?  I'm not sure that we'd save enough space to justify the extra
> > processing.
> > 
> > --D
> > 
> > > File system scrub
> > > =================
> > > 
> > > While this is certainly a feature which we want to have in both userspace
> > > e2fsprogs and kernel I do not have any design notes at this stage.
> > > 
> > > 
> > > 
> > > 
> > > I am sure that there are other possibilities and variants of those design
> > > ideas, but I think that this should be enough to have a discussion started.
> > > As I is not I think that the most viable option is d) that is, per block
> > > group checksum tree, which gives us enough flexibility while not being too
> > > complex solution.
> > > 
> > > I'll try to update this description as it will be getting more concrete
> > > structure and I hope that we will have some productive discussion about
> > > this at LSF.
> > > 
> > > Thanks!
> > > -Lukas
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal draft for data checksumming for ext4
  2014-04-28 16:21   ` Dmitry Monakhov
  2014-04-28 19:36     ` Darrick J. Wong
@ 2014-04-28 20:16     ` Andreas Dilger
  1 sibling, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2014-04-28 20:16 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Darrick J. Wong, Lukáš Czerner, linux-ext4, Theodore Ts'o


[-- Attachment #1.1: Type: text/plain, Size: 2407 bytes --]


On Apr 28, 2014, at 10:21 AM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>> On Thu, Mar 20, 2014 at 05:40:06PM +0100, Lukáš Czerner wrote:
>>> There are also other problems we should be concerned with. Ext4 file system
>>> does have support for metadata checksumming so all the metadata does have
>>> its own checksum. While we can avoid unnecessarily checksuming inodes, group
>>> descriptors and basicall all statically positioned metadata, we still have
>>> dynamically allocated metadata blocks such as extent blocks. These block
>>> do not have to be checksummed but we would still have space reserved in the
>>> checksum table.
>> 
>> Don't forget directory blocks--they (should) have checksums too, so you can
>> skip those.
> 
> Just quick note: We can hide checksum for directory inside
> ext4_dir_entry_2 for a special dirs '.' or '..' simply by increasing
> ->rec_len which make this feature compatible with older FS

First note - the htree index information for each directory is already
stored after the ".." entry in block 0 of the directory.

Also note that there is also a feature we developed for Lustre named
"dirdata" (EXT4_FEATURE_INCOMPAT_DIRDATA is already reserved) that
allows storing more information with each directory entry[*].  We
use this to store a 128-bit identifier with each object so that it is
unique across the cluster, so it is available efficiently for readdir.

We've needed to modify the handling of the htree index data so that
it properly skips the extended directory entry (patch attached), and
this also cleans up this code a bit to conform to modern coding style.
Without this patch, the htree code assumes that the dx_info struct is
immediately following hard-coded "." and ".." entries and does not
actually check the de->rec_len for these entries to determine the right
amount of data to skip.

The patch is based on an older kernel and may not apply directly to
mainline, but is required for any similar changes in this area.

That said, I think that Darrick's metadata checksum patches already
store per-directory-block checksums at the end of the directory in
a dummy entry, so I'm not sure what else is needed here?

Cheers, Andreas

[*] for reference the dirdata patch is at http://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=ldiskfs/kernel_patches/patches/sles11sp2/ext4-data-in-dirent.patch


[-- Attachment #1.2: ext4-kill-dx_root.patch --]
[-- Type: application/octet-stream, Size: 7043 bytes --]

diff -r -u linux-stage.orig/fs/ext4/namei.c linux-stage/fs/ext4/namei.c
--- linux-stage.orig/fs/ext4/namei.c	2012-12-31 15:03:28.000000000 -0500
+++ linux-stage/fs/ext4/namei.c	2012-12-31 15:06:16.000000000 -0500
@@ -115,22 +115,13 @@
  * hash version mod 4 should never be 0.  Sincerely, the paranoia department.
  */
 
-struct dx_root
+struct dx_root_info
 {
-	struct fake_dirent dot;
-	char dot_name[4];
-	struct fake_dirent dotdot;
-	char dotdot_name[4];
-	struct dx_root_info
-	{
-		__le32 reserved_zero;
-		u8 hash_version;
-		u8 info_length; /* 8 */
-		u8 indirect_levels;
-		u8 unused_flags;
-	}
-	info;
-	struct dx_entry	entries[0];
+	__le32 reserved_zero;
+	u8 hash_version;
+	u8 info_length; /* 8 */
+	u8 indirect_levels;
+	u8 unused_flags;
 };
 
 struct dx_node
@@ -220,6 +211,17 @@
 	ext3_rec_len_from_disk(p->rec_len));
 }
  
+struct dx_root_info *dx_get_dx_info(char *buf)
+{
+	struct ext4_dir_entry_2 *de = buf;
+
+	/* get dotdot first, then dx_info is right after dotdot entry */
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(1));
+	de = (struct ext4_dir_entry_2 *)((char *)de + EXT4_DIR_REC_LEN(2));
+
+	return (struct dx_root_info *)de;
+}
+
 /*
  * Future: use high four bits of block for coalesce-on-delete flags
  * Mask them off for now.
@@ -374,7 +375,7 @@
 {
 	unsigned count, indirect;
 	struct dx_entry *at, *entries, *p, *q, *m;
-	struct dx_root *root;
+	struct dx_root_info *dx_info;
 	struct buffer_head *bh;
 	struct dx_frame *frame = frame_in;
 	u32 hash;
@@ -382,17 +383,18 @@
 	frame->bh = NULL;
 	if (!(bh = ext4_bread (NULL,dir, 0, 0, err)))
 		goto fail;
-	root = (struct dx_root *) bh->b_data;
-	if (root->info.hash_version != DX_HASH_TEA &&
-	    root->info.hash_version != DX_HASH_HALF_MD4 &&
-	    root->info.hash_version != DX_HASH_LEGACY) {
-		ext4_warning(dir->i_sb, "Unrecognised inode hash code %d for directory "
-                             "#%lu", root->info.hash_version, dir->i_ino);
+
+	dx_info = dx_get_dx_info(bh->b_data);
+	if (dx_info->hash_version != DX_HASH_TEA &&
+	    dx_info->hash_version != DX_HASH_HALF_MD4 &&
+	    dx_info->hash_version != DX_HASH_LEGACY) {
+		ext4_warning(dir->i_sb, "Unknown inode hash code %u for "
+			    "directory %lu", dx_info->hash_version, dir->i_ino);
 		brelse(bh);
 		*err = ERR_BAD_DX_DIR;
 		goto fail;
 	}
-	hinfo->hash_version = root->info.hash_version;
+	hinfo->hash_version = dx_info->hash_version;
 	if (hinfo->hash_version <= DX_HASH_TEA)
 		hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
 	hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed;
@@ -400,27 +402,25 @@
 		ext4fs_dirhash(d_name->name, d_name->len, hinfo);
 	hash = hinfo->hash;
 
-	if (root->info.unused_flags & 1) {
+	if (dx_info->unused_flags & 1) {
 		ext4_warning(dir->i_sb, "Unimplemented inode hash flags: %#06x",
-			     root->info.unused_flags);
+			     dx_info->unused_flags);
 		brelse(bh);
 		*err = ERR_BAD_DX_DIR;
 		goto fail;
 	}
 
-	if ((indirect = root->info.indirect_levels) > 1) {
+	indirect = dx_info->indirect_levels;
+	if (indirect > 1) {
 		ext4_warning(dir->i_sb, "Unimplemented inode hash depth: %#06x",
-			     root->info.indirect_levels);
+			     dx_info->indirect_levels);
 		brelse(bh);
 		*err = ERR_BAD_DX_DIR;
 		goto fail;
 	}
 
-	entries = (struct dx_entry *) (((char *)&root->info) +
-				       root->info.info_length);
-
-	if (dx_get_limit(entries) != dx_root_limit(dir,
-						   root->info.info_length)) {
+	entries = (struct dx_entry *)((char *)dx_info + dx_info->info_length);
+	if (dx_get_limit(entries) != dx_root_limit(dir, dx_info->info_length)) {
 		ext4_warning(dir->i_sb, "dx entry: limit != root limit");
 		brelse(bh);
 		*err = ERR_BAD_DX_DIR;
@@ -504,7 +505,7 @@
 	if (frames[0].bh == NULL)
 		return;
 
-	if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels)
+	if (dx_get_dx_info(frames[0].bh->b_data)->indirect_levels)
 		brelse(frames[1].bh);
 	brelse(frames[0].bh);
 }
@@ -1400,17 +1403,16 @@
 	const char	*name = dentry->d_name.name;
 	int		namelen = dentry->d_name.len;
 	struct buffer_head *bh2;
-	struct dx_root	*root;
+	struct dx_root_info *dx_info;
 	struct dx_frame	frames[2], *frame;
 	struct dx_entry *entries;
-	struct ext4_dir_entry_2	*de, *de2;
+	struct ext4_dir_entry_2 *de, *de2, *dot_de, *dotdot_de;
 	char		*data1, *top;
 	unsigned	len;
 	int		retval;
 	unsigned	blocksize;
 	struct dx_hash_info hinfo;
 	ext4_lblk_t  block;
-	struct fake_dirent *fde;
 
 	blocksize =  dir->i_sb->s_blocksize;
 	dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
@@ -1420,18 +1422,18 @@
 		brelse(bh);
 		return retval;
 	}
-	root = (struct dx_root *) bh->b_data;
+
+	dot_de = (struct ext4_dir_entry_2 *)bh->b_data;
+	dotdot_de = ext4_next_entry(dot_de, blocksize);
 
 	/* The 0th block becomes the root, move the dirents out */
-	fde = &root->dotdot;
-	de = (struct ext4_dir_entry_2 *)((char *)fde +
-		ext4_rec_len_from_disk(fde->rec_len, blocksize));
-	if ((char *) de >= (((char *) root) + blocksize)) {
+	de = ext4_next_entry(dotdot_de, blocksize);
+	if ((char *)de >= bh->b_data + blocksize) {
 		EXT4_ERROR_INODE(dir, "invalid rec_len for '..'");
 		brelse(bh);
 		return -EIO;
 	}
-	len = ((char *) root) + blocksize - (char *) de;
+	len = (char *)dot_de + blocksize - (char *)de;
 
 	/* Allocate new block for the 0th block's dirents */
 	bh2 = ext4_append(handle, dir, &block, &retval);
@@ -1450,19 +1453,21 @@
 	de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
 					   blocksize);
 	/* Initialize the root; the dot dirents already exist */
-	de = (struct ext4_dir_entry_2 *) (&root->dotdot);
-	de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
-					   blocksize);
-	memset (&root->info, 0, sizeof(root->info));
-	root->info.info_length = sizeof(root->info);
-	root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
-	entries = root->entries;
+	dotdot_de->rec_len = ext4_rec_len_to_disk(blocksize -
+			le16_to_cpu(dot_de->rec_len), blocksize);
+
+	/* initialize hashing dx_info */
+	dx_info = dx_get_dx_info(bh->b_data);
+	memset(dx_info, 0, sizeof(*dx_info));
+	dx_info->info_length = sizeof(*dx_info);
+	dx_info->hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
+	entries = (void *)dx_info + sizeof(*dx_info);
 	dx_set_block(entries, 1);
 	dx_set_count(entries, 1);
-	dx_set_limit(entries, dx_root_limit(dir, sizeof(root->info)));
+	dx_set_limit(entries, dx_root_limit(dir, sizeof(*dx_info)));
 
 	/* Initialize as for dx_probe */
-	hinfo.hash_version = root->info.hash_version;
+	hinfo.hash_version = dx_info->hash_version;
 	if (hinfo.hash_version <= DX_HASH_TEA)
 		hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
 	hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed;
@@ -1732,7 +1740,7 @@
 			/* Set up root */
 			dx_set_count(entries, 1);
 			dx_set_block(entries + 0, newblock);
-			((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels = 1;
+			dx_get_dx_info(frames->bh->b_data)->indirect_levels = 1;
 
 			/* Add new access path frame */
 			frame = frames + 1;

[-- Attachment #1.3: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-28 20:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-20 16:40 Proposal draft for data checksumming for ext4 Lukáš Czerner
2014-03-20 17:59 ` Darrick J. Wong
2014-03-24  1:59   ` Lukáš Czerner
2014-03-25  1:48     ` Andreas Dilger
2014-04-15  0:26     ` mingming cao
2014-04-28 16:21   ` Dmitry Monakhov
2014-04-28 19:36     ` Darrick J. Wong
2014-04-28 20:16     ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.