[Lustre-devel] Integrity and corruption - can file systems be scalable?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
@ 2010-07-02 18:53 Peter Braam
  2010-07-02 20:52 ` Dmitry Zogin
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Braam @ 2010-07-02 18:53 UTC (permalink / raw)
  To: lustre-devel

I wrote a blog post that pertains to Lustre scalability and data integrity.
 You can find it here:

http://braamstorage.blogspot.com

Regards,

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/b6cb205b/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 18:53 [Lustre-devel] Integrity and corruption - can file systems be scalable? Peter Braam
@ 2010-07-02 20:52 ` Dmitry Zogin
  2010-07-02 20:59   ` Peter Braam
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Zogin @ 2010-07-02 20:52 UTC (permalink / raw)
  To: lustre-devel

Hello Peter,

These are really good questions posted there, but I don't think they are 
Lustre specific. These issues are sort of common to any file systems. 
Some of the mature file systems, like Veritas already solved this by

1. Integrating the Volume management and File system. The file system 
can be spread across many volumes.
2. Dividing the file system into a group of file sets(like data, 
metadata, checkpoints) , and allowing the policies to keep different 
filesets on different volumes.
3. Creating the checkpoints (they are sort of like volume snapshots, but 
they are created inside the file system itself). The checkpoints are 
simply the copy-on-write filesets created instantly inside the fs 
itself. Using copy-on-write techniques allows to save the physical space 
and make the process of the file sets creation instantaneous. They do 
allow to revert back to a certain point instantaneously, as the modified 
blocks are kept aside, and the only thing that has to be done is to 
point back to the old blocks of information.
4. Parallel fsck - if the filesystem consists of the allocation units - 
a sort of the sub- file systems, or cylinder groups,  then the fsck can 
be started in parallel on those units.

Well, the ZFS does solve many of these issues, but in a different way, too.
So, my point is that this probably has to be solved on the backend side 
of the Lustre, rather than inside the Lustre.

Best regards,

Dmitry

Peter Braam wrote:
> I wrote a blog post that pertains to Lustre scalability and data 
> integrity.  You can find it here:
>
> http://braamstorage.blogspot.com
>
> Regards,
>
> Peter
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/825eabe5/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 20:52 ` Dmitry Zogin
@ 2010-07-02 20:59   ` Peter Braam
  2010-07-02 21:09     ` Nicolas Williams
  2010-07-02 21:18     ` Dmitry Zogin
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Braam @ 2010-07-02 20:59 UTC (permalink / raw)
  To: lustre-devel

Dmitry,

The point of the note is the opposite of what you write, namely that backend
systems in fact do not solve this, unless they are guaranteed to be bug
free.

Peter

On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:

>  Hello Peter,
>
> These are really good questions posted there, but I don't think they are
> Lustre specific. These issues are sort of common to any file systems. Some
> of the mature file systems, like Veritas already solved this by
>
> 1. Integrating the Volume management and File system. The file system can
> be spread across many volumes.
> 2. Dividing the file system into a group of file sets(like data, metadata,
> checkpoints) , and allowing the policies to keep different filesets on
> different volumes.
> 3. Creating the checkpoints (they are sort of like volume snapshots, but
> they are created inside the file system itself). The checkpoints are simply
> the copy-on-write filesets created instantly inside the fs itself. Using
> copy-on-write techniques allows to save the physical space and make the
> process of the file sets creation instantaneous. They do allow to revert
> back to a certain point instantaneously, as the modified blocks are kept
> aside, and the only thing that has to be done is to point back to the old
> blocks of information.
> 4. Parallel fsck - if the filesystem consists of the allocation units - a
> sort of the sub- file systems, or cylinder groups,  then the fsck can be
> started in parallel on those units.
>
> Well, the ZFS does solve many of these issues, but in a different way, too.
> So, my point is that this probably has to be solved on the backend side of
> the Lustre, rather than inside the Lustre.
>
> Best regards,
>
> Dmitry
>
> Peter Braam wrote:
>
> I wrote a blog post that pertains to Lustre scalability and data integrity.
>  You can find it here:
>
>  http://braamstorage.blogspot.com
>
>  Regards,
>
>  Peter
>
> ------------------------------
>
> _______________________________________________
> Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/ba50cbf2/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 20:59   ` Peter Braam
@ 2010-07-02 21:09     ` Nicolas Williams
  2010-07-02 21:18     ` Dmitry Zogin
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Williams @ 2010-07-02 21:09 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 02, 2010 at 02:59:00PM -0600, Peter Braam wrote:
> The point of the note is the opposite of what you write, namely that backend
> systems in fact do not solve this, unless they are guaranteed to be bug
> free.

Fsck tools can also be buggy.  Consider them redundant code run
asynchronously.  Is it possible to fsck petabytes in reasonable time?
Not if storage capacity grows faster than storage bandwidth.

The obvious alternatives are: test, test, test, and/or run redundant
fsck-like code synchronously.  The latter could be done by reading
just-written transactions to check that the filesystem is consistent.

Nico
-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 20:59   ` Peter Braam
  2010-07-02 21:09     ` Nicolas Williams
@ 2010-07-02 21:18     ` Dmitry Zogin
  2010-07-02 21:39       ` Peter Braam
  1 sibling, 1 reply; 14+ messages in thread
From: Dmitry Zogin @ 2010-07-02 21:18 UTC (permalink / raw)
  To: lustre-devel

Peter,

That is right - some of them do not. My point was that Veritas fs 
already has many things implemented, like parallel fsck, copy-on-write 
checkpoints,etc. If it was used as a backend for the Lustre, that would 
be the perfect match. ZFS has some of its features, but not all.

But, let's say, adding things like that into the Lustre itself will make 
it even more complex, and now it is very complex already . Certainly, 
things like checkpoints can be added at MDT level - consider an inode on 
MDT pointing to another MDT inode, instead of the OST objects - that 
would be a clone. If the file is modified, then, the MDT inode becomes 
pointing to an OST object which keeps changed file blocks only. This 
will be sort of the checkpoint allowing to revert the file back. Well, 
this is is known to help restoring the data in case of the human error, 
or an application bug, it won't help to protect from HW induced errors.
But, the parallel fsck issue is sort of standing alone - if we want fsck 
to be faster, we better make it parallel at every OST level - that's why 
I think this has to be done on the backend side.

Dmitry


Peter Braam wrote:
> Dmitry, 
>
> The point of the note is the opposite of what you write, namely that 
> backend systems in fact do not solve this, unless they are guaranteed 
> to be bug free.
>
> Peter
>
> On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin 
> <dmitry.zoguine at oracle.com <mailto:dmitry.zoguine@oracle.com>> wrote:
>
>     Hello Peter,
>
>     These are really good questions posted there, but I don't think
>     they are Lustre specific. These issues are sort of common to any
>     file systems. Some of the mature file systems, like Veritas
>     already solved this by
>
>     1. Integrating the Volume management and File system. The file
>     system can be spread across many volumes.
>     2. Dividing the file system into a group of file sets(like data,
>     metadata, checkpoints) , and allowing the policies to keep
>     different filesets on different volumes.
>     3. Creating the checkpoints (they are sort of like volume
>     snapshots, but they are created inside the file system itself).
>     The checkpoints are simply the copy-on-write filesets created
>     instantly inside the fs itself. Using copy-on-write techniques
>     allows to save the physical space and make the process of the file
>     sets creation instantaneous. They do allow to revert back to a
>     certain point instantaneously, as the modified blocks are kept
>     aside, and the only thing that has to be done is to point back to
>     the old blocks of information.
>     4. Parallel fsck - if the filesystem consists of the allocation
>     units - a sort of the sub- file systems, or cylinder groups,  then
>     the fsck can be started in parallel on those units.
>
>     Well, the ZFS does solve many of these issues, but in a different
>     way, too.
>     So, my point is that this probably has to be solved on the backend
>     side of the Lustre, rather than inside the Lustre.
>
>     Best regards,
>
>     Dmitry
>
>     Peter Braam wrote:
>>     I wrote a blog post that pertains to Lustre scalability and data
>>     integrity.  You can find it here:
>>
>>     http://braamstorage.blogspot.com
>>
>>     Regards,
>>
>>     Peter
>>     ------------------------------------------------------------------------
>>
>>     _______________________________________________
>>     Lustre-devel mailing list
>>     Lustre-devel at lists.lustre.org <mailto:Lustre-devel@lists.lustre.org>
>>     http://lists.lustre.org/mailman/listinfo/lustre-devel
>>       
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/8d022e8e/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 21:18     ` Dmitry Zogin
@ 2010-07-02 21:39       ` Peter Braam
  2010-07-02 22:21         ` Nicolas Williams
  2010-07-07  6:57         ` [Lustre-devel] [Lustre-discuss] " Andreas Dilger
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Braam @ 2010-07-02 21:39 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:

>  Peter,
>
> That is right - some of them do not. My point was that Veritas fs already
> has many things implemented, like parallel fsck, copy-on-write
> checkpoints,etc. If it was used as a backend for the Lustre, that would be
> the perfect match. ZFS has some of its features, but not all.
>
>
Parallel fsck doesn't help once you are down to one disk (as pointed out in
the post).

The post also mentions copy on write checkpoints, and their usefulness has
not been proven.  There has been no study about this, and certainly in many
cases they are implemented in such a way that bugs in the software can
corrupt them.  For example, most volume level copy on write schemes actually
copy the old data instead of leaving it in place, which is a vulnerability.
 Shadow copies are vulnerable to software bugs, things would get better if
there was something similar to page protection for disk blocks.

But, let's say, adding things like that into the Lustre itself will make it
> even more complex, and now it is very complex already . Certainly, things
> like checkpoints can be added at MDT level - consider an inode on MDT
> pointing to another MDT inode, instead of the OST objects - that would be a
> clone. If the file is modified, then, the MDT inode becomes pointing to an
> OST object which keeps changed file blocks only. This will be sort of the
> checkpoint allowing to revert the file back. Well, this is is known to help
> restoring the data in case of the human error, or an application bug, it
> won't help to protect from HW induced errors.
>

Again, pointing to other objects is subject to possible software bugs.

I wrote this post because I'm unconvinced with the barrage of by now
endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
of the claim that advanced file systems address these issues - they only
address some, and leave critical vulnerability.

Nicolas post is more along the lines that I think will lead to a solution.

Peter




> But, the parallel fsck issue is sort of standing alone - if we want fsck to
> be faster, we better make it parallel at every OST level - that's why I
> think this has to be done on the backend side.
>
> Dmitry
>
>
>
> Peter Braam wrote:
>
> Dmitry,
>
>  The point of the note is the opposite of what you write, namely that
> backend systems in fact do not solve this, unless they are guaranteed to be
> bug free.
>
>  Peter
>
> On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:
>
>> Hello Peter,
>>
>> These are really good questions posted there, but I don't think they are
>> Lustre specific. These issues are sort of common to any file systems. Some
>> of the mature file systems, like Veritas already solved this by
>>
>> 1. Integrating the Volume management and File system. The file system can
>> be spread across many volumes.
>> 2. Dividing the file system into a group of file sets(like data, metadata,
>> checkpoints) , and allowing the policies to keep different filesets on
>> different volumes.
>> 3. Creating the checkpoints (they are sort of like volume snapshots, but
>> they are created inside the file system itself). The checkpoints are simply
>> the copy-on-write filesets created instantly inside the fs itself. Using
>> copy-on-write techniques allows to save the physical space and make the
>> process of the file sets creation instantaneous. They do allow to revert
>> back to a certain point instantaneously, as the modified blocks are kept
>> aside, and the only thing that has to be done is to point back to the old
>> blocks of information.
>> 4. Parallel fsck - if the filesystem consists of the allocation units - a
>> sort of the sub- file systems, or cylinder groups,  then the fsck can be
>> started in parallel on those units.
>>
>> Well, the ZFS does solve many of these issues, but in a different way,
>> too.
>> So, my point is that this probably has to be solved on the backend side of
>> the Lustre, rather than inside the Lustre.
>>
>> Best regards,
>>
>> Dmitry
>>
>> Peter Braam wrote:
>>
>>  I wrote a blog post that pertains to Lustre scalability and data
>> integrity.  You can find it here:
>>
>>  http://braamstorage.blogspot.com
>>
>>  Regards,
>>
>>  Peter
>>
>> ------------------------------
>>
>> _______________________________________________
>> Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>>
>>
>  ------------------------------
>
> _______________________________________________
> Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/b88545b5/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 21:39       ` Peter Braam
@ 2010-07-02 22:21         ` Nicolas Williams
  2010-07-02 22:35           ` Nicolas Williams
  2010-07-03  3:37           ` Dmitry Zogin
  2010-07-07  6:57         ` [Lustre-devel] [Lustre-discuss] " Andreas Dilger
  1 sibling, 2 replies; 14+ messages in thread
From: Nicolas Williams @ 2010-07-02 22:21 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote:
> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:
> The post also mentions copy on write checkpoints, and their usefulness has
> not been proven.  There has been no study about this, and certainly in many
> cases they are implemented in such a way that bugs in the software can
> corrupt them.  For example, most volume level copy on write schemes actually
> copy the old data instead of leaving it in place, which is a vulnerability.
>  Shadow copies are vulnerable to software bugs, things would get better if
> there was something similar to page protection for disk blocks.

Well-delineated transactions are certainly useful.  The reason: you can
fsck each transaction discretely and incrementally.  That means that you
know exactly how much work must be done to fsck a priori.  Sure, you
still have to be confident that N correct transactions == correct
filesystem, but that's much easier to be confident of than software
correctness.  (It'd be interesting to apply theorem provers to theorems
related to on-disk data formats!) 

Another problem, incidentally, is software correctness on the read side.
It's nice to know that no bugs on the write side will corrupt your
filesystem, but read-side bugs that cause your data to be unavailable
are not good either.  The distinction between bugs in the write vs. read
sides is subtle: recovery from the latter is just a patch away, while
recovery from the former might require long fscks, or even more manual
intervention (e.g., writing a better fsck).

> I wrote this post because I'm unconvinced with the barrage of by now
> endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
> of the claim that advanced file systems address these issues - they only
> address some, and leave critical vulnerability.

I do believe COW transactions + Merkel hash trees are _the_ key aspect
of the solution.  Because only by making fscks incremental and discrete
can we get a handle on the amount of time that must be spent waiting for
fscks to complete.  Without incremental fscks there'd be no hope as
storage capacity outstrips storage and compute bandwidth.

If you believe that COW, transactional, Merkle trees are an
anti-solution, or if you believe that they are only a tiny part of the
solution, please argue that view.  Otherwise I think your use of
"barrage" here is a bit over the top (nay, a lot over the top).  It's
one thing to be missing a part of the solution, and it's another to be
on the wrong track, or missing the largest part of the solution.
Extraordinary claims and all that...

(And no, manually partitioning storage into discrete "filesystems",
"filesets", "datasets", whatever, is not a solution; at most it's a
bandaid.)

Nico
-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 22:21         ` Nicolas Williams
@ 2010-07-02 22:35           ` Nicolas Williams
  2010-07-03  3:37           ` Dmitry Zogin
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Williams @ 2010-07-02 22:35 UTC (permalink / raw)
  To: lustre-devel

I explained why well-delineated transactions help, but didn't really
explain why COW and Merkle hash trees help.  COW helps ensure that
correct transactions cannot result in incorrect filesystems -- fsck need
only ensure that a transaction hasn't overwritten live blocks to
guarantee that one can at least rollback to that transaction.  Merkle
hash trees help detect (and recover from) bit rot and hardware errors,
which in turn helps ensure that those incremental fscks are dealing with
correct meta-data (correct fsck code + bad meta-data == bad fsck).

It's much harder to ensure that there are no errors in parts of the
system that are exposed due to lack of special protection features (such
as ECC memory), in system buses and CPUs, that might be difficult or
impossible to protect against in software.  One option is to run the
fscks on different hosts than the ones doing the writing (this means
multi-pathing though, which complicates the overall system, but at least
we currently depend on multipathing anyways).  But even that won't
protect against such unprotectable errors in _data_ (originating in
faraway clients, say).

Nico
-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-02 22:21         ` Nicolas Williams
  2010-07-02 22:35           ` Nicolas Williams
@ 2010-07-03  3:37           ` Dmitry Zogin
  2010-07-04 23:56             ` Nicolas Williams
  1 sibling, 1 reply; 14+ messages in thread
From: Dmitry Zogin @ 2010-07-03  3:37 UTC (permalink / raw)
  To: lustre-devel

Nicolas Williams wrote:
> On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote:
>   
>> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:
>> The post also mentions copy on write checkpoints, and their usefulness has
>> not been proven.  There has been no study about this, and certainly in many
>> cases they are implemented in such a way that bugs in the software can
>> corrupt them.  For example, most volume level copy on write schemes actually
>> copy the old data instead of leaving it in place, which is a vulnerability.
>>  Shadow copies are vulnerable to software bugs, things would get better if
>> there was something similar to page protection for disk blocks.
>>     
>
> Well-delineated transactions are certainly useful.  The reason: you can
> fsck each transaction discretely and incrementally.  That means that you
> know exactly how much work must be done to fsck a priori.  Sure, you
> still have to be confident that N correct transactions == correct
> filesystem, but that's much easier to be confident of than software
> correctness.  (It'd be interesting to apply theorem provers to theorems
> related to on-disk data formats!) 
>
> Another problem, incidentally, is software correctness on the read side.
> It's nice to know that no bugs on the write side will corrupt your
> filesystem, but read-side bugs that cause your data to be unavailable
> are not good either.  The distinction between bugs in the write vs. read
> sides is subtle: recovery from the latter is just a patch away, while
> recovery from the former might require long fscks, or even more manual
> intervention (e.g., writing a better fsck).
>
>   
>> I wrote this post because I'm unconvinced with the barrage of by now
>> endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
>> of the claim that advanced file systems address these issues - they only
>> address some, and leave critical vulnerability.
>>     
>
> I do believe COW transactions + Merkel hash trees are _the_ key aspect
> of the solution.  Because only by making fscks incremental and discrete
> can we get a handle on the amount of time that must be spent waiting for
> fscks to complete.  Without incremental fscks there'd be no hope as
> storage capacity outstrips storage and compute bandwidth.
>
> If you believe that COW, transactional, Merkle trees are an
> anti-solution, or if you believe that they are only a tiny part of the
> solution, please argue that view.  Otherwise I think your use of
> "barrage" here is a bit over the top (nay, a lot over the top).  It's
> one thing to be missing a part of the solution, and it's another to be
> on the wrong track, or missing the largest part of the solution.
> Extraordinary claims and all that...
>   
Well, the hash trees certainly help to achieve data integrity, but at 
the performance cost.
Eventually, the file system becomes fragmented, and moving the data 
around implies more random seeks with Merkle hash trees.
> (And no, manually partitioning storage into discrete "filesystems",
> "filesets", "datasets", whatever, is not a solution; at most it's a
> bandaid.)
>
> Nico
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100702/068ed9a3/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-03  3:37           ` Dmitry Zogin
@ 2010-07-04 23:56             ` Nicolas Williams
  2010-07-05  3:53               ` Dmitry Zogin
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Williams @ 2010-07-04 23:56 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:
> Well, the hash trees certainly help to achieve data integrity, but
> at the performance cost.

Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
result in _less_ I/O in the case of RAID-Zn because there's no need to
read the parity unless the checksum doesn't match.  Also, how much CPU
depends on the hash function.  And HW could help if this became enough
of a problem for us.

> Eventually, the file system becomes fragmented, and moving the data
> around implies more random seeks with Merkle hash trees.

Yes, fragmentation is a problem for COW, but that has nothing to do with
Merkle trees.  But practically every modern filesystem coalesces writes
into contiguous writes on disk to reach streaming write perfmormance,
and that, like COW, results in filesystem fragmentation.

(Of course, you needn't get fragmentation if you never delete or over
write files.  You'll get some fragmentation of meta-data, but that's
much easier to garbage collect since meta-data will amount to much less
on disk than data.)

Everything we do involves trade-offs.

Nico
-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-04 23:56             ` Nicolas Williams
@ 2010-07-05  3:53               ` Dmitry Zogin
  2010-07-05  7:11                 ` Mitchell Erblich
  2010-07-05 17:58                 ` Nicolas Williams
  0 siblings, 2 replies; 14+ messages in thread
From: Dmitry Zogin @ 2010-07-05  3:53 UTC (permalink / raw)
  To: lustre-devel

Nicolas Williams wrote:
> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:
>   
>> Well, the hash trees certainly help to achieve data integrity, but
>> at the performance cost.
>>     
>
> Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
> result in _less_ I/O in the case of RAID-Zn because there's no need to
> read the parity unless the checksum doesn't match.  Also, how much CPU
> depends on the hash function.  And HW could help if this became enough
> of a problem for us.
>
>   
>> Eventually, the file system becomes fragmented, and moving the data
>> around implies more random seeks with Merkle hash trees.
>>     
>
> Yes, fragmentation is a problem for COW, but that has nothing to do with
> Merkle trees.  But practically every modern filesystem coalesces writes
> into contiguous writes on disk to reach streaming write perfmormance,
> and that, like COW, results in filesystem fragmentation.
>
>   
What I really mean is the defragmentation issue and not the 
fragmentation itself. All file systems becomes fragmented, as it is 
unavoidable. But the defragmentation of the file system using hash trees 
really becomes a problem.
> (Of course, you needn't get fragmentation if you never delete or over
> write files.  You'll get some fragmentation of meta-data, but that's
> much easier to garbage collect since meta-data will amount to much less
> on disk than data.)
>   
Well, that is really never happens, unless the file system is read-only. 
The files are deleted and created all the time.
> Everything we do involves trade-offs.
>
>
>   
Yes, but if the performance drop becomes unacceptable, any gain in the 
integrity is miserable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100704/1073278d/attachment.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-05  3:53               ` Dmitry Zogin
@ 2010-07-05  7:11                 ` Mitchell Erblich
  2010-07-05 17:58                 ` Nicolas Williams
  1 sibling, 0 replies; 14+ messages in thread
From: Mitchell Erblich @ 2010-07-05  7:11 UTC (permalink / raw)
  To: lustre-devel


On Jul 4, 2010, at 8:53 PM, Dmitry Zogin wrote:

> Nicolas Williams wrote:
>> 
>> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:
>>   
>>> Well, the hash trees certainly help to achieve data integrity, but
>>> at the performance cost.
>>>     
>> 
>> Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
>> result in _less_ I/O in the case of RAID-Zn because there's no need to
>> read the parity unless the checksum doesn't match.  Also, how much CPU
>> depends on the hash function.  And HW could help if this became enough
>> of a problem for us.
>> 
>>   
>>> Eventually, the file system becomes fragmented, and moving the data
>>> around implies more random seeks with Merkle hash trees.
>>>     
>> 
>> Yes, fragmentation is a problem for COW, but that has nothing to do with
>> Merkle trees.  But practically every modern filesystem coalesces writes
>> into contiguous writes on disk to reach streaming write perfmormance,
>> and that, like COW, results in filesystem fragmentation.
>> 
>>   
> What I really mean is the defragmentation issue and not the fragmentation itself. All file systems becomes fragmented, as it is unavoidable. But the defragmentation of the file system using hash trees really becomes a problem.

Stupid me. I thought the FS fragmentation issue had a solution over a decade ago.

When the write doesn't change the offset, then do nothing. If it is a concatenating write,
locate the best fit block for the new size/offset, update the metadata/inode, then free the 
old block. Since writes as mostly asynch, who cares how long it takes as long as their
are no commits waiting.

Mitchell Erblich

>> (Of course, you needn't get fragmentation if you never delete or over
>> write files.  You'll get some fragmentation of meta-data, but that's
>> much easier to garbage collect since meta-data will amount to much less
>> on disk than data.)
>>   
> Well, that is really never happens, unless the file system is read-only. The files are deleted and created all the time.
>> Everything we do involves trade-offs.
>> 
>> 
>>   
> Yes, but if the performance drop becomes unacceptable, any gain in the integrity is miserable.
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] Integrity and corruption - can file systems be scalable?
  2010-07-05  3:53               ` Dmitry Zogin
  2010-07-05  7:11                 ` Mitchell Erblich
@ 2010-07-05 17:58                 ` Nicolas Williams
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Williams @ 2010-07-05 17:58 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jul 04, 2010 at 11:53:29PM -0400, Dmitry Zogin wrote:
> What I really mean is the defragmentation issue and not the
> fragmentation itself. All file systems becomes fragmented, as it is
> unavoidable. But the defragmentation of the file system using hash
> trees really becomes a problem.

That is emphatically not true.

To defragment a ZFS-like filesystem all you need to do is traverse the
metadata looking for live blocks from old transaction groups, then
relocate those by writing them out again almost as if an application had
written to them (except with no mtime updates).

In ZFS we call this block pointer rewrite, or bp rewrite.

> >Everything we do involves trade-offs.
> >
> Yes, but if the performance drop becomes unacceptable, any gain in
> the integrity is miserable.

I believe ZFS has shown that unacceptable performance losses are not
required in order to get the additional integrity protection.

Nico
-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] [Lustre-discuss] Integrity and corruption - can file systems be scalable?
  2010-07-02 21:39       ` Peter Braam
  2010-07-02 22:21         ` Nicolas Williams
@ 2010-07-07  6:57         ` Andreas Dilger
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2010-07-07  6:57 UTC (permalink / raw)
  To: lustre-devel

On 2010-07-02, at 15:39, Peter Braam wrote:
> I wrote a blog post that pertains to Lustre scalability and data integrity.
> 
> http://braamstorage.blogspot.com

In your blog you write:

> Unfortunately once file system check and repair is required, the scalability of all file systems becomes questionable.  The repair tool needs to iterate over all objects stored in the file system, and this can take unacceptably long on the advanced file systems like ZFS and btrfs just as much as on the more traditional ones like ext4.  
> 
> This shows the shortcoming of the Lustre-ZFS proposal to address scalability.  It merely addresses data integrity.

I agree that ZFS checksums will help detect and recover the data integrity, and we are leveraging this to provide data integrity (as described in "End to End Data Integrity Design" on the Lustre wiki).  However, contrary to your statement, we are not depending on the checksums for checking and fixing the distributed filesystem consistency.

The Integrity design you referenced describes the process for doing the (largely) single-pass parallel consistency checking of the ZFS backing filesystems at the same time as doing the distributed Lustre filesystem consistency check, while the filesystem is active.

In the years since you have been working on Lustre, we have already implemented similar ideas as ChunkFS/TileFS to use back-references for avoiding the need to keep the full filesystem state in memory when doing checks and recovering from corruption.  The OST filesystem inodes contain their own object IDs (for recreating the OST namespace in case of directory corruption, as anyone who's used ll_recover_lost_found_objs can attest), and a back-pointer to the MDT inode FID to be used for fast orphan and layout inconsistency detection.  With 2.0 the MDT inodes will also contain the FID number for reconstructing the object index, should it be corrupted, and also the list of hard links to the inode for doing O(1) path construction and nlink verification.  With CMD the remotely referenced  MDT inodes will have back-pointers to the originating MDT to allow local consistency checking, similar to the shadow inodes proposed for ChunkFS.

As you pointed out, scaling fsck to be able to check a filesystem with 10^12 files within 100h is difficult.  It turns out that the metadata requirements for doing a full check within this time period exceed the metadata requirements specified for normal operation.  It of course isn't possible to do a consistency check of a filesystem without actually checking each of the items in that filesystem, so each one has to be visited at least (and preferably at most) once.  That said, the requirements are not beyond what is capable from the hardware that will be needed to host a filesystem this large in the first place, assuming the local and distributed consistency checking can run in parallel and utilize the full bandwidth of the filesystem.

What is also important to note is that both ZFS and the new lfsck are designed to be able to validate the filesystem continuously as it is being used, so there is no need to take a 100h outage before putting the filesystem back into use.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-07-07  6:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-02 18:53 [Lustre-devel] Integrity and corruption - can file systems be scalable? Peter Braam
2010-07-02 20:52 ` Dmitry Zogin
2010-07-02 20:59   ` Peter Braam
2010-07-02 21:09     ` Nicolas Williams
2010-07-02 21:18     ` Dmitry Zogin
2010-07-02 21:39       ` Peter Braam
2010-07-02 22:21         ` Nicolas Williams
2010-07-02 22:35           ` Nicolas Williams
2010-07-03  3:37           ` Dmitry Zogin
2010-07-04 23:56             ` Nicolas Williams
2010-07-05  3:53               ` Dmitry Zogin
2010-07-05  7:11                 ` Mitchell Erblich
2010-07-05 17:58                 ` Nicolas Williams
2010-07-07  6:57         ` [Lustre-devel] [Lustre-discuss] " Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.