linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Tux3 Report: Initial fsck has landed
@ 2013-01-28  5:55 Daniel Phillips
  2013-01-28  6:02 ` David Lang
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2013-01-28  5:55 UTC (permalink / raw)
  To: linux-kernel, tux3; +Cc: linux-fsdevel

Initial Tux3 fsck has landed

Things are moving right along in Tux3 land. Encouraged by our great initial
benchmarks for in-cache workloads, we are now busy working through our to-do
list to develop Tux3 the rest of the way into a functional filesystem that a
sufficiently brave person could actually mount.

At the top of the to-do list is "fsck". Because really, fsck has to rank as
one of the top features of any filesystem you would actually want to use.
Ext4 rules the world largely on the strength of e2fsck. Not just fsck, but
certainly that is a large part of it. Accordingly, we have set our sights on
creating an e2fsck-quality fsck in due course.

Today, I am happy to be able to say that a first draft of a functional Tux3
fsck has already landed:

    https://github.com/OGAWAHirofumi/tux3/blob/master/user/tux3_fsck.c

Note how short it is. That is because Tux3 fsck uses a "walker" framework
shared by a number of other features. It will soon also use our suite of
metadata format checking methods that were developed years ago (and still
continue to be improved).

The Tux3 walker framework (another great hack by Hirofumi, likewise the
initial fsck) is interesting in that it evolved from tux3graph, Hirofumi's
graphical filesystem structure dumper. And before that, it came from our btree
traversing framework, which came from ddsnap, which came from HTree, which
came from Tux2. Whew. Nearly a 15 year history for that code when you trace
it all out.

Anyway, the walker is really sweet. You give it a few specialized methods and
poof, you have an fsck. So far, we just check physical referential integrity:
each block is either free or is referenced by exactly one pointer in the
filesystem tree, possibly as part of a data extent. This check is done with
the help of a "shadow bitmap". As we walk the tree, we mark off all referenced
blocks in the shadow bitmap, complaining if already marked. At the end of
that, the shadow file should be identical to the allocation bitmap inode. And
more often than not, it is.

Cases where we actually get differences are now mostly during hacking, though
of course we do need to be checking a lot more volumes under different loads
to have a lot of confidence about that. As a development tool, even this very
simple fsck is a wonderful thing.

Tux3 fsck is certainly not going to stay simple. Here is roughly where we are
going with it next:

    http://phunq.net/pipermail/tux3/2013-January/001976.html
    "Fsck Revisited"

To recap, next on the list is checking referential integrity of the directory
namespace, a somewhat more involved problem than physical structure, but not
really hard. After that, the main difference between this and a real fsck
will be repair. Which is a big topic, but it is already underway. First simple
repairs, then tricky repairs.

Compared to Ext2/3/4, Tux3 has a big disadvantage in terms of fsck: it does
not confine inode table blocks to fixed regions of the volume. Tux3 may store
any metadata block anywhere, and tends to stir things around to new locations
during normal operation. To overcome this disadvantage, we have the concept of
uptags:

    http://phunq.net/pipermail/tux3/2013-January/001973.html
    "What are uptags?"

With uptags we should be able to fall back to a full scan of a damaged volume
and get a pretty good idea of which blocks are actually lost metadata blocks,
and to which filesystem objects they might belong.

Free form metadata has another disadvantage: we can't just slurp it up from
disk in huge, efficient reads. Instead we tend to mix inode table blocks,
directory entry blocks, data blocks and index blocks all together in one big
soup so that related blocks live close together. This is supposed to be great
for read performance on spinning media, and should also help control write
multiplication on solid state devices, but it is most probably going to suck
for fsck performance on spinning disk, due to seeking.

So what are we going to do about that? Well, first we want to verify that
there is actually an issue, as proved by slow fsck. We already suspect that
there is, but some of the layout optimization work we have underway might go
some distance to fixing it. After optimizing layout, we will probably still
have some work to do to get at least close to e2fsck performance. Maybe we can
come up with some smart cache preload strategy or something like that.

The real problem is, Moore's Law just does not work for spinning disks. Nobody
really wants their disk spinning faster than 72000 rpm, or they don't want to
pay for it. But density goes up as the square of feature size. So media
transfer rate goes up linearly while disk size goes up quadratically. Today,
it takes a couple of hours to read each terabyte of disk. Fsck is normally
faster than that, because it only reads a portion of the disk, but over time,
it breaks in the same way. The bottom line is, full fsck just isn't a viable
thing to do on your system as a standard, periodic procedure. There is really
not a lot of choice but to move on to incremental and online fsck.

It is quite possible that Tux3 will get to incremental and online fsck before
Ext4 does. (There you go, Ted, that is a challenge.) There is no question that
this is something that every viable, modern filesystem must do, and no,
scrubbing does not cut the mustard. We need to be able to detect errors on the
filesystem, perhaps due to blocks going bad, or heaven forbid, bugs, then
report them to the user and *fix* them on command without taking the volume
offline. If that seems hard, it is. But it simply has to be done.

So that is the Tux3 Report for today. As usual, the welcome mat is out for
developers at oftc.net #tux3. Or hop on over and join our mailing list:

    http://phunq.net/cgi-bin/mailman/listinfo/tux3

We are open to donations of various kinds, particularly of your own awesome
developer power. We have an increasing need for testers. Expect to see a
nice simple recipe for KVM testing soon. Developing kernel code in userspace
is a normal thing in the Tux3 world. It's great. If you haven't tried it yet,
you should.

Thank you for reading, and see you on #tux3.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-28  5:55 Tux3 Report: Initial fsck has landed Daniel Phillips
@ 2013-01-28  6:02 ` David Lang
  2013-01-28  6:13   ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: David Lang @ 2013-01-28  6:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, tux3, linux-fsdevel

On Sun, 27 Jan 2013, Daniel Phillips wrote:

> Compared to Ext2/3/4, Tux3 has a big disadvantage in terms of fsck: it does
> not confine inode table blocks to fixed regions of the volume. Tux3 may store
> any metadata block anywhere, and tends to stir things around to new locations
> during normal operation. To overcome this disadvantage, we have the concept of
> uptags:
>
>    http://phunq.net/pipermail/tux3/2013-January/001973.html
>    "What are uptags?"
>
> With uptags we should be able to fall back to a full scan of a damaged volume
> and get a pretty good idea of which blocks are actually lost metadata blocks,
> and to which filesystem objects they might belong.

The thing that jumps out at me with this is the question of how you will avoid 
the 'filesystem image in a file' disaster that reiserfs had (where it's fsck 
could mix up metadata chunks from the main filesystem with metadata chunks from 
any filesystem images that it happened to stumble across when scanning the disk)

many people with dd if=/dev/sda2 of=filesystem.image, and if you are doing 
virtualization, you may be running out of one of these filesystem images. With 
virtualization, it's very likely that you will have many copies of a single 
image that are all identical.

have you thought of how to deal with this problem?

David Lang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-28  6:02 ` David Lang
@ 2013-01-28  6:13   ` Daniel Phillips
  2013-01-28 14:12     ` Theodore Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2013-01-28  6:13 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel, tux3, linux-fsdevel

On Sun, Jan 27, 2013 at 10:02 PM, David Lang <david@lang.hm> wrote:
> On Sun, 27 Jan 2013, Daniel Phillips wrote:
> The thing that jumps out at me with this is the question of how you will
> avoid the 'filesystem image in a file' disaster that reiserfs had (where
> it's fsck could mix up metadata chunks from the main filesystem with
> metadata chunks from any filesystem images that it happened to stumble
> across when scanning the disk)
>
> many people with dd if=/dev/sda2 of=filesystem.image, and if you are doing
> virtualization, you may be running out of one of these filesystem images.
> With virtualization, it's very likely that you will have many copies of a
> single image that are all identical.
>
> have you thought of how to deal with this problem?
>
> David Lang

Only superficially. Deep thoughts are in order. First, there needs to be a
hole in the filesystem structure, before we would even consider trying to
plug something in there. Once we know there is a hole, we want to
narrow down the list of candidates to fill it. If a candidate already lies
within a perfectly viable file, obviously we would not want to interpret
that as lost metadata. Unless the filesystem is really mess up...

That is about as far as I have got with the analysis. Clearly, much more
is required. Suggestions welcome.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-28  6:13   ` Daniel Phillips
@ 2013-01-28 14:12     ` Theodore Ts'o
  2013-01-28 23:27       ` David Lang
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2013-01-28 14:12 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: David Lang, linux-kernel, tux3, linux-fsdevel

On Sun, Jan 27, 2013 at 10:13:37PM -0800, Daniel Phillips wrote:
> > The thing that jumps out at me with this is the question of how you will
> > avoid the 'filesystem image in a file' disaster that reiserfs had (where
> > it's fsck could mix up metadata chunks from the main filesystem with
> > metadata chunks from any filesystem images that it happened to stumble
> > across when scanning the disk)
> >
> Only superficially. Deep thoughts are in order. First, there needs to be a
> hole in the filesystem structure, before we would even consider trying to
> plug something in there. Once we know there is a hole, we want to
> narrow down the list of candidates to fill it. If a candidate already lies
> within a perfectly viable file, obviously we would not want to interpret
> that as lost metadata. Unless the filesystem is really mess up...
> 
> That is about as far as I have got with the analysis. Clearly, much more
> is required. Suggestions welcome.

The obvious answer is what resierfs4 ultimately ended up using.  Drop
a file system UUID in the superblock; mix the UUID into a checksum
which protects each of the your metadata blocks.  We're mixing in the
inode number as well as the fs uuid in in ext4's new metadata checksum
feature to protect against an inode table block getting written to the
wrong location on disk.  It will also mean that e2fsck won't mistake
an inode table from an earlier mkfs with the current file system.
This will allow us to avoid needing to zero the inode table for newly
initialized file systems.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-28 14:12     ` Theodore Ts'o
@ 2013-01-28 23:27       ` David Lang
  2013-01-29  0:20         ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: David Lang @ 2013-01-28 23:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Daniel Phillips, linux-kernel, tux3, linux-fsdevel

On Mon, 28 Jan 2013, Theodore Ts'o wrote:

> On Sun, Jan 27, 2013 at 10:13:37PM -0800, Daniel Phillips wrote:
>>> The thing that jumps out at me with this is the question of how you will
>>> avoid the 'filesystem image in a file' disaster that reiserfs had (where
>>> it's fsck could mix up metadata chunks from the main filesystem with
>>> metadata chunks from any filesystem images that it happened to stumble
>>> across when scanning the disk)
>>>
>> Only superficially. Deep thoughts are in order. First, there needs to be a
>> hole in the filesystem structure, before we would even consider trying to
>> plug something in there. Once we know there is a hole, we want to
>> narrow down the list of candidates to fill it. If a candidate already lies
>> within a perfectly viable file, obviously we would not want to interpret
>> that as lost metadata. Unless the filesystem is really mess up...
>>
>> That is about as far as I have got with the analysis. Clearly, much more
>> is required. Suggestions welcome.
>
> The obvious answer is what resierfs4 ultimately ended up using.  Drop
> a file system UUID in the superblock; mix the UUID into a checksum
> which protects each of the your metadata blocks.  We're mixing in the
> inode number as well as the fs uuid in in ext4's new metadata checksum
> feature to protect against an inode table block getting written to the
> wrong location on disk.  It will also mean that e2fsck won't mistake
> an inode table from an earlier mkfs with the current file system.
> This will allow us to avoid needing to zero the inode table for newly
> initialized file systems.

The situation I'm thinking of is when dealing with VMs, you make a filesystem 
image once and clone it multiple times. Won't that end up with the same UUID in 
the superblock?

David Lang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-28 23:27       ` David Lang
@ 2013-01-29  0:20         ` Darrick J. Wong
  2013-01-29  1:40           ` Theodore Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2013-01-29  0:20 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Daniel Phillips, linux-kernel, tux3, linux-fsdevel

On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> On Mon, 28 Jan 2013, Theodore Ts'o wrote:
> 
> >On Sun, Jan 27, 2013 at 10:13:37PM -0800, Daniel Phillips wrote:
> >>>The thing that jumps out at me with this is the question of how you will
> >>>avoid the 'filesystem image in a file' disaster that reiserfs had (where
> >>>it's fsck could mix up metadata chunks from the main filesystem with
> >>>metadata chunks from any filesystem images that it happened to stumble
> >>>across when scanning the disk)

Did that ever get fixed in reiserfs?

> >>>
> >>Only superficially. Deep thoughts are in order. First, there needs to be a
> >>hole in the filesystem structure, before we would even consider trying to
> >>plug something in there. Once we know there is a hole, we want to
> >>narrow down the list of candidates to fill it. If a candidate already lies
> >>within a perfectly viable file, obviously we would not want to interpret
> >>that as lost metadata. Unless the filesystem is really mess up...
> >>
> >>That is about as far as I have got with the analysis. Clearly, much more
> >>is required. Suggestions welcome.
> >
> >The obvious answer is what resierfs4 ultimately ended up using.  Drop
> >a file system UUID in the superblock; mix the UUID into a checksum
> >which protects each of the your metadata blocks.  We're mixing in the
> >inode number as well as the fs uuid in in ext4's new metadata checksum
> >feature to protect against an inode table block getting written to the
> >wrong location on disk.  It will also mean that e2fsck won't mistake
> >an inode table from an earlier mkfs with the current file system.
> >This will allow us to avoid needing to zero the inode table for newly
> >initialized file systems.
> 
> The situation I'm thinking of is when dealing with VMs, you make a
> filesystem image once and clone it multiple times. Won't that end up
> with the same UUID in the superblock?

Yes, but one ought to be able to change the UUID a la tune2fs -U.  Even
still... so long as the VM images have a different UUID than the fs that they
live on, it ought to be fine.

--D
> 
> David Lang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-29  0:20         ` Darrick J. Wong
@ 2013-01-29  1:40           ` Theodore Ts'o
  2013-01-29  4:34             ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2013-01-29  1:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: David Lang, Daniel Phillips, linux-kernel, tux3, linux-fsdevel

On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> > On Mon, 28 Jan 2013, Theodore Ts'o wrote:
> > 
> > >On Sun, Jan 27, 2013 at 10:13:37PM -0800, Daniel Phillips wrote:
> > >>>The thing that jumps out at me with this is the question of how you will
> > >>>avoid the 'filesystem image in a file' disaster that reiserfs had (where
> > >>>it's fsck could mix up metadata chunks from the main filesystem with
> > >>>metadata chunks from any filesystem images that it happened to stumble
> > >>>across when scanning the disk)
> 
> Did that ever get fixed in reiserfs?

Not in resierfs, but this was something that Hans did change for
reiserfs4.

> > The situation I'm thinking of is when dealing with VMs, you make a
> > filesystem image once and clone it multiple times. Won't that end up
> > with the same UUID in the superblock?
> 
> Yes, but one ought to be able to change the UUID a la tune2fs -U.  Even
> still... so long as the VM images have a different UUID than the fs that they
> live on, it ought to be fine.

... and this is something most system administrators should be
familiar with.  For example, it's one of those things that Norton
Ghost when makes file system image copes (the equivalent of "tune2fs
-U random /dev/XXX")

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-29  1:40           ` Theodore Ts'o
@ 2013-01-29  4:34             ` Daniel Phillips
  2013-03-19 23:00               ` Martin Steigerwald
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2013-01-29  4:34 UTC (permalink / raw)
  To: Theodore Ts'o, Darrick J. Wong, David Lang, Daniel Phillips,
	linux-kernel, tux3, linux-fsdevel

On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
>> > The situation I'm thinking of is when dealing with VMs, you make a
>> > filesystem image once and clone it multiple times. Won't that end up
>> > with the same UUID in the superblock?
>>
>> Yes, but one ought to be able to change the UUID a la tune2fs -U.  Even
>> still... so long as the VM images have a different UUID than the fs that they
>> live on, it ought to be fine.
>
> ... and this is something most system administrators should be
> familiar with.  For example, it's one of those things that Norton
> Ghost when makes file system image copes (the equivalent of "tune2fs
> -U random /dev/XXX")

Hmm, maybe I missed something but it does not seem like a good idea
to use the volume UID itself to generate unique-per-volume metadata
hashes, if users expect to be able to change it. All the metadata hashes
would need to be changed.

Anyway, our primary line of attack on this problem is not unique hashes,
but actually knowing which blocks are in files and which are not. Before
(a hypothetical) Tux3 fsck repair would be so bold as to reattach some lost
metadata to the place it thinks it belongs, all of the following would need
to be satisfied:

   * The lost metadata subtree is completely detached from the filesystem
     tree. In other words, it cannot possibly be the contents of some valid
     file already belonging to the filesystem. I believe this addresses the
     concern of David Lang at the head of this thread.

   * The filesystem tree is incomplete. Somwhere in it Tux3 fsck has
     discovered a hole that needs to be filled.

   * The lost metadata subree is complete and consistent, except for not
     being attached to the filesystem tree.

   * The lost metadata subtree that was found matches a hole where
     metadata is missing, according to its "uptags", which specify at
     least the low order bits of the inode the metadata belongs to and
     the offset at which it belongs.

   * Tux3 fsck asked the user if this lost metadata (describing it in some
     reasonable way) should be attached to some particular filesystem
     object that appears to be incomplete. Alternatively, the lost subtree
     may be attached to the traditional "lost+found" directory, though we
     are able to be somewhat more specific about where the subtree
     might originally have belonged, and can name the lost+found object
     accordingly.

Additionally, Tux3 fsck might consider the following:

  * If the allocation bitmaps appear to be undamaged, but some or all
    of a lost filesystem tree is marked as free space, then the subtree is
    most likely free space and no attempt should be made to attach it to
    anything.

Thanks for your comments. I look forward to further review as things progress.

One thing to consider: this all gets much more interesting when versioning
arrives. For shared tree snapshotting filesystem designs, this must get very
interesting indeed, to the point where even contemplating the corner makes
me shudder. But even with versioning, Tux3 still upholds the single-reference
rule, therefore our fsck problem will continue to look a lot more like Ext4 than
like Btrfs or ZFS. Which suggests some great opportunities for unabashed
imitation.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-01-29  4:34             ` Daniel Phillips
@ 2013-03-19 23:00               ` Martin Steigerwald
  2013-03-20  4:04                 ` David Lang
  2013-03-20  6:54                 ` Rob Landley
  0 siblings, 2 replies; 16+ messages in thread
From: Martin Steigerwald @ 2013-03-19 23:00 UTC (permalink / raw)
  To: tux3
  Cc: Daniel Phillips, Theodore Ts'o, Darrick J. Wong, David Lang,
	linux-kernel, tux3, linux-fsdevel

Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> >> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> >> > The situation I'm thinking of is when dealing with VMs, you make a
> >> > filesystem image once and clone it multiple times. Won't that end up
> >> > with the same UUID in the superblock?
> >> 
> >> Yes, but one ought to be able to change the UUID a la tune2fs
> >> -U.  Even still... so long as the VM images have a different UUID
> >> than the fs that they live on, it ought to be fine.
> > 
> > ... and this is something most system administrators should be
> > familiar with.  For example, it's one of those things that Norton
> > Ghost when makes file system image copes (the equivalent of "tune2fs
> > -U random /dev/XXX")
> 
> Hmm, maybe I missed something but it does not seem like a good idea
> to use the volume UID itself to generate unique-per-volume metadata
> hashes, if users expect to be able to change it. All the metadata hashes
> would need to be changed.

I believe that is what BTRFS is doing.

And yes, AFAIK there is no easy way to change the UUID of a BTRFS filesystems 
after it was created.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-19 23:00               ` Martin Steigerwald
@ 2013-03-20  4:04                 ` David Lang
  2013-03-20  4:08                   ` Daniel Phillips
  2013-03-20 10:29                   ` Martin Steigerwald
  2013-03-20  6:54                 ` Rob Landley
  1 sibling, 2 replies; 16+ messages in thread
From: David Lang @ 2013-03-20  4:04 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: tux3, Daniel Phillips, Theodore Ts'o, Darrick J. Wong,
	linux-kernel, tux3, linux-fsdevel

On Wed, 20 Mar 2013, Martin Steigerwald wrote:

> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
>> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
>>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
>>>>> The situation I'm thinking of is when dealing with VMs, you make a
>>>>> filesystem image once and clone it multiple times. Won't that end up
>>>>> with the same UUID in the superblock?
>>>>
>>>> Yes, but one ought to be able to change the UUID a la tune2fs
>>>> -U.  Even still... so long as the VM images have a different UUID
>>>> than the fs that they live on, it ought to be fine.
>>>
>>> ... and this is something most system administrators should be
>>> familiar with.  For example, it's one of those things that Norton
>>> Ghost when makes file system image copes (the equivalent of "tune2fs
>>> -U random /dev/XXX")
>>
>> Hmm, maybe I missed something but it does not seem like a good idea
>> to use the volume UID itself to generate unique-per-volume metadata
>> hashes, if users expect to be able to change it. All the metadata hashes
>> would need to be changed.
>
> I believe that is what BTRFS is doing.
>
> And yes, AFAIK there is no easy way to change the UUID of a BTRFS filesystems
> after it was created.

In a world where systems are cloned, and many VMs are started from one master 
copy of a filesystem, a UUID is about as far from unique as anything you can 
generate.

BTRFS may have this problem, but why should Tux3 copy the problem?

David Lang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-20  4:04                 ` David Lang
@ 2013-03-20  4:08                   ` Daniel Phillips
  2013-03-20 10:29                   ` Martin Steigerwald
  1 sibling, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2013-03-20  4:08 UTC (permalink / raw)
  To: David Lang
  Cc: Martin Steigerwald, tux3, Theodore Ts'o, Darrick J. Wong,
	linux-kernel, tux3, linux-fsdevel

On Tue, Mar 19, 2013 at 9:04 PM, David Lang <david@lang.hm> wrote:
> On Wed, 20 Mar 2013, Martin Steigerwald wrote:
>
>> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
>>>
>>> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>>>>
>>>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
>>>>>
>>>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
>>>>>>
>>>>>> The situation I'm thinking of is when dealing with VMs, you make a
>>>>>> filesystem image once and clone it multiple times. Won't that end up
>>>>>> with the same UUID in the superblock?
>>>>>
>>>>> Yes, but one ought to be able to change the UUID a la tune2fs
>>>>> -U.  Even still... so long as the VM images have a different UUID
>>>>> than the fs that they live on, it ought to be fine.
>>>>
>>>> ... and this is something most system administrators should be
>>>> familiar with.  For example, it's one of those things that Norton
>>>> Ghost when makes file system image copes (the equivalent of "tune2fs
>>>> -U random /dev/XXX")
>>>
>>> Hmm, maybe I missed something but it does not seem like a good idea
>>> to use the volume UID itself to generate unique-per-volume metadata
>>> hashes, if users expect to be able to change it. All the metadata hashes
>>> would need to be changed.
>>
>> I believe that is what BTRFS is doing.
>>
>> And yes, AFAIK there is no easy way to change the UUID of a BTRFS
>> filesystems
>> after it was created.
>
> In a world where systems are cloned, and many VMs are started from one
> master copy of a filesystem, a UUID is about as far from unique as anything
> you can generate.
>
> BTRFS may have this problem, but why should Tux3 copy the problem?

Tux3 won't copy that problem. We have enough real problems to deal with
as it is, without manufacturing new ones.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-19 23:00               ` Martin Steigerwald
  2013-03-20  4:04                 ` David Lang
@ 2013-03-20  6:54                 ` Rob Landley
  2013-03-21  1:49                   ` Daniel Phillips
  1 sibling, 1 reply; 16+ messages in thread
From: Rob Landley @ 2013-03-20  6:54 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: tux3, Daniel Phillips, Theodore Ts'o, Darrick J. Wong,
	David Lang, linux-kernel, tux3, linux-fsdevel

On 03/19/2013 06:00:32 PM, Martin Steigerwald wrote:
> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> > On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu>  
> wrote:
> > > On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> > >> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> > >> > The situation I'm thinking of is when dealing with VMs, you  
> make a
> > >> > filesystem image once and clone it multiple times. Won't that  
> end up
> > >> > with the same UUID in the superblock?
> > >>
> > >> Yes, but one ought to be able to change the UUID a la tune2fs
> > >> -U.  Even still... so long as the VM images have a different UUID
> > >> than the fs that they live on, it ought to be fine.
> > >
> > > ... and this is something most system administrators should be
> > > familiar with.  For example, it's one of those things that Norton
> > > Ghost when makes file system image copes (the equivalent of  
> "tune2fs
> > > -U random /dev/XXX")
> >
> > Hmm, maybe I missed something but it does not seem like a good idea
> > to use the volume UID itself to generate unique-per-volume metadata
> > hashes, if users expect to be able to change it. All the metadata  
> hashes
> > would need to be changed.
> 
> I believe that is what BTRFS is doing.
> 
> And yes, AFAIK there is no easy way to change the UUID of a BTRFS  
> filesystems
> after it was created.

I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago,  
then nothing. Is this project dead or not?

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-20  4:04                 ` David Lang
  2013-03-20  4:08                   ` Daniel Phillips
@ 2013-03-20 10:29                   ` Martin Steigerwald
  1 sibling, 0 replies; 16+ messages in thread
From: Martin Steigerwald @ 2013-03-20 10:29 UTC (permalink / raw)
  To: David Lang
  Cc: tux3, Daniel Phillips, Theodore Ts'o, Darrick J. Wong,
	linux-kernel, tux3, linux-fsdevel

Am Mittwoch, 20. März 2013 schrieb David Lang:
> On Wed, 20 Mar 2013, Martin Steigerwald wrote:
> > Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> >> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> >>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> >>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> >>>>> The situation I'm thinking of is when dealing with VMs, you make a
> >>>>> filesystem image once and clone it multiple times. Won't that end
> >>>>> up with the same UUID in the superblock?
> >>>> 
> >>>> Yes, but one ought to be able to change the UUID a la tune2fs
> >>>> -U.  Even still... so long as the VM images have a different UUID
> >>>> than the fs that they live on, it ought to be fine.
> >>> 
> >>> ... and this is something most system administrators should be
> >>> familiar with.  For example, it's one of those things that Norton
> >>> Ghost when makes file system image copes (the equivalent of "tune2fs
> >>> -U random /dev/XXX")
> >> 
> >> Hmm, maybe I missed something but it does not seem like a good idea
> >> to use the volume UID itself to generate unique-per-volume metadata
> >> hashes, if users expect to be able to change it. All the metadata
> >> hashes would need to be changed.
> > 
> > I believe that is what BTRFS is doing.
> > 
> > And yes, AFAIK there is no easy way to change the UUID of a BTRFS
> > filesystems after it was created.
> 
> In a world where systems are cloned, and many VMs are started from one
> master copy of a filesystem, a UUID is about as far from unique as
> anything you can generate.
> 
> BTRFS may have this problem, but why should Tux3 copy the problem?

I didn´t ask for copying that behavior. I just mentioned it :)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-20  6:54                 ` Rob Landley
@ 2013-03-21  1:49                   ` Daniel Phillips
  2013-03-22  1:57                     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2013-03-21  1:49 UTC (permalink / raw)
  To: Rob Landley
  Cc: Martin Steigerwald, tux3, Theodore Ts'o, Darrick J. Wong,
	David Lang, linux-kernel, tux3, linux-fsdevel

On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley <rob@landley.net> wrote:
> I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
> nothing. Is this project dead or not?

Not. We haven't done much about updating tux3.org lately, however you
will find plenty of activity here:

     https://github.com/OGAWAHirofumi/tux3/tree/master/user

You will also find fairly comprehensive updates on where we are and
where this is going, here:

     http://phunq.net/pipermail/tux3/

At the moment we're being pretty quiet because of being in the middle
of developing the next-gen directory index. Not such a small task, as
you might imagine.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-21  1:49                   ` Daniel Phillips
@ 2013-03-22  1:57                     ` Dave Chinner
  2013-03-22  5:41                       ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2013-03-22  1:57 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rob Landley, Martin Steigerwald, tux3, Theodore Ts'o,
	Darrick J. Wong, David Lang, linux-kernel, tux3, linux-fsdevel

On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
> On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley <rob@landley.net> wrote:
> > I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
> > nothing. Is this project dead or not?
> 
> Not. We haven't done much about updating tux3.org lately, however you
> will find plenty of activity here:
> 
>      https://github.com/OGAWAHirofumi/tux3/tree/master/user
> 
> You will also find fairly comprehensive updates on where we are and
> where this is going, here:
> 
>      http://phunq.net/pipermail/tux3/
> 
> At the moment we're being pretty quiet because of being in the middle
> of developing the next-gen directory index. Not such a small task, as
> you might imagine.

Hi Daniel,

The "next-gen directory index" comment made me curious. I wanted to
know if there's anything I could learn from what you are doing and
whether anything of your new algorithms could be applied to, say,
the XFS directory structure to improve it.

I went looking for design docs and found this:

http://phunq.net/pipermail/tux3/2013-January/001938.html

In a word: Disappointment.

Compared to the XFS directory structure, the most striking
architectural similarity that I see is this:

	"the file bteee[sic] effectively is a second directory index
	that imposes a stable ordering on directory blocks".

That was the key architectural innovation in the XFS directory
structure that allowed it to provide the correct seekdir/telldir/NFS
readdir semantics and still scale. i.e. virtually mapped directory
entries. I explained this layout recently here:

http://marc.info/?l=linux-ext4&m=136081996316453&w=2
http://marc.info/?l=linux-ext4&m=136082221117399&w=2
http://marc.info/?l=linux-ext4&m=136089526928538&w=2

We could swap the relevant portions of your PHTree design doc with
my comments (and vice versa) and both sets of references would still
make perfect sense. :P

Further, the PHTree description of tag based freespace tracking is
rather close to how XFS uses tags to track free space regions,
including the fact that XFS can be lazy at updating global free
space indexes.  The global freespace tree indexing is slightly
different to the XFS method - it's closer to the original V1 dir
code in XFS (that didn't scale at all well) than the current code.
However, that's really a fine detail compared to all the major
structural and algorithmic similarities.

Hence it appears to me that at a fundamental level PHTree is just a
re-implementation of the XFS directory architecture. It's definitely
a *major* step forward from HTree, but it can hardly be considered
revolutionary or "next-gen". It's not even state of the art. Hence:
disappointment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Tux3 Report: Initial fsck has landed
  2013-03-22  1:57                     ` Dave Chinner
@ 2013-03-22  5:41                       ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2013-03-22  5:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rob Landley, Martin Steigerwald, tux3, Theodore Ts'o,
	Darrick J. Wong, David Lang, linux-kernel, tux3, linux-fsdevel

Hi Dave,

Thank you for your insightful post. The answer to the riddle is that
the PHTree scheme as described in the link you cited has already
become "last gen" and that, after roughly ten years of searching, I am
cautiously optimistic that I have discovered a satisfactory next gen
indexing scheme with the properties I was seeking. This is what
Hirofumi and I have busy prototyping and testing for the last few
weeks. More below...

On Thu, Mar 21, 2013 at 6:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
>> At the moment we're being pretty quiet because of being in the middle
>> of developing the next-gen directory index. Not such a small task, as
>> you might imagine.
>
> The "next-gen directory index" comment made me curious. I wanted to
> know if there's anything I could learn from what you are doing and
> whether anything of your new algorithms could be applied to, say,
> the XFS directory structure to improve it.
>
> I went looking for design docs and found this:
>
> http://phunq.net/pipermail/tux3/2013-January/001938.html
>
> In a word: Disappointment.

Me too. While I convinced myself that the PHTree scheme would scale
significantly better than HTree while being only modestly slower than
HTree in the smaller range (millions of files) even the new scheme
began hitting significant difficulties in the form of write
multiplication in the larger range (billions of files). Most probably,
you discovered the same thing. The problem is not so much about
randomly thrashing the index, because these days even a cheap desktop
can cache the entire index, but rather getting the index onto disk
with proper atomic update at reasonable intervals. We can't accept a
situation where crashing on the 999,999,999th file create requires the
entire index to be rebuilt, or even a significant portion of it. That
means we need ACID commit at normal intervals all the way through the
heavy create load, and unfortunately, that's where the write
multiplication issue rears its ugly head. It turned out that most of
the PHTree index blocks would end up being written to disk hundreds of
times each, effectively stretching out what should be a 10 minute test
to hours.

To solve this, I eventually came up with a secondary indexing scheme
that would kick in under heavy file create load, to take care of
committing enough state to disk at regular intervals that remounting
after a crash would only lose a few seconds of work. With this, PHTree
would satisfy all the performance goals we set out for it, which can
be summarized as: scale smoothly all the way from one file per
directory to one billion files per directory.

The only really distasteful element remaining was the little matter of
having two different directory indexes, the PHTree and the temporary
secondary index. That seems like one index too many. Then the big aha
landed out of the blue: we can actually throw away the primary BTree
and the secondary index will work fine all on its own. So the
secondary index was suddenly promoted to a "next gen" primary index.
This new index (which does not yet have a name) is based on hashing
and sharding and has nothing whatsoever to do with BTrees. It
currently exists in prototype with enough implementation in place to
get some early benchmarks, but we are not quite ready to provide those
details yet. You are completely welcome to call me a tease or be
sceptical, which is just what I would do coming from the other side,
but just now we're in the thick of the heavy work, and the key as we
see it is to keep on concentrating until the time is right. After all,
this amounts to the end of a ten year search that began around the
time HTree went into service in Ext3. Another couple of weeks hardly
seems worth worrying about.

> Compared to the XFS directory structure, the most striking
> architectural similarity that I see is this:
>
>         "the file bteee[sic] effectively is a second directory index
>         that imposes a stable ordering on directory blocks".
>
> That was the key architectural innovation in the XFS directory
> structure that allowed it to provide the correct seekdir/telldir/NFS
> readdir semantics and still scale. i.e. virtually mapped directory
> entries. I explained this layout recently here:
>
> http://marc.info/?l=linux-ext4&m=136081996316453&w=2
> http://marc.info/?l=linux-ext4&m=136082221117399&w=2
> http://marc.info/?l=linux-ext4&m=136089526928538&w=2
>
> We could swap the relevant portions of your PHTree design doc with
> my comments (and vice versa) and both sets of references would still
> make perfect sense. :P
>
> Further, the PHTree description of tag based freespace tracking is
> rather close to how XFS uses tags to track free space regions,
> including the fact that XFS can be lazy at updating global free
> space indexes.  The global freespace tree indexing is slightly
> different to the XFS method - it's closer to the original V1 dir
> code in XFS (that didn't scale at all well) than the current code.
> However, that's really a fine detail compared to all the major
> structural and algorithmic similarities.
>
> Hence it appears to me that at a fundamental level PHTree is just a
> re-implementation of the XFS directory architecture. It's definitely
> a *major* step forward from HTree, but it can hardly be considered
> revolutionary or "next-gen". It's not even state of the art. Hence:
> disappointment.

Insightful indeed, and right on the money. I had no idea we were
reinventing XFS to that extent and would love to spend some time later
dwelling on the details. But at this point, I am willing to cautiously
characterize all that as history, based on the performance numbers we
are seeing from the next gen prototype. We plan to publish details
fairly soon. I will apologize again for the lag. I can only plead that
this kind of work just seems to take more time than it reasonably
should.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-03-22  5:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-28  5:55 Tux3 Report: Initial fsck has landed Daniel Phillips
2013-01-28  6:02 ` David Lang
2013-01-28  6:13   ` Daniel Phillips
2013-01-28 14:12     ` Theodore Ts'o
2013-01-28 23:27       ` David Lang
2013-01-29  0:20         ` Darrick J. Wong
2013-01-29  1:40           ` Theodore Ts'o
2013-01-29  4:34             ` Daniel Phillips
2013-03-19 23:00               ` Martin Steigerwald
2013-03-20  4:04                 ` David Lang
2013-03-20  4:08                   ` Daniel Phillips
2013-03-20 10:29                   ` Martin Steigerwald
2013-03-20  6:54                 ` Rob Landley
2013-03-21  1:49                   ` Daniel Phillips
2013-03-22  1:57                     ` Dave Chinner
2013-03-22  5:41                       ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).