linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* soft update vs journaling?
@ 2006-01-22  6:42 John Richard Moser
  2006-01-22  8:51 ` Jan Engelhardt
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: John Richard Moser @ 2006-01-22  6:42 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I've been researching, because I thought this "Soft Update" thing
that BSD uses was some weird freak-ass way to totally corrupt a file
system if the power drops.  Seems I was wrong; it's actually just the
opposite, an alternate solution to journaling.  So let's compare notes.

I'm not quite clear on what the benefits versus costs of Soft Update
versus Journaling are, so I'll run down what I got, and anyone who wants
to give input can run down on what they got, and we can compare.  Maybe
someone will write a Soft Update system into Linux one day, far, far
into the future; but I doubt it.  It might, however, be interesting to
compare ext2 + SU to ext3; and giving the chance to solve problems such
as delayed delete (i.e. file system fills up while soft update has not
yet executed a delete; try reacting by looking for a delete to suddenly
actually execute) might also be cool.


Soft Update appears to buffer and order meta-data writes in a dependency
scheme that makes certain that inconsistencies can't happen.  Apparently
this means writing up directory entries before inodes, or something to
that effect.  I can't see how this would help in the middle of a buffer
flush (half a dentry written?  Partially deleted inode?  Inode "deleted"
but not freed from disk?), so maybe someone can fill me in.

Journaling apparently means writing out meta-data to a log before
transferring it to the file system.  No matter what happens, a proper
journal (for fun I've designed a transaction log format for low level
filesystems; it's entirely possible to have interrupt at any bit
recoverable) can always be checked over and either rolled back or rolled
forward.  This is easy to design.

Soft Update appears to have the advantage of not needing multiple
writes.  There's no need for journal flushing and then disk flushing;
you just flush the meta-data.  Also, soft update systems mount
instantly, because there's no journal to play back, and the file system
is always consistent.  It may be technically feasible to impliment soft
update on any old file system; I'm unclear as to how exactly to make any
soft-update work, so I can't say if this is absolutely possible (think
for vfat, consistent at all times and still Win32 compatible; great for
flash drives).

Unfortunately, soft update can leave retarded situations where areas of
disk are allocated still after a system failure during an inode delete.
 This won't cause inconsistencies in the on-disk structure, however; you
can freely use the disk without causing even more damage.  The system
just has to sanity check stuff while running and clean up such damage as
it sees it.

Journaling appears to have the advantage that the data gets to disk
faster.  It also seems easier a concept to grasp (i.e. I understand it
fully).  It's old, tried, trusted, and durable.  You also don't have to
worry about having odd meta-data writes that leave deleted files around
in certain circumstances, eating up space.

Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
Sure it could be done in 8 or 4 or so; or (in one of my file system
designs) a static 16KiB block could reference dynamicly allocated
journal space, allowing the system to sacrifice performance and shrink
the journal when more space is needed.  Either way, slow media like
floppies will suffer, HARD; and flash devices will see a lot of
write/erase all over the journal area, causing wear on that spot.

So, that's my understanding.  Any comments?  Enlighten me.
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD0yldhDd4aOud5P8RAhzBAJwOvWpAYb+m3Zg8ugnvuY10K74jZgCeL69s
y0172JATNX+q8jzrYGAJ/xc=
=7Dcn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  6:42 soft update vs journaling? John Richard Moser
@ 2006-01-22  8:51 ` Jan Engelhardt
  2006-01-22 18:40   ` John Richard Moser
  2006-01-22 19:05   ` Adrian Bunk
  2006-01-22  9:31 ` Theodore Ts'o
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Jan Engelhardt @ 2006-01-22  8:51 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

>Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>Sure it could be done in 8 or 4 or so; or (in one of my file system
>designs) a static 16KiB block could reference dynamicly allocated
>journal space, allowing the system to sacrifice performance and shrink
>the journal when more space is needed.  Either way, slow media like
>floppies will suffer, HARD; and flash devices will see a lot of
>write/erase all over the journal area, causing wear on that spot.

 - Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes,
   which would be ok with me for a 128meg drive.
   Most of the time you need vfat anyway for your flashstick to make
   useful use of it on Windows.

 - reiser4's journal is even smaller than reiser3's with a new fresh
   filesystem - same goes for jfs and xfs (below 1 megabyte IIRC)

 - I would not use a journalling filesystem at all on media that degrades
   faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
   There are specially-crafted filesystems for that, mostly jffs and udf.

 - You really need a hell of a power fluctuation to get a disk crippled.
   Just powering off (and potentially on after a few milliseconds) did
   (in my cases) just stop a disk write whereever it happened to be,
   and that seemed easily correctable.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  6:42 soft update vs journaling? John Richard Moser
  2006-01-22  8:51 ` Jan Engelhardt
@ 2006-01-22  9:31 ` Theodore Ts'o
  2006-01-22 18:54   ` John Richard Moser
  2006-01-22 19:50   ` Diego Calleja
  2006-01-22 19:26 ` James Courtier-Dutton
  2006-01-23  5:32 ` Michael Loftis
  3 siblings, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2006-01-22  9:31 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote:
> Soft Update appears to have the advantage of not needing multiple
> writes.  There's no need for journal flushing and then disk flushing;
> you just flush the meta-data.  

Not quite true; there are cases where Soft Update will have to do
multiple writes, when a particular block containing meta-data has
multiple changes in it that have to be committed to the filesystem at
different times in order to maintain consistency; this is particularly
true when a block is part of the inode table, for example.  When this
happens, the soft update machinery has to allocate memory for a block
and then undo changes to that block which come from transactions that
are not yet ready to be written to disk yet.

In general, though, it is true that Soft Updates can result in fewer
disk writes compared to filesystems that utilizing traditional
journaling approaches, and this might even be noticeable if your
workload is heavily skewed towards metadata updates.  (This is mainly
true in benchmarks that are horrendously disconneted to the real
world, such as dbench.)

One major downside with Soft Updates that you haven't mentioned in
your note, is that the amount of complexity it adds to the filesystem
is tremendous; the filesystem has to keep track of a very complex
state machinery, with knowledge of about the ordering constraints of
each change to the filesystem and how to "back out" parts of the
change when that becomes necessary.

Whenever you want to extend a filesystem to add some new feature, such
as online resizing, for example, it's not enough to just add that
feature; you also have to modify the black magic which is the Soft
Updates machinery.  This significantly increases the difficulty to add
new features to a filesystem, and can add as a roadblock to people
wanting to add new features.  I can't say for sure that this is why

BSD UFS doesn't have online resizing yet; and while I can't
conclusively blame the lack of this feature on Soft Updates, it is
clear that adding this and other features is much more difficult when
you are dealing with soft update code.

> Also, soft update systems mount instantly, because there's no
> journal to play back, and the file system is always consistent.  

This is only true if you don't care about recovering lost data blocks.
Fixing this requires that you run the equivalent of fsck on the
filesystem.  If you do, then it is major difference in performance.
Even if you can do the fsck scan on-line, it will greatly slow down
normal operations while recovering from a system crash, and the
slowdown associated with doing a journal replay is far smaller in
comparison.

> Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
> Sure it could be done in 8 or 4 or so; or (in one of my file system
> designs) a static 16KiB block could reference dynamicly allocated
> journal space, allowing the system to sacrifice performance and shrink
> the journal when more space is needed.  Either way, slow media like
> floppies will suffer, HARD; and flash devices will see a lot of
> write/erase all over the journal area, causing wear on that spot.

If you are using flash, use a filesystem which is optimized for flash,
such as JFFS2.  Otherwise, note that in most cases disk space is
nearly free, so allocating even 128 megs for the journal is chump
change when you're talking about a 200GB or larger hard drive.

Also note that if you have to use slow media, one of the things which
you can do is use a separate (fast) device for your journal; there is
no rule which says the journal has to be on the slow device.  

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  8:51 ` Jan Engelhardt
@ 2006-01-22 18:40   ` John Richard Moser
  2006-01-22 19:05   ` Adrian Bunk
  1 sibling, 0 replies; 32+ messages in thread
From: John Richard Moser @ 2006-01-22 18:40 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Jan Engelhardt wrote:
>>Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
>>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>>Sure it could be done in 8 or 4 or so; or (in one of my file system
>>designs) a static 16KiB block could reference dynamicly allocated
>>journal space, allowing the system to sacrifice performance and shrink
>>the journal when more space is needed.  Either way, slow media like
>>floppies will suffer, HARD; and flash devices will see a lot of
>>write/erase all over the journal area, causing wear on that spot.
> 
> 
>  - Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes,
>    which would be ok with me for a 128meg drive.
>    Most of the time you need vfat anyway for your flashstick to make
>    useful use of it on Windows.
> 
>  - reiser4's journal is even smaller than reiser3's with a new fresh
>    filesystem - same goes for jfs and xfs (below 1 megabyte IIRC)
> 

Nice, but does not solve. . .

>  - I would not use a journalling filesystem at all on media that degrades
>    faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
>    There are specially-crafted filesystems for that, mostly jffs and udf.
> 

Yes.  They'll degrade very, very fast.  This is where Soft Update would
have an advantage.  Another issue here is we can't just slap a journal
onto vfat, for all those flash devices that we want to share with Windows.

>  - You really need a hell of a power fluctuation to get a disk crippled.
>    Just powering off (and potentially on after a few milliseconds) did
>    (in my cases) just stop a disk write whereever it happened to be,
>    and that seemed easily correctable.

Yeah, I never said you could cripple a disk with power problems.  You
COULD destroy a NAND in a flash device by nuking the thing with
10000000000000 writes to the same area.

> 
> 
> Jan Engelhardt

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD09GOhDd4aOud5P8RAr1lAJ9fGMSJOd4QALc4nCbx+jDLgTlijwCbBM94
r60oZO/x2Q0xEWeF9sp9Vz8=
=63vo
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  9:31 ` Theodore Ts'o
@ 2006-01-22 18:54   ` John Richard Moser
  2006-01-22 21:02     ` Theodore Ts'o
  2006-01-22 19:50   ` Diego Calleja
  1 sibling, 1 reply; 32+ messages in thread
From: John Richard Moser @ 2006-01-22 18:54 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Theodore Ts'o wrote:
> On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote:
> 
>>Soft Update appears to have the advantage of not needing multiple
>>writes.  There's no need for journal flushing and then disk flushing;
>>you just flush the meta-data.  
> 
> 
> Not quite true; there are cases where Soft Update will have to do
> multiple writes, when a particular block containing meta-data has
> multiple changes in it that have to be committed to the filesystem at
> different times in order to maintain consistency; this is particularly

Yes, that makes sense.

> true when a block is part of the inode table, for example.  When this
> happens, the soft update machinery has to allocate memory for a block
> and then undo changes to that block which come from transactions that
> are not yet ready to be written to disk yet.
> 
> In general, though, it is true that Soft Updates can result in fewer
> disk writes compared to filesystems that utilizing traditional
> journaling approaches, and this might even be noticeable if your
> workload is heavily skewed towards metadata updates.  (This is mainly
> true in benchmarks that are horrendously disconneted to the real
> world, such as dbench.)

Yeah, microbenchmarks are like "AFAFAFAFAFFAF THIS WILL NEVAR HAPPAN MOR
THAN 1 EVERY BILLION ZILLION YARS BUT LOOK WER FASTAR BY LIKE 1
MICROSECOND" stuff.

> 
> One major downside with Soft Updates that you haven't mentioned in
> your note, is that the amount of complexity it adds to the filesystem
> is tremendous; the filesystem has to keep track of a very complex
> state machinery, with knowledge of about the ordering constraints of
> each change to the filesystem and how to "back out" parts of the
> change when that becomes necessary.

Yes, I had figured soft update would be a lot more complex than
journaling.  Though, could this be majorly implimented filesystem
independent?  I could see a "Soft Update API" to allow file systems to
sketch out dependencies each meta-data operation has and describe order;
it would, of course, be a total pain in the ass to do.

> 
> Whenever you want to extend a filesystem to add some new feature, such
> as online resizing, for example, it's not enough to just add that

Online resizing is ever safe?  I mean, with on-disk filesystem layout
support I could somewhat believe it for growing; for shrinking you'd
need a way to move files around without damaging them (possible).  I
guess it would be.

So how does this work?  Move files -> alter file system superblocks?

> feature; you also have to modify the black magic which is the Soft
> Updates machinery.  This significantly increases the difficulty to add
> new features to a filesystem, and can add as a roadblock to people
> wanting to add new features.  I can't say for sure that this is why
> 
> BSD UFS doesn't have online resizing yet; and while I can't
> conclusively blame the lack of this feature on Soft Updates, it is
> clear that adding this and other features is much more difficult when
> you are dealing with soft update code.
> 

Nod.

> 
>>Also, soft update systems mount instantly, because there's no
>>journal to play back, and the file system is always consistent.  
> 
> 
> This is only true if you don't care about recovering lost data blocks.
> Fixing this requires that you run the equivalent of fsck on the
> filesystem.  If you do, then it is major difference in performance.
> Even if you can do the fsck scan on-line, it will greatly slow down
> normal operations while recovering from a system crash, and the
> slowdown associated with doing a journal replay is far smaller in
> comparison.

A passive-active approach could passively generate a list of inodes from
dentries as they're accessed; and actively walk the directory tree when
the disk is idle.  Then a quick allocation check between inodes and
whatever allocation lists or trees there are could be done.

This has the disadvantage that if the system is under heavy load, the
recovery won't be done.  There's also a period where the disk may be
rather full, causing fragmentation or out of space errors along the way.
 The only way to counter this would be to force a mandatory minimum
amount of recovery activity per time interval, which again causes your
problem.

> 
> 
>>Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
>>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>>Sure it could be done in 8 or 4 or so; or (in one of my file system
>>designs) a static 16KiB block could reference dynamicly allocated
>>journal space, allowing the system to sacrifice performance and shrink
>>the journal when more space is needed.  Either way, slow media like
>>floppies will suffer, HARD; and flash devices will see a lot of
>>write/erase all over the journal area, causing wear on that spot.
> 
> 
> If you are using flash, use a filesystem which is optimized for flash,
> such as JFFS2.  Otherwise, note that in most cases disk space is

What about a NAND flash chip on a USB drive like a SanDisk Cruizer Mini?
 Or hell, a compact flash card for use in a digital camera.

> nearly free, so allocating even 128 megs for the journal is chump
> change when you're talking about a 200GB or larger hard drive.
> 
> Also note that if you have to use slow media, one of the things which
> you can do is use a separate (fast) device for your journal; there is
> no rule which says the journal has to be on the slow device.  

Unless it's portable and you don't want to reconfigure every system.

> 
> 						- Ted
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD09TdhDd4aOud5P8RAv1HAJ9SUeY0c42RognwsR6ve1w4XvFalwCdFc8N
feGuco4l9lz4yQB4U3tDcW8=
=4QFG
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  8:51 ` Jan Engelhardt
  2006-01-22 18:40   ` John Richard Moser
@ 2006-01-22 19:05   ` Adrian Bunk
  2006-01-22 19:08     ` Arjan van de Ven
  1 sibling, 1 reply; 32+ messages in thread
From: Adrian Bunk @ 2006-01-22 19:05 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: John Richard Moser, linux-kernel

On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
>...
>  - I would not use a journalling filesystem at all on media that degrades
>    faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
>    There are specially-crafted filesystems for that, mostly jffs and udf.
>...

[ ] you know what the "j" in "jffs" stands for

> Jan Engelhardt

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:05   ` Adrian Bunk
@ 2006-01-22 19:08     ` Arjan van de Ven
  2006-01-22 19:25       ` Adrian Bunk
  2006-01-24  2:33       ` Jörn Engel
  0 siblings, 2 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-01-22 19:08 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Jan Engelhardt, John Richard Moser, linux-kernel

On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote:
> On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
> >...
> >  - I would not use a journalling filesystem at all on media that degrades
> >    faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> >    There are specially-crafted filesystems for that, mostly jffs and udf.
> >...
> 
> [ ] you know what the "j" in "jffs" stands for

it stands for "logging" since jffs2 at least is NOT a journalling
filesystem.... but a logging one. I assume jffs is too.
 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:08     ` Arjan van de Ven
@ 2006-01-22 19:25       ` Adrian Bunk
  2006-01-24  2:33       ` Jörn Engel
  1 sibling, 0 replies; 32+ messages in thread
From: Adrian Bunk @ 2006-01-22 19:25 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jan Engelhardt, John Richard Moser, linux-kernel

On Sun, Jan 22, 2006 at 08:08:17PM +0100, Arjan van de Ven wrote:
> On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote:
> > On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
> > >...
> > >  - I would not use a journalling filesystem at all on media that degrades
> > >    faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> > >    There are specially-crafted filesystems for that, mostly jffs and udf.
> > >...
> > 
> > [ ] you know what the "j" in "jffs" stands for
> 
> it stands for "logging" since jffs2 at least is NOT a journalling
> filesystem.... but a logging one. I assume jffs is too.

Ah, sorry.

It seems I confused this with Reiser4 and it's wandering logs.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  6:42 soft update vs journaling? John Richard Moser
  2006-01-22  8:51 ` Jan Engelhardt
  2006-01-22  9:31 ` Theodore Ts'o
@ 2006-01-22 19:26 ` James Courtier-Dutton
  2006-01-23  0:06   ` John Richard Moser
  2006-01-23  5:32 ` Michael Loftis
  3 siblings, 1 reply; 32+ messages in thread
From: James Courtier-Dutton @ 2006-01-22 19:26 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

John Richard Moser wrote:
> 
> Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
> Sure it could be done in 8 or 4 or so; or (in one of my file system
> designs) a static 16KiB block could reference dynamicly allocated
> journal space, allowing the system to sacrifice performance and shrink
> the journal when more space is needed.  Either way, slow media like
> floppies will suffer, HARD; and flash devices will see a lot of
> write/erase all over the journal area, causing wear on that spot.
> 

My understanding is that if one designed a power supply with enough 
headroom, one could remove the power and still have time to write dirty 
sectors to the USB flash stick. Would this not remove the need for a 
journaling fs on a flash stick. This would remove the "wear on that 
spot" problem. Actually USB flash sticks are a bit clever, in that they 
add an extra layer of translation to the write. I.e. If you write to the 
same sector again and again, the USB flash stick will actually write it 
to a different area of the memory each time. This is specifically done 
to save the "wear on that spot" problem.

This "flush on power fail" approach is not so easy with a HD because it 
uses more power and takes longer to flush.

James


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  9:31 ` Theodore Ts'o
  2006-01-22 18:54   ` John Richard Moser
@ 2006-01-22 19:50   ` Diego Calleja
  2006-01-22 20:39     ` Suleiman Souhlal
  2006-01-23  1:00     ` John Richard Moser
  1 sibling, 2 replies; 32+ messages in thread
From: Diego Calleja @ 2006-01-22 19:50 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: nigelenki, linux-kernel

El Sun, 22 Jan 2006 04:31:44 -0500,
Theodore Ts'o <tytso@mit.edu> escribió:


> One major downside with Soft Updates that you haven't mentioned in
> your note, is that the amount of complexity it adds to the filesystem
> is tremendous; the filesystem has to keep track of a very complex
> state machinery, with knowledge of about the ordering constraints of
> each change to the filesystem and how to "back out" parts of the
> change when that becomes necessary.


And FreeBSD is implementing journaling for UFS and getting rid of 
softupdates [1]. While this not proves that softupdates is "a bad idea",
i think this proves why the added sofupdates complexity doesn't seem
to pay off in the real world. 

[1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html

"4.  Journaled filesystem.  While we can debate the merits of speed and
data integrety of journalling vs. softupdates, the simple fact remains
that softupdates still requires a fsck run on recovery, and the
multi-terabyte filesystems that are possible these days make fsck a very
long and unpleasant experience, even with bg-fsck.  There was work at
some point at RPI to add journaling to UFS, but there hasn't been much
status on that in a long time.  There have also been proposals and
works-in-progress to port JFS, ReiserFS, and XFS.  Some of these efforts
are still alive, but they need to be seen through to completion"

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:50   ` Diego Calleja
@ 2006-01-22 20:39     ` Suleiman Souhlal
  2006-01-22 20:50       ` Diego Calleja
  2006-01-23  1:00     ` John Richard Moser
  1 sibling, 1 reply; 32+ messages in thread
From: Suleiman Souhlal @ 2006-01-22 20:39 UTC (permalink / raw)
  To: Diego Calleja; +Cc: Theodore Ts'o, nigelenki, linux-kernel

Diego Calleja wrote:
> And FreeBSD is implementing journaling for UFS and getting rid of 
> softupdates [1]. While this not proves that softupdates is "a bad idea",
> i think this proves why the added sofupdates complexity doesn't seem
> to pay off in the real world. 

You read the message wrong: We're not getting rid of softupdates.
-- Suleiman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 20:39     ` Suleiman Souhlal
@ 2006-01-22 20:50       ` Diego Calleja
  0 siblings, 0 replies; 32+ messages in thread
From: Diego Calleja @ 2006-01-22 20:50 UTC (permalink / raw)
  To: Suleiman Souhlal; +Cc: tytso, nigelenki, linux-kernel

El Sun, 22 Jan 2006 12:39:38 -0800,
Suleiman Souhlal <ssouhlal@FreeBSD.org> escribió:

> Diego Calleja wrote:
> > And FreeBSD is implementing journaling for UFS and getting rid of 
> > softupdates [1]. While this not proves that softupdates is "a bad idea",
> > i think this proves why the added sofupdates complexity doesn't seem
> > to pay off in the real world. 
> 
> You read the message wrong: We're not getting rid of softupdates.
> -- Suleiman


Oh, both systems will be available at the same time? That will be
certainyl a good place to compare both approachs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 18:54   ` John Richard Moser
@ 2006-01-22 21:02     ` Theodore Ts'o
  2006-01-22 22:44       ` Kyle Moffett
  2006-01-23  1:02       ` John Richard Moser
  0 siblings, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2006-01-22 21:02 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote:
> > Whenever you want to extend a filesystem to add some new feature, such
> > as online resizing, for example, it's not enough to just add that
> 
> Online resizing is ever safe?  I mean, with on-disk filesystem layout
> support I could somewhat believe it for growing; for shrinking you'd
> need a way to move files around without damaging them (possible).  I
> guess it would be.
> 
> So how does this work?  Move files -> alter file system superblocks?

The online resizing support in ext3 only grows the filesystems; it
doesn't shrink it.  What is currently supported in 2.6 requires you to
reserve space in advance.  There is also a slight modification to the
ext2/3 filesystem format which is only supported by Linux 2.6 which
allows you to grow the filesystem without needing to move filesystem
data structures around; the kernel patches for actualling doing this
new style of online resizing aren't yet in mainline yet, although they
have been posted to ext2-devel for evaluation.

> A passive-active approach could passively generate a list of inodes from
> dentries as they're accessed; and actively walk the directory tree when
> the disk is idle.  Then a quick allocation check between inodes and
> whatever allocation lists or trees there are could be done.

That doesn't really help, because in order to release the unused disk
blocks, you have to walk every single inode and keep track of the
block allocation bitmaps for the entire filesystem.  If you have a
really big filesystem, it may require hundreds of megabytes of
non-swappable kernel memory.  And if you try to do this in userspace,
it becomes an unholy mess trying to keep the userspace and in-kernel
mounted filesystem data structures in sync.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 21:02     ` Theodore Ts'o
@ 2006-01-22 22:44       ` Kyle Moffett
  2006-01-23  7:24         ` Theodore Ts'o
  2006-01-23  1:02       ` John Richard Moser
  1 sibling, 1 reply; 32+ messages in thread
From: Kyle Moffett @ 2006-01-22 22:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: John Richard Moser, linux-kernel

On Jan 22, 2006, at 16:02, Theodore Ts'o wrote:
>> Online resizing is ever safe?  I mean, with on-disk filesystem  
>> layout support I could somewhat believe it for growing; for  
>> shrinking you'd need a way to move files around without damaging  
>> them (possible).  I guess it would be.
>>
>> So how does this work?  Move files -> alter file system superblocks?
>
> The online resizing support in ext3 only grows the filesystems; it  
> doesn't shrink it.  What is currently supported in 2.6 requires you  
> to reserve space in advance.  There is also a slight modification  
> to the ext2/3 filesystem format which is only supported by Linux  
> 2.6 which allows you to grow the filesystem without needing to move  
> filesystem data structures around; the kernel patches for  
> actualling doing this new style of online resizing aren't yet in  
> mainline yet, although they have been posted to ext2-devel for  
> evaluation.

 From my understanding of HFS+/HFSX, this is actually one of the  
nicer bits of that filesystem architecture.  It stores the data- 
structures on-disk using extents in such a way that you probably  
could hot-resize the disk without significant RAM overhead (both grow  
and shrink) as long as there's enough free space.  Essentially, every  
block on the disk is represented by an allocation block, and all data  
structures refer to allocation block offsets.  The allocation file  
bitmap itself is comprised of allocation blocks and mapped by a set  
of extent descriptors.  The result is that it is possible to fragment  
the allocation file, catalog file, and any other on-disk structures  
(with the sole exception of the 1K boot block and the 512-byte volume  
headers at the very start and end of the volume).

At the moment I'm educating myself on the operation of MFS/HFS/HFS+/ 
HFSX and the linux kernel VFS by writing a completely new combined  
hfsx driver, which I eventually plan to add online-resizing support  
and a variety of other features to.

One question though: Does anyone have any good recent references to  
"How to write a blockdev-based Linux Filesystem" docs?  I've searched  
the various crufty corners of the web, Documentation/, etc, and found  
enough to get started, but (for example), I had a hard time  
determining from the various sources what a struct file_system_type  
was supposed to have in it, and what the available default  
address_space/superblock ops are.

Cheers,
Kyle Moffett

--
They _will_ find opposing experts to say it isn't, if you push hard  
enough the wrong way.  Idiots with a PhD aren't hard to buy.
   -- Rob Landley




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:26 ` James Courtier-Dutton
@ 2006-01-23  0:06   ` John Richard Moser
  0 siblings, 0 replies; 32+ messages in thread
From: John Richard Moser @ 2006-01-23  0:06 UTC (permalink / raw)
  To: James Courtier-Dutton; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



James Courtier-Dutton wrote:
> John Richard Moser wrote:
> 
>>
>> Unfortunately, journaling uses a chunk of space.  Imagine a journal on a
>> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>> Sure it could be done in 8 or 4 or so; or (in one of my file system
>> designs) a static 16KiB block could reference dynamicly allocated
>> journal space, allowing the system to sacrifice performance and shrink
>> the journal when more space is needed.  Either way, slow media like
>> floppies will suffer, HARD; and flash devices will see a lot of
>> write/erase all over the journal area, causing wear on that spot.
>>
> 
> My understanding is that if one designed a power supply with enough
> headroom, one could remove the power and still have time to write dirty
> sectors to the USB flash stick. Would this not remove the need for a
> journaling fs on a flash stick.

Depends on how much meta-data you have to write out.  What if you just
altered 6000 files?  Now you have a ton of dentries to destroy and
inodes to invalidate, some FAT entries to free up, etc.  What if the
user just pulled the drive out of the USB port?  Or the USB port is
faulty and lost connection (I've seen it!).

> This would remove the "wear on that
> spot" problem.

Wha?  You mean remove the trigger, not the underlying problem.

> Actually USB flash sticks are a bit clever, in that they
> add an extra layer of translation to the write. I.e. If you write to the
> same sector again and again, the USB flash stick will actually write it
> to a different area of the memory each time. This is specifically done
> to save the "wear on that spot" problem.

Yeah, built-in write balancing is nice.

> 
> This "flush on power fail" approach is not so easy with a HD because it
> uses more power and takes longer to flush.

The "flush on power fail" is retarded because it takes extra hardware
and doesn't work if the USB port itself loses connection or if the user
is just dumb enough to pull/knock the drive out.  It won't work with
mini hard disks either, as you say.

"Flush on power fail" is pretty much getting a 10 minute UPS and issuing
'shutdown -h now' when the UPS signals init, which there's already
contingencies for (can also suspend to disk).  It won't help if the PSU
burns out, if the system crashes, if the power cord is pulled, or if the
dog walks around your chair and you turn and bump your foot into the
"power" button on the UPS itself.

> 
> James
> 
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1B37hDd4aOud5P8RAsCtAJ0TZM4I9T9gE6PMbfUhMux8zrxE9wCff67G
kdlY0fvfJQXmDljz6KekSxc=
=BV+l
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:50   ` Diego Calleja
  2006-01-22 20:39     ` Suleiman Souhlal
@ 2006-01-23  1:00     ` John Richard Moser
  2006-01-23  1:09       ` Suleiman Souhlal
  1 sibling, 1 reply; 32+ messages in thread
From: John Richard Moser @ 2006-01-23  1:00 UTC (permalink / raw)
  To: Diego Calleja; +Cc: Theodore Ts'o, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Diego Calleja wrote:
> El Sun, 22 Jan 2006 04:31:44 -0500,
> Theodore Ts'o <tytso@mit.edu> escribió:
> 
> 
> 
>>One major downside with Soft Updates that you haven't mentioned in
>>your note, is that the amount of complexity it adds to the filesystem
>>is tremendous; the filesystem has to keep track of a very complex
>>state machinery, with knowledge of about the ordering constraints of
>>each change to the filesystem and how to "back out" parts of the
>>change when that becomes necessary.
> 
> 
> 
> And FreeBSD is implementing journaling for UFS and getting rid of 
> softupdates [1]. While this not proves that softupdates is "a bad idea",
> i think this proves why the added sofupdates complexity doesn't seem
> to pay off in the real world. 
> 

Yeah, the huge TB fsck thing became a problem.  I wonder still if it'd
be useful for small vfat file systems (floppies, usb drives); nobody has
led me to believe it's definitely feasible to not corrupt meta-data in
this way.

I guess journaling is looking a lot better. :)

> [1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html
> 
> "4.  Journaled filesystem.  While we can debate the merits of speed and
> data integrety of journalling vs. softupdates, the simple fact remains
> that softupdates still requires a fsck run on recovery, and the
> multi-terabyte filesystems that are possible these days make fsck a very
> long and unpleasant experience, even with bg-fsck.  There was work at
> some point at RPI to add journaling to UFS, but there hasn't been much
> status on that in a long time.  There have also been proposals and
> works-in-progress to port JFS, ReiserFS, and XFS.  Some of these efforts
> are still alive, but they need to be seen through to completion"
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1CqhhDd4aOud5P8RAjvDAJ0W9pcNQ31v0RWSSIGVitnSpfvReQCdHBah
usgY72whnDcCwgshpVFW02o=
=Px/i
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 21:02     ` Theodore Ts'o
  2006-01-22 22:44       ` Kyle Moffett
@ 2006-01-23  1:02       ` John Richard Moser
  1 sibling, 0 replies; 32+ messages in thread
From: John Richard Moser @ 2006-01-23  1:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Theodore Ts'o wrote:
> On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote:
> 
>>>Whenever you want to extend a filesystem to add some new feature, such
>>>as online resizing, for example, it's not enough to just add that
>>
>>Online resizing is ever safe?  I mean, with on-disk filesystem layout
>>support I could somewhat believe it for growing; for shrinking you'd
>>need a way to move files around without damaging them (possible).  I
>>guess it would be.
>>
>>So how does this work?  Move files -> alter file system superblocks?
> 
> 
> The online resizing support in ext3 only grows the filesystems; it
> doesn't shrink it.  What is currently supported in 2.6 requires you to
> reserve space in advance.  There is also a slight modification to the
> ext2/3 filesystem format which is only supported by Linux 2.6 which
> allows you to grow the filesystem without needing to move filesystem
> data structures around; the kernel patches for actualling doing this
> new style of online resizing aren't yet in mainline yet, although they
> have been posted to ext2-devel for evaluation.
> 
> 
>>A passive-active approach could passively generate a list of inodes from
>>dentries as they're accessed; and actively walk the directory tree when
>>the disk is idle.  Then a quick allocation check between inodes and
>>whatever allocation lists or trees there are could be done.
> 
> 
> That doesn't really help, because in order to release the unused disk
> blocks, you have to walk every single inode and keep track of the
> block allocation bitmaps for the entire filesystem.  If you have a
> really big filesystem, it may require hundreds of megabytes of
> non-swappable kernel memory.  And if you try to do this in userspace,
> it becomes an unholy mess trying to keep the userspace and in-kernel
> mounted filesystem data structures in sync.
> 

Yeah I figured that you couldn't take action until everything was seen;
I can see how you could have problems with all that kernel memory ;)
FUSE driver, anyone?  :>

(I've actually looked into FUSE for the rootfs, via loading a fuser
driver from an init.d and then replacing bash with init on the rootfs;
haven't found an ext2 or xfs fuse driver to test with)
> 						- Ted
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1CskhDd4aOud5P8RArmJAJ9mgLjkxUcg5GW1o4q88Cb6ESmdCACZAS00
M1R+7biZpmOCCCBkEXVQL7w=
=060w
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  1:00     ` John Richard Moser
@ 2006-01-23  1:09       ` Suleiman Souhlal
  2006-01-23  2:09         ` John Richard Moser
  0 siblings, 1 reply; 32+ messages in thread
From: Suleiman Souhlal @ 2006-01-23  1:09 UTC (permalink / raw)
  To: John Richard Moser; +Cc: Diego Calleja, Theodore Ts'o, linux-kernel

John Richard Moser wrote:
> Yeah, the huge TB fsck thing became a problem.  I wonder still if it'd
> be useful for small vfat file systems (floppies, usb drives); nobody has
> led me to believe it's definitely feasible to not corrupt meta-data in
> this way.

Please note that you don't *HAVE* to run fsck at every reboot. All 
background fsck does is reclaim unused blocks.

-- Suleiman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  1:09       ` Suleiman Souhlal
@ 2006-01-23  2:09         ` John Richard Moser
  0 siblings, 0 replies; 32+ messages in thread
From: John Richard Moser @ 2006-01-23  2:09 UTC (permalink / raw)
  To: Suleiman Souhlal; +Cc: Diego Calleja, Theodore Ts'o, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Suleiman Souhlal wrote:
> John Richard Moser wrote:
> 
>> Yeah, the huge TB fsck thing became a problem.  I wonder still if it'd
>> be useful for small vfat file systems (floppies, usb drives); nobody has
>> led me to believe it's definitely feasible to not corrupt meta-data in
>> this way.
> 
> 
> Please note that you don't *HAVE* to run fsck at every reboot. All
> background fsck does is reclaim unused blocks.
> 

Duly noted, now can you answer my question?

> -- Suleiman
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1DryhDd4aOud5P8RAjiwAJ9xH5V/W2i5U/oVzT6AjdmBVk5+iwCfWD2j
JzBRinqiqDd/rIQFkS9QIsQ=
=SlOI
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22  6:42 soft update vs journaling? John Richard Moser
                   ` (2 preceding siblings ...)
  2006-01-22 19:26 ` James Courtier-Dutton
@ 2006-01-23  5:32 ` Michael Loftis
  2006-01-23 18:52   ` John Richard Moser
  3 siblings, 1 reply; 32+ messages in thread
From: Michael Loftis @ 2006-01-23  5:32 UTC (permalink / raw)
  To: John Richard Moser, linux-kernel



--On January 22, 2006 1:42:38 AM -0500 John Richard Moser 
<nigelenki@comcast.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I've been researching, because I thought this "Soft Update" thing
> that BSD uses was some weird freak-ass way to totally corrupt a file
> system if the power drops.  Seems I was wrong; it's actually just the
> opposite, an alternate solution to journaling.  So let's compare notes.

I hate to say it...but in my experience, this has been exactly the case 
with soft updates and FreeBSD 4 up to 4.11 pre releases.

Whenever something untoward would happen, the filesystem almost always lost 
files and/or data, usually just files though.  In practice it's never 
really worked too well for me.  It also still requires a full fsck on boot, 
which means long boot times for recovery on large filesystems.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 22:44       ` Kyle Moffett
@ 2006-01-23  7:24         ` Theodore Ts'o
  2006-01-23 13:31           ` Mitchell Blank Jr
                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Theodore Ts'o @ 2006-01-23  7:24 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: John Richard Moser, linux-kernel

On Sun, Jan 22, 2006 at 05:44:08PM -0500, Kyle Moffett wrote:
> From my understanding of HFS+/HFSX, this is actually one of the  
> nicer bits of that filesystem architecture.  It stores the data- 
> structures on-disk using extents in such a way that you probably  
> could hot-resize the disk without significant RAM overhead (both grow  
> and shrink) as long as there's enough free space.  

Hot-shrinking a filesystem is certainly possible for any filesystem,
but the problem is how many filesystem data structures you have to
walk in order to find all the owner of all of the blocks that you have
to relocate.  That generallly isn't a RAM overhead problem, but the
fact that in general, most filesystems don't have an efficient way to
answer the question, "who owns this arbitrary disk block?"  Having
extents means you have a slightly more efficient encoding system, but
it still is the case that you have to check potentially every file in
the filesystem to see if it is the owner of one of the disk blocks
that needs to be moved when you are shrinking the filesystem.

You could of course design a filesystem which maintained a reverse map
data structure, but it would slow the filesystem down since it would
be a separate data structure that would have to be updated each time
you allocated or freed a disk block.  And the only use for such a data
structure would be to make shrinking a filesystem more efficient.
Given that this is generally not a common operation, it seems unlikely
that a filesystem designer would choose to make this particular
tradeoff.

							- Ted


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  7:24         ` Theodore Ts'o
@ 2006-01-23 13:31           ` Mitchell Blank Jr
  2006-01-23 13:33           ` Kyle Moffett
  2006-01-23 20:48           ` soft update vs journaling? Folkert van Heusden
  2 siblings, 0 replies; 32+ messages in thread
From: Mitchell Blank Jr @ 2006-01-23 13:31 UTC (permalink / raw)
  To: Theodore Ts'o, Kyle Moffett, John Richard Moser, linux-kernel

Theodore Ts'o wrote:
> in general, most filesystems don't have an efficient way to
> answer the question, "who owns this arbitrary disk block?"
[...]
> Given that this is generally not a common operation, it seems unlikely
> that a filesystem designer would choose to make this particular
> tradeoff.

True -- a much more rational approach would be to provide a translation
table for "old block #" to "new block #" -- then when the filesystem sees
a reference to an invalid blocknumber (>= the filesystem size) it can just
translate it to its new location.

You have to be careful if the filesystem is regrown since some of those
block numbers may now be valid again.  It can easily be handled by just
moving the data back to its original block # and removing the mapping.

This doesn't completely remove the extra cost on the block allocator
fastpath: if an block is freed it must make sure to remove any entry pointing
to it from the translation table or you can't handle regrowth properly
(the block could have been reused by a file pointing to the real block # --
you won't know whether to move it back or not).  However, this is probably
a lot cheaper than maintaining a full reverse-map, plus you only have to
take it after a shrink has actually happened.

-Mitch

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  7:24         ` Theodore Ts'o
  2006-01-23 13:31           ` Mitchell Blank Jr
@ 2006-01-23 13:33           ` Kyle Moffett
  2006-01-23 13:52             ` Antonio Vargas
  2006-01-23 20:48           ` soft update vs journaling? Folkert van Heusden
  2 siblings, 1 reply; 32+ messages in thread
From: Kyle Moffett @ 2006-01-23 13:33 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: John Richard Moser, linux-kernel

On Jan 23, 2006, at 02:24, Theodore Ts'o wrote:
> Hot-shrinking a filesystem is certainly possible for any  
> filesystem, but the problem is how many filesystem data structures  
> you have to walk in order to find all the owner of all of the  
> blocks that you have to relocate.  That generally isn't a RAM  
> overhead problem, but the fact that in general, most filesystems  
> don't have an efficient way to answer the question, "who owns this  
> arbitrary disk block?"  Having extents means you have a slightly  
> more efficient encoding system, but it still is the case that you  
> have to check potentially every file in the filesystem to see if it  
> is the owner of one of the disk blocks that needs to be moved when  
> you are shrinking the filesystem.

The way that I'm considering implementing this is by intentionally  
fragmenting the allocation bitmap, catalog file, etc, such that each  
1/8 or so of the disk contains its own allocation bitmap describing  
its contents, its own set of files or directories, etc.  The  
allocator would largely try to keep individual btree fragments  
cohesive, such that one of the 1/8th divisions of the disk would only  
have pertinent data for itself.  The idea would be that when trying  
to look up an allocation block, in the common case you need only  
parse a much smaller subsection of the disk structures.

> And the only use for such a [reverse-mapping] data structure would  
> be to make shrinking a filesystem more efficient.

Not entirely true.  I _believe_ you could use such data structures to  
make the allocation algorithm much more robust against fragmentation  
if you record the right kind of information.

Cheers,
Kyle Moffett

--
If you don't believe that a case based on [nothing] could potentially  
drag on in court for _years_, then you have no business playing with  
the legal system at all.
   -- Rob Landley




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23 13:33           ` Kyle Moffett
@ 2006-01-23 13:52             ` Antonio Vargas
  2006-01-23 16:48               ` Linux VFS architecture questions Kyle Moffett
  0 siblings, 1 reply; 32+ messages in thread
From: Antonio Vargas @ 2006-01-23 13:52 UTC (permalink / raw)
  To: Kyle Moffett, Theodore Ts'o, John Richard Moser, linux-kernel

On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> On Jan 23, 2006, at 02:24, Theodore Ts'o wrote:
> > Hot-shrinking a filesystem is certainly possible for any
> > filesystem, but the problem is how many filesystem data structures
> > you have to walk in order to find all the owner of all of the
> > blocks that you have to relocate.  That generally isn't a RAM
> > overhead problem, but the fact that in general, most filesystems
> > don't have an efficient way to answer the question, "who owns this
> > arbitrary disk block?"  Having extents means you have a slightly
> > more efficient encoding system, but it still is the case that you
> > have to check potentially every file in the filesystem to see if it
> > is the owner of one of the disk blocks that needs to be moved when
> > you are shrinking the filesystem.
>
> The way that I'm considering implementing this is by intentionally
> fragmenting the allocation bitmap, catalog file, etc, such that each
> 1/8 or so of the disk contains its own allocation bitmap describing
> its contents, its own set of files or directories, etc.  The
> allocator would largely try to keep individual btree fragments
> cohesive, such that one of the 1/8th divisions of the disk would only
> have pertinent data for itself.  The idea would be that when trying
> to look up an allocation block, in the common case you need only
> parse a much smaller subsection of the disk structures.

this sounds exactly the same as ext2/ext3 allocation groups :)

> > And the only use for such a [reverse-mapping] data structure would
> > be to make shrinking a filesystem more efficient.
>
> Not entirely true.  I _believe_ you could use such data structures to
> make the allocation algorithm much more robust against fragmentation
> if you record the right kind of information.
>
> Cheers,
> Kyle Moffett
>
> --
> If you don't believe that a case based on [nothing] could potentially
> drag on in court for _years_, then you have no business playing with
> the legal system at all.
>    -- Rob Landley
>

--
Greetz, Antonio Vargas aka winden of network

http://wind.codepixel.com/
windNOenSPAMntw@gmail.com
thesameasabove@amigascne.org

Every day, every year
you have to work
you have to study
you have to scene.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Linux VFS architecture questions
  2006-01-23 13:52             ` Antonio Vargas
@ 2006-01-23 16:48               ` Kyle Moffett
  2006-01-23 17:00                 ` Pekka Enberg
  0 siblings, 1 reply; 32+ messages in thread
From: Kyle Moffett @ 2006-01-23 16:48 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Theodore Ts'o, John Richard Moser, linux-kernel

On Jan 23, 2006, at 08:52:51, Antonio Vargas wrote:
> On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote:
>> The way that I'm considering implementing this is by intentionally  
>> fragmenting the allocation bitmap, catalog file, etc, such that  
>> each 1/8 or so of the disk contains its own allocation bitmap  
>> describing its contents, its own set of files or directories,  
>> etc.  The allocator would largely try to keep individual btree  
>> fragments cohesive, such that one of the 1/8th divisions of the  
>> disk would only have pertinent data for itself.  The idea would be  
>> that when trying to look up an allocation block, in the common  
>> case you need only parse a much smaller subsection of the disk  
>> structures.
>
> this sounds exactly the same as ext2/ext3 allocation groups :)

Great!  I'm trying to learn about filesystem design and  
implementation, which is why I started writing my own hfsplus  
filesystem (otherwise I would have just used the in-kernel one).  Do  
you have any recommended reading (either online or otherwise) for  
someone trying to understand the kernel's VFS and blockdev  
interfaces?  I _think_ I understand the basics of buffer_head,  
super_block, and have some idea of how to use aops, but it's tough  
going trying to find out what functions to call to manage cached disk  
blocks, or under what conditions the various VFS functions are  
called.  I'm trying to write up a "Linux Disk-Based Filesystem  
Developers Guide" based on what I learn, but it's remarkably sparse  
so far.

One big question I have:  HFS/HFS+ have an "extents overflow" btree  
that contains extents beyond the first 3 (for HFS) or 8 (for HFS+).   
I would like to speculatively cache parts of that btree when the  
files are accessed, but not if memory is short, and I would like to  
allow the filesystem to free up parts of the btree under the same  
circumstances.  I have a preliminary understanding of how to trigger  
the filesystem to read various blocks of metadata (using  
buffer_heads) or file data for programs (by returning a block number  
from the appropriate aops function), but how do I allocate data  
structures as "easily reclaimable" and indicate to the kernel that it  
can ask me to reclaim that memory?

Thanks for the help!

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux VFS architecture questions
  2006-01-23 16:48               ` Linux VFS architecture questions Kyle Moffett
@ 2006-01-23 17:00                 ` Pekka Enberg
  2006-01-23 17:50                   ` Kyle Moffett
  0 siblings, 1 reply; 32+ messages in thread
From: Pekka Enberg @ 2006-01-23 17:00 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Antonio Vargas, Theodore Ts'o, John Richard Moser, linux-kernel

Hi Kyle,

On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> Great!  I'm trying to learn about filesystem design and
> implementation, which is why I started writing my own hfsplus
> filesystem (otherwise I would have just used the in-kernel one).  Do
> you have any recommended reading (either online or otherwise) for
> someone trying to understand the kernel's VFS and blockdev
> interfaces?  I _think_ I understand the basics of buffer_head,
> super_block, and have some idea of how to use aops, but it's tough
> going trying to find out what functions to call to manage cached disk
> blocks, or under what conditions the various VFS functions are
> called.  I'm trying to write up a "Linux Disk-Based Filesystem
> Developers Guide" based on what I learn, but it's remarkably sparse
> so far.

Did you read Documentation/filesystems/vfs.txt? Also, books Linux
Kernel Development and Understanding the Linux Kernel have fairly good
information on VFS (and related) stuff.

                      Pekka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux VFS architecture questions
  2006-01-23 17:00                 ` Pekka Enberg
@ 2006-01-23 17:50                   ` Kyle Moffett
  2006-01-23 17:54                     ` Randy.Dunlap
  0 siblings, 1 reply; 32+ messages in thread
From: Kyle Moffett @ 2006-01-23 17:50 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Antonio Vargas, Theodore Ts'o, John Richard Moser, linux-kernel

On Jan 23, 2006, at 12:00, Pekka Enberg wrote:
> Hi Kyle,
>
> On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote:
>> Great!  I'm trying to learn about filesystem design and  
>> implementation, which is why I started writing my own hfsplus  
>> filesystem (otherwise I would have just used the in-kernel one).   
>> Do you have any recommended reading (either online or otherwise)  
>> for someone trying to understand the kernel's VFS and blockdev  
>> interfaces?  I _think_ I understand the basics of buffer_head,  
>> super_block, and have some idea of how to use aops, but it's tough  
>> going trying to find out what functions to call to manage cached  
>> disk blocks, or under what conditions the various VFS  functions  
>> are called.  I'm trying to write up a "Linux Disk-Based Filesystem  
>> Developers Guide" based on what I learn, but it's remarkably  
>> sparse so far.
>
> Did you read Documentation/filesystems/vfs.txt?

Yeah, that was the first thing I looked at.  Once I've got things  
figured out, I'll probably submit a fairly hefty patch to that file  
to add additional documentation.

> Also, books Linux Kernel Development and Understanding the Linux  
> Kernel have fairly good information on VFS (and related) stuff.

Ah, thanks again!  It looks like both of those are available through  
my university's Safari/ProQuest subscription (http:// 
safari.oreilly.com/), so I'll take a look right away!

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw  
knives at people who weren't supposed to be in your machine room.
   -- Anthony de Boer



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux VFS architecture questions
  2006-01-23 17:50                   ` Kyle Moffett
@ 2006-01-23 17:54                     ` Randy.Dunlap
  0 siblings, 0 replies; 32+ messages in thread
From: Randy.Dunlap @ 2006-01-23 17:54 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Pekka Enberg, Antonio Vargas, Theodore Ts'o,
	John Richard Moser, linux-kernel

On Mon, 23 Jan 2006, Kyle Moffett wrote:

> On Jan 23, 2006, at 12:00, Pekka Enberg wrote:
> > Hi Kyle,
> >
> > On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> >> Great!  I'm trying to learn about filesystem design and
> >> implementation, which is why I started writing my own hfsplus
> >> filesystem (otherwise I would have just used the in-kernel one).
> >> Do you have any recommended reading (either online or otherwise)
> >> for someone trying to understand the kernel's VFS and blockdev
> >> interfaces?  I _think_ I understand the basics of buffer_head,
> >> super_block, and have some idea of how to use aops, but it's tough
> >> going trying to find out what functions to call to manage cached
> >> disk blocks, or under what conditions the various VFS  functions
> >> are called.  I'm trying to write up a "Linux Disk-Based Filesystem
> >> Developers Guide" based on what I learn, but it's remarkably
> >> sparse so far.
> >
> > Did you read Documentation/filesystems/vfs.txt?
>
> Yeah, that was the first thing I looked at.  Once I've got things
> figured out, I'll probably submit a fairly hefty patch to that file
> to add additional documentation.
>
> > Also, books Linux Kernel Development and Understanding the Linux
> > Kernel have fairly good information on VFS (and related) stuff.
>
> Ah, thanks again!  It looks like both of those are available through
> my university's Safari/ProQuest subscription (http://
> safari.oreilly.com/), so I'll take a look right away!

This web page is terribly out of date, but you might find
a few helpful link on it (near the bottom):
  http://www.xenotime.net/linux/linux-fs.html

-- 
~Randy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  5:32 ` Michael Loftis
@ 2006-01-23 18:52   ` John Richard Moser
  2006-01-23 19:32     ` Matthias Andree
  0 siblings, 1 reply; 32+ messages in thread
From: John Richard Moser @ 2006-01-23 18:52 UTC (permalink / raw)
  To: Michael Loftis; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Michael Loftis wrote:
> 
> 
> --On January 22, 2006 1:42:38 AM -0500 John Richard Moser
> <nigelenki@comcast.net> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> So I've been researching, because I thought this "Soft Update" thing
>> that BSD uses was some weird freak-ass way to totally corrupt a file
>> system if the power drops.  Seems I was wrong; it's actually just the
>> opposite, an alternate solution to journaling.  So let's compare notes.
> 
> 
> I hate to say it...but in my experience, this has been exactly the case
> with soft updates and FreeBSD 4 up to 4.11 pre releases.
> 
> Whenever something untoward would happen, the filesystem almost always
> lost files and/or data, usually just files though.  In practice it's
> never really worked too well for me.  It also still requires a full fsck
> on boot, which means long boot times for recovery on large filesystems.

You lost files in use, or random files?

Soft Update was designed to assure file system consistency.  In typical
usage, when you drop power on something like FAT, you create a 'hole' in
the filesystem.  This hole could be something like files pointing to
allocated blocks belonging to other files; or crossed dentries; etc.  As
you use the file system, it simply accepts the information it gets,
because it doesn't look bad until you look at EVERYTHING.  The effect is
akin to repeatedly sodomizing the file system in this newly created
hole; you just cause more and more damage until the system gives out.
The system makes allocations and decisions based on faulty data and
really, really screws things up.

The idea of Soft Update was to make sure that while you may lose
something, when you come back up the FS is in a safely usable state.
The fsck only colors in a view of the FS and frees up blocks that don't
seem to be allocated by any particular file, an annoying but mostly
harmless side effect of losing power in this scheme.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1SXXhDd4aOud5P8RAj9PAJ9G5CF6gfPx470/Ak+OlaKogZhMSwCeKORg
Q7AZegZunZ3S2hTSNVnXFlc=
=7Rme
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23 18:52   ` John Richard Moser
@ 2006-01-23 19:32     ` Matthias Andree
  0 siblings, 0 replies; 32+ messages in thread
From: Matthias Andree @ 2006-01-23 19:32 UTC (permalink / raw)
  To: John Richard Moser; +Cc: Michael Loftis, linux-kernel

On Mon, 23 Jan 2006, John Richard Moser wrote:

> The idea of Soft Update was to make sure that while you may lose
> something, when you come back up the FS is in a safely usable state.

Soft Updates are *extremely* sensitive to reordered writes, and more
likely to be reordered at the same time than streaming to a linear
journal is. Don't even THINK of using softupdates without enforcing
write order. ext3fs, particularly with data=ordered or data=journal, is
much more forgiving in my experience. Not that I'd endorse dangerous use
of file system, but the average user just doesn't know.

FreeBSD (stable@ Cc:d) has no notion of write barriers as of yet as it
seems, wedging the SCSI bus in the middle of a write sequence causes
major devastations with WCE=1, and took me two runs of fsck to repair
(unfortunately I needed the (test) machine back up at once, so no time
to snapshot the b0rked partition for later scrutiny), and found myself
with two hundred files relocated to the lost+found office^Wdirectory.

Of course, it's the "Doctor, doctor, it always hurts my right eye if I'm
drinking coffee" -- "well, remove the spoon from your mug before
drinking then" (don't do that) category of "bug", but it hosts practical
relevance...

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-23  7:24         ` Theodore Ts'o
  2006-01-23 13:31           ` Mitchell Blank Jr
  2006-01-23 13:33           ` Kyle Moffett
@ 2006-01-23 20:48           ` Folkert van Heusden
  2 siblings, 0 replies; 32+ messages in thread
From: Folkert van Heusden @ 2006-01-23 20:48 UTC (permalink / raw)
  To: Theodore Ts'o, Kyle Moffett, John Richard Moser, linux-kernel

> You could of course design a filesystem which maintained a reverse map
> data structure, but it would slow the filesystem down since it would
> be a separate data structure that would have to be updated each time
> you allocated or freed a disk block.  And the only use for such a data
> structure would be to make shrinking a filesystem more efficient.
> Given that this is generally not a common operation, it seems unlikely
> that a filesystem designer would choose to make this particular
> tradeoff.

Or you could set if switched off by default. E.g. reserve the space for
it and activate it as soon as some magic switch is set in the kernel.
Then some background processs should update it while als keeping track
of current changes. Then when everything is finished, update some flag
to let the resizer know it can do its job.


Folkert van Heusden

-- 
www.vanheusden.com/recoverdm/ - got an unreadable cd with scratches?
                            recoverdm might help you recovering data
--------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: soft update vs journaling?
  2006-01-22 19:08     ` Arjan van de Ven
  2006-01-22 19:25       ` Adrian Bunk
@ 2006-01-24  2:33       ` Jörn Engel
  1 sibling, 0 replies; 32+ messages in thread
From: Jörn Engel @ 2006-01-24  2:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Adrian Bunk, Jan Engelhardt, John Richard Moser, linux-kernel

On Sun, 22 January 2006 20:08:17 +0100, Arjan van de Ven wrote:
> 
> it stands for "logging" since jffs2 at least is NOT a journalling
> filesystem.... but a logging one. I assume jffs is too.

s/logging/log-structured/

People could (and did) argue that jffs[|2] is a journalling
filesystem consisting of a journal and _no_ regular storage.  Which is
quite sane.  Having a live-fast, die-young journal confined to a small
portion of the device would kill it quickly, no doubt.

Jörn

-- 
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-01-24  2:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-22  6:42 soft update vs journaling? John Richard Moser
2006-01-22  8:51 ` Jan Engelhardt
2006-01-22 18:40   ` John Richard Moser
2006-01-22 19:05   ` Adrian Bunk
2006-01-22 19:08     ` Arjan van de Ven
2006-01-22 19:25       ` Adrian Bunk
2006-01-24  2:33       ` Jörn Engel
2006-01-22  9:31 ` Theodore Ts'o
2006-01-22 18:54   ` John Richard Moser
2006-01-22 21:02     ` Theodore Ts'o
2006-01-22 22:44       ` Kyle Moffett
2006-01-23  7:24         ` Theodore Ts'o
2006-01-23 13:31           ` Mitchell Blank Jr
2006-01-23 13:33           ` Kyle Moffett
2006-01-23 13:52             ` Antonio Vargas
2006-01-23 16:48               ` Linux VFS architecture questions Kyle Moffett
2006-01-23 17:00                 ` Pekka Enberg
2006-01-23 17:50                   ` Kyle Moffett
2006-01-23 17:54                     ` Randy.Dunlap
2006-01-23 20:48           ` soft update vs journaling? Folkert van Heusden
2006-01-23  1:02       ` John Richard Moser
2006-01-22 19:50   ` Diego Calleja
2006-01-22 20:39     ` Suleiman Souhlal
2006-01-22 20:50       ` Diego Calleja
2006-01-23  1:00     ` John Richard Moser
2006-01-23  1:09       ` Suleiman Souhlal
2006-01-23  2:09         ` John Richard Moser
2006-01-22 19:26 ` James Courtier-Dutton
2006-01-23  0:06   ` John Richard Moser
2006-01-23  5:32 ` Michael Loftis
2006-01-23 18:52   ` John Richard Moser
2006-01-23 19:32     ` Matthias Andree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).