* soft update vs journaling? @ 2006-01-22 6:42 John Richard Moser 2006-01-22 8:51 ` Jan Engelhardt ` (3 more replies) 0 siblings, 4 replies; 32+ messages in thread From: John Richard Moser @ 2006-01-22 6:42 UTC (permalink / raw) To: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 So I've been researching, because I thought this "Soft Update" thing that BSD uses was some weird freak-ass way to totally corrupt a file system if the power drops. Seems I was wrong; it's actually just the opposite, an alternate solution to journaling. So let's compare notes. I'm not quite clear on what the benefits versus costs of Soft Update versus Journaling are, so I'll run down what I got, and anyone who wants to give input can run down on what they got, and we can compare. Maybe someone will write a Soft Update system into Linux one day, far, far into the future; but I doubt it. It might, however, be interesting to compare ext2 + SU to ext3; and giving the chance to solve problems such as delayed delete (i.e. file system fills up while soft update has not yet executed a delete; try reacting by looking for a delete to suddenly actually execute) might also be cool. Soft Update appears to buffer and order meta-data writes in a dependency scheme that makes certain that inconsistencies can't happen. Apparently this means writing up directory entries before inodes, or something to that effect. I can't see how this would help in the middle of a buffer flush (half a dentry written? Partially deleted inode? Inode "deleted" but not freed from disk?), so maybe someone can fill me in. Journaling apparently means writing out meta-data to a log before transferring it to the file system. No matter what happens, a proper journal (for fun I've designed a transaction log format for low level filesystems; it's entirely possible to have interrupt at any bit recoverable) can always be checked over and either rolled back or rolled forward. This is easy to design. Soft Update appears to have the advantage of not needing multiple writes. There's no need for journal flushing and then disk flushing; you just flush the meta-data. Also, soft update systems mount instantly, because there's no journal to play back, and the file system is always consistent. It may be technically feasible to impliment soft update on any old file system; I'm unclear as to how exactly to make any soft-update work, so I can't say if this is absolutely possible (think for vfat, consistent at all times and still Win32 compatible; great for flash drives). Unfortunately, soft update can leave retarded situations where areas of disk are allocated still after a system failure during an inode delete. This won't cause inconsistencies in the on-disk structure, however; you can freely use the disk without causing even more damage. The system just has to sanity check stuff while running and clean up such damage as it sees it. Journaling appears to have the advantage that the data gets to disk faster. It also seems easier a concept to grasp (i.e. I understand it fully). It's old, tried, trusted, and durable. You also don't have to worry about having odd meta-data writes that leave deleted files around in certain circumstances, eating up space. Unfortunately, journaling uses a chunk of space. Imagine a journal on a USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! Sure it could be done in 8 or 4 or so; or (in one of my file system designs) a static 16KiB block could reference dynamicly allocated journal space, allowing the system to sacrifice performance and shrink the journal when more space is needed. Either way, slow media like floppies will suffer, HARD; and flash devices will see a lot of write/erase all over the journal area, causing wear on that spot. So, that's my understanding. Any comments? Enlighten me. - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD0yldhDd4aOud5P8RAhzBAJwOvWpAYb+m3Zg8ugnvuY10K74jZgCeL69s y0172JATNX+q8jzrYGAJ/xc= =7Dcn -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 6:42 soft update vs journaling? John Richard Moser @ 2006-01-22 8:51 ` Jan Engelhardt 2006-01-22 18:40 ` John Richard Moser 2006-01-22 19:05 ` Adrian Bunk 2006-01-22 9:31 ` Theodore Ts'o ` (2 subsequent siblings) 3 siblings, 2 replies; 32+ messages in thread From: Jan Engelhardt @ 2006-01-22 8:51 UTC (permalink / raw) To: John Richard Moser; +Cc: linux-kernel >Unfortunately, journaling uses a chunk of space. Imagine a journal on a >USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! >Sure it could be done in 8 or 4 or so; or (in one of my file system >designs) a static 16KiB block could reference dynamicly allocated >journal space, allowing the system to sacrifice performance and shrink >the journal when more space is needed. Either way, slow media like >floppies will suffer, HARD; and flash devices will see a lot of >write/erase all over the journal area, causing wear on that spot. - Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes, which would be ok with me for a 128meg drive. Most of the time you need vfat anyway for your flashstick to make useful use of it on Windows. - reiser4's journal is even smaller than reiser3's with a new fresh filesystem - same goes for jfs and xfs (below 1 megabyte IIRC) - I would not use a journalling filesystem at all on media that degrades faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs). There are specially-crafted filesystems for that, mostly jffs and udf. - You really need a hell of a power fluctuation to get a disk crippled. Just powering off (and potentially on after a few milliseconds) did (in my cases) just stop a disk write whereever it happened to be, and that seemed easily correctable. Jan Engelhardt -- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 8:51 ` Jan Engelhardt @ 2006-01-22 18:40 ` John Richard Moser 2006-01-22 19:05 ` Adrian Bunk 1 sibling, 0 replies; 32+ messages in thread From: John Richard Moser @ 2006-01-22 18:40 UTC (permalink / raw) To: Jan Engelhardt; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jan Engelhardt wrote: >>Unfortunately, journaling uses a chunk of space. Imagine a journal on a >>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! >>Sure it could be done in 8 or 4 or so; or (in one of my file system >>designs) a static 16KiB block could reference dynamicly allocated >>journal space, allowing the system to sacrifice performance and shrink >>the journal when more space is needed. Either way, slow media like >>floppies will suffer, HARD; and flash devices will see a lot of >>write/erase all over the journal area, causing wear on that spot. > > > - Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes, > which would be ok with me for a 128meg drive. > Most of the time you need vfat anyway for your flashstick to make > useful use of it on Windows. > > - reiser4's journal is even smaller than reiser3's with a new fresh > filesystem - same goes for jfs and xfs (below 1 megabyte IIRC) > Nice, but does not solve. . . > - I would not use a journalling filesystem at all on media that degrades > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs). > There are specially-crafted filesystems for that, mostly jffs and udf. > Yes. They'll degrade very, very fast. This is where Soft Update would have an advantage. Another issue here is we can't just slap a journal onto vfat, for all those flash devices that we want to share with Windows. > - You really need a hell of a power fluctuation to get a disk crippled. > Just powering off (and potentially on after a few milliseconds) did > (in my cases) just stop a disk write whereever it happened to be, > and that seemed easily correctable. Yeah, I never said you could cripple a disk with power problems. You COULD destroy a NAND in a flash device by nuking the thing with 10000000000000 writes to the same area. > > > Jan Engelhardt - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD09GOhDd4aOud5P8RAr1lAJ9fGMSJOd4QALc4nCbx+jDLgTlijwCbBM94 r60oZO/x2Q0xEWeF9sp9Vz8= =63vo -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 8:51 ` Jan Engelhardt 2006-01-22 18:40 ` John Richard Moser @ 2006-01-22 19:05 ` Adrian Bunk 2006-01-22 19:08 ` Arjan van de Ven 1 sibling, 1 reply; 32+ messages in thread From: Adrian Bunk @ 2006-01-22 19:05 UTC (permalink / raw) To: Jan Engelhardt; +Cc: John Richard Moser, linux-kernel On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote: >... > - I would not use a journalling filesystem at all on media that degrades > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs). > There are specially-crafted filesystems for that, mostly jffs and udf. >... [ ] you know what the "j" in "jffs" stands for > Jan Engelhardt cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:05 ` Adrian Bunk @ 2006-01-22 19:08 ` Arjan van de Ven 2006-01-22 19:25 ` Adrian Bunk 2006-01-24 2:33 ` Jörn Engel 0 siblings, 2 replies; 32+ messages in thread From: Arjan van de Ven @ 2006-01-22 19:08 UTC (permalink / raw) To: Adrian Bunk; +Cc: Jan Engelhardt, John Richard Moser, linux-kernel On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote: > On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote: > >... > > - I would not use a journalling filesystem at all on media that degrades > > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs). > > There are specially-crafted filesystems for that, mostly jffs and udf. > >... > > [ ] you know what the "j" in "jffs" stands for it stands for "logging" since jffs2 at least is NOT a journalling filesystem.... but a logging one. I assume jffs is too. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:08 ` Arjan van de Ven @ 2006-01-22 19:25 ` Adrian Bunk 2006-01-24 2:33 ` Jörn Engel 1 sibling, 0 replies; 32+ messages in thread From: Adrian Bunk @ 2006-01-22 19:25 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Jan Engelhardt, John Richard Moser, linux-kernel On Sun, Jan 22, 2006 at 08:08:17PM +0100, Arjan van de Ven wrote: > On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote: > > On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote: > > >... > > > - I would not use a journalling filesystem at all on media that degrades > > > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs). > > > There are specially-crafted filesystems for that, mostly jffs and udf. > > >... > > > > [ ] you know what the "j" in "jffs" stands for > > it stands for "logging" since jffs2 at least is NOT a journalling > filesystem.... but a logging one. I assume jffs is too. Ah, sorry. It seems I confused this with Reiser4 and it's wandering logs. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:08 ` Arjan van de Ven 2006-01-22 19:25 ` Adrian Bunk @ 2006-01-24 2:33 ` Jörn Engel 1 sibling, 0 replies; 32+ messages in thread From: Jörn Engel @ 2006-01-24 2:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Adrian Bunk, Jan Engelhardt, John Richard Moser, linux-kernel On Sun, 22 January 2006 20:08:17 +0100, Arjan van de Ven wrote: > > it stands for "logging" since jffs2 at least is NOT a journalling > filesystem.... but a logging one. I assume jffs is too. s/logging/log-structured/ People could (and did) argue that jffs[|2] is a journalling filesystem consisting of a journal and _no_ regular storage. Which is quite sane. Having a live-fast, die-young journal confined to a small portion of the device would kill it quickly, no doubt. Jörn -- Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo vorher keine existiert hat. -- Doris Lessing ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 6:42 soft update vs journaling? John Richard Moser 2006-01-22 8:51 ` Jan Engelhardt @ 2006-01-22 9:31 ` Theodore Ts'o 2006-01-22 18:54 ` John Richard Moser 2006-01-22 19:50 ` Diego Calleja 2006-01-22 19:26 ` James Courtier-Dutton 2006-01-23 5:32 ` Michael Loftis 3 siblings, 2 replies; 32+ messages in thread From: Theodore Ts'o @ 2006-01-22 9:31 UTC (permalink / raw) To: John Richard Moser; +Cc: linux-kernel On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote: > Soft Update appears to have the advantage of not needing multiple > writes. There's no need for journal flushing and then disk flushing; > you just flush the meta-data. Not quite true; there are cases where Soft Update will have to do multiple writes, when a particular block containing meta-data has multiple changes in it that have to be committed to the filesystem at different times in order to maintain consistency; this is particularly true when a block is part of the inode table, for example. When this happens, the soft update machinery has to allocate memory for a block and then undo changes to that block which come from transactions that are not yet ready to be written to disk yet. In general, though, it is true that Soft Updates can result in fewer disk writes compared to filesystems that utilizing traditional journaling approaches, and this might even be noticeable if your workload is heavily skewed towards metadata updates. (This is mainly true in benchmarks that are horrendously disconneted to the real world, such as dbench.) One major downside with Soft Updates that you haven't mentioned in your note, is that the amount of complexity it adds to the filesystem is tremendous; the filesystem has to keep track of a very complex state machinery, with knowledge of about the ordering constraints of each change to the filesystem and how to "back out" parts of the change when that becomes necessary. Whenever you want to extend a filesystem to add some new feature, such as online resizing, for example, it's not enough to just add that feature; you also have to modify the black magic which is the Soft Updates machinery. This significantly increases the difficulty to add new features to a filesystem, and can add as a roadblock to people wanting to add new features. I can't say for sure that this is why BSD UFS doesn't have online resizing yet; and while I can't conclusively blame the lack of this feature on Soft Updates, it is clear that adding this and other features is much more difficult when you are dealing with soft update code. > Also, soft update systems mount instantly, because there's no > journal to play back, and the file system is always consistent. This is only true if you don't care about recovering lost data blocks. Fixing this requires that you run the equivalent of fsck on the filesystem. If you do, then it is major difference in performance. Even if you can do the fsck scan on-line, it will greatly slow down normal operations while recovering from a system crash, and the slowdown associated with doing a journal replay is far smaller in comparison. > Unfortunately, journaling uses a chunk of space. Imagine a journal on a > USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! > Sure it could be done in 8 or 4 or so; or (in one of my file system > designs) a static 16KiB block could reference dynamicly allocated > journal space, allowing the system to sacrifice performance and shrink > the journal when more space is needed. Either way, slow media like > floppies will suffer, HARD; and flash devices will see a lot of > write/erase all over the journal area, causing wear on that spot. If you are using flash, use a filesystem which is optimized for flash, such as JFFS2. Otherwise, note that in most cases disk space is nearly free, so allocating even 128 megs for the journal is chump change when you're talking about a 200GB or larger hard drive. Also note that if you have to use slow media, one of the things which you can do is use a separate (fast) device for your journal; there is no rule which says the journal has to be on the slow device. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 9:31 ` Theodore Ts'o @ 2006-01-22 18:54 ` John Richard Moser 2006-01-22 21:02 ` Theodore Ts'o 2006-01-22 19:50 ` Diego Calleja 1 sibling, 1 reply; 32+ messages in thread From: John Richard Moser @ 2006-01-22 18:54 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Theodore Ts'o wrote: > On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote: > >>Soft Update appears to have the advantage of not needing multiple >>writes. There's no need for journal flushing and then disk flushing; >>you just flush the meta-data. > > > Not quite true; there are cases where Soft Update will have to do > multiple writes, when a particular block containing meta-data has > multiple changes in it that have to be committed to the filesystem at > different times in order to maintain consistency; this is particularly Yes, that makes sense. > true when a block is part of the inode table, for example. When this > happens, the soft update machinery has to allocate memory for a block > and then undo changes to that block which come from transactions that > are not yet ready to be written to disk yet. > > In general, though, it is true that Soft Updates can result in fewer > disk writes compared to filesystems that utilizing traditional > journaling approaches, and this might even be noticeable if your > workload is heavily skewed towards metadata updates. (This is mainly > true in benchmarks that are horrendously disconneted to the real > world, such as dbench.) Yeah, microbenchmarks are like "AFAFAFAFAFFAF THIS WILL NEVAR HAPPAN MOR THAN 1 EVERY BILLION ZILLION YARS BUT LOOK WER FASTAR BY LIKE 1 MICROSECOND" stuff. > > One major downside with Soft Updates that you haven't mentioned in > your note, is that the amount of complexity it adds to the filesystem > is tremendous; the filesystem has to keep track of a very complex > state machinery, with knowledge of about the ordering constraints of > each change to the filesystem and how to "back out" parts of the > change when that becomes necessary. Yes, I had figured soft update would be a lot more complex than journaling. Though, could this be majorly implimented filesystem independent? I could see a "Soft Update API" to allow file systems to sketch out dependencies each meta-data operation has and describe order; it would, of course, be a total pain in the ass to do. > > Whenever you want to extend a filesystem to add some new feature, such > as online resizing, for example, it's not enough to just add that Online resizing is ever safe? I mean, with on-disk filesystem layout support I could somewhat believe it for growing; for shrinking you'd need a way to move files around without damaging them (possible). I guess it would be. So how does this work? Move files -> alter file system superblocks? > feature; you also have to modify the black magic which is the Soft > Updates machinery. This significantly increases the difficulty to add > new features to a filesystem, and can add as a roadblock to people > wanting to add new features. I can't say for sure that this is why > > BSD UFS doesn't have online resizing yet; and while I can't > conclusively blame the lack of this feature on Soft Updates, it is > clear that adding this and other features is much more difficult when > you are dealing with soft update code. > Nod. > >>Also, soft update systems mount instantly, because there's no >>journal to play back, and the file system is always consistent. > > > This is only true if you don't care about recovering lost data blocks. > Fixing this requires that you run the equivalent of fsck on the > filesystem. If you do, then it is major difference in performance. > Even if you can do the fsck scan on-line, it will greatly slow down > normal operations while recovering from a system crash, and the > slowdown associated with doing a journal replay is far smaller in > comparison. A passive-active approach could passively generate a list of inodes from dentries as they're accessed; and actively walk the directory tree when the disk is idle. Then a quick allocation check between inodes and whatever allocation lists or trees there are could be done. This has the disadvantage that if the system is under heavy load, the recovery won't be done. There's also a period where the disk may be rather full, causing fragmentation or out of space errors along the way. The only way to counter this would be to force a mandatory minimum amount of recovery activity per time interval, which again causes your problem. > > >>Unfortunately, journaling uses a chunk of space. Imagine a journal on a >>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! >>Sure it could be done in 8 or 4 or so; or (in one of my file system >>designs) a static 16KiB block could reference dynamicly allocated >>journal space, allowing the system to sacrifice performance and shrink >>the journal when more space is needed. Either way, slow media like >>floppies will suffer, HARD; and flash devices will see a lot of >>write/erase all over the journal area, causing wear on that spot. > > > If you are using flash, use a filesystem which is optimized for flash, > such as JFFS2. Otherwise, note that in most cases disk space is What about a NAND flash chip on a USB drive like a SanDisk Cruizer Mini? Or hell, a compact flash card for use in a digital camera. > nearly free, so allocating even 128 megs for the journal is chump > change when you're talking about a 200GB or larger hard drive. > > Also note that if you have to use slow media, one of the things which > you can do is use a separate (fast) device for your journal; there is > no rule which says the journal has to be on the slow device. Unless it's portable and you don't want to reconfigure every system. > > - Ted > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD09TdhDd4aOud5P8RAv1HAJ9SUeY0c42RognwsR6ve1w4XvFalwCdFc8N feGuco4l9lz4yQB4U3tDcW8= =4QFG -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 18:54 ` John Richard Moser @ 2006-01-22 21:02 ` Theodore Ts'o 2006-01-22 22:44 ` Kyle Moffett 2006-01-23 1:02 ` John Richard Moser 0 siblings, 2 replies; 32+ messages in thread From: Theodore Ts'o @ 2006-01-22 21:02 UTC (permalink / raw) To: John Richard Moser; +Cc: linux-kernel On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote: > > Whenever you want to extend a filesystem to add some new feature, such > > as online resizing, for example, it's not enough to just add that > > Online resizing is ever safe? I mean, with on-disk filesystem layout > support I could somewhat believe it for growing; for shrinking you'd > need a way to move files around without damaging them (possible). I > guess it would be. > > So how does this work? Move files -> alter file system superblocks? The online resizing support in ext3 only grows the filesystems; it doesn't shrink it. What is currently supported in 2.6 requires you to reserve space in advance. There is also a slight modification to the ext2/3 filesystem format which is only supported by Linux 2.6 which allows you to grow the filesystem without needing to move filesystem data structures around; the kernel patches for actualling doing this new style of online resizing aren't yet in mainline yet, although they have been posted to ext2-devel for evaluation. > A passive-active approach could passively generate a list of inodes from > dentries as they're accessed; and actively walk the directory tree when > the disk is idle. Then a quick allocation check between inodes and > whatever allocation lists or trees there are could be done. That doesn't really help, because in order to release the unused disk blocks, you have to walk every single inode and keep track of the block allocation bitmaps for the entire filesystem. If you have a really big filesystem, it may require hundreds of megabytes of non-swappable kernel memory. And if you try to do this in userspace, it becomes an unholy mess trying to keep the userspace and in-kernel mounted filesystem data structures in sync. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 21:02 ` Theodore Ts'o @ 2006-01-22 22:44 ` Kyle Moffett 2006-01-23 7:24 ` Theodore Ts'o 2006-01-23 1:02 ` John Richard Moser 1 sibling, 1 reply; 32+ messages in thread From: Kyle Moffett @ 2006-01-22 22:44 UTC (permalink / raw) To: Theodore Ts'o; +Cc: John Richard Moser, linux-kernel On Jan 22, 2006, at 16:02, Theodore Ts'o wrote: >> Online resizing is ever safe? I mean, with on-disk filesystem >> layout support I could somewhat believe it for growing; for >> shrinking you'd need a way to move files around without damaging >> them (possible). I guess it would be. >> >> So how does this work? Move files -> alter file system superblocks? > > The online resizing support in ext3 only grows the filesystems; it > doesn't shrink it. What is currently supported in 2.6 requires you > to reserve space in advance. There is also a slight modification > to the ext2/3 filesystem format which is only supported by Linux > 2.6 which allows you to grow the filesystem without needing to move > filesystem data structures around; the kernel patches for > actualling doing this new style of online resizing aren't yet in > mainline yet, although they have been posted to ext2-devel for > evaluation. From my understanding of HFS+/HFSX, this is actually one of the nicer bits of that filesystem architecture. It stores the data- structures on-disk using extents in such a way that you probably could hot-resize the disk without significant RAM overhead (both grow and shrink) as long as there's enough free space. Essentially, every block on the disk is represented by an allocation block, and all data structures refer to allocation block offsets. The allocation file bitmap itself is comprised of allocation blocks and mapped by a set of extent descriptors. The result is that it is possible to fragment the allocation file, catalog file, and any other on-disk structures (with the sole exception of the 1K boot block and the 512-byte volume headers at the very start and end of the volume). At the moment I'm educating myself on the operation of MFS/HFS/HFS+/ HFSX and the linux kernel VFS by writing a completely new combined hfsx driver, which I eventually plan to add online-resizing support and a variety of other features to. One question though: Does anyone have any good recent references to "How to write a blockdev-based Linux Filesystem" docs? I've searched the various crufty corners of the web, Documentation/, etc, and found enough to get started, but (for example), I had a hard time determining from the various sources what a struct file_system_type was supposed to have in it, and what the available default address_space/superblock ops are. Cheers, Kyle Moffett -- They _will_ find opposing experts to say it isn't, if you push hard enough the wrong way. Idiots with a PhD aren't hard to buy. -- Rob Landley ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 22:44 ` Kyle Moffett @ 2006-01-23 7:24 ` Theodore Ts'o 2006-01-23 13:31 ` Mitchell Blank Jr ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Theodore Ts'o @ 2006-01-23 7:24 UTC (permalink / raw) To: Kyle Moffett; +Cc: John Richard Moser, linux-kernel On Sun, Jan 22, 2006 at 05:44:08PM -0500, Kyle Moffett wrote: > From my understanding of HFS+/HFSX, this is actually one of the > nicer bits of that filesystem architecture. It stores the data- > structures on-disk using extents in such a way that you probably > could hot-resize the disk without significant RAM overhead (both grow > and shrink) as long as there's enough free space. Hot-shrinking a filesystem is certainly possible for any filesystem, but the problem is how many filesystem data structures you have to walk in order to find all the owner of all of the blocks that you have to relocate. That generallly isn't a RAM overhead problem, but the fact that in general, most filesystems don't have an efficient way to answer the question, "who owns this arbitrary disk block?" Having extents means you have a slightly more efficient encoding system, but it still is the case that you have to check potentially every file in the filesystem to see if it is the owner of one of the disk blocks that needs to be moved when you are shrinking the filesystem. You could of course design a filesystem which maintained a reverse map data structure, but it would slow the filesystem down since it would be a separate data structure that would have to be updated each time you allocated or freed a disk block. And the only use for such a data structure would be to make shrinking a filesystem more efficient. Given that this is generally not a common operation, it seems unlikely that a filesystem designer would choose to make this particular tradeoff. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 7:24 ` Theodore Ts'o @ 2006-01-23 13:31 ` Mitchell Blank Jr 2006-01-23 13:33 ` Kyle Moffett 2006-01-23 20:48 ` soft update vs journaling? Folkert van Heusden 2 siblings, 0 replies; 32+ messages in thread From: Mitchell Blank Jr @ 2006-01-23 13:31 UTC (permalink / raw) To: Theodore Ts'o, Kyle Moffett, John Richard Moser, linux-kernel Theodore Ts'o wrote: > in general, most filesystems don't have an efficient way to > answer the question, "who owns this arbitrary disk block?" [...] > Given that this is generally not a common operation, it seems unlikely > that a filesystem designer would choose to make this particular > tradeoff. True -- a much more rational approach would be to provide a translation table for "old block #" to "new block #" -- then when the filesystem sees a reference to an invalid blocknumber (>= the filesystem size) it can just translate it to its new location. You have to be careful if the filesystem is regrown since some of those block numbers may now be valid again. It can easily be handled by just moving the data back to its original block # and removing the mapping. This doesn't completely remove the extra cost on the block allocator fastpath: if an block is freed it must make sure to remove any entry pointing to it from the translation table or you can't handle regrowth properly (the block could have been reused by a file pointing to the real block # -- you won't know whether to move it back or not). However, this is probably a lot cheaper than maintaining a full reverse-map, plus you only have to take it after a shrink has actually happened. -Mitch ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 7:24 ` Theodore Ts'o 2006-01-23 13:31 ` Mitchell Blank Jr @ 2006-01-23 13:33 ` Kyle Moffett 2006-01-23 13:52 ` Antonio Vargas 2006-01-23 20:48 ` soft update vs journaling? Folkert van Heusden 2 siblings, 1 reply; 32+ messages in thread From: Kyle Moffett @ 2006-01-23 13:33 UTC (permalink / raw) To: Theodore Ts'o; +Cc: John Richard Moser, linux-kernel On Jan 23, 2006, at 02:24, Theodore Ts'o wrote: > Hot-shrinking a filesystem is certainly possible for any > filesystem, but the problem is how many filesystem data structures > you have to walk in order to find all the owner of all of the > blocks that you have to relocate. That generally isn't a RAM > overhead problem, but the fact that in general, most filesystems > don't have an efficient way to answer the question, "who owns this > arbitrary disk block?" Having extents means you have a slightly > more efficient encoding system, but it still is the case that you > have to check potentially every file in the filesystem to see if it > is the owner of one of the disk blocks that needs to be moved when > you are shrinking the filesystem. The way that I'm considering implementing this is by intentionally fragmenting the allocation bitmap, catalog file, etc, such that each 1/8 or so of the disk contains its own allocation bitmap describing its contents, its own set of files or directories, etc. The allocator would largely try to keep individual btree fragments cohesive, such that one of the 1/8th divisions of the disk would only have pertinent data for itself. The idea would be that when trying to look up an allocation block, in the common case you need only parse a much smaller subsection of the disk structures. > And the only use for such a [reverse-mapping] data structure would > be to make shrinking a filesystem more efficient. Not entirely true. I _believe_ you could use such data structures to make the allocation algorithm much more robust against fragmentation if you record the right kind of information. Cheers, Kyle Moffett -- If you don't believe that a case based on [nothing] could potentially drag on in court for _years_, then you have no business playing with the legal system at all. -- Rob Landley ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 13:33 ` Kyle Moffett @ 2006-01-23 13:52 ` Antonio Vargas 2006-01-23 16:48 ` Linux VFS architecture questions Kyle Moffett 0 siblings, 1 reply; 32+ messages in thread From: Antonio Vargas @ 2006-01-23 13:52 UTC (permalink / raw) To: Kyle Moffett, Theodore Ts'o, John Richard Moser, linux-kernel On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote: > On Jan 23, 2006, at 02:24, Theodore Ts'o wrote: > > Hot-shrinking a filesystem is certainly possible for any > > filesystem, but the problem is how many filesystem data structures > > you have to walk in order to find all the owner of all of the > > blocks that you have to relocate. That generally isn't a RAM > > overhead problem, but the fact that in general, most filesystems > > don't have an efficient way to answer the question, "who owns this > > arbitrary disk block?" Having extents means you have a slightly > > more efficient encoding system, but it still is the case that you > > have to check potentially every file in the filesystem to see if it > > is the owner of one of the disk blocks that needs to be moved when > > you are shrinking the filesystem. > > The way that I'm considering implementing this is by intentionally > fragmenting the allocation bitmap, catalog file, etc, such that each > 1/8 or so of the disk contains its own allocation bitmap describing > its contents, its own set of files or directories, etc. The > allocator would largely try to keep individual btree fragments > cohesive, such that one of the 1/8th divisions of the disk would only > have pertinent data for itself. The idea would be that when trying > to look up an allocation block, in the common case you need only > parse a much smaller subsection of the disk structures. this sounds exactly the same as ext2/ext3 allocation groups :) > > And the only use for such a [reverse-mapping] data structure would > > be to make shrinking a filesystem more efficient. > > Not entirely true. I _believe_ you could use such data structures to > make the allocation algorithm much more robust against fragmentation > if you record the right kind of information. > > Cheers, > Kyle Moffett > > -- > If you don't believe that a case based on [nothing] could potentially > drag on in court for _years_, then you have no business playing with > the legal system at all. > -- Rob Landley > -- Greetz, Antonio Vargas aka winden of network http://wind.codepixel.com/ windNOenSPAMntw@gmail.com thesameasabove@amigascne.org Every day, every year you have to work you have to study you have to scene. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Linux VFS architecture questions 2006-01-23 13:52 ` Antonio Vargas @ 2006-01-23 16:48 ` Kyle Moffett 2006-01-23 17:00 ` Pekka Enberg 0 siblings, 1 reply; 32+ messages in thread From: Kyle Moffett @ 2006-01-23 16:48 UTC (permalink / raw) To: Antonio Vargas; +Cc: Theodore Ts'o, John Richard Moser, linux-kernel On Jan 23, 2006, at 08:52:51, Antonio Vargas wrote: > On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote: >> The way that I'm considering implementing this is by intentionally >> fragmenting the allocation bitmap, catalog file, etc, such that >> each 1/8 or so of the disk contains its own allocation bitmap >> describing its contents, its own set of files or directories, >> etc. The allocator would largely try to keep individual btree >> fragments cohesive, such that one of the 1/8th divisions of the >> disk would only have pertinent data for itself. The idea would be >> that when trying to look up an allocation block, in the common >> case you need only parse a much smaller subsection of the disk >> structures. > > this sounds exactly the same as ext2/ext3 allocation groups :) Great! I'm trying to learn about filesystem design and implementation, which is why I started writing my own hfsplus filesystem (otherwise I would have just used the in-kernel one). Do you have any recommended reading (either online or otherwise) for someone trying to understand the kernel's VFS and blockdev interfaces? I _think_ I understand the basics of buffer_head, super_block, and have some idea of how to use aops, but it's tough going trying to find out what functions to call to manage cached disk blocks, or under what conditions the various VFS functions are called. I'm trying to write up a "Linux Disk-Based Filesystem Developers Guide" based on what I learn, but it's remarkably sparse so far. One big question I have: HFS/HFS+ have an "extents overflow" btree that contains extents beyond the first 3 (for HFS) or 8 (for HFS+). I would like to speculatively cache parts of that btree when the files are accessed, but not if memory is short, and I would like to allow the filesystem to free up parts of the btree under the same circumstances. I have a preliminary understanding of how to trigger the filesystem to read various blocks of metadata (using buffer_heads) or file data for programs (by returning a block number from the appropriate aops function), but how do I allocate data structures as "easily reclaimable" and indicate to the kernel that it can ask me to reclaim that memory? Thanks for the help! Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Linux VFS architecture questions 2006-01-23 16:48 ` Linux VFS architecture questions Kyle Moffett @ 2006-01-23 17:00 ` Pekka Enberg 2006-01-23 17:50 ` Kyle Moffett 0 siblings, 1 reply; 32+ messages in thread From: Pekka Enberg @ 2006-01-23 17:00 UTC (permalink / raw) To: Kyle Moffett Cc: Antonio Vargas, Theodore Ts'o, John Richard Moser, linux-kernel Hi Kyle, On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote: > Great! I'm trying to learn about filesystem design and > implementation, which is why I started writing my own hfsplus > filesystem (otherwise I would have just used the in-kernel one). Do > you have any recommended reading (either online or otherwise) for > someone trying to understand the kernel's VFS and blockdev > interfaces? I _think_ I understand the basics of buffer_head, > super_block, and have some idea of how to use aops, but it's tough > going trying to find out what functions to call to manage cached disk > blocks, or under what conditions the various VFS functions are > called. I'm trying to write up a "Linux Disk-Based Filesystem > Developers Guide" based on what I learn, but it's remarkably sparse > so far. Did you read Documentation/filesystems/vfs.txt? Also, books Linux Kernel Development and Understanding the Linux Kernel have fairly good information on VFS (and related) stuff. Pekka ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Linux VFS architecture questions 2006-01-23 17:00 ` Pekka Enberg @ 2006-01-23 17:50 ` Kyle Moffett 2006-01-23 17:54 ` Randy.Dunlap 0 siblings, 1 reply; 32+ messages in thread From: Kyle Moffett @ 2006-01-23 17:50 UTC (permalink / raw) To: Pekka Enberg Cc: Antonio Vargas, Theodore Ts'o, John Richard Moser, linux-kernel On Jan 23, 2006, at 12:00, Pekka Enberg wrote: > Hi Kyle, > > On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote: >> Great! I'm trying to learn about filesystem design and >> implementation, which is why I started writing my own hfsplus >> filesystem (otherwise I would have just used the in-kernel one). >> Do you have any recommended reading (either online or otherwise) >> for someone trying to understand the kernel's VFS and blockdev >> interfaces? I _think_ I understand the basics of buffer_head, >> super_block, and have some idea of how to use aops, but it's tough >> going trying to find out what functions to call to manage cached >> disk blocks, or under what conditions the various VFS functions >> are called. I'm trying to write up a "Linux Disk-Based Filesystem >> Developers Guide" based on what I learn, but it's remarkably >> sparse so far. > > Did you read Documentation/filesystems/vfs.txt? Yeah, that was the first thing I looked at. Once I've got things figured out, I'll probably submit a fairly hefty patch to that file to add additional documentation. > Also, books Linux Kernel Development and Understanding the Linux > Kernel have fairly good information on VFS (and related) stuff. Ah, thanks again! It looks like both of those are available through my university's Safari/ProQuest subscription (http:// safari.oreilly.com/), so I'll take a look right away! Cheers, Kyle Moffett -- I lost interest in "blade servers" when I found they didn't throw knives at people who weren't supposed to be in your machine room. -- Anthony de Boer ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Linux VFS architecture questions 2006-01-23 17:50 ` Kyle Moffett @ 2006-01-23 17:54 ` Randy.Dunlap 0 siblings, 0 replies; 32+ messages in thread From: Randy.Dunlap @ 2006-01-23 17:54 UTC (permalink / raw) To: Kyle Moffett Cc: Pekka Enberg, Antonio Vargas, Theodore Ts'o, John Richard Moser, linux-kernel On Mon, 23 Jan 2006, Kyle Moffett wrote: > On Jan 23, 2006, at 12:00, Pekka Enberg wrote: > > Hi Kyle, > > > > On 1/23/06, Kyle Moffett <mrmacman_g4@mac.com> wrote: > >> Great! I'm trying to learn about filesystem design and > >> implementation, which is why I started writing my own hfsplus > >> filesystem (otherwise I would have just used the in-kernel one). > >> Do you have any recommended reading (either online or otherwise) > >> for someone trying to understand the kernel's VFS and blockdev > >> interfaces? I _think_ I understand the basics of buffer_head, > >> super_block, and have some idea of how to use aops, but it's tough > >> going trying to find out what functions to call to manage cached > >> disk blocks, or under what conditions the various VFS functions > >> are called. I'm trying to write up a "Linux Disk-Based Filesystem > >> Developers Guide" based on what I learn, but it's remarkably > >> sparse so far. > > > > Did you read Documentation/filesystems/vfs.txt? > > Yeah, that was the first thing I looked at. Once I've got things > figured out, I'll probably submit a fairly hefty patch to that file > to add additional documentation. > > > Also, books Linux Kernel Development and Understanding the Linux > > Kernel have fairly good information on VFS (and related) stuff. > > Ah, thanks again! It looks like both of those are available through > my university's Safari/ProQuest subscription (http:// > safari.oreilly.com/), so I'll take a look right away! This web page is terribly out of date, but you might find a few helpful link on it (near the bottom): http://www.xenotime.net/linux/linux-fs.html -- ~Randy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 7:24 ` Theodore Ts'o 2006-01-23 13:31 ` Mitchell Blank Jr 2006-01-23 13:33 ` Kyle Moffett @ 2006-01-23 20:48 ` Folkert van Heusden 2 siblings, 0 replies; 32+ messages in thread From: Folkert van Heusden @ 2006-01-23 20:48 UTC (permalink / raw) To: Theodore Ts'o, Kyle Moffett, John Richard Moser, linux-kernel > You could of course design a filesystem which maintained a reverse map > data structure, but it would slow the filesystem down since it would > be a separate data structure that would have to be updated each time > you allocated or freed a disk block. And the only use for such a data > structure would be to make shrinking a filesystem more efficient. > Given that this is generally not a common operation, it seems unlikely > that a filesystem designer would choose to make this particular > tradeoff. Or you could set if switched off by default. E.g. reserve the space for it and activate it as soon as some magic switch is set in the kernel. Then some background processs should update it while als keeping track of current changes. Then when everything is finished, update some flag to let the resizer know it can do its job. Folkert van Heusden -- www.vanheusden.com/recoverdm/ - got an unreadable cd with scratches? recoverdm might help you recovering data -------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 21:02 ` Theodore Ts'o 2006-01-22 22:44 ` Kyle Moffett @ 2006-01-23 1:02 ` John Richard Moser 1 sibling, 0 replies; 32+ messages in thread From: John Richard Moser @ 2006-01-23 1:02 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Theodore Ts'o wrote: > On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote: > >>>Whenever you want to extend a filesystem to add some new feature, such >>>as online resizing, for example, it's not enough to just add that >> >>Online resizing is ever safe? I mean, with on-disk filesystem layout >>support I could somewhat believe it for growing; for shrinking you'd >>need a way to move files around without damaging them (possible). I >>guess it would be. >> >>So how does this work? Move files -> alter file system superblocks? > > > The online resizing support in ext3 only grows the filesystems; it > doesn't shrink it. What is currently supported in 2.6 requires you to > reserve space in advance. There is also a slight modification to the > ext2/3 filesystem format which is only supported by Linux 2.6 which > allows you to grow the filesystem without needing to move filesystem > data structures around; the kernel patches for actualling doing this > new style of online resizing aren't yet in mainline yet, although they > have been posted to ext2-devel for evaluation. > > >>A passive-active approach could passively generate a list of inodes from >>dentries as they're accessed; and actively walk the directory tree when >>the disk is idle. Then a quick allocation check between inodes and >>whatever allocation lists or trees there are could be done. > > > That doesn't really help, because in order to release the unused disk > blocks, you have to walk every single inode and keep track of the > block allocation bitmaps for the entire filesystem. If you have a > really big filesystem, it may require hundreds of megabytes of > non-swappable kernel memory. And if you try to do this in userspace, > it becomes an unholy mess trying to keep the userspace and in-kernel > mounted filesystem data structures in sync. > Yeah I figured that you couldn't take action until everything was seen; I can see how you could have problems with all that kernel memory ;) FUSE driver, anyone? :> (I've actually looked into FUSE for the rootfs, via loading a fuser driver from an init.d and then replacing bash with init on the rootfs; haven't found an ext2 or xfs fuse driver to test with) > - Ted > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD1CskhDd4aOud5P8RArmJAJ9mgLjkxUcg5GW1o4q88Cb6ESmdCACZAS00 M1R+7biZpmOCCCBkEXVQL7w= =060w -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 9:31 ` Theodore Ts'o 2006-01-22 18:54 ` John Richard Moser @ 2006-01-22 19:50 ` Diego Calleja 2006-01-22 20:39 ` Suleiman Souhlal 2006-01-23 1:00 ` John Richard Moser 1 sibling, 2 replies; 32+ messages in thread From: Diego Calleja @ 2006-01-22 19:50 UTC (permalink / raw) To: Theodore Ts'o; +Cc: nigelenki, linux-kernel El Sun, 22 Jan 2006 04:31:44 -0500, Theodore Ts'o <tytso@mit.edu> escribió: > One major downside with Soft Updates that you haven't mentioned in > your note, is that the amount of complexity it adds to the filesystem > is tremendous; the filesystem has to keep track of a very complex > state machinery, with knowledge of about the ordering constraints of > each change to the filesystem and how to "back out" parts of the > change when that becomes necessary. And FreeBSD is implementing journaling for UFS and getting rid of softupdates [1]. While this not proves that softupdates is "a bad idea", i think this proves why the added sofupdates complexity doesn't seem to pay off in the real world. [1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html "4. Journaled filesystem. While we can debate the merits of speed and data integrety of journalling vs. softupdates, the simple fact remains that softupdates still requires a fsck run on recovery, and the multi-terabyte filesystems that are possible these days make fsck a very long and unpleasant experience, even with bg-fsck. There was work at some point at RPI to add journaling to UFS, but there hasn't been much status on that in a long time. There have also been proposals and works-in-progress to port JFS, ReiserFS, and XFS. Some of these efforts are still alive, but they need to be seen through to completion" ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:50 ` Diego Calleja @ 2006-01-22 20:39 ` Suleiman Souhlal 2006-01-22 20:50 ` Diego Calleja 2006-01-23 1:00 ` John Richard Moser 1 sibling, 1 reply; 32+ messages in thread From: Suleiman Souhlal @ 2006-01-22 20:39 UTC (permalink / raw) To: Diego Calleja; +Cc: Theodore Ts'o, nigelenki, linux-kernel Diego Calleja wrote: > And FreeBSD is implementing journaling for UFS and getting rid of > softupdates [1]. While this not proves that softupdates is "a bad idea", > i think this proves why the added sofupdates complexity doesn't seem > to pay off in the real world. You read the message wrong: We're not getting rid of softupdates. -- Suleiman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 20:39 ` Suleiman Souhlal @ 2006-01-22 20:50 ` Diego Calleja 0 siblings, 0 replies; 32+ messages in thread From: Diego Calleja @ 2006-01-22 20:50 UTC (permalink / raw) To: Suleiman Souhlal; +Cc: tytso, nigelenki, linux-kernel El Sun, 22 Jan 2006 12:39:38 -0800, Suleiman Souhlal <ssouhlal@FreeBSD.org> escribió: > Diego Calleja wrote: > > And FreeBSD is implementing journaling for UFS and getting rid of > > softupdates [1]. While this not proves that softupdates is "a bad idea", > > i think this proves why the added sofupdates complexity doesn't seem > > to pay off in the real world. > > You read the message wrong: We're not getting rid of softupdates. > -- Suleiman Oh, both systems will be available at the same time? That will be certainyl a good place to compare both approachs. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:50 ` Diego Calleja 2006-01-22 20:39 ` Suleiman Souhlal @ 2006-01-23 1:00 ` John Richard Moser 2006-01-23 1:09 ` Suleiman Souhlal 1 sibling, 1 reply; 32+ messages in thread From: John Richard Moser @ 2006-01-23 1:00 UTC (permalink / raw) To: Diego Calleja; +Cc: Theodore Ts'o, linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Diego Calleja wrote: > El Sun, 22 Jan 2006 04:31:44 -0500, > Theodore Ts'o <tytso@mit.edu> escribió: > > > >>One major downside with Soft Updates that you haven't mentioned in >>your note, is that the amount of complexity it adds to the filesystem >>is tremendous; the filesystem has to keep track of a very complex >>state machinery, with knowledge of about the ordering constraints of >>each change to the filesystem and how to "back out" parts of the >>change when that becomes necessary. > > > > And FreeBSD is implementing journaling for UFS and getting rid of > softupdates [1]. While this not proves that softupdates is "a bad idea", > i think this proves why the added sofupdates complexity doesn't seem > to pay off in the real world. > Yeah, the huge TB fsck thing became a problem. I wonder still if it'd be useful for small vfat file systems (floppies, usb drives); nobody has led me to believe it's definitely feasible to not corrupt meta-data in this way. I guess journaling is looking a lot better. :) > [1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html > > "4. Journaled filesystem. While we can debate the merits of speed and > data integrety of journalling vs. softupdates, the simple fact remains > that softupdates still requires a fsck run on recovery, and the > multi-terabyte filesystems that are possible these days make fsck a very > long and unpleasant experience, even with bg-fsck. There was work at > some point at RPI to add journaling to UFS, but there hasn't been much > status on that in a long time. There have also been proposals and > works-in-progress to port JFS, ReiserFS, and XFS. Some of these efforts > are still alive, but they need to be seen through to completion" > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD1CqhhDd4aOud5P8RAjvDAJ0W9pcNQ31v0RWSSIGVitnSpfvReQCdHBah usgY72whnDcCwgshpVFW02o= =Px/i -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 1:00 ` John Richard Moser @ 2006-01-23 1:09 ` Suleiman Souhlal 2006-01-23 2:09 ` John Richard Moser 0 siblings, 1 reply; 32+ messages in thread From: Suleiman Souhlal @ 2006-01-23 1:09 UTC (permalink / raw) To: John Richard Moser; +Cc: Diego Calleja, Theodore Ts'o, linux-kernel John Richard Moser wrote: > Yeah, the huge TB fsck thing became a problem. I wonder still if it'd > be useful for small vfat file systems (floppies, usb drives); nobody has > led me to believe it's definitely feasible to not corrupt meta-data in > this way. Please note that you don't *HAVE* to run fsck at every reboot. All background fsck does is reclaim unused blocks. -- Suleiman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 1:09 ` Suleiman Souhlal @ 2006-01-23 2:09 ` John Richard Moser 0 siblings, 0 replies; 32+ messages in thread From: John Richard Moser @ 2006-01-23 2:09 UTC (permalink / raw) To: Suleiman Souhlal; +Cc: Diego Calleja, Theodore Ts'o, linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Suleiman Souhlal wrote: > John Richard Moser wrote: > >> Yeah, the huge TB fsck thing became a problem. I wonder still if it'd >> be useful for small vfat file systems (floppies, usb drives); nobody has >> led me to believe it's definitely feasible to not corrupt meta-data in >> this way. > > > Please note that you don't *HAVE* to run fsck at every reboot. All > background fsck does is reclaim unused blocks. > Duly noted, now can you answer my question? > -- Suleiman > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD1DryhDd4aOud5P8RAjiwAJ9xH5V/W2i5U/oVzT6AjdmBVk5+iwCfWD2j JzBRinqiqDd/rIQFkS9QIsQ= =SlOI -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 6:42 soft update vs journaling? John Richard Moser 2006-01-22 8:51 ` Jan Engelhardt 2006-01-22 9:31 ` Theodore Ts'o @ 2006-01-22 19:26 ` James Courtier-Dutton 2006-01-23 0:06 ` John Richard Moser 2006-01-23 5:32 ` Michael Loftis 3 siblings, 1 reply; 32+ messages in thread From: James Courtier-Dutton @ 2006-01-22 19:26 UTC (permalink / raw) To: John Richard Moser; +Cc: linux-kernel John Richard Moser wrote: > > Unfortunately, journaling uses a chunk of space. Imagine a journal on a > USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! > Sure it could be done in 8 or 4 or so; or (in one of my file system > designs) a static 16KiB block could reference dynamicly allocated > journal space, allowing the system to sacrifice performance and shrink > the journal when more space is needed. Either way, slow media like > floppies will suffer, HARD; and flash devices will see a lot of > write/erase all over the journal area, causing wear on that spot. > My understanding is that if one designed a power supply with enough headroom, one could remove the power and still have time to write dirty sectors to the USB flash stick. Would this not remove the need for a journaling fs on a flash stick. This would remove the "wear on that spot" problem. Actually USB flash sticks are a bit clever, in that they add an extra layer of translation to the write. I.e. If you write to the same sector again and again, the USB flash stick will actually write it to a different area of the memory each time. This is specifically done to save the "wear on that spot" problem. This "flush on power fail" approach is not so easy with a HD because it uses more power and takes longer to flush. James ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 19:26 ` James Courtier-Dutton @ 2006-01-23 0:06 ` John Richard Moser 0 siblings, 0 replies; 32+ messages in thread From: John Richard Moser @ 2006-01-23 0:06 UTC (permalink / raw) To: James Courtier-Dutton; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Courtier-Dutton wrote: > John Richard Moser wrote: > >> >> Unfortunately, journaling uses a chunk of space. Imagine a journal on a >> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes! >> Sure it could be done in 8 or 4 or so; or (in one of my file system >> designs) a static 16KiB block could reference dynamicly allocated >> journal space, allowing the system to sacrifice performance and shrink >> the journal when more space is needed. Either way, slow media like >> floppies will suffer, HARD; and flash devices will see a lot of >> write/erase all over the journal area, causing wear on that spot. >> > > My understanding is that if one designed a power supply with enough > headroom, one could remove the power and still have time to write dirty > sectors to the USB flash stick. Would this not remove the need for a > journaling fs on a flash stick. Depends on how much meta-data you have to write out. What if you just altered 6000 files? Now you have a ton of dentries to destroy and inodes to invalidate, some FAT entries to free up, etc. What if the user just pulled the drive out of the USB port? Or the USB port is faulty and lost connection (I've seen it!). > This would remove the "wear on that > spot" problem. Wha? You mean remove the trigger, not the underlying problem. > Actually USB flash sticks are a bit clever, in that they > add an extra layer of translation to the write. I.e. If you write to the > same sector again and again, the USB flash stick will actually write it > to a different area of the memory each time. This is specifically done > to save the "wear on that spot" problem. Yeah, built-in write balancing is nice. > > This "flush on power fail" approach is not so easy with a HD because it > uses more power and takes longer to flush. The "flush on power fail" is retarded because it takes extra hardware and doesn't work if the USB port itself loses connection or if the user is just dumb enough to pull/knock the drive out. It won't work with mini hard disks either, as you say. "Flush on power fail" is pretty much getting a 10 minute UPS and issuing 'shutdown -h now' when the UPS signals init, which there's already contingencies for (can also suspend to disk). It won't help if the PSU burns out, if the system crashes, if the power cord is pulled, or if the dog walks around your chair and you turn and bump your foot into the "power" button on the UPS itself. > > James > > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD1B37hDd4aOud5P8RAsCtAJ0TZM4I9T9gE6PMbfUhMux8zrxE9wCff67G kdlY0fvfJQXmDljz6KekSxc= =BV+l -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-22 6:42 soft update vs journaling? John Richard Moser ` (2 preceding siblings ...) 2006-01-22 19:26 ` James Courtier-Dutton @ 2006-01-23 5:32 ` Michael Loftis 2006-01-23 18:52 ` John Richard Moser 3 siblings, 1 reply; 32+ messages in thread From: Michael Loftis @ 2006-01-23 5:32 UTC (permalink / raw) To: John Richard Moser, linux-kernel --On January 22, 2006 1:42:38 AM -0500 John Richard Moser <nigelenki@comcast.net> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > So I've been researching, because I thought this "Soft Update" thing > that BSD uses was some weird freak-ass way to totally corrupt a file > system if the power drops. Seems I was wrong; it's actually just the > opposite, an alternate solution to journaling. So let's compare notes. I hate to say it...but in my experience, this has been exactly the case with soft updates and FreeBSD 4 up to 4.11 pre releases. Whenever something untoward would happen, the filesystem almost always lost files and/or data, usually just files though. In practice it's never really worked too well for me. It also still requires a full fsck on boot, which means long boot times for recovery on large filesystems. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 5:32 ` Michael Loftis @ 2006-01-23 18:52 ` John Richard Moser 2006-01-23 19:32 ` Matthias Andree 0 siblings, 1 reply; 32+ messages in thread From: John Richard Moser @ 2006-01-23 18:52 UTC (permalink / raw) To: Michael Loftis; +Cc: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Michael Loftis wrote: > > > --On January 22, 2006 1:42:38 AM -0500 John Richard Moser > <nigelenki@comcast.net> wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> So I've been researching, because I thought this "Soft Update" thing >> that BSD uses was some weird freak-ass way to totally corrupt a file >> system if the power drops. Seems I was wrong; it's actually just the >> opposite, an alternate solution to journaling. So let's compare notes. > > > I hate to say it...but in my experience, this has been exactly the case > with soft updates and FreeBSD 4 up to 4.11 pre releases. > > Whenever something untoward would happen, the filesystem almost always > lost files and/or data, usually just files though. In practice it's > never really worked too well for me. It also still requires a full fsck > on boot, which means long boot times for recovery on large filesystems. You lost files in use, or random files? Soft Update was designed to assure file system consistency. In typical usage, when you drop power on something like FAT, you create a 'hole' in the filesystem. This hole could be something like files pointing to allocated blocks belonging to other files; or crossed dentries; etc. As you use the file system, it simply accepts the information it gets, because it doesn't look bad until you look at EVERYTHING. The effect is akin to repeatedly sodomizing the file system in this newly created hole; you just cause more and more damage until the system gives out. The system makes allocations and decisions based on faulty data and really, really screws things up. The idea of Soft Update was to make sure that while you may lose something, when you come back up the FS is in a safely usable state. The fsck only colors in a view of the FS and frees up blocks that don't seem to be allocated by any particular file, an annoying but mostly harmless side effect of losing power in this scheme. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - -- All content of all messages exchanged herein are left in the Public Domain, unless otherwise explicitly stated. Creative brains are a valuable, limited resource. They shouldn't be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there. -- Eric Steven Raymond -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD1SXXhDd4aOud5P8RAj9PAJ9G5CF6gfPx470/Ak+OlaKogZhMSwCeKORg Q7AZegZunZ3S2hTSNVnXFlc= =7Rme -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: soft update vs journaling? 2006-01-23 18:52 ` John Richard Moser @ 2006-01-23 19:32 ` Matthias Andree 0 siblings, 0 replies; 32+ messages in thread From: Matthias Andree @ 2006-01-23 19:32 UTC (permalink / raw) To: John Richard Moser; +Cc: Michael Loftis, linux-kernel On Mon, 23 Jan 2006, John Richard Moser wrote: > The idea of Soft Update was to make sure that while you may lose > something, when you come back up the FS is in a safely usable state. Soft Updates are *extremely* sensitive to reordered writes, and more likely to be reordered at the same time than streaming to a linear journal is. Don't even THINK of using softupdates without enforcing write order. ext3fs, particularly with data=ordered or data=journal, is much more forgiving in my experience. Not that I'd endorse dangerous use of file system, but the average user just doesn't know. FreeBSD (stable@ Cc:d) has no notion of write barriers as of yet as it seems, wedging the SCSI bus in the middle of a write sequence causes major devastations with WCE=1, and took me two runs of fsck to repair (unfortunately I needed the (test) machine back up at once, so no time to snapshot the b0rked partition for later scrutiny), and found myself with two hundred files relocated to the lost+found office^Wdirectory. Of course, it's the "Doctor, doctor, it always hurts my right eye if I'm drinking coffee" -- "well, remove the spoon from your mug before drinking then" (don't do that) category of "bug", but it hosts practical relevance... -- Matthias Andree ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2006-01-24 2:37 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-01-22 6:42 soft update vs journaling? John Richard Moser 2006-01-22 8:51 ` Jan Engelhardt 2006-01-22 18:40 ` John Richard Moser 2006-01-22 19:05 ` Adrian Bunk 2006-01-22 19:08 ` Arjan van de Ven 2006-01-22 19:25 ` Adrian Bunk 2006-01-24 2:33 ` Jörn Engel 2006-01-22 9:31 ` Theodore Ts'o 2006-01-22 18:54 ` John Richard Moser 2006-01-22 21:02 ` Theodore Ts'o 2006-01-22 22:44 ` Kyle Moffett 2006-01-23 7:24 ` Theodore Ts'o 2006-01-23 13:31 ` Mitchell Blank Jr 2006-01-23 13:33 ` Kyle Moffett 2006-01-23 13:52 ` Antonio Vargas 2006-01-23 16:48 ` Linux VFS architecture questions Kyle Moffett 2006-01-23 17:00 ` Pekka Enberg 2006-01-23 17:50 ` Kyle Moffett 2006-01-23 17:54 ` Randy.Dunlap 2006-01-23 20:48 ` soft update vs journaling? Folkert van Heusden 2006-01-23 1:02 ` John Richard Moser 2006-01-22 19:50 ` Diego Calleja 2006-01-22 20:39 ` Suleiman Souhlal 2006-01-22 20:50 ` Diego Calleja 2006-01-23 1:00 ` John Richard Moser 2006-01-23 1:09 ` Suleiman Souhlal 2006-01-23 2:09 ` John Richard Moser 2006-01-22 19:26 ` James Courtier-Dutton 2006-01-23 0:06 ` John Richard Moser 2006-01-23 5:32 ` Michael Loftis 2006-01-23 18:52 ` John Richard Moser 2006-01-23 19:32 ` Matthias Andree
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).