linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Designing Another File System
@ 2004-11-30  4:32 John Richard Moser
  2004-11-30  7:16 ` Bernd Eckenfels
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: John Richard Moser @ 2004-11-30  4:32 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greetings.

I've been interested in file system design for a while, and I think I
may be able to design one.  Being poor, I would like to patent/sell it
when done; however, whatever the result, I'd assuredly give a free
license to implement the resulting file system in GPL2 licensed code (if
you're not making money off it, I'm not making money off it).  This is
similar in principle to how Hans Reiser sells his file system I think.

I've examined briefly the overall concepts which go into several file
systems and pooled them together to create my basic requirements.  In
these include things such as:


- - a new journaling method that scales up and down to high loads and
small devices

- - localization of Inodes and related meta-data to prevent disk thrashing

- - a scheme which allows Inodes to be dynamically allocated and
deallocated out of order

- - 64 bit indices indicating the exact physical location on disk of
Inodes, giving a O(1) seek to the Inode itself

- - A design based around preventing information leaking by guaranteed
secure erasure of data (insofar that the journal even will finish wiping
out data when replaying a transaction)


I have 6 basic questions.


1)  Can Unix utilities in general deal with 64 bit Inodes?  (Most
programs I assume won't care; ls -i and df -i might have trouble)

2)  Are there any other security concerns aside from information leaking
(deleted data being left over on disk, which may pop up in incompletely
written files)?

3)  What basic information do I absolutely *need* in my super block?

4)  What basic information do I absolutely *need* in my Inodes? (I'm
thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
links}

5)  What basic information do I absolutely *need* in my directory
entries? (I'm thinking {name,inode})

6)  How do I lay out a time field for atime/dtime/ctime/mtime?


A few more specifics, as Block size grows, the file system does not lose
efficiency.  Fragment Blocks should skillfully subdivide huge Blocks
with a third level index, increasing seek time O(+1) for the information
stored in the Fragment Block:


- -Inode
|-Meta-data
||-Data Block
|||-Data (2 seeks)
||-Fragment Block
|||-Fragment
||||-Data (3 seeks)


Fragment Blocks can store anything but Inodes, so it would be advisable
to avoid huge Blocks; if a Block is 25% of the disk, for example, one
Block must be dedicated to serving up Inode information.  Also, with
64KiB blocks, a 256TiB file system can be created.  Larger Blocks will
allow larger file systems; a larger file system will predictably house
more files (unless it houses one gargantuan relational data base-- the
only plausible situation where my design would require more, smaller
blocks to be more efficient).

Directories will have to be some sort of on disk BsomethingTree. . . B*,
B+, something.  I have no idea how to implement this :D  I'll assume I
treat the directory data as a logically contiguous file (yes I'm gonna
store directory data just like file data).  I could use a string hash of
the Directory Entry name with disambiguation at the end, except my
options here seem limited:


- - Use a 32 bit string hash, 4 levels, 8 bits per level.  256 entries of
32 bit {offset} for the next level per level, 1024B per level on unique
paths.  Worst case 1 path per file, 65536 files in the worst case
measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
(64M), total 128MiB, collision rate is probably low.  Ouch.

- - Use a 16 bit string hash, 2 levels, 8 bits per level.  256 entries of
32 bit {offset) for the next level per level, 1024B per level on unique
path.  Worst case 1 path per file, 65536 files in the worst case measure
1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
rate is probably higher than that.  Beyond 65536 creates collisions.
Hitting a disambiguation situation is going to be fairly common.


I guess the second would be better?  I can't locate any directories on
my drive with >2000 entries *shrug*.  The end key is just the entry
{name,inode} pair.

Any ideas?

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBq/fEhDd4aOud5P8RAofxAJ4hBzZfFiLAyU413zNj2jVqlLzXWACfcaqd
4iwnewSp2xKrO94F9TF6dSY=
=ZKm6
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30  4:32 Designing Another File System John Richard Moser
@ 2004-11-30  7:16 ` Bernd Eckenfels
  2004-11-30 13:07 ` Helge Hafting
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Bernd Eckenfels @ 2004-11-30  7:16 UTC (permalink / raw)
  To: linux-kernel

In article <41ABF7C5.5070609@comcast.net> you wrote:
> when done; however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code (if
> you're not making money off it, I'm not making money off it).

You can't restrict the GPL.

> I've examined briefly the overall concepts which go into several file
> systems and pooled them together to create my basic requirements.  In
> these include things such as:

It is not a good idea to publish your ideas if you want to patent them.

> - - localization of Inodes and related meta-data to prevent disk thrashing

Why do you asume inodes?

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

How is that different from Logical Block Numbers with a fixed cluster
factor? (besides it wastes more bits)

> - - A design based around preventing information leaking by guaranteed
> secure erasure of data (insofar that the journal even will finish wiping
> out data when replaying a transaction)

Hopefully  optional.

> 2)  Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

IT-Security is about availability and reliability also.

> 3)  What basic information do I absolutely *need* in my super block?

How should  WE know? (For mount you might want to have versioning and uuids)

> 5)  What basic information do I absolutely *need* in my directory
> entries? (I'm thinking {name,inode})

Think about optimizations like NTFS, where you can sort directories by
attributes and store often used attribtes in the dir, so you dont need a
inode dereference (MFT lookup in case of ntfs)

> Directories will have to be some sort of on disk BsomethingTree. . . B*,
> B+, something.  I have no idea how to implement this :D  I'll assume I
> treat the directory data as a logically contiguous file (yes I'm gonna
> store directory data just like file data).  I could use a string hash of
> the Directory Entry name with disambiguation at the end, except my
> options here seem limited:

Dont forget to optimize  directories for all  4: insertion/deletion/lookup
and traversal.

> I guess the second would be better?  I can't locate any directories on
> my drive with >2000 entries *shrug*.  The end key is just the entry
> {name,inode} pair.

Do you want to run old style NNTP Servers?

> Any ideas?

In fact I miss a bit a new idea, so what makes your FS better, why should
anyone use a non-open source implementation if it does not provide anything
new?

Greetings
Bernd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30  4:32 Designing Another File System John Richard Moser
  2004-11-30  7:16 ` Bernd Eckenfels
@ 2004-11-30 13:07 ` Helge Hafting
  2004-12-01  1:16   ` John Richard Moser
  2004-11-30 16:31 ` Alan Cox
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Helge Hafting @ 2004-11-30 13:07 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

John Richard Moser wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Greetings.
>
> I've been interested in file system design for a while, and I think I
> may be able to design one.  Being poor, I would like to patent/sell it
> when done;

Well, if you want to make money - you get to do the work.  Including
the initial research into filesystem design. 

> however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code (if
> you're not making money off it, I'm not making money off it).  This is
> similar in principle to how Hans Reiser sells his file system I think.

GPL means no restrictions, you can't prevent, say, redhat from
selling distro CDs with your GPL'ed fs.  As for _patenting_ it, well,
people here generally dislike software patents for good reasons.
You don't need a patent for this either - a copyright is probably
what you're after.

>
> 2)  Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

Have you considered what kind of data loss you'll get if power fails
before stuff is written to disk, or while it is being written?  That sort
of thing is important to users, because it really happens a lot.  Can you
design so a good fsck for your fs is possible?

> 6)  How do I lay out a time field for atime/dtime/ctime/mtime?
>
Any way you like, as long as you actually store the information
in a retrieveable way.

>
> A few more specifics, as Block size grows, the file system does not lose
> efficiency.  Fragment Blocks should skillfully subdivide huge Blocks
> with a third level index, increasing seek time O(+1) for the information
> stored in the Fragment Block:
>
>
> - -Inode
> |-Meta-data
> ||-Data Block
> |||-Data (2 seeks)
> ||-Fragment Block
> |||-Fragment
> ||||-Data (3 seeks)
>
>
> Fragment Blocks can store anything but Inodes, so it would be advisable
> to avoid huge Blocks; if a Block is 25% of the disk, for example, one
> Block must be dedicated to serving up Inode information.  Also, with
> 64KiB blocks, a 256TiB file system can be created.  Larger Blocks will
> allow larger file systems; a larger file system will predictably house
> more files (unless it houses one gargantuan relational data base-- the
> only plausible situation where my design would require more, smaller
> blocks to be more efficient).
>
> Directories will have to be some sort of on disk BsomethingTree. . . B*,
> B+, something.  I have no idea how to implement this :D  I'll assume I

Then you have a lot to learn about fs design - before you can
design anything _sellable_.  Take a look at the details of
existing linux fs'es - and be happy that you can do that at all because
they aren't patented or otherwise restricted.

> treat the directory data as a logically contiguous file (yes I'm gonna
> store directory data just like file data).  I could use a string hash of
> the Directory Entry name with disambiguation at the end, except my
> options here seem limited:
>
>
> - - Use a 32 bit string hash, 4 levels, 8 bits per level.  256 entries of
> 32 bit {offset} for the next level per level, 1024B per level on unique
> paths.  Worst case 1 path per file, 65536 files in the worst case
> measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
> (64M), total 128MiB, collision rate is probably low.  Ouch.
>
> - - Use a 16 bit string hash, 2 levels, 8 bits per level.  256 entries of
> 32 bit {offset) for the next level per level, 1024B per level on unique
> path.  Worst case 1 path per file, 65536 files in the worst case measure
> 1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
> rate is probably higher than that.  Beyond 65536 creates collisions.
> Hitting a disambiguation situation is going to be fairly common.
>
>
> I guess the second would be better?  I can't locate any directories on
> my drive with >2000 entries *shrug*.  The end key is just the entry
> {name,inode} pair.

If you want this fs to be generically used, expect some people to show
up with millions of files in a directory or tens of millions. 
They will not be amused if the lookup isn't O(1). . .

There are still fs'es around that don't look up names in O(1)
time, but some O(1) fs'es exist.  So, to _sell_ you need to be among the
better ones.

> Any ideas?
>
Write a serious free fs, and get help here.  Or go commercial - sell
your fs and pay your developers and helpers.

Helge Hafting

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30  4:32 Designing Another File System John Richard Moser
  2004-11-30  7:16 ` Bernd Eckenfels
  2004-11-30 13:07 ` Helge Hafting
@ 2004-11-30 16:31 ` Alan Cox
  2004-11-30 18:28 ` Valdis.Kletnieks
  2004-11-30 19:22 ` Phil Lougher
  4 siblings, 0 replies; 21+ messages in thread
From: Alan Cox @ 2004-11-30 16:31 UTC (permalink / raw)
  To: John Richard Moser; +Cc: Linux Kernel Mailing List

On Maw, 2004-11-30 at 04:32, John Richard Moser wrote:
> I've been interested in file system design for a while, and I think I
> may be able to design one.  Being poor, I would like to patent/sell it
> when done; however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code 

Several other vendors have followed this kind of dual model - MySQL,
Troll Tech and Sleepycat being three obvious examples.

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

Until you get a bad block 8) or want to find it in memory (which is the
usual case)

> 1)  Can Unix utilities in general deal with 64 bit Inodes?  (Most
> programs I assume won't care; ls -i and df -i might have trouble)

You would have to ask a unix source code licensee. For Linux inodes can
be 64bit on 64bit platforms, although you would undoubtedly found some
oddments that were not resolved.

> 4)  What basic information do I absolutely *need* in my Inodes? (I'm
> thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
  links}

See posix 1003.1 and the Single Unix SPecification. That defines the
behaviour.

> 5)  What basic information do I absolutely *need* in my directory
> entries? (I'm thinking {name,inode})

Ditto 

> 6)  How do I lay out a time field for atime/dtime/ctime/mtime?

Internal issue to the file system.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 18:28 ` Valdis.Kletnieks
@ 2004-11-30 17:46   ` Alan Cox
  2004-11-30 19:00     ` Jan Engelhardt
                       ` (2 more replies)
  2004-12-01  1:35   ` John Richard Moser
  1 sibling, 3 replies; 21+ messages in thread
From: Alan Cox @ 2004-11-30 17:46 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: John Richard Moser, Linux Kernel Mailing List

On Maw, 2004-11-30 at 18:28, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
> they punt on the issue of over-writing a sector that's been re-allocated by
> the hardware (apparently the chances of critical secret data being left in
> a reallocated block but still actually readable are "low enough" not to worry).

I guess they never consider CF cards which internally are log structured
and for whom such erase operations are very close to pointless.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30  4:32 Designing Another File System John Richard Moser
                   ` (2 preceding siblings ...)
  2004-11-30 16:31 ` Alan Cox
@ 2004-11-30 18:28 ` Valdis.Kletnieks
  2004-11-30 17:46   ` Alan Cox
  2004-12-01  1:35   ` John Richard Moser
  2004-11-30 19:22 ` Phil Lougher
  4 siblings, 2 replies; 21+ messages in thread
From: Valdis.Kletnieks @ 2004-11-30 18:28 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3284 bytes --]

On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:

(Somebody else can address the performance issues - I'll stick to the
parts I understand.. ;)

> - - A design based around preventing information leaking by guaranteed
> secure erasure of data (insofar that the journal even will finish wiping
> out data when replaying a transaction)

Consider carefully what threat model you have here - your choices will be
constrained by it.  If you don't know your threat model, it's time to look
for other features to design around...

(your comment indicates a level of confusion regarding what paranoia is desired)

Note well that you need to define "Guaranteed secure erasure" - at least for
US DOD contractors, DOD 5220-22.M requires 3 over-writes (a character, its
complement, and random) and verify).  That's good for SECRET and below - but
TOP SECRET and higher still require physical destruction of the media.  Also,
they punt on the issue of over-writing a sector that's been re-allocated by
the hardware (apparently the chances of critical secret data being left in
a reallocated block but still actually readable are "low enough" not to worry).

Also, 5220-22.M is more concerned with the exposure of "You surplus the machine
and forgot to wipe the drives" - there's a whole *different* set of rules
regarding how you keep an attacker who steals the system (physical access
issues) or gets root access (this is a *tough* one) from scavenging the
drives....

> 2)  Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

Which of the following are you worried about:

1) On a filesystem that does metadata journalling but not data journalling,
it's possible for a file to be extended by a write(), the metadata is
journalled, but the system fails before the actual data makes it to disk.  As a
result, after the system comes up, stale data is present in the file, causing a
small data exposure and large reliability issues (opening a file with
OpenOffice will almost certainly cause Very Bad Errors if it's a file that was
in the process of being saved when the system went down, so you're actually
reading blocks of somebody else's old Fortran source code...)  Note that this
exposure does *NOT* need you to clear data out of the journal - merely to
journal data (or find other ways to guarantee you never allocate a stale block
of data).  This is why I suggest that you're unclear regarding your threat
model.

2) If you're worried about a well-funded attacker managing to scavenge secure
data out of the journal, what do you do about the attacker scavenging *the rest
of the disk, including existing files*?  (Hint - cryptoloop is the correct
answer here, unless you think Jaari Russo is right, in which case *his*
encrypted filesystem stuff is the right answer).

Alternatively, you may wish to consider a filesystem that does crypto on a
per-file basis - be prepared to deal with some very hairy key-management and
information leakage problems if you pursue this route...

In any case, both cryptoloop and Jaari's loop-aes address the "disk captured by
the attacker" issues, but don't do much for "attacker gets root" - that's a
whole DIFFERENT set of problems...


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 17:46   ` Alan Cox
@ 2004-11-30 19:00     ` Jan Engelhardt
  2004-11-30 19:14     ` Valdis.Kletnieks
  2004-12-02 23:32     ` David Woodhouse
  2 siblings, 0 replies; 21+ messages in thread
From: Jan Engelhardt @ 2004-11-30 19:00 UTC (permalink / raw)
  Cc: Linux Kernel Mailing List

>I guess they never consider CF cards which internally are log structured
>and for whom such erase operations are very close to pointless.

Don't forget WORM (in "multi session", of course).



Jan Engelhardt
-- 
ENOSPC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 17:46   ` Alan Cox
  2004-11-30 19:00     ` Jan Engelhardt
@ 2004-11-30 19:14     ` Valdis.Kletnieks
  2004-11-30 20:22       ` Valdis.Kletnieks
  2004-12-02 23:32     ` David Woodhouse
  2 siblings, 1 reply; 21+ messages in thread
From: Valdis.Kletnieks @ 2004-11-30 19:14 UTC (permalink / raw)
  To: Alan Cox; +Cc: John Richard Moser, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

On Tue, 30 Nov 2004 17:46:10 GMT, Alan Cox said:
> On Maw, 2004-11-30 at 18:28, Valdis.Kletnieks@vt.edu wrote:
> > On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
> > they punt on the issue of over-writing a sector that's been re-allocated by
> > the hardware (apparently the chances of critical secret data being left in
> > a reallocated block but still actually readable are "low enough" not to wor
ry).
> 
> I guess they never consider CF cards which internally are log structured
> and for whom such erase operations are very close to pointless.

The 3-overwrites is for "rigid disks" only, and for "sanitize" operations
required when the media is being released from continuous protection (basically,
when you're disposing of the drive).  Clearing (for when the drive is being
kept, but re-used) only requires 1 pass.  A CF card would be handled
under different rules - I'm pretty sure it would be treated as the appropriate
class of "memory" (but admit not knowing which technology it would be).

See "Clearing and sanitization matrix" from the bottom of:
http://www.dss.mil/isec/chapter8.htm


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30  4:32 Designing Another File System John Richard Moser
                   ` (3 preceding siblings ...)
  2004-11-30 18:28 ` Valdis.Kletnieks
@ 2004-11-30 19:22 ` Phil Lougher
  2004-12-01  1:42   ` John Richard Moser
  2004-12-02 20:18   ` Jan Knutar
  4 siblings, 2 replies; 21+ messages in thread
From: Phil Lougher @ 2004-11-30 19:22 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

On Mon, 29 Nov 2004 23:32:05 -0500, John Richard Moser
<nigelenki@comcast.net> wrote:

> - - localization of Inodes and related meta-data to prevent disk thrashing

All filesystems place their filesystem metadata inside the inodes.  If
you mean file metadata then please be more precise.  This isn't
terribly new, recent posts have discussed how moving eas/acls inside
the inode for ext3 has sped up performance.

> 
> - - a scheme which allows Inodes to be dynamically allocated and
> deallocated out of order
>

Um,  all filesystems do that, I think you're missing words to the
effect "without any performance loss or block fragmentation" !

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

> 1)  Can Unix utilities in general deal with 64 bit Inodes?  (Most
> programs I assume won't care; ls -i and df -i might have trouble)
> 

There seems to be some confusion here.  The filesystem can use 64 bit
inode numbers internally but hide these 64 bits and instead present
munged 32 bit numbers to Linux.

The 64 bit resolution is only necessary within the filesystem dentry
lookup function to go from a directory name entry to the physical
inode location on disk.  The inode number can then be reduced to 32
bits for 'presentation' to the VFS.  AFAIK as all file access is
through the dentry cache this is sufficient.  The only problems are
that VFS iget() needs to be replaced with a filesystem specific iget. 
A number of filesystems do this.  Squashfs internally uses 37 bit
inode numbers and presents them as 32 bit inode numbers in this way.

> 3)  What basic information do I absolutely *need* in my super block?
> 4)  What basic information do I absolutely *need* in my Inodes? (I'm
> thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
> links}

Very much depends on your filesystem.  Cramfs is a good example of the
minimum you need to store to satisfy the Linux VFS.  If you don't care
what they are almost anything can be invented (uid, gid, mode, atime,
dtime etc) and set to a useful default.  The *absolute* minimum is
probably type, file/dir size, and file/dir data location on disk.

> I guess the second would be better?  I can't locate any directories on
> my drive with >2000 entries *shrug*.  The end key is just the entry
> {name,inode} pair.

I've had people trying to store 500,000 + files in a Squashfs
directory.  Needless to say with the original directory implementation
this didn't work terribly well...

Phillip Lougher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 19:14     ` Valdis.Kletnieks
@ 2004-11-30 20:22       ` Valdis.Kletnieks
  0 siblings, 0 replies; 21+ messages in thread
From: Valdis.Kletnieks @ 2004-11-30 20:22 UTC (permalink / raw)
  To: Alan Cox, John Richard Moser, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 603 bytes --]

On Tue, 30 Nov 2004 14:14:10 EST, Valdis.Kletnieks@vt.edu said:

> See "Clearing and sanitization matrix" from the bottom of:
> http://www.dss.mil/isec/chapter8.htm

"I'm sorry, that's an old, worn out magic word" ;)

For what it's worth, that table is in the original 1995 version.
"Change 2" in Feb 01 completely redid chapter 8, and removed said
table.  It's unclear what, if any, replacement table should be used.
I'll follow up if I find out anything specific - the current text just says:

"Instructions on clearing, sanitization and release of IS media shall be issued
by the accrediting CSA."



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 13:07 ` Helge Hafting
@ 2004-12-01  1:16   ` John Richard Moser
  0 siblings, 0 replies; 21+ messages in thread
From: John Richard Moser @ 2004-12-01  1:16 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Helge Hafting wrote:
| John Richard Moser wrote:
|
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|>
|> Greetings.
|>
|> I've been interested in file system design for a while, and I think I
|> may be able to design one.  Being poor, I would like to patent/sell it
|> when done;
|
|
| Well, if you want to make money - you get to do the work.  Including
| the initial research into filesystem design.
|

Eh, you're probably right; but I thought all idealistic open source
advocates considered the fruits of the work to be proper return to the
community >:P

Anyway, I dunno how I'd go about any of that anyway.  The thoughts have
crossed my mind though, as they should yours when you're poor and living
in your parents' house.

I'll definitely code it myself if I decide to sell licenses to use the
code in closed source products; I can't sell other peoples' work, it
doesn't work that way.

|> however, whatever the result, I'd assuredly give a free
|> license to implement the resulting file system in GPL2 licensed code (if
|> you're not making money off it, I'm not making money off it).  This is
|> similar in principle to how Hans Reiser sells his file system I think.
|
|
| GPL means no restrictions, you can't prevent, say, redhat from
| selling distro CDs with your GPL'ed fs.  As for _patenting_ it, well,
| people here generally dislike software patents for good reasons.
| You don't need a patent for this either - a copyright is probably
| what you're after.
|

MySQL, Dan's Guardian.  They cost for business use.

As for "no restrictions," remember that GPL means you can't just
incorporate the code into your closed source product.  A few GPL
projects have been known to sell separate licenses (reiserfs I thought
did this) to allow such an incorporation of their work.  If i.e. Apple
wants *me* to write them a driver for the FS, fine, they can pay me
royaltees too.

|>
|> 2)  Are there any other security concerns aside from information leaking
|> (deleted data being left over on disk, which may pop up in incompletely
|> written files)?
|
|
| Have you considered what kind of data loss you'll get if power fails
| before stuff is written to disk, or while it is being written?  That sort
| of thing is important to users, because it really happens a lot.  Can you
| design so a good fsck for your fs is possible?
|

Data loss can't really be helped.  Either the data makes it to disk or
it doesn't; data goes after allocation, but even a full journal won't
save me data loss.  A rollback journal "might," if the entire change to
the file from fopen() to fclose() is exactly one transaction; else, no.

As for a good fsck, call Sally.  :)

Actually a fsck shouldn't be too hard; inode blocks and fragment blocks
will be self-identifying and indexing, so even if the bitmap is lost,
the whole thing can be rebuilt *if* you know the exact offset of the
data area, the FS version, and the block size.

|> 6)  How do I lay out a time field for atime/dtime/ctime/mtime?
|>
| Any way you like, as long as you actually store the information
| in a retrieveable way.
|

What information is that?

|>
|> A few more specifics, as Block size grows, the file system does not lose
|> efficiency.  Fragment Blocks should skillfully subdivide huge Blocks
|> with a third level index, increasing seek time O(+1) for the information
|> stored in the Fragment Block:
|>
|>
|> - -Inode
|> |-Meta-data
|> ||-Data Block
|> |||-Data (2 seeks)
|> ||-Fragment Block
|> |||-Fragment
|> ||||-Data (3 seeks)
|>
|>
|> Fragment Blocks can store anything but Inodes, so it would be advisable
|> to avoid huge Blocks; if a Block is 25% of the disk, for example, one
|> Block must be dedicated to serving up Inode information.  Also, with
|> 64KiB blocks, a 256TiB file system can be created.  Larger Blocks will
|> allow larger file systems; a larger file system will predictably house
|> more files (unless it houses one gargantuan relational data base-- the
|> only plausible situation where my design would require more, smaller
|> blocks to be more efficient).
|>
|> Directories will have to be some sort of on disk BsomethingTree. . . B*,
|> B+, something.  I have no idea how to implement this :D  I'll assume I
|
|
| Then you have a lot to learn about fs design - before you can
| design anything _sellable_.  Take a look at the details of
| existing linux fs'es - and be happy that you can do that at all because
| they aren't patented or otherwise restricted.

I still think patents are a good thing, if the patent owners have half a
brain.  For example, Judy Arrays ( http://judy.sf.net/ ) are patented by
HP; their design is LGPL though.  "Patented" doesn't equivilate to "Out
of reach" by nature, just by controller.

|
|> treat the directory data as a logically contiguous file (yes I'm gonna
|> store directory data just like file data).  I could use a string hash of
|> the Directory Entry name with disambiguation at the end, except my
|> options here seem limited:
|>
|>
|> - - Use a 32 bit string hash, 4 levels, 8 bits per level.  256 entries of
|> 32 bit {offset} for the next level per level, 1024B per level on unique
|> paths.  Worst case 1 path per file, 65536 files in the worst case
|> measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
|> (64M), total 128MiB, collision rate is probably low.  Ouch.
|>
|> - - Use a 16 bit string hash, 2 levels, 8 bits per level.  256 entries of
|> 32 bit {offset) for the next level per level, 1024B per level on unique
|> path.  Worst case 1 path per file, 65536 files in the worst case measure
|> 1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
|> rate is probably higher than that.  Beyond 65536 creates collisions.
|> Hitting a disambiguation situation is going to be fairly common.
|>
|>
|> I guess the second would be better?  I can't locate any directories on
|> my drive with >2000 entries *shrug*.  The end key is just the entry
|> {name,inode} pair.
|
|
| If you want this fs to be generically used, expect some people to show
| up with millions of files in a directory or tens of millions. They will
| not be amused if the lookup isn't O(1). . .
|
| There are still fs'es around that don't look up names in O(1)
| time, but some O(1) fs'es exist.  So, to _sell_ you need to be among the
| better ones.
|
|> Any ideas?
|>
| Write a serious free fs, and get help here.  Or go commercial - sell
| your fs and pay your developers and helpers.
|

Well, I'm not Hans Reiser, so I guess DARPA isn't going to give me $13M;
selling this thing (even if I can outrank everyone in the world by leaps
and bounds) isn't going to make me instantly rich.  Even if it would,
nobody's gonna help me with it like that; and I don't have the skill to
do this myself.

Ah why not.

| Helge Hafting
|

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrRtPhDd4aOud5P8RAkhgAJ95lOJFvG+cHpOrWyDGn1/VZlFhzQCdGP/p
T8b0J6idnUgZY/uS35FVRZc=
=Vr0i
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 18:28 ` Valdis.Kletnieks
  2004-11-30 17:46   ` Alan Cox
@ 2004-12-01  1:35   ` John Richard Moser
  1 sibling, 0 replies; 21+ messages in thread
From: John Richard Moser @ 2004-12-01  1:35 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Valdis.Kletnieks@vt.edu wrote:
| On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
|
| (Somebody else can address the performance issues - I'll stick to the
| parts I understand.. ;)
|
|
|>- - A design based around preventing information leaking by guaranteed
|>secure erasure of data (insofar that the journal even will finish wiping
|>out data when replaying a transaction)
|
|
| Consider carefully what threat model you have here - your choices will be
| constrained by it.  If you don't know your threat model, it's time to look
| for other features to design around...
|

:)

| (your comment indicates a level of confusion regarding what paranoia
is desired)
|
| Note well that you need to define "Guaranteed secure erasure" - at
least for
| US DOD contractors, DOD 5220-22.M requires 3 over-writes (a character, its
| complement, and random) and verify).  That's good for SECRET and below
- - but

I've read DOD 5220-22.M chapte. . . oh fuck  it *cut* *paste*

Strips of up to 512 contiguous bytes may be written at a time during the
erasure, and may possibly consist of overwrite all affected locations
with a random pattern, binary zeros, binary ones, then a random
character and verifying, as derived by combining the methods given in
DOD 5220.22-M[6], Chapter 8, section 8-306, for sanitizing removable and
non-removable rigid disks using method (d) and EEPROMs (i.e. memory
cards for digital cameras and music players) (h).


^^^  From my original notes in the document I'm writing where I first
alluded to a new "secure" file system (which quickly became a much
higher goal).

| TOP SECRET and higher still require physical destruction of the media.
  Also,
| they punt on the issue of over-writing a sector that's been
re-allocated by
| the hardware (apparently the chances of critical secret data being left in
| a reallocated block but still actually readable are "low enough" not
to worry).
|

I can't make the FS burn the disk, sorry.  You need to use kerosene.

| Also, 5220-22.M is more concerned with the exposure of "You surplus
the machine
| and forgot to wipe the drives" - there's a whole *different* set of rules
| regarding how you keep an attacker who steals the system (physical access
| issues) or gets root access (this is a *tough* one) from scavenging the
| drives....
|

hey I'm working for general setting.  Nobody said this was going to be
"We've taken care of everything, just rm -rf / and you're good to go."
There are certain guarantees that I can't make from the FILE SYSTEM;
they require outside forces.  What I can guarantee is that you won't
crack open shopping_list.txt after a crash and see a part of previously
erased secret_military_assult_on_iranian_nuclear_breeder_reactors.txt

|
|>2)  Are there any other security concerns aside from information leaking
|>(deleted data being left over on disk, which may pop up in incompletely
|>written files)?
|
|
| Which of the following are you worried about:
|
| 1) On a filesystem that does metadata journalling but not data
journalling,
| it's possible for a file to be extended by a write(), the metadata is
| journalled, but the system fails before the actual data makes it to
disk.  As a
| result, after the system comes up, stale data is present in the file,
causing a
| small data exposure

Clipped that off already, that's what I was talking about in the
original message ("information leaking").

| and large reliability issues (opening a file with
| OpenOffice will almost certainly cause Very Bad Errors if it's a file
that was
| in the process of being saved when the system went down, so you're
actually
| reading blocks of somebody else's old Fortran source code...)

I can't help data loss; it's impossible.  The data either gets to disk
or it doesn't.  Unless one big write() was made between fopen() and
fclose(), you have a window of fuxxx0rzd data if the machine goes down--
even with roll forward and roll back full data journaling.

| Note that this
| exposure does *NOT* need you to clear data out of the journal - merely to
| journal data (or find other ways to guarantee you never allocate a
stale block
| of data).  This is why I suggest that you're unclear regarding your threat
| model.
|

If I make numorous write() calls from an app, the fs will journal them
seprately, or periodically, or run out of journal space (actually my
journal design would expand the journal until it ran out of FS space)
trying to heuristically determine what is one write and prevent data
destruction.

When you talk about data loss, remember that even full journaling relies
on data making it to disk.  Stop trying to make it safe to flip the big
switch; you're not gonna do it.

| 2) If you're worried about a well-funded attacker managing to scavenge
secure
| data out of the journal, what do you do about the attacker scavenging
*the rest
| of the disk, including existing files*?  (Hint - cryptoloop is the correct
| answer here, unless you think Jaari Russo is right, in which case *his*
| encrypted filesystem stuff is the right answer).

cryptoloop or crypt extensions is a nice answer; though the journal
needs to be non-crypt.  Cryptoloop + journal == bad, unless journal is
on another device.  :)

as for well-funded attackers, eh.  The built-in secure erase is mainly
paranoia, although it makes `rm -rf /` THE utility to perform the above
erasure method on all file data.  Encryption extensions are the only way
to ensure instant destruction (secure delete takes time) without a
degausser :)  Of course, I could rm -f ~/.fs/my_cypher_key and a cypher
extension would be left keyless.  Cypher *extension* mind you; I'm not
designing encryption into the core of the FS, just leaving it a
possibility in the future (xattrs-- this is the same intention ext2 had
originally iirc regarding compression).

|
| Alternatively, you may wish to consider a filesystem that does crypto on a
| per-file basis - be prepared to deal with some very hairy
key-management and
| information leakage problems if you pursue this route...
|

yeah, could embed the user's keys into the FS and use a utility to erase
them in emergency; and I could also use a special key loaded from a
floppy or usb stick and kept (frobnicated) in memory to encrypt "more
sensitive" files.  Need to destroy the keys?  blow_my_key or
blow_all_keys, and set fire to your usb stick.

| In any case, both cryptoloop and Jaari's loop-aes address the "disk
captured by
| the attacker" issues, but don't do much for "attacker gets root" -
that's a
| whole DIFFERENT set of problems...
|

"attacker gets root" is doable, but only with a multilayered
defense-in-depth matrix with host based intrusion prevention systems.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrR/hhDd4aOud5P8RAoo+AJ9fdn18OaGn56I0LHxgA0LdDQiPHACfb/D5
V3MiVjkdazvy1x3ObdOzJ0w=
=Hh25
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 19:22 ` Phil Lougher
@ 2004-12-01  1:42   ` John Richard Moser
  2004-12-01  2:46     ` Phil Lougher
  2004-12-02 20:18   ` Jan Knutar
  1 sibling, 1 reply; 21+ messages in thread
From: John Richard Moser @ 2004-12-01  1:42 UTC (permalink / raw)
  To: Phil Lougher; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Phil Lougher wrote:
| On Mon, 29 Nov 2004 23:32:05 -0500, John Richard Moser
| <nigelenki@comcast.net> wrote:
|
|
|>- - localization of Inodes and related meta-data to prevent disk thrashing
|
|
| All filesystems place their filesystem metadata inside the inodes.  If
| you mean file metadata then please be more precise.  This isn't
| terribly new, recent posts have discussed how moving eas/acls inside
| the inode for ext3 has sped up performance.
|

I got the idea from FFS and its derivatives (UFS, UFS2, EXT2).  I don't
want to store xattrs inside inodes though; I want them in the same block
with the inode.  A few mS for the seek, but eh, the data's right there,
not on the other side of the disk.

|
|>- - a scheme which allows Inodes to be dynamically allocated and
|>deallocated out of order
|>
|
|
| Um,  all filesystems do that, I think you're missing words to the
| effect "without any performance loss or block fragmentation" !

All filesystems allow you to create the FS with 1 inode total?


Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/hda1            7823616  272670 7550946    4% /

No, it looks like this one allocates many inodes and uses them as it
goes.  Reiser has 0 inodes . . .

|
|
|>- - 64 bit indices indicating the exact physical location on disk of
|>Inodes, giving a O(1) seek to the Inode itself
|
|
|>1)  Can Unix utilities in general deal with 64 bit Inodes?  (Most
|>programs I assume won't care; ls -i and df -i might have trouble)
|>
|
|
| There seems to be some confusion here.  The filesystem can use 64 bit
| inode numbers internally but hide these 64 bits and instead present
| munged 32 bit numbers to Linux.
|
| The 64 bit resolution is only necessary within the filesystem dentry
| lookup function to go from a directory name entry to the physical
| inode location on disk.  The inode number can then be reduced to 32
| bits for 'presentation' to the VFS.  AFAIK as all file access is
| through the dentry cache this is sufficient.  The only problems are
| that VFS iget() needs to be replaced with a filesystem specific iget.
| A number of filesystems do this.  Squashfs internally uses 37 bit
| inode numbers and presents them as 32 bit inode numbers in this way.
|

Ugly, but ok.  What happens when i actually have >4G inodes though?

|
|>3)  What basic information do I absolutely *need* in my super block?
|>4)  What basic information do I absolutely *need* in my Inodes? (I'm
|>thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
|>links}
|
|
| Very much depends on your filesystem.  Cramfs is a good example of the
| minimum you need to store to satisfy the Linux VFS.  If you don't care
| what they are almost anything can be invented (uid, gid, mode, atime,
| dtime etc) and set to a useful default.  The *absolute* minimum is
| probably type, file/dir size, and file/dir data location on disk.

I meant basic, not for me.  Basic things a real Unix filesystem needs.
What *I* need comes from my head.  :)

|
|
|>I guess the second would be better?  I can't locate any directories on
|>my drive with >2000 entries *shrug*.  The end key is just the entry
|>{name,inode} pair.
|
|
| I've had people trying to store 500,000 + files in a Squashfs
| directory.  Needless to say with the original directory implementation
| this didn't work terribly well...
|

Ouch.  Someone told me the directory had to be O(1) lookup . . . .

| Phillip Lougher
|

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrSGNhDd4aOud5P8RAlo0AJ4pxB/LMhgTvNW4GdMmaNA2/uM0wACfWR8+
kOxwHU3/mTUUNAAhda2rv+g=
=fsJV
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-01  1:42   ` John Richard Moser
@ 2004-12-01  2:46     ` Phil Lougher
  2004-12-01  4:32       ` Bernd Eckenfels
  2004-12-01  5:11       ` John Richard Moser
  0 siblings, 2 replies; 21+ messages in thread
From: Phil Lougher @ 2004-12-01  2:46 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

On Tue, 30 Nov 2004 20:42:38 -0500, John Richard Moser
> Phil Lougher wrote:
> | Um,  all filesystems do that, I think you're missing words to the
> | effect "without any performance loss or block fragmentation" !
>
> All filesystems allow you to create the FS with 1 inode total?
>
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/hda1            7823616  272670 7550946    4% /
>
> No, it looks like this one allocates many inodes and uses them as it
> goes.  Reiser has 0 inodes . . .

Yes you're right.  What I said was total rubbish.  I read your
statement as meaning to dynamically allocate/deallocate inodes from a
set configured at filesystem creation...

> |
> | The 64 bit resolution is only necessary within the filesystem dentry
> | lookup function to go from a directory name entry to the physical
> | inode location on disk.  The inode number can then be reduced to 32
> | bits for 'presentation' to the VFS.  AFAIK as all file access is
> | through the dentry cache this is sufficient.  The only problems are
> | that VFS iget() needs to be replaced with a filesystem specific iget.
> | A number of filesystems do this.  Squashfs internally uses 37 bit
> | inode numbers and presents them as 32 bit inode numbers in this way.
> |
>
> Ugly, but ok.  What happens when i actually have >4G inodes though?

Well this is an issue that affects all filesystems on 32-bit systems
(as Alan said inode numbers are 64 bits on 64 bit systems).  To be
honest I've never let this worry me...

A 32-bit system can never cache 4G inodes in memory without falling
over, and so a simple solution would be to allocate the 32-bit inode
number dynamically (e.g. starting at one and going up, keeping track
of inode numbers still used for use when/if wraparound occured), this
would guarantee inode number uniqueness for the subset of file inodes
in memory at any one time, with the drawback that inode numbers
allocated to files will change over filesystem mounts.  Alternatively
from reading fs/inode.c it appears that inode numbers only need to be
unique if the fast hashed inode cache lookup functions are used, there
are other inode cache lookup functions that can be used if inode
numbers are not unique.

> | I've had people trying to store 500,000 + files in a Squashfs
> | directory.  Needless to say with the original directory implementation
> | this didn't work terribly well...
> |
> 
> Ouch.  Someone told me the directory had to be O(1) lookup . . . .

Ideally yes, but ultimately with your filesystem you make the rules
:-)   The Squashfs directory design was fast for the original expected
directory size (ideally <= 64K, maximum 512K) seen on embedded
systems.  The next release of Squashfs has considerably improved
indexed directories which are O(1) for large directories.  To be
released sometime soon, if anyone's interested...

Phillip Lougher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-01  2:46     ` Phil Lougher
@ 2004-12-01  4:32       ` Bernd Eckenfels
  2004-12-01  7:51         ` Phil Lougher
  2004-12-01  5:11       ` John Richard Moser
  1 sibling, 1 reply; 21+ messages in thread
From: Bernd Eckenfels @ 2004-12-01  4:32 UTC (permalink / raw)
  To: linux-kernel

In article <cce9e37e04113018465091010f@mail.gmail.com> you wrote:
> systems.  The next release of Squashfs has considerably improved
> indexed directories which are O(1) for large directories.

Are you talking about time complexity based on a named lookup over the
number of files in a directory? (you might also want to look at the
complexity relative to the naming component length). What data structure
which is not wasting memory allows you those lookups? Even if you consider
hashing the name as a constant O(1) operation (which it isnt), you still can
get collisions.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-01  2:46     ` Phil Lougher
  2004-12-01  4:32       ` Bernd Eckenfels
@ 2004-12-01  5:11       ` John Richard Moser
  1 sibling, 0 replies; 21+ messages in thread
From: John Richard Moser @ 2004-12-01  5:11 UTC (permalink / raw)
  To: Phil Lougher; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Phil Lougher wrote:
|
| Yes you're right.  What I said was total rubbish.  I read your
| statement as meaning to dynamically allocate/deallocate inodes from a
| set configured at filesystem creation...
|

heh :P np

|
|>
|>Ugly, but ok.  What happens when i actually have >4G inodes though?
|
|
| Well this is an issue that affects all filesystems on 32-bit systems
| (as Alan said inode numbers are 64 bits on 64 bit systems).  To be
| honest I've never let this worry me...
|
| A 32-bit system can never cache 4G inodes in memory without falling
| over, and so a simple solution would be to allocate the 32-bit inode
| number dynamically (e.g. starting at one and going up, keeping track
| of inode numbers still used for use when/if wraparound occured), this
| would guarantee inode number uniqueness for the subset of file inodes
| in memory at any one time, with the drawback that inode numbers
| allocated to files will change over filesystem mounts.  Alternatively
| from reading fs/inode.c it appears that inode numbers only need to be
| unique if the fast hashed inode cache lookup functions are used, there
| are other inode cache lookup functions that can be used if inode
| numbers are not unique.
|

*blink*

slower next time.

|
|>| I've had people trying to store 500,000 + files in a Squashfs
|>| directory.  Needless to say with the original directory implementation
|>| this didn't work terribly well...
|>|
|>
|>Ouch.  Someone told me the directory had to be O(1) lookup . . . .
|
|
| Ideally yes, but ultimately with your filesystem you make the rules
| :-)   The Squashfs directory design was fast for the original expected
| directory size (ideally <= 64K, maximum 512K) seen on embedded
| systems.  The next release of Squashfs has considerably improved
| indexed directories which are O(1) for large directories.  To be
| released sometime soon, if anyone's interested...
|

hm.  If I ever get a basic design off the ground (within the next month,
if ever), would you be willing to work with me on putting on polish and
possibly coding it?  Preferably I'd like to find some time to learn how
to code an FS myself-- though my goal is to make "the perfet file
system," so I can't imagine what use that'd have ;) (well, I could fail
the second time around-- I sure fucked up my first attempt last year).
Maybe you could give me guidance when I get to that point, if you don't
have time/will to do (what will inevitably become the bulk of) the work?

Of course, you don't HAVE to, and don't have to commit to this now; but
if you want, I'll poke you later about it.

| Phillip Lougher
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrVKPhDd4aOud5P8RAsO0AJ9LWWR2EItEl7HrcMlt0SZl+PjMEACeMnBQ
Rmhzqz+M8NXQZoYiVoaUbHU=
=6Roh
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-01  4:32       ` Bernd Eckenfels
@ 2004-12-01  7:51         ` Phil Lougher
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Lougher @ 2004-12-01  7:51 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel, phillip

On Wed, 01 Dec 2004 05:32:29 +0100, Bernd Eckenfels
<ecki-news2004-05@lina.inka.de> wrote:
> In article <cce9e37e04113018465091010f@mail.gmail.com> you wrote:
> > systems.  The next release of Squashfs has considerably improved
> > indexed directories which are O(1) for large directories.
> 
> Are you talking about time complexity based on a named lookup over the
> number of files in a directory? (you might also want to look at the
> complexity relative to the naming component length). What data structure
> which is not wasting memory allows you those lookups? Even if you consider
> hashing the name as a constant O(1) operation (which it isnt), you still can
> get collisions.
> 

Ouch.  I was being very loose with the O(1) operation definition. 
Squashfs dentry operations have particular performance problems
because directories are compressed in blocks of 8K (BTW as AFAIK
Squashfs is the only Linux fs that compresses metadata).  With the
original Squashfs directories lookup was purely O(n), directory
scanning for a file started at the directory start and finished when
the file was found.  The major performance hit here was the bigger the
directory the more and more blocks needed to be decompressed to scan
the directory.    If the directory fitted inside the internal Squashfs
decompressed metadata cache (eight 8K blocks), then once the directory
was scanned directory lookups occuring soon afterwards didn't need to
decompress the data (being satisfied from the internal metadata
cache), this would be the case say for ls --color=always (ls to
stdout) or find operations.  If the directory was larger than 64K (too
big to fit inside the internal cache), then the performance was
terrible (each lookup re-decompressed the directory data!).

The major goal to improve performance wasn't to get O(1) performance
directly hashing to the required filename, but to get O(1) performance
in terms of going directly to the compressed block containing the
required filename, i.e. irrespective of the length of the directory,
only one directory block needs to be decompressed for a dentry lookup.
 It's important to appreciate that for Squashfs directly hashing to
the filename is useless because that filename is hidden inside a
compressed block and that block always has to be decompressed (unless
it is in the internal cache).

I looked into (various) tree structures and filename hashing but they
didn't give me what I wanted, requiring either large extra data
overhead on disk and/or requiring the directory to be entirely
decompressed on first access and memory overhead for the computed
structure for the lifetime of the directory inode in memory.  For a
filesystem that is used in embedded systems (as well as the archiving
uses that were causing this problem) both were unacceptable (and
destroyed a lot of the savings gained by compressing the directory)
and were unnecessary for my uses anyway.

I opted for a simple index, one entry per 8K compressed block, each
entry stores a pointer to the first filename in each compressed block
(location of start of compressed block, and offset into that block of
the first filename - directory entries can overlap compressed blocks),
and the first filename in each compressed block.  Directories are
sorted in alphabetical order, and the index is scanned linearly
looking for the first filename alphabetically larger than the filename
being looked up.  At this point the index of the compressed block the
filename should be in has been found.  Now this isn't O(1) as far as
as filenames are concerned but it is O(1) in terms of only needing to
decompress one directory block.  The index is stored within the
directory inode (which is also compressed).  Of course you may argue a
 linear scan of the index is slow, however, because it is also
compressed you *have* to do a linear scan (unless you build an index
of that index).   This scheme has the advantage that it solves my
problem, doesn't require extra memory overhead and doesn't require
much extra storage on disk.

Statistics speak louder than words.  The following statistics are from
doing a "time find /mnt -type f | cat > /dev/null" on an Ubuntu livecd
filesystem compressed using Cloop, zisofs, Squashfs 2.0, Squashfs 2.1
(with the improvements) and ext3 (uncompressed).  The Ubuntu
filesystem is about 1.4 GB uncompressed or 470 MB compressed.  Of
course the find indicates the speed of dentry lookup operations only
and deliberately performs no file I/O.  All filesystems were from hard
disk (rather than CD-ROM).

Cloop:              Real 12.692 secs  User 0.216 secs Sys 12.036 secs
zisofs:              Real 12.151 secs  User 0.176 secs Sys 10.499 secs
Squashfs 2.0:  Real 10.697 secs  User 0.132  secs Sys 9.996 secs
Squashfs 2.1:  Real  4.083 secs   User 0.126 secs  Sys 3.931 secs
ext3 (uncom): Real  24.791 secs  User 0.174 secs  Sys 0.694 secs

FYI the filesystem compressed sizes were cloop 472 MB, Squashfs (2.0 &
2.1) 450M, zisofs 591 MB.

This is a somewhat longer description than I intended, but, you did ask!

Phillip Lougher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 19:22 ` Phil Lougher
  2004-12-01  1:42   ` John Richard Moser
@ 2004-12-02 20:18   ` Jan Knutar
  2004-12-07  2:01     ` Phil Lougher
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Knutar @ 2004-12-02 20:18 UTC (permalink / raw)
  To: Phil Lougher; +Cc: John Richard Moser, linux-kernel

On Tuesday 30 November 2004 21:22, Phil Lougher wrote:

> I've had people trying to store 500,000 + files in a Squashfs
> directory.  Needless to say with the original directory implementation
> this didn't work terribly well...

My old-style newsspool longs for a fast filesystem. Once I did something
stupid, did a backup with mkisofs, and tried a ls -lR on the resulting CD.
After a few hours, I decided that I needed the box to do some work instead
of the 100% uninterruptable sys cpu usage, and rebooted it. I wonder how
long that operation would've taken on isofs :) Mind, the cd drive wasnt really
active during that time, so it wasn't an issue of seek-limited IO blocking
the ide bus.

Tried to store it in a zisofs in later days on more recent kernels for morbid
curiosity, and while it didn't effectively hang the machine anymore, it didn't
work terribly well either...

Currently the thing lives on a Reiser3 partition, previously ext3. They seem
to be about the same speed for this in practice. A "du -sh" takes about 15
minutes to execute, tarring it into a tar.bz2 seems to take about as long as
that, too. Using Theodore Tso's spd_readdir.c, the tar backup case ran in
5 minutes once upon a time. spr_readdir throws segfaults at me now with
2.6 kernels though :-(

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-11-30 17:46   ` Alan Cox
  2004-11-30 19:00     ` Jan Engelhardt
  2004-11-30 19:14     ` Valdis.Kletnieks
@ 2004-12-02 23:32     ` David Woodhouse
  2 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2004-12-02 23:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: Valdis.Kletnieks, John Richard Moser, Linux Kernel Mailing List

On Tue, 2004-11-30 at 17:46 +0000, Alan Cox wrote:
> On Maw, 2004-11-30 at 18:28, Valdis.Kletnieks@vt.edu wrote:
> > On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
> > they punt on the issue of over-writing a sector that's been re-allocated by
> > the hardware (apparently the chances of critical secret data being left in
> > a reallocated block but still actually readable are "low enough" not to worry).
> 
> I guess they never consider CF cards which internally are log structured
> and for whom such erase operations are very close to pointless.

Actually for the cases where we do that translation layer in software
ourselves to emulate a block device, it would be really nice to know
that a sector is unused an hence that we no longer have to keep its old
stale contents when we garbage-collect.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-02 20:18   ` Jan Knutar
@ 2004-12-07  2:01     ` Phil Lougher
  2004-12-07  2:31       ` John Richard Moser
  0 siblings, 1 reply; 21+ messages in thread
From: Phil Lougher @ 2004-12-07  2:01 UTC (permalink / raw)
  To: Jan Knutar; +Cc: John Richard Moser, linux-kernel

On Thu, 02 Dec 2004 22:18:53 +0200, Jan Knutar <jk-lkml@sci.fi> wrote:
> My old-style newsspool longs for a fast filesystem. Once I did something
> stupid, did a backup with mkisofs, and tried a ls -lR on the resulting CD.
> After a few hours, I decided that I needed the box to do some work instead
> of the 100% uninterruptable sys cpu usage, and rebooted it. I wonder how
> long that operation would've taken on isofs :) Mind, the cd drive wasnt really
> active during that time, so it wasn't an issue of seek-limited IO blocking
> the ide bus.

Sorry for the delay in replying, I've been readying Squashfs 2.1 for release...

Anyway, your experience with isofs/zisofs parallels my experience with
Squashfs 2.0 and isofs/zisofs.  Someone reported an 'ls' time of 15+
hours with Squashfs 2.0 (I can't remember the directory size but it
was 100,000 + files)!  Very embarassing...

As you mentioned the problem isn't due to seek overhead but relates to
the way applications like ls and find interact with the VFS file
layer.   ls (to console) by default in most distributions colours the
filenames according to type.  Unfortunately, this results in a
separate file lookup for each file within a directory. 
Isofs/squashfs2.0 (and others) for each file lookup need to scan the
directory from the start to find the filename.  For a directory which
is for example 5 MB in size, this means on average 2.5 MB is scanned
for each and every filename.  A 5 MB directory may have 100,000 files,
which means 100,000 * 2.5 MB (roughly 250 GB) must be scanned  for the
'ls'.  The directory is quickly pulled into the cache, and so the
operation isn't seek-time bound, but the constant rescanning results
in a high CPU overhead.

> 
> Currently the thing lives on a Reiser3 partition, previously ext3. They seem
> to be about the same speed for this in practice. A "du -sh" takes about 15
> minutes to execute, tarring it into a tar.bz2 seems to take about as long as
> that, too. Using Theodore Tso's spd_readdir.c, the tar backup case ran in
> 5 minutes once upon a time. spr_readdir throws segfaults at me now with
> 2.6 kernels though :-(
> 

You don't mention the size of each directory or how many files per
directory.  I did a test with zisofs and Squashfs 2.1 to test the
efficiency of the directory improvements in Squashfs 2.1.

The one directory had 72,784 files (2.6 MB directory size).  Each file
was 10 bytes in size (the test was directory lookup and so the file
size wasn't an issue).  The ext3 directory size was 288 MB (presumably
because of one file per block), the zisofs size 154 MB and the
Squashfs 2.1 size 616 K...  The massively different size of the
Squashfs 2.1 filesystem is because it packs on byte alignments and
because it packs multiple files into shared blocks which it compresses
in one go.  The compression achieved isn't the issue here but it does
nicely show some other aspects of Squashfs' compression ability...

Zisofs takes 34m21 seconds to do an ls on the directory.  Squashfs 2.1
takes 19 seconds !  The improvements in Squashfs 2.1 seem to do the
trick.  Once Squashfs 2.1 is released, and if you try it, I'll be
interested in any performance results you obtain with real-world data
(rather than artificial filesystems)...?

Regards

Phil Lougher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Designing Another File System
  2004-12-07  2:01     ` Phil Lougher
@ 2004-12-07  2:31       ` John Richard Moser
  0 siblings, 0 replies; 21+ messages in thread
From: John Richard Moser @ 2004-12-07  2:31 UTC (permalink / raw)
  To: Phil Lougher; +Cc: Jan Knutar, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What compression algorithm does squashfs use?

I would like to see LZMA used. . . . :)

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBtRX2hDd4aOud5P8RArFcAJ4w6VuO3jtTuZFNqFlLhnnBdwW0OQCfSOwB
S2CKvKYBN3EB6QSgkPwtmZo=
=kCIz
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-12-07  2:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-30  4:32 Designing Another File System John Richard Moser
2004-11-30  7:16 ` Bernd Eckenfels
2004-11-30 13:07 ` Helge Hafting
2004-12-01  1:16   ` John Richard Moser
2004-11-30 16:31 ` Alan Cox
2004-11-30 18:28 ` Valdis.Kletnieks
2004-11-30 17:46   ` Alan Cox
2004-11-30 19:00     ` Jan Engelhardt
2004-11-30 19:14     ` Valdis.Kletnieks
2004-11-30 20:22       ` Valdis.Kletnieks
2004-12-02 23:32     ` David Woodhouse
2004-12-01  1:35   ` John Richard Moser
2004-11-30 19:22 ` Phil Lougher
2004-12-01  1:42   ` John Richard Moser
2004-12-01  2:46     ` Phil Lougher
2004-12-01  4:32       ` Bernd Eckenfels
2004-12-01  7:51         ` Phil Lougher
2004-12-01  5:11       ` John Richard Moser
2004-12-02 20:18   ` Jan Knutar
2004-12-07  2:01     ` Phil Lougher
2004-12-07  2:31       ` John Richard Moser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).