All of lore.kernel.org
 help / color / mirror / Atom feed
* Allocation strategy - dynamic zone for small files
@ 2006-11-13 10:37 Ihar `Philips` Filipau
  2006-11-13 13:56 ` avishay
  0 siblings, 1 reply; 25+ messages in thread
From: Ihar `Philips` Filipau @ 2006-11-13 10:37 UTC (permalink / raw)
  To: linux-fsdevel

Hi!

I'm totally unaware of state of modern file systems, but the idea I
have read out of the paper

http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/INTERNALS

seems to be quite interesting. One of the interesting features of
ReiserFS is tail optimization to save disk space wasted on small
files/parts. It seemed bit too complicated and had lots of performance
drawbacks.

Instead the SpadFS tries to solve only half of the problem: case of
small files (symlinks fall into that cathegory too). Small files are
allocated in special zone and thus treated specifically. That way,
small file accesses can be optimized both performance wise and space
wise.

Just fyi.

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 10:37 Allocation strategy - dynamic zone for small files Ihar `Philips` Filipau
@ 2006-11-13 13:56 ` avishay
  2006-11-13 17:46   ` Bryan Henderson
  0 siblings, 1 reply; 25+ messages in thread
From: avishay @ 2006-11-13 13:56 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: linux-fsdevel

On Mon, 2006-11-13 at 11:37 +0100, Ihar `Philips` Filipau wrote:
> Instead the SpadFS tries to solve only half of the problem: case of
> small files (symlinks fall into that cathegory too). Small files are
> allocated in special zone and thus treated specifically. That way,
> small file accesses can be optimized both performance wise and space
> wise.

Does anyone have any estimates of how much space is wasted by these
files without making them a special case?  It seems to me that most
people have huge disks and don't really care about losing a few KB here
and there (especially if it makes more common cases slower).  Any ideas?

Avishay Traeger
http://www.fsl.cs.sunysb.edu/~avishay/


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 13:56 ` avishay
@ 2006-11-13 17:46   ` Bryan Henderson
  2006-11-13 19:38     ` Josef Sipek
  2006-11-14  1:02     ` Theodore Tso
  0 siblings, 2 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-11-13 17:46 UTC (permalink / raw)
  To: avishay; +Cc: linux-fsdevel, Ihar `Philips` Filipau

>Does anyone have any estimates of how much space is wasted by these
>files without making them a special case?  It seems to me that most
>people have huge disks and don't really care about losing a few KB here
>and there (especially if it makes more common cases slower).

Two thoughts:

1) It's not just disk capacity.  Using a 4K disk block for 16 bytes of 
data also wastes the time it takes to drag that 4K from disk to memory and 
cache space.

2) Making more efficient storage and access of _existing_ sets of files 
isn't usually the justification for this technology.  It's enabling new 
kinds of file sets.  Imagine all the 16 byte files that never got created 
because the designer didn't want to waste 4K on each.  A file with  a 
million 16 byte pieces might work better with a million separate files, 
but was made a single file because 64 GB of storage for 16 MB of data was 
not practical.  Similarly, there are files that would work better with 1 
MB blocks, but have 4K blocks anyway, because the designer couldn't afford 
1 MB for every 16 byte file.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 17:46   ` Bryan Henderson
@ 2006-11-13 19:38     ` Josef Sipek
  2006-11-13 21:12       ` Bryan Henderson
  2006-11-14  1:02     ` Theodore Tso
  1 sibling, 1 reply; 25+ messages in thread
From: Josef Sipek @ 2006-11-13 19:38 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: avishay, linux-fsdevel, Ihar `Philips` Filipau

On Mon, Nov 13, 2006 at 09:46:01AM -0800, Bryan Henderson wrote:
> >Does anyone have any estimates of how much space is wasted by these
> >files without making them a special case?  It seems to me that most
> >people have huge disks and don't really care about losing a few KB here
> >and there (especially if it makes more common cases slower).
> 
> Two thoughts:
> 
> 1) It's not just disk capacity.  Using a 4K disk block for 16 bytes of 
> data also wastes the time it takes to drag that 4K from disk to memory and 
> cache space.

Good point. But wouldn't the page cache suffer regardless? (You can't split
up pages between files, AFAIK.)

> 2) Making more efficient storage and access of _existing_ sets of files 
> isn't usually the justification for this technology.  It's enabling new 
> kinds of file sets.  Imagine all the 16 byte files that never got created 
> because the designer didn't want to waste 4K on each.  A file with  a 
> million 16 byte pieces might work better with a million separate files, 
> but was made a single file because 64 GB of storage for 16 MB of data was 
> not practical.  Similarly, there are files that would work better with 1 
> MB blocks, but have 4K blocks anyway, because the designer couldn't afford 
> 1 MB for every 16 byte file.

I haven't really looked at it, but from what I hear, (Free?)BSD has a nifty
feature where it divides up a block into half/quarter during allocation to
save some space. As far as I know, it is a fs (probably ufs) feature.

Just my 2 cents.

Josef "Jeff" Sipek.

-- 
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like
that.
		- Linus Torvalds

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 19:38     ` Josef Sipek
@ 2006-11-13 21:12       ` Bryan Henderson
  2006-11-13 23:32         ` Ihar `Philips` Filipau
  0 siblings, 1 reply; 25+ messages in thread
From: Bryan Henderson @ 2006-11-13 21:12 UTC (permalink / raw)
  To: Josef Sipek; +Cc: avishay, linux-fsdevel, Ihar `Philips` Filipau

>> 1) It's not just disk capacity.  Using a 4K disk block for 16 bytes of 
>> data also wastes the time it takes to drag that 4K from disk to memory 
and 
>> cache space.
>
>Good point. But wouldn't the page cache suffer regardless? (You can't 
split
>up pages between files, AFAIK.)

Yeah, you're right, if we're talking about granularity finer than the page 
size.  But furthermore, as long as we're just talking about techniques to 
reduce internal fragmentation in the disk allocations, there's no reason 
either the cache usage or the data transfer traffic has to be affected 
(the fact that a whole block is allocated doesn't mean you have to read or 
cache the whole block).

But head movement and rotational latency are worth considering.  If you 
cram 100 files into a track, some access patterns are going to be faster 
than if you have to spread them out across 10 tracks with a lot of empty 
space in between.  That's another reason that I sometimes see people pile 
a bunch of data into a large file and essentially make a filesystem within 
that file.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 21:12       ` Bryan Henderson
@ 2006-11-13 23:32         ` Ihar `Philips` Filipau
  2006-11-13 23:57           ` Andreas Dilger
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Ihar `Philips` Filipau @ 2006-11-13 23:32 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Josef Sipek, avishay, linux-fsdevel

On 11/13/06, Bryan Henderson <hbryan@us.ibm.com> wrote:
> >
> >Good point. But wouldn't the page cache suffer regardless? (You can't split
> >up pages between files, AFAIK.)
>
> Yeah, you're right, if we're talking about granularity finer than the page
> size.  But furthermore, as long as we're just talking about techniques to
> reduce internal fragmentation in the disk allocations, there's no reason
> either the cache usage or the data transfer traffic has to be affected
> (the fact that a whole block is allocated doesn't mean you have to read or
> cache the whole block).
>
> But head movement and rotational latency are worth considering.  If you
>

As person throwing in the idea, I feel bit responsible. So here go my
results from my primitive script (bear with my bashism) on my plain
Debian/unstable with 123k files on 10GB partition with ext3, default
8K block.

Script to count small files:
-+-
#!/bin/bash
find / -xdev 2>/dev/null | wc -l
find / -xdev -\( $(seq -f '-size %gc -o' 1 63) -false -\) 2>/dev/null | wc -l
find / -xdev -\( $(seq -f '-size %gc -o' 64 128) -false -\) 2>/dev/null | wc -l
-+-
First line to find all files on root fs, second to find all files with
sizes 1-63 bytes, third - 64-128. (Param '-xdev' tells find to remain
on same fs to exclude proc/sys/tmp and so on)

And on my system counts are:
-+-
107313
8302
2618
-+-

This is 10.1% of all files - are small files under 128 bytes. (7.7% < 63 bytes)

[ Results for /etc: 1712, 666, 143 (+ 221 file of size in range
129-512 bytes) - small files are better half of whole /etc. ]

[ In fact, the optimization of small blocks is widely used in network
equipment: many intelligent devices can use several packet queues to
send ingress packets to RAM - sorted by size. One device I programmed
driver for allowed to have four queues with recommended sizes: 32,
128, 512, 2048 - the sizes allowing to suck in RAM lots of
small/medium packets (normally used for control - ICMP, TCP's ACK/SYN,
etc) w/o depleting all buffers (normally used for data traffic). I
have posted the link here because I was bit surprised that somebody
tries to apply similar idea to file systems. ]

Most important outcome of the optimization might be that future FSs
wouldn't be afraid to set cluster size higher than it is accepted now:
e.g. standard 4/8/16K now - but with small file (+ tail) optimization
ramp it to 32/64/128K.

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 23:32         ` Ihar `Philips` Filipau
@ 2006-11-13 23:57           ` Andreas Dilger
  2006-11-14  2:19             ` Dave Kleikamp
  2006-11-14 15:19             ` phillip
  2006-11-14  0:15           ` Josef Sipek
  2006-11-14  0:59           ` Bryan Henderson
  2 siblings, 2 replies; 25+ messages in thread
From: Andreas Dilger @ 2006-11-13 23:57 UTC (permalink / raw)
  To: Ihar `Philips` Filipau
  Cc: Bryan Henderson, Josef Sipek, avishay, linux-fsdevel

On Nov 14, 2006  00:32 +0100, Ihar `Philips` Filipau wrote:
> As person throwing in the idea, I feel bit responsible. So here go my
> results from my primitive script (bear with my bashism) on my plain
> Debian/unstable with 123k files on 10GB partition with ext3, default
> 8K block.
> 
> Script to count small files:
> -+-
> #!/bin/bash
> find / -xdev 2>/dev/null | wc -l
> find / -xdev -\( $(seq -f '-size %gc -o' 1 63) -false -\) 2>/dev/null | wc 
> -l
> find / -xdev -\( $(seq -f '-size %gc -o' 64 128) -false -\) 2>/dev/null | 
> wc -l
> -+-
> First line to find all files on root fs, second to find all files with
> sizes 1-63 bytes, third - 64-128. (Param '-xdev' tells find to remain
> on same fs to exclude proc/sys/tmp and so on)
> 
> And on my system counts are:
> -+-
> 107313
> 8302
> 2618
> -+-
> 
> This is 10.1% of all files - are small files under 128 bytes. (7.7% < 63 
> bytes)
> 
> [ Results for /etc: 1712, 666, 143 (+ 221 file of size in range
> 129-512 bytes) - small files are better half of whole /etc. ]

Note that using the root filesystem is a skewed result (esp. on GTK systems
where lots of single-valued files are used by gconf).  Many root filesystems
using ext3 are formatted with 1kB blocks for this reason.  Also gather stats
for other filesystems.

At the filesystem summit we DID find a surprising number of small files
even when the whole system was examined.  We discussed storing small
files directly in the inode along with other EAs (this would require
larger inodes).  This improves data locality and performance (i.e. stat
of the file loads the small file data into cache), though the assumption
is that there will be an increasing number of EAs on files in the future.
It also avoids the issues w.r.t. packing file data from different files
into the same block and they have different lifespans, etc.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 23:32         ` Ihar `Philips` Filipau
  2006-11-13 23:57           ` Andreas Dilger
@ 2006-11-14  0:15           ` Josef Sipek
  2006-11-14  0:59           ` Bryan Henderson
  2 siblings, 0 replies; 25+ messages in thread
From: Josef Sipek @ 2006-11-14  0:15 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: Bryan Henderson, avishay, linux-fsdevel

On Tue, Nov 14, 2006 at 12:32:07AM +0100, Ihar `Philips` Filipau wrote:
...
> As person throwing in the idea, I feel bit responsible. So here go my
> results from my primitive script (bear with my bashism) on my plain
> Debian/unstable with 123k files on 10GB partition with ext3, default
> 8K block.
> 
> Script to count small files:
> -+-
> #!/bin/bash
> find / -xdev 2>/dev/null | wc -l
> find / -xdev -\( $(seq -f '-size %gc -o' 1 63) -false -\) 2>/dev/null | wc 
> -l
> find / -xdev -\( $(seq -f '-size %gc -o' 64 128) -false -\) 2>/dev/null | 
> wc -l
> -+-
> First line to find all files on root fs, second to find all files with
> sizes 1-63 bytes, third - 64-128. (Param '-xdev' tells find to remain
> on same fs to exclude proc/sys/tmp and so on)
> 
> And on my system counts are:
> -+-
> 107313
> 8302
> 2618
> -+-
 
On my system (Debian Etch, the / contains everything except /home):
 
581564
11280
10994

> This is 10.1% of all files - are small files under 128 bytes. (7.7% < 63 
> bytes)
 
This is 3.8% of all files < 128 bytes (1.9% < 63 bytes).
 
Josef "Jeff" Sipek.

-- 
A CRAY is the only computer that runs an endless loop in just 4 hours...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 23:32         ` Ihar `Philips` Filipau
  2006-11-13 23:57           ` Andreas Dilger
  2006-11-14  0:15           ` Josef Sipek
@ 2006-11-14  0:59           ` Bryan Henderson
  2 siblings, 0 replies; 25+ messages in thread
From: Bryan Henderson @ 2006-11-14  0:59 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: avishay, Josef Sipek, linux-fsdevel

Your numbers show that you are wasting about 85 MB in internal 
fragmentation of files < 128 bytes.  To paraphrase an earlier query in 
this thread: what's 85 MB of disk space worth to you?  Probably not much.

There are of course lots of other filesystem applications, so there could 
be some where the wasted disk space is worth a second thought.

But I think the main interest in this sort of thing (as stated previously) 
is based on how many tiny files you _would_ have and how big your basic 
allocation unit for other files _would_  be have if you didn't have to pay 
the price.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 17:46   ` Bryan Henderson
  2006-11-13 19:38     ` Josef Sipek
@ 2006-11-14  1:02     ` Theodore Tso
  2006-11-14 11:21       ` Al Boldi
  2006-11-14 14:30       ` phillip
  1 sibling, 2 replies; 25+ messages in thread
From: Theodore Tso @ 2006-11-14  1:02 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: avishay, linux-fsdevel, Ihar `Philips` Filipau

On Mon, Nov 13, 2006 at 09:46:01AM -0800, Bryan Henderson wrote:
> >Does anyone have any estimates of how much space is wasted by these
> >files without making them a special case?  It seems to me that most
> >people have huge disks and don't really care about losing a few KB here
> >and there (especially if it makes more common cases slower).
> 
> Two thoughts:
> 
> 1) It's not just disk capacity.  Using a 4K disk block for 16 bytes of 
> data also wastes the time it takes to drag that 4K from disk to memory and 
> cache space.
> 
> 2) Making more efficient storage and access of _existing_ sets of files 
> isn't usually the justification for this technology.  It's enabling new 
> kinds of file sets.  Imagine all the 16 byte files that never got created 
> because the designer didn't want to waste 4K on each.  A file with  a 
> million 16 byte pieces might work better with a million separate files, 
> but was made a single file because 64 GB of storage for 16 MB of data was 
> not practical.  Similarly, there are files that would work better with 1 
> MB blocks, but have 4K blocks anyway, because the designer couldn't afford 
> 1 MB for every 16 byte file.

More thoughts:

1) It's not just about storage efficiency, but also about transfer
efficiency.  Disk drives generally like to transfer hunks of data in
16k to 64k at a time.  So getting related pieces of small hunks of
data read at the same time, we can win big on performance.  BUT, it's
extremely hard to do this at the filesystem level, since the
application is much more likely to know which micro-file of 16 bytes
is likely to be needed at the same time as some other micro-file which
is only 16 bytes long.

2) If you have millions of separate files, each 16 bytes long, and you
need to read a huge number of them, you can end up getting killed on
system call overhead.  

I remember having this argument with Hans Reiser at one point.  His
argument was that parsing was evil; and should never have to be done.
(And if anyone has ever seen the vast quanties of garbage which is
generated when you implement an XML parser in Java, and the resulting
GC overhead I can't blame them for thinking this...)  So his argument
was that instead of parsing a file like /etc/inetd.conf, there should
be an /etc/inetd.conf.d directory, and in that directory there might
be directory called telnet, and another one called ssh, and yet
another called smtp, and then you might have files such as:

FILENAME					CONTENTS
===============================================================

/etc/inetd.conf.d/telnet/port			23
/etc/inetd.conf.d/telnet/protocol		tcp
/etc/inetd.conf.d/telnet/flags			nowait
/etc/inetd.conf.d/telnet/user			root
/etc/inetd.conf.d/telnet/daemon			/sbin/telnetd

/etc/inetd.conf.d/ssh/port			22
/etc/inetd.conf.d/ssh/protocol			tcp
/etc/inetd.conf.d/ssh/flags			nowait
/etc/inetd.conf.d/ssh/user			root
/etc/inetd.conf.d/ssh/daemon			/sbin/sshd

etc.  When I pointed out the system call overhead that would result
since instead of an open, read, close to read /etc/inetd.conf, you
would now need perhaps a hundred or more system calls do the
opendir/readir loop, and then individually opening, reading, and
closing each file, Hans had a solution ---- a new system call where
you could download a byte coded language of commands program into the
kernel, so the kernel could execute a sequence of commands and return
to userspace a single buffer containing the contents of all of the
files, which could then be parsed by the userspace program.....

But wait a second, I thought the whole point of this complicated
scheme, including implementing a byte code interpreter in the kernel
with all of the attendent potential security issues, was to avoid
needing to do parsing.  Oops, oh well, so much for that idea.

So color me skeptical that 16 byte files are really such a great
design...

						- Ted


				

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 23:57           ` Andreas Dilger
@ 2006-11-14  2:19             ` Dave Kleikamp
  2006-11-14 13:15               ` Jörn Engel
  2006-11-14 15:19             ` phillip
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Kleikamp @ 2006-11-14  2:19 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ihar `Philips` Filipau, Bryan Henderson, Josef Sipek, avishay,
	linux-fsdevel

On Mon, 2006-11-13 at 16:57 -0700, Andreas Dilger wrote:
> On Nov 14, 2006  00:32 +0100, Ihar `Philips` Filipau wrote:
> > As person throwing in the idea, I feel bit responsible. So here go my
> > results from my primitive script (bear with my bashism) on my plain
> > Debian/unstable with 123k files on 10GB partition with ext3, default
> > 8K block.
> > 
> > Script to count small files:
> > -+-
> > #!/bin/bash
> > find / -xdev 2>/dev/null | wc -l
> > find / -xdev -\( $(seq -f '-size %gc -o' 1 63) -false -\) 2>/dev/null | wc 
> > -l
> > find / -xdev -\( $(seq -f '-size %gc -o' 64 128) -false -\) 2>/dev/null | 
> > wc -l
> > -+-
> > First line to find all files on root fs, second to find all files with
> > sizes 1-63 bytes, third - 64-128. (Param '-xdev' tells find to remain
> > on same fs to exclude proc/sys/tmp and so on)
> > 
> > And on my system counts are:
> > -+-
> > 107313
> > 8302
> > 2618
> > -+-
> > 
> > This is 10.1% of all files - are small files under 128 bytes. (7.7% < 63 
> > bytes)
> > 
> > [ Results for /etc: 1712, 666, 143 (+ 221 file of size in range
> > 129-512 bytes) - small files are better half of whole /etc. ]
> 
> Note that using the root filesystem is a skewed result (esp. on GTK systems
> where lots of single-valued files are used by gconf).  Many root filesystems
> using ext3 are formatted with 1kB blocks for this reason.  Also gather stats
> for other filesystems.
> 
> At the filesystem summit we DID find a surprising number of small files
> even when the whole system was examined.  We discussed storing small
> files directly in the inode along with other EAs (this would require
> larger inodes).  This improves data locality and performance (i.e. stat
> of the file loads the small file data into cache), though the assumption
> is that there will be an increasing number of EAs on files in the future.
> It also avoids the issues w.r.t. packing file data from different files
> into the same block and they have different lifespans, etc.

I would agree that if the focus is on files that are 128 bytes or
smaller, storing the data in the inode makes the most sense.  I don't
think it's worth the complexity to doing any kind of tail merging unless
you would expect that a large number of small files would be too big to
practically fit in the inode, but small enough that it is worth doing
something to store them efficiently.  Symbolic links have been stored
this way for a long time.

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14  1:02     ` Theodore Tso
@ 2006-11-14 11:21       ` Al Boldi
  2006-11-14 14:25         ` Theodore Tso
  2006-11-14 14:30       ` phillip
  1 sibling, 1 reply; 25+ messages in thread
From: Al Boldi @ 2006-11-14 11:21 UTC (permalink / raw)
  To: linux-fsdevel

Theodore Tso wrote:
>
> More thoughts:
>
> 1) It's not just about storage efficiency, but also about transfer
> efficiency.  Disk drives generally like to transfer hunks of data in
> 16k to 64k at a time.  So getting related pieces of small hunks of
> data read at the same time, we can win big on performance.  BUT, it's
> extremely hard to do this at the filesystem level, since the
> application is much more likely to know which micro-file of 16 bytes
> is likely to be needed at the same time as some other micro-file which
> is only 16 bytes long.
>
> 2) If you have millions of separate files, each 16 bytes long, and you
> need to read a huge number of them, you can end up getting killed on
> system call overhead.
>
> I remember having this argument with Hans Reiser at one point.  His
> argument was that parsing was evil; and should never have to be done.
> (And if anyone has ever seen the vast quanties of garbage which is
> generated when you implement an XML parser in Java, and the resulting
> GC overhead I can't blame them for thinking this...)  So his argument
> was that instead of parsing a file like /etc/inetd.conf, there should
> be an /etc/inetd.conf.d directory, and in that directory there might
> be directory called telnet, and another one called ssh, and yet
> another called smtp, and then you might have files such as:
>
> FILENAME					CONTENTS
> ===============================================================
>
> /etc/inetd.conf.d/telnet/port			23
> /etc/inetd.conf.d/telnet/protocol		tcp
> /etc/inetd.conf.d/telnet/flags			nowait
> /etc/inetd.conf.d/telnet/user			root
> /etc/inetd.conf.d/telnet/daemon			/sbin/telnetd
>
> /etc/inetd.conf.d/ssh/port			22
> /etc/inetd.conf.d/ssh/protocol			tcp
> /etc/inetd.conf.d/ssh/flags			nowait
> /etc/inetd.conf.d/ssh/user			root
> /etc/inetd.conf.d/ssh/daemon			/sbin/sshd
>
> etc.  When I pointed out the system call overhead that would result
> since instead of an open, read, close to read /etc/inetd.conf, you
> would now need perhaps a hundred or more system calls do the
> opendir/readir loop, and then individually opening, reading, and
> closing each file, Hans had a solution ----

I have a different solution.

Plugins into the VFS that handle special situations.

I could do something like this manually in userland via loop/FUSE, but a more 
integrated solution could prove more useful.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14  2:19             ` Dave Kleikamp
@ 2006-11-14 13:15               ` Jörn Engel
       [not found]                 ` <efa6f5910611140541m302201e6t4e84551b75e79611@mail.gmail.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Jörn Engel @ 2006-11-14 13:15 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andreas Dilger, Ihar `Philips` Filipau, Bryan Henderson,
	Josef Sipek, avishay, linux-fsdevel

On Mon, 13 November 2006 20:19:43 -0600, Dave Kleikamp wrote:
> 
> I would agree that if the focus is on files that are 128 bytes or
> smaller, storing the data in the inode makes the most sense.  I don't
> think it's worth the complexity to doing any kind of tail merging unless
> you would expect that a large number of small files would be too big to
> practically fit in the inode, but small enough that it is worth doing
> something to store them efficiently.  Symbolic links have been stored
> this way for a long time.

Logfs did this from the beginning, works like a charm.  The only
problem I see with this approach is that it is an incompatible change
for existing filesystems.  So using an old Knoppix CD to rescue a such
a filesystem just won't work.

Jörn

-- 
Joern's library part 14:
http://www.sandpile.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
       [not found]                 ` <efa6f5910611140541m302201e6t4e84551b75e79611@mail.gmail.com>
@ 2006-11-14 13:56                   ` Jörn Engel
  2006-11-14 18:23                   ` Andreas Dilger
  1 sibling, 0 replies; 25+ messages in thread
From: Jörn Engel @ 2006-11-14 13:56 UTC (permalink / raw)
  To: Ihar `Philips` Filipau
  Cc: Dave Kleikamp, Andreas Dilger, Bryan Henderson, Josef Sipek,
	avishay, linux-fsdevel

On Tue, 14 November 2006 14:41:56 +0100, Ihar `Philips` Filipau wrote:
> 
> P.S. Anybody can suggest good (practical) reading on 2.6.x linux fs layer?

"Understanding the Linux kernel" is the best one I know, but most
likely not what you want.  The best strategy really is to get one's
hands dirty and write small patches.  Fast files for ext[234] may
actually be a good start.

You can look for "embedded" in logfs for an example implementation.
http://lwn.net/Articles/196896/

Jörn

-- 
A defeated army first battles and then seeks victory.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 11:21       ` Al Boldi
@ 2006-11-14 14:25         ` Theodore Tso
  2006-11-14 15:43           ` Al Boldi
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Tso @ 2006-11-14 14:25 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-fsdevel

On Tue, Nov 14, 2006 at 02:21:09PM +0300, Al Boldi wrote:
> > etc.  When I pointed out the system call overhead that would result
> > since instead of an open, read, close to read /etc/inetd.conf, you
> > would now need perhaps a hundred or more system calls do the
> > opendir/readir loop, and then individually opening, reading, and
> > closing each file, Hans had a solution ----
> 
> I have a different solution.
> 
> Plugins into the VFS that handle special situations.
> 
> I could do something like this manually in userland via loop/FUSE, but a more 
> integrated solution could prove more useful.

So now an application that needs to read/write 16 byte files
efficiently needs to write a kernel module that gets logged into the
VFS?!?  And we have to trust our system stability to an application
writer to not introduce any bugs into the VFS plugin that might cause
a system panic?

What's the advantage of using smallish ~16 byte files in this case?
If it's ease of application programming, that just got flushed down
the drain; writing kernel modules is harder, because (a) the
compile/edit/debug cycle, if you screw up, crashes your system, and
(b) it requires root privs to install the application.

What exactly is the problem that people are trying to solve with
trying to get the kernel involved with storing tiny individual datums,
as opposed to simply asking the application to store all of these
objects in a userland database, where the object layout can be
optimized for the application's needs?

This really feels like a fragile technological hack looking for a
problem to solve....

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14  1:02     ` Theodore Tso
  2006-11-14 11:21       ` Al Boldi
@ 2006-11-14 14:30       ` phillip
  1 sibling, 0 replies; 25+ messages in thread
From: phillip @ 2006-11-14 14:30 UTC (permalink / raw)
  To: Theodore Tso, hbryan
  Cc: atraeger, linux-fsdevel, Ihar `Philips` Filipau, phillip.lougher

tytso@mit.edu wrote:
 
> 1) It's not just about storage efficiency, but also about transfer
> efficiency.  Disk drives generally like to transfer hunks of data in
> 16k to 64k at a time.  So getting related pieces of small hunks of
> data read at the same time, we can win big on performance.  BUT, it's
> extremely hard to do this at the filesystem level, since the
> application is much more likely to know which micro-file of 16 bytes
> is likely to be needed at the same time as some other micro-file which
> is only 16 bytes long.

Most filesystems (as you'll know) use locality of reference to cluster files.
>From the studies I've seen it works quite well.

When I added tail-end packing to SquashFS, I looked into various stategies to
determine which tail-ends (fragments) to pack together.  As SquashFS is a
read-only filesystem this can be done using off-line analysis.  After
evaluating various strategies (best fit, first fit, same-size etc.) I found the
best compression of these packed tail-ends was achieved by packing small files
together in alphabetical order from the same directory.  Such packing also
achieved the highest performance improvements reading from CDROM (Squashfs is
used for LiveCDs, and so changes in file placement can have a dramatic affect
on seeking).  This was a result which was interesting from my POV because it
confirmed conventional locality of reference wisdom.

Phillip Lougher



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-13 23:57           ` Andreas Dilger
  2006-11-14  2:19             ` Dave Kleikamp
@ 2006-11-14 15:19             ` phillip
  2006-11-14 18:19               ` Andreas Dilger
  1 sibling, 1 reply; 25+ messages in thread
From: phillip @ 2006-11-14 15:19 UTC (permalink / raw)
  To: Andreas Dilger, thephilips
  Cc: hbryan, jsipek, atraeger, linux-fsdevel, phillip.lougher

adilger@clusterfs.com wrote:

> At the filesystem summit we DID find a surprising number of small files
> even when the whole system was examined.  We discussed storing small
> files directly in the inode along with other EAs (this would require
> larger inodes).  This improves data locality and performance (i.e. stat
> of the file loads the small file data into cache), though the assumption
> is that there will be an increasing number of EAs on files in the future.

So it won't be feasible to store EAs + small file in the inode, or in the
future it won't be feasible to store just EAs in the inode for most files?

Are there any stats showing the current amount/size of EAs per file, and
what it is expected to be in the future?

Phillip



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 14:25         ` Theodore Tso
@ 2006-11-14 15:43           ` Al Boldi
  2006-11-14 15:46             ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Al Boldi @ 2006-11-14 15:43 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-fsdevel

Theodore Tso wrote:
> On Tue, Nov 14, 2006 at 02:21:09PM +0300, Al Boldi wrote:
> > > etc.  When I pointed out the system call overhead that would result
> > > since instead of an open, read, close to read /etc/inetd.conf, you
> > > would now need perhaps a hundred or more system calls do the
> > > opendir/readir loop, and then individually opening, reading, and
> > > closing each file, Hans had a solution ----
> >
> > I have a different solution.
> >
> > Plugins into the VFS that handle special situations.
> >
> > I could do something like this manually in userland via loop/FUSE, but a
> > more integrated solution could prove more useful.
>
> So now an application that needs to read/write 16 byte files
> efficiently needs to write a kernel module that gets logged into the
> VFS?!?  And we have to trust our system stability to an application
> writer to not introduce any bugs into the VFS plugin that might cause
> a system panic?

An API would probably be in order.

> What's the advantage of using smallish ~16 byte files in this case?
> If it's ease of application programming, that just got flushed down
> the drain; writing kernel modules is harder, because (a) the
> compile/edit/debug cycle, if you screw up, crashes your system, and
> (b) it requires root privs to install the application.
>
> What exactly is the problem that people are trying to solve with
> trying to get the kernel involved with storing tiny individual datums,
> as opposed to simply asking the application to store all of these
> objects in a userland database, where the object layout can be
> optimized for the application's needs?

Performance due to non-redundancy.

> This really feels like a fragile technological hack looking for a
> problem to solve....

Or a desire to let people choose what they want.

Think freedom...


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 15:43           ` Al Boldi
@ 2006-11-14 15:46             ` Matthew Wilcox
  2006-11-14 16:59               ` Al Boldi
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2006-11-14 15:46 UTC (permalink / raw)
  To: Al Boldi; +Cc: Theodore Tso, linux-fsdevel

On Tue, Nov 14, 2006 at 06:43:39PM +0300, Al Boldi wrote:
> An API would probably be in order.
> 
> Performance due to non-redundancy.
> 
> Or a desire to let people choose what they want.
> 
> Think freedom...

How about fewer "visionary statements" and more code?  Or at least an
architecture description.  Or a rough prototype that shows how we can
make some improvements.  You're looking remarkably content-free right now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 15:46             ` Matthew Wilcox
@ 2006-11-14 16:59               ` Al Boldi
  2006-11-14 17:27                 ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Al Boldi @ 2006-11-14 16:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Theodore Tso, linux-fsdevel

Matthew Wilcox wrote:
> On Tue, Nov 14, 2006 at 06:43:39PM +0300, Al Boldi wrote:
> > An API would probably be in order.
> >
> > Performance due to non-redundancy.
> >
> > Or a desire to let people choose what they want.
> >
> > Think freedom...
>
> How about fewer "visionary statements" and more code?

How about discussing the feasibility before sending any code?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 16:59               ` Al Boldi
@ 2006-11-14 17:27                 ` Matthew Wilcox
  2006-11-14 17:55                   ` Theodore Tso
  2006-11-14 18:23                   ` Al Boldi
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew Wilcox @ 2006-11-14 17:27 UTC (permalink / raw)
  To: Al Boldi; +Cc: Theodore Tso, linux-fsdevel

On Tue, Nov 14, 2006 at 07:59:43PM +0300, Al Boldi wrote:
> Matthew Wilcox wrote:
> > On Tue, Nov 14, 2006 at 06:43:39PM +0300, Al Boldi wrote:
> > > An API would probably be in order.
> > >
> > > Performance due to non-redundancy.
> > >
> > > Or a desire to let people choose what they want.
> > >
> > > Think freedom...
> >
> > How about fewer "visionary statements" and more code?
> 
> How about discussing the feasibility before sending any code?

Great idea.  Discuss the feasibility, rather than responding with
platitudes to people telling you it's infeasible.  Show examples, show
how it would help.  Take the inetd.conf example.  Look at the current
user-space parser.  Show how a new implementation might work.  Use
pseudocode where necessary; this doesn't have to compile, it has to
convince people that there's something worthwhile in doing this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 17:27                 ` Matthew Wilcox
@ 2006-11-14 17:55                   ` Theodore Tso
  2006-11-14 18:23                   ` Al Boldi
  1 sibling, 0 replies; 25+ messages in thread
From: Theodore Tso @ 2006-11-14 17:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Al Boldi, linux-fsdevel

On Tue, Nov 14, 2006 at 10:27:00AM -0700, Matthew Wilcox wrote:
> Great idea.  Discuss the feasibility, rather than responding with
> platitudes to people telling you it's infeasible.  Show examples, show
> how it would help.  Take the inetd.conf example.  Look at the current
> user-space parser.  Show how a new implementation might work.  Use
> pseudocode where necessary; this doesn't have to compile, it has to
> convince people that there's something worthwhile in doing this.

+----------+
|  PLEASE  |
|  DO NOT  |
| FEED THE |
|  TROLLS  |
+----------+
    |  |    
    |  |    
  .\|.||/.. 


I've given up on Al Boldi as a Troll; until he actually gives a
concrete, detailed plan to discuss, instead of content-free
platitudes, I've written him off as someone who likes to spouting
nonsense hoping to draw a reaction out of folks.

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 15:19             ` phillip
@ 2006-11-14 18:19               ` Andreas Dilger
  0 siblings, 0 replies; 25+ messages in thread
From: Andreas Dilger @ 2006-11-14 18:19 UTC (permalink / raw)
  To: phillip
  Cc: thephilips, hbryan, jsipek, atraeger, linux-fsdevel, phillip.lougher

On Nov 14, 2006  15:19 +0000, phillip@lougher.demon.co.uk wrote:
> adilger@clusterfs.com wrote:
> > At the filesystem summit we DID find a surprising number of small files
> > even when the whole system was examined.  We discussed storing small
> > files directly in the inode along with other EAs (this would require
> > larger inodes).  This improves data locality and performance (i.e. stat
> > of the file loads the small file data into cache), though the assumption
> > is that there will be an increasing number of EAs on files in the future.
> 
> So it won't be feasible to store EAs + small file in the inode, or in the
> future it won't be feasible to store just EAs in the inode for most files?

Sorry to be unclear.  What I meant was that part of the justification for
always increasing the inode size (which will cause internal fragmentation
itself) is that the assumption is the larger inode size can be efficiently
used for a variety of EAs (small file bodies, ACLs, selinux, etc).

Additionally, it was proposed is to use this EA space to store the file
name(s) belonging to the inode, and in fact turn the inode table into the
directory itself.  That means information needed for dirent->d_type, etc
is readily available, as is readdir+ (readdir + stat) information.  For
small files it means that readdir + stat + read operations like "grep -r"
or "find ... | xargs grep" or any number of others can be done without any
seeks, which is a HUGE performance improvement.

> Are there any stats showing the current amount/size of EAs per file, and
> what it is expected to be in the future?

No, just empirical evidence.  Using the EA space for filenames to create a
directory itself adds a significant amount of EAs if files are hard linked.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
  2006-11-14 17:27                 ` Matthew Wilcox
  2006-11-14 17:55                   ` Theodore Tso
@ 2006-11-14 18:23                   ` Al Boldi
  1 sibling, 0 replies; 25+ messages in thread
From: Al Boldi @ 2006-11-14 18:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Theodore Tso, linux-fsdevel

Matthew Wilcox wrote:
> On Tue, Nov 14, 2006 at 07:59:43PM +0300, Al Boldi wrote:
> > Matthew Wilcox wrote:
> > > On Tue, Nov 14, 2006 at 06:43:39PM +0300, Al Boldi wrote:
> > > > An API would probably be in order.
> > > >
> > > > Performance due to non-redundancy.
> > > >
> > > > Or a desire to let people choose what they want.
> > > >
> > > > Think freedom...
> > >
> > > How about fewer "visionary statements" and more code?
> >
> > How about discussing the feasibility before sending any code?
>
> Great idea.  Discuss the feasibility, rather than responding with
> platitudes to people telling you it's infeasible.

I don't think an API is a platitude.

google API and you may know what I mean.

> Show examples, show how it would help.

This thread is an example.

> Take the inetd.conf example.  Look at the current
> user-space parser.  Show how a new implementation might work.

Plugins are nothing new.  And are known to work.

> Use pseudocode where necessary; this doesn't have to compile, it has to
> convince people that there's something worthwhile in doing this.

The important part is to convince those who ACK.

IMHO, anyway.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Allocation strategy - dynamic zone for small files
       [not found]                 ` <efa6f5910611140541m302201e6t4e84551b75e79611@mail.gmail.com>
  2006-11-14 13:56                   ` Jörn Engel
@ 2006-11-14 18:23                   ` Andreas Dilger
  1 sibling, 0 replies; 25+ messages in thread
From: Andreas Dilger @ 2006-11-14 18:23 UTC (permalink / raw)
  To: Ihar `Philips` Filipau
  Cc: Jörn Engel, Dave Kleikamp, Bryan Henderson, Josef Sipek,
	avishay, linux-fsdevel

On Nov 14, 2006  14:41 +0100, Ihar `Philips` Filipau wrote:
> More I'm thinking about that, more I'm convinced that some sort of
> compromise is required. E.g. file system with 2/more cluster sizes:
> for example 4k for small/medium files, 64+k cluster for large files.
> Files of 100+MB sizes are not that rare anymore (home video/audio
> processing now is affordable as never before). But on other side tiny
> files like e.g. found in /etc or ~/.kde/* are not going to disappear
> anytime soon.

Well, current plan is that new allocator (mballoc + delalloc) from
Alex Tomas will do efficient in-memory allocation of many contiguous
blocks, and extents format will allow efficient on-disk storage of
many contiguous blocks, so benefit of larger cluster size in disk
format is minimal.

Essentially, delaying the disk allocation until a file is large/complete
(delalloc) and then using a buddy allocator in memory to get contiguous
chunks of disk is better than hard 64K+ cluster because it avoids
internal fragmentation and allows much more optimal/efficient placement
than just a factor of 16 reduction in the block count.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-11-14 18:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-13 10:37 Allocation strategy - dynamic zone for small files Ihar `Philips` Filipau
2006-11-13 13:56 ` avishay
2006-11-13 17:46   ` Bryan Henderson
2006-11-13 19:38     ` Josef Sipek
2006-11-13 21:12       ` Bryan Henderson
2006-11-13 23:32         ` Ihar `Philips` Filipau
2006-11-13 23:57           ` Andreas Dilger
2006-11-14  2:19             ` Dave Kleikamp
2006-11-14 13:15               ` Jörn Engel
     [not found]                 ` <efa6f5910611140541m302201e6t4e84551b75e79611@mail.gmail.com>
2006-11-14 13:56                   ` Jörn Engel
2006-11-14 18:23                   ` Andreas Dilger
2006-11-14 15:19             ` phillip
2006-11-14 18:19               ` Andreas Dilger
2006-11-14  0:15           ` Josef Sipek
2006-11-14  0:59           ` Bryan Henderson
2006-11-14  1:02     ` Theodore Tso
2006-11-14 11:21       ` Al Boldi
2006-11-14 14:25         ` Theodore Tso
2006-11-14 15:43           ` Al Boldi
2006-11-14 15:46             ` Matthew Wilcox
2006-11-14 16:59               ` Al Boldi
2006-11-14 17:27                 ` Matthew Wilcox
2006-11-14 17:55                   ` Theodore Tso
2006-11-14 18:23                   ` Al Boldi
2006-11-14 14:30       ` phillip

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.