Shrinking ext3 directories

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Shrinking ext3 directories
@ 2002-06-18 16:08 DervishD
  2002-06-18 16:10 ` Austin Gonyou
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: DervishD @ 2002-06-18 16:08 UTC (permalink / raw)
  To: Linux-kernel

    Hi all :))

    All of you know that if you create a lot of files or directories
within a directory on ext2/3 and after that you remove them, the
blocks aren't freed (this is the reason behind the lost+found block
preallocation). If you want to 'shrink' the directory now that it
doesn't contain a lot of leafs, the only solution I know is creating
a new directory, move the remaining leafs to it, remove the
'big-unshrinken' directory and after that renaming the new directory:

    $ mkdir new-dir
    $ mv bigone/* new-dir/
    $ rmdir bigone
    $ mv new-dir bigone
    (Well, sort of)

    Any other way of doing the same without the mess?

    Thanks a lot :)
    Raúl

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:08 Shrinking ext3 directories DervishD
@ 2002-06-18 16:10 ` Austin Gonyou
  2002-06-18 16:39   ` Andreas Dilger
  2002-06-18 19:34   ` DervishD
  2002-06-18 16:21 ` Padraig Brady
  2002-06-18 21:50 ` Stephen C. Tweedie
  2 siblings, 2 replies; 42+ messages in thread
From: Austin Gonyou @ 2002-06-18 16:10 UTC (permalink / raw)
  To: DervishD; +Cc: Linux-kernel

Use a volume manager?(LVM or EVMS maybe.) You can grow and shrink their
volumes dynamically. EXT3 mus support ioctls for this, but if it does,
cause I've seen it doesn with EXT2, then you're good.

On Tue, 2002-06-18 at 11:08, DervishD wrote:
>     Hi all :))
> 
>     All of you know that if you create a lot of files or directories
> within a directory on ext2/3 and after that you remove them, the
> blocks aren't freed (this is the reason behind the lost+found block
> preallocation). If you want to 'shrink' the directory now that it
> doesn't contain a lot of leafs, the only solution I know is creating
> a new directory, move the remaining leafs to it, remove the
> 'big-unshrinken' directory and after that renaming the new directory:
> 
>     $ mkdir new-dir
>     $ mv bigone/* new-dir/
>     $ rmdir bigone
>     $ mv new-dir bigone
>     (Well, sort of)
> 
>     Any other way of doing the same without the mess?
> 
>     Thanks a lot :)
>     Raúl
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Austin Gonyou <austin@digitalroadkill.net>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:08 Shrinking ext3 directories DervishD
  2002-06-18 16:10 ` Austin Gonyou
@ 2002-06-18 16:21 ` Padraig Brady
  2002-06-18 16:54   ` David Lang
  2002-06-18 19:35   ` DervishD
  2002-06-18 21:50 ` Stephen C. Tweedie
  2 siblings, 2 replies; 42+ messages in thread
From: Padraig Brady @ 2002-06-18 16:21 UTC (permalink / raw)
  To: DervishD; +Cc: Linux-kernel

DervishD wrote:
>     Hi all :))
> 
>     All of you know that if you create a lot of files or directories
> within a directory on ext2/3 and after that you remove them, the
> blocks aren't freed (this is the reason behind the lost+found block
> preallocation). If you want to 'shrink' the directory now that it
> doesn't contain a lot of leafs, the only solution I know is creating
> a new directory, move the remaining leafs to it, remove the
> 'big-unshrinken' directory and after that renaming the new directory:
> 
>     $ mkdir new-dir
>     $ mv bigone/* new-dir/
>     $ rmdir bigone
>     $ mv new-dir bigone
>     (Well, sort of)

The zipdir component of fslint does this (while maintaining permissions
etc.).

>     Any other way of doing the same without the mess?

Not at present I think. Perhaps we'll get it for free with
the new htree directory indexing?

Padraig.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:10 ` Austin Gonyou
@ 2002-06-18 16:39   ` Andreas Dilger
  2002-06-18 19:39     ` DervishD
  2002-06-18 19:34   ` DervishD
  1 sibling, 1 reply; 42+ messages in thread
From: Andreas Dilger @ 2002-06-18 16:39 UTC (permalink / raw)
  To: Austin Gonyou; +Cc: DervishD, Linux-kernel

On Jun 18, 2002  11:10 -0500, Austin Gonyou wrote:
> Use a volume manager?(LVM or EVMS maybe.) You can grow and shrink their
> volumes dynamically. EXT3 mus support ioctls for this, but if it does,
> cause I've seen it doesn with EXT2, then you're good.

Totally irrelevant.

> On Tue, 2002-06-18 at 11:08, DervishD wrote:
> >     Hi all :))
> > 
> >     All of you know that if you create a lot of files or directories
> > within a directory on ext2/3 and after that you remove them, the
> > blocks aren't freed (this is the reason behind the lost+found block
> > preallocation). If you want to 'shrink' the directory now that it
> > doesn't contain a lot of leafs, the only solution I know is creating
> > a new directory, move the remaining leafs to it, remove the
> > 'big-unshrinken' directory and after that renaming the new directory:
> > 
> >     $ mkdir new-dir
> >     $ mv bigone/* new-dir/
> >     $ rmdir bigone
> >     $ mv new-dir bigone
> >     (Well, sort of)
> > 
> >     Any other way of doing the same without the mess?

Not right now.  Truncating directories on a mounted filesystem is
probably going to be a big source of strange problems.  In the end
it isn't really a problem for most people, because if your directory
has grown big once it is likely to grow big again.

With htree directories we will have to make this work at some point,
because you will be able to create huge directories and not being able
to free directory blocks would be a big pain.  It isn't in the current
htree directory code yet, but it has been discussed a bit already.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:21 ` Padraig Brady
@ 2002-06-18 16:54   ` David Lang
  2002-06-18 19:35   ` DervishD
  1 sibling, 0 replies; 42+ messages in thread
From: David Lang @ 2002-06-18 16:54 UTC (permalink / raw)
  To: Padraig Brady; +Cc: DervishD, Linux-kernel

this won't be part of the htree indexing, however the biggest reason to be
concerned about shrinking the directory is the horrible performance that
ext2/3 have when dealing with large directories and the htree stuff will
hopefully eliminate that performance problem so the only remaining reason
would be to free up a few K of disk space (which is a MUCH less critical
issue)

David Lang


 On Tue, 18 Jun 2002, Padraig Brady
wrote:

> Date: Tue, 18 Jun 2002 17:21:18 +0100
> From: Padraig Brady <padraig@antefacto.com>
> To: DervishD <raul@pleyades.net>
> Cc: Linux-kernel <linux-kernel@vger.kernel.org>
> Subject: Re: Shrinking ext3 directories
>
> DervishD wrote:
> >     Hi all :))
> >
> >     All of you know that if you create a lot of files or directories
> > within a directory on ext2/3 and after that you remove them, the
> > blocks aren't freed (this is the reason behind the lost+found block
> > preallocation). If you want to 'shrink' the directory now that it
> > doesn't contain a lot of leafs, the only solution I know is creating
> > a new directory, move the remaining leafs to it, remove the
> > 'big-unshrinken' directory and after that renaming the new directory:
> >
> >     $ mkdir new-dir
> >     $ mv bigone/* new-dir/
> >     $ rmdir bigone
> >     $ mv new-dir bigone
> >     (Well, sort of)
>
> The zipdir component of fslint does this (while maintaining permissions
> etc.).
>
> >     Any other way of doing the same without the mess?
>
> Not at present I think. Perhaps we'll get it for free with
> the new htree directory indexing?
>
> Padraig.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:10 ` Austin Gonyou
  2002-06-18 16:39   ` Andreas Dilger
@ 2002-06-18 19:34   ` DervishD
  1 sibling, 0 replies; 42+ messages in thread
From: DervishD @ 2002-06-18 19:34 UTC (permalink / raw)
  To: austin, raul; +Cc: linux-kernel

    Hi Austin :)

>Use a volume manager?(LVM or EVMS maybe.) You can grow and shrink
>their volumes dynamically. EXT3 mus support ioctls for this, but if
>it does, cause I've seen it doesn with EXT2, then you're good.

    I'm afraid I explained myself bad O:) I didn't refer to directory
contents, but to the metadata, or the blocks of the directory itself.

    Thanks anyway :))
    Raúl

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:21 ` Padraig Brady
  2002-06-18 16:54   ` David Lang
@ 2002-06-18 19:35   ` DervishD
  1 sibling, 0 replies; 42+ messages in thread
From: DervishD @ 2002-06-18 19:35 UTC (permalink / raw)
  To: padraig, raul; +Cc: linux-kernel

    Hi Padraig :)

>>     Any other way of doing the same without the mess?
>Not at present I think. Perhaps we'll get it for free with
>the new htree directory indexing?

    What's that? BTW, thanks a lot, I didn't know about that feature
of fslint.

    Raúl

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:39   ` Andreas Dilger
@ 2002-06-18 19:39     ` DervishD
  0 siblings, 0 replies; 42+ messages in thread
From: DervishD @ 2002-06-18 19:39 UTC (permalink / raw)
  To: adilger, austin; +Cc: raul, linux-kernel

    Hi Andreas :)

>> >     Any other way of doing the same without the mess?
>Not right now.  Truncating directories on a mounted filesystem is
>probably going to be a big source of strange problems.

    So I'd better go with the move-and-rename alternative...

>In the end it isn't really a problem for most people, because if
>your directory has grown big once it is likely to grow big again.

    Well, my problem arose just a couple of times, when by mistake I
extract a tarball in a directory, or by a wrong 'touch' command
(related to some bad use of 'seq'), etc... Not a problem at all, but
curiosity about this.

>With htree directories we will have to make this work at some point,

    Ok, I'll take a look then :)) Thanks for answering :)
    Raúl

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 16:08 Shrinking ext3 directories DervishD
  2002-06-18 16:10 ` Austin Gonyou
  2002-06-18 16:21 ` Padraig Brady
@ 2002-06-18 21:50 ` Stephen C. Tweedie
  2002-06-18 22:18   ` Alexander Viro
  2 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-18 21:50 UTC (permalink / raw)
  To: DervishD; +Cc: Linux-kernel, ext2-devel, Stephen Tweedie

Hi,

On Tue, Jun 18, 2002 at 06:08:28PM +0200, DervishD wrote:
 
>     All of you know that if you create a lot of files or directories
> within a directory on ext2/3 and after that you remove them, the
> blocks aren't freed (this is the reason behind the lost+found block
> preallocation). If you want to 'shrink' the directory now that it
> doesn't contain a lot of leafs, the only solution I know is creating
> a new directory, move the remaining leafs to it, remove the
> 'big-unshrinken' directory and after that renaming the new directory

Right.  Shrinking directories is not implemented for ext2 or ext3 at
the moment.  However, I know that Daniel Phillips has been thinking
about adding that for his HTree extensions which add fast directory
indexing to ext2/3.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 21:50 ` Stephen C. Tweedie
@ 2002-06-18 22:18   ` Alexander Viro
  2002-06-19  9:38     ` DervishD
  2002-06-19 10:37     ` Stephen C. Tweedie
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Viro @ 2002-06-18 22:18 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: DervishD, Linux-kernel, ext2-devel

On Tue, 18 Jun 2002, Stephen C. Tweedie wrote:

> Hi,
> 
> On Tue, Jun 18, 2002 at 06:08:28PM +0200, DervishD wrote:
>  
> >     All of you know that if you create a lot of files or directories
> > within a directory on ext2/3 and after that you remove them, the
> > blocks aren't freed (this is the reason behind the lost+found block
> > preallocation). If you want to 'shrink' the directory now that it
> > doesn't contain a lot of leafs, the only solution I know is creating
> > a new directory, move the remaining leafs to it, remove the
> > 'big-unshrinken' directory and after that renaming the new directory
> 
> Right.  Shrinking directories is not implemented for ext2 or ext3 at
> the moment.  However, I know that Daniel Phillips has been thinking
> about adding that for his HTree extensions which add fast directory
> indexing to ext2/3.

<shrug> for ext2 a limited form of "shrinking" is easy to implement.
ext2_delete_entry() can easily notice that it's about to create an
empty entry spanning entire last block.  In that case it should
just walk back and check beginnings of previous blocks, as long
as they are empty (inode = 0, len = block size).  Then it's vmtruncate()
time - all IO on directories is protected by i_sem, so we are safe.

IOW, making sure that empty blocks in the end of directory get freed
is a matter of 10-20 lines.  If you want such patch - just tell, it's
half an hour of work...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 22:18   ` Alexander Viro
@ 2002-06-19  9:38     ` DervishD
  2002-06-19 10:37     ` Stephen C. Tweedie
  1 sibling, 0 replies; 42+ messages in thread
From: DervishD @ 2002-06-19  9:38 UTC (permalink / raw)
  To: viro, sct; +Cc: raul, linux-kernel, ext2-devel

    Hi Alexander :))

>IOW, making sure that empty blocks in the end of directory get freed
>is a matter of 10-20 lines.  If you want such patch - just tell, it's
>half an hour of work...

    IMHO it would be a great feature to add, and if the cost is as
low as you say... BTW, thanks a lot for your answer and your offer :)

    Raúl

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking ext3 directories
  2002-06-18 22:18   ` Alexander Viro
  2002-06-19  9:38     ` DervishD
@ 2002-06-19 10:37     ` Stephen C. Tweedie
  2002-06-19 17:03       ` [Ext2-devel] " Christopher Li
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-19 10:37 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Stephen C. Tweedie, DervishD, Linux-kernel, ext2-devel

Hi,

On Tue, Jun 18, 2002 at 06:18:49PM -0400, Alexander Viro wrote:

> IOW, making sure that empty blocks in the end of directory get freed
> is a matter of 10-20 lines.  If you want such patch - just tell, it's
> half an hour of work...

It's certainly easier at the tail, but with htree we may have
genuinely enormous directories and being able to hole-punch arbitrary
coalesced blocks could be a huge win.  Also, doing the coalescing
block by block is likely to be far easier for ext3 than truncating
the directory arbitrarily back in one go.  

Chopping a large directory at once brings back the truncate()
nightmare of having to make an unbounded disk operation seem atomic,
even if it has to get split over multiple transactions.  Incremental
coalescing should allow us to know in advance how many disk blocks we
might end up touching for the operation, so we can guarantee to do it
in one transaction.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 10:37     ` Stephen C. Tweedie
@ 2002-06-19 17:03       ` Christopher Li
  2002-06-19 20:10         ` Stephen C. Tweedie
                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Christopher Li @ 2002-06-19 17:03 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Alexander Viro, DervishD, Linux-kernel, ext2-devel



On Wed, 19 Jun 2002, Stephen C. Tweedie wrote:

> Hi,
> 
> On Tue, Jun 18, 2002 at 06:18:49PM -0400, Alexander Viro wrote:
>  
> > IOW, making sure that empty blocks in the end of directory get freed
> > is a matter of 10-20 lines.  If you want such patch - just tell, it's
> > half an hour of work...
> 
> It's certainly easier at the tail, but with htree we may have
> genuinely enormous directories and being able to hole-punch arbitrary
> coalesced blocks could be a huge win.  Also, doing the coalescing
I would can contribute on that. I am thinking about it anyway.
Daniel might already has some code there.

I have a silly question, where is that ext3 CVS? Under sourcefourge
ext2/ext3 or gkernel?

Chris



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 17:03       ` [Ext2-devel] " Christopher Li
@ 2002-06-19 20:10         ` Stephen C. Tweedie
  2002-06-19 20:34           ` Stephen C. Tweedie
  2002-06-19 20:13         ` Andrew Morton
  2002-06-19 22:49         ` Daniel Phillips
  2 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-19 20:10 UTC (permalink / raw)
  To: Christopher Li
  Cc: Stephen C. Tweedie, Alexander Viro, DervishD, Linux-kernel, ext2-devel

Hi,

On Wed, Jun 19, 2002 at 01:03:38PM -0400, Christopher Li wrote:
 
> On Wed, 19 Jun 2002, Stephen C. Tweedie wrote:
> 
> > Hi,
> > 
> > On Tue, Jun 18, 2002 at 06:18:49PM -0400, Alexander Viro wrote:
> >  
> > > IOW, making sure that empty blocks in the end of directory get freed
> > > is a matter of 10-20 lines.  If you want such patch - just tell, it's
> > > half an hour of work...
> > 
> > It's certainly easier at the tail, but with htree we may have
> > genuinely enormous directories and being able to hole-punch arbitrary
> > coalesced blocks could be a huge win.  Also, doing the coalescing
> I would can contribute on that. I am thinking about it anyway.
> Daniel might already has some code there.
> 
> I have a silly question, where is that ext3 CVS? Under sourcefourge
> ext2/ext3 or gkernel?

cvs -d :ext:FOO@cvs.gkernel.sourceforge.net:/cvsroot/gkernel co ext3

The branches being used are

	cvs up -r ext3-1_0-branch	# HEAD of ext3 development
	cvs up -r features-branch	# For htree, ACLs etc

and there are a couple of other branches I use for tracking merges into
Linus's and the -ac trees.  The htree stuff is all that's new in the
features-branch right now.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 17:03       ` [Ext2-devel] " Christopher Li
  2002-06-19 20:10         ` Stephen C. Tweedie
@ 2002-06-19 20:13         ` Andrew Morton
  2002-06-19 22:43           ` Stephen C. Tweedie
  2002-06-19 22:49         ` Daniel Phillips
  2 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2002-06-19 20:13 UTC (permalink / raw)
  To: Christopher Li
  Cc: Stephen C. Tweedie, Alexander Viro, DervishD, Linux-kernel, ext2-devel

Christopher Li wrote:
> 
> ...
> 
> I have a silly question, where is that ext3 CVS? Under sourcefourge
> ext2/ext3 or gkernel?

See http://www.zip.com.au/~akpm/linux/ext3/ - about halfway
down the page.

btw, I merged all the ext3 htree stuff into 2.5.23 yesterday.  Haven't
tested it much at all yet.

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/ext3-truncate-fix.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/ext3-htree.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/htree-fixes.patch

-

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 20:10         ` Stephen C. Tweedie
@ 2002-06-19 20:34           ` Stephen C. Tweedie
  0 siblings, 0 replies; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-19 20:34 UTC (permalink / raw)
  To: Christopher Li
  Cc: Stephen C. Tweedie, Alexander Viro, DervishD, Linux-kernel, ext2-devel

Hi,

On Wed, Jun 19, 2002 at 09:10:51PM +0100, Stephen C. Tweedie wrote:

> cvs -d :ext:FOO@cvs.gkernel.sourceforge.net:/cvsroot/gkernel co ext3
> The branches being used are
> 
> 	cvs up -r ext3-1_0-branch	# HEAD of ext3 development
> 	cvs up -r features-branch	# For htree, ACLs etc

And one other thing: the subdirectories tools/, testing/ and scripts/
on the cvs trunk contain various tools for testing and stressing the
filesystem and VM.

--Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 20:13         ` Andrew Morton
@ 2002-06-19 22:43           ` Stephen C. Tweedie
  2002-06-19 23:54             ` Stephen C. Tweedie
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-19 22:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christopher Li, Stephen C. Tweedie, Alexander Viro, DervishD,
	Linux-kernel, ext2-devel

Hi,

On Wed, Jun 19, 2002 at 01:13:50PM -0700, Andrew Morton wrote:

> > I have a silly question, where is that ext3 CVS? Under sourcefourge
> > ext2/ext3 or gkernel?
> 
> See http://www.zip.com.au/~akpm/linux/ext3/ - about halfway
> down the page.
> 
> btw, I merged all the ext3 htree stuff into 2.5.23 yesterday.  Haven't
> tested it much at all yet.

Well, it has some interesting properties, such as the hash function
being a constant:

+	return 80; /* FIXME: for test only */

which I assume was an artifact of some testing Christopher was doing.
:)

I'm checking out a proper hash function at the moment.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 17:03       ` [Ext2-devel] " Christopher Li
  2002-06-19 20:10         ` Stephen C. Tweedie
  2002-06-19 20:13         ` Andrew Morton
@ 2002-06-19 22:49         ` Daniel Phillips
  2002-06-20  0:24           ` Andreas Dilger
  2002-06-20  9:34           ` Stephen C. Tweedie
  2 siblings, 2 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-19 22:49 UTC (permalink / raw)
  To: Christopher Li, Stephen C. Tweedie
  Cc: Alexander Viro, DervishD, Linux-kernel, ext2-devel

On Wednesday 19 June 2002 19:03, Christopher Li wrote:
> On Wed, 19 Jun 2002, Stephen C. Tweedie wrote:
> > On Tue, Jun 18, 2002 at 06:18:49PM -0400, Alexander Viro wrote:
> >  
> > > IOW, making sure that empty blocks in the end of directory get freed
> > > is a matter of 10-20 lines.  If you want such patch - just tell, it's
> > > half an hour of work...
> > 
> > It's certainly easier at the tail, but with htree we may have
> > genuinely enormous directories and being able to hole-punch arbitrary
> > coalesced blocks could be a huge win.  Also, doing the coalescing
>
> I would can contribute on that. I am thinking about it anyway.
> Daniel might already has some code there.

I don't have code, but let me remind you of this post:

   http://marc.theaimsgroup.com/?l=ext2-devel&m=102132142032096&w=2

A sketch of the coalescing design is at the end.  I'll formalize that.
One issue Stephen touched on that I hadn't settled at the time, is how
to handle deleted blocks.  My inclination is to copy the last block of
the directory into the vacated block as opposed to leaving a hole in
the file.  The slight extra cost doesn't seem to be worth worrying
about, and it's guaranteed to leave the directory in a compact state
when emptied.

The two competing approaches are the hole-punch idea - which I didn't
consider before - and keeping a list of free blocks somehow.  I think
it's best to err on the side of simplicity this time: the copy-down-last
strategy eliminates the need to search for a free block when the
directory needs to be expanded again, 

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 22:43           ` Stephen C. Tweedie
@ 2002-06-19 23:54             ` Stephen C. Tweedie
  2002-06-21  3:28               ` Daniel Phillips
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-19 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christopher Li, Stephen C. Tweedie, Alexander Viro, DervishD,
	Linux-kernel, ext2-devel

Hi,

On Wed, Jun 19, 2002 at 11:43:40PM +0100, Stephen C. Tweedie wrote:

> Well, it has some interesting properties, such as the hash function
> being a constant:
> 
> +	return 80; /* FIXME: for test only */
> 
> which I assume was an artifact of some testing Christopher was doing.
> :)
> 
> I'm checking out a proper hash function at the moment.

Done, checked into ext3 cvs (features-branch again.)

Deleting and recreating 100,000 files with this kernel:

[root@spock test0]# time xargs rm -f < /root/flist.100000 

real    0m14.305s
user    0m0.750s
sys     0m5.430s
[root@spock test0]# time xargs touch < /root/flist.100000 

real    0m16.244s
user    0m0.530s
sys     0m6.660s

that's an average of 160usec per create, 140usec per delete elapsed
time, and 66/54usec respectively system time.  

I assume the elapsed time is greater only because we're starting to
wrap the journal due to the large amount of metadata being touched
(we're touching a lot of inodes doing the above, which I could avoid
by making hard links instead of new files.)  Certainly, limiting the
test to 10,000 files lets it run at 100% cpu.

--Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 22:49         ` Daniel Phillips
@ 2002-06-20  0:24           ` Andreas Dilger
  2002-06-20  9:34           ` Stephen C. Tweedie
  1 sibling, 0 replies; 42+ messages in thread
From: Andreas Dilger @ 2002-06-20  0:24 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Christopher Li, Stephen C. Tweedie, Alexander Viro, DervishD,
	Linux-kernel, ext2-devel

On Jun 20, 2002  00:49 +0200, Daniel Phillips wrote:
> My inclination is to copy the last block of the directory into the
> vacated block as opposed to leaving a hole in the file.  The slight
> extra cost doesn't seem to be worth worrying about, and it's guaranteed
> to leave the directory in a compact state when emptied.

This also has the benefit of avoiding huge truncates when we are
deleting lots of files.  At most it will add a single block into
each transaction.

> I think it's best to err on the side of simplicity this time: the
> copy-down-last strategy eliminates the need to search for a free
> block when the directory needs to be expanded again, 

It also keeps compatibility with older code, whereas having holes in
directories can cause problems on older kernels.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 22:49         ` Daniel Phillips
  2002-06-20  0:24           ` Andreas Dilger
@ 2002-06-20  9:34           ` Stephen C. Tweedie
  2002-06-20 10:18             ` Andreas Dilger
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-20  9:34 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Christopher Li, Stephen C. Tweedie, Alexander Viro, DervishD,
	Linux-kernel, ext2-devel

Hi,

On Thu, Jun 20, 2002 at 12:49:57AM +0200, Daniel Phillips wrote:

> I don't have code, but let me remind you of this post:
> 
>    http://marc.theaimsgroup.com/?l=ext2-devel&m=102132142032096&w=2
> 
> A sketch of the coalescing design is at the end.  I'll formalize that.

One question --- just how stable will this be if we boot into a kernel
that doesn't have the coalescing enabled, and start modifying
directories?  We _could_ just teach the current code to clear those
top 8 bits in the parent any time we touch a leaf node, but that's
unnecessarily expensive, so we'd really need to have some way of
either recreating the hint fields from scratch every so often, or of
spotting when they have become badly out-of-date.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-20  9:34           ` Stephen C. Tweedie
@ 2002-06-20 10:18             ` Andreas Dilger
  2002-06-20 13:45               ` Daniel Phillips
  2002-06-20 16:26               ` Bill Davidsen
  0 siblings, 2 replies; 42+ messages in thread
From: Andreas Dilger @ 2002-06-20 10:18 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Daniel Phillips, Christopher Li, Alexander Viro, DervishD,
	Linux-kernel, ext2-devel

On Jun 20, 2002  10:34 +0100, Stephen C. Tweedie wrote:
> On Thu, Jun 20, 2002 at 12:49:57AM +0200, Daniel Phillips wrote:
> > I don't have code, but let me remind you of this post:
> > 
> >    http://marc.theaimsgroup.com/?l=ext2-devel&m=102132142032096&w=2
> > 
> > A sketch of the coalescing design is at the end.  I'll formalize that.
> 
> One question --- just how stable will this be if we boot into a kernel
> that doesn't have the coalescing enabled, and start modifying
> directories?  We _could_ just teach the current code to clear those
> top 8 bits in the parent any time we touch a leaf node, but that's
> unnecessarily expensive, so we'd really need to have some way of
> either recreating the hint fields from scratch every so often, or of
> spotting when they have become badly out-of-date.

Three notes:
1) Coalescing isn't necessarily the same as just discarding empty
   blocks.  We can do the latter much more easily, and without the
   hint bits at all, but it won't work unless a block is totally
   empty, so you could still approach 0% fullness with huge directories.

2) The hint bits are meant to be intentionally vague (i.e. only a hint)
   so there is no need to keep them 100% up-to-date.  If it turns out
   that you modify a directory with a kernel that does not understand
   coalescing it is fairly benign.  The worst that would happen is that
   you get an empty leaf block (assuming you don't even have the simple
   support for dropping empty blocks), or you try to coalesce with such
   a block and find it too full to do the merge, so you update the hint
   again with the correct value.  Over the normal course of operations
   you would be updating the hints for each block anyways, so invalid
   hints would be cleaned out from the index.

To avoid extra overhead from writing out the parent each time you delete
an entry from the leaf, you could update the values all of the time
(you had to have read the parent to find the correct leaf block), but
not mark the block dirty, so the updated hints are only written to disk
if there is another reason to write out the block (e.g. split/coalesce
of a leaf block).

Having a large number of bits of hint info would not necessarily be
useful.  In Daniel's "1-bit hint" example, the actual worst-case fullness
could _approach_ 25%, but you would always drop 100% empty blocks
immediately, so it would never quite get there.

With 2 bits of hint, you would probably only merge if the sum of the two
neighbours was <= 3 (i.e. 75% fullness for a single block), because you
don't necessarily want to be merging blocks to be almost 100% full and
then splitting them again.  This would give a worst-case fullness between
37.5% and 75% at any time, which isn't really so bad given the performance
implications of repeated merge+truncate+allocate+split operations.

Remember also that each leaf block merge will incur a copy from the tail
block (which may need to be read from disk) and then a truncate to drop
that block.  We _could_ leave some number of empty dir blocks at the end
of the directory file if we had some sort of dir prealloc scheme happening.
There would be some amount of hysteresis there to avoid the repeated
alloc/free overhead (i.e. keep no more than 8 free blocks, but allocate
8 blocks at a time if you need more).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-20 10:18             ` Andreas Dilger
@ 2002-06-20 13:45               ` Daniel Phillips
  2002-06-21 14:54                 ` Ville Herva
  2002-06-20 16:26               ` Bill Davidsen
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Phillips @ 2002-06-20 13:45 UTC (permalink / raw)
  To: Andreas Dilger, Stephen C. Tweedie
  Cc: Christopher Li, Alexander Viro, DervishD, Linux-kernel, ext2-devel

On Thursday 20 June 2002 12:18, Andreas Dilger wrote:
> On Jun 20, 2002  10:34 +0100, Stephen C. Tweedie wrote:
> > One question --- just how stable will this be if we boot into a kernel
> > that doesn't have the coalescing enabled, and start modifying
> > directories?  We _could_ just teach the current code to clear those
> > top 8 bits in the parent any time we touch a leaf node, but that's
> > unnecessarily expensive, so we'd really need to have some way of
> > either recreating the hint fields from scratch every so often, or of
> > spotting when they have become badly out-of-date.
> 
> Three notes:
> 1) Coalescing isn't necessarily the same as just discarding empty
>    blocks.  We can do the latter much more easily, and without the
>    hint bits at all, but it won't work unless a block is totally
>    empty, so you could still approach 0% fullness with huge directories.

And in the accidental-untar case that started this thread, Raul would
have the same complaint: a directory bloats up and never unbloats
until completely emptied.

> 2) The hint bits are meant to be intentionally vague (i.e. only a hint)
>    so there is no need to keep them 100% up-to-date.  If it turns out
>    that you modify a directory with a kernel that does not understand
>    coalescing it is fairly benign.  The worst that would happen is that
>    you get an empty leaf block (assuming you don't even have the simple
>    support for dropping empty blocks), or you try to coalesce with such
>    a block and find it too full to do the merge, so you update the hint
>    again with the correct value.  Over the normal course of operations
>    you would be updating the hints for each block anyways, so invalid
>    hints would be cleaned out from the index.

To state it another way: when the fullness hint is wrong, it's an
underestimate.  The self-correcting mechanism you described is exactly
what I had in mind.

> To avoid extra overhead from writing out the parent each time you delete
> an entry from the leaf, you could update the values all of the time
> (you had to have read the parent to find the correct leaf block), but
> not mark the block dirty, so the updated hints are only written to disk
> if there is another reason to write out the block (e.g. split/coalesce
> of a leaf block).

Hmm, so if the VM discards the unmarked dirty index block then on
reread the fullness estimate becomes an overestimate and we could, in
rare circumstances, end up with mergable blocks that never get merged.
Since we're touching the index block (set PG_Referenced), this might
be a very rare occurance indeed.

Another way to achieve a similar effect is to set the fullness to an
underestimate on delete, by rounding down to some smaller number of
hint bits than we actually have available, and set it to an accurate
estimate only on a failed merge (to prevent repeated unsuccessful
merge attempts).  This way, several deletes can take place before we
need to update the index block.  Say we round down to 5 bits on delete,
then on a 4K block we can delete 128 bytes, or 6-7 entries, before
having to update the estimate.  This gives us 85% of the benefit in
terms of reducing index block dirties while only slightly increasing
the chance of a failed merge.

So let me see, the purpose of recording the fullness bits in the
index block in the first place is to save CPU (probing the index
and scanning dirent blocks) and the occasional read.  The cost is
extra complexity in the algorithm, and some extra index block writes.
Did we win?  I think we did, and significantly, but the analysis
isn't good enough yet to quantify that.

> Having a large number of bits of hint info would not necessarily be
> useful.  In Daniel's "1-bit hint" example, the actual worst-case fullness
> could _approach_ 25%, but you would always drop 100% empty blocks
> immediately, so it would never quite get there.
>
> With 2 bits of hint, you would probably only merge if the sum of the two
> neighbours was <= 3 (i.e. 75% fullness for a single block), because you
> don't necessarily want to be merging blocks to be almost 100% full and
> then splitting them again.  This would give a worst-case fullness between
> 37.5% and 75% at any time, which isn't really so bad given the performance
> implications of repeated merge+truncate+allocate+split operations.

Yes, provably containing worst case fullness at 50% would be entirely
satisfactory.  The current worst case, however rare it may be, is 0%.
I'm hoping for an average steady state fullness in the 70% range.

> Remember also that each leaf block merge will incur a copy from the tail
> block (which may need to be read from disk) and then a truncate to drop
> that block.  We _could_ leave some number of empty dir blocks at the end
> of the directory file if we had some sort of dir prealloc scheme happening.
> There would be some amount of hysteresis there to avoid the repeated
> alloc/free overhead (i.e. keep no more than 8 free blocks, but allocate
> 8 blocks at a time if you need more).

We could hope that the block allocation policy eventually improves to
the point where the preallocation is taken care of without the directory
subsystem needing to know about it.  On the other hand, if we do want
explicit preallocation at some point then we probably have to anticipate
it now for forward compatibility reasons.  I think we will be ok just by
building in enough split-merge hysteresis.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-20 10:18             ` Andreas Dilger
  2002-06-20 13:45               ` Daniel Phillips
@ 2002-06-20 16:26               ` Bill Davidsen
  1 sibling, 0 replies; 42+ messages in thread
From: Bill Davidsen @ 2002-06-20 16:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Linux-kernel, ext2-devel

On Thu, 20 Jun 2002, Andreas Dilger wrote:

> Remember also that each leaf block merge will incur a copy from the tail
> block (which may need to be read from disk) and then a truncate to drop
> that block.  We _could_ leave some number of empty dir blocks at the end
> of the directory file if we had some sort of dir prealloc scheme happening.
> There would be some amount of hysteresis there to avoid the repeated
> alloc/free overhead (i.e. keep no more than 8 free blocks, but allocate
> 8 blocks at a time if you need more).

Wouldn't the hysteresis be the frag or block size? Is there benefit to
truncating if it doesn't free any disk space? Actually, there might be
benefit to leaving a few empty blocks at the end of the dir when doing
trunc as a means of reducing alloc.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-19 23:54             ` Stephen C. Tweedie
@ 2002-06-21  3:28               ` Daniel Phillips
  2002-06-21  7:03                 ` Helge Hafting
                                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-21  3:28 UTC (permalink / raw)
  To: Stephen C. Tweedie, Andrew Morton
  Cc: Christopher Li, Linux-kernel, ext2-devel

On Thursday 20 June 2002 01:54, Stephen C. Tweedie wrote:
> > I'm checking out a proper hash function at the moment.
> 
> Done, checked into ext3 cvs (features-branch again.)
> 
> Deleting and recreating 100,000 files with this kernel:
> 
> [root@spock test0]# time xargs rm -f < /root/flist.100000 
> 
> real    0m14.305s
> user    0m0.750s
> sys     0m5.430s
> [root@spock test0]# time xargs touch < /root/flist.100000 
> 
> real    0m16.244s
> user    0m0.530s
> sys     0m6.660s
> 
> that's an average of 160usec per create, 140usec per delete elapsed
> time, and 66/54usec respectively system time.  
> 
> I assume the elapsed time is greater only because we're starting to
> wrap the journal due to the large amount of metadata being touched
> (we're touching a lot of inodes doing the above, which I could avoid
> by making hard links instead of new files.)  Certainly, limiting the
> test to 10,000 files lets it run at 100% cpu.

I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As 
predicted, half-md4 does produce very even bucket distributions.  For 200,000 
creates:

   half-md4:        2872 avg bytes filled per 4k block (70%)
   dx_hack_hash:    2853 avg bytes filled per 4k block (69%)

but guess which was faster overall?

   half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
   dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%

This is quite reproducible: dx_hack_hash is always faster by about 6%.  This 
must be due entirely to the difference in hashing cost, since half-md4 
produces measurably better distributions.  Now what do we do?

By the way, I'm running about 37 usec per create here, on a 1GHz/1GB PIII, 
with Ext2.  I think most of the difference vs your timings is that your test 
code is eating a lot of cpu.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21  3:28               ` Daniel Phillips
@ 2002-06-21  7:03                 ` Helge Hafting
  2002-06-21 14:02                   ` Daniel Phillips
  2002-06-21 16:23                   ` Daniel Phillips
  2002-06-21 15:06                 ` Stephen C. Tweedie
  2002-06-22  5:53                 ` Andreas Dilger
  2 siblings, 2 replies; 42+ messages in thread
From: Helge Hafting @ 2002-06-21  7:03 UTC (permalink / raw)
  To: Daniel Phillips, ext2-devel, linux-kernel

Daniel Phillips wrote:

> I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As
> predicted, half-md4 does produce very even bucket distributions.  For 200,000
> creates:
> 
>    half-md4:        2872 avg bytes filled per 4k block (70%)
>    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> 
> but guess which was faster overall?
> 
>    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
>    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> 
> This is quite reproducible: dx_hack_hash is always faster by about 6%.  This
> must be due entirely to the difference in hashing cost, since half-md4
> produces measurably better distributions.  Now what do we do?

No surprise that the worse distribution is faster - you get less
io when fewer blocks are used.  Which means a bad distribution beats 
a good one _until_ blocks start to really fill up and collide. 2.8K per
4K block is only 70% full.  I guess the better hash wins
if you force a higher fill rate?

Helge Hafting

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21  7:03                 ` Helge Hafting
@ 2002-06-21 14:02                   ` Daniel Phillips
  2002-06-24  7:12                     ` Helge Hafting
  2002-06-21 16:23                   ` Daniel Phillips
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Phillips @ 2002-06-21 14:02 UTC (permalink / raw)
  To: Helge Hafting, ext2-devel, linux-kernel

On Friday 21 June 2002 09:03, Helge Hafting wrote:
> Daniel Phillips wrote:
> 
> > I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As
> > predicted, half-md4 does produce very even bucket distributions.  For 200,000
> > creates:
> > 
> >    half-md4:        2872 avg bytes filled per 4k block (70%)
> >    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> > 
> > but guess which was faster overall?
> > 
> >    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
> >    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> > 
> > This is quite reproducible: dx_hack_hash is always faster by about 6%.  This
> > must be due entirely to the difference in hashing cost, since half-md4
> > produces measurably better distributions.  Now what do we do?
> 
> No surprise that the worse distribution is faster - you get less
> io when fewer blocks are used.  Which means a bad distribution beats 
> a good one _until_ blocks start to really fill up and collide. 2.8K per
> 4K block is only 70% full.  I guess the better hash wins
> if you force a higher fill rate?

Hashing in htree doesn't work like that - what you're thinking about
is a traditional fixed-size hash table.  HTree is a btree that uses
hashes of names as keys.  Each block has a variable amount of the key
space assigned to it, initially just one block for the entire key
space.  When that block fills up, its entries and its key space are
split into two, then those blocks start to fill up, get split, and
so on.

So more even key distribution means the key space gets split more
evenly, and blocks are more likely to fill up evenly, meaning less
splitting, fewer blocks in total, and less IO.

A hash function that distributes keys better should give somewhat
better performance, and that has indeed been my experience.  But
in the case of half-MD4, the improvement we get from better
distribution is wiped out by the higher cost of computing the hash
function.[1]  Which is not to say that the work is without value.
The beautiful distribution given by the half-MD4 hash gives us
something to aim at, we just have to be more efficient about it.

I should note that HTree isn't hugely sensitive to bad hash
functions, at least not at the outset when a directory is growing.
The worst that happens is every leaf block ends up half-full with
a performance hit of just a few percent.  However, over time with
many insertions and deletions the hash space can get cut up into 
smaller and smaller pieces, so leaf blocks become less and less
full.  A more uniform hash function will slow this process down a
great deal, but it will not stop it completely.  The proper way
to deal with long term key space fragmentation is to implement
coalesce-on-delete, which is in progress.

[1] CPU cost in filesystem operations *is* important - a lot more
important than commonly thought.  Here we have yet another example
where CPU cost in filesystem operations dominates IO time, and
indeed, since directory operations are performed almost entirely
in cache, the quadratic cost of linear directory lookup is almost
entirely CPU cost.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-20 13:45               ` Daniel Phillips
@ 2002-06-21 14:54                 ` Ville Herva
  2002-06-21 15:08                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 42+ messages in thread
From: Ville Herva @ 2002-06-21 14:54 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Linux-kernel, ext2-devel

On Thu, Jun 20, 2002 at 03:45:51PM +0200, you [Daniel Phillips] wrote:
> 
> And in the accidental-untar case that started this thread, Raul would
> have the same complaint: a directory bloats up and never unbloats
> until completely emptied.

Not only accidental untar, but buggy progs as well. Recently, I found out
that named had created tens of thousands of (luckily zero-length) files in a
single dir on ext2. While it only took couple of hours to delete them with
"find . -name '...'| xargs -n 5000 rm" commands, I can imagine remote DOS
attacks through daemons that create local temp files. Accessing such
directory quickly becomes slow as molasses on ext2.

Daniels patch seems great. I also recall someone (Ted T'so? Stephen Tweedie?)
had another dir access speed-up patch for ext3... Is that applicable to ext2
or was it already merged?

-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21  3:28               ` Daniel Phillips
  2002-06-21  7:03                 ` Helge Hafting
@ 2002-06-21 15:06                 ` Stephen C. Tweedie
  2002-07-04  4:48                   ` Daniel Phillips
  2002-06-22  5:53                 ` Andreas Dilger
  2 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-21 15:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

Hi,

On Fri, Jun 21, 2002 at 05:28:28AM +0200, Daniel Phillips wrote:

> I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As 
> predicted, half-md4 does produce very even bucket distributions.  For 200,000 
> creates:
> 
>    half-md4:        2872 avg bytes filled per 4k block (70%)
>    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> 
> but guess which was faster overall?
> 
>    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
>    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> 
> This is quite reproducible: dx_hack_hash is always faster by about 6%.  This 
> must be due entirely to the difference in hashing cost, since half-md4 
> produces measurably better distributions.  Now what do we do?

I want to get this thing tested!  

There are far too many factors for this to be resolved very quickly.
In reality, there will be a lot of disk cost under load which you
don't see in benchmarks, too.  We also know for a fact that the early
hashes used in Reiserfs were quick but were vulnerable to terribly bad
behaviour under certain application workloads.  With the half-md4, at
least we can expect decent worst-case behaviour unless we're under
active attack (ie. only maliscious apps get hurt).

I think the md4 is a safer bet until we know more, so I'd vote that we
stick with the ext3 cvs code which uses hash version #1 for that, and
defer anything else until we've seen more --- the hash versioning lets
us do that safely.

> By the way, I'm running about 37 usec per create here, on a 1GHz/1GB PIII, 
> with Ext2.  I think most of the difference vs your timings is that your test 
> code is eating a lot of cpu.

I was getting nearer to 50usec system time, but on an athlon k7-700,
so those timings are pretty comparable.  Mine was ext3, too, which
accounts for a bit.  The difference between that and wall-clock time
was all just idle time, which I think was due to using "touch"/"rm"
--- ie. there was a lot of inode table write activity due to the files
being created/deleted, and that was forcing a journal wrap before the
end of the test.  That effect is not visible on ext2, of course.

--Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 14:54                 ` Ville Herva
@ 2002-06-21 15:08                   ` Stephen C. Tweedie
  2002-06-21 15:38                     ` Ville Herva
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-21 15:08 UTC (permalink / raw)
  To: Ville Herva, Daniel Phillips, Andreas Dilger, Linux-kernel, ext2-devel

Hi,

On Fri, Jun 21, 2002 at 05:54:51PM +0300, Ville Herva wrote:
 
> Daniels patch seems great. I also recall someone (Ted T'so? Stephen Tweedie?)
> had another dir access speed-up patch for ext3... Is that applicable to ext2
> or was it already merged?

That was Ted's microoptimisation to start directory lookups at the
point where we last looked in the directory.  It's in ext3 already
these days, and it would definitely help for the mass-delete case.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 15:08                   ` Stephen C. Tweedie
@ 2002-06-21 15:38                     ` Ville Herva
  2002-06-21 16:15                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 42+ messages in thread
From: Ville Herva @ 2002-06-21 15:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Daniel Phillips, Andreas Dilger, Linux-kernel, ext2-devel

On Fri, Jun 21, 2002 at 04:08:33PM +0100, you [Stephen C. Tweedie] wrote:
> 
> That was Ted's microoptimisation to start directory lookups at the
> point where we last looked in the directory. 

That exactly. (Micro maybe in size, but I gather the sped up lookups quite a
bit in some cases?)

> It's in ext3 already these days,

So I thought, but not in ext2?

> and it would definitely help for the mass-delete case.

Yep. Anyway, nice to see the large dir case is being addressed.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 15:38                     ` Ville Herva
@ 2002-06-21 16:15                       ` Stephen C. Tweedie
  2002-06-21 18:44                         ` Ville Herva
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen C. Tweedie @ 2002-06-21 16:15 UTC (permalink / raw)
  To: Ville Herva, Stephen C. Tweedie, Daniel Phillips, Andreas Dilger,
	Linux-kernel, ext2-devel

Hi,

On Fri, Jun 21, 2002 at 06:38:01PM +0300, Ville Herva wrote:
 
> So I thought, but not in ext2?

It's in ext2 --- it's the i_dir_start_lookup field that remembers
where we were last.

--Stephen

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21  7:03                 ` Helge Hafting
  2002-06-21 14:02                   ` Daniel Phillips
@ 2002-06-21 16:23                   ` Daniel Phillips
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-21 16:23 UTC (permalink / raw)
  To: Helge Hafting, ext2-devel, linux-kernel

On Friday 21 June 2002 09:03, Helge Hafting wrote:
> Daniel Phillips wrote:
> 
> > I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As
> > predicted, half-md4 does produce very even bucket distributions.  For 200,000
> > creates:
> > 
> >    half-md4:        2872 avg bytes filled per 4k block (70%)
> >    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> > 
> > but guess which was faster overall?
> > 
> >    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
> >    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> > 
> > This is quite reproducible: dx_hack_hash is always faster by about 6%.  This
> > must be due entirely to the difference in hashing cost, since half-md4
> > produces measurably better distributions.  Now what do we do?
> 
> No surprise that the worse distribution is faster - you get less
> io when fewer blocks are used.  Which means a bad distribution beats 
> a good one _until_ blocks start to really fill up and collide. 2.8K per
> 4K block is only 70% full.  I guess the better hash wins
> if you force a higher fill rate?

Hashing in htree doesn't work like that - what you're thinking about
is a traditional fixed-size hash table.  HTree is a btree that uses
hashes of names as keys.  Each block has a variable amount of the key
space assigned to it, initially just one block for the entire key
space.  When that block fills up, its entries and its key space are
split into two, then those blocks start to fill up, get split, and
so on.

So more even key distribution means the key space gets split more
evenly, and blocks are more likely to fill up evenly, meaning less
splitting, fewer blocks in total, and less IO.

A hash function that distributes keys better should give somewhat
better performance, and that has indeed been my experience.  But
in the case of half-MD4, the improvement we get from better
distribution is wiped out by the higher cost of computing the hash
function.[1]  Which is not to say that the work is without value.
The beautiful distribution given by the half-MD4 hash gives us
something to aim at, we just have to be more efficient about it.

I should note that HTree isn't hugely sensitive to bad hash
functions, at least not at the outset when a directory is growing.
The worst that happens is every leaf block ends up half-full with
a performance hit of just a few percent.  However, over time with
many insertions and deletions the hash space can get cut up into 
smaller and smaller pieces, so leaf blocks become less and less
full.  A more uniform hash function will slow this process down a
great deal, but it will not stop it completely.  The proper way
to deal with long term key space fragmentation is to implement
coalesce-on-delete, which is in progress.

[1] CPU cost in filesystem operations *is* important - a lot more
important than commonly thought.  Here we have yet another example
where CPU cost in filesystem operations dominates IO time, and
indeed, since directory operations are performed almost entirely
in cache, the quadratic cost of linear directory lookup is almost
entirely CPU cost.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 16:15                       ` Stephen C. Tweedie
@ 2002-06-21 18:44                         ` Ville Herva
  0 siblings, 0 replies; 42+ messages in thread
From: Ville Herva @ 2002-06-21 18:44 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linux-kernel, ext2-devel

On Fri, Jun 21, 2002 at 05:15:50PM +0100, you [Stephen C. Tweedie] wrote:
> Hi,
> 
> On Fri, Jun 21, 2002 at 06:38:01PM +0300, Ville Herva wrote:
>  
> > So I thought, but not in ext2?
> 
> It's in ext2 --- it's the i_dir_start_lookup field that remembers
> where we were last.

Great. (Unfortunately the box that experienced the named incident was a
2.2.20 + patches, so obviously the patch didn't help there :/ ).


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21  3:28               ` Daniel Phillips
  2002-06-21  7:03                 ` Helge Hafting
  2002-06-21 15:06                 ` Stephen C. Tweedie
@ 2002-06-22  5:53                 ` Andreas Dilger
  2002-06-22 20:59                   ` Daniel Phillips
  2 siblings, 1 reply; 42+ messages in thread
From: Andreas Dilger @ 2002-06-22  5:53 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

On Jun 21, 2002  05:28 +0200, Daniel Phillips wrote:
> I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As 
> predicted, half-md4 does produce very even bucket distributions.  For 200,000 
> creates:
> 
>    half-md4:        2872 avg bytes filled per 4k block (70%)
>    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> 
> but guess which was faster overall?
> 
>    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
>    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> 
> This is quite reproducible: dx_hack_hash is always faster by about 6%.  This 
> must be due entirely to the difference in hashing cost, since half-md4 
> produces measurably better distributions.  Now what do we do?

While I normally advocate the "cheapest" way of implementing a given
solution (and dx_hack_hash is definitely the lowest-cost hash function
we could reasonably have), I would still be inclined to go with half-MD4
for this.  A few reasons for that:
1) CPUs are getting faster all the time
2) it is a well-understood algorithm that has very good behaviour
3) it is much harder to spoof MD4 than dx_hack_hash
4) it is probably better to have the most uniform hash function we can
   find than to do lots more block split/coalesce operations, so the
   extra cost of half-MD4 may be a benefit overall

It would be interesting to re-run this test to create a few million
entries, but with periodic deletes.

Hmm, now that I think about it, split/coalesce operations are only
important on create and delete, while the hash cost is paid for each
lookup as well.  It would be interesting to see the comparison with
a test something like this (sorry, don't have a system which has both
hash functions working right now):

#!/bin/sh
DEV=/dev/hda7
TESTDIR=/mnt/tmp
date
for d in `seq -f "directory_name_%05g" 1 10000` ; do
	mkdir $TESTDIR/$d
	for f in `seq -f "file_name_%05g" 1 10000` ; do
		touch $TESTDIR/$d/$d_$f
	done
done
date
umount $TESTDIR
mount -o noatime $DEV $TESTDIR
date
for d in `seq -f "directory_name_%05g" 10000 -1 1` ; do
	for f in `seq -f "file_name_%05g" 10000 -1 1` ; do
		stat $TESTDIR/$d/$d_$f > /dev/null
	done
done
date

Having the longer filenames will put more load on the hash function,
so we will see if we are really paying a big price for the overhead,
and the stat test will remove all of the creation time and disk dirtying
and just leave us with the pure lookup costs hopefully.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-22  5:53                 ` Andreas Dilger
@ 2002-06-22 20:59                   ` Daniel Phillips
  2002-06-23  0:01                     ` Daniel Phillips
  2002-06-23  7:57                     ` Daniel Phillips
  0 siblings, 2 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-22 20:59 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

On Saturday 22 June 2002 07:53, Andreas Dilger wrote:
> 4) it is probably better to have the most uniform hash function we can
>    find than to do lots more block split/coalesce operations, so the
>    extra cost of half-MD4 may be a benefit overall

That's the thing: we already know the extra cost of half-MD4 is dominant 
here.  It only saves about 1.3% of the splitting, and splitting itself is a 
small part of overall costs.

Argument (1), "cpus are getting faster all the time" a little weak - 
directories are getting bigger all the time too, and... you know, it just 
isn't stylish to waste cycles in common operations.

However, I buy the stability and predictability arguments, and we should go 
with half-md4 for the alpha-test period.  But also, we should see what we can 
do to trim some of the fat off half-MD4 without degrading its nice properties 
too much.  During the same period, I think I see if I can find a better 
multiplier for dx_hack_hash, to even out its distribution a bit.  Armed with 
these results, the release version should use the fastest hash that exhibits 
acceptable statistical behaviour.

I do agree there's a lot to be said for going with a hash that has been 
analyzed rigorously.

> It would be interesting to re-run this test to create a few million
> entries, but with periodic deletes.

Changing the topic slightly, I'm running the following stupid program to test 
for directory growth during steady state operation similar to a mailer:

/* steady.c */

#include <stdlib.h>

int main (int argc, char *argv[])
{
	int target = (argc > 2)? strtol(argv[2], 0, 10): 0;
	int range = (argc > 3)? strtol(argv[3], 0, 10): 0;
	int size = 50, show = argc > 3 && !strncmp(argv[4], "y", 1);
	int balance = 0, deletes = 0;
	char name[size];

	if (!range) return -1;

	while (1)
	{
		snprintf(name, size, "%s%i", argv[1], rand() % range);
		if (balance < target)
		{
			int fd = open(name, 0100);
			if (fd >= 0)
			{
				if (show) printf("create %s\n", name);
				close(fd);
				balance++;
			}
		} else
			if (!remove(name))
			{
				if (show) printf("remove %s (%i), ", name, ++deletes);
				balance--;
			}
	}
	return 0;
}

when run with:

   steady <basename> <count> <range> [y=print output]

it will create up to <count> files with names <basename><rand> with <rand> in 
the range 0 to range-1, then it will start alternately deleting and creating 
random files.  Its algorithm is braindead - it just keeps trying to delete a 
random file until it hits one that exists, and on create, it creates a random 
file and tries again if it hits one already there.  I'm running it now with:

   steady foo 10000 1000000000 y

meaning it's working with a 10,000 file, creating/deleting files with names 
in the rand foo0 to foo999999999, i.e., a billion possible names.  It's a 
horribly inefficient thing to do, since it has about a hundred thousand 
failures for every success, but it was quick to write.  I'm watching for 
directory expansion.  The directory started at 385024 bytes (38.5 bytes/entry 
and grew a block after about 300 steady state remove/create cycles.   Another 
block was added around 500 cycles.  It's up to 1100 cycles now, and hasn't 
added another block.  It sort of looks like the expansion is getting slower.  
This is with dx_hack_hash.  I'll let it run overnight to get a feeling of how 
urgent the steady-state problem is.

What I expect to find is that directory expansion is slower with half-md4, 
perhaps significantly slower.  However, even if it is slower, we can't really 
tolerate any steady-state expansion at all, so it's no excuse for not 
implementing the coalesce.

> Hmm, now that I think about it, split/coalesce operations are only
> important on create and delete, while the hash cost is paid for each
> lookup as well.  It would be interesting to see the comparison with
> a test something like this (sorry, don't have a system which has both
> hash functions working right now):
>
> [code]

Yes, good test: performance against name length.  Well, I don't plan to try 
it just now, since I've run out of time and must get ready for the pilgrimage 
to Ottawa.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-22 20:59                   ` Daniel Phillips
@ 2002-06-23  0:01                     ` Daniel Phillips
  2002-06-23  7:57                     ` Daniel Phillips
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-23  0:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

On Saturday 22 June 2002 22:59, Daniel Phillips wrote:
> ... The directory started at 385024 bytes (38.5 bytes/entry 
> and grew a block after about 300 steady state remove/create cycles.   Another 
> block was added around 500 cycles.  It's up to 1100 cycles now, and hasn't 
> added another block.  It sort of looks like the expansion is getting slower.  
> This is with dx_hack_hash.  I'll let it run overnight to get a feeling of how 
> urgent the steady-state problem is.

And now it's up to 5500 cycles, and still hasn't added another block.
Empirically, we're seeing a very steep falloff in the expansion rate,
though I'll readily admit the testing is far from sufficient.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-22 20:59                   ` Daniel Phillips
  2002-06-23  0:01                     ` Daniel Phillips
@ 2002-06-23  7:57                     ` Daniel Phillips
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-06-23  7:57 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

On Sunday 23 June 2002 02:01, Daniel Phillips wrote:
> On Saturday 22 June 2002 22:59, Daniel Phillips wrote:
> > ... The directory started at 385024 bytes (38.5 bytes/entry 
> > and grew a block after about 300 steady state remove/create cycles.   Another 
> > block was added around 500 cycles.  It's up to 1100 cycles now, and hasn't 
> > added another block.  It sort of looks like the expansion is getting slower.  
> > This is with dx_hack_hash.  I'll let it run overnight to get a feeling of how 
> > urgent the steady-state problem is.
> 
> And now it's up to 5500 cycles, and still hasn't added another block.
> Empirically, we're seeing a very steep falloff in the expansion rate,
> though I'll readily admit the testing is far from sufficient.

At the end of the overnight run with about 16,600 cycles another block has
been added, bringing the directory up to 397312 bytes.  The four points I
have now make a nice logarithmic curve, i.e., the slope seems to fall off so
rapidly that there is not a lot to worry about here.  All the same, I should
code a more efficient test and run a few billion cycles.  Or perhaps that
same effort would be better spent working on the coalesce, which will turn
this into a moot point.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 14:02                   ` Daniel Phillips
@ 2002-06-24  7:12                     ` Helge Hafting
  0 siblings, 0 replies; 42+ messages in thread
From: Helge Hafting @ 2002-06-24  7:12 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: ext2-devel, linux-kernel

Daniel Phillips wrote:

> 
> Hashing in htree doesn't work like that - what you're thinking about
> is a traditional fixed-size hash table.  HTree is a btree that uses
> hashes of names as keys.  Each block has a variable amount of the key
> space assigned to it, initially just one block for the entire key
> space.  When that block fills up, its entries and its key space are
> split into two, then those blocks start to fill up, get split, and
> so on.
> 
Very interesting, thanks!

> A hash function that distributes keys better should give somewhat
> better performance, and that has indeed been my experience.  But
> in the case of half-MD4, the improvement we get from better
> distribution is wiped out by the higher cost of computing the hash
> function.[1]  Which is not to say that the work is without value.
> The beautiful distribution given by the half-MD4 hash gives us
> something to aim at, we just have to be more efficient about it.
> 
There is a way to measure the effect of the better but slower
hash.  Add "time-wasting" code to the faster hash so
it gets exactly as slow as MD4.  Then run benchmarks and
see the effects of the better distribution undisturbed.

Helge Hafting

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-06-21 15:06                 ` Stephen C. Tweedie
@ 2002-07-04  4:48                   ` Daniel Phillips
  2002-07-04 14:15                     ` jlnance
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Phillips @ 2002-07-04  4:48 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Stephen C. Tweedie, Andrew Morton, Christopher Li, Linux-kernel,
	ext2-devel

(I wrote this about 10 days ago and somehow didn't get around to posting it)

(please note my new, permanent email address)

On Friday 21 June 2002 17:06, Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, Jun 21, 2002 at 05:28:28AM +0200, Daniel Phillips wrote:
> 
> > I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2.  As 
> > predicted, half-md4 does produce very even bucket distributions.  For 200,000 
> > creates:
> > 
> >    half-md4:        2872 avg bytes filled per 4k block (70%)
> >    dx_hack_hash:    2853 avg bytes filled per 4k block (69%)
> > 
> > but guess which was faster overall?
> > 
> >    half-md4:        user 0.43 system 6.88 real 0:07.33 CPU 99%
> >    dx_hack_hash:    user 0.43 system 6.40 real 0:06.82 CPU 100%
> > 
> > This is quite reproducible: dx_hack_hash is always faster by about 6%.  This 
> > must be due entirely to the difference in hashing cost, since half-md4 
> > produces measurably better distributions.  Now what do we do?
> 
> I want to get this thing tested!  

Yes :-)

> There are far too many factors for this to be resolved very quickly.
> In reality, there will be a lot of disk cost under load which you
> don't see in benchmarks, too.  We also know for a fact that the early
> hashes used in Reiserfs were quick but were vulnerable to terribly bad
> behaviour under certain application workloads.  With the half-md4, at
> least we can expect decent worst-case behaviour unless we're under
> active attack (ie. only maliscious apps get hurt).

OK, anti-hash-attack is on the list of things to do, and it's fairly
clear how to go about it:

  - When we create a volume, generate and store some number of bits of
    random seed in the superblock, or if we are creating the first-ever
    index on an existing volume, generate the seed at that time.

  - Every dx_hash starts by seeding the hash function from the value in
    the superblock.

The width of the superblock seed doesn't have to match the value needed
by the directory hash exactly since we can easily widen the value by
seeding a pseudo-random generator with the smaller seed.  If the
superblock seed is wider than we need, we can fold it down via xor, or
better, just take as much as we need (all bits of our seed are supposed
to be eually random, so xoring parts of the seed together does't make
the result any more random).

We would do whatever seed-adjusting we need to at mount time, so the
process of seeding the per-dirop hash is just a memory->memory or
memory->register move.  This part of the hashing process, at least,
should impose no measurable overhead.

Originally, I'd put aside the 'reserve_zero' field in the per-directory
dx_info for this purpose, but I later realized that it doesn't make
sense to do this anti-hash-attack protection at any finer granuality
than a volume, plus we save valuable real estate in the directory root
and gain a little efficiency.  Putting it in the superblock is a clear
win.

By the way, the dx_info area is not hardwired to 8 bytes.  The current
code is written so that dx_info can be expanded without breaking
forward compatibility.  At least it's supposed to be written that way.
We should put that on the list of things to test.

> I think the md4 is a safer bet until we know more, so I'd vote that we
> stick with the ext3 cvs code which uses hash version #1 for that, and
> defer anything else until we've seen more --- the hash versioning lets
> us do that safely.

Yes, I agree we should err on the side of safety.  The 6% overhead I see
in my stripped down test will certainly be diluted quite a bit under real
loads.  Hash version #1 does not have to be the release version.  We can
reasonably require a fsck at the end of the alpha-testing period, which
would rebuild all the hash indexes with the final, release hash.  This
would have the added advantage of enlisting our alpha-testers to test
Ted's new e2fsck code as well.

I think we're making progress here.  An 'aha' I had while working with
this new hash code is that there's a one statistical property we're
looking for more than any other in a good hash: the hash value should
yield as little information as possible about the input string.  Then
any statistical change in input strings over time won't cause a
statistical change in output hash values over time - precisely the
property htree wants, in order to keep leaf blocks filled uniformly
and slow the hash range fragmentation rate.  This is why cryptographic
hash analysis is the right approach to this problem.

-- 
Daniel

P.S., in retrospect, pretty much all of the above survived scrutiny
at Ottawa.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-07-04  4:48                   ` Daniel Phillips
@ 2002-07-04 14:15                     ` jlnance
  2002-07-05  2:11                       ` Daniel Phillips
  0 siblings, 1 reply; 42+ messages in thread
From: jlnance @ 2002-07-04 14:15 UTC (permalink / raw)
  To: linux-kernel

On Thu, Jul 04, 2002 at 06:48:45AM +0200, Daniel Phillips wrote:
> > behaviour under certain application workloads.  With the half-md4, at
> > least we can expect decent worst-case behaviour unless we're under
> > active attack (ie. only maliscious apps get hurt).
> 
> OK, anti-hash-attack is on the list of things to do, and it's fairly
> clear how to go about it:

Is it really worth the trouble and complexity to do anti-hash-attack?
What is the worst that could happen if someone managed to create a bunch
of files that hashed to the same value?

Jim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ext2-devel] Re: Shrinking ext3 directories
  2002-07-04 14:15                     ` jlnance
@ 2002-07-05  2:11                       ` Daniel Phillips
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Phillips @ 2002-07-05  2:11 UTC (permalink / raw)
  To: jlnance, linux-kernel

On Thursday 04 July 2002 16:15, jlnance@intrex.net wrote:
> On Thu, Jul 04, 2002 at 06:48:45AM +0200, Daniel Phillips wrote:
> > > behaviour under certain application workloads.  With the half-md4, at
> > > least we can expect decent worst-case behaviour unless we're under
> > > active attack (ie. only maliscious apps get hurt).
> > 
> > OK, anti-hash-attack is on the list of things to do, and it's fairly
> > clear how to go about it:
> 
> Is it really worth the trouble and complexity to do anti-hash-attack?
> What is the worst that could happen if someone managed to create a bunch
> of files that hashed to the same value?

Just a slowdown, but in some cases it could be a quadratic slowdown
that could conceivably be turned into a denial of service.  As risks
go, it's a minor one, but there's a straightforward solution with
insignificant cost in either efficiency or code size, so why not do
it.  The overhead is just a data move from the superblock per name 
hash.

-- 
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2002-07-05  2:07 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-06-18 16:08 Shrinking ext3 directories DervishD
2002-06-18 16:10 ` Austin Gonyou
2002-06-18 16:39   ` Andreas Dilger
2002-06-18 19:39     ` DervishD
2002-06-18 19:34   ` DervishD
2002-06-18 16:21 ` Padraig Brady
2002-06-18 16:54   ` David Lang
2002-06-18 19:35   ` DervishD
2002-06-18 21:50 ` Stephen C. Tweedie
2002-06-18 22:18   ` Alexander Viro
2002-06-19  9:38     ` DervishD
2002-06-19 10:37     ` Stephen C. Tweedie
2002-06-19 17:03       ` [Ext2-devel] " Christopher Li
2002-06-19 20:10         ` Stephen C. Tweedie
2002-06-19 20:34           ` Stephen C. Tweedie
2002-06-19 20:13         ` Andrew Morton
2002-06-19 22:43           ` Stephen C. Tweedie
2002-06-19 23:54             ` Stephen C. Tweedie
2002-06-21  3:28               ` Daniel Phillips
2002-06-21  7:03                 ` Helge Hafting
2002-06-21 14:02                   ` Daniel Phillips
2002-06-24  7:12                     ` Helge Hafting
2002-06-21 16:23                   ` Daniel Phillips
2002-06-21 15:06                 ` Stephen C. Tweedie
2002-07-04  4:48                   ` Daniel Phillips
2002-07-04 14:15                     ` jlnance
2002-07-05  2:11                       ` Daniel Phillips
2002-06-22  5:53                 ` Andreas Dilger
2002-06-22 20:59                   ` Daniel Phillips
2002-06-23  0:01                     ` Daniel Phillips
2002-06-23  7:57                     ` Daniel Phillips
2002-06-19 22:49         ` Daniel Phillips
2002-06-20  0:24           ` Andreas Dilger
2002-06-20  9:34           ` Stephen C. Tweedie
2002-06-20 10:18             ` Andreas Dilger
2002-06-20 13:45               ` Daniel Phillips
2002-06-21 14:54                 ` Ville Herva
2002-06-21 15:08                   ` Stephen C. Tweedie
2002-06-21 15:38                     ` Ville Herva
2002-06-21 16:15                       ` Stephen C. Tweedie
2002-06-21 18:44                         ` Ville Herva
2002-06-20 16:26               ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).