All of lore.kernel.org
 help / color / mirror / Atom feed
* How does NILFS2 handle directory management
@ 2009-09-10 19:16 Prasenjit Giri
       [not found] ` <2324ff2b0909101216q31dad1a8y43ea0f229923a0c3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Prasenjit Giri @ 2009-09-10 19:16 UTC (permalink / raw)
  To: NILFS2


[-- Attachment #1.1: Type: text/plain, Size: 361 bytes --]

 Hello,
Even after looking through various sites, forums, mailing lists I could not
uncover the present directoy management of NILFS2.

As I see a B-tree directory management in your long term to do list, it
would be really nice if someone would ought to share the information
regarding the present directory management of NILFS2

Thank you.
-- 
Prasenjit Giri

[-- Attachment #1.2: Type: text/html, Size: 478 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
users mailing list
users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
https://www.nilfs.org/mailman/listinfo/users

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found] ` <2324ff2b0909101216q31dad1a8y43ea0f229923a0c3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-10 19:26   ` Reinoud Zandijk
       [not found]     ` <20090910192619.GA1263-bVHBekiX4bNgoMqBc1r0ESegHCQxtGRMHZ5vskTnxNA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-10 19:26 UTC (permalink / raw)
  To: NILFS Users mailing list

Hi Prasenjit Giri,

On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> Even after looking through various sites, forums, mailing lists I could not
> uncover the present directoy management of NILFS2.
> 
> As I see a B-tree directory management in your long term to do list, it
> would be really nice if someone would ought to share the information
> regarding the present directory management of NILFS2

Currently directories are recorded just as a sequential file but not filled
with user data but with dirent entries in their blocks like ext2fs does.
Nothing special about that :)

I dont know if b+tree directory management would be preferable though. With
some smart caching all directory operations can be made O(1) anyway.

With regards,
Reinoud Zandijk

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]     ` <20090910192619.GA1263-bVHBekiX4bNgoMqBc1r0ESegHCQxtGRMHZ5vskTnxNA@public.gmane.org>
@ 2009-09-11  1:21       ` Ryusuke Konishi
       [not found]         ` <20090911.102118.69189169.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-11  1:21 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg, reinoud-S783fYmB3Ccdnm+yROfE0A,
	Prasenjit Giri

Hi Prasenjit Giri, and Reinoud,
On Thu, 10 Sep 2009 21:26:19 +0200, Reinoud Zandijk <wrote:
> Hi Prasenjit Giri,
> 
> On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> > Even after looking through various sites, forums, mailing lists I could not
> > uncover the present directoy management of NILFS2.
> > 
> > As I see a B-tree directory management in your long term to do list, it
> > would be really nice if someone would ought to share the information
> > regarding the present directory management of NILFS2
> 
> Currently directories are recorded just as a sequential file but not filled
> with user data but with dirent entries in their blocks like ext2fs does.
> Nothing special about that :)
>
> I dont know if b+tree directory management would be preferable though. With
> some smart caching all directory operations can be made O(1) anyway.
> 
> With regards,
> Reinoud Zandijk

Yes, directory is a file and it's already managed with b-tree like
other files.  However, the current directory design is not O(log n),
and is actually slow especially for file creation.

So, it leaves room for considering the true "b-tree directory
management".

OTOH, the replacement has the following points to notice:

 1) It may complicate the log writer.

    The current log writer is designed on the basis that every data
    and meta data is a file.  In NILFS, inodes, segment usage state,
    and checkpoints, are managed with correponding meta-data files.

    The true "b-tree directory management" may break this uniformity.

 2) A disk format change is required.
    (We can deem this a trade-off at this stage)

This may be a -v3 material.  But anyway, patch proposals are welcome.

Thank you,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]         ` <20090911.102118.69189169.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2009-09-11  3:48           ` Jiro SEKIBA
       [not found]             ` <873a6u16qy.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
  2009-09-11  3:58           ` Prasenjit Giri
  2009-09-11  5:22           ` Reinoud Zandijk
  2 siblings, 1 reply; 18+ messages in thread
From: Jiro SEKIBA @ 2009-09-11  3:48 UTC (permalink / raw)
  To: NILFS Users mailing list

Hi

As far as I see, nilfs2 directory management is based on ext2.
ext3 has same format as ext2, but file creation is faster than ext2.
http://plaza18.mbn.or.jp/~moriban/linux/img/FileSystemBenchmarkResults-01-03.png
# There may be more reliable benchmark result, but I cound't find it.

So I think it would be worth to try porting ext3 code to nilfs2
to improve file creation.  Isn't it?

thanks,

regards,
-- 
Jiro SEKIBA <jir-hfpbi5WX9J54Eiagz67IpQ@public.gmane.org>

At Fri, 11 Sep 2009 10:21:18 +0900 (JST),
Ryusuke Konishi wrote:
> 
> Hi Prasenjit Giri, and Reinoud,
> On Thu, 10 Sep 2009 21:26:19 +0200, Reinoud Zandijk <wrote:
> > Hi Prasenjit Giri,
> > 
> > On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> > > Even after looking through various sites, forums, mailing lists I could not
> > > uncover the present directoy management of NILFS2.
> > > 
> > > As I see a B-tree directory management in your long term to do list, it
> > > would be really nice if someone would ought to share the information
> > > regarding the present directory management of NILFS2
> > 
> > Currently directories are recorded just as a sequential file but not filled
> > with user data but with dirent entries in their blocks like ext2fs does.
> > Nothing special about that :)
> >
> > I dont know if b+tree directory management would be preferable though. With
> > some smart caching all directory operations can be made O(1) anyway.
> > 
> > With regards,
> > Reinoud Zandijk
> 
> Yes, directory is a file and it's already managed with b-tree like
> other files.  However, the current directory design is not O(log n),
> and is actually slow especially for file creation.
> 
> So, it leaves room for considering the true "b-tree directory
> management".
> 
> OTOH, the replacement has the following points to notice:
> 
>  1) It may complicate the log writer.
> 
>     The current log writer is designed on the basis that every data
>     and meta data is a file.  In NILFS, inodes, segment usage state,
>     and checkpoints, are managed with correponding meta-data files.
> 
>     The true "b-tree directory management" may break this uniformity.
> 
>  2) A disk format change is required.
>     (We can deem this a trade-off at this stage)
> 
> This may be a -v3 material.  But anyway, patch proposals are welcome.
> 
> Thank you,
> Ryusuke Konishi
> _______________________________________________
> users mailing list
> users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
> https://www.nilfs.org/mailman/listinfo/users
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]         ` <20090911.102118.69189169.ryusuke-sG5X7nlA6pw@public.gmane.org>
  2009-09-11  3:48           ` Jiro SEKIBA
@ 2009-09-11  3:58           ` Prasenjit Giri
       [not found]             ` <2324ff2b0909102058g1ee407fco3f2482b4732dacca-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-09-11  5:22           ` Reinoud Zandijk
  2 siblings, 1 reply; 18+ messages in thread
From: Prasenjit Giri @ 2009-09-11  3:58 UTC (permalink / raw)
  To: NILFS2


[-- Attachment #1.1: Type: text/plain, Size: 2523 bytes --]

On Fri, Sep 11, 2009 at 6:51 AM, Ryusuke Konishi <ryusuke-sG5X7nlA6pw@public.gmane.org> wrote:

> Hi Prasenjit Giri, and Reinoud,
> On Thu, 10 Sep 2009 21:26:19 +0200, Reinoud Zandijk <wrote:
> > Hi Prasenjit Giri,
> >
> > On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> > > Even after looking through various sites, forums, mailing lists I could
> not
> > > uncover the present directoy management of NILFS2.
> > >
> > > As I see a B-tree directory management in your long term to do list, it
> > > would be really nice if someone would ought to share the information
> > > regarding the present directory management of NILFS2
> >
> > Currently directories are recorded just as a sequential file but not
> filled
> > with user data but with dirent entries in their blocks like ext2fs does.
> > Nothing special about that :)
> >
> > I dont know if b+tree directory management would be preferable though.
> With
> > some smart caching all directory operations can be made O(1) anyway.
> >
> > With regards,
> > Reinoud Zandijk
>
> Yes, directory is a file and it's already managed with b-tree like
> other files.  However, the current directory design is not O(log n),
> and is actually slow especially for file creation.
>
> So, it leaves room for considering the true "b-tree directory
> management".
>
> OTOH, the replacement has the following points to notice:
>
>  1) It may complicate the log writer.
>
>    The current log writer is designed on the basis that every data
>    and meta data is a file.  In NILFS, inodes, segment usage state,
>    and checkpoints, are managed with correponding meta-data files.
>
>    The true "b-tree directory management" may break this uniformity.
>
>  2) A disk format change is required.
>    (We can deem this a trade-off at this stage)
>
> This may be a -v3 material.  But anyway, patch proposals are welcome.
>
> Thank you,
> Ryusuke Konishi
>



Hello Reinoud Zandijk, Ryusuke Konishi

This has been more than helpful as I was at complete lost. But with this new
side of it could you explain the quoted, " --- true 'b-tree directory
management' --- ".

I've on this mail list, looking for ideas that would be feasible for an
undergrad project -- though this adds a very unpleasant twist.

Is writable snapshot feasible ? If so, is there an abstract for this
purpose? Is it doable in a years time?

Or could you all please suggest an alternative on such grounds.
With regards,
Prasenjit Giri
Rajesh D. Yadav
Ashwin Goel
Sunil V. Chitkote


-- 
Prasenjit Giri

[-- Attachment #1.2: Type: text/html, Size: 3178 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
users mailing list
users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
https://www.nilfs.org/mailman/listinfo/users

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]             ` <873a6u16qy.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11  4:13               ` Ryusuke Konishi
  0 siblings, 0 replies; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-11  4:13 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg, jir-hfpbi5WX9J54Eiagz67IpQ

On Fri, 11 Sep 2009 12:48:37 +0900, Jiro SEKIBA wrote:
> Hi
> 
> As far as I see, nilfs2 directory management is based on ext2.
> ext3 has same format as ext2, but file creation is faster than ext2.
> http://plaza18.mbn.or.jp/~moriban/linux/img/FileSystemBenchmarkResults-01-03.png
> # There may be more reliable benchmark result, but I cound't find it.
> 
> So I think it would be worth to try porting ext3 code to nilfs2
> to improve file creation.  Isn't it?

Yeah, it sounds nice.  The approach seems to be able to avoid both
problems I mentioned.

Thanks,
Ryusuke Konishi

> 
> thanks,
> 
> regards,
> -- 
> Jiro SEKIBA <jir-hfpbi5WX9J54Eiagz67IpQ@public.gmane.org>
> 
> At Fri, 11 Sep 2009 10:21:18 +0900 (JST),
> Ryusuke Konishi wrote:
> > 
> > Hi Prasenjit Giri, and Reinoud,
> > On Thu, 10 Sep 2009 21:26:19 +0200, Reinoud Zandijk <wrote:
> > > Hi Prasenjit Giri,
> > > 
> > > On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> > > > Even after looking through various sites, forums, mailing lists I could not
> > > > uncover the present directoy management of NILFS2.
> > > > 
> > > > As I see a B-tree directory management in your long term to do list, it
> > > > would be really nice if someone would ought to share the information
> > > > regarding the present directory management of NILFS2
> > > 
> > > Currently directories are recorded just as a sequential file but not filled
> > > with user data but with dirent entries in their blocks like ext2fs does.
> > > Nothing special about that :)
> > >
> > > I dont know if b+tree directory management would be preferable though. With
> > > some smart caching all directory operations can be made O(1) anyway.
> > > 
> > > With regards,
> > > Reinoud Zandijk
> > 
> > Yes, directory is a file and it's already managed with b-tree like
> > other files.  However, the current directory design is not O(log n),
> > and is actually slow especially for file creation.
> > 
> > So, it leaves room for considering the true "b-tree directory
> > management".
> > 
> > OTOH, the replacement has the following points to notice:
> > 
> >  1) It may complicate the log writer.
> > 
> >     The current log writer is designed on the basis that every data
> >     and meta data is a file.  In NILFS, inodes, segment usage state,
> >     and checkpoints, are managed with correponding meta-data files.
> > 
> >     The true "b-tree directory management" may break this uniformity.
> > 
> >  2) A disk format change is required.
> >     (We can deem this a trade-off at this stage)
> > 
> > This may be a -v3 material.  But anyway, patch proposals are welcome.
> > 
> > Thank you,
> > Ryusuke Konishi
> > _______________________________________________
> > users mailing list
> > users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
> > https://www.nilfs.org/mailman/listinfo/users
> > 
> > 
> > 
> 
> 
> _______________________________________________
> users mailing list
> users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
> https://www.nilfs.org/mailman/listinfo/users

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]         ` <20090911.102118.69189169.ryusuke-sG5X7nlA6pw@public.gmane.org>
  2009-09-11  3:48           ` Jiro SEKIBA
  2009-09-11  3:58           ` Prasenjit Giri
@ 2009-09-11  5:22           ` Reinoud Zandijk
       [not found]             ` <20090911052213.GA24899-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  2 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-11  5:22 UTC (permalink / raw)
  To: Ryusuke Konishi
  Cc: Prasenjit Giri, reinoud-S783fYmB3Ccdnm+yROfE0A,
	users-JrjvKiOkagjYtjvyW6yDsg


[-- Attachment #1.1: Type: text/plain, Size: 1669 bytes --]

Hi folks,

On Fri, Sep 11, 2009 at 10:21:18AM +0900, Ryusuke Konishi wrote:
> > I dont know if b+tree directory management would be preferable though. With
> > some smart caching all directory operations can be made O(1) anyway.
> 
> Yes, directory is a file and it's already managed with b-tree like
> other files.  However, the current directory design is not O(log n),
> and is actually slow especially for file creation.

Thats what i was refering to. UDF has the same issue and the same design on
directory structure, only even more complicated since directory entries are
not of a fixed length and may actually span logical sectors/blocks.

I now have code in UDF that i also use in my NiLFS implementation that manages
the first file creation/reference in O(n) but all other files
creation/lookup/delete in O(1)! speeding up large directory copies
modifications/copies enormously, upto thousands of times.

> So, it leaves room for considering the true "b-tree directory
> management".
> 
> OTOH, the replacement has the following points to notice:
> 
>  1) It may complicate the log writer.
> 
>     The current log writer is designed on the basis that every data
>     and meta data is a file.  In NILFS, inodes, segment usage state,
>     and checkpoints, are managed with correponding meta-data files.
> 
>     The true "b-tree directory management" may break this uniformity.

This indeed complicates it a lot. It could still be implemented using `perfect
hashing' techniques in one file but i have no experience with this and updates
may still be expensive.

With regards,
Reinoud

P.S. Ryusuke, did you recieve my mail on the length of partial segments?


[-- Attachment #1.2: Type: application/pgp-signature, Size: 486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
users mailing list
users-JrjvKiOkagjYtjvyW6yDsg@public.gmane.org
https://www.nilfs.org/mailman/listinfo/users

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]             ` <2324ff2b0909102058g1ee407fco3f2482b4732dacca-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-11  6:23               ` Ryusuke Konishi
  0 siblings, 0 replies; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-11  6:23 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg, tangyorangesour-Re5JQEeQqe8AvxtiuMwx3w

Hi Prasenjit Giri,
On Fri, 11 Sep 2009 09:28:55 +0530, Prasenjit Giri <tangyorangesour-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, Sep 11, 2009 at 6:51 AM, Ryusuke Konishi <ryusuke-sG5X7nlA6pw@public.gmane.org> wrote:
> 
> > Hi Prasenjit Giri, and Reinoud,
> > On Thu, 10 Sep 2009 21:26:19 +0200, Reinoud Zandijk <wrote:
> > > Hi Prasenjit Giri,
> > >
> > > On Fri, Sep 11, 2009 at 12:46:04AM +0530, Prasenjit Giri wrote:
> > > > Even after looking through various sites, forums, mailing lists I could
> > not
> > > > uncover the present directoy management of NILFS2.
> > > >
> > > > As I see a B-tree directory management in your long term to do list, it
> > > > would be really nice if someone would ought to share the information
> > > > regarding the present directory management of NILFS2
> > >
> > > Currently directories are recorded just as a sequential file but not
> > filled
> > > with user data but with dirent entries in their blocks like ext2fs does.
> > > Nothing special about that :)
> > >
> > > I dont know if b+tree directory management would be preferable though.
> > With
> > > some smart caching all directory operations can be made O(1) anyway.
> > >
> > > With regards,
> > > Reinoud Zandijk
> >
> > Yes, directory is a file and it's already managed with b-tree like
> > other files.  However, the current directory design is not O(log n),
> > and is actually slow especially for file creation.
> >
> > So, it leaves room for considering the true "b-tree directory
> > management".
> >
> > OTOH, the replacement has the following points to notice:
> >
> >  1) It may complicate the log writer.
> >
> >    The current log writer is designed on the basis that every data
> >    and meta data is a file.  In NILFS, inodes, segment usage state,
> >    and checkpoints, are managed with correponding meta-data files.
> >
> >    The true "b-tree directory management" may break this uniformity.
> >
> >  2) A disk format change is required.
> >    (We can deem this a trade-off at this stage)
> >
> > This may be a -v3 material.  But anyway, patch proposals are welcome.
> >
> > Thank you,
> > Ryusuke Konishi
> >
> 
> 
> 
> Hello Reinoud Zandijk, Ryusuke Konishi
> 
> This has been more than helpful as I was at complete lost. But with this new
> side of it could you explain the quoted, " --- true 'b-tree directory
> management' --- ".
> 
> I've on this mail list, looking for ideas that would be feasible for an
> undergrad project -- though this adds a very unpleasant twist.
> 
> Is writable snapshot feasible ? If so, is there an abstract for this
> purpose? Is it doable in a years time?
> 
> Or could you all please suggest an alternative on such grounds.
> With regards,
> Prasenjit Giri
> Rajesh D. Yadav
> Ashwin Goel
> Sunil V. Chitkote
> 
> 
> -- 
> Prasenjit Giri

Sorry for my late reply.  The Reinoud's mail reminded me your
question.

> Is writable snapshot feasible ? If so, is there an abstract for this
> purpose? Is it doable in a years time?

Well, it is a desired feature.  But, I think the writable snapshot on
nilfs is a bit challenging.

It needs extending checkpoint file to allow branching.  The current
nilfs sequentially numbers checkpoints, and there is no branch on the
series.

One possible approach is to allocate checkpoints of branches in high
order zone of checkpoint numbers:

  --> checkpoint number

 |--------------------+            ...       |---|   ...  |---|  
       past cno   current cno              series of      series of
                                           checkpoints    branch-b
                                           for writable
                                           snapshot-A

In addition, it needs to enhance DAT file.

The current DAT file has the following entries.  The GC of nilfs uses
the [start, end) values to judge each block is live or dead for given
checkpoint period [cno1, cno2].

struct nilfs_dat_entry {
        __le64 de_blocknr;  /* disk block address */
        __le64 de_start;    /* start checkpoint number */
        __le64 de_end;      /* end checkpoint number this block died */
        __le64 de_rsv;
};

Live blocks are copied in GC, whreas dead blocks are discarded.

If the writable snapshot makes a branch on checkpoints, this entry
needs to be enhanced to have multiple [start, end) ranges.

One possible approach is to use the reserved field and chain multiple
entries to have multiple ranges.

How do you think these ?

> Or could you all please suggest an alternative on such grounds.

How about developing a new garbage collector.

The current GC has several problems:

 * it doesn't adjust GC speed. it reclaims in constant speed
   even while the I/O load is very high.
   it doesn't icrease speed even if the partition is in a near disk
   full condition.

 * it select the oldest segment for GC, and doesn't take account of
   the number of live or dead blocks in the segment.

 * it doesn't have defrag work.

The garbage collector is implemented as a userland daemon, so we can
easily replace it with a new one.

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]             ` <20090911052213.GA24899-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
@ 2009-09-11  6:44               ` Ryusuke Konishi
       [not found]                 ` <20090911.154452.43227079.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-11  6:44 UTC (permalink / raw)
  To: reinoud-S783fYmB3Ccdnm+yROfE0A
  Cc: tangyorangesour-Re5JQEeQqe8AvxtiuMwx3w, users-JrjvKiOkagjYtjvyW6yDsg

Hi,
On Fri, 11 Sep 2009 07:22:13 +0200, Reinoud Zandijk wrote:
> Hi folks,
> 
> On Fri, Sep 11, 2009 at 10:21:18AM +0900, Ryusuke Konishi wrote:
> > > I dont know if b+tree directory management would be preferable though. With
> > > some smart caching all directory operations can be made O(1) anyway.
> > 
> > Yes, directory is a file and it's already managed with b-tree like
> > other files.  However, the current directory design is not O(log n),
> > and is actually slow especially for file creation.
> 
> Thats what i was refering to. UDF has the same issue and the same design on
> directory structure, only even more complicated since directory entries are
> not of a fixed length and may actually span logical sectors/blocks.
> 
> I now have code in UDF that i also use in my NiLFS implementation that manages
> the first file creation/reference in O(n) but all other files
> creation/lookup/delete in O(1)! speeding up large directory copies
> modifications/copies enormously, upto thousands of times.

You mean hashing (on-memory) by file names?
Well, I agree.  Such optimization would improve directory operations of
the current code without disk format change.

I think it should be considered along with ext3 approach.

> > So, it leaves room for considering the true "b-tree directory
> > management".
> > 
> > OTOH, the replacement has the following points to notice:
> > 
> >  1) It may complicate the log writer.
> > 
> >     The current log writer is designed on the basis that every data
> >     and meta data is a file.  In NILFS, inodes, segment usage state,
> >     and checkpoints, are managed with correponding meta-data files.
> > 
> >     The true "b-tree directory management" may break this uniformity.
> 
> This indeed complicates it a lot. It could still be implemented using `perfect
> hashing' techniques in one file but i have no experience with this and updates
> may still be expensive.
> 
> With regards,
> Reinoud
> 
> P.S. Ryusuke, did you recieve my mail on the length of partial segments?

Oh, I just noticed.

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                 ` <20090911.154452.43227079.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2009-09-14 10:27                   ` Reinoud Zandijk
       [not found]                     ` <20090914102731.GA154-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-14 10:27 UTC (permalink / raw)
  To: Ryusuke Konishi
  Cc: reinoud-S783fYmB3Ccdnm+yROfE0A, users-JrjvKiOkagjYtjvyW6yDsg

On Fri, Sep 11, 2009 at 03:44:52PM +0900, Ryusuke Konishi wrote:
> You mean hashing (on-memory) by file names?
> Well, I agree.  Such optimization would improve directory operations of
> the current code without disk format change.

Well, the strength is that you don't remember the directory contents, only the
hashes, diroffset and name length; this means nearly no memory use. Its also
authorative: not found in the hash means its not in the dir.

> I think it should be considered along with ext3 approach.

What is is the ext3 approach? What is different compared to ext2? And is it
really such a big improvement? Will it complicate directory reading/writing
more?

With regards,
Reinoud Zandijk

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                     ` <20090914102731.GA154-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
@ 2009-09-24  8:49                       ` Jiro SEKIBA
       [not found]                         ` <87hbusbud6.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jiro SEKIBA @ 2009-09-24  8:49 UTC (permalink / raw)
  To: NILFS Users mailing list

Hi,

At Mon, 14 Sep 2009 12:27:31 +0200,
Reinoud Zandijk wrote:

> > I think it should be considered along with ext3 approach.
> 
> What is is the ext3 approach? What is different compared to ext2? And is it
> really such a big improvement? Will it complicate directory reading/writing
> more?

ext3 approach is a hashed index for file names, stored on disk.
So it always costs constant time to lookup the block filename stored.

The point is that this hashed index extension preserve backward compatibility,
which means that code which does not know about hashed index still be
able to read the directory entry as usual.

The improvement of this was significant compared with ext2.
I re-put up the following benchmark results.

http://plaza18.mbn.or.jp/~moriban/linux/img/FileSystemBenchmarkResults-01-03.png
# There may be more reliable benchmark result, but I cound't find it.

There are some documents for further information if you are interested in.

- A Directory Index for Ext2
 http://ols.fedoraproject.org/OLS/Reprints-2002/phillips-reprint.pdf
- [RFC] Ext2 Directory Index - File Structure
 http://lwn.net/2001/0412/a/index-format.php3
- [rfc] Near-constant time directory index for Ext2
 http://search.luky.org/linux-kernel.2001/msg00117.html

thanks,

regards,
-- 
Jiro SEKIBA <jir-hfpbi5WX9J54Eiagz67IpQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                         ` <87hbusbud6.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
@ 2009-09-25 12:21                           ` Reinoud Zandijk
       [not found]                             ` <20090925122109.GD6624-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-25 12:21 UTC (permalink / raw)
  To: NILFS Users mailing list

Hi folks,

On Thu, Sep 24, 2009 at 05:49:09PM +0900, Jiro SEKIBA wrote:
> Reinoud Zandijk wrote:
> > > I think it should be considered along with ext3 approach.
> > 
> > What is is the ext3 approach? What is different compared to ext2? And is it
> > really such a big improvement? Will it complicate directory reading/writing
> > more?
> 
> ext3 approach is a hashed index for file names, stored on disk.
> So it always costs constant time to lookup the block filename stored.

So its basicly the same aproach as mine only with the disadvantage that the
hash is stored on-disc and hacked into the directory entries. That explains
the deletion penelty to be paid since the has to be updated. One could cheat
by not updating the hash on disc on deletion since its result will backfire on
lookup but thats silly.

Hmm...  with these things in mind, i'd still opt for the application of my
hash scheme in NiLFS2. It does have the penalty that the directory has to be
sequentially read once but no further penalties and no disc layout change.
Note that it *only* needs the directory contents to be read, the inodes don't
have to be visited/created and is thus a cheap alternative.

The only time it will be slower than ext3 is when you randomly delete files
from large directories visiting each directory once... an event thats not
remotely realistic.

Looking into the papers you sent me, they refer to my scheme as a `cached
index' if i read it correctly. The disadvantages they bring forward are not
that realistic in common use. Especially not if you keep the cache entries
small like the 24/28 bytes it takes for mine and if it also keeps track of the
free entries.

Thoughts?

With regards,
Reinoud Zandijk

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                             ` <20090925122109.GD6624-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
@ 2009-09-25 15:47                               ` Ryusuke Konishi
       [not found]                                 ` <20090926.004730.113223425.ryusuke-sG5X7nlA6pw@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-25 15:47 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg, reinoud-S783fYmB3Ccdnm+yROfE0A,
	Jiro SEKIBA

Hi,
On Fri, 25 Sep 2009 14:21:09 +0200, Reinoud Zandijk wrote:
> Hi folks,
> 
> On Thu, Sep 24, 2009 at 05:49:09PM +0900, Jiro SEKIBA wrote:
> > Reinoud Zandijk wrote:
> > > > I think it should be considered along with ext3 approach.
> > > 
> > > What is is the ext3 approach? What is different compared to ext2? And is it
> > > really such a big improvement? Will it complicate directory reading/writing
> > > more?
> > 
> > ext3 approach is a hashed index for file names, stored on disk.
> > So it always costs constant time to lookup the block filename stored.
>
> So its basicly the same aproach as mine only with the disadvantage
> that the hash is stored on-disc and hacked into the directory
> entries.  That explains the deletion penelty to be paid since the
> has to be updated. One could cheat by not updating the hash on disc
> on deletion since its result will backfire on lookup but thats
> silly.

You have a point.  Ext3 inserts intermediate structures on directory
files in somewhat tricky way.  So, it would suffer penalty for
deletion.  It's actually one of my concern for this approach.

Sekiba-san, thank you for gathering useful links.  As shown in the
figure you pointed to, I recognize improving the directory
implemenetation is one of priorities especially for file creation.

> Hmm...  with these things in mind, i'd still opt for the application of my
> hash scheme in NiLFS2. It does have the penalty that the directory has to be
> sequentially read once but no further penalties and no disc layout change.
> Note that it *only* needs the directory contents to be read, the inodes don't
> have to be visited/created and is thus a cheap alternative.

Well, I think your on-memory hash scheme is a hopeful candidate.
What kind of on-memory structures are used in your implementation?

I'd like to know how to rebuild your approach on Linux basis.  If
possible, I'd like to compare patches including the ext3 approach with
benchmarks.

> The only time it will be slower than ext3 is when you randomly delete files
> from large directories visiting each directory once... an event thats not
> remotely realistic.
>
> Looking into the papers you sent me, they refer to my scheme as a `cached
> index' if i read it correctly. The disadvantages they bring forward are not
> that realistic in common use. Especially not if you keep the cache entries
> small like the 24/28 bytes it takes for mine and if it also keeps track of the
> free entries.
> 
> Thoughts?

So, it uses 24/28 bytes on-memory entry per valid on-disk dirent.  How
about lifetime of the cache entries?  Are the entries designed to be
left on memory until they get shrunk by memory pressure?
Or flushed out in LRU ?

With regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                                 ` <20090926.004730.113223425.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2009-09-25 16:23                                   ` Reinoud Zandijk
       [not found]                                     ` <20090925162334.GA14945-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-25 16:23 UTC (permalink / raw)
  To: Ryusuke Konishi
  Cc: Jiro SEKIBA, reinoud-S783fYmB3Ccdnm+yROfE0A,
	users-JrjvKiOkagjYtjvyW6yDsg

Hi Ryusuke,

On Sat, Sep 26, 2009 at 12:47:30AM +0900, Ryusuke Konishi wrote:
> > So its basicly the same aproach as mine only with the disadvantage
> > that the hash is stored on-disc and hacked into the directory
> > entries.  That explains the deletion penelty to be paid since the
> > has to be updated. One could cheat by not updating the hash on disc
> > on deletion since its result will backfire on lookup but thats
> > silly.
> 
> You have a point.  Ext3 inserts intermediate structures on directory
> files in somewhat tricky way.  So, it would suffer penalty for
> deletion.  It's actually one of my concern for this approach.

The Ext3 approach also is hacky and shoe-horned. How can it even see if the
hash table is still OK? an Ext2 system could have radically reformed and
reformatted the dirents.

> Well, I think your on-memory hash scheme is a hopeful candidate.
> What kind of on-memory structures are used in your implementation?
> 
> I'd like to know how to rebuild your approach on Linux basis.  If
> possible, I'd like to compare patches including the ext3 approach with
> benchmarks.
> 
...
> So, it uses 24/28 bytes on-memory entry per valid on-disk dirent.  How
> about lifetime of the cache entries?  Are the entries designed to be
> left on memory until they get shrunk by memory pressure?
> Or flushed out in LRU ?

The maximum memory used for entries is run-time configurable but normally say
1Mb, but 2Mb could easily be overdone. All cache entries are associated with a
directory and are created/destroyed on a directory basis(!) to keep the
lookup/creation/deletion authorative. The directories are LRU replaced and
least used directories caches are freed to satisfy the maximum memory usage.
One could also go for the largest one from the least used but thats a choice.

I'll mail the source-snippets to you privately; but thats'll have to wait till
tomorrow. Please do remind me if you haven't recieved them!

With regards,
Reinoud Zandijk

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                                     ` <20090925162334.GA14945-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
@ 2009-09-26  1:16                                       ` Ryusuke Konishi
       [not found]                                         ` <20090926.101634.31873031.ryusuke-sG5X7nlA6pw@public.gmane.org>
  2009-09-27 11:34                                       ` Jiro SEKIBA
  1 sibling, 1 reply; 18+ messages in thread
From: Ryusuke Konishi @ 2009-09-26  1:16 UTC (permalink / raw)
  To: reinoud-S783fYmB3Ccdnm+yROfE0A
  Cc: jir-hfpbi5WX9J54Eiagz67IpQ, users-JrjvKiOkagjYtjvyW6yDsg

Hi Reinoud,
On Fri, 25 Sep 2009 18:23:34 +0200, Reinoud Zandijk wrote:
> Hi Ryusuke,
> 
> On Sat, Sep 26, 2009 at 12:47:30AM +0900, Ryusuke Konishi wrote:
> > > So its basicly the same aproach as mine only with the disadvantage
> > > that the hash is stored on-disc and hacked into the directory
> > > entries.  That explains the deletion penelty to be paid since the
> > > has to be updated. One could cheat by not updating the hash on disc
> > > on deletion since its result will backfire on lookup but thats
> > > silly.
> > 
> > You have a point.  Ext3 inserts intermediate structures on directory
> > files in somewhat tricky way.  So, it would suffer penalty for
> > deletion.  It's actually one of my concern for this approach.
> 
> The Ext3 approach also is hacky and shoe-horned. How can it even see if the
> hash table is still OK? an Ext2 system could have radically reformed and
> reformatted the dirents.

One of the reason I see it as a candidate is that it keeps both
backward and forward compatibility.  And, we can see the applicable
code in front of our eyes.

Honestly, I don't like introducing the hack that even needs the
``radical reform''.  Also, we cannot turn back once I send it
upstream.

So, I won't automatically adopt it without conclusive evaluation
results and consensus.  At least, I'd like to compare it with the
patch written on your approach.

> > Well, I think your on-memory hash scheme is a hopeful candidate.
> > What kind of on-memory structures are used in your implementation?
> > 
> > I'd like to know how to rebuild your approach on Linux basis.  If
> > possible, I'd like to compare patches including the ext3 approach with
> > benchmarks.
> > 
> ...
> > So, it uses 24/28 bytes on-memory entry per valid on-disk dirent.  How
> > about lifetime of the cache entries?  Are the entries designed to be
> > left on memory until they get shrunk by memory pressure?
> > Or flushed out in LRU ?
> 
> The maximum memory used for entries is run-time configurable but normally say
> 1Mb, but 2Mb could easily be overdone. All cache entries are associated with a
> directory and are created/destroyed on a directory basis(!) to keep the
> lookup/creation/deletion authorative. The directories are LRU replaced and
> least used directories caches are freed to satisfy the maximum memory usage.
> One could also go for the largest one from the least used but thats a choice.
> 
> I'll mail the source-snippets to you privately; but thats'll have to wait till
> tomorrow. Please do remind me if you haven't recieved them!

Well, could you share the reference to the snippet on this list?  I
think Sekiba-san also has interest in the code.

Thanks,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                                     ` <20090925162334.GA14945-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  2009-09-26  1:16                                       ` Ryusuke Konishi
@ 2009-09-27 11:34                                       ` Jiro SEKIBA
  1 sibling, 0 replies; 18+ messages in thread
From: Jiro SEKIBA @ 2009-09-27 11:34 UTC (permalink / raw)
  To: users-JrjvKiOkagjYtjvyW6yDsg; +Cc: Reinoud Zandijk

Hi,

At Fri, 25 Sep 2009 18:23:34 +0200,
Reinoud Zandijk wrote:

...
> The Ext3 approach also is hacky and shoe-horned. How can it even see if the
> hash table is still OK? an Ext2 system could have radically reformed and
> reformatted the dirents.

The hash table consumes entire block.  First 64 bytes of the block is
fake dirent, whose inode number is 0 (means unused).  So when ext2
(or older code which does not know hash index) will use the hash index block
for normal dirents, the first fake dirent is replaced as normal entry.

So the algorithm still be able to detect the index hash is OK or not,
by checking the fake entry still stay FAKE or not, after older code mounts
the partition.

..

Thanks,

regards,
-- 
Jiro SEKIBA <jir-hfpbi5WX9J54Eiagz67IpQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                                         ` <20090926.101634.31873031.ryusuke-sG5X7nlA6pw@public.gmane.org>
@ 2009-09-27 22:10                                           ` Reinoud Zandijk
       [not found]                                             ` <20090927221049.GA14618-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Reinoud Zandijk @ 2009-09-27 22:10 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: users-JrjvKiOkagjYtjvyW6yDsg

Hi Ryusuke,

On Sat, Sep 26, 2009 at 10:16:34AM +0900, Ryusuke Konishi wrote:
> > The Ext3 approach also is hacky and shoe-horned. How can it even see if the
> > hash table is still OK? an Ext2 system could have radically reformed and
> > reformatted the dirents.
> 
> One of the reason I see it as a candidate is that it keeps both
> backward and forward compatibility.  And, we can see the applicable
> code in front of our eyes.

Ext3 hash correctness detection thus works ONLY if the first entry is indeed
recycled on a directory entry creation. If not, its pointing to garbage.....

> > I'll mail the source-snippets to you privately; but thats'll have to wait till
> > tomorrow. Please do remind me if you haven't recieved them!
> 
> Well, could you share the reference to the snippet on this list?  I
> think Sekiba-san also has interest in the code.

Sure! I'll first try to fixup the code a bit more since i noticed that the
free-space detection in Ext2 type direntries is euhm... hackish, just like in
FFS btw ;)

I've added a `compactable directory' bit now in the directory readin. If no
free space is found on insertion that can be reused directly (i.e. fits inside
a free space), it now knows it can issue a directory compaction. This
compaction can either compact the entire directory in one go or only compact
the broken up blocks; followed up by a rehash.

So i'll be experimenting a bit more and let you know.

With regards,
Reinoud

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does NILFS2 handle directory management
       [not found]                                             ` <20090927221049.GA14618-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
@ 2009-09-27 23:03                                               ` Jiro SEKIBA
  0 siblings, 0 replies; 18+ messages in thread
From: Jiro SEKIBA @ 2009-09-27 23:03 UTC (permalink / raw)
  To: NILFS Users mailing list; +Cc: Reinoud Zandijk

Hi,

At Mon, 28 Sep 2009 00:10:49 +0200,
Reinoud Zandijk wrote:
> 
> Hi Ryusuke,
> 
> On Sat, Sep 26, 2009 at 10:16:34AM +0900, Ryusuke Konishi wrote:
> > > The Ext3 approach also is hacky and shoe-horned. How can it even see if the
> > > hash table is still OK? an Ext2 system could have radically reformed and
> > > reformatted the dirents.
> > 
> > One of the reason I see it as a candidate is that it keeps both
> > backward and forward compatibility.  And, we can see the applicable
> > code in front of our eyes.
> 
> Ext3 hash correctness detection thus works ONLY if the first entry is indeed
> recycled on a directory entry creation. If not, its pointing to garbage.....

When older code see that block, entire block has one virtual entry,
which is the first fake entry.  Rest of the block is seemed to be body of
the first entry.  So actually, there is only one entry on the block.
Therefore, that is the only case to be recycled.

thanks,

regards,
-- 
Jiro SEKIBA <jir-hfpbi5WX9J54Eiagz67IpQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-09-27 23:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-10 19:16 How does NILFS2 handle directory management Prasenjit Giri
     [not found] ` <2324ff2b0909101216q31dad1a8y43ea0f229923a0c3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-10 19:26   ` Reinoud Zandijk
     [not found]     ` <20090910192619.GA1263-bVHBekiX4bNgoMqBc1r0ESegHCQxtGRMHZ5vskTnxNA@public.gmane.org>
2009-09-11  1:21       ` Ryusuke Konishi
     [not found]         ` <20090911.102118.69189169.ryusuke-sG5X7nlA6pw@public.gmane.org>
2009-09-11  3:48           ` Jiro SEKIBA
     [not found]             ` <873a6u16qy.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
2009-09-11  4:13               ` Ryusuke Konishi
2009-09-11  3:58           ` Prasenjit Giri
     [not found]             ` <2324ff2b0909102058g1ee407fco3f2482b4732dacca-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-11  6:23               ` Ryusuke Konishi
2009-09-11  5:22           ` Reinoud Zandijk
     [not found]             ` <20090911052213.GA24899-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
2009-09-11  6:44               ` Ryusuke Konishi
     [not found]                 ` <20090911.154452.43227079.ryusuke-sG5X7nlA6pw@public.gmane.org>
2009-09-14 10:27                   ` Reinoud Zandijk
     [not found]                     ` <20090914102731.GA154-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
2009-09-24  8:49                       ` Jiro SEKIBA
     [not found]                         ` <87hbusbud6.wl%jir-27yqGEOhnJbQT0dZR+AlfA@public.gmane.org>
2009-09-25 12:21                           ` Reinoud Zandijk
     [not found]                             ` <20090925122109.GD6624-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
2009-09-25 15:47                               ` Ryusuke Konishi
     [not found]                                 ` <20090926.004730.113223425.ryusuke-sG5X7nlA6pw@public.gmane.org>
2009-09-25 16:23                                   ` Reinoud Zandijk
     [not found]                                     ` <20090925162334.GA14945-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
2009-09-26  1:16                                       ` Ryusuke Konishi
     [not found]                                         ` <20090926.101634.31873031.ryusuke-sG5X7nlA6pw@public.gmane.org>
2009-09-27 22:10                                           ` Reinoud Zandijk
     [not found]                                             ` <20090927221049.GA14618-5cYspOl2ggRz6xQTk39kMVfVdRo2wo/d@public.gmane.org>
2009-09-27 23:03                                               ` Jiro SEKIBA
2009-09-27 11:34                                       ` Jiro SEKIBA

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.