All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@sun.com>
To: Zhang Huan <zhhuan@gmail.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Question on readdir implementation
Date: Tue, 15 Sep 2009 08:41:41 -0600	[thread overview]
Message-ID: <20090915144141.GS2537@webber.adilger.int> (raw)
In-Reply-To: <20090915095724.GA8440@zhanghuan.nrchpc.ac.cn>

On Sep 15, 2009  17:57 +0800, Zhang Huan wrote:
> I'm reading EXT4 codes and has some questions about readdir
> implementation.
> 
> Why traverse the directory in hash order? This brings lots of code to
> build and traverse a red-black tree. Why not just plainly traverse the
> directory's blocks?

Because htree does not maintain constant ordering of directory entries.
Consider a readdir running at the same time as files are being added:

readdir proc		create proc
read block 0
read block 1
			create new file
			    hash of filename puts entry into block 0
			    block 0 is full, split it
				add new block 3
				copy 1/2 of block 0 entries to block 3
				add new filename to block 0
read block 2
read block 3

When the readdir returns block 3, 1/2 of the entries in that block
are the same as were returned with the original block 0 read.  This
is a violation of POSIX readdir() semantics that require each existing
entry only be returned a single time (it doesn't matter if the new
filename is returned or not).
			
> Since the red-black tree is built every time a NFS readdir request comes
> in, in case of hash collision, the nfs client may receive duplicate dir
> entries if the buffer is not large enough to return all entries with the
> same hash value in once.

This is because NFSv2 only has a 32-bit cookie value, and if there is a
whole buffer full of values with the same 32-bit hash there isn't anything
that can be done about it except drop those extra duplicates (a very rare
case).  It would have a much more serious problem if there was a directory
larger than 2^32 bytes in size, which would result in all entries beyond
2GB being lost.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


  reply	other threads:[~2009-09-15 14:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-15  9:57 Question on readdir implementation Zhang Huan
2009-09-15 14:41 ` Andreas Dilger [this message]
2009-09-15 14:53 ` Theodore Tso
2009-09-15 17:56   ` Florian Weimer
2009-09-15 18:38     ` Theodore Tso
2009-09-16  5:47   ` Zhang Huan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090915144141.GS2537@webber.adilger.int \
    --to=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=zhhuan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.