All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhang Huan <zhhuan@gmail.com>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Question on readdir implementation
Date: Wed, 16 Sep 2009 13:47:46 +0800	[thread overview]
Message-ID: <20090916054746.GA2393@zhanghuan.nrchpc.ac.cn> (raw)
In-Reply-To: <20090915145337.GB23118@mit.edu>

On Tue, Sep 15, 2009 at 10:53:37AM -0400, Theodore Tso wrote:
> On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote:
> > Hi all,
> > 
> > I'm reading EXT4 codes and has some questions about readdir
> > implementation.
> > 
> > Why traverse the directory in hash order? This brings lots of code to
> > build and traverse a red-black tree. Why not just plainly traverse the
> > directory's blocks?
> > 
> > Since the red-black tree is built every time a NFS readdir request comes
> > in, in case of hash collision, the nfs client may receive duplicate dir
> > entries if the buffer is not large enough to return all entries with the
> > same hash value in once.
> 
> The reason why we have to do this is because of the telldir() and
> seekdir() interfaces.  NFS is implicated in this as well, because it
> traverses the directory using a 32-bit offset on the protocol wire.
> 
> The fundamental problem is they presume that directory is stored as a
> linear array.  For file systems which use some kind of B-tree or other
> non-linear data structure to speed up directory lookups, the problem
> is what to do when we need to split a B-tree leaf.  When this happens,
> half of the directory entries in the that B-tree leaf must be moved to
> a new directory block.  However, readdir() must still return every
> single entry in the directory once and exactly once; and this must be
> true even if even if telldir()/seekdir() is used, and even if NFS
> decides to wait for several minutes or hours between reading the first
> set of directory entries, and then reading the next set of directory
> entries, sending to the NFS server nothing but the 32-bit offset
> token.
> 
> It is for this reason that we must traverse the directory in hash
> order; so that directory entries are returned once and only once even
> in the face of node splits.  We try very hard to avoid a hash
> collisions, including using a keyed hash which is kept secret to
> prevent users from deliberately trying to trigger the failure mode.
> 
> Looking more closely at what we have done, we should be able to do
> better on architectures where off_t is 64 bits.  Internally, we use a
> 64-bit hash, but we only put the high 32 bits of the hash in f_offset
> because we can't tell whether the telldir() or telldir64() call might
> be used by an application, and we can't tell whether the NFS server is
> speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3
> (where the readdir cookie is cookie is 64 bits).  On platforms where
> off_t is 64-bytes, we could store the full 64-bit hash value in
> f_offset, but we would swap the low and high 32-bit values, so that 32
> LSB are in the high 32 bits of f_offset and vice versa.  
> 
> The NFS v2 server would still get a 64-bit f_offset, but it currently
> doesn't do any range checking before dropping it into the 32-bit
> cookie, so the high 32 bits would get truncated.  This would result in
> the same behaviour we have today for those poor benighted souls which
> are still using NFSv2, but allow for better behavior for NFSv3 at
> least on 64-bit platforms.
> 
Thanks for replying to explain in such details.

So there isn't much we can do before we improve nfsd's offset range
checking.

> So this is something we could do in the future.  In practice, no one
> has complained about this as far as NFS is concerned, so it's not high
> priority for me to pursue.  Were you worried about this as a practical
> matter, or as a theoretical one?
> 
Oh, I was asking because some guy I work with told me he has seen this
once before. He was testing a proprietary Windows NFS v3 client then,
using EXT3 as backend filesystem.
I do not know it is hash collision that cause the problem until I spent
some time reading EXT3 codes.

> Regards,
> 
> 						- Ted

-- 
Zhang Huan

      parent reply	other threads:[~2009-09-16  5:47 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-15  9:57 Question on readdir implementation Zhang Huan
2009-09-15 14:41 ` Andreas Dilger
2009-09-15 14:53 ` Theodore Tso
2009-09-15 17:56   ` Florian Weimer
2009-09-15 18:38     ` Theodore Tso
2009-09-16  5:47   ` Zhang Huan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090916054746.GA2393@zhanghuan.nrchpc.ac.cn \
    --to=zhhuan@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.