All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
To: Andreas Dilger <adilger-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
Cc: "Ted Ts'o" <tytso-3s7WtUTddSA@public.gmane.org>,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	"linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List"
	<linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Fan Yong <yong.fan-KloliPT79xf2eFz/2MeuCQ@public.gmane.org>
Subject: Re: infinite getdents64 loop
Date: Tue, 31 May 2011 19:43:58 +0200	[thread overview]
Message-ID: <4DE528DE.5020908@itwm.fraunhofer.de> (raw)
In-Reply-To: <D598829B-FB36-4DA8-978E-8C689940D0FA-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>

On 05/31/2011 07:26 PM, Andreas Dilger wrote:
> On 2011-05-31, at 6:35 AM, Ted Ts'o wrote:
>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>>
>>> Out of interest, did anyone ever benchmark if dirindex provides any
>>> advantages to readdir?  And did those benchmarks include the
>>> disadvantages of the present implementation (non-linear inode
>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>>> 'rm -fr $dir')?
>>
>> The problem is that seekdir/telldir is terminally broken (and so is
>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
>> a linear data structure.  If you're going to use any kind of
>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
>> doesn't cut it.  We actually play games where we memoize the low
>> 32-bits of the hash and keep track of which cookies we hand out via
>> seekdir/telldir so that things mostly work --- except for NFSv2, where
>> with the 32-bit cookie, you're just hosed.
>>
>> The reason why we have to iterate over the directory in hash tree
>> order is because if we have a leaf node split, half the directories
>> entries get copied to another directory entry, given the promises made
>> by seekdir() and telldir() about directory entries appearing exactly
>> once during a readdir() stream, even if you hold the fd open for weeks
>> or days, mean that you really have to iterate over things in hash
>> order.
>>
>> I'd have to look, since it's been too many years, but as I recall the
>> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
>> don't know whether we can hand back a 32-bit cookie or a 64-bit
>> cookie, so we're always handing the NFS server a 32-bit "offset", even
>> though ew could do better.  Actually, if we had an interface where we
>> could give you a 128-bit "offset" into the directory, we could
>> probably eliminate the duplicate cookie problem entirely.  We just
>> send 64-bits worth of hash, plus the first two bytes of the of file
>> name.
>
> If it's of interest, we've implemented a 64-bit hash mode for ext4 to
> solve just this problem for Lustre.  The llseek() code will return a
> 64-bit hash value on 64-bit systems, unless it is running for some
> process that needs a 32-bit hash value (only NFSv2, AFAIK).
>
> The attached patch can at least form the basis for being able to return
> 64-bit hash values for userspace/NFSv3/v4 when usable.  The patch
> is NOT usable as it stands now, since I've had to modify it from the
> version that we are currently using for Lustre (this version hasn't
> actually been compiled), but it at least shows the outline of what needs
> to be done to get this working.  None of the NFS side is implemented.

Thanks Andreas! I haven't tested it yet, but the generic idea looks 
good. I guess the lower part of the patch (netfilter stuff) got 
accidentally in?


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
To: Andreas Dilger <adilger@dilger.ca>
Cc: "Ted Ts'o" <tytso@mit.edu>,
	linux-nfs@vger.kernel.org,
	"linux-ext4@vger.kernel.org List" <linux-ext4@vger.kernel.org>,
	Fan Yong <yong.fan@whamcloud.com>
Subject: Re: infinite getdents64 loop
Date: Tue, 31 May 2011 19:43:58 +0200	[thread overview]
Message-ID: <4DE528DE.5020908@itwm.fraunhofer.de> (raw)
In-Reply-To: <D598829B-FB36-4DA8-978E-8C689940D0FA@dilger.ca>

On 05/31/2011 07:26 PM, Andreas Dilger wrote:
> On 2011-05-31, at 6:35 AM, Ted Ts'o wrote:
>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>>
>>> Out of interest, did anyone ever benchmark if dirindex provides any
>>> advantages to readdir?  And did those benchmarks include the
>>> disadvantages of the present implementation (non-linear inode
>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>>> 'rm -fr $dir')?
>>
>> The problem is that seekdir/telldir is terminally broken (and so is
>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
>> a linear data structure.  If you're going to use any kind of
>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
>> doesn't cut it.  We actually play games where we memoize the low
>> 32-bits of the hash and keep track of which cookies we hand out via
>> seekdir/telldir so that things mostly work --- except for NFSv2, where
>> with the 32-bit cookie, you're just hosed.
>>
>> The reason why we have to iterate over the directory in hash tree
>> order is because if we have a leaf node split, half the directories
>> entries get copied to another directory entry, given the promises made
>> by seekdir() and telldir() about directory entries appearing exactly
>> once during a readdir() stream, even if you hold the fd open for weeks
>> or days, mean that you really have to iterate over things in hash
>> order.
>>
>> I'd have to look, since it's been too many years, but as I recall the
>> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
>> don't know whether we can hand back a 32-bit cookie or a 64-bit
>> cookie, so we're always handing the NFS server a 32-bit "offset", even
>> though ew could do better.  Actually, if we had an interface where we
>> could give you a 128-bit "offset" into the directory, we could
>> probably eliminate the duplicate cookie problem entirely.  We just
>> send 64-bits worth of hash, plus the first two bytes of the of file
>> name.
>
> If it's of interest, we've implemented a 64-bit hash mode for ext4 to
> solve just this problem for Lustre.  The llseek() code will return a
> 64-bit hash value on 64-bit systems, unless it is running for some
> process that needs a 32-bit hash value (only NFSv2, AFAIK).
>
> The attached patch can at least form the basis for being able to return
> 64-bit hash values for userspace/NFSv3/v4 when usable.  The patch
> is NOT usable as it stands now, since I've had to modify it from the
> version that we are currently using for Lustre (this version hasn't
> actually been compiled), but it at least shows the outline of what needs
> to be done to get this working.  None of the NFS side is implemented.

Thanks Andreas! I haven't tested it yet, but the generic idea looks 
good. I guess the lower part of the patch (netfilter stuff) got 
accidentally in?


Cheers,
Bernd

  parent reply	other threads:[~2011-05-31 17:43 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-28 13:02 infinite getdents64 loop Rüdiger Meier
2011-05-28 15:00 ` Rüdiger Meier
2011-05-29 16:05   ` Trond Myklebust
2011-05-29 16:55     ` Rüdiger Meier
2011-05-29 17:04       ` Trond Myklebust
     [not found]         ` <1306688643.2386.24.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2011-05-30  9:37           ` Ruediger Meier
2011-05-30 11:59             ` Jeff Layton
2011-05-30 12:42               ` Ruediger Meier
2011-05-30 14:58             ` Trond Myklebust
2011-05-31  9:47               ` Rüdiger Meier
2011-05-31 10:18                 ` Bernd Schubert
2011-05-31 10:18                   ` Bernd Schubert
2011-05-31 12:35                   ` Ted Ts'o
2011-05-31 17:07                     ` Bernd Schubert
2011-05-31 17:13                     ` Boaz Harrosh
     [not found]                       ` <4DE521B9.5050603-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2011-05-31 17:30                         ` Bernd Schubert
2011-05-31 17:30                           ` Bernd Schubert
     [not found]                           ` <4DE525AE.9030806-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2011-06-01 13:10                             ` Boaz Harrosh
2011-06-01 13:10                               ` Boaz Harrosh
2011-06-01 16:15                               ` Trond Myklebust
     [not found]                     ` <20110531123518.GB4215-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2011-05-31 17:26                       ` Andreas Dilger
2011-05-31 17:26                         ` Andreas Dilger
     [not found]                         ` <D598829B-FB36-4DA8-978E-8C689940D0FA-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
2011-05-31 17:43                           ` Bernd Schubert [this message]
2011-05-31 17:43                             ` Bernd Schubert
     [not found]                             ` <4DE528DE.5020908-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2011-05-31 19:16                               ` Andreas Dilger
2011-05-31 19:16                                 ` Andreas Dilger
2011-05-31 14:51             ` Bryan Schumaker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DE528DE.5020908@itwm.fraunhofer.de \
    --to=bernd.schubert-mpn0npgs4xgatndf+kubs4quadtiucjx@public.gmane.org \
    --cc=adilger-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org \
    --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=tytso-3s7WtUTddSA@public.gmane.org \
    --cc=yong.fan-KloliPT79xf2eFz/2MeuCQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.