From: Boaz Harrosh <bharrosh-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org> To: Bernd Schubert <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> Cc: Ted Ts'o <tytso-3s7WtUTddSA@public.gmane.org>, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Subject: Re: infinite getdents64 loop Date: Wed, 01 Jun 2011 16:10:26 +0300 [thread overview] Message-ID: <4DE63A42.4090102@panasas.com> (raw) In-Reply-To: <4DE525AE.9030806-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> On 05/31/2011 08:30 PM, Bernd Schubert wrote: > On 05/31/2011 07:13 PM, Boaz Harrosh wrote: >> On 05/31/2011 03:35 PM, Ted Ts'o wrote: >>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote: >>>> >>>> Out of interest, did anyone ever benchmark if dirindex provides any >>>> advantages to readdir? And did those benchmarks include the >>>> disadvantages of the present implementation (non-linear inode >>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or >>>> 'rm -fr $dir')? >>> >>> The problem is that seekdir/telldir is terminally broken (and so is >>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes >>> a linear data structure. If you're going to use any kind of >>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just >>> doesn't cut it. We actually play games where we memoize the low >>> 32-bits of the hash and keep track of which cookies we hand out via >>> seekdir/telldir so that things mostly work --- except for NFSv2, where >>> with the 32-bit cookie, you're just hosed. >>> >>> The reason why we have to iterate over the directory in hash tree >>> order is because if we have a leaf node split, half the directories >>> entries get copied to another directory entry, given the promises made >>> by seekdir() and telldir() about directory entries appearing exactly >>> once during a readdir() stream, even if you hold the fd open for weeks >>> or days, mean that you really have to iterate over things in hash >>> order. >> >> open fd means that it does not survive a server reboot. Why don't you >> keep an array per open fd, and hand out the array index. In the array >> you can keep a pointer to any info you want to keep. (that's the meaning of >> a cookie) > > An array can take lots of memory for a large directory, of course. Do we > really want to do that in kernel space? Although I wouldn't have a > problem to reserve a certain amount of memory for that. But what do we > do if that gets exhausted (for example directory too large or several > open filedescriptors)? You miss understood me. Ted was complaining that the cookie was only 32 bit and he hoped it was bigger, perhaps 128 minimum. What I said is that for each open fd, a cookie is returned that denotes a temporary space allocated for just that caller. When a second call with the same fd, same cookie comes, the allocated object is inspected to retrieve all the information needed to continue the walk from the same place. So the allocated space is only per active caller, up to the time fd is closed. (I never meant per directory entry) > And how does that help with NFS and other cluster filesystems where the > client passes over the cookie? We ignore posix compliance then? > I was not referring to that. I understand that this is an hard problem but it is solvable. The space per cookie is solved above. > Thanks, > Bernd But this is all talk. I don't know enough, or use, ext4 to be able to solve it myself. So I'm just babbling out here. Just that in the server we've done it before to keep things in an internal array and return the index as a magic cookie, when more information was needed internally. Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: multiple messages have this Message-ID (diff)
From: Boaz Harrosh <bharrosh@panasas.com> To: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Cc: "Ted Ts'o" <tytso@mit.edu>, linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org Subject: Re: infinite getdents64 loop Date: Wed, 01 Jun 2011 16:10:26 +0300 [thread overview] Message-ID: <4DE63A42.4090102@panasas.com> (raw) In-Reply-To: <4DE525AE.9030806@itwm.fraunhofer.de> On 05/31/2011 08:30 PM, Bernd Schubert wrote: > On 05/31/2011 07:13 PM, Boaz Harrosh wrote: >> On 05/31/2011 03:35 PM, Ted Ts'o wrote: >>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote: >>>> >>>> Out of interest, did anyone ever benchmark if dirindex provides any >>>> advantages to readdir? And did those benchmarks include the >>>> disadvantages of the present implementation (non-linear inode >>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or >>>> 'rm -fr $dir')? >>> >>> The problem is that seekdir/telldir is terminally broken (and so is >>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes >>> a linear data structure. If you're going to use any kind of >>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just >>> doesn't cut it. We actually play games where we memoize the low >>> 32-bits of the hash and keep track of which cookies we hand out via >>> seekdir/telldir so that things mostly work --- except for NFSv2, where >>> with the 32-bit cookie, you're just hosed. >>> >>> The reason why we have to iterate over the directory in hash tree >>> order is because if we have a leaf node split, half the directories >>> entries get copied to another directory entry, given the promises made >>> by seekdir() and telldir() about directory entries appearing exactly >>> once during a readdir() stream, even if you hold the fd open for weeks >>> or days, mean that you really have to iterate over things in hash >>> order. >> >> open fd means that it does not survive a server reboot. Why don't you >> keep an array per open fd, and hand out the array index. In the array >> you can keep a pointer to any info you want to keep. (that's the meaning of >> a cookie) > > An array can take lots of memory for a large directory, of course. Do we > really want to do that in kernel space? Although I wouldn't have a > problem to reserve a certain amount of memory for that. But what do we > do if that gets exhausted (for example directory too large or several > open filedescriptors)? You miss understood me. Ted was complaining that the cookie was only 32 bit and he hoped it was bigger, perhaps 128 minimum. What I said is that for each open fd, a cookie is returned that denotes a temporary space allocated for just that caller. When a second call with the same fd, same cookie comes, the allocated object is inspected to retrieve all the information needed to continue the walk from the same place. So the allocated space is only per active caller, up to the time fd is closed. (I never meant per directory entry) > And how does that help with NFS and other cluster filesystems where the > client passes over the cookie? We ignore posix compliance then? > I was not referring to that. I understand that this is an hard problem but it is solvable. The space per cookie is solved above. > Thanks, > Bernd But this is all talk. I don't know enough, or use, ext4 to be able to solve it myself. So I'm just babbling out here. Just that in the server we've done it before to keep things in an internal array and return the index as a magic cookie, when more information was needed internally. Boaz
next prev parent reply other threads:[~2011-06-01 13:10 UTC|newest] Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-05-28 13:02 infinite getdents64 loop Rüdiger Meier 2011-05-28 15:00 ` Rüdiger Meier 2011-05-29 16:05 ` Trond Myklebust 2011-05-29 16:55 ` Rüdiger Meier 2011-05-29 17:04 ` Trond Myklebust [not found] ` <1306688643.2386.24.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org> 2011-05-30 9:37 ` Ruediger Meier 2011-05-30 11:59 ` Jeff Layton 2011-05-30 12:42 ` Ruediger Meier 2011-05-30 14:58 ` Trond Myklebust 2011-05-31 9:47 ` Rüdiger Meier 2011-05-31 10:18 ` Bernd Schubert 2011-05-31 10:18 ` Bernd Schubert 2011-05-31 12:35 ` Ted Ts'o 2011-05-31 17:07 ` Bernd Schubert 2011-05-31 17:13 ` Boaz Harrosh [not found] ` <4DE521B9.5050603-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org> 2011-05-31 17:30 ` Bernd Schubert 2011-05-31 17:30 ` Bernd Schubert [not found] ` <4DE525AE.9030806-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2011-06-01 13:10 ` Boaz Harrosh [this message] 2011-06-01 13:10 ` Boaz Harrosh 2011-06-01 16:15 ` Trond Myklebust [not found] ` <20110531123518.GB4215-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 2011-05-31 17:26 ` Andreas Dilger 2011-05-31 17:26 ` Andreas Dilger [not found] ` <D598829B-FB36-4DA8-978E-8C689940D0FA-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org> 2011-05-31 17:43 ` Bernd Schubert 2011-05-31 17:43 ` Bernd Schubert [not found] ` <4DE528DE.5020908-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2011-05-31 19:16 ` Andreas Dilger 2011-05-31 19:16 ` Andreas Dilger 2011-05-31 14:51 ` Bryan Schumaker
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=4DE63A42.4090102@panasas.com \ --to=bharrosh-c4p08nqkorlbdgjk7y7tuq@public.gmane.org \ --cc=bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org \ --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \ --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \ --cc=tytso-3s7WtUTddSA@public.gmane.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.