* [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls [not found] <1106785262.13440918.1406308542921.JavaMail.zimbra@redhat.com> @ 2014-07-25 17:37 ` Abhijith Das 0 siblings, 0 replies; 26+ messages in thread From: Abhijith Das @ 2014-07-25 17:37 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, cluster-devel Hi all, The topic of a readdirplus-like syscall had come up for discussion at last year's LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations to get at a directory's entries as well as stat() info on the individual inodes. I'm presenting these patches and some early test results on a single-node GFS2 filesystem. 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system call below and scales very well for large directories in GFS2. dirreadahead() is designed to be called prior to getdents+stat operations. In it's current form, it only speeds up stat() operations by caching the relevant inodes. Support can be added in the future to cache extended attribute blocks as well. This works by first collecting all the inode numbers of the directory's entries (subject to a numeric or memory cap). This list is sorted by inode disk block order and passed on to workqueues to perform lookups on the inodes asynchronously to bring them into the cache. 2. xgetdents() - I had posted a version of this patchset some time last year and it is largely unchanged - I just ported it to the latest upstream kernel. It allows the user to request a combination of entries, stat and xattrs (keys/values) for a directory. The stat portion is based on David Howells' xstat patchset he had posted last year as well. I've included the relevant vfs bits in my patchset. xgetdents() in GFS2 works in two phases. In the first phase, it collects all the dirents by reading the directory in question. In phase two, it reads in inode blocks and xattr blocks (if requested) for each entry after sorting the disk accesses in block order. All of the intermediate data is stored in a buffer backed by a vector of pages and is eventually transferred to the user supplied buffer. Both syscalls perform significantly better than a simple getdents+stat with a cold cache. The main advantage lies in being able to sort disk accesses for a bunch of inodes in advance compared to seeking all over the disk for inodes one entry at a time. This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the time taken to get directory entries and their respective stat info by 3 different sets of syscalls: 1) getdents+stat ('ls -l', basically) - Solid blue line 2) xgetdents with various buffer size and num_entries limits - Dotted lines Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and a maximum of 10000 entries to collect at a time. 3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines Eg: d10000 implies that it would fire off a max of 10000 inode lookups during each syscall. numfiles: 10000 50000 100000 500000 -------------------------------------------------------------------- getdents+stat 1.4s 220s 514s 2441s xgetdents 1.2s 43s 75s 1710s dirreadahead+getdents+stat 1.1s 5s 68s 391s Here is a seekwatcher graph from a test run on a directory of 50000 files. (https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is between getdents+stat and xgetdents. The first set of plots is of getdents+stat, followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and num_entries (100 to 1000000) limits. One can see the effect of ordering the disk reads in the Disk IO portion of the graphs and the corresponding effect on seeks, throughput and overall time taken. This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat() syscall-combo for a 500000-file directory with increasing num_entries (100 to 1000000) limits versus getdents+stat. (https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png) The corresponding getdents+stat baseline for this run is at the top of the series of graphs. I'm posting these two patchsets shortly for comments. Cheers! --Abhi Red Hat Filesystems ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 17:37 ` Abhijith Das 0 siblings, 0 replies; 26+ messages in thread From: Abhijith Das @ 2014-07-25 17:37 UTC (permalink / raw) To: cluster-devel.redhat.com Hi all, The topic of a readdirplus-like syscall had come up for discussion at last year's LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations to get at a directory's entries as well as stat() info on the individual inodes. I'm presenting these patches and some early test results on a single-node GFS2 filesystem. 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system call below and scales very well for large directories in GFS2. dirreadahead() is designed to be called prior to getdents+stat operations. In it's current form, it only speeds up stat() operations by caching the relevant inodes. Support can be added in the future to cache extended attribute blocks as well. This works by first collecting all the inode numbers of the directory's entries (subject to a numeric or memory cap). This list is sorted by inode disk block order and passed on to workqueues to perform lookups on the inodes asynchronously to bring them into the cache. 2. xgetdents() - I had posted a version of this patchset some time last year and it is largely unchanged - I just ported it to the latest upstream kernel. It allows the user to request a combination of entries, stat and xattrs (keys/values) for a directory. The stat portion is based on David Howells' xstat patchset he had posted last year as well. I've included the relevant vfs bits in my patchset. xgetdents() in GFS2 works in two phases. In the first phase, it collects all the dirents by reading the directory in question. In phase two, it reads in inode blocks and xattr blocks (if requested) for each entry after sorting the disk accesses in block order. All of the intermediate data is stored in a buffer backed by a vector of pages and is eventually transferred to the user supplied buffer. Both syscalls perform significantly better than a simple getdents+stat with a cold cache. The main advantage lies in being able to sort disk accesses for a bunch of inodes in advance compared to seeking all over the disk for inodes one entry at a time. This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the time taken to get directory entries and their respective stat info by 3 different sets of syscalls: 1) getdents+stat ('ls -l', basically) - Solid blue line 2) xgetdents with various buffer size and num_entries limits - Dotted lines Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and a maximum of 10000 entries to collect at a time. 3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines Eg: d10000 implies that it would fire off a max of 10000 inode lookups during each syscall. numfiles: 10000 50000 100000 500000 -------------------------------------------------------------------- getdents+stat 1.4s 220s 514s 2441s xgetdents 1.2s 43s 75s 1710s dirreadahead+getdents+stat 1.1s 5s 68s 391s Here is a seekwatcher graph from a test run on a directory of 50000 files. (https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is between getdents+stat and xgetdents. The first set of plots is of getdents+stat, followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and num_entries (100 to 1000000) limits. One can see the effect of ordering the disk reads in the Disk IO portion of the graphs and the corresponding effect on seeks, throughput and overall time taken. This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat() syscall-combo for a 500000-file directory with increasing num_entries (100 to 1000000) limits versus getdents+stat. (https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png) The corresponding getdents+stat baseline for this run is at the top of the series of graphs. I'm posting these two patchsets shortly for comments. Cheers! --Abhi Red Hat Filesystems ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 17:37 ` [Cluster-devel] " Abhijith Das @ 2014-07-25 17:52 ` Zach Brown -1 siblings, 0 replies; 26+ messages in thread From: Zach Brown @ 2014-07-25 17:52 UTC (permalink / raw) To: Abhijith Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > Hi all, > > The topic of a readdirplus-like syscall had come up for discussion at last year's > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > to get at a directory's entries as well as stat() info on the individual inodes. > I'm presenting these patches and some early test results on a single-node GFS2 > filesystem. > > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > call below and scales very well for large directories in GFS2. dirreadahead() is > designed to be called prior to getdents+stat operations. Hmm. Have you tried plumbing these read-ahead calls in under the normal getdents() syscalls? We don't have a filereadahead() syscall and yet we somehow manage to implement buffered file data read-ahead :). - z ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 17:52 ` Zach Brown 0 siblings, 0 replies; 26+ messages in thread From: Zach Brown @ 2014-07-25 17:52 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > Hi all, > > The topic of a readdirplus-like syscall had come up for discussion at last year's > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > to get at a directory's entries as well as stat() info on the individual inodes. > I'm presenting these patches and some early test results on a single-node GFS2 > filesystem. > > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > call below and scales very well for large directories in GFS2. dirreadahead() is > designed to be called prior to getdents+stat operations. Hmm. Have you tried plumbing these read-ahead calls in under the normal getdents() syscalls? We don't have a filereadahead() syscall and yet we somehow manage to implement buffered file data read-ahead :). - z ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 17:52 ` [Cluster-devel] " Zach Brown (?) @ 2014-07-25 18:08 ` Steven Whitehouse -1 siblings, 0 replies; 26+ messages in thread From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw) To: Zach Brown, Abhijith Das; +Cc: linux-fsdevel, cluster-devel, linux-kernel Hi, On 25/07/14 18:52, Zach Brown wrote: > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: >> Hi all, >> >> The topic of a readdirplus-like syscall had come up for discussion at last year's >> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations >> to get at a directory's entries as well as stat() info on the individual inodes. >> I'm presenting these patches and some early test results on a single-node GFS2 >> filesystem. >> >> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system >> call below and scales very well for large directories in GFS2. dirreadahead() is >> designed to be called prior to getdents+stat operations. > Hmm. Have you tried plumbing these read-ahead calls in under the normal > getdents() syscalls? > > We don't have a filereadahead() syscall and yet we somehow manage to > implement buffered file data read-ahead :). > > - z > Well I'm not sure thats entirely true... we have readahead() and we also have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no doubt, but how would we tell getdents64() when we were going to read the inodes, rather than just the file names? We may only want to readahead some subset of the directory entries rather than all of them, so the thought was to allow that flexibility by making it, its own syscall, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 18:08 ` Steven Whitehouse 0 siblings, 0 replies; 26+ messages in thread From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, On 25/07/14 18:52, Zach Brown wrote: > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: >> Hi all, >> >> The topic of a readdirplus-like syscall had come up for discussion at last year's >> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations >> to get at a directory's entries as well as stat() info on the individual inodes. >> I'm presenting these patches and some early test results on a single-node GFS2 >> filesystem. >> >> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system >> call below and scales very well for large directories in GFS2. dirreadahead() is >> designed to be called prior to getdents+stat operations. > Hmm. Have you tried plumbing these read-ahead calls in under the normal > getdents() syscalls? > > We don't have a filereadahead() syscall and yet we somehow manage to > implement buffered file data read-ahead :). > > - z > Well I'm not sure thats entirely true... we have readahead() and we also have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no doubt, but how would we tell getdents64() when we were going to read the inodes, rather than just the file names? We may only want to readahead some subset of the directory entries rather than all of them, so the thought was to allow that flexibility by making it, its own syscall, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 18:08 ` Steven Whitehouse 0 siblings, 0 replies; 26+ messages in thread From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw) To: Zach Brown, Abhijith Das; +Cc: linux-fsdevel, cluster-devel, linux-kernel Hi, On 25/07/14 18:52, Zach Brown wrote: > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: >> Hi all, >> >> The topic of a readdirplus-like syscall had come up for discussion at last year's >> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations >> to get at a directory's entries as well as stat() info on the individual inodes. >> I'm presenting these patches and some early test results on a single-node GFS2 >> filesystem. >> >> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system >> call below and scales very well for large directories in GFS2. dirreadahead() is >> designed to be called prior to getdents+stat operations. > Hmm. Have you tried plumbing these read-ahead calls in under the normal > getdents() syscalls? > > We don't have a filereadahead() syscall and yet we somehow manage to > implement buffered file data read-ahead :). > > - z > Well I'm not sure thats entirely true... we have readahead() and we also have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no doubt, but how would we tell getdents64() when we were going to read the inodes, rather than just the file names? We may only want to readahead some subset of the directory entries rather than all of them, so the thought was to allow that flexibility by making it, its own syscall, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 18:08 ` Steven Whitehouse @ 2014-07-25 18:28 ` Zach Brown -1 siblings, 0 replies; 26+ messages in thread From: Zach Brown @ 2014-07-25 18:28 UTC (permalink / raw) To: Steven Whitehouse Cc: Abhijith Das, linux-fsdevel, cluster-devel, linux-kernel On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote: > Hi, > > On 25/07/14 18:52, Zach Brown wrote: > >On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > >>Hi all, > >> > >>The topic of a readdirplus-like syscall had come up for discussion at last year's > >>LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > >>to get at a directory's entries as well as stat() info on the individual inodes. > >>I'm presenting these patches and some early test results on a single-node GFS2 > >>filesystem. > >> > >>1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > >>call below and scales very well for large directories in GFS2. dirreadahead() is > >>designed to be called prior to getdents+stat operations. > >Hmm. Have you tried plumbing these read-ahead calls in under the normal > >getdents() syscalls? > > > >We don't have a filereadahead() syscall and yet we somehow manage to > >implement buffered file data read-ahead :). > > > >- z > > > Well I'm not sure thats entirely true... we have readahead() and we also > have fadvise(FADV_WILLNEED) for that. Sure, fair enough. It would have been more precise to say that buffered file data readers see read-ahead without *having* to use a syscall. > doubt, but how would we tell getdents64() when we were going to read the > inodes, rather than just the file names? How does transparent file read-ahead know how far to read-ahead, if at all? How do the file systems that implement directory read-ahead today deal with this? Just playing devil's advocate here: It's not at all obvious that adding more interfaces is necessary to get directory read-ahead working, given our existing read-ahead implementations. - z ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 18:28 ` Zach Brown 0 siblings, 0 replies; 26+ messages in thread From: Zach Brown @ 2014-07-25 18:28 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote: > Hi, > > On 25/07/14 18:52, Zach Brown wrote: > >On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > >>Hi all, > >> > >>The topic of a readdirplus-like syscall had come up for discussion at last year's > >>LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > >>to get at a directory's entries as well as stat() info on the individual inodes. > >>I'm presenting these patches and some early test results on a single-node GFS2 > >>filesystem. > >> > >>1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > >>call below and scales very well for large directories in GFS2. dirreadahead() is > >>designed to be called prior to getdents+stat operations. > >Hmm. Have you tried plumbing these read-ahead calls in under the normal > >getdents() syscalls? > > > >We don't have a filereadahead() syscall and yet we somehow manage to > >implement buffered file data read-ahead :). > > > >- z > > > Well I'm not sure thats entirely true... we have readahead() and we also > have fadvise(FADV_WILLNEED) for that. Sure, fair enough. It would have been more precise to say that buffered file data readers see read-ahead without *having* to use a syscall. > doubt, but how would we tell getdents64() when we were going to read the > inodes, rather than just the file names? How does transparent file read-ahead know how far to read-ahead, if at all? How do the file systems that implement directory read-ahead today deal with this? Just playing devil's advocate here: It's not at all obvious that adding more interfaces is necessary to get directory read-ahead working, given our existing read-ahead implementations. - z ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 18:28 ` Zach Brown @ 2014-07-25 20:02 ` Steven Whitehouse -1 siblings, 0 replies; 26+ messages in thread From: Steven Whitehouse @ 2014-07-25 20:02 UTC (permalink / raw) To: Zach Brown; +Cc: Abhijith Das, linux-fsdevel, cluster-devel, linux-kernel Hi, On 25/07/14 19:28, Zach Brown wrote: > On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote: >> Hi, >> >> On 25/07/14 18:52, Zach Brown wrote: [snip] >>> Hmm. Have you tried plumbing these read-ahead calls in under the normal >>> getdents() syscalls? >>> >>> We don't have a filereadahead() syscall and yet we somehow manage to >>> implement buffered file data read-ahead :). >>> >>> - z >>> >> Well I'm not sure thats entirely true... we have readahead() and we also >> have fadvise(FADV_WILLNEED) for that. > Sure, fair enough. It would have been more precise to say that buffered > file data readers see read-ahead without *having* to use a syscall. > >> doubt, but how would we tell getdents64() when we were going to read the >> inodes, rather than just the file names? > How does transparent file read-ahead know how far to read-ahead, if at > all? In the file readahead case it has some context, and thats stored in the struct file. Thats where the problem lies in this case, the struct file relates to the directory, and when we then call open, or stat or whatever on some file within that directory, we don't pass the directory's fd to that open call, so we don't have a context to use. We could possibly look through the open fds relating to the process that called open to see if the parent dir of the inode we are opening is in there, in order to find the context to figure out whether to do readahead or not, but...... its not very nice to say the least. I'm very much in agreement that doing this automatically is best, but that only works when its possible to get a very good estimate of whether the readahead is needed or not. That is much easier for file data than it is for inodes in a directory. If someone can figure out how to get around this problem though, then that is certainly something we'd like to look at. The problem gets even more tricky in case the user only wants, say, half of the inodes in the directory... how does the kernel know which half? The idea here is really to give some idea of the kind of performance gains that we might see with the readahead vs xgetdents approaches, and by the sizes of the patches, the relative complexity of the implementations. I think overall, the readahead approach is the more flexible... if I had a directory full of files I wanted to truncate for example, it would be possible to use the same readahead to pull in the inodes quickly and then issue the truncates to the pre-cached inodes. That is something that would not be possible using xgetdents. Whether thats useful for real world applications or not remains to be seen, but it does show that it can handle more potential use cases than xgetdents. Also the ability to only readahead an application specific subset of inodes is a useful feature. There is certainly a discussion to be had about how to specify the inodes that are wanted. Using the directory position is a relatively easy way to do it, and works well when most of the inodes in a directory are wanted. Specifying the file names would work better when fewer inodes are wanted, but then if very few are required, is readahead likely to give much of a gain anyway?... so thats why we chose the approach that we did. > How do the file systems that implement directory read-ahead today deal > with this? I don't know of one that does - or at least readahead of the directory info itself is one thing (which is relatively easy, and done by many file systems) its reading ahead the inodes within the directory which is more complex, and what we are talking about here. > Just playing devil's advocate here: It's not at all obvious that adding > more interfaces is necessary to get directory read-ahead working, given > our existing read-ahead implementations. > > - z Thats perfectly ok - we hoped to generate some discussion and they are good questions, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 20:02 ` Steven Whitehouse 0 siblings, 0 replies; 26+ messages in thread From: Steven Whitehouse @ 2014-07-25 20:02 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, On 25/07/14 19:28, Zach Brown wrote: > On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote: >> Hi, >> >> On 25/07/14 18:52, Zach Brown wrote: [snip] >>> Hmm. Have you tried plumbing these read-ahead calls in under the normal >>> getdents() syscalls? >>> >>> We don't have a filereadahead() syscall and yet we somehow manage to >>> implement buffered file data read-ahead :). >>> >>> - z >>> >> Well I'm not sure thats entirely true... we have readahead() and we also >> have fadvise(FADV_WILLNEED) for that. > Sure, fair enough. It would have been more precise to say that buffered > file data readers see read-ahead without *having* to use a syscall. > >> doubt, but how would we tell getdents64() when we were going to read the >> inodes, rather than just the file names? > How does transparent file read-ahead know how far to read-ahead, if at > all? In the file readahead case it has some context, and thats stored in the struct file. Thats where the problem lies in this case, the struct file relates to the directory, and when we then call open, or stat or whatever on some file within that directory, we don't pass the directory's fd to that open call, so we don't have a context to use. We could possibly look through the open fds relating to the process that called open to see if the parent dir of the inode we are opening is in there, in order to find the context to figure out whether to do readahead or not, but...... its not very nice to say the least. I'm very much in agreement that doing this automatically is best, but that only works when its possible to get a very good estimate of whether the readahead is needed or not. That is much easier for file data than it is for inodes in a directory. If someone can figure out how to get around this problem though, then that is certainly something we'd like to look at. The problem gets even more tricky in case the user only wants, say, half of the inodes in the directory... how does the kernel know which half? The idea here is really to give some idea of the kind of performance gains that we might see with the readahead vs xgetdents approaches, and by the sizes of the patches, the relative complexity of the implementations. I think overall, the readahead approach is the more flexible... if I had a directory full of files I wanted to truncate for example, it would be possible to use the same readahead to pull in the inodes quickly and then issue the truncates to the pre-cached inodes. That is something that would not be possible using xgetdents. Whether thats useful for real world applications or not remains to be seen, but it does show that it can handle more potential use cases than xgetdents. Also the ability to only readahead an application specific subset of inodes is a useful feature. There is certainly a discussion to be had about how to specify the inodes that are wanted. Using the directory position is a relatively easy way to do it, and works well when most of the inodes in a directory are wanted. Specifying the file names would work better when fewer inodes are wanted, but then if very few are required, is readahead likely to give much of a gain anyway?... so thats why we chose the approach that we did. > How do the file systems that implement directory read-ahead today deal > with this? I don't know of one that does - or at least readahead of the directory info itself is one thing (which is relatively easy, and done by many file systems) its reading ahead the inodes within the directory which is more complex, and what we are talking about here. > Just playing devil's advocate here: It's not at all obvious that adding > more interfaces is necessary to get directory read-ahead working, given > our existing read-ahead implementations. > > - z Thats perfectly ok - we hoped to generate some discussion and they are good questions, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 20:02 ` Steven Whitehouse @ 2014-07-25 20:30 ` Trond Myklebust -1 siblings, 0 replies; 26+ messages in thread From: Trond Myklebust @ 2014-07-25 20:30 UTC (permalink / raw) To: Steven Whitehouse Cc: Zach Brown, Abhijith Das, linux-fsdevel, cluster-devel, Linux Kernel mailing list On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > Hi, > > > On 25/07/14 19:28, Zach Brown wrote: >> > >> How do the file systems that implement directory read-ahead today deal >> with this? > > I don't know of one that does - or at least readahead of the directory info > itself is one thing (which is relatively easy, and done by many file > systems) its reading ahead the inodes within the directory which is more > complex, and what we are talking about here. > NFS looks at whether or not there are lookup revalidations and/or getattr calls in between the calls to readdir(). If there are, then we assume an 'ls -l' workload, and continue to issue readdirplus calls to the server. Note that we also actively zap the readdir cache if we see getattr calls over the wire, since the single call to readdirplus is usually very much more efficient. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-25 20:30 ` Trond Myklebust 0 siblings, 0 replies; 26+ messages in thread From: Trond Myklebust @ 2014-07-25 20:30 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > Hi, > > > On 25/07/14 19:28, Zach Brown wrote: >> > >> How do the file systems that implement directory read-ahead today deal >> with this? > > I don't know of one that does - or at least readahead of the directory info > itself is one thing (which is relatively easy, and done by many file > systems) its reading ahead the inodes within the directory which is more > complex, and what we are talking about here. > NFS looks at whether or not there are lookup revalidations and/or getattr calls in between the calls to readdir(). If there are, then we assume an 'ls -l' workload, and continue to issue readdirplus calls to the server. Note that we also actively zap the readdir cache if we see getattr calls over the wire, since the single call to readdirplus is usually very much more efficient. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust at primarydata.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-25 17:52 ` [Cluster-devel] " Zach Brown @ 2014-07-26 0:38 ` Dave Chinner -1 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-26 0:38 UTC (permalink / raw) To: Zach Brown; +Cc: Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > Hi all, > > > > The topic of a readdirplus-like syscall had come up for discussion at last year's > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > > to get at a directory's entries as well as stat() info on the individual inodes. > > I'm presenting these patches and some early test results on a single-node GFS2 > > filesystem. > > > > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > > call below and scales very well for large directories in GFS2. dirreadahead() is > > designed to be called prior to getdents+stat operations. > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > getdents() syscalls? The issue is not directory block readahead (which some filesystems like XFS already have), but issuing inode readahead during the getdents() syscall. It's the semi-random, interleaved inode IO that is being optimised here (i.e. queued, ordered, issued, cached), not the directory blocks themselves. As such, why does this need to be done in the kernel? This can all be done in userspace, and even hidden within the readdir() or ftw/ntfw() implementations themselves so it's OS, kernel and filesystem independent...... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-26 0:38 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-26 0:38 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > Hi all, > > > > The topic of a readdirplus-like syscall had come up for discussion at last year's > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > > to get at a directory's entries as well as stat() info on the individual inodes. > > I'm presenting these patches and some early test results on a single-node GFS2 > > filesystem. > > > > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > > call below and scales very well for large directories in GFS2. dirreadahead() is > > designed to be called prior to getdents+stat operations. > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > getdents() syscalls? The issue is not directory block readahead (which some filesystems like XFS already have), but issuing inode readahead during the getdents() syscall. It's the semi-random, interleaved inode IO that is being optimised here (i.e. queued, ordered, issued, cached), not the directory blocks themselves. As such, why does this need to be done in the kernel? This can all be done in userspace, and even hidden within the readdir() or ftw/ntfw() implementations themselves so it's OS, kernel and filesystem independent...... Cheers, Dave. -- Dave Chinner david at fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-26 0:38 ` [Cluster-devel] " Dave Chinner @ 2014-07-28 12:22 ` Abhijith Das -1 siblings, 0 replies; 26+ messages in thread From: Abhijith Das @ 2014-07-28 12:22 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel ----- Original Message ----- > From: "Dave Chinner" <david@fromorbit.com> > To: "Zach Brown" <zab@redhat.com> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, > "cluster-devel" <cluster-devel@redhat.com> > Sent: Friday, July 25, 2014 7:38:59 PM > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > Hi all, > > > > > > The topic of a readdirplus-like syscall had come up for discussion at > > > last year's > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 > > > implementations > > > to get at a directory's entries as well as stat() info on the individual > > > inodes. > > > I'm presenting these patches and some early test results on a single-node > > > GFS2 > > > filesystem. > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > xgetdents() system > > > call below and scales very well for large directories in GFS2. > > > dirreadahead() is > > > designed to be called prior to getdents+stat operations. > > > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > > getdents() syscalls? > > The issue is not directory block readahead (which some filesystems > like XFS already have), but issuing inode readahead during the > getdents() syscall. > > It's the semi-random, interleaved inode IO that is being optimised > here (i.e. queued, ordered, issued, cached), not the directory > blocks themselves. As such, why does this need to be done in the > kernel? This can all be done in userspace, and even hidden within > the readdir() or ftw/ntfw() implementations themselves so it's OS, > kernel and filesystem independent...... > I don't see how the sorting of the inode reads in disk block order can be accomplished in userland without knowing the fs-specific topology. From my observations, I've seen that the performance gain is the most when we can order the reads such that seek times are minimized on rotational media. I have not tested my patches against SSDs, but my guess would be that the performance impact would be minimal, if any. Cheers! --Abhi ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-28 12:22 ` Abhijith Das 0 siblings, 0 replies; 26+ messages in thread From: Abhijith Das @ 2014-07-28 12:22 UTC (permalink / raw) To: cluster-devel.redhat.com ----- Original Message ----- > From: "Dave Chinner" <david@fromorbit.com> > To: "Zach Brown" <zab@redhat.com> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, > "cluster-devel" <cluster-devel@redhat.com> > Sent: Friday, July 25, 2014 7:38:59 PM > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > Hi all, > > > > > > The topic of a readdirplus-like syscall had come up for discussion at > > > last year's > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 > > > implementations > > > to get at a directory's entries as well as stat() info on the individual > > > inodes. > > > I'm presenting these patches and some early test results on a single-node > > > GFS2 > > > filesystem. > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > xgetdents() system > > > call below and scales very well for large directories in GFS2. > > > dirreadahead() is > > > designed to be called prior to getdents+stat operations. > > > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > > getdents() syscalls? > > The issue is not directory block readahead (which some filesystems > like XFS already have), but issuing inode readahead during the > getdents() syscall. > > It's the semi-random, interleaved inode IO that is being optimised > here (i.e. queued, ordered, issued, cached), not the directory > blocks themselves. As such, why does this need to be done in the > kernel? This can all be done in userspace, and even hidden within > the readdir() or ftw/ntfw() implementations themselves so it's OS, > kernel and filesystem independent...... > I don't see how the sorting of the inode reads in disk block order can be accomplished in userland without knowing the fs-specific topology. From my observations, I've seen that the performance gain is the most when we can order the reads such that seek times are minimized on rotational media. I have not tested my patches against SSDs, but my guess would be that the performance impact would be minimal, if any. Cheers! --Abhi ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-28 12:22 ` [Cluster-devel] " Abhijith Das (?) @ 2014-07-28 14:30 ` Zuckerman, Boris -1 siblings, 0 replies; 26+ messages in thread From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw) To: Abhijith Das, Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3504 bytes --] 2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x. Regards, Boris > -----Original Message----- > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel- > owner@vger.kernel.org] On Behalf Of Abhijith Das > Sent: Monday, July 28, 2014 8:22 AM > To: Dave Chinner > Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > > > ----- Original Message ----- > > From: "Dave Chinner" <david@fromorbit.com> > > To: "Zach Brown" <zab@redhat.com> > > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, > > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel" > > <cluster-devel@redhat.com> > > Sent: Friday, July 25, 2014 7:38:59 PM > > Subject: Re: [RFC] readdirplus implementations: xgetdents vs > > dirreadahead syscalls > > > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > > Hi all, > > > > > > > > The topic of a readdirplus-like syscall had come up for discussion > > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls > > > > with their GFS2 implementations to get at a directory's entries as > > > > well as stat() info on the individual inodes. > > > > I'm presenting these patches and some early test results on a > > > > single-node > > > > GFS2 > > > > filesystem. > > > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > > xgetdents() system > > > > call below and scales very well for large directories in GFS2. > > > > dirreadahead() is > > > > designed to be called prior to getdents+stat operations. > > > > > > Hmm. Have you tried plumbing these read-ahead calls in under the > > > normal > > > getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory blocks > > themselves. As such, why does this need to be done in the kernel? > > This can all be done in userspace, and even hidden within the > > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel > > and filesystem independent...... > > > > I don't see how the sorting of the inode reads in disk block order can be accomplished in > userland without knowing the fs-specific topology. From my observations, I've seen that > the performance gain is the most when we can order the reads such that seek times are > minimized on rotational media. > > I have not tested my patches against SSDs, but my guess would be that the > performance impact would be minimal, if any. > > Cheers! > --Abhi > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a > message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-28 14:30 ` Zuckerman, Boris 0 siblings, 0 replies; 26+ messages in thread From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw) To: cluster-devel.redhat.com 2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x. Regards, Boris > -----Original Message----- > From: linux-fsdevel-owner at vger.kernel.org [mailto:linux-fsdevel- > owner at vger.kernel.org] On Behalf Of Abhijith Das > Sent: Monday, July 28, 2014 8:22 AM > To: Dave Chinner > Cc: linux-kernel at vger.kernel.org; linux-fsdevel; cluster-devel > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > > > ----- Original Message ----- > > From: "Dave Chinner" <david@fromorbit.com> > > To: "Zach Brown" <zab@redhat.com> > > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org, > > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel" > > <cluster-devel@redhat.com> > > Sent: Friday, July 25, 2014 7:38:59 PM > > Subject: Re: [RFC] readdirplus implementations: xgetdents vs > > dirreadahead syscalls > > > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > > Hi all, > > > > > > > > The topic of a readdirplus-like syscall had come up for discussion > > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls > > > > with their GFS2 implementations to get at a directory's entries as > > > > well as stat() info on the individual inodes. > > > > I'm presenting these patches and some early test results on a > > > > single-node > > > > GFS2 > > > > filesystem. > > > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > > xgetdents() system > > > > call below and scales very well for large directories in GFS2. > > > > dirreadahead() is > > > > designed to be called prior to getdents+stat operations. > > > > > > Hmm. Have you tried plumbing these read-ahead calls in under the > > > normal > > > getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory blocks > > themselves. As such, why does this need to be done in the kernel? > > This can all be done in userspace, and even hidden within the > > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel > > and filesystem independent...... > > > > I don't see how the sorting of the inode reads in disk block order can be accomplished in > userland without knowing the fs-specific topology. From my observations, I've seen that > the performance gain is the most when we can order the reads such that seek times are > minimized on rotational media. > > I have not tested my patches against SSDs, but my guess would be that the > performance impact would be minimal, if any. > > Cheers! > --Abhi > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a > message to majordomo at vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-28 14:30 ` Zuckerman, Boris 0 siblings, 0 replies; 26+ messages in thread From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw) To: Abhijith Das, Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel 2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x. Regards, Boris > -----Original Message----- > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel- > owner@vger.kernel.org] On Behalf Of Abhijith Das > Sent: Monday, July 28, 2014 8:22 AM > To: Dave Chinner > Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > > > ----- Original Message ----- > > From: "Dave Chinner" <david@fromorbit.com> > > To: "Zach Brown" <zab@redhat.com> > > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, > > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel" > > <cluster-devel@redhat.com> > > Sent: Friday, July 25, 2014 7:38:59 PM > > Subject: Re: [RFC] readdirplus implementations: xgetdents vs > > dirreadahead syscalls > > > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > > Hi all, > > > > > > > > The topic of a readdirplus-like syscall had come up for discussion > > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls > > > > with their GFS2 implementations to get at a directory's entries as > > > > well as stat() info on the individual inodes. > > > > I'm presenting these patches and some early test results on a > > > > single-node > > > > GFS2 > > > > filesystem. > > > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > > xgetdents() system > > > > call below and scales very well for large directories in GFS2. > > > > dirreadahead() is > > > > designed to be called prior to getdents+stat operations. > > > > > > Hmm. Have you tried plumbing these read-ahead calls in under the > > > normal > > > getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory blocks > > themselves. As such, why does this need to be done in the kernel? > > This can all be done in userspace, and even hidden within the > > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel > > and filesystem independent...... > > > > I don't see how the sorting of the inode reads in disk block order can be accomplished in > userland without knowing the fs-specific topology. From my observations, I've seen that > the performance gain is the most when we can order the reads such that seek times are > minimized on rotational media. > > I have not tested my patches against SSDs, but my guess would be that the > performance impact would be minimal, if any. > > Cheers! > --Abhi > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a > message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-28 12:22 ` [Cluster-devel] " Abhijith Das @ 2014-07-31 3:25 ` Dave Chinner -1 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-31 3:25 UTC (permalink / raw) To: Abhijith Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote: > > > ----- Original Message ----- > > From: "Dave Chinner" <david@fromorbit.com> > > To: "Zach Brown" <zab@redhat.com> > > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, > > "cluster-devel" <cluster-devel@redhat.com> > > Sent: Friday, July 25, 2014 7:38:59 PM > > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > > Hi all, > > > > > > > > The topic of a readdirplus-like syscall had come up for discussion at > > > > last year's > > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 > > > > implementations > > > > to get at a directory's entries as well as stat() info on the individual > > > > inodes. > > > > I'm presenting these patches and some early test results on a single-node > > > > GFS2 > > > > filesystem. > > > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > > xgetdents() system > > > > call below and scales very well for large directories in GFS2. > > > > dirreadahead() is > > > > designed to be called prior to getdents+stat operations. > > > > > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > > > getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory > > blocks themselves. As such, why does this need to be done in the > > kernel? This can all be done in userspace, and even hidden within > > the readdir() or ftw/ntfw() implementations themselves so it's OS, > > kernel and filesystem independent...... > > > > I don't see how the sorting of the inode reads in disk block order can be > accomplished in userland without knowing the fs-specific topology. I didn't say anything about doing "disk block ordering" in userspace. disk block ordering can be done by the IO scheduler and that's simple enough to do by multithreading and dispatch a few tens of stat() calls at once.... > From my > observations, I've seen that the performance gain is the most when we can > order the reads such that seek times are minimized on rotational media. Yup, which is done by ensuring that we drive deep IO queues rather than issuing a single IO at a time and waiting for completion before issuing the next one. This can easily be done from userspace. > I have not tested my patches against SSDs, but my guess would be that the > performance impact would be minimal, if any. Depends. if the overhead of executing readahead is higher than the time spent waiting for IO completion, then it will reduce performance. i.e. the faster the underlying storage, the less CPU time we want to spend on IO. Readahead generally increases CPU time per object that needs to be retrieved from disk, and so on high IOP devices there's a really good chance we don't want readahead like this at all. i.e. this is yet another reason directory traversal readahead should be driven from userspace so the policy can be easily controlled by the application and/or user.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-31 3:25 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-31 3:25 UTC (permalink / raw) To: cluster-devel.redhat.com On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote: > > > ----- Original Message ----- > > From: "Dave Chinner" <david@fromorbit.com> > > To: "Zach Brown" <zab@redhat.com> > > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, > > "cluster-devel" <cluster-devel@redhat.com> > > Sent: Friday, July 25, 2014 7:38:59 PM > > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls > > > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > > > > Hi all, > > > > > > > > The topic of a readdirplus-like syscall had come up for discussion at > > > > last year's > > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 > > > > implementations > > > > to get at a directory's entries as well as stat() info on the individual > > > > inodes. > > > > I'm presenting these patches and some early test results on a single-node > > > > GFS2 > > > > filesystem. > > > > > > > > 1. dirreadahead() - This patchset is very simple compared to the > > > > xgetdents() system > > > > call below and scales very well for large directories in GFS2. > > > > dirreadahead() is > > > > designed to be called prior to getdents+stat operations. > > > > > > Hmm. Have you tried plumbing these read-ahead calls in under the normal > > > getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory > > blocks themselves. As such, why does this need to be done in the > > kernel? This can all be done in userspace, and even hidden within > > the readdir() or ftw/ntfw() implementations themselves so it's OS, > > kernel and filesystem independent...... > > > > I don't see how the sorting of the inode reads in disk block order can be > accomplished in userland without knowing the fs-specific topology. I didn't say anything about doing "disk block ordering" in userspace. disk block ordering can be done by the IO scheduler and that's simple enough to do by multithreading and dispatch a few tens of stat() calls at once.... > From my > observations, I've seen that the performance gain is the most when we can > order the reads such that seek times are minimized on rotational media. Yup, which is done by ensuring that we drive deep IO queues rather than issuing a single IO at a time and waiting for completion before issuing the next one. This can easily be done from userspace. > I have not tested my patches against SSDs, but my guess would be that the > performance impact would be minimal, if any. Depends. if the overhead of executing readahead is higher than the time spent waiting for IO completion, then it will reduce performance. i.e. the faster the underlying storage, the less CPU time we want to spend on IO. Readahead generally increases CPU time per object that needs to be retrieved from disk, and so on high IOP devices there's a really good chance we don't want readahead like this at all. i.e. this is yet another reason directory traversal readahead should be driven from userspace so the policy can be easily controlled by the application and/or user.... Cheers, Dave. -- Dave Chinner david at fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-26 0:38 ` [Cluster-devel] " Dave Chinner @ 2014-07-28 21:21 ` Andreas Dilger -1 siblings, 0 replies; 26+ messages in thread From: Andreas Dilger @ 2014-07-28 21:21 UTC (permalink / raw) To: Dave Chinner Cc: Zach Brown, Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel [-- Attachment #1: Type: text/plain, Size: 1625 bytes --] On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: >>> Hi all, >>> >>> The topic of a readdirplus-like syscall had come up for discussion at last year's >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations >>> to get at a directory's entries as well as stat() info on the individual inodes. >>> I'm presenting these patches and some early test results on a single-node GFS2 >>> filesystem. >>> >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system >>> call below and scales very well for large directories in GFS2. dirreadahead() is >>> designed to be called prior to getdents+stat operations. >> >> Hmm. Have you tried plumbing these read-ahead calls in under the normal >> getdents() syscalls? > > The issue is not directory block readahead (which some filesystems > like XFS already have), but issuing inode readahead during the > getdents() syscall. > > It's the semi-random, interleaved inode IO that is being optimised > here (i.e. queued, ordered, issued, cached), not the directory > blocks themselves. Sure. > As such, why does this need to be done in the > kernel? This can all be done in userspace, and even hidden within > the readdir() or ftw/ntfw() implementations themselves so it's OS, > kernel and filesystem independent...... That assumes sorting by inode number maps to sorting by disk order. That isn't always true. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-28 21:21 ` Andreas Dilger 0 siblings, 0 replies; 26+ messages in thread From: Andreas Dilger @ 2014-07-28 21:21 UTC (permalink / raw) To: cluster-devel.redhat.com On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: >>> Hi all, >>> >>> The topic of a readdirplus-like syscall had come up for discussion at last year's >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations >>> to get at a directory's entries as well as stat() info on the individual inodes. >>> I'm presenting these patches and some early test results on a single-node GFS2 >>> filesystem. >>> >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system >>> call below and scales very well for large directories in GFS2. dirreadahead() is >>> designed to be called prior to getdents+stat operations. >> >> Hmm. Have you tried plumbing these read-ahead calls in under the normal >> getdents() syscalls? > > The issue is not directory block readahead (which some filesystems > like XFS already have), but issuing inode readahead during the > getdents() syscall. > > It's the semi-random, interleaved inode IO that is being optimised > here (i.e. queued, ordered, issued, cached), not the directory > blocks themselves. Sure. > As such, why does this need to be done in the > kernel? This can all be done in userspace, and even hidden within > the readdir() or ftw/ntfw() implementations themselves so it's OS, > kernel and filesystem independent...... That assumes sorting by inode number maps to sorting by disk order. That isn't always true. Cheers, Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP using GPGMail URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20140728/aa9f1777/attachment.sig> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls 2014-07-28 21:21 ` [Cluster-devel] " Andreas Dilger @ 2014-07-31 3:16 ` Dave Chinner -1 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-31 3:16 UTC (permalink / raw) To: Andreas Dilger Cc: Zach Brown, Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote: > On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > >>> Hi all, > >>> > >>> The topic of a readdirplus-like syscall had come up for discussion at last year's > >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > >>> to get at a directory's entries as well as stat() info on the individual inodes. > >>> I'm presenting these patches and some early test results on a single-node GFS2 > >>> filesystem. > >>> > >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > >>> call below and scales very well for large directories in GFS2. dirreadahead() is > >>> designed to be called prior to getdents+stat operations. > >> > >> Hmm. Have you tried plumbing these read-ahead calls in under the normal > >> getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory > > blocks themselves. > > Sure. > > > As such, why does this need to be done in the > > kernel? This can all be done in userspace, and even hidden within > > the readdir() or ftw/ntfw() implementations themselves so it's OS, > > kernel and filesystem independent...... > > That assumes sorting by inode number maps to sorting by disk order. > That isn't always true. That's true, but it's a fair bet that roughly ascending inode number ordering is going to be better than random ordering for most filesystems. Besides, ordering isn't the real problem - the real problem is the latency caused by having to do the inode IO synchronously one stat() at a time. Just multithread the damn thing in userspace so the stat()s can be done asynchronously and hence be more optimally ordered by the IO scheduler and completed before the application blocks on the IO. It doesn't even need completion synchronisation - the stat() issued by the application will block until the async stat() completes the process of bringing the inode into the kernel cache... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls @ 2014-07-31 3:16 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2014-07-31 3:16 UTC (permalink / raw) To: cluster-devel.redhat.com On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote: > On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote: > >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote: > >>> Hi all, > >>> > >>> The topic of a readdirplus-like syscall had come up for discussion at last year's > >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations > >>> to get at a directory's entries as well as stat() info on the individual inodes. > >>> I'm presenting these patches and some early test results on a single-node GFS2 > >>> filesystem. > >>> > >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system > >>> call below and scales very well for large directories in GFS2. dirreadahead() is > >>> designed to be called prior to getdents+stat operations. > >> > >> Hmm. Have you tried plumbing these read-ahead calls in under the normal > >> getdents() syscalls? > > > > The issue is not directory block readahead (which some filesystems > > like XFS already have), but issuing inode readahead during the > > getdents() syscall. > > > > It's the semi-random, interleaved inode IO that is being optimised > > here (i.e. queued, ordered, issued, cached), not the directory > > blocks themselves. > > Sure. > > > As such, why does this need to be done in the > > kernel? This can all be done in userspace, and even hidden within > > the readdir() or ftw/ntfw() implementations themselves so it's OS, > > kernel and filesystem independent...... > > That assumes sorting by inode number maps to sorting by disk order. > That isn't always true. That's true, but it's a fair bet that roughly ascending inode number ordering is going to be better than random ordering for most filesystems. Besides, ordering isn't the real problem - the real problem is the latency caused by having to do the inode IO synchronously one stat() at a time. Just multithread the damn thing in userspace so the stat()s can be done asynchronously and hence be more optimally ordered by the IO scheduler and completed before the application blocks on the IO. It doesn't even need completion synchronisation - the stat() issued by the application will block until the async stat() completes the process of bringing the inode into the kernel cache... Cheers, Dave. -- Dave Chinner david at fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-07-31 3:25 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1106785262.13440918.1406308542921.JavaMail.zimbra@redhat.com> 2014-07-25 17:37 ` [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls Abhijith Das 2014-07-25 17:37 ` [Cluster-devel] " Abhijith Das 2014-07-25 17:52 ` Zach Brown 2014-07-25 17:52 ` [Cluster-devel] " Zach Brown 2014-07-25 18:08 ` Steven Whitehouse 2014-07-25 18:08 ` Steven Whitehouse 2014-07-25 18:08 ` Steven Whitehouse 2014-07-25 18:28 ` [Cluster-devel] " Zach Brown 2014-07-25 18:28 ` Zach Brown 2014-07-25 20:02 ` Steven Whitehouse 2014-07-25 20:02 ` Steven Whitehouse 2014-07-25 20:30 ` Trond Myklebust 2014-07-25 20:30 ` Trond Myklebust 2014-07-26 0:38 ` Dave Chinner 2014-07-26 0:38 ` [Cluster-devel] " Dave Chinner 2014-07-28 12:22 ` Abhijith Das 2014-07-28 12:22 ` [Cluster-devel] " Abhijith Das 2014-07-28 14:30 ` Zuckerman, Boris 2014-07-28 14:30 ` [Cluster-devel] " Zuckerman, Boris 2014-07-28 14:30 ` Zuckerman, Boris 2014-07-31 3:25 ` Dave Chinner 2014-07-31 3:25 ` [Cluster-devel] " Dave Chinner 2014-07-28 21:21 ` Andreas Dilger 2014-07-28 21:21 ` [Cluster-devel] " Andreas Dilger 2014-07-31 3:16 ` Dave Chinner 2014-07-31 3:16 ` [Cluster-devel] " Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.