[RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
       [not found] <1106785262.13440918.1406308542921.JavaMail.zimbra@redhat.com>
@ 2014-07-25 17:37   ` Abhijith Das
  0 siblings, 0 replies; 26+ messages in thread
From: Abhijith Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel

Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations. In it's current form, it
only speeds up stat() operations by caching the relevant inodes. Support can be
added in the future to cache extended attribute blocks as well. This works by first
collecting all the inode numbers of the directory's entries (subject to a numeric
or memory cap). This list is sorted by inode disk block order and passed on to
workqueues to perform lookups on the inodes asynchronously to bring them into the
cache.

2. xgetdents() - I had posted a version of this patchset some time last year and
it is largely unchanged - I just ported it to the latest upstream kernel. It
allows the user to request a combination of entries, stat and xattrs (keys/values)
for a directory. The stat portion is based on David Howells' xstat patchset he
had posted last year as well. I've included the relevant vfs bits in my patchset.
xgetdents() in GFS2 works in two phases. In the first phase, it collects all the
dirents by reading the directory in question. In phase two, it reads in inode
blocks and xattr blocks (if requested) for each entry after sorting the disk
accesses in block order. All of the intermediate data is stored in a buffer backed
by a vector of pages and is eventually transferred to the user supplied buffer.

Both syscalls perform significantly better than a simple getdents+stat with a cold
cache. The main advantage lies in being able to sort disk accesses for a bunch of
inodes in advance compared to seeking all over the disk for inodes one entry at a
time.

This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the
time taken to get directory entries and their respective stat info by 3 different
sets of syscalls:

1) getdents+stat ('ls -l', basically) - Solid blue line
2) xgetdents with various buffer size and num_entries limits - Dotted lines
   Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and
   a maximum of 10000 entries to collect at a time.
3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines
   Eg: d10000 implies that it would fire off a max of 10000 inode lookups during
   each syscall.

numfiles:                      10000     50000     100000    500000
--------------------------------------------------------------------
getdents+stat                   1.4s      220s       514s     2441s
xgetdents                       1.2s       43s        75s     1710s
dirreadahead+getdents+stat      1.1s        5s        68s      391s

Here is a seekwatcher graph from a test run on a directory of 50000 files. 
(https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is
between getdents+stat and xgetdents. The first set of plots is of getdents+stat,
followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and
num_entries (100 to 1000000) limits. One can see the effect of ordering the disk
reads in the Disk IO portion of the graphs and the corresponding effect on seeks,
throughput and overall time taken.

This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat()
syscall-combo for a 500000-file directory with increasing num_entries (100 to
1000000) limits versus getdents+stat.
(https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png)
The corresponding getdents+stat baseline for this run is at the top of the series
of graphs.

I'm posting these two patchsets shortly for comments.

Cheers!
--Abhi

Red Hat Filesystems

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 17:37   ` Abhijith Das
  0 siblings, 0 replies; 26+ messages in thread
From: Abhijith Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations. In it's current form, it
only speeds up stat() operations by caching the relevant inodes. Support can be
added in the future to cache extended attribute blocks as well. This works by first
collecting all the inode numbers of the directory's entries (subject to a numeric
or memory cap). This list is sorted by inode disk block order and passed on to
workqueues to perform lookups on the inodes asynchronously to bring them into the
cache.

2. xgetdents() - I had posted a version of this patchset some time last year and
it is largely unchanged - I just ported it to the latest upstream kernel. It
allows the user to request a combination of entries, stat and xattrs (keys/values)
for a directory. The stat portion is based on David Howells' xstat patchset he
had posted last year as well. I've included the relevant vfs bits in my patchset.
xgetdents() in GFS2 works in two phases. In the first phase, it collects all the
dirents by reading the directory in question. In phase two, it reads in inode
blocks and xattr blocks (if requested) for each entry after sorting the disk
accesses in block order. All of the intermediate data is stored in a buffer backed
by a vector of pages and is eventually transferred to the user supplied buffer.

Both syscalls perform significantly better than a simple getdents+stat with a cold
cache. The main advantage lies in being able to sort disk accesses for a bunch of
inodes in advance compared to seeking all over the disk for inodes one entry at a
time.

This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the
time taken to get directory entries and their respective stat info by 3 different
sets of syscalls:

1) getdents+stat ('ls -l', basically) - Solid blue line
2) xgetdents with various buffer size and num_entries limits - Dotted lines
   Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and
   a maximum of 10000 entries to collect at a time.
3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines
   Eg: d10000 implies that it would fire off a max of 10000 inode lookups during
   each syscall.

numfiles:                      10000     50000     100000    500000
--------------------------------------------------------------------
getdents+stat                   1.4s      220s       514s     2441s
xgetdents                       1.2s       43s        75s     1710s
dirreadahead+getdents+stat      1.1s        5s        68s      391s

Here is a seekwatcher graph from a test run on a directory of 50000 files. 
(https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is
between getdents+stat and xgetdents. The first set of plots is of getdents+stat,
followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and
num_entries (100 to 1000000) limits. One can see the effect of ordering the disk
reads in the Disk IO portion of the graphs and the corresponding effect on seeks,
throughput and overall time taken.

This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat()
syscall-combo for a 500000-file directory with increasing num_entries (100 to
1000000) limits versus getdents+stat.
(https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png)
The corresponding getdents+stat baseline for this run is at the top of the series
of graphs.

I'm posting these two patchsets shortly for comments.

Cheers!
--Abhi

Red Hat Filesystems

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 17:37   ` [Cluster-devel] " Abhijith Das
@ 2014-07-25 17:52     ` Zach Brown
  -1 siblings, 0 replies; 26+ messages in thread
From: Zach Brown @ 2014-07-25 17:52 UTC (permalink / raw)
  To: Abhijith Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel

On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> Hi all,
> 
> The topic of a readdirplus-like syscall had come up for discussion at last year's
> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> to get at a directory's entries as well as stat() info on the individual inodes.
> I'm presenting these patches and some early test results on a single-node GFS2
> filesystem.
> 
> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> call below and scales very well for large directories in GFS2. dirreadahead() is
> designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 17:52     ` Zach Brown
  0 siblings, 0 replies; 26+ messages in thread
From: Zach Brown @ 2014-07-25 17:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> Hi all,
> 
> The topic of a readdirplus-like syscall had come up for discussion at last year's
> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> to get at a directory's entries as well as stat() info on the individual inodes.
> I'm presenting these patches and some early test results on a single-node GFS2
> filesystem.
> 
> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> call below and scales very well for large directories in GFS2. dirreadahead() is
> designed to be called prior to getdents+stat operations.

Hmm.  Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 17:52     ` [Cluster-devel] " Zach Brown
  (?)
@ 2014-07-25 18:08       ` Steven Whitehouse
  -1 siblings, 0 replies; 26+ messages in thread
From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw)
  To: Zach Brown, Abhijith Das; +Cc: linux-fsdevel, cluster-devel, linux-kernel

Hi,

On 25/07/14 18:52, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>> Hi all,
>>
>> The topic of a readdirplus-like syscall had come up for discussion at last year's
>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
>> to get at a directory's entries as well as stat() info on the individual inodes.
>> I'm presenting these patches and some early test results on a single-node GFS2
>> filesystem.
>>
>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
>> call below and scales very well for large directories in GFS2. dirreadahead() is
>> designed to be called prior to getdents+stat operations.
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?
>
> We don't have a filereadahead() syscall and yet we somehow manage to
> implement buffered file data read-ahead :).
>
> - z
>
Well I'm not sure thats entirely true... we have readahead() and we also 
have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no 
doubt, but how would we tell getdents64() when we were going to read the 
inodes, rather than just the file names? We may only want to readahead 
some subset of the directory entries rather than all of them, so the 
thought was to allow that flexibility by making it, its own syscall,

Steve.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 18:08       ` Steven Whitehouse
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 25/07/14 18:52, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>> Hi all,
>>
>> The topic of a readdirplus-like syscall had come up for discussion at last year's
>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
>> to get at a directory's entries as well as stat() info on the individual inodes.
>> I'm presenting these patches and some early test results on a single-node GFS2
>> filesystem.
>>
>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
>> call below and scales very well for large directories in GFS2. dirreadahead() is
>> designed to be called prior to getdents+stat operations.
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?
>
> We don't have a filereadahead() syscall and yet we somehow manage to
> implement buffered file data read-ahead :).
>
> - z
>
Well I'm not sure thats entirely true... we have readahead() and we also 
have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no 
doubt, but how would we tell getdents64() when we were going to read the 
inodes, rather than just the file names? We may only want to readahead 
some subset of the directory entries rather than all of them, so the 
thought was to allow that flexibility by making it, its own syscall,

Steve.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 18:08       ` Steven Whitehouse
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Whitehouse @ 2014-07-25 18:08 UTC (permalink / raw)
  To: Zach Brown, Abhijith Das; +Cc: linux-fsdevel, cluster-devel, linux-kernel

Hi,

On 25/07/14 18:52, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>> Hi all,
>>
>> The topic of a readdirplus-like syscall had come up for discussion at last year's
>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
>> to get at a directory's entries as well as stat() info on the individual inodes.
>> I'm presenting these patches and some early test results on a single-node GFS2
>> filesystem.
>>
>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
>> call below and scales very well for large directories in GFS2. dirreadahead() is
>> designed to be called prior to getdents+stat operations.
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?
>
> We don't have a filereadahead() syscall and yet we somehow manage to
> implement buffered file data read-ahead :).
>
> - z
>
Well I'm not sure thats entirely true... we have readahead() and we also 
have fadvise(FADV_WILLNEED) for that. It could be added to getdents() no 
doubt, but how would we tell getdents64() when we were going to read the 
inodes, rather than just the file names? We may only want to readahead 
some subset of the directory entries rather than all of them, so the 
thought was to allow that flexibility by making it, its own syscall,

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 18:08       ` Steven Whitehouse
@ 2014-07-25 18:28         ` Zach Brown
  -1 siblings, 0 replies; 26+ messages in thread
From: Zach Brown @ 2014-07-25 18:28 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Abhijith Das, linux-fsdevel, cluster-devel, linux-kernel

On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On 25/07/14 18:52, Zach Brown wrote:
> >On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>Hi all,
> >>
> >>The topic of a readdirplus-like syscall had come up for discussion at last year's
> >>LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> >>to get at a directory's entries as well as stat() info on the individual inodes.
> >>I'm presenting these patches and some early test results on a single-node GFS2
> >>filesystem.
> >>
> >>1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> >>call below and scales very well for large directories in GFS2. dirreadahead() is
> >>designed to be called prior to getdents+stat operations.
> >Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >getdents() syscalls?
> >
> >We don't have a filereadahead() syscall and yet we somehow manage to
> >implement buffered file data read-ahead :).
> >
> >- z
> >
> Well I'm not sure thats entirely true... we have readahead() and we also
> have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.

> doubt, but how would we tell getdents64() when we were going to read the
> inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?

How do the file systems that implement directory read-ahead today deal
with this?

Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 18:28         ` Zach Brown
  0 siblings, 0 replies; 26+ messages in thread
From: Zach Brown @ 2014-07-25 18:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On 25/07/14 18:52, Zach Brown wrote:
> >On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>Hi all,
> >>
> >>The topic of a readdirplus-like syscall had come up for discussion at last year's
> >>LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> >>to get at a directory's entries as well as stat() info on the individual inodes.
> >>I'm presenting these patches and some early test results on a single-node GFS2
> >>filesystem.
> >>
> >>1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> >>call below and scales very well for large directories in GFS2. dirreadahead() is
> >>designed to be called prior to getdents+stat operations.
> >Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >getdents() syscalls?
> >
> >We don't have a filereadahead() syscall and yet we somehow manage to
> >implement buffered file data read-ahead :).
> >
> >- z
> >
> Well I'm not sure thats entirely true... we have readahead() and we also
> have fadvise(FADV_WILLNEED) for that.

Sure, fair enough.  It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.

> doubt, but how would we tell getdents64() when we were going to read the
> inodes, rather than just the file names?

How does transparent file read-ahead know how far to read-ahead, if at
all?

How do the file systems that implement directory read-ahead today deal
with this?

Just playing devil's advocate here:  It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 18:28         ` Zach Brown
@ 2014-07-25 20:02           ` Steven Whitehouse
  -1 siblings, 0 replies; 26+ messages in thread
From: Steven Whitehouse @ 2014-07-25 20:02 UTC (permalink / raw)
  To: Zach Brown; +Cc: Abhijith Das, linux-fsdevel, cluster-devel, linux-kernel

Hi,

On 25/07/14 19:28, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
>> Hi,
>>
>> On 25/07/14 18:52, Zach Brown wrote:
[snip]
>>> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
>>> getdents() syscalls?
>>>
>>> We don't have a filereadahead() syscall and yet we somehow manage to
>>> implement buffered file data read-ahead :).
>>>
>>> - z
>>>
>> Well I'm not sure thats entirely true... we have readahead() and we also
>> have fadvise(FADV_WILLNEED) for that.
> Sure, fair enough.  It would have been more precise to say that buffered
> file data readers see read-ahead without *having* to use a syscall.
>
>> doubt, but how would we tell getdents64() when we were going to read the
>> inodes, rather than just the file names?
> How does transparent file read-ahead know how far to read-ahead, if at
> all?
In the file readahead case it has some context, and thats stored in the 
struct file. Thats where the problem lies in this case, the struct file 
relates to the directory, and when we then call open, or stat or 
whatever on some file within that directory, we don't pass the 
directory's fd to that open call, so we don't have a context to use. We 
could possibly look through the open fds relating to the process that 
called open to see if the parent dir of the inode we are opening is in 
there, in order to find the context to figure out whether to do 
readahead or not, but...... its not very nice to say the least.

I'm very much in agreement that doing this automatically is best, but 
that only works when its possible to get a very good estimate of whether 
the readahead is needed or not. That is much easier for file data than 
it is for inodes in a directory. If someone can figure out how to get 
around this problem though, then that is certainly something we'd like 
to look at.

The problem gets even more tricky in case the user only wants, say, half 
of the inodes in the directory... how does the kernel know which half?

The idea here is really to give some idea of the kind of performance 
gains that we might see with the readahead vs xgetdents approaches, and 
by the sizes of the patches, the relative complexity of the implementations.

I think overall, the readahead approach is the more flexible... if I had 
a directory full of files I wanted to truncate for example, it would be 
possible to use the same readahead to pull in the inodes quickly and 
then issue the truncates to the pre-cached inodes. That is something 
that would not be possible using xgetdents. Whether thats useful for 
real world applications or not remains to be seen, but it does show that 
it can handle more potential use cases than xgetdents. Also the ability 
to only readahead an application specific subset of inodes is a useful 
feature.

There is certainly a discussion to be had about how to specify the 
inodes that are wanted. Using the directory position is a relatively 
easy way to do it, and works well when most of the inodes in a directory 
are wanted. Specifying the file names would work better when fewer 
inodes are wanted, but then if very few are required, is readahead 
likely to give much of a gain anyway?... so thats why we chose the 
approach that we did.

> How do the file systems that implement directory read-ahead today deal
> with this?
I don't know of one that does - or at least readahead of the directory 
info itself is one thing (which is relatively easy, and done by many 
file systems) its reading ahead the inodes within the directory which is 
more complex, and what we are talking about here.

> Just playing devil's advocate here:  It's not at all obvious that adding
> more interfaces is necessary to get directory read-ahead working, given
> our existing read-ahead implementations.
>
> - z
Thats perfectly ok - we hoped to generate some discussion and they are 
good questions,

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 20:02           ` Steven Whitehouse
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Whitehouse @ 2014-07-25 20:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 25/07/14 19:28, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
>> Hi,
>>
>> On 25/07/14 18:52, Zach Brown wrote:
[snip]
>>> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
>>> getdents() syscalls?
>>>
>>> We don't have a filereadahead() syscall and yet we somehow manage to
>>> implement buffered file data read-ahead :).
>>>
>>> - z
>>>
>> Well I'm not sure thats entirely true... we have readahead() and we also
>> have fadvise(FADV_WILLNEED) for that.
> Sure, fair enough.  It would have been more precise to say that buffered
> file data readers see read-ahead without *having* to use a syscall.
>
>> doubt, but how would we tell getdents64() when we were going to read the
>> inodes, rather than just the file names?
> How does transparent file read-ahead know how far to read-ahead, if at
> all?
In the file readahead case it has some context, and thats stored in the 
struct file. Thats where the problem lies in this case, the struct file 
relates to the directory, and when we then call open, or stat or 
whatever on some file within that directory, we don't pass the 
directory's fd to that open call, so we don't have a context to use. We 
could possibly look through the open fds relating to the process that 
called open to see if the parent dir of the inode we are opening is in 
there, in order to find the context to figure out whether to do 
readahead or not, but...... its not very nice to say the least.

I'm very much in agreement that doing this automatically is best, but 
that only works when its possible to get a very good estimate of whether 
the readahead is needed or not. That is much easier for file data than 
it is for inodes in a directory. If someone can figure out how to get 
around this problem though, then that is certainly something we'd like 
to look at.

The problem gets even more tricky in case the user only wants, say, half 
of the inodes in the directory... how does the kernel know which half?

The idea here is really to give some idea of the kind of performance 
gains that we might see with the readahead vs xgetdents approaches, and 
by the sizes of the patches, the relative complexity of the implementations.

I think overall, the readahead approach is the more flexible... if I had 
a directory full of files I wanted to truncate for example, it would be 
possible to use the same readahead to pull in the inodes quickly and 
then issue the truncates to the pre-cached inodes. That is something 
that would not be possible using xgetdents. Whether thats useful for 
real world applications or not remains to be seen, but it does show that 
it can handle more potential use cases than xgetdents. Also the ability 
to only readahead an application specific subset of inodes is a useful 
feature.

There is certainly a discussion to be had about how to specify the 
inodes that are wanted. Using the directory position is a relatively 
easy way to do it, and works well when most of the inodes in a directory 
are wanted. Specifying the file names would work better when fewer 
inodes are wanted, but then if very few are required, is readahead 
likely to give much of a gain anyway?... so thats why we chose the 
approach that we did.

> How do the file systems that implement directory read-ahead today deal
> with this?
I don't know of one that does - or at least readahead of the directory 
info itself is one thing (which is relatively easy, and done by many 
file systems) its reading ahead the inodes within the directory which is 
more complex, and what we are talking about here.

> Just playing devil's advocate here:  It's not at all obvious that adding
> more interfaces is necessary to get directory read-ahead working, given
> our existing read-ahead implementations.
>
> - z
Thats perfectly ok - we hoped to generate some discussion and they are 
good questions,

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 20:02           ` Steven Whitehouse
@ 2014-07-25 20:30             ` Trond Myklebust
  -1 siblings, 0 replies; 26+ messages in thread
From: Trond Myklebust @ 2014-07-25 20:30 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Zach Brown, Abhijith Das, linux-fsdevel, cluster-devel,
	Linux Kernel mailing list

On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
>
> On 25/07/14 19:28, Zach Brown wrote:
>>
>
>> How do the file systems that implement directory read-ahead today deal
>> with this?
>
> I don't know of one that does - or at least readahead of the directory info
> itself is one thing (which is relatively easy, and done by many file
> systems) its reading ahead the inodes within the directory which is more
> complex, and what we are talking about here.
>

NFS looks at whether or not there are lookup revalidations and/or
getattr calls in between the calls to readdir(). If there are, then we
assume an 'ls -l' workload, and continue to issue readdirplus calls to
the server.

Note that we also actively zap the readdir cache if we see getattr
calls over the wire, since the single call to readdirplus is usually
very much more efficient.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-25 20:30             ` Trond Myklebust
  0 siblings, 0 replies; 26+ messages in thread
From: Trond Myklebust @ 2014-07-25 20:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jul 25, 2014 at 4:02 PM, Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
>
> On 25/07/14 19:28, Zach Brown wrote:
>>
>
>> How do the file systems that implement directory read-ahead today deal
>> with this?
>
> I don't know of one that does - or at least readahead of the directory info
> itself is one thing (which is relatively easy, and done by many file
> systems) its reading ahead the inodes within the directory which is more
> complex, and what we are talking about here.
>

NFS looks at whether or not there are lookup revalidations and/or
getattr calls in between the calls to readdir(). If there are, then we
assume an 'ls -l' workload, and continue to issue readdirplus calls to
the server.

Note that we also actively zap the readdir cache if we see getattr
calls over the wire, since the single call to readdirplus is usually
very much more efficient.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust at primarydata.com



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-25 17:52     ` [Cluster-devel] " Zach Brown
@ 2014-07-26  0:38       ` Dave Chinner
  -1 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-26  0:38 UTC (permalink / raw)
  To: Zach Brown; +Cc: Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel

On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > Hi all,
> > 
> > The topic of a readdirplus-like syscall had come up for discussion at last year's
> > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> > to get at a directory's entries as well as stat() info on the individual inodes.
> > I'm presenting these patches and some early test results on a single-node GFS2
> > filesystem.
> > 
> > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> > call below and scales very well for large directories in GFS2. dirreadahead() is
> > designed to be called prior to getdents+stat operations.
> 
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?

The issue is not directory block readahead (which some filesystems
like XFS already have), but issuing inode readahead during the
getdents() syscall.

It's the semi-random, interleaved inode IO that is being optimised
here (i.e. queued, ordered, issued, cached), not the directory
blocks themselves. As such, why does this need to be done in the
kernel?  This can all be done in userspace, and even hidden within
the readdir() or ftw/ntfw() implementations themselves so it's OS,
kernel and filesystem independent......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-26  0:38       ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-26  0:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > Hi all,
> > 
> > The topic of a readdirplus-like syscall had come up for discussion at last year's
> > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> > to get at a directory's entries as well as stat() info on the individual inodes.
> > I'm presenting these patches and some early test results on a single-node GFS2
> > filesystem.
> > 
> > 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> > call below and scales very well for large directories in GFS2. dirreadahead() is
> > designed to be called prior to getdents+stat operations.
> 
> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> getdents() syscalls?

The issue is not directory block readahead (which some filesystems
like XFS already have), but issuing inode readahead during the
getdents() syscall.

It's the semi-random, interleaved inode IO that is being optimised
here (i.e. queued, ordered, issued, cached), not the directory
blocks themselves. As such, why does this need to be done in the
kernel?  This can all be done in userspace, and even hidden within
the readdir() or ftw/ntfw() implementations themselves so it's OS,
kernel and filesystem independent......

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-26  0:38       ` [Cluster-devel] " Dave Chinner
@ 2014-07-28 12:22         ` Abhijith Das
  -1 siblings, 0 replies; 26+ messages in thread
From: Abhijith Das @ 2014-07-28 12:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Zach Brown" <zab@redhat.com>
> Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>,
> "cluster-devel" <cluster-devel@redhat.com>
> Sent: Friday, July 25, 2014 7:38:59 PM
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> 
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > Hi all,
> > > 
> > > The topic of a readdirplus-like syscall had come up for discussion at
> > > last year's
> > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > implementations
> > > to get at a directory's entries as well as stat() info on the individual
> > > inodes.
> > > I'm presenting these patches and some early test results on a single-node
> > > GFS2
> > > filesystem.
> > > 
> > > 1. dirreadahead() - This patchset is very simple compared to the
> > > xgetdents() system
> > > call below and scales very well for large directories in GFS2.
> > > dirreadahead() is
> > > designed to be called prior to getdents+stat operations.
> > 
> > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves. As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent......
> 

I don't see how the sorting of the inode reads in disk block order can be
accomplished in userland without knowing the fs-specific topology. From my
observations, I've seen that the performance gain is the most when we can
order the reads such that seek times are minimized on rotational media.

I have not tested my patches against SSDs, but my guess would be that the
performance impact would be minimal, if any.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-28 12:22         ` Abhijith Das
  0 siblings, 0 replies; 26+ messages in thread
From: Abhijith Das @ 2014-07-28 12:22 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Zach Brown" <zab@redhat.com>
> Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>,
> "cluster-devel" <cluster-devel@redhat.com>
> Sent: Friday, July 25, 2014 7:38:59 PM
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> 
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > Hi all,
> > > 
> > > The topic of a readdirplus-like syscall had come up for discussion at
> > > last year's
> > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > implementations
> > > to get at a directory's entries as well as stat() info on the individual
> > > inodes.
> > > I'm presenting these patches and some early test results on a single-node
> > > GFS2
> > > filesystem.
> > > 
> > > 1. dirreadahead() - This patchset is very simple compared to the
> > > xgetdents() system
> > > call below and scales very well for large directories in GFS2.
> > > dirreadahead() is
> > > designed to be called prior to getdents+stat operations.
> > 
> > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves. As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent......
> 

I don't see how the sorting of the inode reads in disk block order can be
accomplished in userland without knowing the fs-specific topology. From my
observations, I've seen that the performance gain is the most when we can
order the reads such that seek times are minimized on rotational media.

I have not tested my patches against SSDs, but my guess would be that the
performance impact would be minimal, if any.

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-28 12:22         ` [Cluster-devel] " Abhijith Das
  (?)
@ 2014-07-28 14:30           ` Zuckerman, Boris
  -1 siblings, 0 replies; 26+ messages in thread
From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw)
  To: Abhijith Das, Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3504 bytes --]

2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x.

Regards, Boris



> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-
> owner@vger.kernel.org] On Behalf Of Abhijith Das
> Sent: Monday, July 28, 2014 8:22 AM
> To: Dave Chinner
> Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> 
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "Zach Brown" <zab@redhat.com>
> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org,
> > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel"
> > <cluster-devel@redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs
> > dirreadahead syscalls
> >
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > >
> > > > The topic of a readdirplus-like syscall had come up for discussion
> > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls
> > > > with their GFS2 implementations to get at a directory's entries as
> > > > well as stat() info on the individual inodes.
> > > > I'm presenting these patches and some early test results on a
> > > > single-node
> > > > GFS2
> > > > filesystem.
> > > >
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > >
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the
> > > normal
> > > getdents() syscalls?
> >
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> >
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory blocks
> > themselves. As such, why does this need to be done in the kernel?
> > This can all be done in userspace, and even hidden within the
> > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel
> > and filesystem independent......
> >
> 
> I don't see how the sorting of the inode reads in disk block order can be accomplished in
> userland without knowing the fs-specific topology. From my observations, I've seen that
> the performance gain is the most when we can order the reads such that seek times are
> minimized on rotational media.
> 
> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a
> message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-28 14:30           ` Zuckerman, Boris
  0 siblings, 0 replies; 26+ messages in thread
From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x.

Regards, Boris



> -----Original Message-----
> From: linux-fsdevel-owner at vger.kernel.org [mailto:linux-fsdevel-
> owner at vger.kernel.org] On Behalf Of Abhijith Das
> Sent: Monday, July 28, 2014 8:22 AM
> To: Dave Chinner
> Cc: linux-kernel at vger.kernel.org; linux-fsdevel; cluster-devel
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> 
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "Zach Brown" <zab@redhat.com>
> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org,
> > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel"
> > <cluster-devel@redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs
> > dirreadahead syscalls
> >
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > >
> > > > The topic of a readdirplus-like syscall had come up for discussion
> > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls
> > > > with their GFS2 implementations to get at a directory's entries as
> > > > well as stat() info on the individual inodes.
> > > > I'm presenting these patches and some early test results on a
> > > > single-node
> > > > GFS2
> > > > filesystem.
> > > >
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > >
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the
> > > normal
> > > getdents() syscalls?
> >
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> >
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory blocks
> > themselves. As such, why does this need to be done in the kernel?
> > This can all be done in userspace, and even hidden within the
> > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel
> > and filesystem independent......
> >
> 
> I don't see how the sorting of the inode reads in disk block order can be accomplished in
> userland without knowing the fs-specific topology. From my observations, I've seen that
> the performance gain is the most when we can order the reads such that seek times are
> minimized on rotational media.
> 
> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a
> message to majordomo at vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-28 14:30           ` Zuckerman, Boris
  0 siblings, 0 replies; 26+ messages in thread
From: Zuckerman, Boris @ 2014-07-28 14:30 UTC (permalink / raw)
  To: Abhijith Das, Dave Chinner; +Cc: linux-kernel, linux-fsdevel, cluster-devel

2 years ago I had that type of functionality implemented for Ibrix. It included readdir-ahead and lookup-ahead. We did not assume any new syscalls, simply detected readdir+ like interest on VFS level and pushed a wave of populating directory caches and plugging in dentry cache entries. It improved productivity of NFS readdir+ and SMB QueryDirectories more than 4x.

Regards, Boris



> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-
> owner@vger.kernel.org] On Behalf Of Abhijith Das
> Sent: Monday, July 28, 2014 8:22 AM
> To: Dave Chinner
> Cc: linux-kernel@vger.kernel.org; linux-fsdevel; cluster-devel
> Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> 
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "Zach Brown" <zab@redhat.com>
> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org,
> > "linux-fsdevel" <linux-fsdevel@vger.kernel.org>, "cluster-devel"
> > <cluster-devel@redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs
> > dirreadahead syscalls
> >
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > >
> > > > The topic of a readdirplus-like syscall had come up for discussion
> > > > at last year's LSF/MM collab summit. I wrote a couple of syscalls
> > > > with their GFS2 implementations to get at a directory's entries as
> > > > well as stat() info on the individual inodes.
> > > > I'm presenting these patches and some early test results on a
> > > > single-node
> > > > GFS2
> > > > filesystem.
> > > >
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > >
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the
> > > normal
> > > getdents() syscalls?
> >
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> >
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory blocks
> > themselves. As such, why does this need to be done in the kernel?
> > This can all be done in userspace, and even hidden within the
> > readdir() or ftw/ntfw() implementations themselves so it's OS, kernel
> > and filesystem independent......
> >
> 
> I don't see how the sorting of the inode reads in disk block order can be accomplished in
> userland without knowing the fs-specific topology. From my observations, I've seen that
> the performance gain is the most when we can order the reads such that seek times are
> minimized on rotational media.
> 
> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a
> message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-28 12:22         ` [Cluster-devel] " Abhijith Das
@ 2014-07-31  3:25           ` Dave Chinner
  -1 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-31  3:25 UTC (permalink / raw)
  To: Abhijith Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel

On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "Zach Brown" <zab@redhat.com>
> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel@vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>,
> > "cluster-devel" <cluster-devel@redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> > 
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > > 
> > > > The topic of a readdirplus-like syscall had come up for discussion at
> > > > last year's
> > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > > implementations
> > > > to get at a directory's entries as well as stat() info on the individual
> > > > inodes.
> > > > I'm presenting these patches and some early test results on a single-node
> > > > GFS2
> > > > filesystem.
> > > > 
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > > 
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > > getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves. As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> > 
> 
> I don't see how the sorting of the inode reads in disk block order can be
> accomplished in userland without knowing the fs-specific topology.

I didn't say anything about doing "disk block ordering" in
userspace. disk block ordering can be done by the IO scheduler and
that's simple enough to do by multithreading and dispatch a few tens
of stat() calls at once....

> From my
> observations, I've seen that the performance gain is the most when we can
> order the reads such that seek times are minimized on rotational media.

Yup, which is done by ensuring that we drive deep IO queues rather
than issuing a single IO at a time and waiting for completion before
issuing the next one. This can easily be done from userspace.

> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.

Depends. if the overhead of executing readahead is higher than the time spent
waiting for IO completion, then it will reduce performance. i.e. the
faster the underlying storage, the less CPU time we want to spend on
IO. Readahead generally increases CPU time per object that needs to
be retrieved from disk, and so on high IOP devices there's a really
good chance we don't want readahead like this at all.

i.e. this is yet another reason directory traversal readahead should
be driven from userspace so the policy can be easily controlled by
the application and/or user....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-31  3:25           ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-31  3:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "Zach Brown" <zab@redhat.com>
> > Cc: "Abhijith Das" <adas@redhat.com>, linux-kernel at vger.kernel.org, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>,
> > "cluster-devel" <cluster-devel@redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> > 
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > > 
> > > > The topic of a readdirplus-like syscall had come up for discussion at
> > > > last year's
> > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > > implementations
> > > > to get at a directory's entries as well as stat() info on the individual
> > > > inodes.
> > > > I'm presenting these patches and some early test results on a single-node
> > > > GFS2
> > > > filesystem.
> > > > 
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > > 
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > > getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves. As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> > 
> 
> I don't see how the sorting of the inode reads in disk block order can be
> accomplished in userland without knowing the fs-specific topology.

I didn't say anything about doing "disk block ordering" in
userspace. disk block ordering can be done by the IO scheduler and
that's simple enough to do by multithreading and dispatch a few tens
of stat() calls at once....

> From my
> observations, I've seen that the performance gain is the most when we can
> order the reads such that seek times are minimized on rotational media.

Yup, which is done by ensuring that we drive deep IO queues rather
than issuing a single IO at a time and waiting for completion before
issuing the next one. This can easily be done from userspace.

> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.

Depends. if the overhead of executing readahead is higher than the time spent
waiting for IO completion, then it will reduce performance. i.e. the
faster the underlying storage, the less CPU time we want to spend on
IO. Readahead generally increases CPU time per object that needs to
be retrieved from disk, and so on high IOP devices there's a really
good chance we don't want readahead like this at all.

i.e. this is yet another reason directory traversal readahead should
be driven from userspace so the policy can be easily controlled by
the application and/or user....

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-26  0:38       ` [Cluster-devel] " Dave Chinner
@ 2014-07-28 21:21         ` Andreas Dilger
  -1 siblings, 0 replies; 26+ messages in thread
From: Andreas Dilger @ 2014-07-28 21:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zach Brown, Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel

[-- Attachment #1: Type: text/plain, Size: 1625 bytes --]

On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
>> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>>> Hi all,
>>> 
>>> The topic of a readdirplus-like syscall had come up for discussion at last year's
>>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
>>> to get at a directory's entries as well as stat() info on the individual inodes.
>>> I'm presenting these patches and some early test results on a single-node GFS2
>>> filesystem.
>>> 
>>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
>>> call below and scales very well for large directories in GFS2. dirreadahead() is
>>> designed to be called prior to getdents+stat operations.
>> 
>> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
>> getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves.

Sure.

> As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent......

That assumes sorting by inode number maps to sorting by disk order.
That isn't always true.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-28 21:21         ` Andreas Dilger
  0 siblings, 0 replies; 26+ messages in thread
From: Andreas Dilger @ 2014-07-28 21:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
>> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
>>> Hi all,
>>> 
>>> The topic of a readdirplus-like syscall had come up for discussion at last year's
>>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
>>> to get at a directory's entries as well as stat() info on the individual inodes.
>>> I'm presenting these patches and some early test results on a single-node GFS2
>>> filesystem.
>>> 
>>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
>>> call below and scales very well for large directories in GFS2. dirreadahead() is
>>> designed to be called prior to getdents+stat operations.
>> 
>> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
>> getdents() syscalls?
> 
> The issue is not directory block readahead (which some filesystems
> like XFS already have), but issuing inode readahead during the
> getdents() syscall.
> 
> It's the semi-random, interleaved inode IO that is being optimised
> here (i.e. queued, ordered, issued, cached), not the directory
> blocks themselves.

Sure.

> As such, why does this need to be done in the
> kernel?  This can all be done in userspace, and even hidden within
> the readdir() or ftw/ntfw() implementations themselves so it's OS,
> kernel and filesystem independent......

That assumes sorting by inode number maps to sorting by disk order.
That isn't always true.

Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20140728/aa9f1777/attachment.sig>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
  2014-07-28 21:21         ` [Cluster-devel] " Andreas Dilger
@ 2014-07-31  3:16           ` Dave Chinner
  -1 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-31  3:16 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Zach Brown, Abhijith Das, linux-kernel, linux-fsdevel, cluster-devel

On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote:
> On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>> Hi all,
> >>> 
> >>> The topic of a readdirplus-like syscall had come up for discussion at last year's
> >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> >>> to get at a directory's entries as well as stat() info on the individual inodes.
> >>> I'm presenting these patches and some early test results on a single-node GFS2
> >>> filesystem.
> >>> 
> >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> >>> call below and scales very well for large directories in GFS2. dirreadahead() is
> >>> designed to be called prior to getdents+stat operations.
> >> 
> >> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >> getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves.
> 
> Sure.
> 
> > As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> 
> That assumes sorting by inode number maps to sorting by disk order.
> That isn't always true.

That's true, but it's a fair bet that roughly ascending inode number
ordering is going to be better than random ordering for most
filesystems.

Besides, ordering isn't the real problem - the real problem is the
latency caused by having to do the inode IO synchronously one stat()
at a time. Just multithread the damn thing in userspace so the
stat()s can be done asynchronously and hence be more optimally
ordered by the IO scheduler and completed before the application
blocks on the IO.

It doesn't even need completion synchronisation - the stat()
issued by the application will block until the async stat()
completes the process of bringing the inode into the kernel cache...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
@ 2014-07-31  3:16           ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2014-07-31  3:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote:
> On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>> Hi all,
> >>> 
> >>> The topic of a readdirplus-like syscall had come up for discussion at last year's
> >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> >>> to get at a directory's entries as well as stat() info on the individual inodes.
> >>> I'm presenting these patches and some early test results on a single-node GFS2
> >>> filesystem.
> >>> 
> >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> >>> call below and scales very well for large directories in GFS2. dirreadahead() is
> >>> designed to be called prior to getdents+stat operations.
> >> 
> >> Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> >> getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves.
> 
> Sure.
> 
> > As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> 
> That assumes sorting by inode number maps to sorting by disk order.
> That isn't always true.

That's true, but it's a fair bet that roughly ascending inode number
ordering is going to be better than random ordering for most
filesystems.

Besides, ordering isn't the real problem - the real problem is the
latency caused by having to do the inode IO synchronously one stat()
at a time. Just multithread the damn thing in userspace so the
stat()s can be done asynchronously and hence be more optimally
ordered by the IO scheduler and completed before the application
blocks on the IO.

It doesn't even need completion synchronisation - the stat()
issued by the application will block until the async stat()
completes the process of bringing the inode into the kernel cache...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-07-31  3:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1106785262.13440918.1406308542921.JavaMail.zimbra@redhat.com>
2014-07-25 17:37 ` [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls Abhijith Das
2014-07-25 17:37   ` [Cluster-devel] " Abhijith Das
2014-07-25 17:52   ` Zach Brown
2014-07-25 17:52     ` [Cluster-devel] " Zach Brown
2014-07-25 18:08     ` Steven Whitehouse
2014-07-25 18:08       ` Steven Whitehouse
2014-07-25 18:08       ` Steven Whitehouse
2014-07-25 18:28       ` [Cluster-devel] " Zach Brown
2014-07-25 18:28         ` Zach Brown
2014-07-25 20:02         ` Steven Whitehouse
2014-07-25 20:02           ` Steven Whitehouse
2014-07-25 20:30           ` Trond Myklebust
2014-07-25 20:30             ` Trond Myklebust
2014-07-26  0:38     ` Dave Chinner
2014-07-26  0:38       ` [Cluster-devel] " Dave Chinner
2014-07-28 12:22       ` Abhijith Das
2014-07-28 12:22         ` [Cluster-devel] " Abhijith Das
2014-07-28 14:30         ` Zuckerman, Boris
2014-07-28 14:30           ` [Cluster-devel] " Zuckerman, Boris
2014-07-28 14:30           ` Zuckerman, Boris
2014-07-31  3:25         ` Dave Chinner
2014-07-31  3:25           ` [Cluster-devel] " Dave Chinner
2014-07-28 21:21       ` Andreas Dilger
2014-07-28 21:21         ` [Cluster-devel] " Andreas Dilger
2014-07-31  3:16         ` Dave Chinner
2014-07-31  3:16           ` [Cluster-devel] " Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.