linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] NFS: report more appropriate block size for directories.
@ 2015-05-08  3:10 NeilBrown
  2015-05-08 15:14 ` Scott Mayhew
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2015-05-08  3:10 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]


In glibc 2.21 (and several previous), a call to opendir() will
result in a 32K (BUFSIZ*4) buffer being allocated and passed to
getdents.

However a call to fdopendir() results in an 'fstat' request to
determine block size and a matching buffer allocated for subsequent
use with getdents.  This will typically be 1M.

The first getdents call on an NFS directory will always use
READDIR_PLUS (or NFSv4 equivalent) if available.  Subsequent getdents
calls only use this more expensive version if some 'stat' requests are
made between the getdents calls.

For this reason it is good to keep at least that first getdents call
relatively short.  When fdopendir() and readdir() is used on a large
directory, it takes approximately 32 times as long to complete as
using "opendir".  Current versions of 'find' use fdopendir() and
demonstrate this slowness.

'stat' on a directory currently returns the 'wsize'.  This number has
no meaning on directories.
Actual READDIR requests are limited to ->dtsize, which itself is
capped at 4 pages, coincidently the same as BUFSIZ*4.
So this is a meaningful number to use as the blocksize on directories,
and has the effect of making 'find' on large directories go a lot
faster.

Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 96f2d55781fb..f8aebf59383f 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -678,6 +678,8 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 	if (!err) {
 		generic_fillattr(inode, stat);
 		stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+		if (S_ISDIR(inode->i_mode))
+			stat->blksize = NFS_SERVER(inode)->dtsize;
 	}
 out:
 	trace_nfs_getattr_exit(inode, err);

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] NFS: report more appropriate block size for directories.
  2015-05-08  3:10 [PATCH] NFS: report more appropriate block size for directories NeilBrown
@ 2015-05-08 15:14 ` Scott Mayhew
  2015-05-13 18:55   ` Trond Myklebust
  0 siblings, 1 reply; 3+ messages in thread
From: Scott Mayhew @ 2015-05-08 15:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs, linux-kernel

On Fri, 08 May 2015, NeilBrown wrote:

> 
> In glibc 2.21 (and several previous), a call to opendir() will
> result in a 32K (BUFSIZ*4) buffer being allocated and passed to
> getdents.
> 
> However a call to fdopendir() results in an 'fstat' request to
> determine block size and a matching buffer allocated for subsequent
> use with getdents.  This will typically be 1M.
> 
> The first getdents call on an NFS directory will always use
> READDIR_PLUS (or NFSv4 equivalent) if available.  Subsequent getdents
> calls only use this more expensive version if some 'stat' requests are
> made between the getdents calls.
> 
> For this reason it is good to keep at least that first getdents call
> relatively short.  When fdopendir() and readdir() is used on a large
> directory, it takes approximately 32 times as long to complete as
> using "opendir".  Current versions of 'find' use fdopendir() and
> demonstrate this slowness.
> 
> 'stat' on a directory currently returns the 'wsize'.  This number has
> no meaning on directories.
> Actual READDIR requests are limited to ->dtsize, which itself is
> capped at 4 pages, coincidently the same as BUFSIZ*4.
> So this is a meaningful number to use as the blocksize on directories,
> and has the effect of making 'find' on large directories go a lot
> faster.

Would it make sense to do something similar for regular files too?
fopen() does a similar buffer allocation unless the application
overrides the buffer size via setbuffer()/setvbuf().  That can then
result in fseek() reading a lot of unnecessary data over the wire.

Prior to commit ba52de1 (inode-diet: Eliminate i_blksize from the inode
structure), a stat() over nfs would return the page size in st_blksize,
and for some workloads it does make a difference.  For instance, I have
a customer running gdb in an diskless environment.  On a stock kernel
where a stat() over nfs returns the wsize in st_blksize, their job takes
~19 minutes... on a test kernel where a stat() over nfs returns the page
size instead, that same job takes ~13 minutes.  I hadn't sent a patch
yet because I'm still trying to account for a few extra minutes of
run time elsewhere...

-Scott
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 96f2d55781fb..f8aebf59383f 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -678,6 +678,8 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
>  	if (!err) {
>  		generic_fillattr(inode, stat);
>  		stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> +		if (S_ISDIR(inode->i_mode))
> +			stat->blksize = NFS_SERVER(inode)->dtsize;
>  	}
>  out:
>  	trace_nfs_getattr_exit(inode, err);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] NFS: report more appropriate block size for directories.
  2015-05-08 15:14 ` Scott Mayhew
@ 2015-05-13 18:55   ` Trond Myklebust
  0 siblings, 0 replies; 3+ messages in thread
From: Trond Myklebust @ 2015-05-13 18:55 UTC (permalink / raw)
  To: Scott Mayhew; +Cc: NeilBrown, Anna Schumaker, linux-nfs, linux-kernel

On Fri, 2015-05-08 at 11:14 -0400, Scott Mayhew wrote:
> On Fri, 08 May 2015, NeilBrown wrote:
> 
> > 
> > In glibc 2.21 (and several previous), a call to opendir() will
> > result in a 32K (BUFSIZ*4) buffer being allocated and passed to
> > getdents.
> > 
> > However a call to fdopendir() results in an 'fstat' request to
> > determine block size and a matching buffer allocated for subsequent
> > use with getdents.  This will typically be 1M.
> > 
> > The first getdents call on an NFS directory will always use
> > READDIR_PLUS (or NFSv4 equivalent) if available.  Subsequent getdents
> > calls only use this more expensive version if some 'stat' requests are
> > made between the getdents calls.
> > 
> > For this reason it is good to keep at least that first getdents call
> > relatively short.  When fdopendir() and readdir() is used on a large
> > directory, it takes approximately 32 times as long to complete as
> > using "opendir".  Current versions of 'find' use fdopendir() and
> > demonstrate this slowness.
> > 
> > 'stat' on a directory currently returns the 'wsize'.  This number has
> > no meaning on directories.
> > Actual READDIR requests are limited to ->dtsize, which itself is
> > capped at 4 pages, coincidently the same as BUFSIZ*4.
> > So this is a meaningful number to use as the blocksize on directories,
> > and has the effect of making 'find' on large directories go a lot
> > faster.
> 
> Would it make sense to do something similar for regular files too?
> fopen() does a similar buffer allocation unless the application
> overrides the buffer size via setbuffer()/setvbuf().  That can then
> result in fseek() reading a lot of unnecessary data over the wire.
> 
> Prior to commit ba52de1 (inode-diet: Eliminate i_blksize from the inode
> structure), a stat() over nfs would return the page size in st_blksize,
> and for some workloads it does make a difference.  For instance, I have
> a customer running gdb in an diskless environment.  On a stock kernel
> where a stat() over nfs returns the wsize in st_blksize, their job takes
> ~19 minutes... on a test kernel where a stat() over nfs returns the page
> size instead, that same job takes ~13 minutes.  I hadn't sent a patch
> yet because I'm still trying to account for a few extra minutes of
> run time elsewhere...
> 

The client shouldn't be reporting anything different after commit
ba52de1. We should have
  inode->i_blkbits = sb->s_blocksize_bits;

with
  sb->s_blocksize_bits being set as log2(sb->s_blocksize)

Previously, inode->i_blksize was the same as sb_s_blocksize.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-05-13 18:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08  3:10 [PATCH] NFS: report more appropriate block size for directories NeilBrown
2015-05-08 15:14 ` Scott Mayhew
2015-05-13 18:55   ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).