All of lore.kernel.org
 help / color / mirror / Atom feed
* /etc/mtab read ~900 times by rpc.mountd
@ 2017-07-06 22:43 Phil Kauffman
  2017-07-07  2:30 ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-06 22:43 UTC (permalink / raw)
  To: linux-nfs

In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2

I have the same issue and am looking for a solution.

My current setup involves nested ZFS datasets. It looks like this:
  tank/homes/user{1..6000}
  tank/stage/somedirname{1..100}

tank/homes and tank/stage are datasets and the immediate children are also datasets.

There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup.


Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted.

Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish.

# ls -l /etc/mtab
lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts

# cat /proc/self/mounts > mtab

# ls -lh mtab
-rw-rw---- 1 root root 442K Jul  6 16:51 mtab

# wc -l mtab
6051 mtab


Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2 
(This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.)


I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/

steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path);


Cheers,

Phil


P.S.: I can provide straces on demand if required, I didn't want to clutter up the message.

-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-06 22:43 /etc/mtab read ~900 times by rpc.mountd Phil Kauffman
@ 2017-07-07  2:30 ` NeilBrown
  2017-07-07  5:20   ` Phil Kauffman
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-07-07  2:30 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3332 bytes --]

On Thu, Jul 06 2017, Phil Kauffman wrote:

> In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2
>
> I have the same issue and am looking for a solution.
>
> My current setup involves nested ZFS datasets. It looks like this:
>   tank/homes/user{1..6000}
>   tank/stage/somedirname{1..100}
>
> tank/homes and tank/stage are datasets and the immediate children are also datasets.
>
> There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup.
>
>
> Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted.
>
> Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish.
>
> # ls -l /etc/mtab
> lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts
>
> # cat /proc/self/mounts > mtab
>
> # ls -lh mtab
> -rw-rw---- 1 root root 442K Jul  6 16:51 mtab
>
> # wc -l mtab
> 6051 mtab
>
>
> Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2 
> (This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.)
>
>
> I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/
>
> steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path);
>
>
> Cheers,
>
> Phil
>

I can imagine /etc/mtab being read once for every line in /etc/exports,
but unless your /etc/exports is very big, I can't easily see why it
would be read 900 times.
Maybe lots of different mount points are being accessed by something and
each one triggers a few reads...

Can you show your /etc/exports file?

>
> P.S.: I can provide straces on demand if required, I didn't want to clutter up the message.

I don't recommend this approach.  If you have data, just provide it.  If
it is really really big, provide a link.
You have my attention now.  You might not get it again when you send the
strace.  You might have wasted your opportunity.
Yes - please provide the strace showing lots of reads through
/etc/mtab.  Don't provide it privately to me.  Either post it to the
list, or post a link to it.

NeilBrown


>
> -- 
> Phil Kauffman
> Systems Admin
> Dept. of Computer Science
> University of Chicago
> kauffman@cs.uchicago.edu
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-07  2:30 ` NeilBrown
@ 2017-07-07  5:20   ` Phil Kauffman
  2017-07-10  7:17     ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-07  5:20 UTC (permalink / raw)
  To: linux-nfs

> I can imagine /etc/mtab being read once for every line in /etc/exports,
> but unless your /etc/exports is very big, I can't easily see why it
> would be read 900 times.
> Maybe lots of different mount points are being accessed by something and
> each one triggers a few reads...
>
> Can you show your /etc/exports file?

These get generated via puppet and dropped into /etc/exports.d/

Any given host will have 3 files associated with it:

root@charlie-storage1:~# cat 
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_fsidroot.charliestorage1_fsidroot.exports
/tank 
client1.cs.uchicago.edu(fsid=0,rw,async,no_subtree_check,insecure,crossmnt)
root@charlie-storage1:~# cat 
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_homes.charliestorage1_homes.exports 

/tank/homes 
client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt)
root@charlie-storage1:~# cat 
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_stage.charliestorage1_stage.exports
/tank/stage 
client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt)

Here we can see that there are 960 files in /etc/exports.d
root@storage1:~# find /etc/exports.d/ -type f | wc
960 960 88701

You probably don't want or need this but here is a link to all of my 
/etc/exports.d/ combined into one file which I redacted the hostnames: 
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/all_etc_exports.d_in_one_file.txt

> I don't recommend this approach.  If you have data, just provide it.  If
> it is really really big, provide a link.
> You have my attention now.  You might not get it again when you send the
> strace.  You might have wasted your opportunity.
> Yes - please provide the strace showing lots of reads through
> /etc/mtab.  Don't provide it privately to me.  Either post it to the
> list, or post a link to it.

Thanks for the advice.

Here is a link to a best effort redacted strace: 
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/redacted_strace.txt

I would prefer to keep user name leakage to a minimum.


Cheers,

Phil


-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu
773-702-3913

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-07  5:20   ` Phil Kauffman
@ 2017-07-10  7:17     ` NeilBrown
  2017-07-10 11:47       ` Karel Zak
  2017-07-10 20:59       ` Phil Kauffman
  0 siblings, 2 replies; 16+ messages in thread
From: NeilBrown @ 2017-07-10  7:17 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3074 bytes --]

On Fri, Jul 07 2017, Phil Kauffman wrote:

>> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> but unless your /etc/exports is very big, I can't easily see why it
>> would be read 900 times.
>> Maybe lots of different mount points are being accessed by something and
>> each one triggers a few reads...
>>
>> Can you show your /etc/exports file?
>
> These get generated via puppet and dropped into /etc/exports.d/
...
>
> Here we can see that there are 960 files in /etc/exports.d
> root@storage1:~# find /etc/exports.d/ -type f | wc
> 960 960 88701

Ahhh.. 960 exports.  That could do it.

$ grep -c crossmnt all_etc_exports.d_in_one_file.txt
957

I get 957 - much the same number.

$ grep open redacted_strace.txt | uniq -c
      2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
    942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
      1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6

/etc/mtab was ready 942 for a single access.  I would have expected 957,
but maybe the system is dynamic and something changed between the two
samples.

This makes it fairly clear what is happening.  Now we just need to fix
it.
One option would be to cache some of the details extracted from
/etc/mtab, but that could get messy.

Another is to move the code around.  In your case there are really just
3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
the combined exports file is really different in different files).
So any one client only needs to consider 3 exports, not 300.

There is room to optimize this code further than the below, but let's
start simple.  Could you test with this patch applied and see what
difference it makes?

Thanks,
NeilBrown


diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ca6c84f4d93d..e712cc166157 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -727,6 +727,17 @@ static void nfsd_fh(int f)
 		for (exp = exportlist[i].p_head; exp; exp = next_exp) {
 			char *path;
 
+			if (!is_ipaddr_client(dom)
+			    && !namelist_client_matches(exp, dom)) {
+				next_exp = exp->m_next;
+				continue;
+			}
+			if (is_ipaddr_client(dom)
+			    && !ipaddr_client_matches(exp, ai)) {
+				next_exp = exp->m_next;
+				continue;
+			}
+
 			if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) {
 				static nfs_export *prev = NULL;
 				static void *mnt = NULL;
@@ -751,9 +762,6 @@ static void nfsd_fh(int f)
 				next_exp = exp->m_next;
 			}
 
-			if (!is_ipaddr_client(dom)
-					&& !namelist_client_matches(exp, dom))
-				continue;
 			if (exp->m_export.e_mountpoint &&
 			    !is_mountpoint(exp->m_export.e_mountpoint[0]?
 					   exp->m_export.e_mountpoint:
@@ -762,9 +770,6 @@ static void nfsd_fh(int f)
 
 			if (!match_fsid(&parsed, exp, path))
 				continue;
-			if (is_ipaddr_client(dom)
-					&& !ipaddr_client_matches(exp, ai))
-				continue;
 			if (!found || subexport(&exp->m_export, found)) {
 				found = &exp->m_export;
 				free(found_path);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-10  7:17     ` NeilBrown
@ 2017-07-10 11:47       ` Karel Zak
  2017-07-10 23:19         ` NeilBrown
  2017-07-10 20:59       ` Phil Kauffman
  1 sibling, 1 reply; 16+ messages in thread
From: Karel Zak @ 2017-07-10 11:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Phil Kauffman, linux-nfs

On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
> On Fri, Jul 07 2017, Phil Kauffman wrote:
> 
> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
> >> but unless your /etc/exports is very big, I can't easily see why it
> >> would be read 900 times.
> >> Maybe lots of different mount points are being accessed by something and
> >> each one triggers a few reads...
> >>
> >> Can you show your /etc/exports file?
> >
> > These get generated via puppet and dropped into /etc/exports.d/
> ...
> >
> > Here we can see that there are 960 files in /etc/exports.d
> > root@storage1:~# find /etc/exports.d/ -type f | wc
> > 960 960 88701
> 
> Ahhh.. 960 exports.  That could do it.
> 
> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
> 957
> 
> I get 957 - much the same number.
> 
> $ grep open redacted_strace.txt | uniq -c
>       2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
>     942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
>       1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
> 
> /etc/mtab was ready 942 for a single access.  I would have expected 957,
> but maybe the system is dynamic and something changed between the two
> samples.
> 
> This makes it fairly clear what is happening.  Now we just need to fix
> it.
> One option would be to cache some of the details extracted from
> /etc/mtab, but that could get messy.

Would be possible to read /proc/self/mounts only once and use poll()
to detect changes?

The issue is not only the open()+read() in the process, but also 
kernel has to compose the file from the mount tree. It's expensive.

    Karel


-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-10  7:17     ` NeilBrown
  2017-07-10 11:47       ` Karel Zak
@ 2017-07-10 20:59       ` Phil Kauffman
  2017-07-10 23:51         ` NeilBrown
  1 sibling, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-10 20:59 UTC (permalink / raw)
  To: linux-nfs

On 07/10/2017 02:17 AM, NeilBrown wrote:
> Another is to move the code around.  In your case there are really just
> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
> the combined exports file is really different in different files).
> So any one client only needs to consider 3 exports, not 300.
This assumption is correct. We have ~300 clients and growing (slowly).

> Could you test with this patch applied and see what
> difference it makes?
After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.


The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
  It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2

Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt

Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.


I reboot everything: client and server

Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).
  http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_result_no-strace.txt

I then used a larger sample size of users (~/f) and got a similar result.


Now lets strace rpc.mountd while sshing with one user:

$ time ssh -o BatchMode=yes $CLIENTHOSTNAME -l user_not_in_list_f
Permission denied (publickey,password).

real	0m0.352s
user	0m0.020s
sys	0m0.012s

The resulting strace seems entirely reasonable: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_one_user_strace.txt


$ for user in $(cat ~/users | grep -E "^g*"); do time ssh -o BatchMode=yes $CLIENTHOSTNAME -l $user; done
  The first few dozen users whos usernames happen to start with 'g': 
    http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_multiple_user_strace.txt



-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-10 11:47       ` Karel Zak
@ 2017-07-10 23:19         ` NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2017-07-10 23:19 UTC (permalink / raw)
  To: Karel Zak; +Cc: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2217 bytes --]

On Mon, Jul 10 2017, Karel Zak wrote:

> On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
>> On Fri, Jul 07 2017, Phil Kauffman wrote:
>> 
>> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> >> but unless your /etc/exports is very big, I can't easily see why it
>> >> would be read 900 times.
>> >> Maybe lots of different mount points are being accessed by something and
>> >> each one triggers a few reads...
>> >>
>> >> Can you show your /etc/exports file?
>> >
>> > These get generated via puppet and dropped into /etc/exports.d/
>> ...
>> >
>> > Here we can see that there are 960 files in /etc/exports.d
>> > root@storage1:~# find /etc/exports.d/ -type f | wc
>> > 960 960 88701
>> 
>> Ahhh.. 960 exports.  That could do it.
>> 
>> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
>> 957
>> 
>> I get 957 - much the same number.
>> 
>> $ grep open redacted_strace.txt | uniq -c
>>       2 open("/var/lib/nfs/etab", O_RDONLY)     = 6
>>     942 open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 6
>>       1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
>> 
>> /etc/mtab was ready 942 for a single access.  I would have expected 957,
>> but maybe the system is dynamic and something changed between the two
>> samples.
>> 
>> This makes it fairly clear what is happening.  Now we just need to fix
>> it.
>> One option would be to cache some of the details extracted from
>> /etc/mtab, but that could get messy.
>
> Would be possible to read /proc/self/mounts only once and use poll()
> to detect changes?

Sure it is possible.  All we need is code....

>
> The issue is not only the open()+read() in the process, but also 
> kernel has to compose the file from the mount tree. It's expensive.

That certainly contributes to the problem.  If we cache the content, we
can process it fast.  If we avoid wanting to look at the content, it
doesn't matter how fast processes it is (because we don't).

So yes, that might be part of the solution.  But not the whole solution.

Thanks,
NeilBrown


>
>     Karel
>
>
> -- 
>  Karel Zak  <kzak@redhat.com>
>  http://karelzak.blogspot.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-10 20:59       ` Phil Kauffman
@ 2017-07-10 23:51         ` NeilBrown
  2017-07-11 15:39           ` Phil Kauffman
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-07-10 23:51 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

On Mon, Jul 10 2017, Phil Kauffman wrote:

> On 07/10/2017 02:17 AM, NeilBrown wrote:
>> Another is to move the code around.  In your case there are really just
>> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
>> the combined exports file is really different in different files).
>> So any one client only needs to consider 3 exports, not 300.
> This assumption is correct. We have ~300 clients and growing (slowly).
>
>> Could you test with this patch applied and see what
>> difference it makes?
> After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.
>
>
> The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
>   It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2
>
> Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt
>
> Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.
>
>
> I reboot everything: client and server
>
> Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).

This does look encouraging ... but I'm not sure we are comparing apples
with apples.

In the original strace we see.

 read(4, ....)
 check on etab
 statfs(some path)
 write(4,....

which is exactly like the new strace.
But then the old strace also has

 read(5, ...)
 check on etab
 read /etc/mtab hundreds of times.
 write(5, ...)

which is completely missing in the new strace.  So the new code which
doesn't read /etc/mtab as much, isn't being run at all.

file desc 4 is /proc/1926/net/rpc/nfsd.export/channel
file desc 5 is /proc/1926/net/rpc/nfsd.fh/channel

If the nfs server kernel has nothing cached, then when a request arrives
you should first see a transaction of fd 3
(/proc/1926/net/rpc/auth.unix.ip/channel) to map the IP address to a
"domain".
Then a transaction on fd 5 (nfsd.fh) to map the domain + a filehandle
fragment to an export point.
Then a transaction on fd 4 (nfsd.export) to map the domain + export
point to some export options.

If the request is a "LOOKUP", you might see another "fd 5" lookup after
the "fd 4" to find out if a sub-mountpoint is exported.  That is what
the old strace showed.

If you don't see all of those, then something is cached.

Could you repeat your experiments after first running "exportfs -f" on
the nfs server?  That should cause worst-case load on mountd.

Until we see the "read(5).... write(5)" sequence completing quickly, I
cannot be sure that we have fixed anything.  Actually.. for future
traces, please add "-tt" so I can see where the delays are, and how much
they are.

(and thanks for providing all the tracing details - I love getting
unambiguous data!!)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-10 23:51         ` NeilBrown
@ 2017-07-11 15:39           ` Phil Kauffman
  2017-07-12  0:46             ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-11 15:39 UTC (permalink / raw)
  To: linux-nfs

On 07/10/2017 06:51 PM, NeilBrown wrote:
> This does look encouraging ... but I'm not sure we are comparing apples
> with apples.
I had a feeling this was the case.


> Could you repeat your experiments after first running "exportfs -f" on
> the nfs server?  That should cause worst-case load on mountd.
The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt

nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt


# cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
      1 Process
     73 select
    143 write
    216 close
    216 fstat
    216 open
  48240 read
 435599 stat
 871162 statfs
 871163 lstat


> (and thanks for providing all the tracing details - I love getting
> unambiguous data!!)
No problem, I'm available to run any necessary tests. I also appreciate your help on this.



-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu
773-702-3913

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-11 15:39           ` Phil Kauffman
@ 2017-07-12  0:46             ` NeilBrown
  2017-07-12 17:55               ` Phil Kauffman
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-07-12  0:46 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2535 bytes --]

On Tue, Jul 11 2017, Phil Kauffman wrote:

> On 07/10/2017 06:51 PM, NeilBrown wrote:
>> This does look encouraging ... but I'm not sure we are comparing apples
>> with apples.
> I had a feeling this was the case.
>
>
>> Could you repeat your experiments after first running "exportfs -f" on
>> the nfs server?  That should cause worst-case load on mountd.
> The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
> SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt
>
> nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt
>
>
> # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
>       1 Process
>      73 select
>     143 write
>     216 close
>     216 fstat
>     216 open
>   48240 read
>  435599 stat
>  871162 statfs
>  871163 lstat
>
>
>> (and thanks for providing all the tracing details - I love getting
>> unambiguous data!!)
> No problem, I'm available to run any necessary tests. I also appreciate your help on this.
>

So the new data shows about 7 seconds for a login, which is probably
a little longer than you would like, but might be acceptable?
Each login triggers 3 reads of /etc/mtab, and each read  (from 'open' to
'close') takes either 3-4 seconds or about 0.2 seconds.
Actually reading /etc/mtab seems to take about 0.1-0.2 seconds at most.
The rest of the time is calling stat, with a pattern like

10:09:39.547280 stat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547318 lstat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547350 lstat("/tank/homes/someuser"..", {st_mode=S_IFDIR|0755, st_size=5967, ...}) = 0
10:09:39.547385 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba
10:09:39.547424 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba

Of the three passes through /etc/mtab for each login, one generates
95 of these, one generates 6049, and one generates 5952.

The 'stat' is from match_fsid()
The 2 'lstat' are from is_mountpoint() (called by match_fsid()
The first statfs() is from uuid_by_path()
Maybe the second is somewhere in the blkid code.

Probably the next step is to cache the mapping from
  path, fsidtype -> fsid.

That is more than a 2-line patch.  I might have a go later this week.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-12  0:46             ` NeilBrown
@ 2017-07-12 17:55               ` Phil Kauffman
  2017-07-14  6:51                 ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-12 17:55 UTC (permalink / raw)
  To: linux-nfs

On 07/11/2017 07:46 PM, NeilBrown wrote:
> So the new data shows about 7 seconds for a login, which is probably
> a little longer than you would like, but might be acceptable?
Unfortunately, the delay is not acceptable.

The ideal would be to achieve performance parity with when one is not 
forced to use the 'crossmnt' option.

My current setup (home directory server) does not require 'crossmnt' and 
does not incur a delay. It is a standard nfs server using mdadm, lvm, 
and xfs.

While my current setup is probably "normal" and using nested datasets 
with the 'crossmnt' option might be "weird" now; nested mounts will 
probably only become more common as BTRFS, ZFS, and other filesystems 
with similar features gain traction on linux.


> That is more than a 2-line patch.  I might have a go later this week.
That would be super! Thanks for taking a look.



-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-12 17:55               ` Phil Kauffman
@ 2017-07-14  6:51                 ` NeilBrown
  2017-07-25 21:12                   ` Phil Kauffman
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-07-14  6:51 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 7804 bytes --]

On Wed, Jul 12 2017, Phil Kauffman wrote:

> On 07/11/2017 07:46 PM, NeilBrown wrote:
>> So the new data shows about 7 seconds for a login, which is probably
>> a little longer than you would like, but might be acceptable?
> Unfortunately, the delay is not acceptable.
>
> The ideal would be to achieve performance parity with when one is not 
> forced to use the 'crossmnt' option.
>
> My current setup (home directory server) does not require 'crossmnt' and 
> does not incur a delay. It is a standard nfs server using mdadm, lvm, 
> and xfs.
>
> While my current setup is probably "normal" and using nested datasets 
> with the 'crossmnt' option might be "weird" now; nested mounts will 
> probably only become more common as BTRFS, ZFS, and other filesystems 
> with similar features gain traction on linux.
>
>
>> That is more than a 2-line patch.  I might have a go later this week.
> That would be super! Thanks for taking a look.

Please try this (against a clean nfs-utils.  i.e. remove the previous
patch).
It is a hack and would need to be totally re-written, but hopely the
measurements you make and strace that you report could be useful.
Also, for the strace, please use "-ttt" rather than "-tt" like I asked
before.  It is easier to find the difference between two times with
-ttt.
And add -T as well.

Thanks,
NeilBrown


From f15571d747bb6272ec539e46b6b3aee3ccc53b30 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Fri, 14 Jul 2017 16:43:37 +1000
Subject: [PATCH] testing

Signed-off-by: NeilBrown <neilb@suse.com>
---
 utils/mountd/cache.c | 105 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 81 insertions(+), 24 deletions(-)

diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ca6c84f4d93d..db89a4feb7ae 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -18,6 +18,7 @@
 #include <time.h>
 #include <netinet/in.h>
 #include <arpa/inet.h>
+#include <sys/sysmacros.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <errno.h>
@@ -299,7 +300,7 @@ static const long int nonblkid_filesystems[] = {
     0        /* last */
 };
 
-static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
+static int uuid_by_path(char *path, struct statfs64 *stp, int type, size_t uuidlen, char *uuid)
 {
 	/* get a uuid for the filesystem found at 'path'.
 	 * There are several possible ways of generating the
@@ -334,12 +335,17 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
 	const char *val;
 	int rc;
 
-	rc = statfs64(path, &st);
+	if (stp)
+		rc = 0;
+	else {
+		stp = &st;
+		rc= statfs64(path, stp);
+	}
 
 	if (type == 0 && rc == 0) {
 		const long int *bad;
 		for (bad = nonblkid_filesystems; *bad; bad++) {
-			if (*bad == st.f_type)
+			if (*bad == stp->f_type)
 				break;
 		}
 		if (*bad == 0)
@@ -347,9 +353,9 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
 	}
 
 	if (rc == 0 &&
-	    (st.f_fsid.__val[0] || st.f_fsid.__val[1]))
+	    (stp->f_fsid.__val[0] || stp->f_fsid.__val[1]))
 		snprintf(fsid_val, 17, "%08x%08x",
-			 st.f_fsid.__val[0], st.f_fsid.__val[1]);
+			 stp->f_fsid.__val[0], stp->f_fsid.__val[1]);
 	else
 		fsid_val[0] = 0;
 
@@ -603,25 +609,64 @@ static int parse_fsid(int fsidtype, int fsidlen, char *fsid,
 	return 0;
 }
 
+#define entry FOO
+#include <search.h>
+struct pinfo {
+	unsigned int stb:1, mntp:1, stfs:1;
+	unsigned int mountpoint:1;
+	time_t valid;
+	struct stat statbuf;
+	struct statfs64 statfsbuf;
+};
+static struct hsearch_data hdata;
+static int hdata_init = 0;
+
+
 static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 {
-	struct stat stb;
+	struct stat *stb;
 	int type;
 	char u[16];
+	struct pinfo *pi;
+	ENTRY ent, *ret;
 
-	if (stat(path, &stb) != 0)
-		return false;
-	if (!S_ISDIR(stb.st_mode) && !S_ISREG(stb.st_mode))
+	if (!hdata_init) {
+		hcreate_r(1000, &hdata);
+		hdata_init = 1;
+	}
+	ent.key = path;
+	if (hsearch_r(ent, FIND, &ret, &hdata) == 0) {
+		ent.key = strdup(path);
+		pi = xmalloc(sizeof(*pi));
+		pi->stb = pi->mntp = pi->stfs = 0;
+		pi->valid = 0;
+		ent.data = pi;
+		hsearch_r(ent, ENTER, &ret, &hdata);
+	} else
+		pi = ret->data;
+
+	if (pi->valid < time(NULL)+120) {
+		pi->stb = pi->mntp = pi->stfs = 0;
+		pi->valid = time(NULL);
+	}
+
+	stb = &pi->statbuf;
+	if (!pi->stb) {
+		if (stat(path, stb) != 0)
+			return false;
+		pi->stb = 1;
+	}
+	if (!S_ISDIR(stb->st_mode) && !S_ISREG(stb->st_mode))
 		return false;
 
 	switch (parsed->fsidtype) {
 	case FSID_DEV:
 	case FSID_MAJOR_MINOR:
 	case FSID_ENCODE_DEV:
-		if (stb.st_ino != parsed->inode)
+		if (stb->st_ino != parsed->inode)
 			return false;
-		if (parsed->major != major(stb.st_dev) ||
-		    parsed->minor != minor(stb.st_dev))
+		if (parsed->major != major(stb->st_dev) ||
+		    parsed->minor != minor(stb->st_dev))
 			return false;
 		return true;
 	case FSID_NUM:
@@ -631,12 +676,16 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 		return true;
 	case FSID_UUID4_INUM:
 	case FSID_UUID16_INUM:
-		if (stb.st_ino != parsed->inode)
+		if (stb->st_ino != parsed->inode)
 			return false;
 		goto check_uuid;
 	case FSID_UUID8:
 	case FSID_UUID16:
-		if (!is_mountpoint(path))
+		if (!pi->mntp) {
+			pi->mountpoint = is_mountpoint(path);
+			pi->mntp = 1;
+		}
+		if (!pi->mountpoint)
 			return false;
 	check_uuid:
 		if (exp->m_export.e_uuid) {
@@ -644,12 +693,18 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 			if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
 				return true;
 		}
-		else
+		else {
+			if (!pi->stfs) {
+				if (statfs64(path, &pi->statfsbuf) != 0)
+					return false;
+				pi->stfs = 1;
+			}
 			for (type = 0;
-			     uuid_by_path(path, type, parsed->uuidlen, u);
+			     uuid_by_path(path, &pi->statfsbuf, type, parsed->uuidlen, u);
 			     type++)
 				if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
 					return true;
+		}
 	}
 	/* Well, unreachable, actually: */
 	return false;
@@ -727,10 +782,18 @@ static void nfsd_fh(int f)
 		for (exp = exportlist[i].p_head; exp; exp = next_exp) {
 			char *path;
 
+			if (!is_ipaddr_client(dom)) {
+				if (!namelist_client_matches(exp, dom))
+					continue;
+			} else {
+				if (!ipaddr_client_matches(exp, ai))
+					continue;
+			}
+
 			if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) {
 				static nfs_export *prev = NULL;
 				static void *mnt = NULL;
-				
+
 				if (prev == exp) {
 					/* try a submount */
 					path = next_mnt(&mnt, exp->m_export.e_path);
@@ -751,9 +814,6 @@ static void nfsd_fh(int f)
 				next_exp = exp->m_next;
 			}
 
-			if (!is_ipaddr_client(dom)
-					&& !namelist_client_matches(exp, dom))
-				continue;
 			if (exp->m_export.e_mountpoint &&
 			    !is_mountpoint(exp->m_export.e_mountpoint[0]?
 					   exp->m_export.e_mountpoint:
@@ -762,9 +822,6 @@ static void nfsd_fh(int f)
 
 			if (!match_fsid(&parsed, exp, path))
 				continue;
-			if (is_ipaddr_client(dom)
-					&& !ipaddr_client_matches(exp, ai))
-				continue;
 			if (!found || subexport(&exp->m_export, found)) {
 				found = &exp->m_export;
 				free(found_path);
@@ -906,7 +963,7 @@ static int dump_to_cache(int f, char *buf, int buflen, char *domain,
 		write_secinfo(&bp, &blen, exp, flag_mask);
 		if (exp->e_uuid == NULL || different_fs) {
 			char u[16];
-			if (uuid_by_path(path, 0, 16, u)) {
+			if (uuid_by_path(path, NULL, 0, 16, u)) {
 				qword_add(&bp, &blen, "uuid");
 				qword_addhex(&bp, &blen, u, 16);
 			}
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-14  6:51                 ` NeilBrown
@ 2017-07-25 21:12                   ` Phil Kauffman
  2017-07-27  4:37                     ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-25 21:12 UTC (permalink / raw)
  To: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3085 bytes --]

On 07/14/2017 01:51 AM, NeilBrown wrote:
> Please try this (against a clean nfs-utils.  i.e. remove the previous
> patch).
> It is a hack and would need to be totally re-written, but hopely the
> measurements you make and strace that you report could be useful.
> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
> before.  It is easier to find the difference between two times with
> -ttt. And add -T as well.

Sorry this took so long but I had to be sure of my results.

With your latest patch applied I am unable to mount my NFS shares and 
thus continue with the tests from before.

Since the patch you provided cannot be applied cleanly to the source 
used to build '1:1.2.8-9ubuntu12.1' for Ubuntu Xenial I chose to build a 
couple of VMs to continue this testing. They should be identical to the 
machines I was using before except they are under powered and not using 
nfs-common, nfs-kernel-server package version 1:1.2.8-9ubuntu12.1. Most 
importantly the zfs datasets and the files in /etc/exports.d are the same.

I found that utils/mountd/cache.c was last modified in 1.3.4, so I 
tested both 1.3.4 and 2.1.1 without your patch as a baseline. I was able 
to mount my shares in both versions.

I then reverted my VMs to a state before the baseline install and 
installed again with your patch applied.

I get the following with your patch applied on 1.3.4 and 2.1.1:

root@nfsclient:~# mount -avv
/                        : ignored
/boot                    : already mounted
none                     : ignored
mount.nfs4: timeout set for Tue Jul 25 15:43:13 2017
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting 
nfsserver.cs.uchicago.edu:/homes
mount.nfs4: timeout set for Tue Jul 25 15:43:14 2017
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting 
nfsserver.cs.uchicago.edu:/stage


Attached are my install notes and the strace during the failed mount above.




-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu

[-- Attachment #2: nfs_test_notes.txt --]
[-- Type: text/plain, Size: 1334 bytes --]

# all
umask 022
apt-get build-dep nfs-kernel-server
apt-get install rpcbind strace
systemctl enable rpcbind

git clone git://git.linux-nfs.org/projects/steved/nfs-utils.git
cd nfs-utils
# choose the version to test
#git checkout tags/nfs-utils-1-3-4 -b 1.3.4
#git checkout tags/nfs-utils-2-1-1 -b 2.1.1

# if patching
wget http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/nb2.patch
patch -p1 utils/mountd/cache.c < nb2.patch
rm nb2.patch


# Server
./autogen.sh && ./configure --with-systemd --disable-gss && make && make install
touch /etc/default/nfs-kernel-server /etc/default/nfs-common
mkdir -p /usr/libexec/nfs-utils
cp ../nfs-utils/debian/nfs-utils_env.sh /usr/libexec/nfs-utils/
touch /etc/exports && mkdir -p /etc/exports.d
systemctl enable nfs-server.service
systemctl enable nfs-utils.service
reboot

# client
./autogen.sh && ./configure --with-systemd --disable-gss && make && make install
touch /etc/default/nfs-common
chmod 644 /usr/lib/systemd/system/*
systemctl enable nfs-client.target
systemctl enable rpc-statd.service
reboot


root@nfsserver:~/nfs-utils# cat /etc/default/nfs-common
NEED_STATD=yes
STATDOPTS="--port 44634"
NEED_GSSD=

root@nfsserver:~/nfs-utils# cat /etc/default/nfs-kernel-server
RPCNFSDCOUNT=
RPCNFSDPRIORITY=0
RPCMOUNTDOPTS=
NEED_SVCGSSD=
RPCSVCGSSDOPTS=
RPCNFSDOPTS='--syslog'


[-- Attachment #3: strace_no_mount.txt --]
[-- Type: text/plain, Size: 8877 bytes --]

strace: Process 32552 attached
1501016020.534715 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [3]) <2.334481>
1501016022.869338 read(3, "nfsd 10.135.24.22\n", 32768) = 18 <0.000029>
1501016022.869443 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000058>
1501016022.869566 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000036>
1501016022.869673 close(20)             = 0 <0.000045>
1501016022.870015 write(3, "nfsd 10.135.24.22 1501017822 nfs"..., 56) = 56 <0.000042>
1501016022.870094 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [5]) <0.189956>
1501016023.060124 read(5, "nfsclient.cs.uchicago.edu 1 \\x00"..., 32768) = 39 <0.000046>
1501016023.060220 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000028>
1501016023.060292 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000021>
1501016023.060350 close(20)             = 0 <0.000022>
1501016023.060423 write(5, "nfsclient.cs.uchicago.edu 1 \\x00"..., 51) = 51 <0.000032>
1501016023.060488 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.050699>
1501016023.111265 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51141), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"Yv\31\354\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000029>
1501016023.111435 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000030>
1501016023.111522 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000028>
1501016023.111610 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51141), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"Yv\31\354\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000051>
1501016023.111727 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000232>
1501016023.112015 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{".\314P$\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000023>
1501016023.112093 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000054>
1501016023.112188 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000021>
1501016023.112254 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{".\314P$\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000039>
1501016023.112340 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000083>
1501016023.112466 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"/\314P$\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\1\0\0\0\1\0\0\0$"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 88 <0.000023>
1501016023.112537 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000022>
1501016023.112597 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000022>
1501016023.112669 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000027>
1501016023.112734 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000030>
1501016023.112856 close(20)             = 0 <0.000057>
1501016023.112977 lstat("/homes", 0x7ffc0917b900) = -1 ENOENT (No such file or directory) <0.000059>
1501016023.113593 sendto(6, "<28>Jul 25 15:53:43 rpc.mountd[3"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 <0.000054>
1501016023.114333 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"/\314P$\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r", 28}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 28 <0.000065>
1501016023.114480 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.196515>
1501016023.311130 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(49539), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"YsnP\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000051>
1501016023.311292 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000042>
1501016023.311406 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000038>
1501016023.311504 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(49539), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"YsnP\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000064>
1501016023.311653 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000175>
1501016023.311907 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\f[kA\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000047>
1501016023.312044 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000050>
1501016023.312136 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000023>
1501016023.312224 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\f[kA\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000041>
1501016023.312317 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000083>
1501016023.312440 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\r[kA\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\1\0\0\0\1\0\0\0$"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 88 <0.000021>
1501016023.312513 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000026>
1501016023.312576 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000019>
1501016023.312635 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000023>
1501016023.312691 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000019>
1501016023.312741 close(20)             = 0 <0.000019>
1501016023.312790 lstat("/stage", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000045>
1501016023.313043 sendto(6, "<28>Jul 25 15:53:43 rpc.mountd[3"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 <0.000027>
1501016023.313400 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\r[kA\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r", 28}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 28 <0.000036>
1501016023.313485 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-25 21:12                   ` Phil Kauffman
@ 2017-07-27  4:37                     ` NeilBrown
  2017-07-27 21:40                       ` Phil Kauffman
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2017-07-27  4:37 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Tue, Jul 25 2017, Phil Kauffman wrote:

> On 07/14/2017 01:51 AM, NeilBrown wrote:
>> Please try this (against a clean nfs-utils.  i.e. remove the previous
>> patch).
>> It is a hack and would need to be totally re-written, but hopely the
>> measurements you make and strace that you report could be useful.
>> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
>> before.  It is easier to find the difference between two times with
>> -ttt. And add -T as well.
>
> Sorry this took so long but I had to be sure of my results.

It is worth being thorough.

>
> With your latest patch applied I am unable to mount my NFS shares and 
> thus continue with the tests from before.

Weird.

You are experiencing a bug that was very recently fixed, where if
mount.nfs4 gets the error ENOENT from the server, it falls back to
NFSv3.
That explains some of the noise, but doesn't explain why you get ENOENT
for the v4 mount.

The strace output you've provided doesn't even show any attempts to read
/etc/mtab, which my patch doesn't change at all.  So it seems like the
context is different in some way.

Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
... maybe that it done automatically somehow...

Can you try with unpatches 2.1.1?
Also maybe provide an strace starting before any attempt to mount
anything, and with an extra option "-s 1000"..

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-27  4:37                     ` NeilBrown
@ 2017-07-27 21:40                       ` Phil Kauffman
  2017-07-31  4:23                         ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Kauffman @ 2017-07-27 21:40 UTC (permalink / raw)
  To: linux-nfs

On 07/26/2017 11:37 PM, NeilBrown wrote:
> Weird.
I thought so as well.

> So it seems like the context is different in some way.
It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. 

 
> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
> ... maybe that it done automatically somehow...
I copied the files from the original NFS server into /etc/exports.d  as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.

root@nfsserver:~# ls /etc/exports.d/ | wc -l
972

I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).

The ZFS datasets were also precreated:

root@nfsserver:~# zfs list | wc -l
6062


> Can you try with unpatches 2.1.1?
I assume you mean without any patches. vanilla 2.1.1

> Also maybe provide an strace starting before any attempt to mount
> anything, and with an extra option "-s 1000"..

This is all that happens *before* trying to mount from the client.
root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
strace: Process 2517 attached
1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL

Stracing rpc.nfsd:
root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt

When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt 




-- 
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
kauffman@cs.uchicago.edu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: /etc/mtab read ~900 times by rpc.mountd
  2017-07-27 21:40                       ` Phil Kauffman
@ 2017-07-31  4:23                         ` NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2017-07-31  4:23 UTC (permalink / raw)
  To: Phil Kauffman, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Thu, Jul 27 2017, Phil Kauffman wrote:

> On 07/26/2017 11:37 PM, NeilBrown wrote:
>> Weird.
> I thought so as well.
>
>> So it seems like the context is different in some way.
> It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. 
>
>  
>> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
>> ... maybe that it done automatically somehow...
> I copied the files from the original NFS server into /etc/exports.d  as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.
>
> root@nfsserver:~# ls /etc/exports.d/ | wc -l
> 972
>
> I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).
>
> The ZFS datasets were also precreated:
>
> root@nfsserver:~# zfs list | wc -l
> 6062
>
>
>> Can you try with unpatches 2.1.1?
> I assume you mean without any patches. vanilla 2.1.1
>
>> Also maybe provide an strace starting before any attempt to mount
>> anything, and with an extra option "-s 1000"..
>
> This is all that happens *before* trying to mount from the client.
> root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
> strace: Process 2517 attached
> 1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL
>
> Stracing rpc.nfsd:
> root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
> http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt
>
> When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt 
>


I'm sorry but I cannot get anything useful from these.  It isn't crystal
clear what was being tested in each case....
Could you try:

1/ build the current 'master' branch from the nfs-utils git tree, with
   no extra patches.
2/ Install that and perform the same test which produce the 1.3G strace
   file above.  Provide the resulting strace of rpc.mountd.
3/ Tell me what happened when you mounted on the client.  Did it work as
   expected?
4/ Apply the patch I provided on top of the 'master' branch but change

		hcreate_r(1000, &hdata);

to

		hcreate_r(20000, &hdata);

5/ Install that and perform the same test as in step to, producing
   a new strace of rpc.mounted.  Presumably it will be difference.
6/ Tell me what happened when you mounted on the client.  Did it work
   as expected, or did it fail like I think it did last time.

If one worked and the other didn't, I can compare the strace and try to
find a difference.
If both worked, then we should have useful information about how much
speed up the patch provides (if any).
If neither worked, then something weird is happening...

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-07-31  4:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-06 22:43 /etc/mtab read ~900 times by rpc.mountd Phil Kauffman
2017-07-07  2:30 ` NeilBrown
2017-07-07  5:20   ` Phil Kauffman
2017-07-10  7:17     ` NeilBrown
2017-07-10 11:47       ` Karel Zak
2017-07-10 23:19         ` NeilBrown
2017-07-10 20:59       ` Phil Kauffman
2017-07-10 23:51         ` NeilBrown
2017-07-11 15:39           ` Phil Kauffman
2017-07-12  0:46             ` NeilBrown
2017-07-12 17:55               ` Phil Kauffman
2017-07-14  6:51                 ` NeilBrown
2017-07-25 21:12                   ` Phil Kauffman
2017-07-27  4:37                     ` NeilBrown
2017-07-27 21:40                       ` Phil Kauffman
2017-07-31  4:23                         ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.