* /etc/mtab read ~900 times by rpc.mountd @ 2017-07-06 22:43 Phil Kauffman 2017-07-07 2:30 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-06 22:43 UTC (permalink / raw) To: linux-nfs In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2 I have the same issue and am looking for a solution. My current setup involves nested ZFS datasets. It looks like this: tank/homes/user{1..6000} tank/stage/somedirname{1..100} tank/homes and tank/stage are datasets and the immediate children are also datasets. There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup. Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted. Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish. # ls -l /etc/mtab lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts # cat /proc/self/mounts > mtab # ls -lh mtab -rw-rw---- 1 root root 442K Jul 6 16:51 mtab # wc -l mtab 6051 mtab Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2 (This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.) I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/ steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path); Cheers, Phil P.S.: I can provide straces on demand if required, I didn't want to clutter up the message. -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-06 22:43 /etc/mtab read ~900 times by rpc.mountd Phil Kauffman @ 2017-07-07 2:30 ` NeilBrown 2017-07-07 5:20 ` Phil Kauffman 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2017-07-07 2:30 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 3332 bytes --] On Thu, Jul 06 2017, Phil Kauffman wrote: > In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2 > > I have the same issue and am looking for a solution. > > My current setup involves nested ZFS datasets. It looks like this: > tank/homes/user{1..6000} > tank/stage/somedirname{1..100} > > tank/homes and tank/stage are datasets and the immediate children are also datasets. > > There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup. > > > Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted. > > Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish. > > # ls -l /etc/mtab > lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts > > # cat /proc/self/mounts > mtab > > # ls -lh mtab > -rw-rw---- 1 root root 442K Jul 6 16:51 mtab > > # wc -l mtab > 6051 mtab > > > Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2 > (This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.) > > > I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/ > > steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path); > > > Cheers, > > Phil > I can imagine /etc/mtab being read once for every line in /etc/exports, but unless your /etc/exports is very big, I can't easily see why it would be read 900 times. Maybe lots of different mount points are being accessed by something and each one triggers a few reads... Can you show your /etc/exports file? > > P.S.: I can provide straces on demand if required, I didn't want to clutter up the message. I don't recommend this approach. If you have data, just provide it. If it is really really big, provide a link. You have my attention now. You might not get it again when you send the strace. You might have wasted your opportunity. Yes - please provide the strace showing lots of reads through /etc/mtab. Don't provide it privately to me. Either post it to the list, or post a link to it. NeilBrown > > -- > Phil Kauffman > Systems Admin > Dept. of Computer Science > University of Chicago > kauffman@cs.uchicago.edu > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-07 2:30 ` NeilBrown @ 2017-07-07 5:20 ` Phil Kauffman 2017-07-10 7:17 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-07 5:20 UTC (permalink / raw) To: linux-nfs > I can imagine /etc/mtab being read once for every line in /etc/exports, > but unless your /etc/exports is very big, I can't easily see why it > would be read 900 times. > Maybe lots of different mount points are being accessed by something and > each one triggers a few reads... > > Can you show your /etc/exports file? These get generated via puppet and dropped into /etc/exports.d/ Any given host will have 3 files associated with it: root@charlie-storage1:~# cat /etc/exports.d/client1.cs.uchicago.edu.charliestorage1_fsidroot.charliestorage1_fsidroot.exports /tank client1.cs.uchicago.edu(fsid=0,rw,async,no_subtree_check,insecure,crossmnt) root@charlie-storage1:~# cat /etc/exports.d/client1.cs.uchicago.edu.charliestorage1_homes.charliestorage1_homes.exports /tank/homes client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt) root@charlie-storage1:~# cat /etc/exports.d/client1.cs.uchicago.edu.charliestorage1_stage.charliestorage1_stage.exports /tank/stage client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt) Here we can see that there are 960 files in /etc/exports.d root@storage1:~# find /etc/exports.d/ -type f | wc 960 960 88701 You probably don't want or need this but here is a link to all of my /etc/exports.d/ combined into one file which I redacted the hostnames: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/all_etc_exports.d_in_one_file.txt > I don't recommend this approach. If you have data, just provide it. If > it is really really big, provide a link. > You have my attention now. You might not get it again when you send the > strace. You might have wasted your opportunity. > Yes - please provide the strace showing lots of reads through > /etc/mtab. Don't provide it privately to me. Either post it to the > list, or post a link to it. Thanks for the advice. Here is a link to a best effort redacted strace: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/redacted_strace.txt I would prefer to keep user name leakage to a minimum. Cheers, Phil -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu 773-702-3913 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-07 5:20 ` Phil Kauffman @ 2017-07-10 7:17 ` NeilBrown 2017-07-10 11:47 ` Karel Zak 2017-07-10 20:59 ` Phil Kauffman 0 siblings, 2 replies; 16+ messages in thread From: NeilBrown @ 2017-07-10 7:17 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 3074 bytes --] On Fri, Jul 07 2017, Phil Kauffman wrote: >> I can imagine /etc/mtab being read once for every line in /etc/exports, >> but unless your /etc/exports is very big, I can't easily see why it >> would be read 900 times. >> Maybe lots of different mount points are being accessed by something and >> each one triggers a few reads... >> >> Can you show your /etc/exports file? > > These get generated via puppet and dropped into /etc/exports.d/ ... > > Here we can see that there are 960 files in /etc/exports.d > root@storage1:~# find /etc/exports.d/ -type f | wc > 960 960 88701 Ahhh.. 960 exports. That could do it. $ grep -c crossmnt all_etc_exports.d_in_one_file.txt 957 I get 957 - much the same number. $ grep open redacted_strace.txt | uniq -c 2 open("/var/lib/nfs/etab", O_RDONLY) = 6 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 /etc/mtab was ready 942 for a single access. I would have expected 957, but maybe the system is dynamic and something changed between the two samples. This makes it fairly clear what is happening. Now we just need to fix it. One option would be to cache some of the details extracted from /etc/mtab, but that could get messy. Another is to move the code around. In your case there are really just 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in the combined exports file is really different in different files). So any one client only needs to consider 3 exports, not 300. There is room to optimize this code further than the below, but let's start simple. Could you test with this patch applied and see what difference it makes? Thanks, NeilBrown diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index ca6c84f4d93d..e712cc166157 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -727,6 +727,17 @@ static void nfsd_fh(int f) for (exp = exportlist[i].p_head; exp; exp = next_exp) { char *path; + if (!is_ipaddr_client(dom) + && !namelist_client_matches(exp, dom)) { + next_exp = exp->m_next; + continue; + } + if (is_ipaddr_client(dom) + && !ipaddr_client_matches(exp, ai)) { + next_exp = exp->m_next; + continue; + } + if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) { static nfs_export *prev = NULL; static void *mnt = NULL; @@ -751,9 +762,6 @@ static void nfsd_fh(int f) next_exp = exp->m_next; } - if (!is_ipaddr_client(dom) - && !namelist_client_matches(exp, dom)) - continue; if (exp->m_export.e_mountpoint && !is_mountpoint(exp->m_export.e_mountpoint[0]? exp->m_export.e_mountpoint: @@ -762,9 +770,6 @@ static void nfsd_fh(int f) if (!match_fsid(&parsed, exp, path)) continue; - if (is_ipaddr_client(dom) - && !ipaddr_client_matches(exp, ai)) - continue; if (!found || subexport(&exp->m_export, found)) { found = &exp->m_export; free(found_path); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-10 7:17 ` NeilBrown @ 2017-07-10 11:47 ` Karel Zak 2017-07-10 23:19 ` NeilBrown 2017-07-10 20:59 ` Phil Kauffman 1 sibling, 1 reply; 16+ messages in thread From: Karel Zak @ 2017-07-10 11:47 UTC (permalink / raw) To: NeilBrown; +Cc: Phil Kauffman, linux-nfs On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote: > On Fri, Jul 07 2017, Phil Kauffman wrote: > > >> I can imagine /etc/mtab being read once for every line in /etc/exports, > >> but unless your /etc/exports is very big, I can't easily see why it > >> would be read 900 times. > >> Maybe lots of different mount points are being accessed by something and > >> each one triggers a few reads... > >> > >> Can you show your /etc/exports file? > > > > These get generated via puppet and dropped into /etc/exports.d/ > ... > > > > Here we can see that there are 960 files in /etc/exports.d > > root@storage1:~# find /etc/exports.d/ -type f | wc > > 960 960 88701 > > Ahhh.. 960 exports. That could do it. > > $ grep -c crossmnt all_etc_exports.d_in_one_file.txt > 957 > > I get 957 - much the same number. > > $ grep open redacted_strace.txt | uniq -c > 2 open("/var/lib/nfs/etab", O_RDONLY) = 6 > 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6 > 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 > > /etc/mtab was ready 942 for a single access. I would have expected 957, > but maybe the system is dynamic and something changed between the two > samples. > > This makes it fairly clear what is happening. Now we just need to fix > it. > One option would be to cache some of the details extracted from > /etc/mtab, but that could get messy. Would be possible to read /proc/self/mounts only once and use poll() to detect changes? The issue is not only the open()+read() in the process, but also kernel has to compose the file from the mount tree. It's expensive. Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-10 11:47 ` Karel Zak @ 2017-07-10 23:19 ` NeilBrown 0 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2017-07-10 23:19 UTC (permalink / raw) To: Karel Zak; +Cc: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 2217 bytes --] On Mon, Jul 10 2017, Karel Zak wrote: > On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote: >> On Fri, Jul 07 2017, Phil Kauffman wrote: >> >> >> I can imagine /etc/mtab being read once for every line in /etc/exports, >> >> but unless your /etc/exports is very big, I can't easily see why it >> >> would be read 900 times. >> >> Maybe lots of different mount points are being accessed by something and >> >> each one triggers a few reads... >> >> >> >> Can you show your /etc/exports file? >> > >> > These get generated via puppet and dropped into /etc/exports.d/ >> ... >> > >> > Here we can see that there are 960 files in /etc/exports.d >> > root@storage1:~# find /etc/exports.d/ -type f | wc >> > 960 960 88701 >> >> Ahhh.. 960 exports. That could do it. >> >> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt >> 957 >> >> I get 957 - much the same number. >> >> $ grep open redacted_strace.txt | uniq -c >> 2 open("/var/lib/nfs/etab", O_RDONLY) = 6 >> 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6 >> 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 >> >> /etc/mtab was ready 942 for a single access. I would have expected 957, >> but maybe the system is dynamic and something changed between the two >> samples. >> >> This makes it fairly clear what is happening. Now we just need to fix >> it. >> One option would be to cache some of the details extracted from >> /etc/mtab, but that could get messy. > > Would be possible to read /proc/self/mounts only once and use poll() > to detect changes? Sure it is possible. All we need is code.... > > The issue is not only the open()+read() in the process, but also > kernel has to compose the file from the mount tree. It's expensive. That certainly contributes to the problem. If we cache the content, we can process it fast. If we avoid wanting to look at the content, it doesn't matter how fast processes it is (because we don't). So yes, that might be part of the solution. But not the whole solution. Thanks, NeilBrown > > Karel > > > -- > Karel Zak <kzak@redhat.com> > http://karelzak.blogspot.com [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-10 7:17 ` NeilBrown 2017-07-10 11:47 ` Karel Zak @ 2017-07-10 20:59 ` Phil Kauffman 2017-07-10 23:51 ` NeilBrown 1 sibling, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-10 20:59 UTC (permalink / raw) To: linux-nfs On 07/10/2017 02:17 AM, NeilBrown wrote: > Another is to move the code around. In your case there are really just > 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in > the combined exports file is really different in different files). > So any one client only needs to consider 3 exports, not 300. This assumption is correct. We have ~300 clients and growing (slowly). > Could you test with this patch applied and see what > difference it makes? After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works. The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2 Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt Install nfs-kernel-server and nfs-common debs on server and nfs-common on client. I reboot everything: client and server Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success). http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_result_no-strace.txt I then used a larger sample size of users (~/f) and got a similar result. Now lets strace rpc.mountd while sshing with one user: $ time ssh -o BatchMode=yes $CLIENTHOSTNAME -l user_not_in_list_f Permission denied (publickey,password). real 0m0.352s user 0m0.020s sys 0m0.012s The resulting strace seems entirely reasonable: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_one_user_strace.txt $ for user in $(cat ~/users | grep -E "^g*"); do time ssh -o BatchMode=yes $CLIENTHOSTNAME -l $user; done The first few dozen users whos usernames happen to start with 'g': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_multiple_user_strace.txt -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-10 20:59 ` Phil Kauffman @ 2017-07-10 23:51 ` NeilBrown 2017-07-11 15:39 ` Phil Kauffman 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2017-07-10 23:51 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 2996 bytes --] On Mon, Jul 10 2017, Phil Kauffman wrote: > On 07/10/2017 02:17 AM, NeilBrown wrote: >> Another is to move the code around. In your case there are really just >> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in >> the combined exports file is really different in different files). >> So any one client only needs to consider 3 exports, not 300. > This assumption is correct. We have ~300 clients and growing (slowly). > >> Could you test with this patch applied and see what >> difference it makes? > After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works. > > > The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c > It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2 > > Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt > > Install nfs-kernel-server and nfs-common debs on server and nfs-common on client. > > > I reboot everything: client and server > > Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success). This does look encouraging ... but I'm not sure we are comparing apples with apples. In the original strace we see. read(4, ....) check on etab statfs(some path) write(4,.... which is exactly like the new strace. But then the old strace also has read(5, ...) check on etab read /etc/mtab hundreds of times. write(5, ...) which is completely missing in the new strace. So the new code which doesn't read /etc/mtab as much, isn't being run at all. file desc 4 is /proc/1926/net/rpc/nfsd.export/channel file desc 5 is /proc/1926/net/rpc/nfsd.fh/channel If the nfs server kernel has nothing cached, then when a request arrives you should first see a transaction of fd 3 (/proc/1926/net/rpc/auth.unix.ip/channel) to map the IP address to a "domain". Then a transaction on fd 5 (nfsd.fh) to map the domain + a filehandle fragment to an export point. Then a transaction on fd 4 (nfsd.export) to map the domain + export point to some export options. If the request is a "LOOKUP", you might see another "fd 5" lookup after the "fd 4" to find out if a sub-mountpoint is exported. That is what the old strace showed. If you don't see all of those, then something is cached. Could you repeat your experiments after first running "exportfs -f" on the nfs server? That should cause worst-case load on mountd. Until we see the "read(5).... write(5)" sequence completing quickly, I cannot be sure that we have fixed anything. Actually.. for future traces, please add "-tt" so I can see where the delays are, and how much they are. (and thanks for providing all the tracing details - I love getting unambiguous data!!) Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-10 23:51 ` NeilBrown @ 2017-07-11 15:39 ` Phil Kauffman 2017-07-12 0:46 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-11 15:39 UTC (permalink / raw) To: linux-nfs On 07/10/2017 06:51 PM, NeilBrown wrote: > This does look encouraging ... but I'm not sure we are comparing apples > with apples. I had a feeling this was the case. > Could you repeat your experiments after first running "exportfs -f" on > the nfs server? That should cause worst-case load on mountd. The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n 1 Process 73 select 143 write 216 close 216 fstat 216 open 48240 read 435599 stat 871162 statfs 871163 lstat > (and thanks for providing all the tracing details - I love getting > unambiguous data!!) No problem, I'm available to run any necessary tests. I also appreciate your help on this. -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu 773-702-3913 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-11 15:39 ` Phil Kauffman @ 2017-07-12 0:46 ` NeilBrown 2017-07-12 17:55 ` Phil Kauffman 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2017-07-12 0:46 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 2535 bytes --] On Tue, Jul 11 2017, Phil Kauffman wrote: > On 07/10/2017 06:51 PM, NeilBrown wrote: >> This does look encouraging ... but I'm not sure we are comparing apples >> with apples. > I had a feeling this was the case. > > >> Could you repeat your experiments after first running "exportfs -f" on >> the nfs server? That should cause worst-case load on mountd. > The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt > SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt > > nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt > > > # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n > 1 Process > 73 select > 143 write > 216 close > 216 fstat > 216 open > 48240 read > 435599 stat > 871162 statfs > 871163 lstat > > >> (and thanks for providing all the tracing details - I love getting >> unambiguous data!!) > No problem, I'm available to run any necessary tests. I also appreciate your help on this. > So the new data shows about 7 seconds for a login, which is probably a little longer than you would like, but might be acceptable? Each login triggers 3 reads of /etc/mtab, and each read (from 'open' to 'close') takes either 3-4 seconds or about 0.2 seconds. Actually reading /etc/mtab seems to take about 0.1-0.2 seconds at most. The rest of the time is calling stat, with a pattern like 10:09:39.547280 stat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0 10:09:39.547318 lstat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0 10:09:39.547350 lstat("/tank/homes/someuser"..", {st_mode=S_IFDIR|0755, st_size=5967, ...}) = 0 10:09:39.547385 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba 10:09:39.547424 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba Of the three passes through /etc/mtab for each login, one generates 95 of these, one generates 6049, and one generates 5952. The 'stat' is from match_fsid() The 2 'lstat' are from is_mountpoint() (called by match_fsid() The first statfs() is from uuid_by_path() Maybe the second is somewhere in the blkid code. Probably the next step is to cache the mapping from path, fsidtype -> fsid. That is more than a 2-line patch. I might have a go later this week. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-12 0:46 ` NeilBrown @ 2017-07-12 17:55 ` Phil Kauffman 2017-07-14 6:51 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-12 17:55 UTC (permalink / raw) To: linux-nfs On 07/11/2017 07:46 PM, NeilBrown wrote: > So the new data shows about 7 seconds for a login, which is probably > a little longer than you would like, but might be acceptable? Unfortunately, the delay is not acceptable. The ideal would be to achieve performance parity with when one is not forced to use the 'crossmnt' option. My current setup (home directory server) does not require 'crossmnt' and does not incur a delay. It is a standard nfs server using mdadm, lvm, and xfs. While my current setup is probably "normal" and using nested datasets with the 'crossmnt' option might be "weird" now; nested mounts will probably only become more common as BTRFS, ZFS, and other filesystems with similar features gain traction on linux. > That is more than a 2-line patch. I might have a go later this week. That would be super! Thanks for taking a look. -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-12 17:55 ` Phil Kauffman @ 2017-07-14 6:51 ` NeilBrown 2017-07-25 21:12 ` Phil Kauffman 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2017-07-14 6:51 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 7804 bytes --] On Wed, Jul 12 2017, Phil Kauffman wrote: > On 07/11/2017 07:46 PM, NeilBrown wrote: >> So the new data shows about 7 seconds for a login, which is probably >> a little longer than you would like, but might be acceptable? > Unfortunately, the delay is not acceptable. > > The ideal would be to achieve performance parity with when one is not > forced to use the 'crossmnt' option. > > My current setup (home directory server) does not require 'crossmnt' and > does not incur a delay. It is a standard nfs server using mdadm, lvm, > and xfs. > > While my current setup is probably "normal" and using nested datasets > with the 'crossmnt' option might be "weird" now; nested mounts will > probably only become more common as BTRFS, ZFS, and other filesystems > with similar features gain traction on linux. > > >> That is more than a 2-line patch. I might have a go later this week. > That would be super! Thanks for taking a look. Please try this (against a clean nfs-utils. i.e. remove the previous patch). It is a hack and would need to be totally re-written, but hopely the measurements you make and strace that you report could be useful. Also, for the strace, please use "-ttt" rather than "-tt" like I asked before. It is easier to find the difference between two times with -ttt. And add -T as well. Thanks, NeilBrown From f15571d747bb6272ec539e46b6b3aee3ccc53b30 Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.com> Date: Fri, 14 Jul 2017 16:43:37 +1000 Subject: [PATCH] testing Signed-off-by: NeilBrown <neilb@suse.com> --- utils/mountd/cache.c | 105 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 81 insertions(+), 24 deletions(-) diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index ca6c84f4d93d..db89a4feb7ae 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -18,6 +18,7 @@ #include <time.h> #include <netinet/in.h> #include <arpa/inet.h> +#include <sys/sysmacros.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> @@ -299,7 +300,7 @@ static const long int nonblkid_filesystems[] = { 0 /* last */ }; -static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid) +static int uuid_by_path(char *path, struct statfs64 *stp, int type, size_t uuidlen, char *uuid) { /* get a uuid for the filesystem found at 'path'. * There are several possible ways of generating the @@ -334,12 +335,17 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid) const char *val; int rc; - rc = statfs64(path, &st); + if (stp) + rc = 0; + else { + stp = &st; + rc= statfs64(path, stp); + } if (type == 0 && rc == 0) { const long int *bad; for (bad = nonblkid_filesystems; *bad; bad++) { - if (*bad == st.f_type) + if (*bad == stp->f_type) break; } if (*bad == 0) @@ -347,9 +353,9 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid) } if (rc == 0 && - (st.f_fsid.__val[0] || st.f_fsid.__val[1])) + (stp->f_fsid.__val[0] || stp->f_fsid.__val[1])) snprintf(fsid_val, 17, "%08x%08x", - st.f_fsid.__val[0], st.f_fsid.__val[1]); + stp->f_fsid.__val[0], stp->f_fsid.__val[1]); else fsid_val[0] = 0; @@ -603,25 +609,64 @@ static int parse_fsid(int fsidtype, int fsidlen, char *fsid, return 0; } +#define entry FOO +#include <search.h> +struct pinfo { + unsigned int stb:1, mntp:1, stfs:1; + unsigned int mountpoint:1; + time_t valid; + struct stat statbuf; + struct statfs64 statfsbuf; +}; +static struct hsearch_data hdata; +static int hdata_init = 0; + + static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path) { - struct stat stb; + struct stat *stb; int type; char u[16]; + struct pinfo *pi; + ENTRY ent, *ret; - if (stat(path, &stb) != 0) - return false; - if (!S_ISDIR(stb.st_mode) && !S_ISREG(stb.st_mode)) + if (!hdata_init) { + hcreate_r(1000, &hdata); + hdata_init = 1; + } + ent.key = path; + if (hsearch_r(ent, FIND, &ret, &hdata) == 0) { + ent.key = strdup(path); + pi = xmalloc(sizeof(*pi)); + pi->stb = pi->mntp = pi->stfs = 0; + pi->valid = 0; + ent.data = pi; + hsearch_r(ent, ENTER, &ret, &hdata); + } else + pi = ret->data; + + if (pi->valid < time(NULL)+120) { + pi->stb = pi->mntp = pi->stfs = 0; + pi->valid = time(NULL); + } + + stb = &pi->statbuf; + if (!pi->stb) { + if (stat(path, stb) != 0) + return false; + pi->stb = 1; + } + if (!S_ISDIR(stb->st_mode) && !S_ISREG(stb->st_mode)) return false; switch (parsed->fsidtype) { case FSID_DEV: case FSID_MAJOR_MINOR: case FSID_ENCODE_DEV: - if (stb.st_ino != parsed->inode) + if (stb->st_ino != parsed->inode) return false; - if (parsed->major != major(stb.st_dev) || - parsed->minor != minor(stb.st_dev)) + if (parsed->major != major(stb->st_dev) || + parsed->minor != minor(stb->st_dev)) return false; return true; case FSID_NUM: @@ -631,12 +676,16 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path) return true; case FSID_UUID4_INUM: case FSID_UUID16_INUM: - if (stb.st_ino != parsed->inode) + if (stb->st_ino != parsed->inode) return false; goto check_uuid; case FSID_UUID8: case FSID_UUID16: - if (!is_mountpoint(path)) + if (!pi->mntp) { + pi->mountpoint = is_mountpoint(path); + pi->mntp = 1; + } + if (!pi->mountpoint) return false; check_uuid: if (exp->m_export.e_uuid) { @@ -644,12 +693,18 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path) if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0) return true; } - else + else { + if (!pi->stfs) { + if (statfs64(path, &pi->statfsbuf) != 0) + return false; + pi->stfs = 1; + } for (type = 0; - uuid_by_path(path, type, parsed->uuidlen, u); + uuid_by_path(path, &pi->statfsbuf, type, parsed->uuidlen, u); type++) if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0) return true; + } } /* Well, unreachable, actually: */ return false; @@ -727,10 +782,18 @@ static void nfsd_fh(int f) for (exp = exportlist[i].p_head; exp; exp = next_exp) { char *path; + if (!is_ipaddr_client(dom)) { + if (!namelist_client_matches(exp, dom)) + continue; + } else { + if (!ipaddr_client_matches(exp, ai)) + continue; + } + if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) { static nfs_export *prev = NULL; static void *mnt = NULL; - + if (prev == exp) { /* try a submount */ path = next_mnt(&mnt, exp->m_export.e_path); @@ -751,9 +814,6 @@ static void nfsd_fh(int f) next_exp = exp->m_next; } - if (!is_ipaddr_client(dom) - && !namelist_client_matches(exp, dom)) - continue; if (exp->m_export.e_mountpoint && !is_mountpoint(exp->m_export.e_mountpoint[0]? exp->m_export.e_mountpoint: @@ -762,9 +822,6 @@ static void nfsd_fh(int f) if (!match_fsid(&parsed, exp, path)) continue; - if (is_ipaddr_client(dom) - && !ipaddr_client_matches(exp, ai)) - continue; if (!found || subexport(&exp->m_export, found)) { found = &exp->m_export; free(found_path); @@ -906,7 +963,7 @@ static int dump_to_cache(int f, char *buf, int buflen, char *domain, write_secinfo(&bp, &blen, exp, flag_mask); if (exp->e_uuid == NULL || different_fs) { char u[16]; - if (uuid_by_path(path, 0, 16, u)) { + if (uuid_by_path(path, NULL, 0, 16, u)) { qword_add(&bp, &blen, "uuid"); qword_addhex(&bp, &blen, u, 16); } -- 2.12.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-14 6:51 ` NeilBrown @ 2017-07-25 21:12 ` Phil Kauffman 2017-07-27 4:37 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-25 21:12 UTC (permalink / raw) To: linux-nfs [-- Attachment #1: Type: text/plain, Size: 3085 bytes --] On 07/14/2017 01:51 AM, NeilBrown wrote: > Please try this (against a clean nfs-utils. i.e. remove the previous > patch). > It is a hack and would need to be totally re-written, but hopely the > measurements you make and strace that you report could be useful. > Also, for the strace, please use "-ttt" rather than "-tt" like I asked > before. It is easier to find the difference between two times with > -ttt. And add -T as well. Sorry this took so long but I had to be sure of my results. With your latest patch applied I am unable to mount my NFS shares and thus continue with the tests from before. Since the patch you provided cannot be applied cleanly to the source used to build '1:1.2.8-9ubuntu12.1' for Ubuntu Xenial I chose to build a couple of VMs to continue this testing. They should be identical to the machines I was using before except they are under powered and not using nfs-common, nfs-kernel-server package version 1:1.2.8-9ubuntu12.1. Most importantly the zfs datasets and the files in /etc/exports.d are the same. I found that utils/mountd/cache.c was last modified in 1.3.4, so I tested both 1.3.4 and 2.1.1 without your patch as a baseline. I was able to mount my shares in both versions. I then reverted my VMs to a state before the baseline install and installed again with your patch applied. I get the following with your patch applied on 1.3.4 and 2.1.1: root@nfsclient:~# mount -avv / : ignored /boot : already mounted none : ignored mount.nfs4: timeout set for Tue Jul 25 15:43:13 2017 mount.nfs4: trying text-based options 'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22' mount.nfs4: mount(2): No such file or directory mount.nfs4: trying text-based options 'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23' mount.nfs4: prog 100003, trying vers=3, prot=6 mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049 mount.nfs4: prog 100005, trying vers=3, prot=17 mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598 mount.nfs4: mount(2): Permission denied mount.nfs4: access denied by server while mounting nfsserver.cs.uchicago.edu:/homes mount.nfs4: timeout set for Tue Jul 25 15:43:14 2017 mount.nfs4: trying text-based options 'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22' mount.nfs4: mount(2): No such file or directory mount.nfs4: trying text-based options 'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23' mount.nfs4: prog 100003, trying vers=3, prot=6 mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049 mount.nfs4: prog 100005, trying vers=3, prot=17 mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598 mount.nfs4: mount(2): Permission denied mount.nfs4: access denied by server while mounting nfsserver.cs.uchicago.edu:/stage Attached are my install notes and the strace during the failed mount above. -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu [-- Attachment #2: nfs_test_notes.txt --] [-- Type: text/plain, Size: 1334 bytes --] # all umask 022 apt-get build-dep nfs-kernel-server apt-get install rpcbind strace systemctl enable rpcbind git clone git://git.linux-nfs.org/projects/steved/nfs-utils.git cd nfs-utils # choose the version to test #git checkout tags/nfs-utils-1-3-4 -b 1.3.4 #git checkout tags/nfs-utils-2-1-1 -b 2.1.1 # if patching wget http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/nb2.patch patch -p1 utils/mountd/cache.c < nb2.patch rm nb2.patch # Server ./autogen.sh && ./configure --with-systemd --disable-gss && make && make install touch /etc/default/nfs-kernel-server /etc/default/nfs-common mkdir -p /usr/libexec/nfs-utils cp ../nfs-utils/debian/nfs-utils_env.sh /usr/libexec/nfs-utils/ touch /etc/exports && mkdir -p /etc/exports.d systemctl enable nfs-server.service systemctl enable nfs-utils.service reboot # client ./autogen.sh && ./configure --with-systemd --disable-gss && make && make install touch /etc/default/nfs-common chmod 644 /usr/lib/systemd/system/* systemctl enable nfs-client.target systemctl enable rpc-statd.service reboot root@nfsserver:~/nfs-utils# cat /etc/default/nfs-common NEED_STATD=yes STATDOPTS="--port 44634" NEED_GSSD= root@nfsserver:~/nfs-utils# cat /etc/default/nfs-kernel-server RPCNFSDCOUNT= RPCNFSDPRIORITY=0 RPCMOUNTDOPTS= NEED_SVCGSSD= RPCSVCGSSDOPTS= RPCNFSDOPTS='--syslog' [-- Attachment #3: strace_no_mount.txt --] [-- Type: text/plain, Size: 8877 bytes --] strace: Process 32552 attached 1501016020.534715 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [3]) <2.334481> 1501016022.869338 read(3, "nfsd 10.135.24.22\n", 32768) = 18 <0.000029> 1501016022.869443 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000058> 1501016022.869566 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000036> 1501016022.869673 close(20) = 0 <0.000045> 1501016022.870015 write(3, "nfsd 10.135.24.22 1501017822 nfs"..., 56) = 56 <0.000042> 1501016022.870094 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [5]) <0.189956> 1501016023.060124 read(5, "nfsclient.cs.uchicago.edu 1 \\x00"..., 32768) = 39 <0.000046> 1501016023.060220 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000028> 1501016023.060292 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000021> 1501016023.060350 close(20) = 0 <0.000022> 1501016023.060423 write(5, "nfsclient.cs.uchicago.edu 1 \\x00"..., 51) = 51 <0.000032> 1501016023.060488 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.050699> 1501016023.111265 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51141), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"Yv\31\354\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000029> 1501016023.111435 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000030> 1501016023.111522 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000028> 1501016023.111610 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51141), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"Yv\31\354\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000051> 1501016023.111727 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000232> 1501016023.112015 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{".\314P$\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000023> 1501016023.112093 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000054> 1501016023.112188 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000021> 1501016023.112254 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{".\314P$\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000039> 1501016023.112340 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000083> 1501016023.112466 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"/\314P$\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\1\0\0\0\1\0\0\0$"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 88 <0.000023> 1501016023.112537 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000022> 1501016023.112597 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000022> 1501016023.112669 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000027> 1501016023.112734 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000030> 1501016023.112856 close(20) = 0 <0.000057> 1501016023.112977 lstat("/homes", 0x7ffc0917b900) = -1 ENOENT (No such file or directory) <0.000059> 1501016023.113593 sendto(6, "<28>Jul 25 15:53:43 rpc.mountd[3"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 <0.000054> 1501016023.114333 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(989), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"/\314P$\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r", 28}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 28 <0.000065> 1501016023.114480 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.196515> 1501016023.311130 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(49539), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"YsnP\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000051> 1501016023.311292 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000042> 1501016023.311406 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000038> 1501016023.311504 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(49539), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"YsnP\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000064> 1501016023.311653 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000175> 1501016023.311907 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\f[kA\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 40 <0.000047> 1501016023.312044 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000050> 1501016023.312136 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000023> 1501016023.312224 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\f[kA\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 24 <0.000041> 1501016023.312317 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL) = 1 (in [15]) <0.000083> 1501016023.312440 recvmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\r[kA\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\1\0\0\0\1\0\0\0$"..., 8800}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=if_nametoindex("ens18"), ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 88 <0.000021> 1501016023.312513 stat("/etc/hosts.allow", {st_mode=S_IFREG|0644, st_size=411, ...}) = 0 <0.000026> 1501016023.312576 stat("/etc/hosts.deny", {st_mode=S_IFREG|0644, st_size=711, ...}) = 0 <0.000019> 1501016023.312635 open("/var/lib/nfs/etab", O_RDONLY) = 20 <0.000023> 1501016023.312691 fstat(20, {st_mode=S_IFREG|0644, st_size=234990, ...}) = 0 <0.000019> 1501016023.312741 close(20) = 0 <0.000019> 1501016023.312790 lstat("/stage", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000045> 1501016023.313043 sendto(6, "<28>Jul 25 15:53:43 rpc.mountd[3"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 <0.000027> 1501016023.313400 sendmsg(15, {msg_name(16)={sa_family=AF_INET, sin_port=htons(971), sin_addr=inet_addr("10.135.24.22")}, msg_iov(1)=[{"\r[kA\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r", 28}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, {ipi_ifindex=0, ipi_spec_dst=inet_addr("10.135.24.23"), ipi_addr=inet_addr("10.135.24.23")}}], msg_flags=0}, 0) = 28 <0.000036> 1501016023.313485 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-25 21:12 ` Phil Kauffman @ 2017-07-27 4:37 ` NeilBrown 2017-07-27 21:40 ` Phil Kauffman 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2017-07-27 4:37 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 1441 bytes --] On Tue, Jul 25 2017, Phil Kauffman wrote: > On 07/14/2017 01:51 AM, NeilBrown wrote: >> Please try this (against a clean nfs-utils. i.e. remove the previous >> patch). >> It is a hack and would need to be totally re-written, but hopely the >> measurements you make and strace that you report could be useful. >> Also, for the strace, please use "-ttt" rather than "-tt" like I asked >> before. It is easier to find the difference between two times with >> -ttt. And add -T as well. > > Sorry this took so long but I had to be sure of my results. It is worth being thorough. > > With your latest patch applied I am unable to mount my NFS shares and > thus continue with the tests from before. Weird. You are experiencing a bug that was very recently fixed, where if mount.nfs4 gets the error ENOENT from the server, it falls back to NFSv3. That explains some of the noise, but doesn't explain why you get ENOENT for the v4 mount. The strace output you've provided doesn't even show any attempts to read /etc/mtab, which my patch doesn't change at all. So it seems like the context is different in some way. Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in ... maybe that it done automatically somehow... Can you try with unpatches 2.1.1? Also maybe provide an strace starting before any attempt to mount anything, and with an extra option "-s 1000".. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-27 4:37 ` NeilBrown @ 2017-07-27 21:40 ` Phil Kauffman 2017-07-31 4:23 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Phil Kauffman @ 2017-07-27 21:40 UTC (permalink / raw) To: linux-nfs On 07/26/2017 11:37 PM, NeilBrown wrote: > Weird. I thought so as well. > So it seems like the context is different in some way. It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. > Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in > ... maybe that it done automatically somehow... I copied the files from the original NFS server into /etc/exports.d as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary. root@nfsserver:~# ls /etc/exports.d/ | wc -l 972 I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver). The ZFS datasets were also precreated: root@nfsserver:~# zfs list | wc -l 6062 > Can you try with unpatches 2.1.1? I assume you mean without any patches. vanilla 2.1.1 > Also maybe provide an strace starting before any attempt to mount > anything, and with an extra option "-s 1000".. This is all that happens *before* trying to mount from the client. root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt strace: Process 2517 attached 1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL Stracing rpc.nfsd: root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt -- Phil Kauffman Systems Admin Dept. of Computer Science University of Chicago kauffman@cs.uchicago.edu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: /etc/mtab read ~900 times by rpc.mountd 2017-07-27 21:40 ` Phil Kauffman @ 2017-07-31 4:23 ` NeilBrown 0 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2017-07-31 4:23 UTC (permalink / raw) To: Phil Kauffman, linux-nfs [-- Attachment #1: Type: text/plain, Size: 3077 bytes --] On Thu, Jul 27 2017, Phil Kauffman wrote: > On 07/26/2017 11:37 PM, NeilBrown wrote: >> Weird. > I thought so as well. > >> So it seems like the context is different in some way. > It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. > > >> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in >> ... maybe that it done automatically somehow... > I copied the files from the original NFS server into /etc/exports.d as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary. > > root@nfsserver:~# ls /etc/exports.d/ | wc -l > 972 > > I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver). > > The ZFS datasets were also precreated: > > root@nfsserver:~# zfs list | wc -l > 6062 > > >> Can you try with unpatches 2.1.1? > I assume you mean without any patches. vanilla 2.1.1 > >> Also maybe provide an strace starting before any attempt to mount >> anything, and with an extra option "-s 1000".. > > This is all that happens *before* trying to mount from the client. > root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt > strace: Process 2517 attached > 1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL > > Stracing rpc.nfsd: > root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify > http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt > > When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt > I'm sorry but I cannot get anything useful from these. It isn't crystal clear what was being tested in each case.... Could you try: 1/ build the current 'master' branch from the nfs-utils git tree, with no extra patches. 2/ Install that and perform the same test which produce the 1.3G strace file above. Provide the resulting strace of rpc.mountd. 3/ Tell me what happened when you mounted on the client. Did it work as expected? 4/ Apply the patch I provided on top of the 'master' branch but change hcreate_r(1000, &hdata); to hcreate_r(20000, &hdata); 5/ Install that and perform the same test as in step to, producing a new strace of rpc.mounted. Presumably it will be difference. 6/ Tell me what happened when you mounted on the client. Did it work as expected, or did it fail like I think it did last time. If one worked and the other didn't, I can compare the strace and try to find a difference. If both worked, then we should have useful information about how much speed up the patch provides (if any). If neither worked, then something weird is happening... NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-07-31 4:23 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-06 22:43 /etc/mtab read ~900 times by rpc.mountd Phil Kauffman 2017-07-07 2:30 ` NeilBrown 2017-07-07 5:20 ` Phil Kauffman 2017-07-10 7:17 ` NeilBrown 2017-07-10 11:47 ` Karel Zak 2017-07-10 23:19 ` NeilBrown 2017-07-10 20:59 ` Phil Kauffman 2017-07-10 23:51 ` NeilBrown 2017-07-11 15:39 ` Phil Kauffman 2017-07-12 0:46 ` NeilBrown 2017-07-12 17:55 ` Phil Kauffman 2017-07-14 6:51 ` NeilBrown 2017-07-25 21:12 ` Phil Kauffman 2017-07-27 4:37 ` NeilBrown 2017-07-27 21:40 ` Phil Kauffman 2017-07-31 4:23 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.