Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
From: Daire Byrne <daire@dneg.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: bfields <bfields@fieldses.org>,
	linux-cachefs <linux-cachefs@redhat.com>,
	linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: Adventures in NFS re-exporting
Date: Wed, 21 Oct 2020 10:33:52 +0100 (BST)
Message-ID: <635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com> (raw)
In-Reply-To: <279389889.68934777.1603124383614.JavaMail.zimbra@dneg.com>

----- On 19 Oct, 2020, at 17:19, Daire Byrne daire@dneg.com wrote:
> ----- On 16 Sep, 2020, at 17:01, Daire Byrne daire@dneg.com wrote:
> 
>> Trond/Bruce,
>> 
>> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@hammerspace.com wrote:
>> 
>>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>>> > (under memory cache churn) when those filehandles are still being
>>>> > used by the knfsd's remote clients resulting in sporadic and random
>>>> > stale filehandles. This seems to be mostly for directories from
>>>> > what I've seen. Does the NFS client not know that knfsd is still
>>>> > using those files/dirs? The workaround is to never drop inode &
>>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>>> > also helps to ensure that we actually make the most of our
>>>> > actimeo=3600,nocto mount options for the full specified time.
>>>> 
>>>> I thought reexport worked by embedding the original server's
>>>> filehandles
>>>> in the filehandles given out by the reexporting server.
>>>> 
>>>> So, even if nothing's cached, when the reexporting server gets a
>>>> filehandle, it should be able to extract the original filehandle from
>>>> it
>>>> and use that.
>>>> 
>>>> I wonder why that's not working?
>>> 
>>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>>> callback for it.
>> 
>> So in terms of the ESTALE counter on the reexport server, we see it increase if
>> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
>> difference in the client experience in that with NFSv3 we quickly get
>> input/output errors but with NFSv4 we don't. But it does seem like the
>> performance drops significantly which makes me think that NFSv4 retries the
>> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
>> 
>> This is the simplest reproducer I could come up with but it may still be
>> specific to our workloads/applications and hard to replicate exactly.
>> 
>> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
>> reexport-server:/vol/software /mnt/software
>> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
>> /proc/sys/vm/drop_caches; done
>> 
>> reexport-server # sysctl -w vm.vfs_cache_pressure=100
>> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
>> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
>> 10; done
>> 
>> Where "application" is some big application with lots of paths to scan with libs
>> to memory map and "/vol/software" is an NFS mount on the reexport-server from
>> another originating NFS server. I don't know why this application loading
>> workload shows this best, but perhaps the access patterns of memory mapped
>> binaries and libs is particularly susceptible to estale?
>> 
>> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
>> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
>> The ESTALE count increases and the client running the application reports
>> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
>> 
>> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
>> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
>> no longer increases and the client experiences no issues (NFSv3 & NFSv4).
> 
> I don't suppose anyone has any more thoughts on this one? This is likely the
> first problem that anyone trying to NFS re-export is going to encounter. If
> they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are
> dropped from cache (with the default vfs_cache_pressure=100) and if they
> re-export NFSv4, the lookup performance will drop significantly as an ESTALE
> triggers re-lookups.
> 
> For our particular use case, it is actually desirable to have
> vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to
> help with expensive metadata lookups, but it would still be nice to have the
> option of using a less drastic setting (such as vfs_cache_pressure=1) to help
> avoid OOM conditions.

Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests.

Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!).

However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches.

So in summary, while continuously dropping inode/dentry caches on the re-export server:

originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors)
originating server NFSv4.1/4.2 -> NFSv3 re-export server = good
originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors
originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches)
originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches)

In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations.

But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0.

Daire

  parent reply index

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-07 17:31 Daire Byrne
2020-09-08  9:40 ` Mkrtchyan, Tigran
2020-09-08 11:06   ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59   ` Trond Myklebust
2020-09-16 16:01     ` Daire Byrne
2020-10-19 16:19       ` Daire Byrne
2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53             ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05             ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37               ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21  9:33         ` Daire Byrne [this message]
2020-11-09 16:02           ` Adventures in NFS re-exporting bfields
2020-11-12 13:01             ` Daire Byrne
2020-11-12 13:57               ` bfields
2020-11-12 18:33                 ` Daire Byrne
2020-11-12 20:55                   ` bfields
2020-11-12 23:05                     ` Daire Byrne
2020-11-13 14:50                       ` bfields
2020-11-13 22:26                         ` bfields
2020-11-14 12:57                           ` Daire Byrne
2020-11-16 15:18                             ` bfields
2020-11-16 15:53                             ` bfields
2020-11-16 19:21                               ` Daire Byrne
2020-11-16 15:29                           ` Jeff Layton
2020-11-16 15:56                             ` bfields
2020-11-16 16:03                               ` Jeff Layton
2020-11-16 16:14                                 ` bfields
2020-11-16 16:38                                   ` Jeff Layton
2020-11-16 19:03                                     ` bfields
2020-11-16 20:03                                       ` Jeff Layton
2020-11-17  3:16                                         ` bfields
2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-17 15:26                                                 ` J. Bruce Fields
2020-11-17 15:34                                                   ` Jeff Layton
2020-11-20 22:38                                                     ` J. Bruce Fields
2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21  0:34                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21  0:58                                                           ` Jeff Layton
2020-11-21  1:01                                                             ` J. Bruce Fields
2020-11-21 13:00                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21  1:03                                                         ` Jeff Layton
2020-11-21 21:44                                                           ` Daire Byrne
2020-11-22  0:02                                                             ` bfields
2020-11-22  1:55                                                               ` Daire Byrne
2020-11-22  3:03                                                                 ` bfields
2020-11-23 20:07                                                                   ` Daire Byrne
2020-11-17 15:25                                               ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27                                               ` Jeff Layton
2020-11-17 14:14                                                 ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15                 ` bfields
2020-11-24 22:15                   ` Frank Filz
2020-11-25 14:47                     ` 'bfields'
2020-11-25 16:25                       ` Frank Filz
2020-11-25 19:03                         ` 'bfields'
2020-11-26  0:04                           ` Frank Filz
2020-11-25 17:14                   ` Daire Byrne
2020-11-25 19:31                     ` bfields
2020-12-03 12:20                     ` Daire Byrne
2020-12-03 18:51                       ` bfields
2020-12-03 20:27                         ` Trond Myklebust
2020-12-03 21:13                           ` bfields
2020-12-03 21:32                             ` Frank Filz
2020-12-03 21:34                             ` Trond Myklebust
2020-12-03 21:45                               ` Frank Filz
2020-12-03 21:57                                 ` Trond Myklebust
2020-12-03 22:04                                   ` bfields
2020-12-03 22:14                                     ` Trond Myklebust
2020-12-03 22:39                                       ` Frank Filz
2020-12-03 22:50                                         ` Trond Myklebust
2020-12-03 23:34                                           ` Frank Filz
2020-12-03 22:44                                       ` bfields
2020-12-03 21:54                               ` bfields
2020-12-03 22:45                               ` bfields
2020-12-03 22:53                                 ` Trond Myklebust
2020-12-03 23:16                                   ` bfields
2020-12-03 23:28                                     ` Frank Filz
2020-12-04  1:02                                     ` Trond Myklebust
2020-12-04  1:41                                       ` bfields
2020-12-04  2:27                                         ` Trond Myklebust
2020-09-17 16:01   ` Daire Byrne
2020-09-17 19:09     ` bfields
2020-09-17 20:23       ` Frank van der Linden
2020-09-17 21:57         ` bfields
2020-09-19 11:08           ` Daire Byrne
2020-09-22 16:43         ` Chuck Lever
2020-09-23 20:25           ` Daire Byrne
2020-09-23 21:01             ` Frank van der Linden
2020-09-26  9:00               ` Daire Byrne
2020-09-28 15:49                 ` Frank van der Linden
2020-09-28 16:08                   ` Chuck Lever
2020-09-28 17:42                     ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52   ` Trond Myklebust
2020-09-23 12:40     ` J. Bruce Fields
2020-09-23 13:09       ` Trond Myklebust
2020-09-23 17:07         ` bfields
2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
2020-10-01  0:09     ` Daire Byrne
2020-10-01 10:36       ` Jeff Layton
2020-10-01 12:38         ` Trond Myklebust
2020-10-01 16:39           ` Jeff Layton
2020-10-05 12:54         ` Daire Byrne
2020-10-13  9:59           ` Daire Byrne
2020-10-01 18:41     ` J. Bruce Fields
2020-10-01 19:24       ` Trond Myklebust
2020-10-01 19:26         ` bfields
2020-10-01 19:29           ` Trond Myklebust
2020-10-01 19:51             ` bfields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com \
    --to=daire@dneg.com \
    --cc=bfields@fieldses.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org
	public-inbox-index linux-nfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git