From: Daire Byrne <daire@dneg.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs <linux-nfs@vger.kernel.org>,
linux-cachefs <linux-cachefs@redhat.com>
Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting
Date: Thu, 1 Oct 2020 01:09:29 +0100 (BST) [thread overview]
Message-ID: <1309604906.55950004.1601510969548.JavaMail.zimbra@dneg.com> (raw)
In-Reply-To: <97eff1ee2886c14bcd7972b17330f18ceacdef78.camel@kernel.org>
----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> Hi,
>>
>> I just thought I'd flesh out the other two issues I have found with re-exporting
>> that are ultimately responsible for the biggest performance bottlenecks. And
>> both of them revolve around the caching of metadata file lookups in the NFS
>> client.
>>
>> Especially for the case where we are re-exporting a server many milliseconds
>> away (i.e. on-premise -> cloud), we want to be able to control how much the
>> client caches metadata and file data so that it's many LAN clients all benefit
>> from the re-export server only having to do the WAN lookups once (within a
>> specified coherency time).
>>
>> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> is fairly straightforward, but keeping the metadata cached is particularly
>> difficult. And without the cached metadata we introduce long delays before we
>> can serve the already present and locally cached file data to many waiting
>> clients.
>>
>> ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
>> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > cut the network packets back to the origin server to zero for repeated lookups.
>> > However, if a client of the re-export server walks paths and memory maps those
>> > files (i.e. loading an application), the re-export server starts issuing
>> > unexpected calls back to the origin server again, ignoring/invalidating the
>> > re-export server's NFS client cache. We worked around this this by patching an
>> > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > re-export server is used. I'm not sure about the correctness of this patch but
>> > it works for our corner case.
>>
>> If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> re-export server, we can successfully cache the loading of applications and
>> walking of paths directly on the re-export server such that after a couple of
>> runs, there are practically zero packets back to the originating NFS server
>> (great!). But, if we then do the same thing on a client which is mounting that
>> re-export server, the re-export server now starts issuing lots of calls back to
>> the originating server and invalidating it's client cache (bad!).
>>
>> I'm not exactly sure why, but the iversion of the inode gets changed locally
>> (due to atime modification?) most likely via invocation of method
>> inode_inc_iversion_raw. Each time it gets incremented the following call to
>> validate attributes detects changes causing it to be reloaded from the
>> originating server.
>>
>
> I'd expect the change attribute to track what's in actual inode on the
> "home" server. The NFS client is supposed to (mostly) keep the raw
> change attribute in its i_version field.
>
> The only place we call inode_inc_iversion_raw is in
> nfs_inode_add_request, which I don't think you'd be hitting unless you
> were writing to the file while holding a write delegation.
>
> What sort of server is hosting the actual data in your setup?
We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).
As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.
Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.
>> This patch helps to avoid this when applied to the re-export server but there
>> may be other places where this happens too. I accept that this patch is
>> probably not the right/general way to do this, but it helps to highlight the
>> issue when re-exporting and it works well for our use case:
>>
>> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000
>> +0000
>> +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
>> @@ -1869,7 +1869,7 @@
>>
>> /* More cache consistency checks */
>> if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
>> - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> + if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
>> /* Could it be a race with writeback? */
>> if (!(have_writers || have_delegation)) {
>> invalid |= NFS_INO_INVALID_DATA
>>
>> With this patch, the re-export server's NFS client attribute cache is maintained
>> and used by all the clients that then mount it. When many hundreds of clients
>> are all doing similar things at the same time, the re-export server's NFS
>> client cache is invaluable in accelerating the lookups (getattrs).
>>
>> Perhaps a more correct approach would be to detect when it is knfsd that is
>> accessing the client mount and change the cache consistency checks accordingly?
>
> Yeah, I don't think you can do this for the reasons Trond outlined.
Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.
We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.
While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache.
Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.
But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.
Daire
next prev parent reply other threads:[~2020-10-01 0:09 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08 9:40 ` Mkrtchyan, Tigran
2020-09-08 11:06 ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59 ` Trond Myklebust
2020-09-16 16:01 ` Daire Byrne
2020-10-19 16:19 ` Daire Byrne
2020-10-19 17:53 ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53 ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53 ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05 ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05 ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05 ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37 ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37 ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21 9:33 ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02 ` bfields
2020-11-12 13:01 ` Daire Byrne
2020-11-12 13:57 ` bfields
2020-11-12 18:33 ` Daire Byrne
2020-11-12 20:55 ` bfields
2020-11-12 23:05 ` Daire Byrne
2020-11-13 14:50 ` bfields
2020-11-13 22:26 ` bfields
2020-11-14 12:57 ` Daire Byrne
2020-11-16 15:18 ` bfields
2020-11-16 15:53 ` bfields
2020-11-16 19:21 ` Daire Byrne
2020-11-16 15:29 ` Jeff Layton
2020-11-16 15:56 ` bfields
2020-11-16 16:03 ` Jeff Layton
2020-11-16 16:14 ` bfields
2020-11-16 16:38 ` Jeff Layton
2020-11-16 19:03 ` bfields
2020-11-16 20:03 ` Jeff Layton
2020-11-17 3:16 ` bfields
2020-11-17 3:18 ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17 3:18 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-17 15:26 ` J. Bruce Fields
2020-11-17 15:34 ` Jeff Layton
2020-11-20 22:38 ` J. Bruce Fields
2020-11-20 22:39 ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39 ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21 0:34 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39 ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21 0:58 ` Jeff Layton
2020-11-21 1:01 ` J. Bruce Fields
2020-11-21 13:00 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21 1:03 ` Jeff Layton
2020-11-21 21:44 ` Daire Byrne
2020-11-22 0:02 ` bfields
2020-11-22 1:55 ` Daire Byrne
2020-11-22 3:03 ` bfields
2020-11-23 20:07 ` Daire Byrne
2020-11-17 15:25 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27 ` Jeff Layton
2020-11-17 14:14 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-24 20:35 ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15 ` bfields
2020-11-24 22:15 ` Frank Filz
2020-11-25 14:47 ` 'bfields'
2020-11-25 16:25 ` Frank Filz
2020-11-25 19:03 ` 'bfields'
2020-11-26 0:04 ` Frank Filz
2020-11-25 17:14 ` Daire Byrne
2020-11-25 19:31 ` bfields
2020-12-03 12:20 ` Daire Byrne
2020-12-03 18:51 ` bfields
2020-12-03 20:27 ` Trond Myklebust
2020-12-03 21:13 ` bfields
2020-12-03 21:32 ` Frank Filz
2020-12-03 21:34 ` Trond Myklebust
2020-12-03 21:45 ` Frank Filz
2020-12-03 21:57 ` Trond Myklebust
2020-12-03 22:04 ` bfields
2020-12-03 22:14 ` Trond Myklebust
2020-12-03 22:39 ` Frank Filz
2020-12-03 22:50 ` Trond Myklebust
2020-12-03 23:34 ` Frank Filz
2020-12-03 22:44 ` bfields
2020-12-03 21:54 ` bfields
2020-12-03 22:45 ` bfields
2020-12-03 22:53 ` Trond Myklebust
2020-12-03 23:16 ` bfields
2020-12-03 23:28 ` Frank Filz
2020-12-04 1:02 ` Trond Myklebust
2020-12-04 1:41 ` bfields
2020-12-04 2:27 ` Trond Myklebust
2020-09-17 16:01 ` Daire Byrne
2020-09-17 19:09 ` bfields
2020-09-17 20:23 ` Frank van der Linden
2020-09-17 21:57 ` bfields
2020-09-19 11:08 ` Daire Byrne
2020-09-22 16:43 ` Chuck Lever
2020-09-23 20:25 ` Daire Byrne
2020-09-23 21:01 ` Frank van der Linden
2020-09-26 9:00 ` Daire Byrne
2020-09-28 15:49 ` Frank van der Linden
2020-09-28 16:08 ` Chuck Lever
2020-09-28 17:42 ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52 ` Trond Myklebust
2020-09-23 12:40 ` J. Bruce Fields
2020-09-23 13:09 ` Trond Myklebust
2020-09-23 17:07 ` bfields
2020-09-30 19:30 ` [Linux-cachefs] " Jeff Layton
2020-10-01 0:09 ` Daire Byrne [this message]
2020-10-01 10:36 ` Jeff Layton
2020-10-01 12:38 ` Trond Myklebust
2020-10-01 16:39 ` Jeff Layton
2020-10-05 12:54 ` Daire Byrne
2020-10-13 9:59 ` Daire Byrne
2020-10-01 18:41 ` J. Bruce Fields
2020-10-01 19:24 ` Trond Myklebust
2020-10-01 19:26 ` bfields
2020-10-01 19:29 ` Trond Myklebust
2020-10-01 19:51 ` bfields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1309604906.55950004.1601510969548.JavaMail.zimbra@dneg.com \
--to=daire@dneg.com \
--cc=jlayton@kernel.org \
--cc=linux-cachefs@redhat.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).