From: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
To: Daire Byrne <daire@dneg.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>, linux-cachefs@redhat.com
Subject: Re: Adventures in NFS re-exporting
Date: Tue, 8 Sep 2020 11:40:47 +0200 (CEST) [thread overview]
Message-ID: <1642729052.4237865.1599558047149.JavaMail.zimbra@desy.de> (raw)
In-Reply-To: <943482310.31162206.1599499860595.JavaMail.zimbra@dneg.com>
Just out of curiosity:
did you have tries instead of re-exporting nfs mount directly
re-export an overlayfs mount on top of the original nfs mount?
Such setup should cover most of your issues.
Regards,
Tigran.
----- Original Message -----
> From: "Daire Byrne" <daire@dneg.com>
> To: "linux-nfs" <linux-nfs@vger.kernel.org>
> Cc: linux-cachefs@redhat.com
> Sent: Monday, September 7, 2020 7:31:00 PM
> Subject: Adventures in NFS re-exporting
> Hi,
>
> Apologies for this rather long email, but I thought there may be some interest
> out there in the community in how and why we've been doing something
> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
> tell our story well in just a few short sentences so please bear with me (or
> stop now!).
>
> Full disclosure - I am also rather hoping that this story piques some interest
> amongst developers to help make our rather niche setup even better and perhaps
> a little better documented. I also totally understand if this is something
> people wouldn't want to touch with a very long barge pole....
>
> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
> The rationale then was to provide a "proxy" server that could mount an NFSv4
> only server and re-export it to older clients that only supported NFSv3. One of
> the main sticking points then (as now), was around the 63 byte limit of
> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
> filehandles would fit within that (in my experience it mostly works with
> "no_subtree_check"). There are also the usual locking and coherence concerns
> with NFSv3 too but I'll get to that in a bit.
>
> Then almost two years later, v4.13 was released including parts of the patch
> series that actually allowed the re-export and since then other relevant bits
> (such as the open file cache) have also been merged. I soon became interested
> in using this new functionality to both accelerate our on-premises NFS storage
> and use it as a "WAN cache" to provide cloud compute instances locally cached
> proxy access to our on-premises storage.
>
> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
> company which uses large compute farms to generate complex final frame renders
> for movies and TV. This workload mostly consists of reads of common data shared
> between many render clients (e.g textures, geometry) and a little unique data
> per frame. All file writes are to unique files per process (frames) and there
> is very little if any writing over existing files. Hence it's not very
> demanding on locking and coherence guarantees.
>
> When our on-premises NFS storage is being overloaded or the server's network is
> maxed out, we can place multiple re-export servers in between them and our farm
> to improve performance. When our on-premises render farm is not quite big
> enough to meet a deadline, we spin up compute instances with a (reasonably
> local) cloud provider. Some of these cloud instances are Linux NFS servers
> which mount our on-premises NFS storage servers (~10ms away) and re-export
> these to the other cloud (render) instances. Since we know that the data we are
> reading doesn't change often, we can increase the actimeo and even use nocto to
> reduce the network chatter back to the on-prem servers. These re-export servers
> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
> previously read data locally in the cloud over long periods of time. We also
> use NFSv4 (less network chatter) all the way from our on-prem storage to the
> re-export server and then on to the clients.
>
> The re-export server(s) quickly builds up both a memory cache and disk backed
> fscache/cachefiles storage cache of our working data set so the data being
> pulled from on-prem lessens over time. Data is only ever read once over the WAN
> network from on-prem storage and then read multiple times by the many render
> client instances in the cloud. Recent NFS features such as "nconnect" help to
> speed up the initial reading of data from on-prem by using multiple connections
> to offset TCP latency. At the end of the render, we write the files back
> through the re-export server to our on-prem storage. Our average read bandwidth
> is many times higher than our write bandwidth.
>
> Rather surprisingly, this mostly works for our particular workloads. We've
> completed movies using this setup and saved money on commercial caching systems
> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
> something that is very much not widely supported (or recommended). In most
> cases we have worked around them, but it would be great if we didn't have to so
> others could also benefit. I will list the main problems quickly now and
> provide more information and reproducers later if anyone is interested.
>
> 1) The kernel can drop entries out of the NFS client inode cache (under memory
> cache churn) when those filehandles are still being used by the knfsd's remote
> clients resulting in sporadic and random stale filehandles. This seems to be
> mostly for directories from what I've seen. Does the NFS client not know that
> knfsd is still using those files/dirs? The workaround is to never drop inode &
> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
> to ensure that we actually make the most of our actimeo=3600,nocto mount
> options for the full specified time.
>
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
>
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
>
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
> quickly eat up the CPU on the re-export server and perf top shows we are mostly
> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
> like that added to NFSv3? Our workaround is to either fix the thing doing lots
> of repeated open/closes or use NFSv3 instead.
>
> If you made it this far, I've probably taken up way too much of your valuable
> time already. If nobody is interested in this rather niche application of the
> Linux client & knfsd, then I totally understand and I will not mention it here
> again. If your interest is piqued however, I'm happy to go into more detail
> about any of this with the hope that this could become a better documented and
> understood type of setup that others with similar workloads could reference.
>
> Also, many thanks to all the Linux NFS developers for the amazing work you do
> which, in turn, helps us to make great movies. :)
>
> Daire (Head of Systems DNEG)
next prev parent reply other threads:[~2020-09-08 9:46 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08 9:40 ` Mkrtchyan, Tigran [this message]
2020-09-08 11:06 ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59 ` Trond Myklebust
2020-09-16 16:01 ` Daire Byrne
2020-10-19 16:19 ` Daire Byrne
2020-10-19 17:53 ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53 ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53 ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05 ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05 ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05 ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37 ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37 ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21 9:33 ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02 ` bfields
2020-11-12 13:01 ` Daire Byrne
2020-11-12 13:57 ` bfields
2020-11-12 18:33 ` Daire Byrne
2020-11-12 20:55 ` bfields
2020-11-12 23:05 ` Daire Byrne
2020-11-13 14:50 ` bfields
2020-11-13 22:26 ` bfields
2020-11-14 12:57 ` Daire Byrne
2020-11-16 15:18 ` bfields
2020-11-16 15:53 ` bfields
2020-11-16 19:21 ` Daire Byrne
2020-11-16 15:29 ` Jeff Layton
2020-11-16 15:56 ` bfields
2020-11-16 16:03 ` Jeff Layton
2020-11-16 16:14 ` bfields
2020-11-16 16:38 ` Jeff Layton
2020-11-16 19:03 ` bfields
2020-11-16 20:03 ` Jeff Layton
2020-11-17 3:16 ` bfields
2020-11-17 3:18 ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17 3:18 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-17 15:26 ` J. Bruce Fields
2020-11-17 15:34 ` Jeff Layton
2020-11-20 22:38 ` J. Bruce Fields
2020-11-20 22:39 ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39 ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21 0:34 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39 ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21 0:58 ` Jeff Layton
2020-11-21 1:01 ` J. Bruce Fields
2020-11-21 13:00 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21 1:03 ` Jeff Layton
2020-11-21 21:44 ` Daire Byrne
2020-11-22 0:02 ` bfields
2020-11-22 1:55 ` Daire Byrne
2020-11-22 3:03 ` bfields
2020-11-23 20:07 ` Daire Byrne
2020-11-17 15:25 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27 ` Jeff Layton
2020-11-17 14:14 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-24 20:35 ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15 ` bfields
2020-11-24 22:15 ` Frank Filz
2020-11-25 14:47 ` 'bfields'
2020-11-25 16:25 ` Frank Filz
2020-11-25 19:03 ` 'bfields'
2020-11-26 0:04 ` Frank Filz
2020-11-25 17:14 ` Daire Byrne
2020-11-25 19:31 ` bfields
2020-12-03 12:20 ` Daire Byrne
2020-12-03 18:51 ` bfields
2020-12-03 20:27 ` Trond Myklebust
2020-12-03 21:13 ` bfields
2020-12-03 21:32 ` Frank Filz
2020-12-03 21:34 ` Trond Myklebust
2020-12-03 21:45 ` Frank Filz
2020-12-03 21:57 ` Trond Myklebust
2020-12-03 22:04 ` bfields
2020-12-03 22:14 ` Trond Myklebust
2020-12-03 22:39 ` Frank Filz
2020-12-03 22:50 ` Trond Myklebust
2020-12-03 23:34 ` Frank Filz
2020-12-03 22:44 ` bfields
2020-12-03 21:54 ` bfields
2020-12-03 22:45 ` bfields
2020-12-03 22:53 ` Trond Myklebust
2020-12-03 23:16 ` bfields
2020-12-03 23:28 ` Frank Filz
2020-12-04 1:02 ` Trond Myklebust
2020-12-04 1:41 ` bfields
2020-12-04 2:27 ` Trond Myklebust
2020-09-17 16:01 ` Daire Byrne
2020-09-17 19:09 ` bfields
2020-09-17 20:23 ` Frank van der Linden
2020-09-17 21:57 ` bfields
2020-09-19 11:08 ` Daire Byrne
2020-09-22 16:43 ` Chuck Lever
2020-09-23 20:25 ` Daire Byrne
2020-09-23 21:01 ` Frank van der Linden
2020-09-26 9:00 ` Daire Byrne
2020-09-28 15:49 ` Frank van der Linden
2020-09-28 16:08 ` Chuck Lever
2020-09-28 17:42 ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52 ` Trond Myklebust
2020-09-23 12:40 ` J. Bruce Fields
2020-09-23 13:09 ` Trond Myklebust
2020-09-23 17:07 ` bfields
2020-09-30 19:30 ` [Linux-cachefs] " Jeff Layton
2020-10-01 0:09 ` Daire Byrne
2020-10-01 10:36 ` Jeff Layton
2020-10-01 12:38 ` Trond Myklebust
2020-10-01 16:39 ` Jeff Layton
2020-10-05 12:54 ` Daire Byrne
2020-10-13 9:59 ` Daire Byrne
2020-10-01 18:41 ` J. Bruce Fields
2020-10-01 19:24 ` Trond Myklebust
2020-10-01 19:26 ` bfields
2020-10-01 19:29 ` Trond Myklebust
2020-10-01 19:51 ` bfields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1642729052.4237865.1599558047149.JavaMail.zimbra@desy.de \
--to=tigran.mkrtchyan@desy.de \
--cc=daire@dneg.com \
--cc=linux-cachefs@redhat.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).