Adventures in NFS re-exporting

* Adventures in NFS re-exporting
@ 2020-09-07 17:31 Daire Byrne
  2020-09-08  9:40 ` Mkrtchyan, Tigran
                   ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Daire Byrne @ 2020-09-07 17:31 UTC (permalink / raw)
  To: linux-nfs; +Cc: linux-cachefs

Hi,

Apologies for this rather long email, but I thought there may be some interest out there in the community in how and why we've been doing something unsupported and barely documented - NFS re-exporting! And I'm not sure I can tell our story well in just a few short sentences so please bear with me (or stop now!).

Full disclosure - I am also rather hoping that this story piques some interest amongst developers to help make our rather niche setup even better and perhaps a little better documented. I also totally understand if this is something people wouldn't want to touch with a very long barge pole....

First a quick bit of history (I hope I have this right). Late in 2015, Jeff Layton proposed a patch series allowing knfsd to re-export a NFS client mount. The rationale then was to provide a "proxy" server that could mount an NFSv4 only server and re-export it to older clients that only supported NFSv3. One of the main sticking points then (as now), was around the 63 byte limit of filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported filehandles would fit within that (in my experience it mostly works with "no_subtree_check"). There are also the usual locking and coherence concerns with NFSv3 too but I'll get to that in a bit.

Then almost two years later, v4.13 was released including parts of the patch series that actually allowed the re-export and since then other relevant bits (such as the open file cache) have also been merged. I soon became interested in using this new functionality to both accelerate our on-premises NFS storage and use it as a "WAN cache" to provide cloud compute instances locally cached proxy access to our on-premises storage.

Cut to a brief introduction to us and what we do... DNEG is an award winning VFX company which uses large compute farms to generate complex final frame renders for movies and TV. This workload mostly consists of reads of common data shared between many render clients (e.g textures, geometry) and a little unique data per frame. All file writes are to unique files per process (frames) and there is very little if any writing over existing files. Hence it's not very demanding on locking and coherence guarantees.

When our on-premises NFS storage is being overloaded or the server's network is maxed out, we can place multiple re-export servers in between them and our farm to improve performance. When our on-premises render farm is not quite big enough to meet a deadline, we spin up compute instances with a (reasonably local) cloud provider. Some of these cloud instances are Linux NFS servers which mount our on-premises NFS storage servers (~10ms away) and re-export these to the other cloud (render) instances. Since we know that the data we are reading doesn't change often, we can increase the actimeo and even use nocto to reduce the network chatter back to the on-prem servers. These re-export servers also use fscache/cachefiles to cache data to disk so that we can retain TBs of previously read data locally in the cloud over long periods of time. We also use NFSv4 (less network chatter) all the way from our on-prem storage to the re-export server and then on to the clients.

The re-export server(s) quickly builds up both a memory cache and disk backed fscache/cachefiles storage cache of our working data set so the data being pulled from on-prem lessens over time. Data is only ever read once over the WAN network from on-prem storage and then read multiple times by the many render client instances in the cloud. Recent NFS features such as "nconnect" help to speed up the initial reading of data from on-prem by using multiple connections to offset TCP latency. At the end of the render, we write the files back through the re-export server to our on-prem storage. Our average read bandwidth is many times higher than our write bandwidth.

Rather surprisingly, this mostly works for our particular workloads. We've completed movies using this setup and saved money on commercial caching systems (e.g Avere, GPFS, etc). But there are still some remaining issues with doing something that is very much not widely supported (or recommended). In most cases we have worked around them, but it would be great if we didn't have to so others could also benefit. I will list the main problems quickly now and provide more information and reproducers later if anyone is interested.

1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.

2) If we cache metadata on the re-export server using actimeo=3600,nocto we can cut the network packets back to the origin server to zero for repeated lookups. However, if a client of the re-export server walks paths and memory maps those files (i.e. loading an application), the re-export server starts issuing unexpected calls back to the origin server again, ignoring/invalidating the re-export server's NFS client cache. We worked around this this by patching an inode/iversion validity check in inode.c so that the NFS client cache on the re-export server is used. I'm not sure about the correctness of this patch but it works for our corner case.

3) If we saturate an NFS client's network with reads from the server, all client metadata lookups become unbearably slow even if it's all cached in the NFS client's memory and no network RPCs should be required. This is the case for any NFS client regardless of re-exporting but it affects this case more because when we can't serve cached metadata we also can't serve the cached data. It feels like some sort of bottleneck in the client's ability to parallelise requests? We work around this by not maxing out our network.

4) With an NFSv4 re-export, lots of open/close requests (hundreds per second) quickly eat up the CPU on the re-export server and perf top shows we are mostly in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache like that added to NFSv3? Our workaround is to either fix the thing doing lots of repeated open/closes or use NFSv3 instead.

If you made it this far, I've probably taken up way too much of your valuable time already. If nobody is interested in this rather niche application of the Linux client & knfsd, then I totally understand and I will not mention it here again. If your interest is piqued however, I'm happy to go into more detail about any of this with the hope that this could become a better documented and understood type of setup that others with similar workloads could reference.

Also, many thanks to all the Linux NFS developers for the amazing work you do which, in turn, helps us to make great movies. :)

Daire (Head of Systems DNEG)

^ permalink raw reply	[flat|nested] 129+ messages in thread