linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daire Byrne <daire@dneg.com>
To: bfields <bfields@fieldses.org>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	linux-cachefs <linux-cachefs@redhat.com>,
	linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: Adventures in NFS re-exporting
Date: Tue, 24 Nov 2020 20:35:06 +0000 (GMT)	[thread overview]
Message-ID: <1055884313.92996091.1606250106656.JavaMail.zimbra@dneg.com> (raw)
In-Reply-To: <1744768451.86186596.1605186084252.JavaMail.zimbra@dneg.com>

----- On 12 Nov, 2020, at 13:01, Daire Byrne daire@dneg.com wrote:
> 
> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's
> NFSv3 lookupp emulation patches, I can now revise my original list of issues
> that others will likely experience if they ever try to do this craziness:
> 
> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will
> see random input/output errors on your clients when things are dropped out of
> the cache. In the end we gave up on using NFSv4.0 with our Netapps because the
> 7-mode implementation seemed a bit flakey with modern Linux clients (Linux
> NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with
> Trond's lookupp emulation patches instead.
> 
> 2) In order to better utilise the re-export server's client cache when
> re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to
> use the horrible inode_peek_iversion_raw hack to maintain good metadata
> performance for large numbers of clients. Otherwise each re-export server's
> clients can cause invalidation of the re-export server client cache. Once you
> have hundreds of clients they all combine to constantly invalidate the cache
> resulting in an order of magnitude slower metadata performance. If you are
> re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not
> required.
> 
> 3) For some reason, when a 1MB read call arrives at the re-export server from a
> client, it gets chopped up into 128k read calls that are issued back to the
> originating server despite rsize/wsize=1MB on all mounts. This results in a
> noticeable increase in rpc chatter for large reads. Writes on the other hand
> retain their 1MB size from client to re-export server and back to the
> originating server. I am using nconnect but I doubt that is related.
> 
> 4) After some random time, the cachefilesd userspace daemon stops culling old
> data from an fscache disk storage. I thought it was to do with setting
> vfs_cache_pressure=0 but even with it set to the default 100 it just randomly
> decides to stop culling and never comes back to life until restarted or
> rebooted. Perhaps the fscache/cachefilesd rewrite that David Howells & David
> Wysochanski have been working on will improve matters.
> 
> 5) It's still really hard to cache nfs client metadata for any definitive time
> (actimeo,nocto) due to the pagecache churn that reads cause. If all required
> metadata (i.e. directory contents) could either be locally cached to disk or
> the inode cache rather than pagecache then maybe we would have more control
> over the actual cache times we are comfortable with for our workloads. This has
> little to do with re-exporting and is just a general NFS performance over the
> WAN thing. I'm very interested to see how Trond's recent patches to improve
> readdir performance might at least help re-populate the dropped cached metadata
> more efficiently over the WAN.
> 
> I just want to finish with one more crazy thing we have been doing - a re-export
> server of a re-export server! Again, a locking and consistency nightmare so
> only possible for very specific workloads (like ours). The advantage of this
> topology is that you can pull all your data over the WAN once (e.g. on-premise
> to cloud) and then fan-out that data to multiple other NFS re-export servers in
> the cloud to improve the aggregate performance to many clients. This avoids
> having multiple re-export servers all needing to pull the same data across the
> WAN.

I will officially add another point to the wishlist that I mentioned in Bruce's recent patches thread (for dealing with the iversion change on NFS re-export). I had held off mentioning this one because I wasn't really sure if it was just a normal production workload and expected behaviour for NFS, but the more I look into it, the more it seems like maybe it could be optimised for the re-export case. But then I also might be too overly sensitive about metadata ops over the WAN at this point....

6) I see many fast repeating COMMITs & GETATTRs from the NFS re-export server to the originating server for the same file while writing through it from a client. If I do a write from userspace on the re-export server directly to the client mountpoint (i.e. no re-exporting) I do not see the GETATTRs or COMMITs.

I see something similar with both a re-export of a NFSv3 originating server and a re-export of a NFSv4.2 originating server (using either NFSv3 or NFSv4). Bruce mentioned an extra GETATTR in the NFSv4.2 re-export case for a COMMIT (pre/post).

For simplicity let's look at the NFSv3 re-export of an NFSv3 originating server. But first let's write a file from userspace directly on the re-export server back to the originating server mount point (ie no re-export):

    3   0.772902  V3 GETATTR Call, FH: 0x6791bc70
    6   0.781239  V3 SETATTR Call, FH: 0x6791bc70
 3286   0.919601  V3 WRITE Call, FH: 0x6791bc70 Offset: 1048576 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
 3494   0.921351  V3 WRITE Call, FH: 0x6791bc70 Offset: 8388608 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
...
...
48178   1.462670  V3 WRITE Call, FH: 0x6791bc70 Offset: 102760448 Len: 1048576 UNSTABLE
48210   1.472400  V3 COMMIT Call, FH: 0x6791bc70

So lots of uninterrupted 1MB write calls back to the originating server as expected with a final COMMIT (good). We can also set nconnect=16 back to the originating server and get the same trace but with the write packets going down different ports (also good).

Now let's do the same write through the re-export server from a client (NFSv4.2 or NFSv3, it doesn't matter much):

    7   0.034411  V3 SETATTR Call, FH: 0x364ced2c
  286   0.148066  V3 WRITE Call, FH: 0x364ced2c Offset: 0 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
  343   0.152644  V3 WRITE Call, FH: 0x364ced2c Offset: 1048576 Len: 196608 UNSTABLEV3 WRITE Call, FH: 0x364ced2c Offset: 1245184 Len: 8192 FILE_SYNC
  580   0.168159  V3 WRITE Call, FH: 0x364ced2c Offset: 1253376 Len: 843776 UNSTABLE
  671   0.174668  V3 COMMIT Call, FH: 0x364ced2c
 1105   0.193805  V3 COMMIT Call, FH: 0x364ced2c
 1123   0.201570  V3 WRITE Call, FH: 0x364ced2c Offset: 2097152 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
 1592   0.242259  V3 WRITE Call, FH: 0x364ced2c Offset: 3145728 Len: 1048576 UNSTABLE
...
...
54571   3.668028  V3 WRITE Call, FH: 0x364ced2c Offset: 102760448 Len: 1048576 FILE_SYNC [TCP segment of a reassembled PDU]
54940   3.713392  V3 WRITE Call, FH: 0x364ced2c Offset: 103809024 Len: 1048576 UNSTABLE
55706   3.733284  V3 COMMIT Call, FH: 0x364ced2c

So now we have lots of pairs of COMMIT calls inbetween the WRITE calls. We also see sporadic FILE_SYNC write calls which we don't when we just write direct to the originating server from userspace (all UNSTABLE).

Finally, if we add nconnect=16 when mounting the originating server (useful for increasing WAN throughput) and again write through from the client, we start to see lots of GETATTRs mixed with the WRITEs & COMMITs:

   84   0.075830  V3 SETATTR Call, FH: 0x0e9698e8
  608   0.201944  V3 WRITE Call, FH: 0x0e9698e8 Offset: 0 Len: 1048576 UNSTABLE
  857   0.218760  V3 COMMIT Call, FH: 0x0e9698e8
  968   0.231706  V3 WRITE Call, FH: 0x0e9698e8 Offset: 1048576 Len: 1048576 UNSTABLE
 1042   0.246934  V3 COMMIT Call, FH: 0x0e9698e8
...
...
43754   3.033689  V3 WRITE Call, FH: 0x0e9698e8 Offset: 100663296 Len: 1048576 UNSTABLE
44085   3.044767  V3 COMMIT Call, FH: 0x0e9698e8
44086   3.044959  V3 GETATTR Call, FH: 0x0e9698e8
44087   3.044964  V3 GETATTR Call, FH: 0x0e9698e8
44088   3.044983  V3 COMMIT Call, FH: 0x0e9698e8
44615   3.079491  V3 WRITE Call, FH: 0x0e9698e8 Offset: 102760448 Len: 1048576 UNSTABLE
44700   3.082909  V3 WRITE Call, FH: 0x0e9698e8 Offset: 103809024 Len: 1048576 UNSTABLE
44978   3.092010  V3 COMMIT Call, FH: 0x0e9698e8
44982   3.092943  V3 COMMIT Call, FH: 0x0e9698e8

Sometimes I have seen clusters of 16 GETATTRs for the same file on the wire with nothing else inbetween. So if the re-export server is the only "client" writing these files to the originating server, why do we need to do so many repeat GETATTR calls when using nconnect>1? And why are the COMMIT calls required when the writes are coming via nfsd but not from userspace on the re-export server? Is that due to some sort of memory pressure or locking?

I picked the NFSv3 originating server case because my head starts to hurt tracking the equivalent packets, stateids and compound calls with NFSv4. But I think it's mostly the same for NFSv4. The writes through the re-export server lead to lots of COMMITs and (double) GETATTRs but using nconnect>1 at least doesn't seem to make it any worse like it does for NFSv3.

But maybe you actually want all the extra COMMITs to help better guarantee your writes when putting a re-export server in the way? Perhaps all of this is by design...

Daire

  parent reply	other threads:[~2020-11-24 20:35 UTC|newest]

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08  9:40 ` Mkrtchyan, Tigran
2020-09-08 11:06   ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59   ` Trond Myklebust
2020-09-16 16:01     ` Daire Byrne
2020-10-19 16:19       ` Daire Byrne
2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53             ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05             ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37               ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02           ` bfields
2020-11-12 13:01             ` Daire Byrne
2020-11-12 13:57               ` bfields
2020-11-12 18:33                 ` Daire Byrne
2020-11-12 20:55                   ` bfields
2020-11-12 23:05                     ` Daire Byrne
2020-11-13 14:50                       ` bfields
2020-11-13 22:26                         ` bfields
2020-11-14 12:57                           ` Daire Byrne
2020-11-16 15:18                             ` bfields
2020-11-16 15:53                             ` bfields
2020-11-16 19:21                               ` Daire Byrne
2020-11-16 15:29                           ` Jeff Layton
2020-11-16 15:56                             ` bfields
2020-11-16 16:03                               ` Jeff Layton
2020-11-16 16:14                                 ` bfields
2020-11-16 16:38                                   ` Jeff Layton
2020-11-16 19:03                                     ` bfields
2020-11-16 20:03                                       ` Jeff Layton
2020-11-17  3:16                                         ` bfields
2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-17 15:26                                                 ` J. Bruce Fields
2020-11-17 15:34                                                   ` Jeff Layton
2020-11-20 22:38                                                     ` J. Bruce Fields
2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21  0:34                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21  0:58                                                           ` Jeff Layton
2020-11-21  1:01                                                             ` J. Bruce Fields
2020-11-21 13:00                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21  1:03                                                         ` Jeff Layton
2020-11-21 21:44                                                           ` Daire Byrne
2020-11-22  0:02                                                             ` bfields
2020-11-22  1:55                                                               ` Daire Byrne
2020-11-22  3:03                                                                 ` bfields
2020-11-23 20:07                                                                   ` Daire Byrne
2020-11-17 15:25                                               ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27                                               ` Jeff Layton
2020-11-17 14:14                                                 ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-24 20:35               ` Daire Byrne [this message]
2020-11-24 21:15                 ` Adventures in NFS re-exporting bfields
2020-11-24 22:15                   ` Frank Filz
2020-11-25 14:47                     ` 'bfields'
2020-11-25 16:25                       ` Frank Filz
2020-11-25 19:03                         ` 'bfields'
2020-11-26  0:04                           ` Frank Filz
2020-11-25 17:14                   ` Daire Byrne
2020-11-25 19:31                     ` bfields
2020-12-03 12:20                     ` Daire Byrne
2020-12-03 18:51                       ` bfields
2020-12-03 20:27                         ` Trond Myklebust
2020-12-03 21:13                           ` bfields
2020-12-03 21:32                             ` Frank Filz
2020-12-03 21:34                             ` Trond Myklebust
2020-12-03 21:45                               ` Frank Filz
2020-12-03 21:57                                 ` Trond Myklebust
2020-12-03 22:04                                   ` bfields
2020-12-03 22:14                                     ` Trond Myklebust
2020-12-03 22:39                                       ` Frank Filz
2020-12-03 22:50                                         ` Trond Myklebust
2020-12-03 23:34                                           ` Frank Filz
2020-12-03 22:44                                       ` bfields
2020-12-03 21:54                               ` bfields
2020-12-03 22:45                               ` bfields
2020-12-03 22:53                                 ` Trond Myklebust
2020-12-03 23:16                                   ` bfields
2020-12-03 23:28                                     ` Frank Filz
2020-12-04  1:02                                     ` Trond Myklebust
2020-12-04  1:41                                       ` bfields
2020-12-04  2:27                                         ` Trond Myklebust
2020-09-17 16:01   ` Daire Byrne
2020-09-17 19:09     ` bfields
2020-09-17 20:23       ` Frank van der Linden
2020-09-17 21:57         ` bfields
2020-09-19 11:08           ` Daire Byrne
2020-09-22 16:43         ` Chuck Lever
2020-09-23 20:25           ` Daire Byrne
2020-09-23 21:01             ` Frank van der Linden
2020-09-26  9:00               ` Daire Byrne
2020-09-28 15:49                 ` Frank van der Linden
2020-09-28 16:08                   ` Chuck Lever
2020-09-28 17:42                     ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52   ` Trond Myklebust
2020-09-23 12:40     ` J. Bruce Fields
2020-09-23 13:09       ` Trond Myklebust
2020-09-23 17:07         ` bfields
2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
2020-10-01  0:09     ` Daire Byrne
2020-10-01 10:36       ` Jeff Layton
2020-10-01 12:38         ` Trond Myklebust
2020-10-01 16:39           ` Jeff Layton
2020-10-05 12:54         ` Daire Byrne
2020-10-13  9:59           ` Daire Byrne
2020-10-01 18:41     ` J. Bruce Fields
2020-10-01 19:24       ` Trond Myklebust
2020-10-01 19:26         ` bfields
2020-10-01 19:29           ` Trond Myklebust
2020-10-01 19:51             ` bfields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1055884313.92996091.1606250106656.JavaMail.zimbra@dneg.com \
    --to=daire@dneg.com \
    --cc=bfields@fieldses.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).