All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daire Byrne <daire@dneg.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: bfields <bfields@fieldses.org>, fsorenso <fsorenso@redhat.com>,
	linux-nfs <linux-nfs@vger.kernel.org>, aglo <aglo@umich.edu>,
	neilb <neilb@suse.de>, bcodding <bcodding@redhat.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	jshivers <jshivers@redhat.com>
Subject: Re: unsharing tcp connections from different NFS mounts
Date: Wed, 5 May 2021 13:53:46 +0100 (BST)	[thread overview]
Message-ID: <1454713846.12225482.1620219226547.JavaMail.zimbra@dneg.com> (raw)
In-Reply-To: <5bd2516e41f7a6b35ea9772a56a7dfdec52b83a9.camel@hammerspace.com>

Trond,

----- On 4 May, 2021, at 22:48, Trond Myklebust trondmy@hammerspace.com wrote:
>> I'd really love to see any kind of improvement to this behaviour as
>> it's a real shame we can't serve cached data quickly when all the
>> cache re-validations (getattrs) are stuck behind bulk IO that just
>> seems to plow through everything else.
> 
> If you use statx() instead of the regular stat call, and you
> specifically don't request the ctime and mtime, then the current kernel
> should skip the writeback.
> 
> Otherwise, you're going to have to wait for the NFSv4.2 protocol
> changes that we're trying to push through the IETF to allow the client
> to be authoritative for the ctime/mtime when it holds a write
> delegation.

In my case, it's less about skipping avoidable getattrs if we have the files open and delegated for read/write or are still within the attribute cache timeout, and it has nothing to do with the re-export specific cache optimisations that went into v5.11 (which really helped us out!).

It's more the fact that we can read a terabyte of data (say) into the client's pagecache or (more likely) fscache/cachefiles, but obviously can't use it again days later (say) until some validation getattrs are sent and replied to. If that mountpoint also happens to be very busy with reads or writes at the time, then all that locally cached data sits idle until we can squeeze through the necessary lookups. This is especially painful if you are also using NFS over the WAN.

When I did some basic benchmarking, metadata ops from one process could be x100 slower when the pipe is full of reads or writes from other processes on the same client. Actually, another detail I just read in my previous notes - the more parallel client processes you have reading data, the slower your metadata ops will get replied to.

So if you have 1 process filling the client's network pipe with reads and another walking the filesystem, the walk will be ~x5 slower than if the pipe wasn't full of reads. If you have 20 processes simultaneously reading and again are filling the client's network pipe with reads, then the filesystem walking process is x100 slower. In both cases, the physical network is being maxed out, but the metadata intensive filesystem walking process is getting even less and less opportunity to have it's requests answered.

And this is exactly the scenario we see with our NFS re-export case, where lots of knfsd threads are doing reads from a mountpoint while others are just trying to have lookup requests answered so they can then serve the locally cached data (it helps that our remote files never get overwritten or updated).

So, similar to the original behaviour described in this thread, we also find that even when one client's NFSv4.2 mount is eating up all the network bandwidth and metadata ops are slowed to a crawl, another independent server (or multi-homed with same filesystem) mounted on the same client still shows very good (barely degraded) metadata performance. Presumably due to the independent slot table (which is good news if you are using a single server to re-export multiple servers).

I think for us, some kind of priority for these small metadata ops would be ideal (assuming you can get enough of them into the slot queue in the first place). I'm not sure a slot limit per client process would help that much? I also wonder if readahead (or async writes) could be gobbling up too many available slots leaving little for the sequential metadata intensive processes?

Daire

  reply	other threads:[~2021-05-05 12:53 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-06 15:13 unsharing tcp connections from different NFS mounts J. Bruce Fields
2020-10-06 15:20 ` Chuck Lever
2020-10-06 15:22   ` Bruce Fields
2020-10-06 17:07     ` Tom Talpey
2020-10-06 19:30       ` Bruce Fields
     [not found]         ` <CAGrwUG5_KeRVR8chcA8=3FSeii2+4c8FbuE=CSGAtYVYqV4kLg@mail.gmail.com>
2020-10-07 14:08           ` Tom Talpey
2020-10-06 19:36 ` Benjamin Coddington
2020-10-06 21:46   ` Olga Kornievskaia
2020-10-07  0:18     ` J. Bruce Fields
2020-10-07 11:27       ` Benjamin Coddington
2020-10-07 12:55         ` Benjamin Coddington
2020-10-07 13:45           ` Chuck Lever
2020-10-07 14:05             ` Bruce Fields
2020-10-07 14:15               ` Chuck Lever
2020-10-07 16:05                 ` Bruce Fields
2020-10-07 16:44                   ` Trond Myklebust
2020-10-07 17:15                     ` Bruce Fields
2020-10-07 17:29                       ` Trond Myklebust
2020-10-07 18:05                         ` bfields
2020-10-07 19:11                           ` Trond Myklebust
2020-10-07 20:29                             ` bfields
2020-10-07 18:04                     ` Benjamin Coddington
2020-10-07 18:19                       ` Trond Myklebust
2020-10-07 16:50                   ` Trond Myklebust
2021-01-19 22:22                     ` bfields
2021-01-19 23:09                       ` Trond Myklebust
2021-01-20 15:07                         ` bfields
2021-05-03 20:09                           ` bfields
2021-05-04  2:08                             ` NeilBrown
2021-05-04 13:27                               ` Tom Talpey
2021-05-04 14:27                               ` Trond Myklebust
2021-05-04 16:51                                 ` bfields
2021-05-04 21:32                                   ` Daire Byrne
2021-05-04 21:48                                     ` Trond Myklebust
2021-05-05 12:53                                       ` Daire Byrne [this message]
2021-01-20 15:58                       ` Chuck Lever
2020-10-07 13:56 ` Patrick Goetz
2020-10-07 16:28   ` Igor Ostrovsky
2020-10-07 16:30   ` Benjamin Coddington

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1454713846.12225482.1620219226547.JavaMail.zimbra@dneg.com \
    --to=daire@dneg.com \
    --cc=aglo@umich.edu \
    --cc=bcodding@redhat.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=fsorenso@redhat.com \
    --cc=jshivers@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.