linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Adventures in NFS re-exporting
@ 2020-09-07 17:31 Daire Byrne
  2020-09-08  9:40 ` Mkrtchyan, Tigran
                   ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Daire Byrne @ 2020-09-07 17:31 UTC (permalink / raw)
  To: linux-nfs; +Cc: linux-cachefs

Hi,

Apologies for this rather long email, but I thought there may be some interest out there in the community in how and why we've been doing something unsupported and barely documented - NFS re-exporting! And I'm not sure I can tell our story well in just a few short sentences so please bear with me (or stop now!).

Full disclosure - I am also rather hoping that this story piques some interest amongst developers to help make our rather niche setup even better and perhaps a little better documented. I also totally understand if this is something people wouldn't want to touch with a very long barge pole....

First a quick bit of history (I hope I have this right). Late in 2015, Jeff Layton proposed a patch series allowing knfsd to re-export a NFS client mount. The rationale then was to provide a "proxy" server that could mount an NFSv4 only server and re-export it to older clients that only supported NFSv3. One of the main sticking points then (as now), was around the 63 byte limit of filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported filehandles would fit within that (in my experience it mostly works with "no_subtree_check"). There are also the usual locking and coherence concerns with NFSv3 too but I'll get to that in a bit.

Then almost two years later, v4.13 was released including parts of the patch series that actually allowed the re-export and since then other relevant bits (such as the open file cache) have also been merged. I soon became interested in using this new functionality to both accelerate our on-premises NFS storage and use it as a "WAN cache" to provide cloud compute instances locally cached proxy access to our on-premises storage.

Cut to a brief introduction to us and what we do... DNEG is an award winning VFX company which uses large compute farms to generate complex final frame renders for movies and TV. This workload mostly consists of reads of common data shared between many render clients (e.g textures, geometry) and a little unique data per frame. All file writes are to unique files per process (frames) and there is very little if any writing over existing files. Hence it's not very demanding on locking and coherence guarantees.

When our on-premises NFS storage is being overloaded or the server's network is maxed out, we can place multiple re-export servers in between them and our farm to improve performance. When our on-premises render farm is not quite big enough to meet a deadline, we spin up compute instances with a (reasonably local) cloud provider. Some of these cloud instances are Linux NFS servers which mount our on-premises NFS storage servers (~10ms away) and re-export these to the other cloud (render) instances. Since we know that the data we are reading doesn't change often, we can increase the actimeo and even use nocto to reduce the network chatter back to the on-prem servers. These re-export servers also use fscache/cachefiles to cache data to disk so that we can retain TBs of previously read data locally in the cloud over long periods of time. We also use NFSv4 (less network chatter) all the way from our on-prem storage to the re-export server and then on to the clients.

The re-export server(s) quickly builds up both a memory cache and disk backed fscache/cachefiles storage cache of our working data set so the data being pulled from on-prem lessens over time. Data is only ever read once over the WAN network from on-prem storage and then read multiple times by the many render client instances in the cloud. Recent NFS features such as "nconnect" help to speed up the initial reading of data from on-prem by using multiple connections to offset TCP latency. At the end of the render, we write the files back through the re-export server to our on-prem storage. Our average read bandwidth is many times higher than our write bandwidth.

Rather surprisingly, this mostly works for our particular workloads. We've completed movies using this setup and saved money on commercial caching systems (e.g Avere, GPFS, etc). But there are still some remaining issues with doing something that is very much not widely supported (or recommended). In most cases we have worked around them, but it would be great if we didn't have to so others could also benefit. I will list the main problems quickly now and provide more information and reproducers later if anyone is interested.

1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.

2) If we cache metadata on the re-export server using actimeo=3600,nocto we can cut the network packets back to the origin server to zero for repeated lookups. However, if a client of the re-export server walks paths and memory maps those files (i.e. loading an application), the re-export server starts issuing unexpected calls back to the origin server again, ignoring/invalidating the re-export server's NFS client cache. We worked around this this by patching an inode/iversion validity check in inode.c so that the NFS client cache on the re-export server is used. I'm not sure about the correctness of this patch but it works for our corner case.

3) If we saturate an NFS client's network with reads from the server, all client metadata lookups become unbearably slow even if it's all cached in the NFS client's memory and no network RPCs should be required. This is the case for any NFS client regardless of re-exporting but it affects this case more because when we can't serve cached metadata we also can't serve the cached data. It feels like some sort of bottleneck in the client's ability to parallelise requests? We work around this by not maxing out our network.

4) With an NFSv4 re-export, lots of open/close requests (hundreds per second) quickly eat up the CPU on the re-export server and perf top shows we are mostly in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache like that added to NFSv3? Our workaround is to either fix the thing doing lots of repeated open/closes or use NFSv3 instead.

If you made it this far, I've probably taken up way too much of your valuable time already. If nobody is interested in this rather niche application of the Linux client & knfsd, then I totally understand and I will not mention it here again. If your interest is piqued however, I'm happy to go into more detail about any of this with the hope that this could become a better documented and understood type of setup that others with similar workloads could reference.

Also, many thanks to all the Linux NFS developers for the amazing work you do which, in turn, helps us to make great movies. :)

Daire (Head of Systems DNEG)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
@ 2020-09-08  9:40 ` Mkrtchyan, Tigran
  2020-09-08 11:06   ` Daire Byrne
  2020-09-15 17:21 ` J. Bruce Fields
  2020-09-22 12:31 ` Daire Byrne
  2 siblings, 1 reply; 129+ messages in thread
From: Mkrtchyan, Tigran @ 2020-09-08  9:40 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs, linux-cachefs


Just out of curiosity:

did you have tries instead of re-exporting nfs mount directly
re-export an overlayfs mount on top of the original nfs mount?
Such setup should cover most of your issues.

Regards,
   Tigran.

----- Original Message -----
> From: "Daire Byrne" <daire@dneg.com>
> To: "linux-nfs" <linux-nfs@vger.kernel.org>
> Cc: linux-cachefs@redhat.com
> Sent: Monday, September 7, 2020 7:31:00 PM
> Subject: Adventures in NFS re-exporting

> Hi,
> 
> Apologies for this rather long email, but I thought there may be some interest
> out there in the community in how and why we've been doing something
> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
> tell our story well in just a few short sentences so please bear with me (or
> stop now!).
> 
> Full disclosure - I am also rather hoping that this story piques some interest
> amongst developers to help make our rather niche setup even better and perhaps
> a little better documented. I also totally understand if this is something
> people wouldn't want to touch with a very long barge pole....
> 
> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
> The rationale then was to provide a "proxy" server that could mount an NFSv4
> only server and re-export it to older clients that only supported NFSv3. One of
> the main sticking points then (as now), was around the 63 byte limit of
> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
> filehandles would fit within that (in my experience it mostly works with
> "no_subtree_check"). There are also the usual locking and coherence concerns
> with NFSv3 too but I'll get to that in a bit.
> 
> Then almost two years later, v4.13 was released including parts of the patch
> series that actually allowed the re-export and since then other relevant bits
> (such as the open file cache) have also been merged. I soon became interested
> in using this new functionality to both accelerate our on-premises NFS storage
> and use it as a "WAN cache" to provide cloud compute instances locally cached
> proxy access to our on-premises storage.
> 
> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
> company which uses large compute farms to generate complex final frame renders
> for movies and TV. This workload mostly consists of reads of common data shared
> between many render clients (e.g textures, geometry) and a little unique data
> per frame. All file writes are to unique files per process (frames) and there
> is very little if any writing over existing files. Hence it's not very
> demanding on locking and coherence guarantees.
> 
> When our on-premises NFS storage is being overloaded or the server's network is
> maxed out, we can place multiple re-export servers in between them and our farm
> to improve performance. When our on-premises render farm is not quite big
> enough to meet a deadline, we spin up compute instances with a (reasonably
> local) cloud provider. Some of these cloud instances are Linux NFS servers
> which mount our on-premises NFS storage servers (~10ms away) and re-export
> these to the other cloud (render) instances. Since we know that the data we are
> reading doesn't change often, we can increase the actimeo and even use nocto to
> reduce the network chatter back to the on-prem servers. These re-export servers
> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
> previously read data locally in the cloud over long periods of time. We also
> use NFSv4 (less network chatter) all the way from our on-prem storage to the
> re-export server and then on to the clients.
> 
> The re-export server(s) quickly builds up both a memory cache and disk backed
> fscache/cachefiles storage cache of our working data set so the data being
> pulled from on-prem lessens over time. Data is only ever read once over the WAN
> network from on-prem storage and then read multiple times by the many render
> client instances in the cloud. Recent NFS features such as "nconnect" help to
> speed up the initial reading of data from on-prem by using multiple connections
> to offset TCP latency. At the end of the render, we write the files back
> through the re-export server to our on-prem storage. Our average read bandwidth
> is many times higher than our write bandwidth.
> 
> Rather surprisingly, this mostly works for our particular workloads. We've
> completed movies using this setup and saved money on commercial caching systems
> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
> something that is very much not widely supported (or recommended). In most
> cases we have worked around them, but it would be great if we didn't have to so
> others could also benefit. I will list the main problems quickly now and
> provide more information and reproducers later if anyone is interested.
> 
> 1) The kernel can drop entries out of the NFS client inode cache (under memory
> cache churn) when those filehandles are still being used by the knfsd's remote
> clients resulting in sporadic and random stale filehandles. This seems to be
> mostly for directories from what I've seen. Does the NFS client not know that
> knfsd is still using those files/dirs? The workaround is to never drop inode &
> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
> to ensure that we actually make the most of our actimeo=3600,nocto mount
> options for the full specified time.
> 
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
> 
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
> 
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
> quickly eat up the CPU on the re-export server and perf top shows we are mostly
> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
> like that added to NFSv3? Our workaround is to either fix the thing doing lots
> of repeated open/closes or use NFSv3 instead.
> 
> If you made it this far, I've probably taken up way too much of your valuable
> time already. If nobody is interested in this rather niche application of the
> Linux client & knfsd, then I totally understand and I will not mention it here
> again. If your interest is piqued however, I'm happy to go into more detail
> about any of this with the hope that this could become a better documented and
> understood type of setup that others with similar workloads could reference.
> 
> Also, many thanks to all the Linux NFS developers for the amazing work you do
> which, in turn, helps us to make great movies. :)
> 
> Daire (Head of Systems DNEG)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-08  9:40 ` Mkrtchyan, Tigran
@ 2020-09-08 11:06   ` Daire Byrne
  0 siblings, 0 replies; 129+ messages in thread
From: Daire Byrne @ 2020-09-08 11:06 UTC (permalink / raw)
  To: Mkrtchyan, Tigran; +Cc: linux-nfs, linux-cachefs

Tigran,

I guess I never really considered overlayfs because we still want to seamlessly write through to the original servers from time to time and post processing the copies from upper to lower seems like it might be hard to make reliable or do with low latency? I would also worry that our lower filesystem is being actively updated by processes outside of the overlay clients and how overlayfs would deal with that. And ultimately, the COW nature of overlayfs is a somewhat wasted feature for our workloads whereby it's the caching of file reads (and metadata) we care most about.

I must confess to not having looked at overlayfs in a few years so there may be lots of new tricks and options that would help our case. I'm aware that it gained the ability to NFS (re-)export a couple of years back.

But I'm certainly now interested to know if that NFS re-export implementation fares any better with the issues I experience with a direct knfsd re-export of an NFS client. So I will do some testing with overlayfs and see how it stacks up (see what I did there?).

Thanks for the suggestion!

Daire

----- On 8 Sep, 2020, at 10:40, Mkrtchyan, Tigran tigran.mkrtchyan@desy.de wrote:

> Just out of curiosity:
> 
> did you have tries instead of re-exporting nfs mount directly
> re-export an overlayfs mount on top of the original nfs mount?
> Such setup should cover most of your issues.
> 
> Regards,
>   Tigran.
> 
> ----- Original Message -----
>> From: "Daire Byrne" <daire@dneg.com>
>> To: "linux-nfs" <linux-nfs@vger.kernel.org>
>> Cc: linux-cachefs@redhat.com
>> Sent: Monday, September 7, 2020 7:31:00 PM
>> Subject: Adventures in NFS re-exporting
> 
>> Hi,
>> 
>> Apologies for this rather long email, but I thought there may be some interest
>> out there in the community in how and why we've been doing something
>> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
>> tell our story well in just a few short sentences so please bear with me (or
>> stop now!).
>> 
>> Full disclosure - I am also rather hoping that this story piques some interest
>> amongst developers to help make our rather niche setup even better and perhaps
>> a little better documented. I also totally understand if this is something
>> people wouldn't want to touch with a very long barge pole....
>> 
>> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
>> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
>> The rationale then was to provide a "proxy" server that could mount an NFSv4
>> only server and re-export it to older clients that only supported NFSv3. One of
>> the main sticking points then (as now), was around the 63 byte limit of
>> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
>> filehandles would fit within that (in my experience it mostly works with
>> "no_subtree_check"). There are also the usual locking and coherence concerns
>> with NFSv3 too but I'll get to that in a bit.
>> 
>> Then almost two years later, v4.13 was released including parts of the patch
>> series that actually allowed the re-export and since then other relevant bits
>> (such as the open file cache) have also been merged. I soon became interested
>> in using this new functionality to both accelerate our on-premises NFS storage
>> and use it as a "WAN cache" to provide cloud compute instances locally cached
>> proxy access to our on-premises storage.
>> 
>> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
>> company which uses large compute farms to generate complex final frame renders
>> for movies and TV. This workload mostly consists of reads of common data shared
>> between many render clients (e.g textures, geometry) and a little unique data
>> per frame. All file writes are to unique files per process (frames) and there
>> is very little if any writing over existing files. Hence it's not very
>> demanding on locking and coherence guarantees.
>> 
>> When our on-premises NFS storage is being overloaded or the server's network is
>> maxed out, we can place multiple re-export servers in between them and our farm
>> to improve performance. When our on-premises render farm is not quite big
>> enough to meet a deadline, we spin up compute instances with a (reasonably
>> local) cloud provider. Some of these cloud instances are Linux NFS servers
>> which mount our on-premises NFS storage servers (~10ms away) and re-export
>> these to the other cloud (render) instances. Since we know that the data we are
>> reading doesn't change often, we can increase the actimeo and even use nocto to
>> reduce the network chatter back to the on-prem servers. These re-export servers
>> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
>> previously read data locally in the cloud over long periods of time. We also
>> use NFSv4 (less network chatter) all the way from our on-prem storage to the
>> re-export server and then on to the clients.
>> 
>> The re-export server(s) quickly builds up both a memory cache and disk backed
>> fscache/cachefiles storage cache of our working data set so the data being
>> pulled from on-prem lessens over time. Data is only ever read once over the WAN
>> network from on-prem storage and then read multiple times by the many render
>> client instances in the cloud. Recent NFS features such as "nconnect" help to
>> speed up the initial reading of data from on-prem by using multiple connections
>> to offset TCP latency. At the end of the render, we write the files back
>> through the re-export server to our on-prem storage. Our average read bandwidth
>> is many times higher than our write bandwidth.
>> 
>> Rather surprisingly, this mostly works for our particular workloads. We've
>> completed movies using this setup and saved money on commercial caching systems
>> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
>> something that is very much not widely supported (or recommended). In most
>> cases we have worked around them, but it would be great if we didn't have to so
>> others could also benefit. I will list the main problems quickly now and
>> provide more information and reproducers later if anyone is interested.
>> 
>> 1) The kernel can drop entries out of the NFS client inode cache (under memory
>> cache churn) when those filehandles are still being used by the knfsd's remote
>> clients resulting in sporadic and random stale filehandles. This seems to be
>> mostly for directories from what I've seen. Does the NFS client not know that
>> knfsd is still using those files/dirs? The workaround is to never drop inode &
>> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
>> to ensure that we actually make the most of our actimeo=3600,nocto mount
>> options for the full specified time.
>> 
>> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> cut the network packets back to the origin server to zero for repeated lookups.
>> However, if a client of the re-export server walks paths and memory maps those
>> files (i.e. loading an application), the re-export server starts issuing
>> unexpected calls back to the origin server again, ignoring/invalidating the
>> re-export server's NFS client cache. We worked around this this by patching an
>> inode/iversion validity check in inode.c so that the NFS client cache on the
>> re-export server is used. I'm not sure about the correctness of this patch but
>> it works for our corner case.
>> 
>> 3) If we saturate an NFS client's network with reads from the server, all client
>> metadata lookups become unbearably slow even if it's all cached in the NFS
>> client's memory and no network RPCs should be required. This is the case for
>> any NFS client regardless of re-exporting but it affects this case more because
>> when we can't serve cached metadata we also can't serve the cached data. It
>> feels like some sort of bottleneck in the client's ability to parallelise
>> requests? We work around this by not maxing out our network.
>> 
>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
>> quickly eat up the CPU on the re-export server and perf top shows we are mostly
>> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
>> like that added to NFSv3? Our workaround is to either fix the thing doing lots
>> of repeated open/closes or use NFSv3 instead.
>> 
>> If you made it this far, I've probably taken up way too much of your valuable
>> time already. If nobody is interested in this rather niche application of the
>> Linux client & knfsd, then I totally understand and I will not mention it here
>> again. If your interest is piqued however, I'm happy to go into more detail
>> about any of this with the hope that this could become a better documented and
>> understood type of setup that others with similar workloads could reference.
>> 
>> Also, many thanks to all the Linux NFS developers for the amazing work you do
>> which, in turn, helps us to make great movies. :)
>> 
> > Daire (Head of Systems DNEG)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
  2020-09-08  9:40 ` Mkrtchyan, Tigran
@ 2020-09-15 17:21 ` J. Bruce Fields
  2020-09-15 19:59   ` Trond Myklebust
  2020-09-17 16:01   ` Daire Byrne
  2020-09-22 12:31 ` Daire Byrne
  2 siblings, 2 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-09-15 17:21 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs, linux-cachefs

On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
> 1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.

I thought reexport worked by embedding the original server's filehandles
in the filehandles given out by the reexporting server.

So, even if nothing's cached, when the reexporting server gets a
filehandle, it should be able to extract the original filehandle from it
and use that.

I wonder why that's not working?

> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> second) quickly eat up the CPU on the re-export server and perf top
> shows we are mostly in native_queued_spin_lock_slowpath.

Any statistics on who's calling that function?

> Does NFSv4
> also need an open file cache like that added to NFSv3? Our workaround
> is to either fix the thing doing lots of repeated open/closes or use
> NFSv3 instead.

NFSv4 uses the same file cache.  It might be the file cache that's at
fault, in fact....

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-15 17:21 ` J. Bruce Fields
@ 2020-09-15 19:59   ` Trond Myklebust
  2020-09-16 16:01     ` Daire Byrne
  2020-09-17 16:01   ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-09-15 19:59 UTC (permalink / raw)
  To: bfields, daire; +Cc: linux-cachefs, linux-nfs

On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
> > 1) The kernel can drop entries out of the NFS client inode cache
> > (under memory cache churn) when those filehandles are still being
> > used by the knfsd's remote clients resulting in sporadic and random
> > stale filehandles. This seems to be mostly for directories from
> > what I've seen. Does the NFS client not know that knfsd is still
> > using those files/dirs? The workaround is to never drop inode &
> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
> > also helps to ensure that we actually make the most of our
> > actimeo=3600,nocto mount options for the full specified time.
> 
> I thought reexport worked by embedding the original server's
> filehandles
> in the filehandles given out by the reexporting server.
> 
> So, even if nothing's cached, when the reexporting server gets a
> filehandle, it should be able to extract the original filehandle from
> it
> and use that.
> 
> I wonder why that's not working?

NFSv3? If so, I suspect it is because we never wrote a lookupp()
callback for it.

> 
> > 4) With an NFSv4 re-export, lots of open/close requests (hundreds
> > per
> > second) quickly eat up the CPU on the re-export server and perf top
> > shows we are mostly in native_queued_spin_lock_slowpath.
> 
> Any statistics on who's calling that function?
> 
> > Does NFSv4
> > also need an open file cache like that added to NFSv3? Our
> > workaround
> > is to either fix the thing doing lots of repeated open/closes or
> > use
> > NFSv3 instead.
> 
> NFSv4 uses the same file cache.  It might be the file cache that's at
> fault, in fact....
> 
> --b.
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-15 19:59   ` Trond Myklebust
@ 2020-09-16 16:01     ` Daire Byrne
  2020-10-19 16:19       ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-09-16 16:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-cachefs, linux-nfs

Trond/Bruce,

----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@hammerspace.com wrote:

> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>> > 1) The kernel can drop entries out of the NFS client inode cache
>> > (under memory cache churn) when those filehandles are still being
>> > used by the knfsd's remote clients resulting in sporadic and random
>> > stale filehandles. This seems to be mostly for directories from
>> > what I've seen. Does the NFS client not know that knfsd is still
>> > using those files/dirs? The workaround is to never drop inode &
>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>> > also helps to ensure that we actually make the most of our
>> > actimeo=3600,nocto mount options for the full specified time.
>> 
>> I thought reexport worked by embedding the original server's
>> filehandles
>> in the filehandles given out by the reexporting server.
>> 
>> So, even if nothing's cached, when the reexporting server gets a
>> filehandle, it should be able to extract the original filehandle from
>> it
>> and use that.
>> 
>> I wonder why that's not working?
> 
> NFSv3? If so, I suspect it is because we never wrote a lookupp()
> callback for it.

So in terms of the ESTALE counter on the reexport server, we see it increase if the end client mounts the reexport using either NFSv3 or NFSv4. But there is a difference in the client experience in that with NFSv3 we quickly get input/output errors but with NFSv4 we don't. But it does seem like the performance drops significantly which makes me think that NFSv4 retries the lookups (which succeed) when an ESTALE is reported but NFSv3 does not?

This is the simplest reproducer I could come up with but it may still be specific to our workloads/applications and hard to replicate exactly.

nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro reexport-server:/vol/software /mnt/software
nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee /proc/sys/vm/drop_caches; done

reexport-server # sysctl -w vm.vfs_cache_pressure=100
reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep 10; done

Where "application" is some big application with lots of paths to scan with libs to memory map and "/vol/software" is an NFS mount on the reexport-server from another originating NFS server. I don't know why this application loading workload shows this best, but perhaps the access patterns of memory mapped binaries and libs is particularly susceptible to estale?

With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches" repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache. The ESTALE count increases and the client running the application reports input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.

As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter no longer increases and the client experiences no issues (NFSv3 & NFSv4).

>> > 4) With an NFSv4 re-export, lots of open/close requests (hundreds
>> > per
>> > second) quickly eat up the CPU on the re-export server and perf top
>> > shows we are mostly in native_queued_spin_lock_slowpath.
>> 
>> Any statistics on who's calling that function?

I have not managed to devise a good reproducer for this as I suspect it requires large numbers of clients. So, I will have to use some production load to replicate it and it will take me a day or two to get something back to you.

Would something from a perf report be of particular interest (e.g. the call graph) or even a /proc/X/stack of a high CPU nfsd thread?

I do recall that nfsd_file_lru_cb and __list_lru_walk_one were usually right below native_queued_spin_lock_slowpath as the next most busy functions in perf top (with NFSv4 exporting). Perhaps this is less of an NFS reexport phenomenon and would be the case for any NFSv4 export of a particularly "slow" underlying filesystem?

>> > Does NFSv4
>> > also need an open file cache like that added to NFSv3? Our
>> > workaround
>> > is to either fix the thing doing lots of repeated open/closes or
>> > use
>> > NFSv3 instead.
>> 
>> NFSv4 uses the same file cache.  It might be the file cache that's at
>> fault, in fact....

Ah, my misunderstanding. I had assumed the open file descriptor cache was of more benefit to NFSv3 and that NFSv4 did not necessarily require it for performance.

I might also be able to do a test with a kernel version from before when that feature landed to see if NFSv4 reexport performs any different. 

Cheers,

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-15 17:21 ` J. Bruce Fields
  2020-09-15 19:59   ` Trond Myklebust
@ 2020-09-17 16:01   ` Daire Byrne
  2020-09-17 19:09     ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-09-17 16:01 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-cachefs


----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:

>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> second) quickly eat up the CPU on the re-export server and perf top
>> shows we are mostly in native_queued_spin_lock_slowpath.
> 
> Any statistics on who's calling that function?

I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.

With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.

The perf top summary looks like this:

# Overhead  Command          Shared Object                 Symbol                                                 
# ........  ...............  ............................  .......................................................
#
    82.91%  nfsd             [kernel.kallsyms]             [k] native_queued_spin_lock_slowpath
     8.24%  swapper          [kernel.kallsyms]             [k] intel_idle
     4.66%  nfsd             [kernel.kallsyms]             [k] __list_lru_walk_one
     0.80%  nfsd             [kernel.kallsyms]             [k] nfsd_file_lru_cb

And the call graph (not sure how this will format):

- nfsd
   - 89.34% svc_process
      - 88.94% svc_process_common
         - 88.87% nfsd_dispatch
            - 88.82% nfsd4_proc_compound
               - 53.97% nfsd4_open
                  - 53.95% nfsd4_process_open2
                     - 53.87% nfs4_get_vfs_file
                        - 53.48% nfsd_file_acquire
                           - 33.31% nfsd_file_lru_walk_list
                              - 33.28% list_lru_walk_node                    
                                 - 33.28% list_lru_walk_one                  
                                    - 30.21% _raw_spin_lock
                                       - 30.21% queued_spin_lock_slowpath
                                            30.20% native_queued_spin_lock_slowpath
                                      2.46% __list_lru_walk_one
                           - 19.39% list_lru_add
                              - 19.39% _raw_spin_lock
                                 - 19.39% queued_spin_lock_slowpath
                                      19.38% native_queued_spin_lock_slowpath
               - 34.46% nfsd4_close
                  - 34.45% nfs4_put_stid
                     - 34.45% nfs4_free_ol_stateid
                        - 34.45% release_all_access
                           - 34.45% nfs4_file_put_access
                              - 34.45% __nfs4_file_put_access.part.81
                                 - 34.45% nfsd_file_put
                                    - 34.44% nfsd_file_lru_walk_list
                                       - 34.40% list_lru_walk_node
                                          - 34.40% list_lru_walk_one
                                             - 31.27% _raw_spin_lock
                                                - 31.27% queued_spin_lock_slowpath
                                                     31.26% native_queued_spin_lock_slowpath
                                               2.50% __list_lru_walk_one
                                               0.50% nfsd_file_lru_cb


The original NFS server is mounted by the reexport server using NFSv4.2. As soon as we switch the clients to mount the reexport server with NFSv3, the high CPU usage goes away and we start to see expected performance for this workload and server hardware.

I'm happy to share perf data or anything else that is useful and I can repeatedly run this production load as required.

Cheers,

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-17 16:01   ` Daire Byrne
@ 2020-09-17 19:09     ` bfields
  2020-09-17 20:23       ` Frank van der Linden
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-09-17 19:09 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs, linux-cachefs, Frank van der Linden

On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> 
> ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
> 
> >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> >> second) quickly eat up the CPU on the re-export server and perf top
> >> shows we are mostly in native_queued_spin_lock_slowpath.
> > 
> > Any statistics on who's calling that function?
> 
> I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
> 
> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.

That sounds a lot like what Frank Van der Linden reported:

	https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/

It looks like a bug in the filehandle caching code.

--b.

> 
> The perf top summary looks like this:
> 
> # Overhead  Command          Shared Object                 Symbol                                                 
> # ........  ...............  ............................  .......................................................
> #
>     82.91%  nfsd             [kernel.kallsyms]             [k] native_queued_spin_lock_slowpath
>      8.24%  swapper          [kernel.kallsyms]             [k] intel_idle
>      4.66%  nfsd             [kernel.kallsyms]             [k] __list_lru_walk_one
>      0.80%  nfsd             [kernel.kallsyms]             [k] nfsd_file_lru_cb
> 
> And the call graph (not sure how this will format):
> 
> - nfsd
>    - 89.34% svc_process
>       - 88.94% svc_process_common
>          - 88.87% nfsd_dispatch
>             - 88.82% nfsd4_proc_compound
>                - 53.97% nfsd4_open
>                   - 53.95% nfsd4_process_open2
>                      - 53.87% nfs4_get_vfs_file
>                         - 53.48% nfsd_file_acquire
>                            - 33.31% nfsd_file_lru_walk_list
>                               - 33.28% list_lru_walk_node                    
>                                  - 33.28% list_lru_walk_one                  
>                                     - 30.21% _raw_spin_lock
>                                        - 30.21% queued_spin_lock_slowpath
>                                             30.20% native_queued_spin_lock_slowpath
>                                       2.46% __list_lru_walk_one
>                            - 19.39% list_lru_add
>                               - 19.39% _raw_spin_lock
>                                  - 19.39% queued_spin_lock_slowpath
>                                       19.38% native_queued_spin_lock_slowpath
>                - 34.46% nfsd4_close
>                   - 34.45% nfs4_put_stid
>                      - 34.45% nfs4_free_ol_stateid
>                         - 34.45% release_all_access
>                            - 34.45% nfs4_file_put_access
>                               - 34.45% __nfs4_file_put_access.part.81
>                                  - 34.45% nfsd_file_put
>                                     - 34.44% nfsd_file_lru_walk_list
>                                        - 34.40% list_lru_walk_node
>                                           - 34.40% list_lru_walk_one
>                                              - 31.27% _raw_spin_lock
>                                                 - 31.27% queued_spin_lock_slowpath
>                                                      31.26% native_queued_spin_lock_slowpath
>                                                2.50% __list_lru_walk_one
>                                                0.50% nfsd_file_lru_cb
> 
> 
> The original NFS server is mounted by the reexport server using NFSv4.2. As soon as we switch the clients to mount the reexport server with NFSv3, the high CPU usage goes away and we start to see expected performance for this workload and server hardware.
> 
> I'm happy to share perf data or anything else that is useful and I can repeatedly run this production load as required.
> 
> Cheers,
> 
> Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-17 19:09     ` bfields
@ 2020-09-17 20:23       ` Frank van der Linden
  2020-09-17 21:57         ` bfields
  2020-09-22 16:43         ` Chuck Lever
  0 siblings, 2 replies; 129+ messages in thread
From: Frank van der Linden @ 2020-09-17 20:23 UTC (permalink / raw)
  To: bfields; +Cc: Daire Byrne, linux-nfs, linux-cachefs

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
> 
> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> >
> > ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
> >
> > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> > >> second) quickly eat up the CPU on the re-export server and perf top
> > >> shows we are mostly in native_queued_spin_lock_slowpath.
> > >
> > > Any statistics on who's calling that function?
> >
> > I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
> >
> > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
> 
> That sounds a lot like what Frank Van der Linden reported:
> 
>         https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> 
> It looks like a bug in the filehandle caching code.
> 
> --b.

Yes, that does look like the same one.

I still think that not caching v4 files at all may be the best way to go
here, since the intent of the filecache code was to speed up v2/v3 I/O,
where you end up doing a lot of opens/closes, but it doesn't make as
much sense for v4.

However, short of that, I tested a local patch a few months back, that
I never posted here, so I'll do so now. It just makes v4 opens in to
'long term' opens, which do not get put on the LRU, since that doesn't
make sense (they are in the hash table, so they are still cached).

Also, the file caching code seems to walk the LRU a little too often,
but that's another issue - and this change keeps the LRU short, so it's
not a big deal.

I don't particularly love this patch, but it does keep the LRU short, and
did significantly speed up my testcase (by about 50%). So, maybe you can
give it a try.

I'll also attach a second patch, that converts the hash table to an rhashtable,
which automatically grows and shrinks in size with usage. That patch also
helped, but not by nearly as much (I think it yielded another 10%).

- Frank

[-- Attachment #2: 0001-nfsd-don-t-put-nfsd_files-with-long-term-refs-on-the.patch --]
[-- Type: text/plain, Size: 6718 bytes --]

From 057a24e1b3744c716e4956eb34c2d15ed719db23 Mon Sep 17 00:00:00 2001
From: Frank van der Linden <fllinden@amazon.com>
Date: Fri, 26 Jun 2020 22:35:01 +0000
Subject: [PATCH 1/2] nfsd: don't put nfsd_files with long term refs on the LRU
 list

Files with long term references, as created by v4 OPENs, will
just clutter the LRU list without a chance of being reaped.
So, don't put them there at all.

When finding a file in the hash table for a long term ref, remove
it from the LRU list.

When dropping the last long term ref, add it back to the LRU list.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 fs/nfsd/filecache.c | 81 ++++++++++++++++++++++++++++++++++++++++-----
 fs/nfsd/filecache.h |  6 ++++
 fs/nfsd/nfs4state.c |  2 +-
 3 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 82198d747c4c..5ef6bb802f24 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -186,6 +186,7 @@ nfsd_file_alloc(struct inode *inode, unsigned int may, unsigned int hashval,
 		nf->nf_inode = inode;
 		nf->nf_hashval = hashval;
 		refcount_set(&nf->nf_ref, 1);
+		atomic_set(&nf->nf_lref, 0);
 		nf->nf_may = may & NFSD_FILE_MAY_MASK;
 		if (may & NFSD_MAY_NOT_BREAK_LEASE) {
 			if (may & NFSD_MAY_WRITE)
@@ -297,13 +298,26 @@ nfsd_file_put_noref(struct nfsd_file *nf)
 	}
 }
 
-void
-nfsd_file_put(struct nfsd_file *nf)
+static void
+__nfsd_file_put(struct nfsd_file *nf, unsigned int flags)
 {
 	bool is_hashed;
+	int refs;
+
+	refs = refcount_read(&nf->nf_ref);
+
+	if (flags & NFSD_ACQ_FILE_LONGTERM) {
+		/*
+		 * If we're dropping the last long term ref, and there
+		 * are other references, put the file on the LRU list,
+		 * as it now makes sense for it to be there.
+		 */
+		if (atomic_dec_return(&nf->nf_lref) == 0 && refs > 2)
+			list_lru_add(&nfsd_file_lru, &nf->nf_lru);
+	} else
+		set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
 
-	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
-	if (refcount_read(&nf->nf_ref) > 2 || !nf->nf_file) {
+	if (refs > 2 || !nf->nf_file) {
 		nfsd_file_put_noref(nf);
 		return;
 	}
@@ -317,6 +331,18 @@ nfsd_file_put(struct nfsd_file *nf)
 		nfsd_file_gc();
 }
 
+void
+nfsd_file_put(struct nfsd_file *nf)
+{
+	__nfsd_file_put(nf, 0);
+}
+
+void
+nfsd_file_put_longterm(struct nfsd_file *nf)
+{
+	__nfsd_file_put(nf, NFSD_ACQ_FILE_LONGTERM);
+}
+
 struct nfsd_file *
 nfsd_file_get(struct nfsd_file *nf)
 {
@@ -934,13 +960,14 @@ nfsd_file_is_cached(struct inode *inode)
 	return ret;
 }
 
-__be32
-nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		  unsigned int may_flags, struct nfsd_file **pnf)
+static __be32
+__nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  unsigned int may_flags, struct nfsd_file **pnf,
+		  unsigned int flags)
 {
 	__be32	status;
 	struct net *net = SVC_NET(rqstp);
-	struct nfsd_file *nf, *new;
+	struct nfsd_file *nf, *new = NULL;
 	struct inode *inode;
 	unsigned int hashval;
 	bool retry = true;
@@ -1006,6 +1033,16 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		}
 	}
 out:
+	if (flags & NFSD_ACQ_FILE_LONGTERM) {
+		/*
+		 * A file with long term (v4) references will needlessly
+		 * clutter the LRU, so remove it when adding the first
+		 * long term ref.
+		 */
+		if (!new && atomic_inc_return(&nf->nf_lref) == 1)
+			list_lru_del(&nfsd_file_lru, &nf->nf_lru);
+	}
+
 	if (status == nfs_ok) {
 		*pnf = nf;
 	} else {
@@ -1021,7 +1058,18 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	refcount_inc(&nf->nf_ref);
 	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
 	__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
-	list_lru_add(&nfsd_file_lru, &nf->nf_lru);
+
+	/*
+	 * Don't add a new file to the LRU if it's a long term reference.
+	 * It is still added to the hash table, so it may be added to the
+	 * LRU later, when the number of long term references drops back
+	 * to zero, and there are other references.
+	 */
+	if (flags & NFSD_ACQ_FILE_LONGTERM)
+		atomic_inc(&nf->nf_lref);
+	else
+		list_lru_add(&nfsd_file_lru, &nf->nf_lru);
+
 	hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
 	++nfsd_file_hashtbl[hashval].nfb_count;
 	nfsd_file_hashtbl[hashval].nfb_maxcount = max(nfsd_file_hashtbl[hashval].nfb_maxcount,
@@ -1054,6 +1102,21 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	goto out;
 }
 
+__be32
+nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  unsigned int may_flags, struct nfsd_file **pnf)
+{
+	return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf, 0);
+}
+
+__be32
+nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  unsigned int may_flags, struct nfsd_file **pnf)
+{
+	return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf,
+				  NFSD_ACQ_FILE_LONGTERM);
+}
+
 /*
  * Note that fields may be added, removed or reordered in the future. Programs
  * scraping this file for info should test the labels to ensure they're
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 7872df5a0fe3..6e1db77d7148 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -44,21 +44,27 @@ struct nfsd_file {
 	struct inode		*nf_inode;
 	unsigned int		nf_hashval;
 	refcount_t		nf_ref;
+	atomic_t		nf_lref;
 	unsigned char		nf_may;
 	struct nfsd_file_mark	*nf_mark;
 	struct rw_semaphore	nf_rwsem;
 };
 
+#define NFSD_ACQ_FILE_LONGTERM	0x0001
+
 int nfsd_file_cache_init(void);
 void nfsd_file_cache_purge(struct net *);
 void nfsd_file_cache_shutdown(void);
 int nfsd_file_cache_start_net(struct net *net);
 void nfsd_file_cache_shutdown_net(struct net *net);
 void nfsd_file_put(struct nfsd_file *nf);
+void nfsd_file_put_longterm(struct nfsd_file *nf);
 struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
 void nfsd_file_close_inode_sync(struct inode *inode);
 bool nfsd_file_is_cached(struct inode *inode);
 __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  unsigned int may_flags, struct nfsd_file **nfp);
+__be32 nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  unsigned int may_flags, struct nfsd_file **nfp);
 int	nfsd_file_cache_stats_open(struct inode *, struct file *);
 #endif /* _FS_NFSD_FILECACHE_H */
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index bb3d2c32664a..451a1071daf4 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4838,7 +4838,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
 
 	if (!fp->fi_fds[oflag]) {
 		spin_unlock(&fp->fi_lock);
-		status = nfsd_file_acquire(rqstp, cur_fh, access, &nf);
+		status = nfsd_file_acquire_longterm(rqstp, cur_fh, access, &nf);
 		if (status)
 			goto out_put_access;
 		spin_lock(&fp->fi_lock);
-- 
2.17.2


[-- Attachment #3: 0002-nfsd-change-file_hashtbl-to-an-rhashtable.patch --]
[-- Type: text/plain, Size: 8058 bytes --]

From 79e7ffd01482d90cd5f6e98b5a362bbf95ea9b2c Mon Sep 17 00:00:00 2001
From: Frank van der Linden <fllinden@amazon.com>
Date: Thu, 16 Jul 2020 21:35:29 +0000
Subject: [PATCH 2/2] nfsd: change file_hashtbl to an rhashtable

file_hashtbl can grow quite large, so use rhashtable, which has
automatic growing (and shrinking).

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 fs/nfsd/nfs4state.c | 112 +++++++++++++++++++++++++++++---------------
 fs/nfsd/nfsctl.c    |   7 ++-
 fs/nfsd/nfsd.h      |   4 ++
 fs/nfsd/state.h     |   3 +-
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 451a1071daf4..ff81c0136224 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -417,13 +417,33 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 	kmem_cache_free(file_slab, fp);
 }
 
+/* hash table for nfs4_file */
+#define FILE_HASH_SIZE		256
+
+static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed);
+static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed);
+static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
+				  const void *obj);
+
+static const struct rhashtable_params file_rhashparams = {
+	.head_offset		= offsetof(struct nfs4_file, fi_hash),
+	.min_size		= FILE_HASH_SIZE,
+	.automatic_shrinking	= true,
+	.hashfn			= nfsd4_file_key_hash,
+	.obj_hashfn		= nfsd4_file_obj_hash,
+	.obj_cmpfn		= nfsd4_file_obj_compare,
+};
+
+struct rhashtable file_hashtbl;
+
 void
 put_nfs4_file(struct nfs4_file *fi)
 {
 	might_lock(&state_lock);
 
 	if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
-		hlist_del_rcu(&fi->fi_hash);
+		rhashtable_remove_fast(&file_hashtbl, &fi->fi_hash,
+				       file_rhashparams);
 		spin_unlock(&state_lock);
 		WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
 		WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
@@ -527,21 +547,33 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
 	return ret & OWNER_HASH_MASK;
 }
 
-/* hash table for nfs4_file */
-#define FILE_HASH_BITS                   8
-#define FILE_HASH_SIZE                  (1 << FILE_HASH_BITS)
-
-static unsigned int nfsd_fh_hashval(struct knfsd_fh *fh)
+static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed)
 {
-	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), 0);
+	struct knfsd_fh *fh = (struct knfsd_fh *)data;
+
+	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
 }
 
-static unsigned int file_hashval(struct knfsd_fh *fh)
+static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed)
 {
-	return nfsd_fh_hashval(fh) & (FILE_HASH_SIZE - 1);
+	struct nfs4_file *fp = (struct nfs4_file *)data;
+	struct knfsd_fh *fh;
+
+	fh = &fp->fi_fhandle;
+
+	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
 }
 
-static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
+static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
+				  const void *obj)
+{
+	struct nfs4_file *fp = (struct nfs4_file *)obj;
+
+	if (fh_match(&fp->fi_fhandle, (struct knfsd_fh *)arg->key))
+		return 0;
+
+	return 1;
+}
 
 static void
 __nfs4_file_get_access(struct nfs4_file *fp, u32 access)
@@ -4042,8 +4074,7 @@ static struct nfs4_file *nfsd4_alloc_file(void)
 }
 
 /* OPEN Share state helper functions */
-static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
-				struct nfs4_file *fp)
+static void nfsd4_init_file(struct knfsd_fh *fh, struct nfs4_file *fp)
 {
 	lockdep_assert_held(&state_lock);
 
@@ -4062,7 +4093,6 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
 	INIT_LIST_HEAD(&fp->fi_lo_states);
 	atomic_set(&fp->fi_lo_recalls, 0);
 #endif
-	hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
 }
 
 void
@@ -4126,6 +4156,18 @@ nfsd4_init_slabs(void)
 	return -ENOMEM;
 }
 
+int
+nfsd4_init_hash(void)
+{
+	return rhashtable_init(&file_hashtbl, &file_rhashparams);
+}
+
+void
+nfsd4_free_hash(void)
+{
+	rhashtable_destroy(&file_hashtbl);
+}
+
 static void init_nfs4_replay(struct nfs4_replay *rp)
 {
 	rp->rp_status = nfserr_serverfault;
@@ -4395,30 +4437,19 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
 }
 
 /* search file_hashtbl[] for file */
-static struct nfs4_file *
-find_file_locked(struct knfsd_fh *fh, unsigned int hashval)
-{
-	struct nfs4_file *fp;
-
-	hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
-				lockdep_is_held(&state_lock)) {
-		if (fh_match(&fp->fi_fhandle, fh)) {
-			if (refcount_inc_not_zero(&fp->fi_ref))
-				return fp;
-		}
-	}
-	return NULL;
-}
-
 struct nfs4_file *
 find_file(struct knfsd_fh *fh)
 {
 	struct nfs4_file *fp;
-	unsigned int hashval = file_hashval(fh);
 
 	rcu_read_lock();
-	fp = find_file_locked(fh, hashval);
+	fp = rhashtable_lookup(&file_hashtbl, fh, file_rhashparams);
+	if (fp) {
+		if (IS_ERR(fp) || refcount_inc_not_zero(&fp->fi_ref))
+			fp = NULL;
+	}
 	rcu_read_unlock();
+
 	return fp;
 }
 
@@ -4426,22 +4457,27 @@ static struct nfs4_file *
 find_or_add_file(struct nfs4_file *new, struct knfsd_fh *fh)
 {
 	struct nfs4_file *fp;
-	unsigned int hashval = file_hashval(fh);
 
-	rcu_read_lock();
-	fp = find_file_locked(fh, hashval);
-	rcu_read_unlock();
+	fp = find_file(fh);
 	if (fp)
 		return fp;
 
+	nfsd4_init_file(fh, new);
+
 	spin_lock(&state_lock);
-	fp = find_file_locked(fh, hashval);
-	if (likely(fp == NULL)) {
-		nfsd4_init_file(fh, hashval, new);
+
+	fp = rhashtable_lookup_get_insert_key(&file_hashtbl, &new->fi_fhandle,
+	    &new->fi_hash, file_rhashparams);
+	if (likely(fp == NULL))
 		fp = new;
-	}
+	else if (IS_ERR(fp))
+		fp = NULL;
+	else
+		refcount_inc(&fp->fi_ref);
+
 	spin_unlock(&state_lock);
 
+
 	return fp;
 }
 
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index b68e96681522..bac5d8cff1d3 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1528,9 +1528,12 @@ static int __init init_nfsd(void)
 	retval = nfsd4_init_slabs();
 	if (retval)
 		goto out_unregister_notifier;
-	retval = nfsd4_init_pnfs();
+	retval = nfsd4_init_hash();
 	if (retval)
 		goto out_free_slabs;
+	retval = nfsd4_init_pnfs();
+	if (retval)
+		goto out_free_hash;
 	nfsd_fault_inject_init(); /* nfsd fault injection controls */
 	nfsd_stat_init();	/* Statistics */
 	retval = nfsd_drc_slab_create();
@@ -1554,6 +1557,8 @@ static int __init init_nfsd(void)
 	nfsd_stat_shutdown();
 	nfsd_fault_inject_cleanup();
 	nfsd4_exit_pnfs();
+out_free_hash:
+	nfsd4_free_hash();
 out_free_slabs:
 	nfsd4_free_slabs();
 out_unregister_notifier:
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 5343c771da18..fb0349d16158 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -141,6 +141,8 @@ nfsd_user_namespace(const struct svc_rqst *rqstp)
 extern unsigned long max_delegations;
 int nfsd4_init_slabs(void);
 void nfsd4_free_slabs(void);
+int nfsd4_init_hash(void);
+void nfsd4_free_hash(void);
 int nfs4_state_start(void);
 int nfs4_state_start_net(struct net *net);
 void nfs4_state_shutdown(void);
@@ -151,6 +153,8 @@ bool nfsd4_spo_must_allow(struct svc_rqst *rqstp);
 #else
 static inline int nfsd4_init_slabs(void) { return 0; }
 static inline void nfsd4_free_slabs(void) { }
+static inline int nfsd4_init_hash(void) { return 0; }
+static inline void nfsd4_free_hash(void) { }
 static inline int nfs4_state_start(void) { return 0; }
 static inline int nfs4_state_start_net(struct net *net) { return 0; }
 static inline void nfs4_state_shutdown(void) { }
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 3b408532a5dc..bf66244a7a2d 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -38,6 +38,7 @@
 #include <linux/idr.h>
 #include <linux/refcount.h>
 #include <linux/sunrpc/svc_xprt.h>
+#include <linux/rhashtable.h>
 #include "nfsfh.h"
 #include "nfsd.h"
 
@@ -513,7 +514,7 @@ struct nfs4_clnt_odstate {
 struct nfs4_file {
 	refcount_t		fi_ref;
 	spinlock_t		fi_lock;
-	struct hlist_node       fi_hash;	/* hash on fi_fhandle */
+	struct rhash_head	fi_hash;	/* hash on fi_fhandle */
 	struct list_head        fi_stateids;
 	union {
 		struct list_head	fi_delegations;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-17 20:23       ` Frank van der Linden
@ 2020-09-17 21:57         ` bfields
  2020-09-19 11:08           ` Daire Byrne
  2020-09-22 16:43         ` Chuck Lever
  1 sibling, 1 reply; 129+ messages in thread
From: bfields @ 2020-09-17 21:57 UTC (permalink / raw)
  To: Frank van der Linden; +Cc: Daire Byrne, linux-nfs, linux-cachefs

On Thu, Sep 17, 2020 at 08:23:03PM +0000, Frank van der Linden wrote:
> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
> > 
> > On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> > >
> > > ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
> > >
> > > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> > > >> second) quickly eat up the CPU on the re-export server and perf top
> > > >> shows we are mostly in native_queued_spin_lock_slowpath.
> > > >
> > > > Any statistics on who's calling that function?
> > >
> > > I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
> > >
> > > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
> > 
> > That sounds a lot like what Frank Van der Linden reported:
> > 
> >         https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> > 
> > It looks like a bug in the filehandle caching code.
> > 
> > --b.
> 
> Yes, that does look like the same one.
> 
> I still think that not caching v4 files at all may be the best way to go
> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> where you end up doing a lot of opens/closes, but it doesn't make as
> much sense for v4.
> 
> However, short of that, I tested a local patch a few months back, that
> I never posted here, so I'll do so now. It just makes v4 opens in to
> 'long term' opens, which do not get put on the LRU, since that doesn't
> make sense (they are in the hash table, so they are still cached).

That makes sense to me.  But I'm also not opposed to turning it off for
v4 at this point.

--b.

> Also, the file caching code seems to walk the LRU a little too often,
> but that's another issue - and this change keeps the LRU short, so it's
> not a big deal.
> 
> I don't particularly love this patch, but it does keep the LRU short, and
> did significantly speed up my testcase (by about 50%). So, maybe you can
> give it a try.
> 
> I'll also attach a second patch, that converts the hash table to an rhashtable,
> which automatically grows and shrinks in size with usage. That patch also
> helped, but not by nearly as much (I think it yielded another 10%).
> 
> - Frank

> >From 057a24e1b3744c716e4956eb34c2d15ed719db23 Mon Sep 17 00:00:00 2001
> From: Frank van der Linden <fllinden@amazon.com>
> Date: Fri, 26 Jun 2020 22:35:01 +0000
> Subject: [PATCH 1/2] nfsd: don't put nfsd_files with long term refs on the LRU
>  list
> 
> Files with long term references, as created by v4 OPENs, will
> just clutter the LRU list without a chance of being reaped.
> So, don't put them there at all.
> 
> When finding a file in the hash table for a long term ref, remove
> it from the LRU list.
> 
> When dropping the last long term ref, add it back to the LRU list.
> 
> Signed-off-by: Frank van der Linden <fllinden@amazon.com>
> ---
>  fs/nfsd/filecache.c | 81 ++++++++++++++++++++++++++++++++++++++++-----
>  fs/nfsd/filecache.h |  6 ++++
>  fs/nfsd/nfs4state.c |  2 +-
>  3 files changed, 79 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 82198d747c4c..5ef6bb802f24 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -186,6 +186,7 @@ nfsd_file_alloc(struct inode *inode, unsigned int may, unsigned int hashval,
>  		nf->nf_inode = inode;
>  		nf->nf_hashval = hashval;
>  		refcount_set(&nf->nf_ref, 1);
> +		atomic_set(&nf->nf_lref, 0);
>  		nf->nf_may = may & NFSD_FILE_MAY_MASK;
>  		if (may & NFSD_MAY_NOT_BREAK_LEASE) {
>  			if (may & NFSD_MAY_WRITE)
> @@ -297,13 +298,26 @@ nfsd_file_put_noref(struct nfsd_file *nf)
>  	}
>  }
>  
> -void
> -nfsd_file_put(struct nfsd_file *nf)
> +static void
> +__nfsd_file_put(struct nfsd_file *nf, unsigned int flags)
>  {
>  	bool is_hashed;
> +	int refs;
> +
> +	refs = refcount_read(&nf->nf_ref);
> +
> +	if (flags & NFSD_ACQ_FILE_LONGTERM) {
> +		/*
> +		 * If we're dropping the last long term ref, and there
> +		 * are other references, put the file on the LRU list,
> +		 * as it now makes sense for it to be there.
> +		 */
> +		if (atomic_dec_return(&nf->nf_lref) == 0 && refs > 2)
> +			list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> +	} else
> +		set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>  
> -	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> -	if (refcount_read(&nf->nf_ref) > 2 || !nf->nf_file) {
> +	if (refs > 2 || !nf->nf_file) {
>  		nfsd_file_put_noref(nf);
>  		return;
>  	}
> @@ -317,6 +331,18 @@ nfsd_file_put(struct nfsd_file *nf)
>  		nfsd_file_gc();
>  }
>  
> +void
> +nfsd_file_put(struct nfsd_file *nf)
> +{
> +	__nfsd_file_put(nf, 0);
> +}
> +
> +void
> +nfsd_file_put_longterm(struct nfsd_file *nf)
> +{
> +	__nfsd_file_put(nf, NFSD_ACQ_FILE_LONGTERM);
> +}
> +
>  struct nfsd_file *
>  nfsd_file_get(struct nfsd_file *nf)
>  {
> @@ -934,13 +960,14 @@ nfsd_file_is_cached(struct inode *inode)
>  	return ret;
>  }
>  
> -__be32
> -nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		  unsigned int may_flags, struct nfsd_file **pnf)
> +static __be32
> +__nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  unsigned int may_flags, struct nfsd_file **pnf,
> +		  unsigned int flags)
>  {
>  	__be32	status;
>  	struct net *net = SVC_NET(rqstp);
> -	struct nfsd_file *nf, *new;
> +	struct nfsd_file *nf, *new = NULL;
>  	struct inode *inode;
>  	unsigned int hashval;
>  	bool retry = true;
> @@ -1006,6 +1033,16 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		}
>  	}
>  out:
> +	if (flags & NFSD_ACQ_FILE_LONGTERM) {
> +		/*
> +		 * A file with long term (v4) references will needlessly
> +		 * clutter the LRU, so remove it when adding the first
> +		 * long term ref.
> +		 */
> +		if (!new && atomic_inc_return(&nf->nf_lref) == 1)
> +			list_lru_del(&nfsd_file_lru, &nf->nf_lru);
> +	}
> +
>  	if (status == nfs_ok) {
>  		*pnf = nf;
>  	} else {
> @@ -1021,7 +1058,18 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	refcount_inc(&nf->nf_ref);
>  	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>  	__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> -	list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> +
> +	/*
> +	 * Don't add a new file to the LRU if it's a long term reference.
> +	 * It is still added to the hash table, so it may be added to the
> +	 * LRU later, when the number of long term references drops back
> +	 * to zero, and there are other references.
> +	 */
> +	if (flags & NFSD_ACQ_FILE_LONGTERM)
> +		atomic_inc(&nf->nf_lref);
> +	else
> +		list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> +
>  	hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
>  	++nfsd_file_hashtbl[hashval].nfb_count;
>  	nfsd_file_hashtbl[hashval].nfb_maxcount = max(nfsd_file_hashtbl[hashval].nfb_maxcount,
> @@ -1054,6 +1102,21 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	goto out;
>  }
>  
> +__be32
> +nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  unsigned int may_flags, struct nfsd_file **pnf)
> +{
> +	return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf, 0);
> +}
> +
> +__be32
> +nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  unsigned int may_flags, struct nfsd_file **pnf)
> +{
> +	return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf,
> +				  NFSD_ACQ_FILE_LONGTERM);
> +}
> +
>  /*
>   * Note that fields may be added, removed or reordered in the future. Programs
>   * scraping this file for info should test the labels to ensure they're
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index 7872df5a0fe3..6e1db77d7148 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -44,21 +44,27 @@ struct nfsd_file {
>  	struct inode		*nf_inode;
>  	unsigned int		nf_hashval;
>  	refcount_t		nf_ref;
> +	atomic_t		nf_lref;
>  	unsigned char		nf_may;
>  	struct nfsd_file_mark	*nf_mark;
>  	struct rw_semaphore	nf_rwsem;
>  };
>  
> +#define NFSD_ACQ_FILE_LONGTERM	0x0001
> +
>  int nfsd_file_cache_init(void);
>  void nfsd_file_cache_purge(struct net *);
>  void nfsd_file_cache_shutdown(void);
>  int nfsd_file_cache_start_net(struct net *net);
>  void nfsd_file_cache_shutdown_net(struct net *net);
>  void nfsd_file_put(struct nfsd_file *nf);
> +void nfsd_file_put_longterm(struct nfsd_file *nf);
>  struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
>  void nfsd_file_close_inode_sync(struct inode *inode);
>  bool nfsd_file_is_cached(struct inode *inode);
>  __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		  unsigned int may_flags, struct nfsd_file **nfp);
> +__be32 nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  unsigned int may_flags, struct nfsd_file **nfp);
>  int	nfsd_file_cache_stats_open(struct inode *, struct file *);
>  #endif /* _FS_NFSD_FILECACHE_H */
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index bb3d2c32664a..451a1071daf4 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4838,7 +4838,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>  
>  	if (!fp->fi_fds[oflag]) {
>  		spin_unlock(&fp->fi_lock);
> -		status = nfsd_file_acquire(rqstp, cur_fh, access, &nf);
> +		status = nfsd_file_acquire_longterm(rqstp, cur_fh, access, &nf);
>  		if (status)
>  			goto out_put_access;
>  		spin_lock(&fp->fi_lock);
> -- 
> 2.17.2
> 

> >From 79e7ffd01482d90cd5f6e98b5a362bbf95ea9b2c Mon Sep 17 00:00:00 2001
> From: Frank van der Linden <fllinden@amazon.com>
> Date: Thu, 16 Jul 2020 21:35:29 +0000
> Subject: [PATCH 2/2] nfsd: change file_hashtbl to an rhashtable
> 
> file_hashtbl can grow quite large, so use rhashtable, which has
> automatic growing (and shrinking).
> 
> Signed-off-by: Frank van der Linden <fllinden@amazon.com>
> ---
>  fs/nfsd/nfs4state.c | 112 +++++++++++++++++++++++++++++---------------
>  fs/nfsd/nfsctl.c    |   7 ++-
>  fs/nfsd/nfsd.h      |   4 ++
>  fs/nfsd/state.h     |   3 +-
>  4 files changed, 86 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 451a1071daf4..ff81c0136224 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -417,13 +417,33 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>  	kmem_cache_free(file_slab, fp);
>  }
>  
> +/* hash table for nfs4_file */
> +#define FILE_HASH_SIZE		256
> +
> +static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed);
> +static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed);
> +static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
> +				  const void *obj);
> +
> +static const struct rhashtable_params file_rhashparams = {
> +	.head_offset		= offsetof(struct nfs4_file, fi_hash),
> +	.min_size		= FILE_HASH_SIZE,
> +	.automatic_shrinking	= true,
> +	.hashfn			= nfsd4_file_key_hash,
> +	.obj_hashfn		= nfsd4_file_obj_hash,
> +	.obj_cmpfn		= nfsd4_file_obj_compare,
> +};
> +
> +struct rhashtable file_hashtbl;
> +
>  void
>  put_nfs4_file(struct nfs4_file *fi)
>  {
>  	might_lock(&state_lock);
>  
>  	if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
> -		hlist_del_rcu(&fi->fi_hash);
> +		rhashtable_remove_fast(&file_hashtbl, &fi->fi_hash,
> +				       file_rhashparams);
>  		spin_unlock(&state_lock);
>  		WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
>  		WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
> @@ -527,21 +547,33 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
>  	return ret & OWNER_HASH_MASK;
>  }
>  
> -/* hash table for nfs4_file */
> -#define FILE_HASH_BITS                   8
> -#define FILE_HASH_SIZE                  (1 << FILE_HASH_BITS)
> -
> -static unsigned int nfsd_fh_hashval(struct knfsd_fh *fh)
> +static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed)
>  {
> -	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), 0);
> +	struct knfsd_fh *fh = (struct knfsd_fh *)data;
> +
> +	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
>  }
>  
> -static unsigned int file_hashval(struct knfsd_fh *fh)
> +static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed)
>  {
> -	return nfsd_fh_hashval(fh) & (FILE_HASH_SIZE - 1);
> +	struct nfs4_file *fp = (struct nfs4_file *)data;
> +	struct knfsd_fh *fh;
> +
> +	fh = &fp->fi_fhandle;
> +
> +	return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
>  }
>  
> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
> +static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
> +				  const void *obj)
> +{
> +	struct nfs4_file *fp = (struct nfs4_file *)obj;
> +
> +	if (fh_match(&fp->fi_fhandle, (struct knfsd_fh *)arg->key))
> +		return 0;
> +
> +	return 1;
> +}
>  
>  static void
>  __nfs4_file_get_access(struct nfs4_file *fp, u32 access)
> @@ -4042,8 +4074,7 @@ static struct nfs4_file *nfsd4_alloc_file(void)
>  }
>  
>  /* OPEN Share state helper functions */
> -static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
> -				struct nfs4_file *fp)
> +static void nfsd4_init_file(struct knfsd_fh *fh, struct nfs4_file *fp)
>  {
>  	lockdep_assert_held(&state_lock);
>  
> @@ -4062,7 +4093,6 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
>  	INIT_LIST_HEAD(&fp->fi_lo_states);
>  	atomic_set(&fp->fi_lo_recalls, 0);
>  #endif
> -	hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
>  }
>  
>  void
> @@ -4126,6 +4156,18 @@ nfsd4_init_slabs(void)
>  	return -ENOMEM;
>  }
>  
> +int
> +nfsd4_init_hash(void)
> +{
> +	return rhashtable_init(&file_hashtbl, &file_rhashparams);
> +}
> +
> +void
> +nfsd4_free_hash(void)
> +{
> +	rhashtable_destroy(&file_hashtbl);
> +}
> +
>  static void init_nfs4_replay(struct nfs4_replay *rp)
>  {
>  	rp->rp_status = nfserr_serverfault;
> @@ -4395,30 +4437,19 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
>  }
>  
>  /* search file_hashtbl[] for file */
> -static struct nfs4_file *
> -find_file_locked(struct knfsd_fh *fh, unsigned int hashval)
> -{
> -	struct nfs4_file *fp;
> -
> -	hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
> -				lockdep_is_held(&state_lock)) {
> -		if (fh_match(&fp->fi_fhandle, fh)) {
> -			if (refcount_inc_not_zero(&fp->fi_ref))
> -				return fp;
> -		}
> -	}
> -	return NULL;
> -}
> -
>  struct nfs4_file *
>  find_file(struct knfsd_fh *fh)
>  {
>  	struct nfs4_file *fp;
> -	unsigned int hashval = file_hashval(fh);
>  
>  	rcu_read_lock();
> -	fp = find_file_locked(fh, hashval);
> +	fp = rhashtable_lookup(&file_hashtbl, fh, file_rhashparams);
> +	if (fp) {
> +		if (IS_ERR(fp) || refcount_inc_not_zero(&fp->fi_ref))
> +			fp = NULL;
> +	}
>  	rcu_read_unlock();
> +
>  	return fp;
>  }
>  
> @@ -4426,22 +4457,27 @@ static struct nfs4_file *
>  find_or_add_file(struct nfs4_file *new, struct knfsd_fh *fh)
>  {
>  	struct nfs4_file *fp;
> -	unsigned int hashval = file_hashval(fh);
>  
> -	rcu_read_lock();
> -	fp = find_file_locked(fh, hashval);
> -	rcu_read_unlock();
> +	fp = find_file(fh);
>  	if (fp)
>  		return fp;
>  
> +	nfsd4_init_file(fh, new);
> +
>  	spin_lock(&state_lock);
> -	fp = find_file_locked(fh, hashval);
> -	if (likely(fp == NULL)) {
> -		nfsd4_init_file(fh, hashval, new);
> +
> +	fp = rhashtable_lookup_get_insert_key(&file_hashtbl, &new->fi_fhandle,
> +	    &new->fi_hash, file_rhashparams);
> +	if (likely(fp == NULL))
>  		fp = new;
> -	}
> +	else if (IS_ERR(fp))
> +		fp = NULL;
> +	else
> +		refcount_inc(&fp->fi_ref);
> +
>  	spin_unlock(&state_lock);
>  
> +
>  	return fp;
>  }
>  
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index b68e96681522..bac5d8cff1d3 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1528,9 +1528,12 @@ static int __init init_nfsd(void)
>  	retval = nfsd4_init_slabs();
>  	if (retval)
>  		goto out_unregister_notifier;
> -	retval = nfsd4_init_pnfs();
> +	retval = nfsd4_init_hash();
>  	if (retval)
>  		goto out_free_slabs;
> +	retval = nfsd4_init_pnfs();
> +	if (retval)
> +		goto out_free_hash;
>  	nfsd_fault_inject_init(); /* nfsd fault injection controls */
>  	nfsd_stat_init();	/* Statistics */
>  	retval = nfsd_drc_slab_create();
> @@ -1554,6 +1557,8 @@ static int __init init_nfsd(void)
>  	nfsd_stat_shutdown();
>  	nfsd_fault_inject_cleanup();
>  	nfsd4_exit_pnfs();
> +out_free_hash:
> +	nfsd4_free_hash();
>  out_free_slabs:
>  	nfsd4_free_slabs();
>  out_unregister_notifier:
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 5343c771da18..fb0349d16158 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -141,6 +141,8 @@ nfsd_user_namespace(const struct svc_rqst *rqstp)
>  extern unsigned long max_delegations;
>  int nfsd4_init_slabs(void);
>  void nfsd4_free_slabs(void);
> +int nfsd4_init_hash(void);
> +void nfsd4_free_hash(void);
>  int nfs4_state_start(void);
>  int nfs4_state_start_net(struct net *net);
>  void nfs4_state_shutdown(void);
> @@ -151,6 +153,8 @@ bool nfsd4_spo_must_allow(struct svc_rqst *rqstp);
>  #else
>  static inline int nfsd4_init_slabs(void) { return 0; }
>  static inline void nfsd4_free_slabs(void) { }
> +static inline int nfsd4_init_hash(void) { return 0; }
> +static inline void nfsd4_free_hash(void) { }
>  static inline int nfs4_state_start(void) { return 0; }
>  static inline int nfs4_state_start_net(struct net *net) { return 0; }
>  static inline void nfs4_state_shutdown(void) { }
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 3b408532a5dc..bf66244a7a2d 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -38,6 +38,7 @@
>  #include <linux/idr.h>
>  #include <linux/refcount.h>
>  #include <linux/sunrpc/svc_xprt.h>
> +#include <linux/rhashtable.h>
>  #include "nfsfh.h"
>  #include "nfsd.h"
>  
> @@ -513,7 +514,7 @@ struct nfs4_clnt_odstate {
>  struct nfs4_file {
>  	refcount_t		fi_ref;
>  	spinlock_t		fi_lock;
> -	struct hlist_node       fi_hash;	/* hash on fi_fhandle */
> +	struct rhash_head	fi_hash;	/* hash on fi_fhandle */
>  	struct list_head        fi_stateids;
>  	union {
>  		struct list_head	fi_delegations;
> -- 
> 2.17.2
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-17 21:57         ` bfields
@ 2020-09-19 11:08           ` Daire Byrne
  0 siblings, 0 replies; 129+ messages in thread
From: Daire Byrne @ 2020-09-19 11:08 UTC (permalink / raw)
  To: bfields; +Cc: Frank van der Linden, linux-nfs, linux-cachefs


----- On 17 Sep, 2020, at 22:57, bfields bfields@fieldses.org wrote:

> On Thu, Sep 17, 2020 at 08:23:03PM +0000, Frank van der Linden wrote:
>> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> > 
>> > On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>> > >
>> > > ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
>> > >
>> > > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> > > >> second) quickly eat up the CPU on the re-export server and perf top
>> > > >> shows we are mostly in native_queued_spin_lock_slowpath.
>> > > >
>> > > > Any statistics on who's calling that function?
>> > >
>> > > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>> > > the CPU of the nfsd threads increase rapidly and by the time we have 100
>> > > clients, we have maxed out the 32 cores of the server with most of that in
>> > > native_queued_spin_lock_slowpath.
>> > 
>> > That sounds a lot like what Frank Van der Linden reported:
>> > 
>> >         https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> > 
>> > It looks like a bug in the filehandle caching code.
>> > 
>> > --b.
>> 
>> Yes, that does look like the same one.
>> 
>> I still think that not caching v4 files at all may be the best way to go
>> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> where you end up doing a lot of opens/closes, but it doesn't make as
>> much sense for v4.
>> 
>> However, short of that, I tested a local patch a few months back, that
>> I never posted here, so I'll do so now. It just makes v4 opens in to
>> 'long term' opens, which do not get put on the LRU, since that doesn't
>> make sense (they are in the hash table, so they are still cached).
> 
> That makes sense to me.  But I'm also not opposed to turning it off for
> v4 at this point.
> 
> --b.

Thank you both, that's absolutely the issue with our (broken) production workload. I totally missed that thread while researching the archives.

I tried both of Frank's patches and the CPU returned to normal levels, native_queued_spin_lock_slowpath went from 88% to 2% usage and the server performed pretty much the same as it does for an NFSv3 export.

So, ultimately this had nothing to do with NFS re-exporting; it's just that I was using a newer kernel with filecache to do it. All our other NFSv4 originating servers are running older kernels, hence why our (broken) workload never caused us any problems before. Thanks for clearing that up for me.

With regards to dropping the filecache feature completely for NFSv4, I do wonder if it does still save a few precious network round-trips (which is especially important for my re-export scenario)? We want to be able to choose the level of caching on the re-export server and minimise expensive lookups to originating servers that may be many milliseconds away (coherency be damned).

Seeing as there was some interest in issue #1 (drop caches = estale re-exports) and this #4 issue (NFSv4 filecache vs ridiculous open/close counts), I'll post some more detail & reproducers next week for #2 (invalidating the re-export server's NFS client cache) and #3 (cached client metadata lookups not returned quickly enough when the client is busy with reads).

That way anyone trying to follow in my (re-exporting) footsteps is fully aware of all the potential performance pitfalls I have discovered so far.

Many thanks,

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
  2020-09-08  9:40 ` Mkrtchyan, Tigran
  2020-09-15 17:21 ` J. Bruce Fields
@ 2020-09-22 12:31 ` Daire Byrne
  2020-09-22 13:52   ` Trond Myklebust
  2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
  2 siblings, 2 replies; 129+ messages in thread
From: Daire Byrne @ 2020-09-22 12:31 UTC (permalink / raw)
  To: linux-nfs; +Cc: linux-cachefs

Hi, 

I just thought I'd flesh out the other two issues I have found with re-exporting that are ultimately responsible for the biggest performance bottlenecks. And both of them revolve around the caching of metadata file lookups in the NFS client.

Especially for the case where we are re-exporting a server many milliseconds away (i.e. on-premise -> cloud), we want to be able to control how much the client caches metadata and file data so that it's many LAN clients all benefit from the re-export server only having to do the WAN lookups once (within a specified coherency time).

Keeping the file data in the vfs page cache or on disk using fscache/cachefiles is fairly straightforward, but keeping the metadata cached is particularly difficult. And without the cached metadata we introduce long delays before we can serve the already present and locally cached file data to many waiting clients.

----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.

If we use actimeo=3600,nocto (say) to mount a remote software volume on the re-export server, we can successfully cache the loading of applications and walking of paths directly on the re-export server such that after a couple of runs, there are practically zero packets back to the originating NFS server (great!). But, if we then do the same thing on a client which is mounting that re-export server, the re-export server now starts issuing lots of calls back to the originating server and invalidating it's client cache (bad!).

I'm not exactly sure why, but the iversion of the inode gets changed locally (due to atime modification?) most likely via invocation of method inode_inc_iversion_raw. Each time it gets incremented the following call to validate attributes detects changes causing it to be reloaded from the originating server.

This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:

--- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000 +0000
+++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
@@ -1869,7 +1869,7 @@
 
        /* More cache consistency checks */
        if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
-               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
+               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
                        /* Could it be a race with writeback? */
                        if (!(have_writers || have_delegation)) {
                                invalid |= NFS_INO_INVALID_DATA

With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).

Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly?

> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.

I spent a bit more time testing this issue and it's not quite as I've written it. Again the issue is that we have very little control over preserving complete metadata caches to avoid expensive contact with the originating NFS server. Even though we can use actimeo,nocto mount options, these provide no guarantees that we can keep all the required metadata in cache when the page cache is under constant churn (e.g. NFS reads).

This has very little to do with the re-export of an NFS client mount and is more a general observation of how the NFS client works. It is probably relevant to anyone who wants to cache metadata for long periods of time (e.g. read-only, non-changing, over the WAN).

Let's consider how we might try to keep as much metadata cached in memory....

nfsclient # echo 0 >/proc/sys/vm/vfs_cache_pressure
nfsclient # mount -o vers=3,actimeo=7200,nocto,ro,nolock nfsserver:/usr /mnt/nfsserver
nfsclient # for x in {1..3}; do /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null; sleep 5; done
53.23 <- first time so lots of network traffic
2.82 <- now cached for actimeo=7200 with almost no packets between nfsserver & nfsclient
2.85

This is ideal and as long as we don't touch the page cache then repeated walks of the remote server will all come from cache until the attribute cache times out.

We can even read from the remote server using either directio or fadvise so that we don't upset the client's page cache and we will keep the complete metadata cache intact. e.g.

nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" iflag=direct of=/dev/null bs=1M &>/dev/null'
nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'nocache dd if="X" of=/dev/null bs=1M &>/dev/null'
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
2.82 <- still showing good complete cached metadata

But as soon as we switch to the more normal reading of file data which then populates the page cache, we lose portions of our cached metadata (readdir?) even when there is plenty of RAM available.

nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" of=/dev/null bs=1M &>/dev/null'
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
10.82 <- still mostly cached metadata but we had to do some fresh lookups

Now once our NFS client starts doing lots of sustained reads such that it maxes out the network, we end up in a situation where we are both dropping useful cached metadata (before actimeo) and we are making it harder to get the new metadata lookups back in a timely fashion because the reads are so much more dominant (and require less round trips to get more done).

So if we do the reads and try to do the filesystem walk at the same time, we get even slower performance:

nfsclient # (find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" of=/dev/null bs=1M &>/dev/null') &
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
30.12

As we increase the number of simultaneous threads for the reads (e.g knfsd threads), the single thread of metadata lookups gets slower and slower.

So even when setting vfs_cache_pressure=0 (to keep nfs inodes in memory), setting actimeo=large and using nocto to avoid more lookups, we still can't keep a complete metadata cache in memory for any specified time when the server is doing lots of reads and churning through the page cache.

So, while I am not able to provide many answers or solutions to any of the issues I have highlighted in this email thread, hopefully I have described in enough detail all the main performance hurdles others will likely run into if they attempt this in production as we have.

And like I said from the outset, it's already stable enough for us to use in production and it's definitely better than nothing... ;)

Regards,

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-22 12:31 ` Daire Byrne
@ 2020-09-22 13:52   ` Trond Myklebust
  2020-09-23 12:40     ` J. Bruce Fields
  2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
  1 sibling, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-09-22 13:52 UTC (permalink / raw)
  To: linux-nfs, daire; +Cc: linux-cachefs

On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> Hi, 
> 
> I just thought I'd flesh out the other two issues I have found with
> re-exporting that are ultimately responsible for the biggest
> performance bottlenecks. And both of them revolve around the caching
> of metadata file lookups in the NFS client.
> 
> Especially for the case where we are re-exporting a server many
> milliseconds away (i.e. on-premise -> cloud), we want to be able to
> control how much the client caches metadata and file data so that
> it's many LAN clients all benefit from the re-export server only
> having to do the WAN lookups once (within a specified coherency
> time).
> 
> Keeping the file data in the vfs page cache or on disk using
> fscache/cachefiles is fairly straightforward, but keeping the
> metadata cached is particularly difficult. And without the cached
> metadata we introduce long delays before we can serve the already
> present and locally cached file data to many waiting clients.
> 
> ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > 2) If we cache metadata on the re-export server using
> > actimeo=3600,nocto we can
> > cut the network packets back to the origin server to zero for
> > repeated lookups.
> > However, if a client of the re-export server walks paths and memory
> > maps those
> > files (i.e. loading an application), the re-export server starts
> > issuing
> > unexpected calls back to the origin server again,
> > ignoring/invalidating the
> > re-export server's NFS client cache. We worked around this this by
> > patching an
> > inode/iversion validity check in inode.c so that the NFS client
> > cache on the
> > re-export server is used. I'm not sure about the correctness of
> > this patch but
> > it works for our corner case.
> 
> If we use actimeo=3600,nocto (say) to mount a remote software volume
> on the re-export server, we can successfully cache the loading of
> applications and walking of paths directly on the re-export server
> such that after a couple of runs, there are practically zero packets
> back to the originating NFS server (great!). But, if we then do the
> same thing on a client which is mounting that re-export server, the
> re-export server now starts issuing lots of calls back to the
> originating server and invalidating it's client cache (bad!).
> 
> I'm not exactly sure why, but the iversion of the inode gets changed
> locally (due to atime modification?) most likely via invocation of
> method inode_inc_iversion_raw. Each time it gets incremented the
> following call to validate attributes detects changes causing it to
> be reloaded from the originating server.
> 
> This patch helps to avoid this when applied to the re-export server
> but there may be other places where this happens too. I accept that
> this patch is probably not the right/general way to do this, but it
> helps to highlight the issue when re-exporting and it works well for
> our use case:
> 
> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> 00:23:03.000000000 +0000
> +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> @@ -1869,7 +1869,7 @@
>  
>         /* More cache consistency checks */
>         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> -               if (!inode_eq_iversion_raw(inode, fattr-
> >change_attr)) {
> +               if (inode_peek_iversion_raw(inode) < fattr-
> >change_attr) {
>                         /* Could it be a race with writeback? */
>                         if (!(have_writers || have_delegation)) {
>                                 invalid |= NFS_INO_INVALID_DATA


There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to
make assumptions about how the change attribute behaves over time.

The only safe way to do something like the above is if the server
supports NFSv4.2 and also advertises support for the 'change_attr_type'
attribute. In that case, you can check at mount time for whether or not
the change attribute on this filesystem is one of the monotonic types
which would allow the above optimisation.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-17 20:23       ` Frank van der Linden
  2020-09-17 21:57         ` bfields
@ 2020-09-22 16:43         ` Chuck Lever
  2020-09-23 20:25           ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: Chuck Lever @ 2020-09-22 16:43 UTC (permalink / raw)
  To: Frank van der Linden, Bruce Fields
  Cc: Daire Byrne, Linux NFS Mailing List, linux-cachefs



> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <fllinden@amazon.com> wrote:
> 
> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> 
>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>>> 
>>> ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
>>> 
>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>>>>> second) quickly eat up the CPU on the re-export server and perf top
>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>>>> 
>>>> Any statistics on who's calling that function?
>>> 
>>> I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
>>> 
>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
>> 
>> That sounds a lot like what Frank Van der Linden reported:
>> 
>>        https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> 
>> It looks like a bug in the filehandle caching code.
>> 
>> --b.
> 
> Yes, that does look like the same one.
> 
> I still think that not caching v4 files at all may be the best way to go
> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> where you end up doing a lot of opens/closes, but it doesn't make as
> much sense for v4.
> 
> However, short of that, I tested a local patch a few months back, that
> I never posted here, so I'll do so now. It just makes v4 opens in to
> 'long term' opens, which do not get put on the LRU, since that doesn't
> make sense (they are in the hash table, so they are still cached).
> 
> Also, the file caching code seems to walk the LRU a little too often,
> but that's another issue - and this change keeps the LRU short, so it's
> not a big deal.
> 
> I don't particularly love this patch, but it does keep the LRU short, and
> did significantly speed up my testcase (by about 50%). So, maybe you can
> give it a try.
> 
> I'll also attach a second patch, that converts the hash table to an rhashtable,
> which automatically grows and shrinks in size with usage. That patch also
> helped, but not by nearly as much (I think it yielded another 10%).

For what it's worth, I applied your two patches to my test server, along
with my patch that force-closes cached file descriptors during NFSv4
CLOSE processing. The patch combination improves performance (faster
elapsed time) for my workload as well.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-22 13:52   ` Trond Myklebust
@ 2020-09-23 12:40     ` J. Bruce Fields
  2020-09-23 13:09       ` Trond Myklebust
  0 siblings, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-09-23 12:40 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, daire, linux-cachefs

On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > Hi, 
> > 
> > I just thought I'd flesh out the other two issues I have found with
> > re-exporting that are ultimately responsible for the biggest
> > performance bottlenecks. And both of them revolve around the caching
> > of metadata file lookups in the NFS client.
> > 
> > Especially for the case where we are re-exporting a server many
> > milliseconds away (i.e. on-premise -> cloud), we want to be able to
> > control how much the client caches metadata and file data so that
> > it's many LAN clients all benefit from the re-export server only
> > having to do the WAN lookups once (within a specified coherency
> > time).
> > 
> > Keeping the file data in the vfs page cache or on disk using
> > fscache/cachefiles is fairly straightforward, but keeping the
> > metadata cached is particularly difficult. And without the cached
> > metadata we introduce long delays before we can serve the already
> > present and locally cached file data to many waiting clients.
> > 
> > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > > 2) If we cache metadata on the re-export server using
> > > actimeo=3600,nocto we can
> > > cut the network packets back to the origin server to zero for
> > > repeated lookups.
> > > However, if a client of the re-export server walks paths and memory
> > > maps those
> > > files (i.e. loading an application), the re-export server starts
> > > issuing
> > > unexpected calls back to the origin server again,
> > > ignoring/invalidating the
> > > re-export server's NFS client cache. We worked around this this by
> > > patching an
> > > inode/iversion validity check in inode.c so that the NFS client
> > > cache on the
> > > re-export server is used. I'm not sure about the correctness of
> > > this patch but
> > > it works for our corner case.
> > 
> > If we use actimeo=3600,nocto (say) to mount a remote software volume
> > on the re-export server, we can successfully cache the loading of
> > applications and walking of paths directly on the re-export server
> > such that after a couple of runs, there are practically zero packets
> > back to the originating NFS server (great!). But, if we then do the
> > same thing on a client which is mounting that re-export server, the
> > re-export server now starts issuing lots of calls back to the
> > originating server and invalidating it's client cache (bad!).
> > 
> > I'm not exactly sure why, but the iversion of the inode gets changed
> > locally (due to atime modification?) most likely via invocation of
> > method inode_inc_iversion_raw. Each time it gets incremented the
> > following call to validate attributes detects changes causing it to
> > be reloaded from the originating server.
> > 
> > This patch helps to avoid this when applied to the re-export server
> > but there may be other places where this happens too. I accept that
> > this patch is probably not the right/general way to do this, but it
> > helps to highlight the issue when re-exporting and it works well for
> > our use case:
> > 
> > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > 00:23:03.000000000 +0000
> > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > @@ -1869,7 +1869,7 @@
> >  
> >         /* More cache consistency checks */
> >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > -               if (!inode_eq_iversion_raw(inode, fattr-
> > >change_attr)) {
> > +               if (inode_peek_iversion_raw(inode) < fattr-
> > >change_attr) {
> >                         /* Could it be a race with writeback? */
> >                         if (!(have_writers || have_delegation)) {
> >                                 invalid |= NFS_INO_INVALID_DATA
> 
> 
> There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to
> make assumptions about how the change attribute behaves over time.
> 
> The only safe way to do something like the above is if the server
> supports NFSv4.2 and also advertises support for the 'change_attr_type'
> attribute. In that case, you can check at mount time for whether or not
> the change attribute on this filesystem is one of the monotonic types
> which would allow the above optimisation.

Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?

The Linux server's ctime is monotonic and will advertise that with
change_attr_type since 4.19.

So I think it would be easy to patch the client to check
change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
server->caps, the hard part would be figuring out which optimisations
are OK.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-23 12:40     ` J. Bruce Fields
@ 2020-09-23 13:09       ` Trond Myklebust
  2020-09-23 17:07         ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-09-23 13:09 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, daire

On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote:
> On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > Hi, 
> > > 
> > > I just thought I'd flesh out the other two issues I have found
> > > with
> > > re-exporting that are ultimately responsible for the biggest
> > > performance bottlenecks. And both of them revolve around the
> > > caching
> > > of metadata file lookups in the NFS client.
> > > 
> > > Especially for the case where we are re-exporting a server many
> > > milliseconds away (i.e. on-premise -> cloud), we want to be able
> > > to
> > > control how much the client caches metadata and file data so that
> > > it's many LAN clients all benefit from the re-export server only
> > > having to do the WAN lookups once (within a specified coherency
> > > time).
> > > 
> > > Keeping the file data in the vfs page cache or on disk using
> > > fscache/cachefiles is fairly straightforward, but keeping the
> > > metadata cached is particularly difficult. And without the cached
> > > metadata we introduce long delays before we can serve the already
> > > present and locally cached file data to many waiting clients.
> > > 
> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > > > 2) If we cache metadata on the re-export server using
> > > > actimeo=3600,nocto we can
> > > > cut the network packets back to the origin server to zero for
> > > > repeated lookups.
> > > > However, if a client of the re-export server walks paths and
> > > > memory
> > > > maps those
> > > > files (i.e. loading an application), the re-export server
> > > > starts
> > > > issuing
> > > > unexpected calls back to the origin server again,
> > > > ignoring/invalidating the
> > > > re-export server's NFS client cache. We worked around this this
> > > > by
> > > > patching an
> > > > inode/iversion validity check in inode.c so that the NFS client
> > > > cache on the
> > > > re-export server is used. I'm not sure about the correctness of
> > > > this patch but
> > > > it works for our corner case.
> > > 
> > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > volume
> > > on the re-export server, we can successfully cache the loading of
> > > applications and walking of paths directly on the re-export
> > > server
> > > such that after a couple of runs, there are practically zero
> > > packets
> > > back to the originating NFS server (great!). But, if we then do
> > > the
> > > same thing on a client which is mounting that re-export server,
> > > the
> > > re-export server now starts issuing lots of calls back to the
> > > originating server and invalidating it's client cache (bad!).
> > > 
> > > I'm not exactly sure why, but the iversion of the inode gets
> > > changed
> > > locally (due to atime modification?) most likely via invocation
> > > of
> > > method inode_inc_iversion_raw. Each time it gets incremented the
> > > following call to validate attributes detects changes causing it
> > > to
> > > be reloaded from the originating server.
> > > 
> > > This patch helps to avoid this when applied to the re-export
> > > server
> > > but there may be other places where this happens too. I accept
> > > that
> > > this patch is probably not the right/general way to do this, but
> > > it
> > > helps to highlight the issue when re-exporting and it works well
> > > for
> > > our use case:
> > > 
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > 00:23:03.000000000 +0000
> > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >  
> > >         /* More cache consistency checks */
> > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > > change_attr)) {
> > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > > change_attr) {
> > >                         /* Could it be a race with writeback? */
> > >                         if (!(have_writers || have_delegation)) {
> > >                                 invalid |= NFS_INO_INVALID_DATA
> > 
> > There is nothing in the base NFSv4, and NFSv4.1 specs that allow
> > you to
> > make assumptions about how the change attribute behaves over time.
> > 
> > The only safe way to do something like the above is if the server
> > supports NFSv4.2 and also advertises support for the
> > 'change_attr_type'
> > attribute. In that case, you can check at mount time for whether or
> > not
> > the change attribute on this filesystem is one of the monotonic
> > types
> > which would allow the above optimisation.
> 
> Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
> think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
> 
> The Linux server's ctime is monotonic and will advertise that with
> change_attr_type since 4.19.
> 
> So I think it would be easy to patch the client to check
> change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
> server->caps, the hard part would be figuring out which optimisations
> are OK.
> 

The ctime is *not* monotonic. It can regress under server reboots and
it can regress if someone deliberately changes the time. We have code
that tries to handle all these issues (see fattr->gencount and nfsi-
>attr_gencount) because we've hit those issues before...

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-23 13:09       ` Trond Myklebust
@ 2020-09-23 17:07         ` bfields
  0 siblings, 0 replies; 129+ messages in thread
From: bfields @ 2020-09-23 17:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, daire

On Wed, Sep 23, 2020 at 01:09:01PM +0000, Trond Myklebust wrote:
> On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > Hi, 
> > > > 
> > > > I just thought I'd flesh out the other two issues I have found
> > > > with
> > > > re-exporting that are ultimately responsible for the biggest
> > > > performance bottlenecks. And both of them revolve around the
> > > > caching
> > > > of metadata file lookups in the NFS client.
> > > > 
> > > > Especially for the case where we are re-exporting a server many
> > > > milliseconds away (i.e. on-premise -> cloud), we want to be able
> > > > to
> > > > control how much the client caches metadata and file data so that
> > > > it's many LAN clients all benefit from the re-export server only
> > > > having to do the WAN lookups once (within a specified coherency
> > > > time).
> > > > 
> > > > Keeping the file data in the vfs page cache or on disk using
> > > > fscache/cachefiles is fairly straightforward, but keeping the
> > > > metadata cached is particularly difficult. And without the cached
> > > > metadata we introduce long delays before we can serve the already
> > > > present and locally cached file data to many waiting clients.
> > > > 
> > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > > > > 2) If we cache metadata on the re-export server using
> > > > > actimeo=3600,nocto we can
> > > > > cut the network packets back to the origin server to zero for
> > > > > repeated lookups.
> > > > > However, if a client of the re-export server walks paths and
> > > > > memory
> > > > > maps those
> > > > > files (i.e. loading an application), the re-export server
> > > > > starts
> > > > > issuing
> > > > > unexpected calls back to the origin server again,
> > > > > ignoring/invalidating the
> > > > > re-export server's NFS client cache. We worked around this this
> > > > > by
> > > > > patching an
> > > > > inode/iversion validity check in inode.c so that the NFS client
> > > > > cache on the
> > > > > re-export server is used. I'm not sure about the correctness of
> > > > > this patch but
> > > > > it works for our corner case.
> > > > 
> > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > volume
> > > > on the re-export server, we can successfully cache the loading of
> > > > applications and walking of paths directly on the re-export
> > > > server
> > > > such that after a couple of runs, there are practically zero
> > > > packets
> > > > back to the originating NFS server (great!). But, if we then do
> > > > the
> > > > same thing on a client which is mounting that re-export server,
> > > > the
> > > > re-export server now starts issuing lots of calls back to the
> > > > originating server and invalidating it's client cache (bad!).
> > > > 
> > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > changed
> > > > locally (due to atime modification?) most likely via invocation
> > > > of
> > > > method inode_inc_iversion_raw. Each time it gets incremented the
> > > > following call to validate attributes detects changes causing it
> > > > to
> > > > be reloaded from the originating server.
> > > > 
> > > > This patch helps to avoid this when applied to the re-export
> > > > server
> > > > but there may be other places where this happens too. I accept
> > > > that
> > > > this patch is probably not the right/general way to do this, but
> > > > it
> > > > helps to highlight the issue when re-exporting and it works well
> > > > for
> > > > our use case:
> > > > 
> > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > > 00:23:03.000000000 +0000
> > > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > > @@ -1869,7 +1869,7 @@
> > > >  
> > > >         /* More cache consistency checks */
> > > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > > > change_attr)) {
> > > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > > > change_attr) {
> > > >                         /* Could it be a race with writeback? */
> > > >                         if (!(have_writers || have_delegation)) {
> > > >                                 invalid |= NFS_INO_INVALID_DATA
> > > 
> > > There is nothing in the base NFSv4, and NFSv4.1 specs that allow
> > > you to
> > > make assumptions about how the change attribute behaves over time.
> > > 
> > > The only safe way to do something like the above is if the server
> > > supports NFSv4.2 and also advertises support for the
> > > 'change_attr_type'
> > > attribute. In that case, you can check at mount time for whether or
> > > not
> > > the change attribute on this filesystem is one of the monotonic
> > > types
> > > which would allow the above optimisation.
> > 
> > Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
> > think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
> > 
> > The Linux server's ctime is monotonic and will advertise that with
> > change_attr_type since 4.19.
> > 
> > So I think it would be easy to patch the client to check
> > change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
> > server->caps, the hard part would be figuring out which optimisations
> > are OK.
> > 
> 
> The ctime is *not* monotonic. It can regress under server reboots and
> it can regress if someone deliberately changes the time.

So, anything other than IS_UNDEFINED or IS_TIME_METADATA?

Though the linux server is susceptible to some of that even when it
returns MONTONIC_INCR.  If the admin replaces the filesystem by an older
snapshot, there's not much we can do.  I'm not sure what degree of
gaurantee we need.

--b.

> We have code
> that tries to handle all these issues (see fattr->gencount and nfsi-
> >attr_gencount) because we've hit those issues before...



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-22 16:43         ` Chuck Lever
@ 2020-09-23 20:25           ` Daire Byrne
  2020-09-23 21:01             ` Frank van der Linden
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-09-23 20:25 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Frank van der Linden, bfields, linux-nfs, linux-cachefs


----- On 22 Sep, 2020, at 17:43, Chuck Lever chuck.lever@oracle.com wrote:
>> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <fllinden@amazon.com> wrote:
>> 
>> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>>> 
>>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>>>> 
>>>> ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
>>>> 
>>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>>>>>> second) quickly eat up the CPU on the re-export server and perf top
>>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>>>>> 
>>>>> Any statistics on who's calling that function?
>>>> 
>>>> I've always struggled to reproduce this with a simple open/close simulation, so
>>>> I suspect some other operations need to be mixed in too. But I have one
>>>> production workload that I know has lots of opens & closes (buggy software)
>>>> included in amongst the usual reads, writes etc.
>>>> 
>>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
>>>> clients, we have maxed out the 32 cores of the server with most of that in
>>>> native_queued_spin_lock_slowpath.
>>> 
>>> That sounds a lot like what Frank Van der Linden reported:
>>> 
>>>        https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>> 
>>> It looks like a bug in the filehandle caching code.
>>> 
>>> --b.
>> 
>> Yes, that does look like the same one.
>> 
>> I still think that not caching v4 files at all may be the best way to go
>> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> where you end up doing a lot of opens/closes, but it doesn't make as
>> much sense for v4.
>> 
>> However, short of that, I tested a local patch a few months back, that
>> I never posted here, so I'll do so now. It just makes v4 opens in to
>> 'long term' opens, which do not get put on the LRU, since that doesn't
>> make sense (they are in the hash table, so they are still cached).
>> 
>> Also, the file caching code seems to walk the LRU a little too often,
>> but that's another issue - and this change keeps the LRU short, so it's
>> not a big deal.
>> 
>> I don't particularly love this patch, but it does keep the LRU short, and
>> did significantly speed up my testcase (by about 50%). So, maybe you can
>> give it a try.
>> 
>> I'll also attach a second patch, that converts the hash table to an rhashtable,
>> which automatically grows and shrinks in size with usage. That patch also
>> helped, but not by nearly as much (I think it yielded another 10%).
> 
> For what it's worth, I applied your two patches to my test server, along
> with my patch that force-closes cached file descriptors during NFSv4
> CLOSE processing. The patch combination improves performance (faster
> elapsed time) for my workload as well.

I tested Frank's NFSv4 filecache patches with some production workloads and I've hit the below refcount issue a couple of times in the last 48 hours with v5.8.10. This server was re-exporting an NFS client mount at the time.

Apologies for the spam if I've just hit something unrelated to the patches that is present in v5.8.10.... In truth, I have not used this kernel version before with this workload and just patched it because I had it ready to go. I'll remove the 2 patches and verify.

Daire


[ 8930.027838] ------------[ cut here ]------------
[ 8930.032769] refcount_t: addition on 0; use-after-free.
[ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25 refcount_warn_saturate+0x6e/0xf0
[ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4 input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G        W         5.8.10-1.dneg.x86_64 #1
[ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
[ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
[ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
[ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
[ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
[ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
[ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
[ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
[ 8930.182976] FS:  0000000000000000(0000) GS:ffff888d0bc80000(0000) knlGS:0000000000000000
[ 8930.191231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
[ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8930.219027] Call Trace:
[ 8930.221665]  nfsd4_process_open2+0xa48/0xec0 [nfsd]
[ 8930.226724]  ? nfsd_permission+0x6b/0x100 [nfsd]
[ 8930.231524]  ? fh_verify+0x167/0x210 [nfsd]
[ 8930.235893]  nfsd4_open+0x407/0x820 [nfsd]
[ 8930.240248]  nfsd4_proc_compound+0x3c2/0x760 [nfsd]
[ 8930.245296]  ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
[ 8930.251734]  nfsd_dispatch+0xe2/0x220 [nfsd]
[ 8930.256213]  svc_process_common+0x47b/0x6f0 [sunrpc]
[ 8930.261355]  ? svc_sock_secure_port+0x16/0x30 [sunrpc]
[ 8930.266707]  ? nfsd_svc+0x330/0x330 [nfsd]
[ 8930.270981]  svc_process+0xc5/0x100 [sunrpc]
[ 8930.275423]  nfsd+0xe8/0x150 [nfsd]
[ 8930.280028]  kthread+0x114/0x150
[ 8930.283434]  ? nfsd_destroy+0x60/0x60 [nfsd]
[ 8930.287875]  ? kthread_park+0x90/0x90
[ 8930.291700]  ret_from_fork+0x22/0x30
[ 8930.295447] ---[ end trace c551536c3520545c ]---

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-23 20:25           ` Daire Byrne
@ 2020-09-23 21:01             ` Frank van der Linden
  2020-09-26  9:00               ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: Frank van der Linden @ 2020-09-23 21:01 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Chuck Lever, bfields, linux-nfs, linux-cachefs

On Wed, Sep 23, 2020 at 09:25:07PM +0100, Daire Byrne wrote:
> 
> ----- On 22 Sep, 2020, at 17:43, Chuck Lever chuck.lever@oracle.com wrote:
> >> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <fllinden@amazon.com> wrote:
> >>
> >> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
> >>>
> >>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> >>>>
> >>>> ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
> >>>>
> >>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> >>>>>> second) quickly eat up the CPU on the re-export server and perf top
> >>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
> >>>>>
> >>>>> Any statistics on who's calling that function?
> >>>>
> >>>> I've always struggled to reproduce this with a simple open/close simulation, so
> >>>> I suspect some other operations need to be mixed in too. But I have one
> >>>> production workload that I know has lots of opens & closes (buggy software)
> >>>> included in amongst the usual reads, writes etc.
> >>>>
> >>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
> >>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
> >>>> clients, we have maxed out the 32 cores of the server with most of that in
> >>>> native_queued_spin_lock_slowpath.
> >>>
> >>> That sounds a lot like what Frank Van der Linden reported:
> >>>
> >>>        https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >>>
> >>> It looks like a bug in the filehandle caching code.
> >>>
> >>> --b.
> >>
> >> Yes, that does look like the same one.
> >>
> >> I still think that not caching v4 files at all may be the best way to go
> >> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> >> where you end up doing a lot of opens/closes, but it doesn't make as
> >> much sense for v4.
> >>
> >> However, short of that, I tested a local patch a few months back, that
> >> I never posted here, so I'll do so now. It just makes v4 opens in to
> >> 'long term' opens, which do not get put on the LRU, since that doesn't
> >> make sense (they are in the hash table, so they are still cached).
> >>
> >> Also, the file caching code seems to walk the LRU a little too often,
> >> but that's another issue - and this change keeps the LRU short, so it's
> >> not a big deal.
> >>
> >> I don't particularly love this patch, but it does keep the LRU short, and
> >> did significantly speed up my testcase (by about 50%). So, maybe you can
> >> give it a try.
> >>
> >> I'll also attach a second patch, that converts the hash table to an rhashtable,
> >> which automatically grows and shrinks in size with usage. That patch also
> >> helped, but not by nearly as much (I think it yielded another 10%).
> >
> > For what it's worth, I applied your two patches to my test server, along
> > with my patch that force-closes cached file descriptors during NFSv4
> > CLOSE processing. The patch combination improves performance (faster
> > elapsed time) for my workload as well.
> 
> I tested Frank's NFSv4 filecache patches with some production workloads and I've hit the below refcount issue a couple of times in the last 48 hours with v5.8.10. This server was re-exporting an NFS client mount at the time.
> 
> Apologies for the spam if I've just hit something unrelated to the patches that is present in v5.8.10.... In truth, I have not used this kernel version before with this workload and just patched it because I had it ready to go. I'll remove the 2 patches and verify.
> 
> Daire
> 
> 
> [ 8930.027838] ------------[ cut here ]------------
> [ 8930.032769] refcount_t: addition on 0; use-after-free.
> [ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25 refcount_warn_saturate+0x6e/0xf0
> [ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4 input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
> [ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G        W         5.8.10-1.dneg.x86_64 #1
> [ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
> [ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
> [ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
> [ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
> [ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
> [ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
> [ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
> [ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
> [ 8930.182976] FS:  0000000000000000(0000) GS:ffff888d0bc80000(0000) knlGS:0000000000000000
> [ 8930.191231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
> [ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8930.219027] Call Trace:
> [ 8930.221665]  nfsd4_process_open2+0xa48/0xec0 [nfsd]
> [ 8930.226724]  ? nfsd_permission+0x6b/0x100 [nfsd]
> [ 8930.231524]  ? fh_verify+0x167/0x210 [nfsd]
> [ 8930.235893]  nfsd4_open+0x407/0x820 [nfsd]
> [ 8930.240248]  nfsd4_proc_compound+0x3c2/0x760 [nfsd]
> [ 8930.245296]  ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
> [ 8930.251734]  nfsd_dispatch+0xe2/0x220 [nfsd]
> [ 8930.256213]  svc_process_common+0x47b/0x6f0 [sunrpc]
> [ 8930.261355]  ? svc_sock_secure_port+0x16/0x30 [sunrpc]
> [ 8930.266707]  ? nfsd_svc+0x330/0x330 [nfsd]
> [ 8930.270981]  svc_process+0xc5/0x100 [sunrpc]
> [ 8930.275423]  nfsd+0xe8/0x150 [nfsd]
> [ 8930.280028]  kthread+0x114/0x150
> [ 8930.283434]  ? nfsd_destroy+0x60/0x60 [nfsd]
> [ 8930.287875]  ? kthread_park+0x90/0x90
> [ 8930.291700]  ret_from_fork+0x22/0x30
> [ 8930.295447] ---[ end trace c551536c3520545c ]---

It's entirely possible that my patch introduces a refcounting error - it was
intended as a proof-of-concept on how to fix the LRU locking issue for v4
open file caching (while keeping it enabled) - which is why I didn't
"formally" send it in.

Having said that, I don't immediately see the problem.

Maybe try it without the rhashtable patch, that is much less of an
optimization.

The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
of nfs4_file. If it's the latter, it's probably the rhashtable change.

- Frank

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-23 21:01             ` Frank van der Linden
@ 2020-09-26  9:00               ` Daire Byrne
  2020-09-28 15:49                 ` Frank van der Linden
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-09-26  9:00 UTC (permalink / raw)
  To: Frank van der Linden; +Cc: Chuck Lever, bfields, linux-nfs, linux-cachefs

----- On 23 Sep, 2020, at 22:01, Frank van der Linden fllinden@amazon.com wrote:

> On Wed, Sep 23, 2020 at 09:25:07PM +0100, Daire Byrne wrote:
>> 
>> ----- On 22 Sep, 2020, at 17:43, Chuck Lever chuck.lever@oracle.com wrote:
>> >> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <fllinden@amazon.com> wrote:
>> >>
>> >> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> >>>
>> >>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>> >>>>
>> >>>> ----- On 15 Sep, 2020, at 18:21, bfields bfields@fieldses.org wrote:
>> >>>>
>> >>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> >>>>>> second) quickly eat up the CPU on the re-export server and perf top
>> >>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>> >>>>>
>> >>>>> Any statistics on who's calling that function?
>> >>>>
>> >>>> I've always struggled to reproduce this with a simple open/close simulation, so
>> >>>> I suspect some other operations need to be mixed in too. But I have one
>> >>>> production workload that I know has lots of opens & closes (buggy software)
>> >>>> included in amongst the usual reads, writes etc.
>> >>>>
>> >>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>> >>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
>> >>>> clients, we have maxed out the 32 cores of the server with most of that in
>> >>>> native_queued_spin_lock_slowpath.
>> >>>
>> >>> That sounds a lot like what Frank Van der Linden reported:
>> >>>
>> >>>        https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> >>>
>> >>> It looks like a bug in the filehandle caching code.
>> >>>
>> >>> --b.
>> >>
>> >> Yes, that does look like the same one.
>> >>
>> >> I still think that not caching v4 files at all may be the best way to go
>> >> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> >> where you end up doing a lot of opens/closes, but it doesn't make as
>> >> much sense for v4.
>> >>
>> >> However, short of that, I tested a local patch a few months back, that
>> >> I never posted here, so I'll do so now. It just makes v4 opens in to
>> >> 'long term' opens, which do not get put on the LRU, since that doesn't
>> >> make sense (they are in the hash table, so they are still cached).
>> >>
>> >> Also, the file caching code seems to walk the LRU a little too often,
>> >> but that's another issue - and this change keeps the LRU short, so it's
>> >> not a big deal.
>> >>
>> >> I don't particularly love this patch, but it does keep the LRU short, and
>> >> did significantly speed up my testcase (by about 50%). So, maybe you can
>> >> give it a try.
>> >>
>> >> I'll also attach a second patch, that converts the hash table to an rhashtable,
>> >> which automatically grows and shrinks in size with usage. That patch also
>> >> helped, but not by nearly as much (I think it yielded another 10%).
>> >
>> > For what it's worth, I applied your two patches to my test server, along
>> > with my patch that force-closes cached file descriptors during NFSv4
>> > CLOSE processing. The patch combination improves performance (faster
>> > elapsed time) for my workload as well.
>> 
>> I tested Frank's NFSv4 filecache patches with some production workloads and I've
>> hit the below refcount issue a couple of times in the last 48 hours with
>> v5.8.10. This server was re-exporting an NFS client mount at the time.
>> 
>> Apologies for the spam if I've just hit something unrelated to the patches that
>> is present in v5.8.10.... In truth, I have not used this kernel version before
>> with this workload and just patched it because I had it ready to go. I'll
>> remove the 2 patches and verify.
>> 
>> Daire
>> 
>> 
>> [ 8930.027838] ------------[ cut here ]------------
>> [ 8930.032769] refcount_t: addition on 0; use-after-free.
>> [ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25
>> refcount_warn_saturate+0x6e/0xf0
>> [ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4
>> dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio
>> cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4
>> input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs
>> libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover
>> virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
>> aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci
>> virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
>> [ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G        W
>> 5.8.10-1.dneg.x86_64 #1
>> [ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute
>> Engine, BIOS Google 01/01/2011
>> [ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
>> [ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01
>> 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3
>> 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
>> [ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
>> [ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
>> [ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
>> [ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
>> [ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
>> [ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
>> [ 8930.182976] FS:  0000000000000000(0000) GS:ffff888d0bc80000(0000)
>> knlGS:0000000000000000
>> [ 8930.191231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
>> [ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 8930.219027] Call Trace:
>> [ 8930.221665]  nfsd4_process_open2+0xa48/0xec0 [nfsd]
>> [ 8930.226724]  ? nfsd_permission+0x6b/0x100 [nfsd]
>> [ 8930.231524]  ? fh_verify+0x167/0x210 [nfsd]
>> [ 8930.235893]  nfsd4_open+0x407/0x820 [nfsd]
>> [ 8930.240248]  nfsd4_proc_compound+0x3c2/0x760 [nfsd]
>> [ 8930.245296]  ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
>> [ 8930.251734]  nfsd_dispatch+0xe2/0x220 [nfsd]
>> [ 8930.256213]  svc_process_common+0x47b/0x6f0 [sunrpc]
>> [ 8930.261355]  ? svc_sock_secure_port+0x16/0x30 [sunrpc]
>> [ 8930.266707]  ? nfsd_svc+0x330/0x330 [nfsd]
>> [ 8930.270981]  svc_process+0xc5/0x100 [sunrpc]
>> [ 8930.275423]  nfsd+0xe8/0x150 [nfsd]
>> [ 8930.280028]  kthread+0x114/0x150
>> [ 8930.283434]  ? nfsd_destroy+0x60/0x60 [nfsd]
>> [ 8930.287875]  ? kthread_park+0x90/0x90
>> [ 8930.291700]  ret_from_fork+0x22/0x30
>> [ 8930.295447] ---[ end trace c551536c3520545c ]---
> 
> It's entirely possible that my patch introduces a refcounting error - it was
> intended as a proof-of-concept on how to fix the LRU locking issue for v4
> open file caching (while keeping it enabled) - which is why I didn't
> "formally" send it in.
> 
> Having said that, I don't immediately see the problem.
> 
> Maybe try it without the rhashtable patch, that is much less of an
> optimization.
> 
> The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
> of nfs4_file. If it's the latter, it's probably the rhashtable change.

Thanks Frank; I think you are right in that it seems to be a problem with the rhashtable patch. Another 48 hours using the same workload with just the main patch and I have not seen the same issue again so far.

Also, it still has the effect of reducing the CPU usage dramatically such that there are plenty of cores still left idle. This is actually helping us buy some more time while we fix our obviously broken software so that it doesn't open/close so crazily.

So, many thanks for that.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-26  9:00               ` Daire Byrne
@ 2020-09-28 15:49                 ` Frank van der Linden
  2020-09-28 16:08                   ` Chuck Lever
  0 siblings, 1 reply; 129+ messages in thread
From: Frank van der Linden @ 2020-09-28 15:49 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Chuck Lever, bfields, linux-nfs, linux-cachefs

On Sat, Sep 26, 2020 at 10:00:22AM +0100, Daire Byrne wrote:
> 
> 
> ----- On 23 Sep, 2020, at 22:01, Frank van der Linden fllinden@amazon.com wrote:
> > It's entirely possible that my patch introduces a refcounting error - it was
> > intended as a proof-of-concept on how to fix the LRU locking issue for v4
> > open file caching (while keeping it enabled) - which is why I didn't
> > "formally" send it in.
> >
> > Having said that, I don't immediately see the problem.
> >
> > Maybe try it without the rhashtable patch, that is much less of an
> > optimization.
> >
> > The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
> > of nfs4_file. If it's the latter, it's probably the rhashtable change.
> 
> Thanks Frank; I think you are right in that it seems to be a problem with the rhashtable patch. Another 48 hours using the same workload with just the main patch and I have not seen the same issue again so far.
> 
> Also, it still has the effect of reducing the CPU usage dramatically such that there are plenty of cores still left idle. This is actually helping us buy some more time while we fix our obviously broken software so that it doesn't open/close so crazily.
> 
> So, many thanks for that.

Cool. I'm glad the "don't put v4 files on the LRU list" works as intended for
you. The rhashtable patch was more of an afterthought, and obviously has an
issue. It did provide some extra gains, so I'll see if I can find the problem
if I get some time.

Bruce - if you want me to 'formally' submit a version of the patch, let me
know. Just disabling the cache for v4, which comes down to reverting a few
commits, is probably simpler - I'd be able to test that too.

- Frank

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-28 15:49                 ` Frank van der Linden
@ 2020-09-28 16:08                   ` Chuck Lever
  2020-09-28 17:42                     ` Frank van der Linden
  0 siblings, 1 reply; 129+ messages in thread
From: Chuck Lever @ 2020-09-28 16:08 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Daire Byrne, Bruce Fields, Linux NFS Mailing List, linux-cachefs



> On Sep 28, 2020, at 11:49 AM, Frank van der Linden <fllinden@amazon.com> wrote:
> 
> Bruce - if you want me to 'formally' submit a version of the patch, let me
> know. Just disabling the cache for v4, which comes down to reverting a few
> commits, is probably simpler - I'd be able to test that too.

I'd be interested in seeing that. From what I saw, the mechanics of
unhooking the cache from NFSv4 simply involve reverting patches, but
there appear to be some recent changes that depend on the open
filecache that might be difficult to deal with, like

b66ae6dd0c30 ("nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()")


--
Chuck Lever




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-28 16:08                   ` Chuck Lever
@ 2020-09-28 17:42                     ` Frank van der Linden
  0 siblings, 0 replies; 129+ messages in thread
From: Frank van der Linden @ 2020-09-28 17:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Daire Byrne, Bruce Fields, Linux NFS Mailing List, linux-cachefs

On Mon, Sep 28, 2020 at 12:08:09PM -0400, Chuck Lever wrote:
> 
> 
> > On Sep 28, 2020, at 11:49 AM, Frank van der Linden <fllinden@amazon.com> wrote:
> >
> > Bruce - if you want me to 'formally' submit a version of the patch, let me
> > know. Just disabling the cache for v4, which comes down to reverting a few
> > commits, is probably simpler - I'd be able to test that too.
> 
> I'd be interested in seeing that. From what I saw, the mechanics of
> unhooking the cache from NFSv4 simply involve reverting patches, but
> there appear to be some recent changes that depend on the open
> filecache that might be difficult to deal with, like
> 
> b66ae6dd0c30 ("nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()")

Hm, yes, I missed nf_rwsem being added to the struct.

Probably easier to keep nfsd_file, and have v4 use just straight alloc/free
functions for it that don't touch the cache at all.

- Frank

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-09-22 12:31 ` Daire Byrne
  2020-09-22 13:52   ` Trond Myklebust
@ 2020-09-30 19:30   ` Jeff Layton
  2020-10-01  0:09     ` Daire Byrne
  2020-10-01 18:41     ` J. Bruce Fields
  1 sibling, 2 replies; 129+ messages in thread
From: Jeff Layton @ 2020-09-30 19:30 UTC (permalink / raw)
  To: Daire Byrne, linux-nfs; +Cc: linux-cachefs

On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> Hi, 
> 
> I just thought I'd flesh out the other two issues I have found with re-exporting that are ultimately responsible for the biggest performance bottlenecks. And both of them revolve around the caching of metadata file lookups in the NFS client.
> 
> Especially for the case where we are re-exporting a server many milliseconds away (i.e. on-premise -> cloud), we want to be able to control how much the client caches metadata and file data so that it's many LAN clients all benefit from the re-export server only having to do the WAN lookups once (within a specified coherency time).
> 
> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles is fairly straightforward, but keeping the metadata cached is particularly difficult. And without the cached metadata we introduce long delays before we can serve the already present and locally cached file data to many waiting clients.
> 
> ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> > cut the network packets back to the origin server to zero for repeated lookups.
> > However, if a client of the re-export server walks paths and memory maps those
> > files (i.e. loading an application), the re-export server starts issuing
> > unexpected calls back to the origin server again, ignoring/invalidating the
> > re-export server's NFS client cache. We worked around this this by patching an
> > inode/iversion validity check in inode.c so that the NFS client cache on the
> > re-export server is used. I'm not sure about the correctness of this patch but
> > it works for our corner case.
> 
> If we use actimeo=3600,nocto (say) to mount a remote software volume on the re-export server, we can successfully cache the loading of applications and walking of paths directly on the re-export server such that after a couple of runs, there are practically zero packets back to the originating NFS server (great!). But, if we then do the same thing on a client which is mounting that re-export server, the re-export server now starts issuing lots of calls back to the originating server and invalidating it's client cache (bad!).
> 
> I'm not exactly sure why, but the iversion of the inode gets changed locally (due to atime modification?) most likely via invocation of method inode_inc_iversion_raw. Each time it gets incremented the following call to validate attributes detects changes causing it to be reloaded from the originating server.
> 

I'd expect the change attribute to track what's in actual inode on the
"home" server. The NFS client is supposed to (mostly) keep the raw
change attribute in its i_version field.

The only place we call inode_inc_iversion_raw is in
nfs_inode_add_request, which I don't think you'd be hitting unless you
were writing to the file while holding a write delegation.

What sort of server is hosting the actual data in your setup?


> This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:
> 
> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000 +0000
> +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> @@ -1869,7 +1869,7 @@
>  
>         /* More cache consistency checks */
>         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> -               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> +               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
>                         /* Could it be a race with writeback? */
>                         if (!(have_writers || have_delegation)) {
>                                 invalid |= NFS_INO_INVALID_DATA
> 
> With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).
> 
> Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly? 

Yeah, I don't think you can do this for the reasons Trond outlined.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
@ 2020-10-01  0:09     ` Daire Byrne
  2020-10-01 10:36       ` Jeff Layton
  2020-10-01 18:41     ` J. Bruce Fields
  1 sibling, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-10-01  0:09 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, linux-cachefs


----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org wrote:

> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> Hi,
>> 
>> I just thought I'd flesh out the other two issues I have found with re-exporting
>> that are ultimately responsible for the biggest performance bottlenecks. And
>> both of them revolve around the caching of metadata file lookups in the NFS
>> client.
>> 
>> Especially for the case where we are re-exporting a server many milliseconds
>> away (i.e. on-premise -> cloud), we want to be able to control how much the
>> client caches metadata and file data so that it's many LAN clients all benefit
>> from the re-export server only having to do the WAN lookups once (within a
>> specified coherency time).
>> 
>> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> is fairly straightforward, but keeping the metadata cached is particularly
>> difficult. And without the cached metadata we introduce long delays before we
>> can serve the already present and locally cached file data to many waiting
>> clients.
>> 
>> ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
>> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > cut the network packets back to the origin server to zero for repeated lookups.
>> > However, if a client of the re-export server walks paths and memory maps those
>> > files (i.e. loading an application), the re-export server starts issuing
>> > unexpected calls back to the origin server again, ignoring/invalidating the
>> > re-export server's NFS client cache. We worked around this this by patching an
>> > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > re-export server is used. I'm not sure about the correctness of this patch but
>> > it works for our corner case.
>> 
>> If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> re-export server, we can successfully cache the loading of applications and
>> walking of paths directly on the re-export server such that after a couple of
>> runs, there are practically zero packets back to the originating NFS server
>> (great!). But, if we then do the same thing on a client which is mounting that
>> re-export server, the re-export server now starts issuing lots of calls back to
>> the originating server and invalidating it's client cache (bad!).
>> 
>> I'm not exactly sure why, but the iversion of the inode gets changed locally
>> (due to atime modification?) most likely via invocation of method
>> inode_inc_iversion_raw. Each time it gets incremented the following call to
>> validate attributes detects changes causing it to be reloaded from the
>> originating server.
>> 
> 
> I'd expect the change attribute to track what's in actual inode on the
> "home" server. The NFS client is supposed to (mostly) keep the raw
> change attribute in its i_version field.
> 
> The only place we call inode_inc_iversion_raw is in
> nfs_inode_add_request, which I don't think you'd be hitting unless you
> were writing to the file while holding a write delegation.
> 
> What sort of server is hosting the actual data in your setup?

We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).

As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.

Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.


>> This patch helps to avoid this when applied to the re-export server but there
>> may be other places where this happens too. I accept that this patch is
>> probably not the right/general way to do this, but it helps to highlight the
>> issue when re-exporting and it works well for our use case:
>> 
>> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000
>> +0000
>> +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
>> @@ -1869,7 +1869,7 @@
>>  
>>         /* More cache consistency checks */
>>         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
>> -               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> +               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
>>                         /* Could it be a race with writeback? */
>>                         if (!(have_writers || have_delegation)) {
>>                                 invalid |= NFS_INO_INVALID_DATA
>> 
>> With this patch, the re-export server's NFS client attribute cache is maintained
>> and used by all the clients that then mount it. When many hundreds of clients
>> are all doing similar things at the same time, the re-export server's NFS
>> client cache is invaluable in accelerating the lookups (getattrs).
>> 
>> Perhaps a more correct approach would be to detect when it is knfsd that is
>> accessing the client mount and change the cache consistency checks accordingly?
> 
> Yeah, I don't think you can do this for the reasons Trond outlined.

Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.

We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.

While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache. 

Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.

But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01  0:09     ` Daire Byrne
@ 2020-10-01 10:36       ` Jeff Layton
  2020-10-01 12:38         ` Trond Myklebust
  2020-10-05 12:54         ` Daire Byrne
  0 siblings, 2 replies; 129+ messages in thread
From: Jeff Layton @ 2020-10-01 10:36 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs, linux-cachefs

On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> ----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org wrote:
> 
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > Hi,
> > > 
> > > I just thought I'd flesh out the other two issues I have found with re-exporting
> > > that are ultimately responsible for the biggest performance bottlenecks. And
> > > both of them revolve around the caching of metadata file lookups in the NFS
> > > client.
> > > 
> > > Especially for the case where we are re-exporting a server many milliseconds
> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
> > > client caches metadata and file data so that it's many LAN clients all benefit
> > > from the re-export server only having to do the WAN lookups once (within a
> > > specified coherency time).
> > > 
> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
> > > is fairly straightforward, but keeping the metadata cached is particularly
> > > difficult. And without the cached metadata we introduce long delays before we
> > > can serve the already present and locally cached file data to many waiting
> > > clients.
> > > 
> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> > > > cut the network packets back to the origin server to zero for repeated lookups.
> > > > However, if a client of the re-export server walks paths and memory maps those
> > > > files (i.e. loading an application), the re-export server starts issuing
> > > > unexpected calls back to the origin server again, ignoring/invalidating the
> > > > re-export server's NFS client cache. We worked around this this by patching an
> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
> > > > re-export server is used. I'm not sure about the correctness of this patch but
> > > > it works for our corner case.
> > > 
> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
> > > re-export server, we can successfully cache the loading of applications and
> > > walking of paths directly on the re-export server such that after a couple of
> > > runs, there are practically zero packets back to the originating NFS server
> > > (great!). But, if we then do the same thing on a client which is mounting that
> > > re-export server, the re-export server now starts issuing lots of calls back to
> > > the originating server and invalidating it's client cache (bad!).
> > > 
> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
> > > (due to atime modification?) most likely via invocation of method
> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
> > > validate attributes detects changes causing it to be reloaded from the
> > > originating server.
> > > 
> > 
> > I'd expect the change attribute to track what's in actual inode on the
> > "home" server. The NFS client is supposed to (mostly) keep the raw
> > change attribute in its i_version field.
> > 
> > The only place we call inode_inc_iversion_raw is in
> > nfs_inode_add_request, which I don't think you'd be hitting unless you
> > were writing to the file while holding a write delegation.
> > 
> > What sort of server is hosting the actual data in your setup?
> 
> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).
> 
> As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.
> 
> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.
> 

Ok. I suspect there is a bug here somewhere, but with such a complicated
setup though it's not clear to me where that bug would be though. You
might need to do some packet sniffing and look at what the servers are
sending for change attributes.

nfsd4_change_attribute does mix in the ctime, so your hunch about the
atime may be correct. atime updates imply a ctime update and that could
cause nfsd to continually send a new one, even on files that aren't
being changed.

It might be interesting to doctor nfsd4_change_attribute() to not mix in
the ctime and see whether that improves things. If it does, then we may
want to teach nfsd how to avoid doing that for certain types of
filesystems.

> 
> > > This patch helps to avoid this when applied to the re-export server but there
> > > may be other places where this happens too. I accept that this patch is
> > > probably not the right/general way to do this, but it helps to highlight the
> > > issue when re-exporting and it works well for our use case:
> > > 
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000
> > > +0000
> > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >  
> > >         /* More cache consistency checks */
> > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > -               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > +               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
> > >                         /* Could it be a race with writeback? */
> > >                         if (!(have_writers || have_delegation)) {
> > >                                 invalid |= NFS_INO_INVALID_DATA
> > > 
> > > With this patch, the re-export server's NFS client attribute cache is maintained
> > > and used by all the clients that then mount it. When many hundreds of clients
> > > are all doing similar things at the same time, the re-export server's NFS
> > > client cache is invaluable in accelerating the lookups (getattrs).
> > > 
> > > Perhaps a more correct approach would be to detect when it is knfsd that is
> > > accessing the client mount and change the cache consistency checks accordingly?
> > 
> > Yeah, I don't think you can do this for the reasons Trond outlined.
> 
> Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.
> 
> We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.
> 
> While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache. 
>
> Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.
> 
> But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.
> 
> Daire

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 10:36       ` Jeff Layton
@ 2020-10-01 12:38         ` Trond Myklebust
  2020-10-01 16:39           ` Jeff Layton
  2020-10-05 12:54         ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-10-01 12:38 UTC (permalink / raw)
  To: jlayton, daire; +Cc: linux-cachefs, linux-nfs

On Thu, 2020-10-01 at 06:36 -0400, Jeff Layton wrote:
> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> > ----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org
> > wrote:
> > 
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > Hi,
> > > > 
> > > > I just thought I'd flesh out the other two issues I have found
> > > > with re-exporting
> > > > that are ultimately responsible for the biggest performance
> > > > bottlenecks. And
> > > > both of them revolve around the caching of metadata file
> > > > lookups in the NFS
> > > > client.
> > > > 
> > > > Especially for the case where we are re-exporting a server many
> > > > milliseconds
> > > > away (i.e. on-premise -> cloud), we want to be able to control
> > > > how much the
> > > > client caches metadata and file data so that it's many LAN
> > > > clients all benefit
> > > > from the re-export server only having to do the WAN lookups
> > > > once (within a
> > > > specified coherency time).
> > > > 
> > > > Keeping the file data in the vfs page cache or on disk using
> > > > fscache/cachefiles
> > > > is fairly straightforward, but keeping the metadata cached is
> > > > particularly
> > > > difficult. And without the cached metadata we introduce long
> > > > delays before we
> > > > can serve the already present and locally cached file data to
> > > > many waiting
> > > > clients.
> > > > 
> > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com
> > > > wrote:
> > > > > 2) If we cache metadata on the re-export server using
> > > > > actimeo=3600,nocto we can
> > > > > cut the network packets back to the origin server to zero for
> > > > > repeated lookups.
> > > > > However, if a client of the re-export server walks paths and
> > > > > memory maps those
> > > > > files (i.e. loading an application), the re-export server
> > > > > starts issuing
> > > > > unexpected calls back to the origin server again,
> > > > > ignoring/invalidating the
> > > > > re-export server's NFS client cache. We worked around this
> > > > > this by patching an
> > > > > inode/iversion validity check in inode.c so that the NFS
> > > > > client cache on the
> > > > > re-export server is used. I'm not sure about the correctness
> > > > > of this patch but
> > > > > it works for our corner case.
> > > > 
> > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > volume on the
> > > > re-export server, we can successfully cache the loading of
> > > > applications and
> > > > walking of paths directly on the re-export server such that
> > > > after a couple of
> > > > runs, there are practically zero packets back to the
> > > > originating NFS server
> > > > (great!). But, if we then do the same thing on a client which
> > > > is mounting that
> > > > re-export server, the re-export server now starts issuing lots
> > > > of calls back to
> > > > the originating server and invalidating it's client cache
> > > > (bad!).
> > > > 
> > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > changed locally
> > > > (due to atime modification?) most likely via invocation of
> > > > method
> > > > inode_inc_iversion_raw. Each time it gets incremented the
> > > > following call to
> > > > validate attributes detects changes causing it to be reloaded
> > > > from the
> > > > originating server.
> > > > 
> > > 
> > > I'd expect the change attribute to track what's in actual inode
> > > on the
> > > "home" server. The NFS client is supposed to (mostly) keep the
> > > raw
> > > change attribute in its i_version field.
> > > 
> > > The only place we call inode_inc_iversion_raw is in
> > > nfs_inode_add_request, which I don't think you'd be hitting
> > > unless you
> > > were writing to the file while holding a write delegation.
> > > 
> > > What sort of server is hosting the actual data in your setup?
> > 
> > We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a
> > couple of (older) Netapps too. The re-export server is running the
> > latest mainline kernel(s).
> > 
> > As far as I can make out, both these originating (home) server
> > types exhibit a similar (but not exactly the same) effect on the
> > Linux NFS client cache when it is being re-exported and accessed by
> > other clients. I can replicate it when only using a read-only mount
> > at every hop so I don't think that writes are related.
> > 
> > Our RHEL7 NFS servers actually mount XFS with noatime too so any
> > atime updates that might be causing this client invalidation (which
> > is what I initially thought) are ultimately a wasted effort.
> > 
> 
> Ok. I suspect there is a bug here somewhere, but with such a
> complicated
> setup though it's not clear to me where that bug would be though. You
> might need to do some packet sniffing and look at what the servers
> are
> sending for change attributes.
> 
> nfsd4_change_attribute does mix in the ctime, so your hunch about the
> atime may be correct. atime updates imply a ctime update and that
> could
> cause nfsd to continually send a new one, even on files that aren't
> being changed.

No. Ordinary atime updates due to read() do not trigger a ctime or
change attribute update. Only an explicit atime update through, e.g. a
call to utimensat() will do that.

> 
> It might be interesting to doctor nfsd4_change_attribute() to not mix
> in
> the ctime and see whether that improves things. If it does, then we
> may
> want to teach nfsd how to avoid doing that for certain types of
> filesystems.

NACK. That would cause very incorrect behaviour for the change
attribute. It is supposed to change in all circumstances where you
ordinarily see a ctime change.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 12:38         ` Trond Myklebust
@ 2020-10-01 16:39           ` Jeff Layton
  0 siblings, 0 replies; 129+ messages in thread
From: Jeff Layton @ 2020-10-01 16:39 UTC (permalink / raw)
  To: Trond Myklebust, daire; +Cc: linux-cachefs, linux-nfs

On Thu, 2020-10-01 at 12:38 +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 06:36 -0400, Jeff Layton wrote:
> > On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> > > ----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org
> > > wrote:
> > > 
> > > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > > Hi,
> > > > > 
> > > > > I just thought I'd flesh out the other two issues I have found
> > > > > with re-exporting
> > > > > that are ultimately responsible for the biggest performance
> > > > > bottlenecks. And
> > > > > both of them revolve around the caching of metadata file
> > > > > lookups in the NFS
> > > > > client.
> > > > > 
> > > > > Especially for the case where we are re-exporting a server many
> > > > > milliseconds
> > > > > away (i.e. on-premise -> cloud), we want to be able to control
> > > > > how much the
> > > > > client caches metadata and file data so that it's many LAN
> > > > > clients all benefit
> > > > > from the re-export server only having to do the WAN lookups
> > > > > once (within a
> > > > > specified coherency time).
> > > > > 
> > > > > Keeping the file data in the vfs page cache or on disk using
> > > > > fscache/cachefiles
> > > > > is fairly straightforward, but keeping the metadata cached is
> > > > > particularly
> > > > > difficult. And without the cached metadata we introduce long
> > > > > delays before we
> > > > > can serve the already present and locally cached file data to
> > > > > many waiting
> > > > > clients.
> > > > > 
> > > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com
> > > > > wrote:
> > > > > > 2) If we cache metadata on the re-export server using
> > > > > > actimeo=3600,nocto we can
> > > > > > cut the network packets back to the origin server to zero for
> > > > > > repeated lookups.
> > > > > > However, if a client of the re-export server walks paths and
> > > > > > memory maps those
> > > > > > files (i.e. loading an application), the re-export server
> > > > > > starts issuing
> > > > > > unexpected calls back to the origin server again,
> > > > > > ignoring/invalidating the
> > > > > > re-export server's NFS client cache. We worked around this
> > > > > > this by patching an
> > > > > > inode/iversion validity check in inode.c so that the NFS
> > > > > > client cache on the
> > > > > > re-export server is used. I'm not sure about the correctness
> > > > > > of this patch but
> > > > > > it works for our corner case.
> > > > > 
> > > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > > volume on the
> > > > > re-export server, we can successfully cache the loading of
> > > > > applications and
> > > > > walking of paths directly on the re-export server such that
> > > > > after a couple of
> > > > > runs, there are practically zero packets back to the
> > > > > originating NFS server
> > > > > (great!). But, if we then do the same thing on a client which
> > > > > is mounting that
> > > > > re-export server, the re-export server now starts issuing lots
> > > > > of calls back to
> > > > > the originating server and invalidating it's client cache
> > > > > (bad!).
> > > > > 
> > > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > > changed locally
> > > > > (due to atime modification?) most likely via invocation of
> > > > > method
> > > > > inode_inc_iversion_raw. Each time it gets incremented the
> > > > > following call to
> > > > > validate attributes detects changes causing it to be reloaded
> > > > > from the
> > > > > originating server.
> > > > > 
> > > > 
> > > > I'd expect the change attribute to track what's in actual inode
> > > > on the
> > > > "home" server. The NFS client is supposed to (mostly) keep the
> > > > raw
> > > > change attribute in its i_version field.
> > > > 
> > > > The only place we call inode_inc_iversion_raw is in
> > > > nfs_inode_add_request, which I don't think you'd be hitting
> > > > unless you
> > > > were writing to the file while holding a write delegation.
> > > > 
> > > > What sort of server is hosting the actual data in your setup?
> > > 
> > > We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a
> > > couple of (older) Netapps too. The re-export server is running the
> > > latest mainline kernel(s).
> > > 
> > > As far as I can make out, both these originating (home) server
> > > types exhibit a similar (but not exactly the same) effect on the
> > > Linux NFS client cache when it is being re-exported and accessed by
> > > other clients. I can replicate it when only using a read-only mount
> > > at every hop so I don't think that writes are related.
> > > 
> > > Our RHEL7 NFS servers actually mount XFS with noatime too so any
> > > atime updates that might be causing this client invalidation (which
> > > is what I initially thought) are ultimately a wasted effort.
> > > 
> > 
> > Ok. I suspect there is a bug here somewhere, but with such a
> > complicated
> > setup though it's not clear to me where that bug would be though. You
> > might need to do some packet sniffing and look at what the servers
> > are
> > sending for change attributes.
> > 
> > nfsd4_change_attribute does mix in the ctime, so your hunch about the
> > atime may be correct. atime updates imply a ctime update and that
> > could
> > cause nfsd to continually send a new one, even on files that aren't
> > being changed.
> 
> No. Ordinary atime updates due to read() do not trigger a ctime or
> change attribute update. Only an explicit atime update through, e.g. a
> call to utimensat() will do that.
> 

Oh, interesting. I didn't realize that.

> > It might be interesting to doctor nfsd4_change_attribute() to not mix
> > in
> > the ctime and see whether that improves things. If it does, then we
> > may
> > want to teach nfsd how to avoid doing that for certain types of
> > filesystems.
> 
> NACK. That would cause very incorrect behaviour for the change
> attribute. It is supposed to change in all circumstances where you
> ordinarily see a ctime change.


I wasn't suggesting this as a real fix, just as a way to see whether we
understand the problem correctly. I doubt the reexporting machine would
be bumping the change_attr on its own, and this may tell you whether
it's the "home" server changing it. There are other ways to determine it
too though (packet sniffer, for instance).

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
  2020-10-01  0:09     ` Daire Byrne
@ 2020-10-01 18:41     ` J. Bruce Fields
  2020-10-01 19:24       ` Trond Myklebust
  1 sibling, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-10-01 18:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, linux-nfs, linux-cachefs

On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:
> > 
> > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000 +0000
> > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > @@ -1869,7 +1869,7 @@
> >  
> >         /* More cache consistency checks */
> >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > -               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > +               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
> >                         /* Could it be a race with writeback? */
> >                         if (!(have_writers || have_delegation)) {
> >                                 invalid |= NFS_INO_INVALID_DATA
> > 
> > With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).
> > 
> > Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly? 
> 
> Yeah, I don't think you can do this for the reasons Trond outlined.

I'm not clear whether Trond thought that knfsd's behavior in the case it
returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to allow
this or some other optimization.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 18:41     ` J. Bruce Fields
@ 2020-10-01 19:24       ` Trond Myklebust
  2020-10-01 19:26         ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-10-01 19:24 UTC (permalink / raw)
  To: bfields, jlayton; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > This patch helps to avoid this when applied to the re-export
> > > server but there may be other places where this happens too. I
> > > accept that this patch is probably not the right/general way to
> > > do this, but it helps to highlight the issue when re-exporting
> > > and it works well for our use case:
> > > 
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > 00:23:03.000000000 +0000
> > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >  
> > >         /* More cache consistency checks */
> > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > >change_attr)) {
> > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > >change_attr) {
> > >                         /* Could it be a race with writeback? */
> > >                         if (!(have_writers || have_delegation)) {
> > >                                 invalid |= NFS_INO_INVALID_DATA
> > > 
> > > With this patch, the re-export server's NFS client attribute
> > > cache is maintained and used by all the clients that then mount
> > > it. When many hundreds of clients are all doing similar things at
> > > the same time, the re-export server's NFS client cache is
> > > invaluable in accelerating the lookups (getattrs).
> > > 
> > > Perhaps a more correct approach would be to detect when it is
> > > knfsd that is accessing the client mount and change the cache
> > > consistency checks accordingly? 
> > 
> > Yeah, I don't think you can do this for the reasons Trond outlined.
> 
> I'm not clear whether Trond thought that knfsd's behavior in the case
> it
> returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to
> allow
> this or some other optimization.
> 

NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough to
allow the above optimisation, yes. I'm less sure about whether or not
we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when in
fact we are adding the ctime and filesystem-specific change attribute,
but we could fix that too.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 19:24       ` Trond Myklebust
@ 2020-10-01 19:26         ` bfields
  2020-10-01 19:29           ` Trond Myklebust
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-10-01 19:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: jlayton, linux-cachefs, linux-nfs, daire

On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> > On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > This patch helps to avoid this when applied to the re-export
> > > > server but there may be other places where this happens too. I
> > > > accept that this patch is probably not the right/general way to
> > > > do this, but it helps to highlight the issue when re-exporting
> > > > and it works well for our use case:
> > > > 
> > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > > 00:23:03.000000000 +0000
> > > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > > @@ -1869,7 +1869,7 @@
> > > >  
> > > >         /* More cache consistency checks */
> > > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > > >change_attr)) {
> > > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > > >change_attr) {
> > > >                         /* Could it be a race with writeback? */
> > > >                         if (!(have_writers || have_delegation)) {
> > > >                                 invalid |= NFS_INO_INVALID_DATA
> > > > 
> > > > With this patch, the re-export server's NFS client attribute
> > > > cache is maintained and used by all the clients that then mount
> > > > it. When many hundreds of clients are all doing similar things at
> > > > the same time, the re-export server's NFS client cache is
> > > > invaluable in accelerating the lookups (getattrs).
> > > > 
> > > > Perhaps a more correct approach would be to detect when it is
> > > > knfsd that is accessing the client mount and change the cache
> > > > consistency checks accordingly? 
> > > 
> > > Yeah, I don't think you can do this for the reasons Trond outlined.
> > 
> > I'm not clear whether Trond thought that knfsd's behavior in the case
> > it
> > returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to
> > allow
> > this or some other optimization.
> > 
> 
> NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough to
> allow the above optimisation, yes. I'm less sure about whether or not
> we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when in
> fact we are adding the ctime and filesystem-specific change attribute,
> but we could fix that too.

Could you explain your concern?

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 19:26         ` bfields
@ 2020-10-01 19:29           ` Trond Myklebust
  2020-10-01 19:51             ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-10-01 19:29 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, jlayton, daire

On Thu, 2020-10-01 at 15:26 -0400, bfields@fieldses.org wrote:
> On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> > > On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > > This patch helps to avoid this when applied to the re-export
> > > > > server but there may be other places where this happens too.
> > > > > I
> > > > > accept that this patch is probably not the right/general way
> > > > > to
> > > > > do this, but it helps to highlight the issue when re-
> > > > > exporting
> > > > > and it works well for our use case:
> > > > > 
> > > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27
> > > > > 00:23:03.000000000 +0000
> > > > > +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
> > > > > @@ -1869,7 +1869,7 @@
> > > > >  
> > > > >         /* More cache consistency checks */
> > > > >         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > > -               if (!inode_eq_iversion_raw(inode, fattr-
> > > > > > change_attr)) {
> > > > > +               if (inode_peek_iversion_raw(inode) < fattr-
> > > > > > change_attr) {
> > > > >                         /* Could it be a race with writeback?
> > > > > */
> > > > >                         if (!(have_writers ||
> > > > > have_delegation)) {
> > > > >                                 invalid |=
> > > > > NFS_INO_INVALID_DATA
> > > > > 
> > > > > With this patch, the re-export server's NFS client attribute
> > > > > cache is maintained and used by all the clients that then
> > > > > mount
> > > > > it. When many hundreds of clients are all doing similar
> > > > > things at
> > > > > the same time, the re-export server's NFS client cache is
> > > > > invaluable in accelerating the lookups (getattrs).
> > > > > 
> > > > > Perhaps a more correct approach would be to detect when it is
> > > > > knfsd that is accessing the client mount and change the cache
> > > > > consistency checks accordingly? 
> > > > 
> > > > Yeah, I don't think you can do this for the reasons Trond
> > > > outlined.
> > > 
> > > I'm not clear whether Trond thought that knfsd's behavior in the
> > > case
> > > it
> > > returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough
> > > to
> > > allow
> > > this or some other optimization.
> > > 
> > 
> > NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough
> > to
> > allow the above optimisation, yes. I'm less sure about whether or
> > not
> > we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when
> > in
> > fact we are adding the ctime and filesystem-specific change
> > attribute,
> > but we could fix that too.
> 
> Could you explain your concern?
> 

Same as before: that the ctime could cause the value to regress if
someone messes with the system time on the server. Yes, we do add in
the change attribute, but the value of ctime.tv_sec dominates by a
factor 2^30.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 19:29           ` Trond Myklebust
@ 2020-10-01 19:51             ` bfields
  0 siblings, 0 replies; 129+ messages in thread
From: bfields @ 2020-10-01 19:51 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, jlayton, daire

On Thu, Oct 01, 2020 at 07:29:51PM +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 15:26 -0400, bfields@fieldses.org wrote:
> > On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> > > NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough
> > > to
> > > allow the above optimisation, yes. I'm less sure about whether or
> > > not
> > > we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when
> > > in
> > > fact we are adding the ctime and filesystem-specific change
> > > attribute,
> > > but we could fix that too.
> > 
> > Could you explain your concern?
> > 
> 
> Same as before: that the ctime could cause the value to regress if
> someone messes with the system time on the server. Yes, we do add in
> the change attribute, but the value of ctime.tv_sec dominates by a
> factor 2^30.

Got it.

I'd like to just tell people not to do that....

If we think it's too easy a mistake to make, I can think of other
approaches, though filesystem assistance might be required:

- Ideal would be just never to expose uncommitted change attributes to
  the client.  Absent persistant RAM that could be terribly expensive.

- It would help just to have any number that's guaranteed to increase
  after a boot.  Of course, if would to go forward at least as reliably
  as the system time.  We'd put it in the high bits of the on-disk
  i_version.  (We'd rather not just mix it into the returned change
  attribute as we do with ctime, because that would cause clients to
  discard all their caches unnecessarily after boot.)

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-01 10:36       ` Jeff Layton
  2020-10-01 12:38         ` Trond Myklebust
@ 2020-10-05 12:54         ` Daire Byrne
  2020-10-13  9:59           ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-10-05 12:54 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, linux-cachefs

----- On 1 Oct, 2020, at 11:36, Jeff Layton jlayton@kernel.org wrote:

> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
>> ----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org wrote:
>> 
>> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> > > Hi,
>> > > 
>> > > I just thought I'd flesh out the other two issues I have found with re-exporting
>> > > that are ultimately responsible for the biggest performance bottlenecks. And
>> > > both of them revolve around the caching of metadata file lookups in the NFS
>> > > client.
>> > > 
>> > > Especially for the case where we are re-exporting a server many milliseconds
>> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
>> > > client caches metadata and file data so that it's many LAN clients all benefit
>> > > from the re-export server only having to do the WAN lookups once (within a
>> > > specified coherency time).
>> > > 
>> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> > > is fairly straightforward, but keeping the metadata cached is particularly
>> > > difficult. And without the cached metadata we introduce long delays before we
>> > > can serve the already present and locally cached file data to many waiting
>> > > clients.
>> > > 
>> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
>> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > > > cut the network packets back to the origin server to zero for repeated lookups.
>> > > > However, if a client of the re-export server walks paths and memory maps those
>> > > > files (i.e. loading an application), the re-export server starts issuing
>> > > > unexpected calls back to the origin server again, ignoring/invalidating the
>> > > > re-export server's NFS client cache. We worked around this this by patching an
>> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > > > re-export server is used. I'm not sure about the correctness of this patch but
>> > > > it works for our corner case.
>> > > 
>> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> > > re-export server, we can successfully cache the loading of applications and
>> > > walking of paths directly on the re-export server such that after a couple of
>> > > runs, there are practically zero packets back to the originating NFS server
>> > > (great!). But, if we then do the same thing on a client which is mounting that
>> > > re-export server, the re-export server now starts issuing lots of calls back to
>> > > the originating server and invalidating it's client cache (bad!).
>> > > 
>> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
>> > > (due to atime modification?) most likely via invocation of method
>> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
>> > > validate attributes detects changes causing it to be reloaded from the
>> > > originating server.
>> > > 
>> > 
>> > I'd expect the change attribute to track what's in actual inode on the
>> > "home" server. The NFS client is supposed to (mostly) keep the raw
>> > change attribute in its i_version field.
>> > 
>> > The only place we call inode_inc_iversion_raw is in
>> > nfs_inode_add_request, which I don't think you'd be hitting unless you
>> > were writing to the file while holding a write delegation.
>> > 
>> > What sort of server is hosting the actual data in your setup?
>> 
>> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of
>> (older) Netapps too. The re-export server is running the latest mainline
>> kernel(s).
>> 
>> As far as I can make out, both these originating (home) server types exhibit a
>> similar (but not exactly the same) effect on the Linux NFS client cache when it
>> is being re-exported and accessed by other clients. I can replicate it when
>> only using a read-only mount at every hop so I don't think that writes are
>> related.
>> 
>> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates
>> that might be causing this client invalidation (which is what I initially
>> thought) are ultimately a wasted effort.
>> 
> 
> Ok. I suspect there is a bug here somewhere, but with such a complicated
> setup though it's not clear to me where that bug would be though. You
> might need to do some packet sniffing and look at what the servers are
> sending for change attributes.
> 
> nfsd4_change_attribute does mix in the ctime, so your hunch about the
> atime may be correct. atime updates imply a ctime update and that could
> cause nfsd to continually send a new one, even on files that aren't
> being changed.
> 
> It might be interesting to doctor nfsd4_change_attribute() to not mix in
> the ctime and see whether that improves things. If it does, then we may
> want to teach nfsd how to avoid doing that for certain types of
> filesystems.

Okay, I started to run back through all my tests again with various combinations of server, client mount options, NFS version etc. with the intention of packet capturing as Jeff has suggested.

But I quickly realised that I had mixed up some previous results before I reported them here. The summary is that using an NFS RHEL76 server, a client mounting with a recent mainline kernel and re-exporting using NFSv4.x all the way through does NOT invalidate the re-export server's NFS client cache (great!) like I had assumed before. It does when we mount the originating RHEL7 server using NFSv3 and re-export, but not with any version of NFSv4 on Linux.

But I think I know how I got confused - the Netapp NFSv4 case is different. When we mount our (old) 7-mode Netapp using NFSv4.0 and re-export that, the re-export server's client cache is invalidated often in the same way as for an NFSv3 server. On top of that, I think I wrongly mistook some of the NFSv4 client's natural dropping of metadata from page cache as client invalidations caused by the re-export and client access (without vfs_cache_pressure=0 and see my #3 bullet point).

Both of these conspired to make me think that both NFSv3 AND NFSv4 re-exporting showed the same issue when in fact, it's just NFSv3 and the Netapp's v4.0 that require my "hack" to stop the client cache being invalidated. Sorry for any confusion (it is indeed a complicated setup!). Let me summarise then once and for all:

rhel76 server (xfs noatime) -> re-export server (vers=4.x,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = good client cache metadata performance, my hacky patch is not required.
rhel76 server (xfs noatime) -> re-export server (vers=3,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = bad performance (new lookups & getattrs), my hacky patch is required for better performance.
netapp (7-mode) -> re-export server (vers=4.0,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = bad performance, my hacky patch is required for better performance.

So for Jeff's original intention of proxying a NFSv3 server -> NFSv4 clients by re-exporting, the metadata lookup performance will degrade severely as more clients access the same files because the re-export server's client cache is not being used as effectively (re-exported) and lookups are happening for the same files many times within the re-export server's actimeo even with vfs_cache_pressure=0.

For our particular use case, we could live without NFSv3 (and my horrible hack) except for the fact that the Netapp shows similar behaviour with NFSv4.0 (but Linux servers do not). I don't know if turning off atime updates on the Netapp volume will change anything - I might try it. Of course, re-exporting NFSv3 with good meatadata cache performance is still a nice thing to have too.

I'll now see if I can decipher the network calls back to the Netapp (NFSv4.0) as suggested by Jeff to see why it is different.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Linux-cachefs] Adventures in NFS re-exporting
  2020-10-05 12:54         ` Daire Byrne
@ 2020-10-13  9:59           ` Daire Byrne
  0 siblings, 0 replies; 129+ messages in thread
From: Daire Byrne @ 2020-10-13  9:59 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, linux-cachefs


----- On 5 Oct, 2020, at 13:54, Daire Byrne daire@dneg.com wrote:
> ----- On 1 Oct, 2020, at 11:36, Jeff Layton jlayton@kernel.org wrote:
> 
>> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
>>> ----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton@kernel.org wrote:
>>> 
>>> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>>> > > Hi,
>>> > > 
>>> > > I just thought I'd flesh out the other two issues I have found with re-exporting
>>> > > that are ultimately responsible for the biggest performance bottlenecks. And
>>> > > both of them revolve around the caching of metadata file lookups in the NFS
>>> > > client.
>>> > > 
>>> > > Especially for the case where we are re-exporting a server many milliseconds
>>> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
>>> > > client caches metadata and file data so that it's many LAN clients all benefit
>>> > > from the re-export server only having to do the WAN lookups once (within a
>>> > > specified coherency time).
>>> > > 
>>> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>>> > > is fairly straightforward, but keeping the metadata cached is particularly
>>> > > difficult. And without the cached metadata we introduce long delays before we
>>> > > can serve the already present and locally cached file data to many waiting
>>> > > clients.
>>> > > 
>>> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote:
>>> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>>> > > > cut the network packets back to the origin server to zero for repeated lookups.
>>> > > > However, if a client of the re-export server walks paths and memory maps those
>>> > > > files (i.e. loading an application), the re-export server starts issuing
>>> > > > unexpected calls back to the origin server again, ignoring/invalidating the
>>> > > > re-export server's NFS client cache. We worked around this this by patching an
>>> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
>>> > > > re-export server is used. I'm not sure about the correctness of this patch but
>>> > > > it works for our corner case.
>>> > > 
>>> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>>> > > re-export server, we can successfully cache the loading of applications and
>>> > > walking of paths directly on the re-export server such that after a couple of
>>> > > runs, there are practically zero packets back to the originating NFS server
>>> > > (great!). But, if we then do the same thing on a client which is mounting that
>>> > > re-export server, the re-export server now starts issuing lots of calls back to
>>> > > the originating server and invalidating it's client cache (bad!).
>>> > > 
>>> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
>>> > > (due to atime modification?) most likely via invocation of method
>>> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
>>> > > validate attributes detects changes causing it to be reloaded from the
>>> > > originating server.
>>> > > 
>>> > 
>>> > I'd expect the change attribute to track what's in actual inode on the
>>> > "home" server. The NFS client is supposed to (mostly) keep the raw
>>> > change attribute in its i_version field.
>>> > 
>>> > The only place we call inode_inc_iversion_raw is in
>>> > nfs_inode_add_request, which I don't think you'd be hitting unless you
>>> > were writing to the file while holding a write delegation.
>>> > 
>>> > What sort of server is hosting the actual data in your setup?
>>> 
>>> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of
>>> (older) Netapps too. The re-export server is running the latest mainline
>>> kernel(s).
>>> 
>>> As far as I can make out, both these originating (home) server types exhibit a
>>> similar (but not exactly the same) effect on the Linux NFS client cache when it
>>> is being re-exported and accessed by other clients. I can replicate it when
>>> only using a read-only mount at every hop so I don't think that writes are
>>> related.
>>> 
>>> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates
>>> that might be causing this client invalidation (which is what I initially
>>> thought) are ultimately a wasted effort.
>>> 
>> 
>> Ok. I suspect there is a bug here somewhere, but with such a complicated
>> setup though it's not clear to me where that bug would be though. You
>> might need to do some packet sniffing and look at what the servers are
>> sending for change attributes.
>> 
>> nfsd4_change_attribute does mix in the ctime, so your hunch about the
>> atime may be correct. atime updates imply a ctime update and that could
>> cause nfsd to continually send a new one, even on files that aren't
>> being changed.
>> 
>> It might be interesting to doctor nfsd4_change_attribute() to not mix in
>> the ctime and see whether that improves things. If it does, then we may
>> want to teach nfsd how to avoid doing that for certain types of
>> filesystems.
> 
> Okay, I started to run back through all my tests again with various combinations
> of server, client mount options, NFS version etc. with the intention of packet
> capturing as Jeff has suggested.
> 
> But I quickly realised that I had mixed up some previous results before I
> reported them here. The summary is that using an NFS RHEL76 server, a client
> mounting with a recent mainline kernel and re-exporting using NFSv4.x all the
> way through does NOT invalidate the re-export server's NFS client cache
> (great!) like I had assumed before. It does when we mount the originating RHEL7
> server using NFSv3 and re-export, but not with any version of NFSv4 on Linux.
> 
> But I think I know how I got confused - the Netapp NFSv4 case is different. When
> we mount our (old) 7-mode Netapp using NFSv4.0 and re-export that, the
> re-export server's client cache is invalidated often in the same way as for an
> NFSv3 server. On top of that, I think I wrongly mistook some of the NFSv4
> client's natural dropping of metadata from page cache as client invalidations
> caused by the re-export and client access (without vfs_cache_pressure=0 and see
> my #3 bullet point).
> 
> Both of these conspired to make me think that both NFSv3 AND NFSv4 re-exporting
> showed the same issue when in fact, it's just NFSv3 and the Netapp's v4.0 that
> require my "hack" to stop the client cache being invalidated. Sorry for any
> confusion (it is indeed a complicated setup!). Let me summarise then once and
> for all:
> 
> rhel76 server (xfs noatime) -> re-export server (vers=4.x,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = good client cache metadata performance, my hacky patch
> is not required.
> rhel76 server (xfs noatime) -> re-export server (vers=3,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = bad performance (new lookups & getattrs), my hacky
> patch is required for better performance.
> netapp (7-mode) -> re-export server (vers=4.0,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = bad performance, my hacky patch is required for better
> performance.
> 
> So for Jeff's original intention of proxying a NFSv3 server -> NFSv4 clients by
> re-exporting, the metadata lookup performance will degrade severely as more
> clients access the same files because the re-export server's client cache is
> not being used as effectively (re-exported) and lookups are happening for the
> same files many times within the re-export server's actimeo even with
> vfs_cache_pressure=0.
> 
> For our particular use case, we could live without NFSv3 (and my horrible hack)
> except for the fact that the Netapp shows similar behaviour with NFSv4.0 (but
> Linux servers do not). I don't know if turning off atime updates on the Netapp
> volume will change anything - I might try it. Of course, re-exporting NFSv3
> with good meatadata cache performance is still a nice thing to have too.
> 
> I'll now see if I can decipher the network calls back to the Netapp (NFSv4.0) as
> suggested by Jeff to see why it is different.

I did a little more digging and the big jump in client ops on the re-export server back to the originating Netapp using NFSv4.0 seems to be mostly because it is issuing lots of READDIR calls. The same workload to a Linux NFS server does not issue a single READDIR/READDIRPLUS call (once cached). As to why these are not cached in the client for repeated lookups (without my hack), I have no idea.

However, I was eventually able to devise a workload that could also cause the NFSv4.2 client cache on the re-export server to unexpectedly "lose" entries such that it needed to reissue calls back to an originating Linux server. A large proportion of these were NFS4ERR_NOENT (but not all) so I don't know if maybe it is something specific to the negative entry cache.

It is really hard following the packets from the re-export's client through the re-export server and on to the originating server, but as far as I can make out, it was mostly issuing access/lookup/getattr for directories (that should already be cached) when the re-export server's clients are issuing calls like readlink (for example resolving a library directory with symlinks).

I have also noticed another couple of new curiosities. If we run a typical small workload against a client mount such that it is all cached for repeat runs and then re-export that same directory to a remote client and run the same workload, the reads that should already be cached are all fetched again from the originating server. Only then are they are cached for repeat runs or for different clients. It's almost like the NFS client cache on the re-export server sees the locally accessed client mount as a different filesystem (and cache) to the knfsd re-exported one. A consequence of embedding the filehandles?

And while looking at the packet traces for this, I also noticed that when re-exported to a client, all the read calls back to the originating server are being chopped up into a maximum of 128k. It's as if I had mounted the originating server using rsize=131072 (it's definitely 1MB). So a client of the re-export server is receiving rsize=1MB reads, but the re-export server is pulling them from the originating server in 128k chunks. This was using NFSV4.2 all the way through.

Is this an expected side-effect of re-exporting? Is it some weird interaction with the nfs client's readahead? It has the effect of large reads requiring 8x more round-trips for re-export clients than if they had just gone direct to the originating server (and gotten 1MB reads).

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-09-16 16:01     ` Daire Byrne
@ 2020-10-19 16:19       ` Daire Byrne
  2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
                           ` (3 more replies)
  0 siblings, 4 replies; 129+ messages in thread
From: Daire Byrne @ 2020-10-19 16:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-cachefs, linux-nfs


----- On 16 Sep, 2020, at 17:01, Daire Byrne daire@dneg.com wrote:

> Trond/Bruce,
> 
> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@hammerspace.com wrote:
> 
>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>> > (under memory cache churn) when those filehandles are still being
>>> > used by the knfsd's remote clients resulting in sporadic and random
>>> > stale filehandles. This seems to be mostly for directories from
>>> > what I've seen. Does the NFS client not know that knfsd is still
>>> > using those files/dirs? The workaround is to never drop inode &
>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>> > also helps to ensure that we actually make the most of our
>>> > actimeo=3600,nocto mount options for the full specified time.
>>> 
>>> I thought reexport worked by embedding the original server's
>>> filehandles
>>> in the filehandles given out by the reexporting server.
>>> 
>>> So, even if nothing's cached, when the reexporting server gets a
>>> filehandle, it should be able to extract the original filehandle from
>>> it
>>> and use that.
>>> 
>>> I wonder why that's not working?
>> 
>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>> callback for it.
> 
> So in terms of the ESTALE counter on the reexport server, we see it increase if
> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
> difference in the client experience in that with NFSv3 we quickly get
> input/output errors but with NFSv4 we don't. But it does seem like the
> performance drops significantly which makes me think that NFSv4 retries the
> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
> 
> This is the simplest reproducer I could come up with but it may still be
> specific to our workloads/applications and hard to replicate exactly.
> 
> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
> reexport-server:/vol/software /mnt/software
> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
> /proc/sys/vm/drop_caches; done
> 
> reexport-server # sysctl -w vm.vfs_cache_pressure=100
> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
> 10; done
> 
> Where "application" is some big application with lots of paths to scan with libs
> to memory map and "/vol/software" is an NFS mount on the reexport-server from
> another originating NFS server. I don't know why this application loading
> workload shows this best, but perhaps the access patterns of memory mapped
> binaries and libs is particularly susceptible to estale?
> 
> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
> The ESTALE count increases and the client running the application reports
> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
> 
> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
> no longer increases and the client experiences no issues (NFSv3 & NFSv4).

I don't suppose anyone has any more thoughts on this one? This is likely the first problem that anyone trying to NFS re-export is going to encounter. If they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are dropped from cache (with the default vfs_cache_pressure=100) and if they re-export NFSv4, the lookup performance will drop significantly as an ESTALE triggers re-lookups.

For our particular use case, it is actually desirable to have vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to help with expensive metadata lookups, but it would still be nice to have the option of using a less drastic setting (such as vfs_cache_pressure=1) to help avoid OOM conditions.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 0/2] Add NFSv3 emulation of the lookupp operation
  2020-10-19 16:19       ` Daire Byrne
@ 2020-10-19 17:53         ` trondmy
  2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
  2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-19 17:53 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").

Trond Myklebust (2):
  NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  NFSv3: Add emulation of the lookupp() operation

 fs/nfs/nfs3proc.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-19 17:53           ` trondmy
  2020-10-19 17:53             ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
  0 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-19 17:53 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 2397ceedba8a..a6a222435e9b 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -154,14 +154,13 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
 }
 
 static int
-nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
-		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
-		 struct nfs4_label *label)
+__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
+		   struct nfs_fh *fhandle, struct nfs_fattr *fattr)
 {
 	struct nfs3_diropargs	arg = {
 		.fh		= NFS_FH(dir),
-		.name		= dentry->d_name.name,
-		.len		= dentry->d_name.len
+		.name		= name,
+		.len		= len
 	};
 	struct nfs3_diropres	res = {
 		.fh		= fhandle,
@@ -175,15 +174,10 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 	int			status;
 	unsigned short task_flags = 0;
 
-	/* Is this is an attribute revalidation, subject to softreval? */
-	if (nfs_lookup_is_soft_revalidate(dentry))
-		task_flags |= RPC_TASK_TIMEOUT;
-
 	res.dir_attr = nfs_alloc_fattr();
 	if (res.dir_attr == NULL)
 		return -ENOMEM;
 
-	dprintk("NFS call  lookup %pd2\n", dentry);
 	nfs_fattr_init(fattr);
 	status = rpc_call_sync(NFS_CLIENT(dir), &msg, task_flags);
 	nfs_refresh_inode(dir, res.dir_attr);
@@ -198,6 +192,20 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 	return status;
 }
 
+static int
+nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
+		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		 struct nfs4_label *label)
+{
+	/* Is this is an attribute revalidation, subject to softreval? */
+	if (nfs_lookup_is_soft_revalidate(dentry))
+		task_flags |= RPC_TASK_TIMEOUT;
+
+	dprintk("NFS call  lookup %pd2\n", dentry);
+	return __nfs3_proc_lookup(dir, dentry->d_name.name,
+				  dentry->d_name.len, fhandle, fattr);
+}
+
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
 {
 	struct nfs3_accessargs	arg = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation
  2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
@ 2020-10-19 17:53             ` trondmy
  0 siblings, 0 replies; 129+ messages in thread
From: trondmy @ 2020-10-19 17:53 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index a6a222435e9b..63d1979933f3 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -155,7 +155,8 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
 
 static int
 __nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
-		   struct nfs_fh *fhandle, struct nfs_fattr *fattr)
+		   struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		   unsigned short task_flags)
 {
 	struct nfs3_diropargs	arg = {
 		.fh		= NFS_FH(dir),
@@ -172,7 +173,6 @@ __nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
 		.rpc_resp	= &res,
 	};
 	int			status;
-	unsigned short task_flags = 0;
 
 	res.dir_attr = nfs_alloc_fattr();
 	if (res.dir_attr == NULL)
@@ -197,13 +197,25 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
 		 struct nfs4_label *label)
 {
+	unsigned short task_flags = 0;
+
 	/* Is this is an attribute revalidation, subject to softreval? */
 	if (nfs_lookup_is_soft_revalidate(dentry))
 		task_flags |= RPC_TASK_TIMEOUT;
 
 	dprintk("NFS call  lookup %pd2\n", dentry);
 	return __nfs3_proc_lookup(dir, dentry->d_name.name,
-				  dentry->d_name.len, fhandle, fattr);
+				  dentry->d_name.len, fhandle, fattr,
+				  task_flags);
+}
+
+static int nfs3_proc_lookupp(struct inode *inode, struct nfs_fh *fhandle,
+			     struct nfs_fattr *fattr, struct nfs4_label *label)
+{
+	const char *dotdot = "..";
+	const size_t len = sizeof(dotdot) - 1;
+
+	return __nfs3_proc_lookup(inode, dotdot, len, fhandle, fattr, 0);
 }
 
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
@@ -1012,6 +1024,7 @@ const struct nfs_rpc_ops nfs_v3_clientops = {
 	.getattr	= nfs3_proc_getattr,
 	.setattr	= nfs3_proc_setattr,
 	.lookup		= nfs3_proc_lookup,
+	.lookupp	= nfs3_proc_lookupp,
 	.access		= nfs3_proc_access,
 	.readlink	= nfs3_proc_readlink,
 	.create		= nfs3_proc_create,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation
  2020-10-19 16:19       ` Daire Byrne
  2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-19 20:05         ` trondmy
  2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
  2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
  2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
  3 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-19 20:05 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").

v2:
 - Fix compilation issues for "NFSv3: Refactor nfs3_proc_lookup() to
   split out the dentry"

Trond Myklebust (2):
  NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  NFSv3: Add emulation of the lookupp() operation

 fs/nfs/nfs3proc.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-19 20:05           ` trondmy
  2020-10-19 20:05             ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
  0 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-19 20:05 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 2397ceedba8a..acbdf7496d31 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -154,14 +154,14 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
 }
 
 static int
-nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
-		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
-		 struct nfs4_label *label)
+__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
+		   struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		   unsigned short task_flags)
 {
 	struct nfs3_diropargs	arg = {
 		.fh		= NFS_FH(dir),
-		.name		= dentry->d_name.name,
-		.len		= dentry->d_name.len
+		.name		= name,
+		.len		= len
 	};
 	struct nfs3_diropres	res = {
 		.fh		= fhandle,
@@ -173,17 +173,11 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 		.rpc_resp	= &res,
 	};
 	int			status;
-	unsigned short task_flags = 0;
-
-	/* Is this is an attribute revalidation, subject to softreval? */
-	if (nfs_lookup_is_soft_revalidate(dentry))
-		task_flags |= RPC_TASK_TIMEOUT;
 
 	res.dir_attr = nfs_alloc_fattr();
 	if (res.dir_attr == NULL)
 		return -ENOMEM;
 
-	dprintk("NFS call  lookup %pd2\n", dentry);
 	nfs_fattr_init(fattr);
 	status = rpc_call_sync(NFS_CLIENT(dir), &msg, task_flags);
 	nfs_refresh_inode(dir, res.dir_attr);
@@ -198,6 +192,23 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 	return status;
 }
 
+static int
+nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
+		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		 struct nfs4_label *label)
+{
+	unsigned short task_flags = 0;
+
+	/* Is this is an attribute revalidation, subject to softreval? */
+	if (nfs_lookup_is_soft_revalidate(dentry))
+		task_flags |= RPC_TASK_TIMEOUT;
+
+	dprintk("NFS call  lookup %pd2\n", dentry);
+	return __nfs3_proc_lookup(dir, dentry->d_name.name,
+				  dentry->d_name.len, fhandle, fattr,
+				  task_flags);
+}
+
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
 {
 	struct nfs3_accessargs	arg = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation
  2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
@ 2020-10-19 20:05             ` trondmy
  0 siblings, 0 replies; 129+ messages in thread
From: trondmy @ 2020-10-19 20:05 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index acbdf7496d31..63d1979933f3 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -209,6 +209,15 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 				  task_flags);
 }
 
+static int nfs3_proc_lookupp(struct inode *inode, struct nfs_fh *fhandle,
+			     struct nfs_fattr *fattr, struct nfs4_label *label)
+{
+	const char *dotdot = "..";
+	const size_t len = sizeof(dotdot) - 1;
+
+	return __nfs3_proc_lookup(inode, dotdot, len, fhandle, fattr, 0);
+}
+
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
 {
 	struct nfs3_accessargs	arg = {
@@ -1015,6 +1024,7 @@ const struct nfs_rpc_ops nfs_v3_clientops = {
 	.getattr	= nfs3_proc_getattr,
 	.setattr	= nfs3_proc_setattr,
 	.lookup		= nfs3_proc_lookup,
+	.lookupp	= nfs3_proc_lookupp,
 	.access		= nfs3_proc_access,
 	.readlink	= nfs3_proc_readlink,
 	.create		= nfs3_proc_create,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation
  2020-10-19 16:19       ` Daire Byrne
  2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
  2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-20 18:37         ` trondmy
  2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
  2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
  3 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-20 18:37 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").

v2:
 - Fix compilation issues for "NFSv3: Refactor nfs3_proc_lookup() to
   split out the dentry"
v3:
 - Fix the string length calculation
 - Apply the NFS_MOUNT_SOFTREVAL flag in both the NFSv3 and NFSv4 lookupp

Trond Myklebust (3):
  NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  NFSv3: Add emulation of the lookupp() operation
  NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp

 fs/nfs/nfs3proc.c | 48 ++++++++++++++++++++++++++++++++++++-----------
 fs/nfs/nfs4proc.c |  6 +++++-
 2 files changed, 42 insertions(+), 12 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-20 18:37           ` trondmy
  2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
  0 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-20 18:37 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 2397ceedba8a..acbdf7496d31 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -154,14 +154,14 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
 }
 
 static int
-nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
-		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
-		 struct nfs4_label *label)
+__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
+		   struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		   unsigned short task_flags)
 {
 	struct nfs3_diropargs	arg = {
 		.fh		= NFS_FH(dir),
-		.name		= dentry->d_name.name,
-		.len		= dentry->d_name.len
+		.name		= name,
+		.len		= len
 	};
 	struct nfs3_diropres	res = {
 		.fh		= fhandle,
@@ -173,17 +173,11 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 		.rpc_resp	= &res,
 	};
 	int			status;
-	unsigned short task_flags = 0;
-
-	/* Is this is an attribute revalidation, subject to softreval? */
-	if (nfs_lookup_is_soft_revalidate(dentry))
-		task_flags |= RPC_TASK_TIMEOUT;
 
 	res.dir_attr = nfs_alloc_fattr();
 	if (res.dir_attr == NULL)
 		return -ENOMEM;
 
-	dprintk("NFS call  lookup %pd2\n", dentry);
 	nfs_fattr_init(fattr);
 	status = rpc_call_sync(NFS_CLIENT(dir), &msg, task_flags);
 	nfs_refresh_inode(dir, res.dir_attr);
@@ -198,6 +192,23 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 	return status;
 }
 
+static int
+nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
+		 struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+		 struct nfs4_label *label)
+{
+	unsigned short task_flags = 0;
+
+	/* Is this is an attribute revalidation, subject to softreval? */
+	if (nfs_lookup_is_soft_revalidate(dentry))
+		task_flags |= RPC_TASK_TIMEOUT;
+
+	dprintk("NFS call  lookup %pd2\n", dentry);
+	return __nfs3_proc_lookup(dir, dentry->d_name.name,
+				  dentry->d_name.len, fhandle, fattr,
+				  task_flags);
+}
+
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
 {
 	struct nfs3_accessargs	arg = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation
  2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
@ 2020-10-20 18:37             ` trondmy
  2020-10-20 18:37               ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
  0 siblings, 1 reply; 129+ messages in thread
From: trondmy @ 2020-10-20 18:37 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3proc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index acbdf7496d31..6b66b73a50eb 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -209,6 +209,20 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
 				  task_flags);
 }
 
+static int nfs3_proc_lookupp(struct inode *inode, struct nfs_fh *fhandle,
+			     struct nfs_fattr *fattr, struct nfs4_label *label)
+{
+	const char dotdot[] = "..";
+	const size_t len = strlen(dotdot);
+	unsigned short task_flags = 0;
+
+	if (NFS_SERVER(inode)->flags & NFS_MOUNT_SOFTREVAL)
+		task_flags |= RPC_TASK_TIMEOUT;
+
+	return __nfs3_proc_lookup(inode, dotdot, len, fhandle, fattr,
+				  task_flags);
+}
+
 static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
 {
 	struct nfs3_accessargs	arg = {
@@ -1015,6 +1029,7 @@ const struct nfs_rpc_ops nfs_v3_clientops = {
 	.getattr	= nfs3_proc_getattr,
 	.setattr	= nfs3_proc_setattr,
 	.lookup		= nfs3_proc_lookup,
+	.lookupp	= nfs3_proc_lookupp,
 	.access		= nfs3_proc_access,
 	.readlink	= nfs3_proc_readlink,
 	.create		= nfs3_proc_create,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp
  2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
@ 2020-10-20 18:37               ` trondmy
  0 siblings, 0 replies; 129+ messages in thread
From: trondmy @ 2020-10-20 18:37 UTC (permalink / raw)
  To: Daire Byrne; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

We need to respect the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp,
by timing out if the server is unavailable.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs4proc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index bdf33e18fc54..c306c97c1ed0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -4376,6 +4376,10 @@ static int _nfs4_proc_lookupp(struct inode *inode,
 		.rpc_argp = &args,
 		.rpc_resp = &res,
 	};
+	unsigned short task_flags = 0;
+
+	if (NFS_SERVER(inode)->flags & NFS_MOUNT_SOFTREVAL)
+		task_flags |= RPC_TASK_TIMEOUT;
 
 	args.bitmask = nfs4_bitmask(server, label);
 
@@ -4383,7 +4387,7 @@ static int _nfs4_proc_lookupp(struct inode *inode,
 
 	dprintk("NFS call  lookupp ino=0x%lx\n", inode->i_ino);
 	status = nfs4_call_sync(clnt, server, &msg, &args.seq_args,
-				&res.seq_res, 0);
+				&res.seq_res, task_flags);
 	dprintk("NFS reply lookupp: %d\n", status);
 	return status;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-10-19 16:19       ` Daire Byrne
                           ` (2 preceding siblings ...)
  2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
@ 2020-10-21  9:33         ` Daire Byrne
  2020-11-09 16:02           ` bfields
  3 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-10-21  9:33 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-cachefs, linux-nfs

----- On 19 Oct, 2020, at 17:19, Daire Byrne daire@dneg.com wrote:
> ----- On 16 Sep, 2020, at 17:01, Daire Byrne daire@dneg.com wrote:
> 
>> Trond/Bruce,
>> 
>> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@hammerspace.com wrote:
>> 
>>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>>> > (under memory cache churn) when those filehandles are still being
>>>> > used by the knfsd's remote clients resulting in sporadic and random
>>>> > stale filehandles. This seems to be mostly for directories from
>>>> > what I've seen. Does the NFS client not know that knfsd is still
>>>> > using those files/dirs? The workaround is to never drop inode &
>>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>>> > also helps to ensure that we actually make the most of our
>>>> > actimeo=3600,nocto mount options for the full specified time.
>>>> 
>>>> I thought reexport worked by embedding the original server's
>>>> filehandles
>>>> in the filehandles given out by the reexporting server.
>>>> 
>>>> So, even if nothing's cached, when the reexporting server gets a
>>>> filehandle, it should be able to extract the original filehandle from
>>>> it
>>>> and use that.
>>>> 
>>>> I wonder why that's not working?
>>> 
>>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>>> callback for it.
>> 
>> So in terms of the ESTALE counter on the reexport server, we see it increase if
>> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
>> difference in the client experience in that with NFSv3 we quickly get
>> input/output errors but with NFSv4 we don't. But it does seem like the
>> performance drops significantly which makes me think that NFSv4 retries the
>> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
>> 
>> This is the simplest reproducer I could come up with but it may still be
>> specific to our workloads/applications and hard to replicate exactly.
>> 
>> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
>> reexport-server:/vol/software /mnt/software
>> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
>> /proc/sys/vm/drop_caches; done
>> 
>> reexport-server # sysctl -w vm.vfs_cache_pressure=100
>> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
>> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
>> 10; done
>> 
>> Where "application" is some big application with lots of paths to scan with libs
>> to memory map and "/vol/software" is an NFS mount on the reexport-server from
>> another originating NFS server. I don't know why this application loading
>> workload shows this best, but perhaps the access patterns of memory mapped
>> binaries and libs is particularly susceptible to estale?
>> 
>> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
>> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
>> The ESTALE count increases and the client running the application reports
>> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
>> 
>> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
>> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
>> no longer increases and the client experiences no issues (NFSv3 & NFSv4).
> 
> I don't suppose anyone has any more thoughts on this one? This is likely the
> first problem that anyone trying to NFS re-export is going to encounter. If
> they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are
> dropped from cache (with the default vfs_cache_pressure=100) and if they
> re-export NFSv4, the lookup performance will drop significantly as an ESTALE
> triggers re-lookups.
> 
> For our particular use case, it is actually desirable to have
> vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to
> help with expensive metadata lookups, but it would still be nice to have the
> option of using a less drastic setting (such as vfs_cache_pressure=1) to help
> avoid OOM conditions.

Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests.

Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!).

However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches.

So in summary, while continuously dropping inode/dentry caches on the re-export server:

originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors)
originating server NFSv4.1/4.2 -> NFSv3 re-export server = good
originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors
originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches)
originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches)

In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations.

But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
@ 2020-11-09 16:02           ` bfields
  2020-11-12 13:01             ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-09 16:02 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests.
> 
> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!).
> 
> However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches.
> 
> So in summary, while continuously dropping inode/dentry caches on the re-export server:

How continuously, exactly?

I recall that there are some situations where the best the client can do
to handle an ESTALE is just retry.  And that our code generally just
retries once and then gives up.

I wonder if it's possible that the client or re-export server can get
stuck in a situation where they can't guarantee forward progress in the
face of repeated ESTALEs.  I don't have a specific case in mind, though.

--b.

> 
> originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors)
> originating server NFSv4.1/4.2 -> NFSv3 re-export server = good
> originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors
> originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches)
> originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches)
> 
> In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations.
> 
> But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0.
> 
> Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-09 16:02           ` bfields
@ 2020-11-12 13:01             ` Daire Byrne
  2020-11-12 13:57               ` bfields
  2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
  0 siblings, 2 replies; 129+ messages in thread
From: Daire Byrne @ 2020-11-12 13:01 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 9 Nov, 2020, at 16:02, bfields bfields@fieldses.org wrote:
> On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
>> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million
>> thanks!) so I applied them to v5.9.1 and ran some more tests using that on the
>> re-export server. Again, I just pathologically dropped inode & dentry caches
>> every second on the re-export server (vfs_cache_pressure=100) while a client
>> looped through some application loading tests.
>> 
>> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I
>> no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode &
>> dentry caches (yay!).
>> 
>> However, my assumption that some of the input/output errors I was seeing were
>> related to the estales seems to have been misguided. After running these tests
>> again without any estales, it now looks like a different issue that is unique
>> to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or
>> Netapp). The lookups are all fine (no estale) but reading some files eventually
>> gives an input/output error on multiple clients which remain consistent until
>> the re-export nfs-server is restarted. Again, this only occurs while dropping
>> inode + dentry caches.
>> 
>> So in summary, while continuously dropping inode/dentry caches on the re-export
>> server:
> 
> How continuously, exactly?
> 
> I recall that there are some situations where the best the client can do
> to handle an ESTALE is just retry.  And that our code generally just
> retries once and then gives up.
> 
> I wonder if it's possible that the client or re-export server can get
> stuck in a situation where they can't guarantee forward progress in the
> face of repeated ESTALEs.  I don't have a specific case in mind, though.

I was dropping caches every second in a loop on the NFS re-export server. Meanwhile a large python application that takes ~15 seconds to complete was also looping on a client of the re-export server. So we are clearing out the cache many times such that the same python paths are being re-populated many times.

Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's NFSv3 lookupp emulation patches, I can now revise my original list of issues that others will likely experience if they ever try to do this craziness:

1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will see random input/output errors on your clients when things are dropped out of the cache. In the end we gave up on using NFSv4.0 with our Netapps because the 7-mode implementation seemed a bit flakey with modern Linux clients (Linux NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with Trond's lookupp emulation patches instead.

2) In order to better utilise the re-export server's client cache when re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to use the horrible inode_peek_iversion_raw hack to maintain good metadata performance for large numbers of clients. Otherwise each re-export server's clients can cause invalidation of the re-export server client cache. Once you have hundreds of clients they all combine to constantly invalidate the cache resulting in an order of magnitude slower metadata performance. If you are re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not required.

3) For some reason, when a 1MB read call arrives at the re-export server from a client, it gets chopped up into 128k read calls that are issued back to the originating server despite rsize/wsize=1MB on all mounts. This results in a noticeable increase in rpc chatter for large reads. Writes on the other hand retain their 1MB size from client to re-export server and back to the originating server. I am using nconnect but I doubt that is related.

4) After some random time, the cachefilesd userspace daemon stops culling old data from an fscache disk storage. I thought it was to do with setting vfs_cache_pressure=0 but even with it set to the default 100 it just randomly decides to stop culling and never comes back to life until restarted or rebooted. Perhaps the fscache/cachefilesd rewrite that David Howells & David Wysochanski have been working on will improve matters.

5) It's still really hard to cache nfs client metadata for any definitive time (actimeo,nocto) due to the pagecache churn that reads cause. If all required metadata (i.e. directory contents) could either be locally cached to disk or the inode cache rather than pagecache then maybe we would have more control over the actual cache times we are comfortable with for our workloads. This has little to do with re-exporting and is just a general NFS performance over the WAN thing. I'm very interested to see how Trond's recent patches to improve readdir performance might at least help re-populate the dropped cached metadata more efficiently over the WAN.

I just want to finish with one more crazy thing we have been doing - a re-export server of a re-export server! Again, a locking and consistency nightmare so only possible for very specific workloads (like ours). The advantage of this topology is that you can pull all your data over the WAN once (e.g. on-premise to cloud) and then fan-out that data to multiple other NFS re-export servers in the cloud to improve the aggregate performance to many clients. This avoids having multiple re-export servers all needing to pull the same data across the WAN.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 13:01             ` Daire Byrne
@ 2020-11-12 13:57               ` bfields
  2020-11-12 18:33                 ` Daire Byrne
  2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-12 13:57 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Thu, Nov 12, 2020 at 01:01:24PM +0000, Daire Byrne wrote:
> 
> ----- On 9 Nov, 2020, at 16:02, bfields bfields@fieldses.org wrote:
> > On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
> >> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million
> >> thanks!) so I applied them to v5.9.1 and ran some more tests using that on the
> >> re-export server. Again, I just pathologically dropped inode & dentry caches
> >> every second on the re-export server (vfs_cache_pressure=100) while a client
> >> looped through some application loading tests.
> >> 
> >> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I
> >> no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode &
> >> dentry caches (yay!).
> >> 
> >> However, my assumption that some of the input/output errors I was seeing were
> >> related to the estales seems to have been misguided. After running these tests
> >> again without any estales, it now looks like a different issue that is unique
> >> to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or
> >> Netapp). The lookups are all fine (no estale) but reading some files eventually
> >> gives an input/output error on multiple clients which remain consistent until
> >> the re-export nfs-server is restarted. Again, this only occurs while dropping
> >> inode + dentry caches.
> >> 
> >> So in summary, while continuously dropping inode/dentry caches on the re-export
> >> server:
> > 
> > How continuously, exactly?
> > 
> > I recall that there are some situations where the best the client can do
> > to handle an ESTALE is just retry.  And that our code generally just
> > retries once and then gives up.
> > 
> > I wonder if it's possible that the client or re-export server can get
> > stuck in a situation where they can't guarantee forward progress in the
> > face of repeated ESTALEs.  I don't have a specific case in mind, though.
> 
> I was dropping caches every second in a loop on the NFS re-export server. Meanwhile a large python application that takes ~15 seconds to complete was also looping on a client of the re-export server. So we are clearing out the cache many times such that the same python paths are being re-populated many times.
> 
> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's NFSv3 lookupp emulation patches, I can now revise my original list of issues that others will likely experience if they ever try to do this craziness:
> 
> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will see random input/output errors on your clients when things are dropped out of the cache. In the end we gave up on using NFSv4.0 with our Netapps because the 7-mode implementation seemed a bit flakey with modern Linux clients (Linux NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with Trond's lookupp emulation patches instead.

So,

		NFSv4.2			  NFSv4.2
	client --------> re-export server -------> original server

works as long as both servers are recent Linux, but when the original
server is Netapp, you need the protocol used in both places to be v3, is
that right?

> 2) In order to better utilise the re-export server's client cache when re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to use the horrible inode_peek_iversion_raw hack to maintain good metadata performance for large numbers of clients. Otherwise each re-export server's clients can cause invalidation of the re-export server client cache. Once you have hundreds of clients they all combine to constantly invalidate the cache resulting in an order of magnitude slower metadata performance. If you are re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not required.

Have we figured out why that's required, or found a longer-term
solution?  (Apologies, the memory of the earlier conversation is
fading....)

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 13:57               ` bfields
@ 2020-11-12 18:33                 ` Daire Byrne
  2020-11-12 20:55                   ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-11-12 18:33 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 12 Nov, 2020, at 13:57, bfields bfields@fieldses.org wrote:
> On Thu, Nov 12, 2020 at 01:01:24PM +0000, Daire Byrne wrote:
>> 
>> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's
>> NFSv3 lookupp emulation patches, I can now revise my original list of issues
>> that others will likely experience if they ever try to do this craziness:
>> 
>> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will
>> see random input/output errors on your clients when things are dropped out of
>> the cache. In the end we gave up on using NFSv4.0 with our Netapps because the
>> 7-mode implementation seemed a bit flakey with modern Linux clients (Linux
>> NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with
>> Trond's lookupp emulation patches instead.
> 
> So,
> 
>		NFSv4.2			  NFSv4.2
>	client --------> re-export server -------> original server
> 
> works as long as both servers are recent Linux, but when the original
> server is Netapp, you need the protocol used in both places to be v3, is
> that right?

Well, yes NFSv4.2 all the way through works well for us but it's re-exporting a NFSv4.0 server (Linux OR Netapp) that seems to still show the input/output errors when dropping caches. Every other possible combination now seems to be working without ESTALE or input/errors with the lookupp emulation patches.

So this is still not working when dropping caches on the re-export server:

		NFSv3/4.x			  NFSv4.0
	client --------> re-export server -------> original server

The bit specific to the Netapp is simply that our 7-mode only supports NFSv4.0 so I can't actually test NFSv4.1/4.2 on a more modern Netapp firmware release. So I have to use NFSv3 to mount the Netapp and can then happily re-export that using NFSv4.x or NFSv3 (if the filehandles fit in 63 bytes).

>> 2) In order to better utilise the re-export server's client cache when
>> re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to
>> use the horrible inode_peek_iversion_raw hack to maintain good metadata
>> performance for large numbers of clients. Otherwise each re-export server's
>> clients can cause invalidation of the re-export server client cache. Once you
>> have hundreds of clients they all combine to constantly invalidate the cache
>> resulting in an order of magnitude slower metadata performance. If you are
>> re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not
>> required.
> 
> Have we figured out why that's required, or found a longer-term
> solution?  (Apologies, the memory of the earlier conversation is
> fading....)

There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR allowing for the hack/optimisation but I guess that is only for the case when re-exporting NFSv4 to the eventual clients. It would not help if you were re-exporting an NFSv3 server with NFSv3 to the clients? I lack the deeper understanding to say anything more than that.

In our case we re-export everything to the clients using NFSv4.2 whether the originating server is NFSv3 (e.g our Netapp) or NFSv4.2 (our RHEL7 storage servers).

With NFSv4.2 as the originating server, we found that either this hack/optimsation was not required or the incidence rate of invalidating the re-export server's client cache was much less as to not cause significant performance problems when many clients requested the same metadata.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 18:33                 ` Daire Byrne
@ 2020-11-12 20:55                   ` bfields
  2020-11-12 23:05                     ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-12 20:55 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Thu, Nov 12, 2020 at 06:33:45PM +0000, Daire Byrne wrote:
> Well, yes NFSv4.2 all the way through works well for us but it's re-exporting a NFSv4.0 server (Linux OR Netapp) that seems to still show the input/output errors when dropping caches. Every other possible combination now seems to be working without ESTALE or input/errors with the lookupp emulation patches.
> 
> So this is still not working when dropping caches on the re-export server:
> 
> 		NFSv3/4.x			  NFSv4.0
> 	client --------> re-export server -------> original server
> 
> The bit specific to the Netapp is simply that our 7-mode only supports NFSv4.0 so I can't actually test NFSv4.1/4.2 on a more modern Netapp firmware release. So I have to use NFSv3 to mount the Netapp and can then happily re-export that using NFSv4.x or NFSv3 (if the filehandles fit in 63 bytes).

Oh, got it, thanks, so it's just the minor-version difference (probably
the open-by-filehandle stuff that went into 4.1).

> There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
> allowing for the hack/optimisation but I guess that is only for the
> case when re-exporting NFSv4 to the eventual clients. It would not
> help if you were re-exporting an NFSv3 server with NFSv3 to the
> clients? I lack the deeper understanding to say anything more than
> that.

Oh, right, thanks for the reminder.  The CHANGE_TYPE_IS_MONOTONIC_INCR
optimization still looks doable to me.

How does that help, anyway?  I guess it avoids false positives of some
kind when rpc's are processed out of order?

Looking back at

	https://lore.kernel.org/linux-nfs/1155061727.42788071.1600777874179.JavaMail.zimbra@dneg.com/

this bothers me: "I'm not exactly sure why, but the iversion of the
inode gets changed locally (due to atime modification?) most likely via
invocation of method inode_inc_iversion_raw. Each time it gets
incremented the following call to validate attributes detects changes
causing it to be reloaded from the originating server."

The only call to that function outside afs or ceph code is in
fs/nfs/write.c, in the write delegation case.  The Linux server doesn't
support write delegations, Netapp does but this shouldn't be causing
cache invalidations.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 20:55                   ` bfields
@ 2020-11-12 23:05                     ` Daire Byrne
  2020-11-13 14:50                       ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-11-12 23:05 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 12 Nov, 2020, at 20:55, bfields bfields@fieldses.org wrote:
> On Thu, Nov 12, 2020 at 06:33:45PM +0000, Daire Byrne wrote:
>> There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
>> allowing for the hack/optimisation but I guess that is only for the
>> case when re-exporting NFSv4 to the eventual clients. It would not
>> help if you were re-exporting an NFSv3 server with NFSv3 to the
>> clients? I lack the deeper understanding to say anything more than
>> that.
> 
> Oh, right, thanks for the reminder.  The CHANGE_TYPE_IS_MONOTONIC_INCR
> optimization still looks doable to me.
> 
> How does that help, anyway?  I guess it avoids false positives of some
> kind when rpc's are processed out of order?
> 
> Looking back at
> 
>	https://lore.kernel.org/linux-nfs/1155061727.42788071.1600777874179.JavaMail.zimbra@dneg.com/
> 
> this bothers me: "I'm not exactly sure why, but the iversion of the
> inode gets changed locally (due to atime modification?) most likely via
> invocation of method inode_inc_iversion_raw. Each time it gets
> incremented the following call to validate attributes detects changes
> causing it to be reloaded from the originating server."
> 
> The only call to that function outside afs or ceph code is in
> fs/nfs/write.c, in the write delegation case.  The Linux server doesn't
> support write delegations, Netapp does but this shouldn't be causing
> cache invalidations.

So, I can't lay claim to identifying the exact optimisation/hack that improves the retention of the re-export server's client cache when re-exporting an NFSv3 server (which is then read by many clients). We were working with an engineer at the time who showed an interest in our use case and after we supplied a reproducer he suggested modifying the nfs/inode.c

-		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
+		if (inode_peek_iversion_raw(inode) < fattr->change_attr) {

His reasoning at the time was:

"Fixes inode invalidation caused by read access. The least
 important bit is ORed with 1 and causes the inode version to differ from the
 one seen on the NFS share. This in turn causes unnecessary re-download
 impacting the performance significantly. This fix makes it only re-fetch file
 content if inode version seen on the server is newer than the one on the
 client."

But I've always been puzzled by why this only seems to be the case when using knfsd to re-export the (NFSv3) client mount. Using multiple processes on a standard client mount never causes any similar re-validations. And this happens with a completely read-only share which is why I started to think it has something to do with atimes as that could perhaps still cause a "write" modification even when read-only?

In our case we saw this at it's most extreme when we were re-exporting a read-only NFSv3 Netapp "software" share and loading large applications with many python search paths to trawl through. Multiple clients of the re-export server just kept causing the re-export server's client to re-validate and re-download from the Netapp even though no files or dirs had changed and the actimeo=large (with nocto for good measure).

The patch made it such that the re-export server's client cache acted the same way if we ran 100 processes directly on the NFSv3 client mount (on the re-export server) or ran it on 100 clients of the re-export server - the data remained in client cache for the duration. So the re-export server fetches the data from the originating server once and then serves all those results many times over to all the clients from it's cache - exactly what we want.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 23:05                     ` Daire Byrne
@ 2020-11-13 14:50                       ` bfields
  2020-11-13 22:26                         ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-13 14:50 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> So, I can't lay claim to identifying the exact optimisation/hack that
> improves the retention of the re-export server's client cache when
> re-exporting an NFSv3 server (which is then read by many clients). We
> were working with an engineer at the time who showed an interest in
> our use case and after we supplied a reproducer he suggested modifying
> the nfs/inode.c
> 
> -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> {
> 
> His reasoning at the time was:
> 
> "Fixes inode invalidation caused by read access. The least important
> bit is ORed with 1 and causes the inode version to differ from the one
> seen on the NFS share. This in turn causes unnecessary re-download
> impacting the performance significantly. This fix makes it only
> re-fetch file content if inode version seen on the server is newer
> than the one on the client."
> 
> But I've always been puzzled by why this only seems to be the case
> when using knfsd to re-export the (NFSv3) client mount. Using multiple
> processes on a standard client mount never causes any similar
> re-validations. And this happens with a completely read-only share
> which is why I started to think it has something to do with atimes as
> that could perhaps still cause a "write" modification even when
> read-only?

Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
i_version.  That's a special thing that only nfsd would do.

I think that's totally fixable, we'll just have to think a little about
how....

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-13 14:50                       ` bfields
@ 2020-11-13 22:26                         ` bfields
  2020-11-14 12:57                           ` Daire Byrne
  2020-11-16 15:29                           ` Jeff Layton
  0 siblings, 2 replies; 129+ messages in thread
From: bfields @ 2020-11-13 22:26 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > So, I can't lay claim to identifying the exact optimisation/hack that
> > improves the retention of the re-export server's client cache when
> > re-exporting an NFSv3 server (which is then read by many clients). We
> > were working with an engineer at the time who showed an interest in
> > our use case and after we supplied a reproducer he suggested modifying
> > the nfs/inode.c
> > 
> > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > {
> > 
> > His reasoning at the time was:
> > 
> > "Fixes inode invalidation caused by read access. The least important
> > bit is ORed with 1 and causes the inode version to differ from the one
> > seen on the NFS share. This in turn causes unnecessary re-download
> > impacting the performance significantly. This fix makes it only
> > re-fetch file content if inode version seen on the server is newer
> > than the one on the client."
> > 
> > But I've always been puzzled by why this only seems to be the case
> > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > processes on a standard client mount never causes any similar
> > re-validations. And this happens with a completely read-only share
> > which is why I started to think it has something to do with atimes as
> > that could perhaps still cause a "write" modification even when
> > read-only?
> 
> Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> i_version.  That's a special thing that only nfsd would do.
> 
> I think that's totally fixable, we'll just have to think a little about
> how....

I wonder if something like this helps?--b.

commit 0add88a9ccc5
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Fri Nov 13 17:03:04 2020 -0500

    nfs: don't mangle i_version on NFS
    
    The i_version on NFS has pretty much opaque to the client, so we don't
    want to give the low bit any special interpretation.
    
    Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
    i_version on their own.
    
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
index 29ec8b09a52d..9b8dd5b713a7 100644
--- a/fs/nfs/fs_context.c
+++ b/fs/nfs/fs_context.c
@@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
 	.init_fs_context	= nfs_init_fs_context,
 	.parameters		= nfs_fs_parameters,
 	.kill_sb		= nfs_kill_super,
-	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+				  FS_PRIVATE_I_VERSION,
 };
 MODULE_ALIAS_FS("nfs");
 EXPORT_SYMBOL_GPL(nfs_fs_type);
@@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
 	.init_fs_context	= nfs_init_fs_context,
 	.parameters		= nfs_fs_parameters,
 	.kill_sb		= nfs_kill_super,
-	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+				  FS_PRIVATE_I_VERSION,
 };
 MODULE_ALIAS_FS("nfs4");
 MODULE_ALIAS("nfs4");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21cc971fd960..c5bb4268228b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
+#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
 #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 2917ef990d43..52c790a847de 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
 	u64 cur, old, new;
 
 	cur = inode_peek_iversion_raw(inode);
+	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
+		return cur;
 	for (;;) {
 		/* If flag is already set, then no need to swap */
 		if (cur & I_VERSION_QUERIED) {

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-13 22:26                         ` bfields
@ 2020-11-14 12:57                           ` Daire Byrne
  2020-11-16 15:18                             ` bfields
  2020-11-16 15:53                             ` bfields
  2020-11-16 15:29                           ` Jeff Layton
  1 sibling, 2 replies; 129+ messages in thread
From: Daire Byrne @ 2020-11-14 12:57 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 13 Nov, 2020, at 22:26, bfields bfields@fieldses.org wrote:
> On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
>> On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
>> > So, I can't lay claim to identifying the exact optimisation/hack that
>> > improves the retention of the re-export server's client cache when
>> > re-exporting an NFSv3 server (which is then read by many clients). We
>> > were working with an engineer at the time who showed an interest in
>> > our use case and after we supplied a reproducer he suggested modifying
>> > the nfs/inode.c
>> > 
>> > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
>> > {
>> > 
>> > His reasoning at the time was:
>> > 
>> > "Fixes inode invalidation caused by read access. The least important
>> > bit is ORed with 1 and causes the inode version to differ from the one
>> > seen on the NFS share. This in turn causes unnecessary re-download
>> > impacting the performance significantly. This fix makes it only
>> > re-fetch file content if inode version seen on the server is newer
>> > than the one on the client."
>> > 
>> > But I've always been puzzled by why this only seems to be the case
>> > when using knfsd to re-export the (NFSv3) client mount. Using multiple
>> > processes on a standard client mount never causes any similar
>> > re-validations. And this happens with a completely read-only share
>> > which is why I started to think it has something to do with atimes as
>> > that could perhaps still cause a "write" modification even when
>> > read-only?
>> 
>> Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
>> i_version.  That's a special thing that only nfsd would do.
>> 
>> I think that's totally fixable, we'll just have to think a little about
>> how....
> 
> I wonder if something like this helps?--b.
> 
> commit 0add88a9ccc5
> Author: J. Bruce Fields <bfields@redhat.com>
> Date:   Fri Nov 13 17:03:04 2020 -0500
> 
>    nfs: don't mangle i_version on NFS
>    
>    The i_version on NFS has pretty much opaque to the client, so we don't
>    want to give the low bit any special interpretation.
>    
>    Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
>    i_version on their own.
>    
>    Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 
> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> index 29ec8b09a52d..9b8dd5b713a7 100644
> --- a/fs/nfs/fs_context.c
> +++ b/fs/nfs/fs_context.c
> @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> 	.init_fs_context	= nfs_init_fs_context,
> 	.parameters		= nfs_fs_parameters,
> 	.kill_sb		= nfs_kill_super,
> -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> +				  FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs");
> EXPORT_SYMBOL_GPL(nfs_fs_type);
> @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> 	.init_fs_context	= nfs_init_fs_context,
> 	.parameters		= nfs_fs_parameters,
> 	.kill_sb		= nfs_kill_super,
> -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> +				  FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs4");
> MODULE_ALIAS("nfs4");
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21cc971fd960..c5bb4268228b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,7 @@ struct file_system_type {
> #define FS_HAS_SUBTYPE		4
> #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename()
> internally. */
> 	int (*init_fs_context)(struct fs_context *);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..52c790a847de 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> 	u64 cur, old, new;
> 
> 	cur = inode_peek_iversion_raw(inode);
> +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> +		return cur;
> 	for (;;) {
> 		/* If flag is already set, then no need to swap */
>  		if (cur & I_VERSION_QUERIED) {

Yes, I can confirm that this absolutely helps! I replaced our (brute force) iversion patch with this (much nicer) patch and we got the same improvement; nfsd and it's clients no longer cause the re-export server's client cache to constantly be re-validated. The re-export server can now serve the same results to many clients from cache. Thanks so much for spending the time to track this down. If merged, future (crazy) NFS re-exporters will benefit from the metadata performance improvement/acceleration!

Now if anyone has any ideas why all the read calls to the originating server are limited to a maximum of 128k (with rsize=1M) when coming via the re-export server's nfsd threads, I see that as the next biggest performance issue. Reading directly on the re-export server with a userspace process issues 1MB reads as expected. It doesn't happen for writes (wsize=1MB all the way through) but I'm not sure if that has more to do with async and write back caching helping to build up the size before commit?

I figure the other remaining items on my (wish) list are probably more in the "won't fix" or "can't fix" category (except maybe the NFSv4.0 input/output errors?).

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-14 12:57                           ` Daire Byrne
@ 2020-11-16 15:18                             ` bfields
  2020-11-16 15:53                             ` bfields
  1 sibling, 0 replies; 129+ messages in thread
From: bfields @ 2020-11-16 15:18 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs, jlayton

Jeff, does something like this look reasonable?

--b.

On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
> ----- On 13 Nov, 2020, at 22:26, bfields bfields@fieldses.org wrote:
> > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> >> Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> >> i_version.  That's a special thing that only nfsd would do.
> >> 
> >> I think that's totally fixable, we'll just have to think a little about
> >> how....
> > 
> > I wonder if something like this helps?--b.
> > 
> > commit 0add88a9ccc5
> > Author: J. Bruce Fields <bfields@redhat.com>
> > Date:   Fri Nov 13 17:03:04 2020 -0500
> > 
> >    nfs: don't mangle i_version on NFS
> >    
> >    The i_version on NFS has pretty much opaque to the client, so we don't
> >    want to give the low bit any special interpretation.
> >    
> >    Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> >    i_version on their own.
> >    
> >    Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > 
> > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > index 29ec8b09a52d..9b8dd5b713a7 100644
> > --- a/fs/nfs/fs_context.c
> > +++ b/fs/nfs/fs_context.c
> > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > 	.init_fs_context	= nfs_init_fs_context,
> > 	.parameters		= nfs_fs_parameters,
> > 	.kill_sb		= nfs_kill_super,
> > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > +				  FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs");
> > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > 	.init_fs_context	= nfs_init_fs_context,
> > 	.parameters		= nfs_fs_parameters,
> > 	.kill_sb		= nfs_kill_super,
> > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > +				  FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs4");
> > MODULE_ALIAS("nfs4");
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 21cc971fd960..c5bb4268228b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > #define FS_HAS_SUBTYPE		4
> > #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> > #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> > +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> > #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> > #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename()
> > internally. */
> > 	int (*init_fs_context)(struct fs_context *);
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..52c790a847de 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > 	u64 cur, old, new;
> > 
> > 	cur = inode_peek_iversion_raw(inode);
> > +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > +		return cur;
> > 	for (;;) {
> > 		/* If flag is already set, then no need to swap */
> >  		if (cur & I_VERSION_QUERIED) {
> 
> Yes, I can confirm that this absolutely helps! I replaced our (brute force) iversion patch with this (much nicer) patch and we got the same improvement; nfsd and it's clients no longer cause the re-export server's client cache to constantly be re-validated. The re-export server can now serve the same results to many clients from cache. Thanks so much for spending the time to track this down. If merged, future (crazy) NFS re-exporters will benefit from the metadata performance improvement/acceleration!
> 
> Now if anyone has any ideas why all the read calls to the originating server are limited to a maximum of 128k (with rsize=1M) when coming via the re-export server's nfsd threads, I see that as the next biggest performance issue. Reading directly on the re-export server with a userspace process issues 1MB reads as expected. It doesn't happen for writes (wsize=1MB all the way through) but I'm not sure if that has more to do with async and write back caching helping to build up the size before commit?
> 
> I figure the other remaining items on my (wish) list are probably more in the "won't fix" or "can't fix" category (except maybe the NFSv4.0 input/output errors?).
> 
> Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-13 22:26                         ` bfields
  2020-11-14 12:57                           ` Daire Byrne
@ 2020-11-16 15:29                           ` Jeff Layton
  2020-11-16 15:56                             ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-16 15:29 UTC (permalink / raw)
  To: bfields, Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > improves the retention of the re-export server's client cache when
> > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > were working with an engineer at the time who showed an interest in
> > > our use case and after we supplied a reproducer he suggested modifying
> > > the nfs/inode.c
> > > 
> > > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > {
> > > 
> > > His reasoning at the time was:
> > > 
> > > "Fixes inode invalidation caused by read access. The least important
> > > bit is ORed with 1 and causes the inode version to differ from the one
> > > seen on the NFS share. This in turn causes unnecessary re-download
> > > impacting the performance significantly. This fix makes it only
> > > re-fetch file content if inode version seen on the server is newer
> > > than the one on the client."
> > > 
> > > But I've always been puzzled by why this only seems to be the case
> > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > processes on a standard client mount never causes any similar
> > > re-validations. And this happens with a completely read-only share
> > > which is why I started to think it has something to do with atimes as
> > > that could perhaps still cause a "write" modification even when
> > > read-only?
> > 
> > Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> > i_version.  That's a special thing that only nfsd would do.
> > 
> > I think that's totally fixable, we'll just have to think a little about
> > how....
> 
> I wonder if something like this helps?--b.
> 
> commit 0add88a9ccc5
> Author: J. Bruce Fields <bfields@redhat.com>
> Date:   Fri Nov 13 17:03:04 2020 -0500
> 
>     nfs: don't mangle i_version on NFS
>     
> 
>     The i_version on NFS has pretty much opaque to the client, so we don't
>     want to give the low bit any special interpretation.
>     
> 
>     Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
>     i_version on their own.
>     
> 
>     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 
> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> index 29ec8b09a52d..9b8dd5b713a7 100644
> --- a/fs/nfs/fs_context.c
> +++ b/fs/nfs/fs_context.c
> @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
>  	.init_fs_context	= nfs_init_fs_context,
>  	.parameters		= nfs_fs_parameters,
>  	.kill_sb		= nfs_kill_super,
> -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> +				  FS_PRIVATE_I_VERSION,
>  };
>  MODULE_ALIAS_FS("nfs");
>  EXPORT_SYMBOL_GPL(nfs_fs_type);
> @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
>  	.init_fs_context	= nfs_init_fs_context,
>  	.parameters		= nfs_fs_parameters,
>  	.kill_sb		= nfs_kill_super,
> -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> +				  FS_PRIVATE_I_VERSION,
>  };
>  MODULE_ALIAS_FS("nfs4");
>  MODULE_ALIAS("nfs4");
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21cc971fd960..c5bb4268228b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,7 @@ struct file_system_type {
>  #define FS_HAS_SUBTYPE		4
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
>  #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
>  	int (*init_fs_context)(struct fs_context *);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..52c790a847de 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
>  	u64 cur, old, new;
>  
> 
>  	cur = inode_peek_iversion_raw(inode);
> +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> +		return cur;
>  	for (;;) {
>  		/* If flag is already set, then no need to swap */
>  		if (cur & I_VERSION_QUERIED) {


It's probably more correct to just check the already-existing
SB_I_VERSION flag here (though in hindsight a fstype flag might have
made more sense).

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-14 12:57                           ` Daire Byrne
  2020-11-16 15:18                             ` bfields
@ 2020-11-16 15:53                             ` bfields
  2020-11-16 19:21                               ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-16 15:53 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
> Now if anyone has any ideas why all the read calls to the originating
> server are limited to a maximum of 128k (with rsize=1M) when coming
> via the re-export server's nfsd threads, I see that as the next
> biggest performance issue. Reading directly on the re-export server
> with a userspace process issues 1MB reads as expected. It doesn't
> happen for writes (wsize=1MB all the way through) but I'm not sure if
> that has more to do with async and write back caching helping to build
> up the size before commit?

I'm not sure where to start with this one....

Is this behavior independent of protocol version and backend server?

> I figure the other remaining items on my (wish) list are probably more
> in the "won't fix" or "can't fix" category (except maybe the NFSv4.0
> input/output errors?).

Well, sounds like you've found a case where this feature's actually
useful.  We should make sure that's documented.

And I think it's also worth some effort to document and triage the list
of remaining issues.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 15:29                           ` Jeff Layton
@ 2020-11-16 15:56                             ` bfields
  2020-11-16 16:03                               ` Jeff Layton
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-16 15:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > improves the retention of the re-export server's client cache when
> > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > were working with an engineer at the time who showed an interest in
> > > > our use case and after we supplied a reproducer he suggested modifying
> > > > the nfs/inode.c
> > > > 
> > > > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > {
> > > > 
> > > > His reasoning at the time was:
> > > > 
> > > > "Fixes inode invalidation caused by read access. The least important
> > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > impacting the performance significantly. This fix makes it only
> > > > re-fetch file content if inode version seen on the server is newer
> > > > than the one on the client."
> > > > 
> > > > But I've always been puzzled by why this only seems to be the case
> > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > processes on a standard client mount never causes any similar
> > > > re-validations. And this happens with a completely read-only share
> > > > which is why I started to think it has something to do with atimes as
> > > > that could perhaps still cause a "write" modification even when
> > > > read-only?
> > > 
> > > Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> > > i_version.  That's a special thing that only nfsd would do.
> > > 
> > > I think that's totally fixable, we'll just have to think a little about
> > > how....
> > 
> > I wonder if something like this helps?--b.
> > 
> > commit 0add88a9ccc5
> > Author: J. Bruce Fields <bfields@redhat.com>
> > Date:   Fri Nov 13 17:03:04 2020 -0500
> > 
> >     nfs: don't mangle i_version on NFS
> >     
> > 
> >     The i_version on NFS has pretty much opaque to the client, so we don't
> >     want to give the low bit any special interpretation.
> >     
> > 
> >     Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> >     i_version on their own.
> >     
> > 
> >     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > 
> > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > index 29ec8b09a52d..9b8dd5b713a7 100644
> > --- a/fs/nfs/fs_context.c
> > +++ b/fs/nfs/fs_context.c
> > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> >  	.init_fs_context	= nfs_init_fs_context,
> >  	.parameters		= nfs_fs_parameters,
> >  	.kill_sb		= nfs_kill_super,
> > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > +				  FS_PRIVATE_I_VERSION,
> >  };
> >  MODULE_ALIAS_FS("nfs");
> >  EXPORT_SYMBOL_GPL(nfs_fs_type);
> > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> >  	.init_fs_context	= nfs_init_fs_context,
> >  	.parameters		= nfs_fs_parameters,
> >  	.kill_sb		= nfs_kill_super,
> > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > +				  FS_PRIVATE_I_VERSION,
> >  };
> >  MODULE_ALIAS_FS("nfs4");
> >  MODULE_ALIAS("nfs4");
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 21cc971fd960..c5bb4268228b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2217,6 +2217,7 @@ struct file_system_type {
> >  #define FS_HAS_SUBTYPE		4
> >  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> >  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> > +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> >  #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> >  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> >  	int (*init_fs_context)(struct fs_context *);
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..52c790a847de 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> >  	u64 cur, old, new;
> >  
> > 
> >  	cur = inode_peek_iversion_raw(inode);
> > +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > +		return cur;
> >  	for (;;) {
> >  		/* If flag is already set, then no need to swap */
> >  		if (cur & I_VERSION_QUERIED) {
> 
> 
> It's probably more correct to just check the already-existing
> SB_I_VERSION flag here

So the check would be

	if (!IS_I_VERSION(inode))
		return cur;

?

> (though in hindsight a fstype flag might have made more sense).

I_VERSION support can vary by superblock (for example, xfs supports it
or not depending on on-disk format version).

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 15:56                             ` bfields
@ 2020-11-16 16:03                               ` Jeff Layton
  2020-11-16 16:14                                 ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-16 16:03 UTC (permalink / raw)
  To: bfields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > improves the retention of the re-export server's client cache when
> > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > were working with an engineer at the time who showed an interest in
> > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > the nfs/inode.c
> > > > > 
> > > > > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > {
> > > > > 
> > > > > His reasoning at the time was:
> > > > > 
> > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > impacting the performance significantly. This fix makes it only
> > > > > re-fetch file content if inode version seen on the server is newer
> > > > > than the one on the client."
> > > > > 
> > > > > But I've always been puzzled by why this only seems to be the case
> > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > processes on a standard client mount never causes any similar
> > > > > re-validations. And this happens with a completely read-only share
> > > > > which is why I started to think it has something to do with atimes as
> > > > > that could perhaps still cause a "write" modification even when
> > > > > read-only?
> > > > 
> > > > Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > i_version.  That's a special thing that only nfsd would do.
> > > > 
> > > > I think that's totally fixable, we'll just have to think a little about
> > > > how....
> > > 
> > > I wonder if something like this helps?--b.
> > > 
> > > commit 0add88a9ccc5
> > > Author: J. Bruce Fields <bfields@redhat.com>
> > > Date:   Fri Nov 13 17:03:04 2020 -0500
> > > 
> > >     nfs: don't mangle i_version on NFS
> > >     
> > > 
> > >     The i_version on NFS has pretty much opaque to the client, so we don't
> > >     want to give the low bit any special interpretation.
> > >     
> > > 
> > >     Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > >     i_version on their own.
> > >     
> > > 
> > >     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > > 
> > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > --- a/fs/nfs/fs_context.c
> > > +++ b/fs/nfs/fs_context.c
> > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > >  	.init_fs_context	= nfs_init_fs_context,
> > >  	.parameters		= nfs_fs_parameters,
> > >  	.kill_sb		= nfs_kill_super,
> > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > +				  FS_PRIVATE_I_VERSION,
> > >  };
> > >  MODULE_ALIAS_FS("nfs");
> > >  EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > >  	.init_fs_context	= nfs_init_fs_context,
> > >  	.parameters		= nfs_fs_parameters,
> > >  	.kill_sb		= nfs_kill_super,
> > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > +				  FS_PRIVATE_I_VERSION,
> > >  };
> > >  MODULE_ALIAS_FS("nfs4");
> > >  MODULE_ALIAS("nfs4");
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 21cc971fd960..c5bb4268228b 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > >  #define FS_HAS_SUBTYPE		4
> > >  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> > >  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> > > +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> > >  #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> > >  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> > >  	int (*init_fs_context)(struct fs_context *);
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 2917ef990d43..52c790a847de 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > >  	u64 cur, old, new;
> > >  
> > > 
> > >  	cur = inode_peek_iversion_raw(inode);
> > > +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > +		return cur;
> > >  	for (;;) {
> > >  		/* If flag is already set, then no need to swap */
> > >  		if (cur & I_VERSION_QUERIED) {
> > 
> > 
> > It's probably more correct to just check the already-existing
> > SB_I_VERSION flag here
> 
> So the check would be
> 
> 	if (!IS_I_VERSION(inode))
> 		return cur;
> 
> ?
> 

Yes, that looks about right.

> > (though in hindsight a fstype flag might have made more sense).
> 
> I_VERSION support can vary by superblock (for example, xfs supports it
> or not depending on on-disk format version).
> 

Good point!

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 16:03                               ` Jeff Layton
@ 2020-11-16 16:14                                 ` bfields
  2020-11-16 16:38                                   ` Jeff Layton
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-16 16:14 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, Nov 16, 2020 at 11:03:00AM -0500, Jeff Layton wrote:
> On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> > On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > > improves the retention of the re-export server's client cache when
> > > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > > were working with an engineer at the time who showed an interest in
> > > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > > the nfs/inode.c
> > > > > > 
> > > > > > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > > {
> > > > > > 
> > > > > > His reasoning at the time was:
> > > > > > 
> > > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > > impacting the performance significantly. This fix makes it only
> > > > > > re-fetch file content if inode version seen on the server is newer
> > > > > > than the one on the client."
> > > > > > 
> > > > > > But I've always been puzzled by why this only seems to be the case
> > > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > > processes on a standard client mount never causes any similar
> > > > > > re-validations. And this happens with a completely read-only share
> > > > > > which is why I started to think it has something to do with atimes as
> > > > > > that could perhaps still cause a "write" modification even when
> > > > > > read-only?
> > > > > 
> > > > > Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > > i_version.  That's a special thing that only nfsd would do.
> > > > > 
> > > > > I think that's totally fixable, we'll just have to think a little about
> > > > > how....
> > > > 
> > > > I wonder if something like this helps?--b.
> > > > 
> > > > commit 0add88a9ccc5
> > > > Author: J. Bruce Fields <bfields@redhat.com>
> > > > Date:   Fri Nov 13 17:03:04 2020 -0500
> > > > 
> > > >     nfs: don't mangle i_version on NFS
> > > >     
> > > > 
> > > >     The i_version on NFS has pretty much opaque to the client, so we don't
> > > >     want to give the low bit any special interpretation.
> > > >     
> > > > 
> > > >     Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > > >     i_version on their own.
> > > >     
> > > > 
> > > >     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > > > 
> > > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > > --- a/fs/nfs/fs_context.c
> > > > +++ b/fs/nfs/fs_context.c
> > > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > > >  	.init_fs_context	= nfs_init_fs_context,
> > > >  	.parameters		= nfs_fs_parameters,
> > > >  	.kill_sb		= nfs_kill_super,
> > > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > +				  FS_PRIVATE_I_VERSION,
> > > >  };
> > > >  MODULE_ALIAS_FS("nfs");
> > > >  EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > > >  	.init_fs_context	= nfs_init_fs_context,
> > > >  	.parameters		= nfs_fs_parameters,
> > > >  	.kill_sb		= nfs_kill_super,
> > > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > +				  FS_PRIVATE_I_VERSION,
> > > >  };
> > > >  MODULE_ALIAS_FS("nfs4");
> > > >  MODULE_ALIAS("nfs4");
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 21cc971fd960..c5bb4268228b 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > > >  #define FS_HAS_SUBTYPE		4
> > > >  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> > > >  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> > > > +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> > > >  #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> > > >  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> > > >  	int (*init_fs_context)(struct fs_context *);
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 2917ef990d43..52c790a847de 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > > >  	u64 cur, old, new;
> > > >  
> > > > 
> > > >  	cur = inode_peek_iversion_raw(inode);
> > > > +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > > +		return cur;
> > > >  	for (;;) {
> > > >  		/* If flag is already set, then no need to swap */
> > > >  		if (cur & I_VERSION_QUERIED) {
> > > 
> > > 
> > > It's probably more correct to just check the already-existing
> > > SB_I_VERSION flag here
> > 
> > So the check would be
> > 
> > 	if (!IS_I_VERSION(inode))
> > 		return cur;
> > 
> > ?
> > 
> 
> Yes, that looks about right.

That doesn't sound right to me.  NFS, for example, has a perfectly good
i_version that works as a change attribute, so it should set
SB_I_VERSION.  But it doesn't want the vfs playing games with the low
bit.

(In fact, I'm confused now: the improvement Daire was seeing should only
be possible if the re-export server was seeing SB_I_VERSION set on the
NFS filesystem it was exporting, but a quick grep doesn't actually show
me where NFS is setting SB_I_VERSION.  I'm missing something
obvious....)

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 16:14                                 ` bfields
@ 2020-11-16 16:38                                   ` Jeff Layton
  2020-11-16 19:03                                     ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-16 16:38 UTC (permalink / raw)
  To: bfields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 11:14 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 11:03:00AM -0500, Jeff Layton wrote:
> > On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> > > On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > > > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > > > improves the retention of the re-export server's client cache when
> > > > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > > > were working with an engineer at the time who showed an interest in
> > > > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > > > the nfs/inode.c
> > > > > > > 
> > > > > > > -		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > > > +		if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > > > {
> > > > > > > 
> > > > > > > His reasoning at the time was:
> > > > > > > 
> > > > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > > > impacting the performance significantly. This fix makes it only
> > > > > > > re-fetch file content if inode version seen on the server is newer
> > > > > > > than the one on the client."
> > > > > > > 
> > > > > > > But I've always been puzzled by why this only seems to be the case
> > > > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > > > processes on a standard client mount never causes any similar
> > > > > > > re-validations. And this happens with a completely read-only share
> > > > > > > which is why I started to think it has something to do with atimes as
> > > > > > > that could perhaps still cause a "write" modification even when
> > > > > > > read-only?
> > > > > > 
> > > > > > Ah-hah!  So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > > > i_version.  That's a special thing that only nfsd would do.
> > > > > > 
> > > > > > I think that's totally fixable, we'll just have to think a little about
> > > > > > how....
> > > > > 
> > > > > I wonder if something like this helps?--b.
> > > > > 
> > > > > commit 0add88a9ccc5
> > > > > Author: J. Bruce Fields <bfields@redhat.com>
> > > > > Date:   Fri Nov 13 17:03:04 2020 -0500
> > > > > 
> > > > >     nfs: don't mangle i_version on NFS
> > > > >     
> > > > > 
> > > > >     The i_version on NFS has pretty much opaque to the client, so we don't
> > > > >     want to give the low bit any special interpretation.
> > > > >     
> > > > > 
> > > > >     Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > > > >     i_version on their own.
> > > > >     
> > > > > 
> > > > >     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > > > > 
> > > > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > > > --- a/fs/nfs/fs_context.c
> > > > > +++ b/fs/nfs/fs_context.c
> > > > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > > > >  	.init_fs_context	= nfs_init_fs_context,
> > > > >  	.parameters		= nfs_fs_parameters,
> > > > >  	.kill_sb		= nfs_kill_super,
> > > > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > > +				  FS_PRIVATE_I_VERSION,
> > > > >  };
> > > > >  MODULE_ALIAS_FS("nfs");
> > > > >  EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > > > >  	.init_fs_context	= nfs_init_fs_context,
> > > > >  	.parameters		= nfs_fs_parameters,
> > > > >  	.kill_sb		= nfs_kill_super,
> > > > > -	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > > +	.fs_flags		= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > > +				  FS_PRIVATE_I_VERSION,
> > > > >  };
> > > > >  MODULE_ALIAS_FS("nfs4");
> > > > >  MODULE_ALIAS("nfs4");
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 21cc971fd960..c5bb4268228b 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > > > >  #define FS_HAS_SUBTYPE		4
> > > > >  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> > > > >  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> > > > > +#define FS_PRIVATE_I_VERSION	32	/* i_version managed by filesystem */
> > > > >  #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
> > > > >  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> > > > >  	int (*init_fs_context)(struct fs_context *);
> > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > > index 2917ef990d43..52c790a847de 100644
> > > > > --- a/include/linux/iversion.h
> > > > > +++ b/include/linux/iversion.h
> > > > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > > > >  	u64 cur, old, new;
> > > > >  
> > > > > 
> > > > >  	cur = inode_peek_iversion_raw(inode);
> > > > > +	if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > > > +		return cur;
> > > > >  	for (;;) {
> > > > >  		/* If flag is already set, then no need to swap */
> > > > >  		if (cur & I_VERSION_QUERIED) {
> > > > 
> > > > 
> > > > It's probably more correct to just check the already-existing
> > > > SB_I_VERSION flag here
> > > 
> > > So the check would be
> > > 
> > > 	if (!IS_I_VERSION(inode))
> > > 		return cur;
> > > 
> > > ?
> > > 
> > 
> > Yes, that looks about right.
> 
> That doesn't sound right to me.  NFS, for example, has a perfectly good
> i_version that works as a change attribute, so it should set
> SB_I_VERSION.  But it doesn't want the vfs playing games with the low
> bit.
> 
> (In fact, I'm confused now: the improvement Daire was seeing should only
> be possible if the re-export server was seeing SB_I_VERSION set on the
> NFS filesystem it was exporting, but a quick grep doesn't actually show
> me where NFS is setting SB_I_VERSION.  I'm missing something
> obvious....)


Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
nfs3 code as well. The v4 caller (encode_change) only calls it when
IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
that.

I think the basic issue here is that we're trying to use SB_I_VERSION
for two different things. Its main purpose is to tell the kernel that
when it's updating the file times that it should also (possibly)
increment the i_version counter too. (Some of this is documented in
include/linux/iversion.h too, fwiw)

nfsd needs a way to tell whether the field should be consulted at all.
For that we probably do need a different flag of some sort. Doing it at
the fstype level seems a bit wrong though -- v2/3 don't have a real
change attribute and it probably shouldn't be trusted when exporting
them.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 16:38                                   ` Jeff Layton
@ 2020-11-16 19:03                                     ` bfields
  2020-11-16 20:03                                       ` Jeff Layton
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-16 19:03 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, Nov 16, 2020 at 11:38:44AM -0500, Jeff Layton wrote:
> Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
> nfs3 code as well. The v4 caller (encode_change) only calls it when
> IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
> that.

Weird.  Looking back....  That goes back to the original patch adding
support for ext4's i_version, c654b8a9cba6 "nfsd: support ext4
i_version".

It's in nfs3xdr.c, but the fields it's filling in, fh_pre_change and
fh_post_change, are only used in nfs4xdr.c.  Maybe moving it someplace
else (vfs.c?) would save some confusion.

Anyway, yes, that should be checking SB_I_VERSION too.

> I think the basic issue here is that we're trying to use SB_I_VERSION
> for two different things. Its main purpose is to tell the kernel that
> when it's updating the file times that it should also (possibly)
> increment the i_version counter too. (Some of this is documented in
> include/linux/iversion.h too, fwiw)
> 
> nfsd needs a way to tell whether the field should be consulted at all.
> For that we probably do need a different flag of some sort. Doing it at
> the fstype level seems a bit wrong though -- v2/3 don't have a real
> change attribute and it probably shouldn't be trusted when exporting
> them.

Oops, good point.

I suppose simplest is just another SB_ flag.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 15:53                             ` bfields
@ 2020-11-16 19:21                               ` Daire Byrne
  0 siblings, 0 replies; 129+ messages in thread
From: Daire Byrne @ 2020-11-16 19:21 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 16 Nov, 2020, at 15:53, bfields bfields@fieldses.org wrote:
> On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
>> Now if anyone has any ideas why all the read calls to the originating
>> server are limited to a maximum of 128k (with rsize=1M) when coming
>> via the re-export server's nfsd threads, I see that as the next
>> biggest performance issue. Reading directly on the re-export server
>> with a userspace process issues 1MB reads as expected. It doesn't
>> happen for writes (wsize=1MB all the way through) but I'm not sure if
>> that has more to do with async and write back caching helping to build
>> up the size before commit?
> 
> I'm not sure where to start with this one....
> 
> Is this behavior independent of protocol version and backend server?

It seems to the case for all combinations of backend versions and re-export versions.

But it does look like it is related to readahead somehow. The default for a client mount is 128k ....

I just increased it to 1024 on the client mount of the originating server on the re-export server and now it's doing the expected 1MB (rsize) read requests back to onprem from the clients all the way through. i.e.

echo 1024 > /sys/class/bdi/0:52/read_ahead_kb

So, there is a difference in behaviour when reading from the client mount with user space processes or the knfsd threads on the re-export server.

Daire



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 19:03                                     ` bfields
@ 2020-11-16 20:03                                       ` Jeff Layton
  2020-11-17  3:16                                         ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-16 20:03 UTC (permalink / raw)
  To: bfields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 14:03 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 11:38:44AM -0500, Jeff Layton wrote:
> > Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
> > nfs3 code as well. The v4 caller (encode_change) only calls it when
> > IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
> > that.
> 
> Weird.  Looking back....  That goes back to the original patch adding
> support for ext4's i_version, c654b8a9cba6 "nfsd: support ext4
> i_version".
> 
> It's in nfs3xdr.c, but the fields it's filling in, fh_pre_change and
> fh_post_change, are only used in nfs4xdr.c.  Maybe moving it someplace
> else (vfs.c?) would save some confusion.
> 
> Anyway, yes, that should be checking SB_I_VERSION too.
> 
> > I think the basic issue here is that we're trying to use SB_I_VERSION
> > for two different things. Its main purpose is to tell the kernel that
> > when it's updating the file times that it should also (possibly)
> > increment the i_version counter too. (Some of this is documented in
> > include/linux/iversion.h too, fwiw)
> > 
> > nfsd needs a way to tell whether the field should be consulted at all.
> > For that we probably do need a different flag of some sort. Doing it at
> > the fstype level seems a bit wrong though -- v2/3 don't have a real
> > change attribute and it probably shouldn't be trusted when exporting
> > them.
> 
> Oops, good point.
> 
> I suppose simplest is just another SB_ flag.
> 

Another idea might be to add a new fetch_iversion export operation that
returns a u64. Roll two generic functions -- one to handle the
xfs/ext4/btrfs case and another for the NFS/AFS/Ceph case (where we just
fetch it raw). When the op is a NULL pointer, treat it like the
!IS_I_VERSION case today.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-16 20:03                                       ` Jeff Layton
@ 2020-11-17  3:16                                         ` bfields
  2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-17  3:16 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, Nov 16, 2020 at 03:03:15PM -0500, Jeff Layton wrote:
> Another idea might be to add a new fetch_iversion export operation that
> returns a u64. Roll two generic functions -- one to handle the
> xfs/ext4/btrfs case and another for the NFS/AFS/Ceph case (where we just
> fetch it raw). When the op is a NULL pointer, treat it like the
> !IS_I_VERSION case today.

OK, a rough attempt follows, mostly untested.--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c
  2020-11-17  3:16                                         ` bfields
@ 2020-11-17  3:18                                           ` J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
                                                               ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17  3:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

These functions are actually used by NFSv4 code as well, and having them
in nfs3xdr.c has caused some confusion.

This is just cleanup, no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs3xdr.c | 49 -----------------------------------------------
 fs/nfsd/nfsfh.c   | 49 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2277f83da250..14efb3aba6b2 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -252,55 +252,6 @@ encode_wcc_data(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
 	return encode_post_op_attr(rqstp, p, fhp);
 }
 
-/*
- * Fill in the pre_op attr for the wcc data
- */
-void fill_pre_wcc(struct svc_fh *fhp)
-{
-	struct inode    *inode;
-	struct kstat	stat;
-	__be32 err;
-
-	if (fhp->fh_pre_saved)
-		return;
-
-	inode = d_inode(fhp->fh_dentry);
-	err = fh_getattr(fhp, &stat);
-	if (err) {
-		/* Grab the times from inode anyway */
-		stat.mtime = inode->i_mtime;
-		stat.ctime = inode->i_ctime;
-		stat.size  = inode->i_size;
-	}
-
-	fhp->fh_pre_mtime = stat.mtime;
-	fhp->fh_pre_ctime = stat.ctime;
-	fhp->fh_pre_size  = stat.size;
-	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
-	fhp->fh_pre_saved = true;
-}
-
-/*
- * Fill in the post_op attr for the wcc data
- */
-void fill_post_wcc(struct svc_fh *fhp)
-{
-	__be32 err;
-
-	if (fhp->fh_post_saved)
-		printk("nfsd: inode locked twice during operation.\n");
-
-	err = fh_getattr(fhp, &fhp->fh_post_attr);
-	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
-						     d_inode(fhp->fh_dentry));
-	if (err) {
-		fhp->fh_post_saved = false;
-		/* Grab the ctime anyway - set_change_info might use it */
-		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
-	} else
-		fhp->fh_post_saved = true;
-}
-
 /*
  * XDR decode functions
  */
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c81dbbad8792..b3b4e8809aa9 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -711,3 +711,52 @@ enum fsid_source fsid_source(struct svc_fh *fhp)
 		return FSIDSOURCE_UUID;
 	return FSIDSOURCE_DEV;
 }
+
+/*
+ * Fill in the pre_op attr for the wcc data
+ */
+void fill_pre_wcc(struct svc_fh *fhp)
+{
+	struct inode    *inode;
+	struct kstat	stat;
+	__be32 err;
+
+	if (fhp->fh_pre_saved)
+		return;
+
+	inode = d_inode(fhp->fh_dentry);
+	err = fh_getattr(fhp, &stat);
+	if (err) {
+		/* Grab the times from inode anyway */
+		stat.mtime = inode->i_mtime;
+		stat.ctime = inode->i_ctime;
+		stat.size  = inode->i_size;
+	}
+
+	fhp->fh_pre_mtime = stat.mtime;
+	fhp->fh_pre_ctime = stat.ctime;
+	fhp->fh_pre_size  = stat.size;
+	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
+	fhp->fh_pre_saved = true;
+}
+
+/*
+ * Fill in the post_op attr for the wcc data
+ */
+void fill_post_wcc(struct svc_fh *fhp)
+{
+	__be32 err;
+
+	if (fhp->fh_post_saved)
+		printk("nfsd: inode locked twice during operation.\n");
+
+	err = fh_getattr(fhp, &fhp->fh_post_attr);
+	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
+						     d_inode(fhp->fh_dentry));
+	if (err) {
+		fhp->fh_post_saved = false;
+		/* Grab the ctime anyway - set_change_info might use it */
+		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
+	} else
+		fhp->fh_post_saved = true;
+}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
@ 2020-11-17  3:18                                             ` J. Bruce Fields
  2020-11-17 12:34                                               ` Jeff Layton
  2020-11-17 15:25                                               ` J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
  2 siblings, 2 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17  3:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

fill_{pre/post}_attr are unconditionally using i_version even when the
underlying filesystem doesn't have proper support for i_version.

Move the code that chooses which i_version to use to the common
nfsd4_change_attribute().

The NFSEXP_V4ROOT case probably doesn't matter (the pseudoroot
filesystem is usually read-only and unlikely to see operations with pre
and post change attributes), but let's put it in the same place anyway
for consistency.

Fixes: c654b8a9cba6 ("nfsd: support ext4 i_version")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs4xdr.c | 11 +----------
 fs/nfsd/nfsfh.c   | 11 +++++++----
 fs/nfsd/nfsfh.h   | 23 -----------------------
 fs/nfsd/vfs.c     | 32 ++++++++++++++++++++++++++++++++
 fs/nfsd/vfs.h     |  3 +++
 5 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 833a2c64dfe8..6806207b6d18 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2295,16 +2295,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
 			     struct svc_export *exp)
 {
-	if (exp->ex_flags & NFSEXP_V4ROOT) {
-		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
-		*p++ = 0;
-	} else if (IS_I_VERSION(inode)) {
-		p = xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode));
-	} else {
-		*p++ = cpu_to_be32(stat->ctime.tv_sec);
-		*p++ = cpu_to_be32(stat->ctime.tv_nsec);
-	}
-	return p;
+	return xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode, exp));
 }
 
 /*
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index b3b4e8809aa9..4fbe1413e767 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -719,6 +719,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
 {
 	struct inode    *inode;
 	struct kstat	stat;
+	struct svc_export *exp = fhp->fh_export;
 	__be32 err;
 
 	if (fhp->fh_pre_saved)
@@ -736,7 +737,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
 	fhp->fh_pre_mtime = stat.mtime;
 	fhp->fh_pre_ctime = stat.ctime;
 	fhp->fh_pre_size  = stat.size;
-	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
+	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode, exp);
 	fhp->fh_pre_saved = true;
 }
 
@@ -746,17 +747,19 @@ void fill_pre_wcc(struct svc_fh *fhp)
 void fill_post_wcc(struct svc_fh *fhp)
 {
 	__be32 err;
+	struct inode *inode = d_inode(fhp->fh_dentry);
+	struct svc_export *exp = fhp->fh_export;
 
 	if (fhp->fh_post_saved)
 		printk("nfsd: inode locked twice during operation.\n");
 
 	err = fh_getattr(fhp, &fhp->fh_post_attr);
-	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
-						     d_inode(fhp->fh_dentry));
+	fhp->fh_post_change =
+			nfsd4_change_attribute(&fhp->fh_post_attr, inode, exp);
 	if (err) {
 		fhp->fh_post_saved = false;
 		/* Grab the ctime anyway - set_change_info might use it */
-		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
+		fhp->fh_post_attr.ctime = inode->i_ctime;
 	} else
 		fhp->fh_post_saved = true;
 }
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 56cfbc361561..547aef9b3265 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -245,29 +245,6 @@ fh_clear_wcc(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
-static inline u64 nfsd4_change_attribute(struct kstat *stat,
-					 struct inode *inode)
-{
-	u64 chattr;
-
-	chattr =  stat->ctime.tv_sec;
-	chattr <<= 30;
-	chattr += stat->ctime.tv_nsec;
-	chattr += inode_query_iversion(inode);
-	return chattr;
-}
-
 extern void fill_pre_wcc(struct svc_fh *fhp);
 extern void fill_post_wcc(struct svc_fh *fhp);
 #else
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecaceebee13..2c71b02dd1fe 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -2390,3 +2390,35 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
 
 	return err? nfserrno(err) : 0;
 }
+
+/*
+ * We could use i_version alone as the change attribute.  However,
+ * i_version can go backwards after a reboot.  On its own that doesn't
+ * necessarily cause a problem, but if i_version goes backwards and then
+ * is incremented again it could reuse a value that was previously used
+ * before boot, and a client who queried the two values might
+ * incorrectly assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as
+ * long as time doesn't go backwards we never reuse an old value.
+ */
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
+					 struct svc_export *exp)
+{
+	u64 chattr;
+
+	if (exp->ex_flags & NFSEXP_V4ROOT) {
+		chattr = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
+		chattr <<= 32;
+	} else if (IS_I_VERSION(inode)) {
+		chattr = stat->ctime.tv_sec;
+		chattr <<= 30;
+		chattr += stat->ctime.tv_nsec;
+		chattr += inode_query_iversion(inode);
+	} else {
+		chattr = stat->ctime.tv_sec;
+		chattr <<= 32;
+		chattr += stat->ctime.tv_nsec;
+	}
+	return chattr;
+}
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index a2442ebe5acf..26ed15256340 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -132,6 +132,9 @@ __be32		nfsd_statfs(struct svc_rqst *, struct svc_fh *,
 __be32		nfsd_permission(struct svc_rqst *, struct svc_export *,
 				struct dentry *, int);
 
+u64		nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
+				struct svc_export *exp);
+
 static inline int fh_want_write(struct svc_fh *fh)
 {
 	int ret;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 3/4] nfs: don't mangle i_version on NFS
  2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
@ 2020-11-17  3:18                                             ` J. Bruce Fields
  2020-11-17 12:27                                               ` Jeff Layton
  2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
  2 siblings, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17  3:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

The i_version on NFS has pretty much opaque to the client, so we don't
want to give the low bit any special interpretation.

Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
i_version on their own.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfs/export.c          | 1 +
 include/linux/exportfs.h | 1 +
 include/linux/iversion.h | 4 ++++
 3 files changed, 6 insertions(+)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 3430d6891e89..c2eb915a54ca 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
+	.fetch_iversion = inode_peek_iversion_raw,
 };
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 3ceb72b67a7a..6000121a201f 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,6 +213,7 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
+	u64 (*fetch_iversion)(const struct inode *);
 };
 
 extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 2917ef990d43..481b3debf6bb 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -3,6 +3,7 @@
 #define _LINUX_IVERSION_H
 
 #include <linux/fs.h>
+#include <linux/exportfs.h>
 
 /*
  * The inode->i_version field:
@@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
 {
 	u64 cur, old, new;
 
+	if (inode->i_sb->s_export_op->fetch_iversion)
+		return inode->i_sb->s_export_op->fetch_iversion(inode);
+
 	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is already set, then no need to swap */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 4/4] nfs: support i_version in the NFSv4 case
  2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
  2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
@ 2020-11-17  3:18                                             ` J. Bruce Fields
  2020-11-17 12:34                                               ` Jeff Layton
  2 siblings, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17  3:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

Currently when knfsd re-exports an NFS filesystem, it uses the ctime as
the change attribute.  But obviously we have a real change
attribute--the one that was returned from the original server.  We
should just use that.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 4034102010f0..ca85f81d1b9e 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -1045,6 +1045,7 @@ static void nfs_fill_super(struct super_block *sb, struct nfs_fs_context *ctx)
 	} else {
 		sb->s_time_min = S64_MIN;
 		sb->s_time_max = S64_MAX;
+		sb->s_flags |= SB_I_VERSION;
 	}
 
 	sb->s_magic = NFS_SUPER_MAGIC;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH 3/4] nfs: don't mangle i_version on NFS
  2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
@ 2020-11-17 12:27                                               ` Jeff Layton
  2020-11-17 14:14                                                 ` J. Bruce Fields
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-17 12:27 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> The i_version on NFS has pretty much opaque to the client, so we don't
> want to give the low bit any special interpretation.
> 
> Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> i_version on their own.
> 

Description here doesn't quite match the patch...

> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/nfs/export.c          | 1 +
>  include/linux/exportfs.h | 1 +
>  include/linux/iversion.h | 4 ++++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 3430d6891e89..c2eb915a54ca 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
>  	.encode_fh = nfs_encode_fh,
>  	.fh_to_dentry = nfs_fh_to_dentry,
>  	.get_parent = nfs_get_parent,
> +	.fetch_iversion = inode_peek_iversion_raw,
>  };
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 3ceb72b67a7a..6000121a201f 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,6 +213,7 @@ struct export_operations {
>  			  bool write, u32 *device_generation);
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
> +	u64 (*fetch_iversion)(const struct inode *);
>  };
>  
> 
> 
> 
> 
> 
> 
> 
>  extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..481b3debf6bb 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_IVERSION_H
>  
> 
> 
> 
> 
> 
> 
> 
>  #include <linux/fs.h>
> +#include <linux/exportfs.h>
>  
> 
> 
> 
> 
> 
> 
> 
>  /*
>   * The inode->i_version field:
> @@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
>  {
>  	u64 cur, old, new;
>  
> 
> 
> 
> 
> 
> 
> 
> +	if (inode->i_sb->s_export_op->fetch_iversion)
> +		return inode->i_sb->s_export_op->fetch_iversion(inode);
> +

This looks dangerous -- s_export_op could be a NULL pointer.

>  	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is already set, then no need to swap */

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 4/4] nfs: support i_version in the NFSv4 case
  2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
@ 2020-11-17 12:34                                               ` Jeff Layton
  0 siblings, 0 replies; 129+ messages in thread
From: Jeff Layton @ 2020-11-17 12:34 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> Currently when knfsd re-exports an NFS filesystem, it uses the ctime as
> the change attribute.  But obviously we have a real change
> attribute--the one that was returned from the original server.  We
> should just use that.
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/nfs/super.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 4034102010f0..ca85f81d1b9e 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -1045,6 +1045,7 @@ static void nfs_fill_super(struct super_block *sb, struct nfs_fs_context *ctx)
>  	} else {
>  		sb->s_time_min = S64_MIN;
>  		sb->s_time_max = S64_MAX;
> +		sb->s_flags |= SB_I_VERSION;
>  	}
>  
> 
>  	sb->s_magic = NFS_SUPER_MAGIC;

I don't think we want this change. This will make file_update_time
attempt to bump the i_version field itself using the routines in
iversion.h. This will almost certainly do the wrong thing.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
@ 2020-11-17 12:34                                               ` Jeff Layton
  2020-11-17 15:26                                                 ` J. Bruce Fields
  2020-11-17 15:25                                               ` J. Bruce Fields
  1 sibling, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-17 12:34 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> fill_{pre/post}_attr are unconditionally using i_version even when the
> underlying filesystem doesn't have proper support for i_version.
> 
> Move the code that chooses which i_version to use to the common
> nfsd4_change_attribute().
> 
> The NFSEXP_V4ROOT case probably doesn't matter (the pseudoroot
> filesystem is usually read-only and unlikely to see operations with pre
> and post change attributes), but let's put it in the same place anyway
> for consistency.
> 
> Fixes: c654b8a9cba6 ("nfsd: support ext4 i_version")
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/nfsd/nfs4xdr.c | 11 +----------
>  fs/nfsd/nfsfh.c   | 11 +++++++----
>  fs/nfsd/nfsfh.h   | 23 -----------------------
>  fs/nfsd/vfs.c     | 32 ++++++++++++++++++++++++++++++++
>  fs/nfsd/vfs.h     |  3 +++
>  5 files changed, 43 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 833a2c64dfe8..6806207b6d18 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2295,16 +2295,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>  static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
>  			     struct svc_export *exp)
>  {
> -	if (exp->ex_flags & NFSEXP_V4ROOT) {
> -		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> -		*p++ = 0;
> -	} else if (IS_I_VERSION(inode)) {
> -		p = xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode));
> -	} else {
> -		*p++ = cpu_to_be32(stat->ctime.tv_sec);
> -		*p++ = cpu_to_be32(stat->ctime.tv_nsec);
> -	}
> -	return p;
> +	return xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode, exp));
>  }
>  
> 
> 
> 
>  /*
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index b3b4e8809aa9..4fbe1413e767 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -719,6 +719,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  {
>  	struct inode    *inode;
>  	struct kstat	stat;
> +	struct svc_export *exp = fhp->fh_export;
>  	__be32 err;
>  
> 
> 
> 
>  	if (fhp->fh_pre_saved)
> @@ -736,7 +737,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_mtime = stat.mtime;
>  	fhp->fh_pre_ctime = stat.ctime;
>  	fhp->fh_pre_size  = stat.size;
> -	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> +	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode, exp);
>  	fhp->fh_pre_saved = true;
>  }
>  
> 
> 
> 
> @@ -746,17 +747,19 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  void fill_post_wcc(struct svc_fh *fhp)
>  {
>  	__be32 err;
> +	struct inode *inode = d_inode(fhp->fh_dentry);
> +	struct svc_export *exp = fhp->fh_export;
>  
> 
> 
> 
>  	if (fhp->fh_post_saved)
>  		printk("nfsd: inode locked twice during operation.\n");
>  
> 
> 
> 
>  	err = fh_getattr(fhp, &fhp->fh_post_attr);
> -	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
> -						     d_inode(fhp->fh_dentry));
> +	fhp->fh_post_change =
> +			nfsd4_change_attribute(&fhp->fh_post_attr, inode, exp);
>  	if (err) {
>  		fhp->fh_post_saved = false;
>  		/* Grab the ctime anyway - set_change_info might use it */
> -		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
> +		fhp->fh_post_attr.ctime = inode->i_ctime;
>  	} else
>  		fhp->fh_post_saved = true;
>  }
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 56cfbc361561..547aef9b3265 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -245,29 +245,6 @@ fh_clear_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_saved = false;
>  }
>  
> 
> 
> 
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
> -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> -					 struct inode *inode)
> -{
> -	u64 chattr;
> -
> -	chattr =  stat->ctime.tv_sec;
> -	chattr <<= 30;
> -	chattr += stat->ctime.tv_nsec;
> -	chattr += inode_query_iversion(inode);
> -	return chattr;
> -}
> -
>  extern void fill_pre_wcc(struct svc_fh *fhp);
>  extern void fill_post_wcc(struct svc_fh *fhp);
>  #else
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 1ecaceebee13..2c71b02dd1fe 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -2390,3 +2390,35 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
>  
> 
> 
> 
>  	return err? nfserrno(err) : 0;
>  }
> +
> +/*
> + * We could use i_version alone as the change attribute.  However,
> + * i_version can go backwards after a reboot.  On its own that doesn't
> + * necessarily cause a problem, but if i_version goes backwards and then
> + * is incremented again it could reuse a value that was previously used
> + * before boot, and a client who queried the two values might
> + * incorrectly assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as
> + * long as time doesn't go backwards we never reuse an old value.
> + */
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
> +					 struct svc_export *exp)
> +{
> +	u64 chattr;
> +
> +	if (exp->ex_flags & NFSEXP_V4ROOT) {
> +		chattr = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> +		chattr <<= 32;
> +	} else if (IS_I_VERSION(inode)) {
> +		chattr = stat->ctime.tv_sec;
> +		chattr <<= 30;
> +		chattr += stat->ctime.tv_nsec;
> +		chattr += inode_query_iversion(inode);
> +	} else {
> +		chattr = stat->ctime.tv_sec;
> +		chattr <<= 32;
> +		chattr += stat->ctime.tv_nsec;
> +	}
> +	return chattr;
> +}


I don't think I described what I was thinking well. Let me try again...

There should be no need to change the code in iversion.h -- I think we
can do this in a way that's confined to just nfsd/export code.

What I would suggest is to have nfsd4_change_attribute call the
fetch_iversion op if it exists, instead of checking IS_I_VERSION and
doing the stuff in that block. If fetch_iversion is NULL, then just use
the ctime.

Then, you just need to make sure that the filesystems' export_ops have
an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
The rest of the filesystems can leave fetch_iversion as NULL (since we
don't want to use it on them).

> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index a2442ebe5acf..26ed15256340 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -132,6 +132,9 @@ __be32		nfsd_statfs(struct svc_rqst *, struct svc_fh *,
>  __be32		nfsd_permission(struct svc_rqst *, struct svc_export *,
>  				struct dentry *, int);
>  
> 
> 
> 
> +u64		nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
> +				struct svc_export *exp);
> +
>  static inline int fh_want_write(struct svc_fh *fh)
>  {
>  	int ret;

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 3/4] nfs: don't mangle i_version on NFS
  2020-11-17 12:27                                               ` Jeff Layton
@ 2020-11-17 14:14                                                 ` J. Bruce Fields
  0 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17 14:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Tue, Nov 17, 2020 at 07:27:03AM -0500, Jeff Layton wrote:
> On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> > From: "J. Bruce Fields" <bfields@redhat.com>
> > 
> > The i_version on NFS has pretty much opaque to the client, so we don't
> > want to give the low bit any special interpretation.
> > 
> > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > i_version on their own.
> > 
> 
> Description here doesn't quite match the patch...

Oops, thanks.--b.

> 
> > Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > ---
> >  fs/nfs/export.c          | 1 +
> >  include/linux/exportfs.h | 1 +
> >  include/linux/iversion.h | 4 ++++
> >  3 files changed, 6 insertions(+)
> > 
> > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > index 3430d6891e89..c2eb915a54ca 100644
> > --- a/fs/nfs/export.c
> > +++ b/fs/nfs/export.c
> > @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
> >  	.encode_fh = nfs_encode_fh,
> >  	.fh_to_dentry = nfs_fh_to_dentry,
> >  	.get_parent = nfs_get_parent,
> > +	.fetch_iversion = inode_peek_iversion_raw,
> >  };
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index 3ceb72b67a7a..6000121a201f 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -213,6 +213,7 @@ struct export_operations {
> >  			  bool write, u32 *device_generation);
> >  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> >  			     int nr_iomaps, struct iattr *iattr);
> > +	u64 (*fetch_iversion)(const struct inode *);
> >  };
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..481b3debf6bb 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -3,6 +3,7 @@
> >  #define _LINUX_IVERSION_H
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  #include <linux/fs.h>
> > +#include <linux/exportfs.h>
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  /*
> >   * The inode->i_version field:
> > @@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
> >  {
> >  	u64 cur, old, new;
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > +	if (inode->i_sb->s_export_op->fetch_iversion)
> > +		return inode->i_sb->s_export_op->fetch_iversion(inode);
> > +
> 
> This looks dangerous -- s_export_op could be a NULL pointer.
> 
> >  	cur = inode_peek_iversion_raw(inode);
> >  	for (;;) {
> >  		/* If flag is already set, then no need to swap */
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
  2020-11-17 12:34                                               ` Jeff Layton
@ 2020-11-17 15:25                                               ` J. Bruce Fields
  1 sibling, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17 15:25 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Mon, Nov 16, 2020 at 10:18:04PM -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> fill_{pre/post}_attr are unconditionally using i_version even when the
> underlying filesystem doesn't have proper support for i_version.

Actually, I didn't have this quite right....

These values are queried, but they aren't used, thanks to the
"change_supported" field of nfsd4_change_info; in set_change_info():

	cinfo->change_supported = IS_I_VERSION(d_inode(fhp->fh_dentry));

and then later on encode_cinfo() chooses to use stored change attribute
or ctime values depending on how change_supported.

But as of the ctime changes, just querying the change attribute here has
side effects.

So, that explains why Daire's team was seeing a performance regression,
while no one was complaining about our returned change info being
garbage.

Anyway.

--b.

> 
> Move the code that chooses which i_version to use to the common
> nfsd4_change_attribute().
> 
> The NFSEXP_V4ROOT case probably doesn't matter (the pseudoroot
> filesystem is usually read-only and unlikely to see operations with pre
> and post change attributes), but let's put it in the same place anyway
> for consistency.
> 
> Fixes: c654b8a9cba6 ("nfsd: support ext4 i_version")
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/nfsd/nfs4xdr.c | 11 +----------
>  fs/nfsd/nfsfh.c   | 11 +++++++----
>  fs/nfsd/nfsfh.h   | 23 -----------------------
>  fs/nfsd/vfs.c     | 32 ++++++++++++++++++++++++++++++++
>  fs/nfsd/vfs.h     |  3 +++
>  5 files changed, 43 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 833a2c64dfe8..6806207b6d18 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2295,16 +2295,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>  static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
>  			     struct svc_export *exp)
>  {
> -	if (exp->ex_flags & NFSEXP_V4ROOT) {
> -		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> -		*p++ = 0;
> -	} else if (IS_I_VERSION(inode)) {
> -		p = xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode));
> -	} else {
> -		*p++ = cpu_to_be32(stat->ctime.tv_sec);
> -		*p++ = cpu_to_be32(stat->ctime.tv_nsec);
> -	}
> -	return p;
> +	return xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode, exp));
>  }
>  
>  /*
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index b3b4e8809aa9..4fbe1413e767 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -719,6 +719,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  {
>  	struct inode    *inode;
>  	struct kstat	stat;
> +	struct svc_export *exp = fhp->fh_export;
>  	__be32 err;
>  
>  	if (fhp->fh_pre_saved)
> @@ -736,7 +737,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_mtime = stat.mtime;
>  	fhp->fh_pre_ctime = stat.ctime;
>  	fhp->fh_pre_size  = stat.size;
> -	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> +	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode, exp);
>  	fhp->fh_pre_saved = true;
>  }
>  
> @@ -746,17 +747,19 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  void fill_post_wcc(struct svc_fh *fhp)
>  {
>  	__be32 err;
> +	struct inode *inode = d_inode(fhp->fh_dentry);
> +	struct svc_export *exp = fhp->fh_export;
>  
>  	if (fhp->fh_post_saved)
>  		printk("nfsd: inode locked twice during operation.\n");
>  
>  	err = fh_getattr(fhp, &fhp->fh_post_attr);
> -	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
> -						     d_inode(fhp->fh_dentry));
> +	fhp->fh_post_change =
> +			nfsd4_change_attribute(&fhp->fh_post_attr, inode, exp);
>  	if (err) {
>  		fhp->fh_post_saved = false;
>  		/* Grab the ctime anyway - set_change_info might use it */
> -		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
> +		fhp->fh_post_attr.ctime = inode->i_ctime;
>  	} else
>  		fhp->fh_post_saved = true;
>  }
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 56cfbc361561..547aef9b3265 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -245,29 +245,6 @@ fh_clear_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_saved = false;
>  }
>  
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
> -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> -					 struct inode *inode)
> -{
> -	u64 chattr;
> -
> -	chattr =  stat->ctime.tv_sec;
> -	chattr <<= 30;
> -	chattr += stat->ctime.tv_nsec;
> -	chattr += inode_query_iversion(inode);
> -	return chattr;
> -}
> -
>  extern void fill_pre_wcc(struct svc_fh *fhp);
>  extern void fill_post_wcc(struct svc_fh *fhp);
>  #else
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 1ecaceebee13..2c71b02dd1fe 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -2390,3 +2390,35 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
>  
>  	return err? nfserrno(err) : 0;
>  }
> +
> +/*
> + * We could use i_version alone as the change attribute.  However,
> + * i_version can go backwards after a reboot.  On its own that doesn't
> + * necessarily cause a problem, but if i_version goes backwards and then
> + * is incremented again it could reuse a value that was previously used
> + * before boot, and a client who queried the two values might
> + * incorrectly assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as
> + * long as time doesn't go backwards we never reuse an old value.
> + */
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
> +					 struct svc_export *exp)
> +{
> +	u64 chattr;
> +
> +	if (exp->ex_flags & NFSEXP_V4ROOT) {
> +		chattr = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> +		chattr <<= 32;
> +	} else if (IS_I_VERSION(inode)) {
> +		chattr = stat->ctime.tv_sec;
> +		chattr <<= 30;
> +		chattr += stat->ctime.tv_nsec;
> +		chattr += inode_query_iversion(inode);
> +	} else {
> +		chattr = stat->ctime.tv_sec;
> +		chattr <<= 32;
> +		chattr += stat->ctime.tv_nsec;
> +	}
> +	return chattr;
> +}
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index a2442ebe5acf..26ed15256340 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -132,6 +132,9 @@ __be32		nfsd_statfs(struct svc_rqst *, struct svc_fh *,
>  __be32		nfsd_permission(struct svc_rqst *, struct svc_export *,
>  				struct dentry *, int);
>  
> +u64		nfsd4_change_attribute(struct kstat *stat, struct inode *inode,
> +				struct svc_export *exp);
> +
>  static inline int fh_want_write(struct svc_fh *fh)
>  {
>  	int ret;
> -- 
> 2.28.0

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17 12:34                                               ` Jeff Layton
@ 2020-11-17 15:26                                                 ` J. Bruce Fields
  2020-11-17 15:34                                                   ` Jeff Layton
  0 siblings, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-17 15:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
> I don't think I described what I was thinking well. Let me try again...
> 
> There should be no need to change the code in iversion.h -- I think we
> can do this in a way that's confined to just nfsd/export code.
> 
> What I would suggest is to have nfsd4_change_attribute call the
> fetch_iversion op if it exists, instead of checking IS_I_VERSION and
> doing the stuff in that block. If fetch_iversion is NULL, then just use
> the ctime.
> 
> Then, you just need to make sure that the filesystems' export_ops have
> an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
> inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
> The rest of the filesystems can leave fetch_iversion as NULL (since we
> don't want to use it on them).

Thanks for your patience, that makes sense, I'll try it.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17 15:26                                                 ` J. Bruce Fields
@ 2020-11-17 15:34                                                   ` Jeff Layton
  2020-11-20 22:38                                                     ` J. Bruce Fields
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-17 15:34 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Tue, 2020-11-17 at 10:26 -0500, J. Bruce Fields wrote:
> On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
> > I don't think I described what I was thinking well. Let me try again...
> > 
> > There should be no need to change the code in iversion.h -- I think we
> > can do this in a way that's confined to just nfsd/export code.
> > 
> > What I would suggest is to have nfsd4_change_attribute call the
> > fetch_iversion op if it exists, instead of checking IS_I_VERSION and
> > doing the stuff in that block. If fetch_iversion is NULL, then just use
> > the ctime.
> > 
> > Then, you just need to make sure that the filesystems' export_ops have
> > an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
> > inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
> > The rest of the filesystems can leave fetch_iversion as NULL (since we
> > don't want to use it on them).
> 
> Thanks for your patience, that makes sense, I'll try it.
> 

There is one gotcha in here though... ext4 needs to also handle the case
where SB_I_VERSION is not set. The simple fix might be to just have
different export ops for ext4 based on whether it was mounted with -o
iversion or not, but maybe there is some better way to do it?

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-17 15:34                                                   ` Jeff Layton
@ 2020-11-20 22:38                                                     ` J. Bruce Fields
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
  2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
  0 siblings, 2 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:38 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Tue, Nov 17, 2020 at 10:34:57AM -0500, Jeff Layton wrote:
> On Tue, 2020-11-17 at 10:26 -0500, J. Bruce Fields wrote:
> > On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
> > > I don't think I described what I was thinking well. Let me try again...
> > > 
> > > There should be no need to change the code in iversion.h -- I think we
> > > can do this in a way that's confined to just nfsd/export code.
> > > 
> > > What I would suggest is to have nfsd4_change_attribute call the
> > > fetch_iversion op if it exists, instead of checking IS_I_VERSION and
> > > doing the stuff in that block. If fetch_iversion is NULL, then just use
> > > the ctime.
> > > 
> > > Then, you just need to make sure that the filesystems' export_ops have
> > > an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
> > > inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
> > > The rest of the filesystems can leave fetch_iversion as NULL (since we
> > > don't want to use it on them).
> > 
> > Thanks for your patience, that makes sense, I'll try it.
> > 
> 
> There is one gotcha in here though... ext4 needs to also handle the case
> where SB_I_VERSION is not set. The simple fix might be to just have
> different export ops for ext4 based on whether it was mounted with -o
> iversion or not, but maybe there is some better way to do it?

I was thinking ext4's export op could check for I_VERSION on its own and
vary behavior based on that.

I'll follow up with new patches in a moment.

I think the first one's all that's needed to fix the problem Daire
identified.  I'm a little less sure of the rest.

Lightly tested, just by running them through my usual regression tests
(which don't re-export) and then running connectathon on a 4.2 re-export
of a 4.2 mount.

The latter triggered a crash preceded by a KASAN use-after free warning.
Looks like it might be a problem with blocking lock notifications,
probably not related to these patches.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case
  2020-11-20 22:38                                                     ` J. Bruce Fields
@ 2020-11-20 22:39                                                       ` J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
                                                                           ` (6 more replies)
  2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
  1 sibling, 7 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

inode_query_iversion() can modify i_version.  Depending on the exported
filesystem, that may not be safe.  For example, if you're re-exporting
NFS, NFS stores the server's change attribute in i_version and does not
expect it to be modified locally.  This has been observed causing
unnecssary cache invalidations.

The way a filesystem indicates that it's OK to call
inode_query_iverson() is by setting SB_I_VERSION.

(This may look like a no-op--in the encode_change() case it's just
rearranging some code--but note nfsd4_change_attribute() is also called
from fill_pre_wcc() and fill_post_wcc().)

(Note we could also pull the NFSEXP_V4ROOT case into
nfsd4_change_attribute as well.  That would actually be a no-op, since
pre/post attrs are only used for metadata-modifying operations, and
V4ROOT exports are read-only.  But we might make the change in the
future just for simplicity.)

Reported-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs4xdr.c |  6 +-----
 fs/nfsd/nfsfh.h   | 14 ++++++++++----
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 833a2c64dfe8..56fd5f6d5c44 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2298,12 +2298,8 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
 	if (exp->ex_flags & NFSEXP_V4ROOT) {
 		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
 		*p++ = 0;
-	} else if (IS_I_VERSION(inode)) {
+	} else
 		p = xdr_encode_hyper(p, nfsd4_change_attribute(stat, inode));
-	} else {
-		*p++ = cpu_to_be32(stat->ctime.tv_sec);
-		*p++ = cpu_to_be32(stat->ctime.tv_nsec);
-	}
 	return p;
 }
 
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 56cfbc361561..3faf5974fa4e 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -261,10 +261,16 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
 {
 	u64 chattr;
 
-	chattr =  stat->ctime.tv_sec;
-	chattr <<= 30;
-	chattr += stat->ctime.tv_nsec;
-	chattr += inode_query_iversion(inode);
+	if (IS_I_VERSION(inode)) {
+		chattr =  stat->ctime.tv_sec;
+		chattr <<= 30;
+		chattr += stat->ctime.tv_nsec;
+		chattr += inode_query_iversion(inode);
+	} else {
+		chattr = cpu_to_be32(stat->ctime.tv_sec);
+		chattr <<= 32;
+		chattr += cpu_to_be32(stat->ctime.tv_nsec);
+	}
 	return chattr;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 2/8] nfsd: simplify nfsd4_change_info
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
                                                                           ` (5 subsequent siblings)
  6 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

It doesn't make sense to carry all these extra fields around.  Just
make everything into change attribute from the start.

This is just cleanup, there should be no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs4xdr.c        | 11 ++---------
 fs/nfsd/xdr4.h           | 22 +++++++++-------------
 include/linux/iversion.h | 13 +++++++++++++
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 56fd5f6d5c44..18c912930947 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2331,15 +2331,8 @@ static __be32 *encode_time_delta(__be32 *p, struct inode *inode)
 static __be32 *encode_cinfo(__be32 *p, struct nfsd4_change_info *c)
 {
 	*p++ = cpu_to_be32(c->atomic);
-	if (c->change_supported) {
-		p = xdr_encode_hyper(p, c->before_change);
-		p = xdr_encode_hyper(p, c->after_change);
-	} else {
-		*p++ = cpu_to_be32(c->before_ctime_sec);
-		*p++ = cpu_to_be32(c->before_ctime_nsec);
-		*p++ = cpu_to_be32(c->after_ctime_sec);
-		*p++ = cpu_to_be32(c->after_ctime_nsec);
-	}
+	p = xdr_encode_hyper(p, c->before_change);
+	p = xdr_encode_hyper(p, c->after_change);
 	return p;
 }
 
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 679d40af1bbb..9c2d942d055d 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -76,12 +76,7 @@ static inline bool nfsd4_has_session(struct nfsd4_compound_state *cs)
 
 struct nfsd4_change_info {
 	u32		atomic;
-	bool		change_supported;
-	u32		before_ctime_sec;
-	u32		before_ctime_nsec;
 	u64		before_change;
-	u32		after_ctime_sec;
-	u32		after_ctime_nsec;
 	u64		after_change;
 };
 
@@ -768,15 +763,16 @@ set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
 {
 	BUG_ON(!fhp->fh_pre_saved);
 	cinfo->atomic = (u32)fhp->fh_post_saved;
-	cinfo->change_supported = IS_I_VERSION(d_inode(fhp->fh_dentry));
-
-	cinfo->before_change = fhp->fh_pre_change;
-	cinfo->after_change = fhp->fh_post_change;
-	cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
-	cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
-	cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
-	cinfo->after_ctime_nsec = fhp->fh_post_attr.ctime.tv_nsec;
 
+	if (IS_I_VERSION(d_inode(fhp->fh_dentry))) {
+		cinfo->before_change = fhp->fh_pre_change;
+		cinfo->after_change = fhp->fh_post_change;
+	} else {
+		cinfo->before_change =
+			time_to_chattr(&fhp->fh_pre_ctime);
+		cinfo->after_change =
+			time_to_chattr(&fhp->fh_post_attr.ctime);
+	}
 }
 
 
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 2917ef990d43..3bfebde5a1a6 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -328,6 +328,19 @@ inode_query_iversion(struct inode *inode)
 	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 
+/*
+ * For filesystems without any sort of change attribute, the best we can
+ * do is fake one up from the ctime:
+ */
+static inline u64 time_to_chattr(struct timespec64 *t)
+{
+	u64 chattr = t->tv_sec;
+
+	chattr <<= 32;
+	chattr += t->tv_nsec;
+	return chattr;
+}
+
 /**
  * inode_eq_iversion_raw - check whether the raw i_version counter has changed
  * @inode: inode to check
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-21  0:34                                                           ` Jeff Layton
  2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
                                                                           ` (4 subsequent siblings)
  6 siblings, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

Minor cleanup, no change in behavior

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfsfh.h | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 3faf5974fa4e..45bd776290d5 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -259,19 +259,16 @@ fh_clear_wcc(struct svc_fh *fhp)
 static inline u64 nfsd4_change_attribute(struct kstat *stat,
 					 struct inode *inode)
 {
-	u64 chattr;
-
 	if (IS_I_VERSION(inode)) {
+		u64 chattr;
+
 		chattr =  stat->ctime.tv_sec;
 		chattr <<= 30;
 		chattr += stat->ctime.tv_nsec;
 		chattr += inode_query_iversion(inode);
-	} else {
-		chattr = cpu_to_be32(stat->ctime.tv_sec);
-		chattr <<= 32;
-		chattr += cpu_to_be32(stat->ctime.tv_nsec);
-	}
-	return chattr;
+		return chattr;
+	} else
+		return time_to_chattr(&stat->ctime);
 }
 
 extern void fill_pre_wcc(struct svc_fh *fhp);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
                                                                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

inode_query_iversion() has side effects, and there's no point calling it
when we're not even going to use it.

We check whether we're currently processing a v4 request by checking
fh_maxsize, which is arguably a little hacky; we could add a flag to
svc_fh instead.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs3xdr.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2277f83da250..2732b04d3878 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -259,11 +259,11 @@ void fill_pre_wcc(struct svc_fh *fhp)
 {
 	struct inode    *inode;
 	struct kstat	stat;
+	bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
 	__be32 err;
 
 	if (fhp->fh_pre_saved)
 		return;
-
 	inode = d_inode(fhp->fh_dentry);
 	err = fh_getattr(fhp, &stat);
 	if (err) {
@@ -272,11 +272,12 @@ void fill_pre_wcc(struct svc_fh *fhp)
 		stat.ctime = inode->i_ctime;
 		stat.size  = inode->i_size;
 	}
+	if (v4)
+		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
 
 	fhp->fh_pre_mtime = stat.mtime;
 	fhp->fh_pre_ctime = stat.ctime;
 	fhp->fh_pre_size  = stat.size;
-	fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
 	fhp->fh_pre_saved = true;
 }
 
@@ -285,18 +286,21 @@ void fill_pre_wcc(struct svc_fh *fhp)
  */
 void fill_post_wcc(struct svc_fh *fhp)
 {
+	bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
+	struct inode *inode = d_inode(fhp->fh_dentry);
 	__be32 err;
 
 	if (fhp->fh_post_saved)
 		printk("nfsd: inode locked twice during operation.\n");
 
 	err = fh_getattr(fhp, &fhp->fh_post_attr);
-	fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
-						     d_inode(fhp->fh_dentry));
+	if (v4)
+		fhp->fh_post_change =
+			nfsd4_change_attribute(&fhp->fh_post_attr, inode);
 	if (err) {
 		fhp->fh_post_saved = false;
 		/* Grab the ctime anyway - set_change_info might use it */
-		fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
+		fhp->fh_post_attr.ctime = inode->i_ctime;
 	} else
 		fhp->fh_post_saved = true;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 5/8] nfs: use change attribute for NFS re-exports
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
                                                                           ` (2 preceding siblings ...)
  2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
                                                                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

When exporting NFS, we may as well use the real change attribute
returned by the original server instead of faking up a change attribute
from the ctime.

Note we can't do that by setting I_VERSION--that would also turn on the
logic in iversion.h which treats the lower bit specially, and that
doesn't make sense for NFS.

So instead we define a new export operation for filesystems like NFS
that want to manage the change attribute themselves.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfs/export.c          | 18 ++++++++++++++++++
 fs/nfsd/nfsfh.h          |  5 ++++-
 include/linux/exportfs.h |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 3430d6891e89..0b10c5946060 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -167,8 +167,26 @@ nfs_get_parent(struct dentry *dentry)
 	return parent;
 }
 
+static u64 nfs_fetch_iversion(struct inode *inode)
+{
+	struct nfs_server *server = NFS_SERVER(inode);
+
+	/* Is this the right call?: */
+	nfs_revalidate_inode(server, inode);
+	/*
+	 * Also, note we're ignoring any returned error.  That seems to be
+	 * the practice for cache consistency information elsewhere in
+	 * the server, but I'm not sure why.
+	 */
+	if (server->nfs_client->rpc_ops->version >= 4)
+		return inode_peek_iversion_raw(inode);
+	else
+		return time_to_chattr(&inode->i_ctime);
+}
+
 const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
+	.fetch_iversion = nfs_fetch_iversion,
 };
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 45bd776290d5..2656a3464c6c 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -12,6 +12,7 @@
 #include <linux/sunrpc/svc.h>
 #include <uapi/linux/nfsd/nfsfh.h>
 #include <linux/iversion.h>
+#include <linux/exportfs.h>
 
 static inline __u32 ino_t_to_u32(ino_t ino)
 {
@@ -259,7 +260,9 @@ fh_clear_wcc(struct svc_fh *fhp)
 static inline u64 nfsd4_change_attribute(struct kstat *stat,
 					 struct inode *inode)
 {
-	if (IS_I_VERSION(inode)) {
+	if (inode->i_sb->s_export_op->fetch_iversion)
+		return inode->i_sb->s_export_op->fetch_iversion(inode);
+	else if (IS_I_VERSION(inode)) {
 		u64 chattr;
 
 		chattr =  stat->ctime.tv_sec;
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 3ceb72b67a7a..da6f0a905b94 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,6 +213,7 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
+	u64 (*fetch_iversion)(struct inode *);
 };
 
 extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 6/8] nfsd: move change attribute generation to filesystem
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
                                                                           ` (3 preceding siblings ...)
  2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-21  0:58                                                           ` Jeff Layton
  2020-11-21 13:00                                                           ` Jeff Layton
  2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
  6 siblings, 2 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

After this, only filesystems lacking change attribute support will leave
the fetch_iversion export op NULL.

This seems cleaner to me, and will allow some minor optimizations in the
nfsd code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/btrfs/export.c        |  2 ++
 fs/ext4/super.c          |  9 +++++++++
 fs/nfsd/nfs4xdr.c        |  2 +-
 fs/nfsd/nfsfh.h          | 25 +++----------------------
 fs/nfsd/xdr4.h           |  4 +++-
 fs/xfs/xfs_export.c      |  2 ++
 include/linux/iversion.h | 26 ++++++++++++++++++++++++++
 7 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 1a8d419d9e1f..ece32440999a 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -7,6 +7,7 @@
 #include "btrfs_inode.h"
 #include "print-tree.h"
 #include "export.h"
+#include <linux/iversion.h>
 
 #define BTRFS_FID_SIZE_NON_CONNECTABLE (offsetof(struct btrfs_fid, \
 						 parent_objectid) / 4)
@@ -279,4 +280,5 @@ const struct export_operations btrfs_export_ops = {
 	.fh_to_parent	= btrfs_fh_to_parent,
 	.get_parent	= btrfs_get_parent,
 	.get_name	= btrfs_get_name,
+	.fetch_iversion	= generic_fetch_iversion,
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ef4734b40e2a..a4f48273d435 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1685,11 +1685,20 @@ static const struct super_operations ext4_sops = {
 	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
+static u64 ext4_fetch_iversion(struct inode *inode)
+{
+	if (IS_I_VERSION(inode))
+		return generic_fetch_iversion(inode);
+	else
+		return time_to_chattr(&inode->i_ctime);
+}
+
 static const struct export_operations ext4_export_ops = {
 	.fh_to_dentry = ext4_fh_to_dentry,
 	.fh_to_parent = ext4_fh_to_parent,
 	.get_parent = ext4_get_parent,
 	.commit_metadata = ext4_nfs_commit_metadata,
+	.fetch_iversion = ext4_fetch_iversion,
 };
 
 enum {
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 18c912930947..182190684792 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3187,7 +3187,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		p = xdr_reserve_space(xdr, 4);
 		if (!p)
 			goto out_resource;
-		if (IS_I_VERSION(d_inode(dentry)))
+		if (IS_I_VERSION(d_inode(dentry))
 			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR);
 		else
 			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_TIME_METADATA);
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 2656a3464c6c..ac3e309d7339 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -46,8 +46,8 @@ typedef struct svc_fh {
 	struct timespec64	fh_pre_mtime;	/* mtime before oper */
 	struct timespec64	fh_pre_ctime;	/* ctime before oper */
 	/*
-	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
-	 *  to find out if it is valid.
+	 * pre-op nfsv4 change attr: note must check for fetch_iversion
+	 * op to find out if it is valid.
 	 */
 	u64			fh_pre_change;
 
@@ -246,31 +246,12 @@ fh_clear_wcc(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
 static inline u64 nfsd4_change_attribute(struct kstat *stat,
 					 struct inode *inode)
 {
 	if (inode->i_sb->s_export_op->fetch_iversion)
 		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
+	else
 		return time_to_chattr(&stat->ctime);
 }
 
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 9c2d942d055d..f0c8fbe704a2 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -761,10 +761,12 @@ void warn_on_nonidempotent_op(struct nfsd4_op *op);
 static inline void
 set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
 {
+	struct inode *inode = d_inode(fhp->fh_dentry);
+
 	BUG_ON(!fhp->fh_pre_saved);
 	cinfo->atomic = (u32)fhp->fh_post_saved;
 
-	if (IS_I_VERSION(d_inode(fhp->fh_dentry))) {
+	if (inode->i_sb->s_export_op->fetch_iversion) {
 		cinfo->before_change = fhp->fh_pre_change;
 		cinfo->after_change = fhp->fh_post_change;
 	} else {
diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
index 465fd9e048d4..b950fac3d7df 100644
--- a/fs/xfs/xfs_export.c
+++ b/fs/xfs/xfs_export.c
@@ -16,6 +16,7 @@
 #include "xfs_inode_item.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include <linux/iversion.h>
 
 /*
  * Note that we only accept fileids which are long enough rather than allow
@@ -234,4 +235,5 @@ const struct export_operations xfs_export_operations = {
 	.map_blocks		= xfs_fs_map_blocks,
 	.commit_blocks		= xfs_fs_commit_blocks,
 #endif
+	.fetch_iversion		= generic_fetch_iversion,
 };
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 3bfebde5a1a6..ded74523c8a6 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -328,6 +328,32 @@ inode_query_iversion(struct inode *inode)
 	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 
+/*
+ * We could use i_version alone as the NFSv4 change attribute.  However,
+ * i_version can go backwards after a reboot.  On its own that doesn't
+ * necessarily cause a problem, but if i_version goes backwards and then
+ * is incremented again it could reuse a value that was previously used
+ * before boot, and a client who queried the two values might
+ * incorrectly assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as
+ * long as time doesn't go backwards we never reuse an old value.
+ *
+ * A filesystem that has an on-disk boot counter or similar might prefer
+ * to use that to avoid the risk of the change attribute going backwards
+ * if system time is set backwards.
+ */
+static inline u64 generic_fetch_iversion(struct inode *inode)
+{
+	u64 chattr;
+
+	chattr =  inode->i_ctime.tv_sec;
+	chattr <<= 30;
+	chattr += inode->i_ctime.tv_nsec;
+	chattr += inode_query_iversion(inode);
+	return chattr;
+}
+
 /*
  * For filesystems without any sort of change attribute, the best we can
  * do is fake one up from the ctime:
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
                                                                           ` (4 preceding siblings ...)
  2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
  6 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

In the typical case of v4 and a i_version-supporting filesystem, we can
skip a stat which is only required to fake up a change attribute from
ctime.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs3xdr.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2732b04d3878..8502a493be6d 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -265,19 +265,21 @@ void fill_pre_wcc(struct svc_fh *fhp)
 	if (fhp->fh_pre_saved)
 		return;
 	inode = d_inode(fhp->fh_dentry);
-	err = fh_getattr(fhp, &stat);
-	if (err) {
-		/* Grab the times from inode anyway */
-		stat.mtime = inode->i_mtime;
-		stat.ctime = inode->i_ctime;
-		stat.size  = inode->i_size;
+	if (!v4 || !inode->i_sb->s_export_op->fetch_iversion) {
+		err = fh_getattr(fhp, &stat);
+		if (err) {
+			/* Grab the times from inode anyway */
+			stat.mtime = inode->i_mtime;
+			stat.ctime = inode->i_ctime;
+			stat.size  = inode->i_size;
+		}
+		fhp->fh_pre_mtime = stat.mtime;
+		fhp->fh_pre_ctime = stat.ctime;
+		fhp->fh_pre_size  = stat.size;
 	}
 	if (v4)
 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
 
-	fhp->fh_pre_mtime = stat.mtime;
-	fhp->fh_pre_ctime = stat.ctime;
-	fhp->fh_pre_size  = stat.size;
 	fhp->fh_pre_saved = true;
 }
 
@@ -293,7 +295,9 @@ void fill_post_wcc(struct svc_fh *fhp)
 	if (fhp->fh_post_saved)
 		printk("nfsd: inode locked twice during operation.\n");
 
-	err = fh_getattr(fhp, &fhp->fh_post_attr);
+
+	if (!v4 || !inode->i_sb->s_export_op->fetch_iversion)
+		err = fh_getattr(fhp, &fhp->fh_post_attr);
 	if (v4)
 		fhp->fh_post_change =
 			nfsd4_change_attribute(&fhp->fh_post_attr, inode);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute"
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
                                                                           ` (5 preceding siblings ...)
  2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
@ 2020-11-20 22:39                                                         ` J. Bruce Fields
  6 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

This reverts commit a85857633b04d57f4524cca0a2bfaf87b2543f9f.

We're still factoring ctime into our change attribute even in the
IS_I_VERSION case.  If someone sets the system time backwards, a client
could see the change attribute go backwards.  Maybe we can just say
"well, don't do that", but there's some question whether that's good
enough, or whether we need a better guarantee.

Also, the client still isn't actually using the attribute.

While we're still figuring this out, let's just stop returning this
attribute.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs4xdr.c    | 10 ----------
 fs/nfsd/nfsd.h       |  1 -
 include/linux/nfs4.h |  8 --------
 3 files changed, 19 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 182190684792..c33838caf8c6 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3183,16 +3183,6 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			goto out;
 	}
 
-	if (bmval2 & FATTR4_WORD2_CHANGE_ATTR_TYPE) {
-		p = xdr_reserve_space(xdr, 4);
-		if (!p)
-			goto out_resource;
-		if (IS_I_VERSION(d_inode(dentry))
-			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR);
-		else
-			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_TIME_METADATA);
-	}
-
 #ifdef CONFIG_NFSD_V4_SECURITY_LABEL
 	if (bmval2 & FATTR4_WORD2_SECURITY_LABEL) {
 		status = nfsd4_encode_security_label(xdr, rqstp, context,
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index cb742e17e04a..40cb40ac0a65 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -387,7 +387,6 @@ void		nfsd_lockd_shutdown(void);
 
 #define NFSD4_2_SUPPORTED_ATTRS_WORD2 \
 	(NFSD4_1_SUPPORTED_ATTRS_WORD2 | \
-	FATTR4_WORD2_CHANGE_ATTR_TYPE | \
 	FATTR4_WORD2_MODE_UMASK | \
 	NFSD4_2_SECURITY_ATTRS | \
 	FATTR4_WORD2_XATTR_SUPPORT)
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 9dc7eeac924f..5b4c67c91f56 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -385,13 +385,6 @@ enum lock_type4 {
 	NFS4_WRITEW_LT = 4
 };
 
-enum change_attr_type4 {
-	NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0,
-	NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1,
-	NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
-	NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3,
-	NFS4_CHANGE_TYPE_IS_UNDEFINED = 4
-};
 
 /* Mandatory Attributes */
 #define FATTR4_WORD0_SUPPORTED_ATTRS    (1UL << 0)
@@ -459,7 +452,6 @@ enum change_attr_type4 {
 #define FATTR4_WORD2_LAYOUT_BLKSIZE     (1UL << 1)
 #define FATTR4_WORD2_MDSTHRESHOLD       (1UL << 4)
 #define FATTR4_WORD2_CLONE_BLKSIZE	(1UL << 13)
-#define FATTR4_WORD2_CHANGE_ATTR_TYPE	(1UL << 15)
 #define FATTR4_WORD2_SECURITY_LABEL     (1UL << 16)
 #define FATTR4_WORD2_MODE_UMASK		(1UL << 17)
 #define FATTR4_WORD2_XATTR_SUPPORT	(1UL << 18)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-20 22:38                                                     ` J. Bruce Fields
  2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
@ 2020-11-20 22:44                                                       ` J. Bruce Fields
  2020-11-21  1:03                                                         ` Jeff Layton
  1 sibling, 1 reply; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-20 22:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, Nov 20, 2020 at 05:38:31PM -0500, J. Bruce Fields wrote:
> On Tue, Nov 17, 2020 at 10:34:57AM -0500, Jeff Layton wrote:
> > On Tue, 2020-11-17 at 10:26 -0500, J. Bruce Fields wrote:
> > > On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
> > > > I don't think I described what I was thinking well. Let me try again...
> > > > 
> > > > There should be no need to change the code in iversion.h -- I think we
> > > > can do this in a way that's confined to just nfsd/export code.
> > > > 
> > > > What I would suggest is to have nfsd4_change_attribute call the
> > > > fetch_iversion op if it exists, instead of checking IS_I_VERSION and
> > > > doing the stuff in that block. If fetch_iversion is NULL, then just use
> > > > the ctime.
> > > > 
> > > > Then, you just need to make sure that the filesystems' export_ops have
> > > > an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
> > > > inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
> > > > The rest of the filesystems can leave fetch_iversion as NULL (since we
> > > > don't want to use it on them).
> > > 
> > > Thanks for your patience, that makes sense, I'll try it.
> > > 
> > 
> > There is one gotcha in here though... ext4 needs to also handle the case
> > where SB_I_VERSION is not set. The simple fix might be to just have
> > different export ops for ext4 based on whether it was mounted with -o
> > iversion or not, but maybe there is some better way to do it?
> 
> I was thinking ext4's export op could check for I_VERSION on its own and
> vary behavior based on that.
> 
> I'll follow up with new patches in a moment.
> 
> I think the first one's all that's needed to fix the problem Daire
> identified.  I'm a little less sure of the rest.
> 
> Lightly tested, just by running them through my usual regression tests
> (which don't re-export) and then running connectathon on a 4.2 re-export
> of a 4.2 mount.
> 
> The latter triggered a crash preceded by a KASAN use-after free warning.
> Looks like it might be a problem with blocking lock notifications,
> probably not related to these patches.

Another nit I ran across:

Some NFSv4 directory-modifying operations return pre- and post- change
attributes together with an "atomic" flag that's supposed to indicate
whether the change attributes were read atomically with the operation.
It looks like we're setting the atomic flag under the assumptions that
local vfs locks are sufficient to guarantee atomicity, which isn't right
when we're exporting a distributed filesystem.

In the case we're reexporting NFS I guess ideal would be to use the pre-
and post- attributes that the original server returned and also save
having to do extra getattr calls.  Not sure how we'd do that,
though--more export operations?  Maybe for now we could just figure out
when to turn off the atomic bit.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup
  2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
@ 2020-11-21  0:34                                                           ` Jeff Layton
  0 siblings, 0 replies; 129+ messages in thread
From: Jeff Layton @ 2020-11-21  0:34 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, 2020-11-20 at 17:39 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> Minor cleanup, no change in behavior
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/nfsd/nfsfh.h | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 3faf5974fa4e..45bd776290d5 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -259,19 +259,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>  static inline u64 nfsd4_change_attribute(struct kstat *stat,
>  					 struct inode *inode)
>  {
> -	u64 chattr;
> -
>  	if (IS_I_VERSION(inode)) {
> +		u64 chattr;
> +
>  		chattr =  stat->ctime.tv_sec;
>  		chattr <<= 30;
>  		chattr += stat->ctime.tv_nsec;
>  		chattr += inode_query_iversion(inode);
> -	} else {
> -		chattr = cpu_to_be32(stat->ctime.tv_sec);
> -		chattr <<= 32;
> -		chattr += cpu_to_be32(stat->ctime.tv_nsec);
> -	}
> -	return chattr;
> +		return chattr;
> +	} else
> +		return time_to_chattr(&stat->ctime);
>  }
>  
> 
>  extern void fill_pre_wcc(struct svc_fh *fhp);

I'd just fold this one into 2/8.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 6/8] nfsd: move change attribute generation to filesystem
  2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
@ 2020-11-21  0:58                                                           ` Jeff Layton
  2020-11-21  1:01                                                             ` J. Bruce Fields
  2020-11-21 13:00                                                           ` Jeff Layton
  1 sibling, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-21  0:58 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, 2020-11-20 at 17:39 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> After this, only filesystems lacking change attribute support will leave
> the fetch_iversion export op NULL.
> 
> This seems cleaner to me, and will allow some minor optimizations in the
> nfsd code.
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/btrfs/export.c        |  2 ++
>  fs/ext4/super.c          |  9 +++++++++
>  fs/nfsd/nfs4xdr.c        |  2 +-
>  fs/nfsd/nfsfh.h          | 25 +++----------------------
>  fs/nfsd/xdr4.h           |  4 +++-
>  fs/xfs/xfs_export.c      |  2 ++
>  include/linux/iversion.h | 26 ++++++++++++++++++++++++++
>  7 files changed, 46 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 1a8d419d9e1f..ece32440999a 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -7,6 +7,7 @@
>  #include "btrfs_inode.h"
>  #include "print-tree.h"
>  #include "export.h"
> +#include <linux/iversion.h>
>  
> 
> 
> 
>  #define BTRFS_FID_SIZE_NON_CONNECTABLE (offsetof(struct btrfs_fid, \
>  						 parent_objectid) / 4)
> @@ -279,4 +280,5 @@ const struct export_operations btrfs_export_ops = {
>  	.fh_to_parent	= btrfs_fh_to_parent,
>  	.get_parent	= btrfs_get_parent,
>  	.get_name	= btrfs_get_name,
> +	.fetch_iversion	= generic_fetch_iversion,
>  };
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ef4734b40e2a..a4f48273d435 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1685,11 +1685,20 @@ static const struct super_operations ext4_sops = {
>  	.bdev_try_to_free_page = bdev_try_to_free_page,
>  };
>  
> 
> 
> 
> +static u64 ext4_fetch_iversion(struct inode *inode)
> +{
> +	if (IS_I_VERSION(inode))
> +		return generic_fetch_iversion(inode);
> +	else
> +		return time_to_chattr(&inode->i_ctime);
> +}
> +
>  static const struct export_operations ext4_export_ops = {
>  	.fh_to_dentry = ext4_fh_to_dentry,
>  	.fh_to_parent = ext4_fh_to_parent,
>  	.get_parent = ext4_get_parent,
>  	.commit_metadata = ext4_nfs_commit_metadata,
> +	.fetch_iversion = ext4_fetch_iversion,
>  };
>  
> 
> 
> 
>  enum {
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 18c912930947..182190684792 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3187,7 +3187,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		p = xdr_reserve_space(xdr, 4);
>  		if (!p)
>  			goto out_resource;
> -		if (IS_I_VERSION(d_inode(dentry)))
> +		if (IS_I_VERSION(d_inode(dentry))
>  			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR);
>  		else
>  			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_TIME_METADATA);
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 2656a3464c6c..ac3e309d7339 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -46,8 +46,8 @@ typedef struct svc_fh {
>  	struct timespec64	fh_pre_mtime;	/* mtime before oper */
>  	struct timespec64	fh_pre_ctime;	/* ctime before oper */
>  	/*
> -	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
> -	 *  to find out if it is valid.
> +	 * pre-op nfsv4 change attr: note must check for fetch_iversion
> +	 * op to find out if it is valid.
>  	 */
>  	u64			fh_pre_change;
>  
> 
> 
> 
> @@ -246,31 +246,12 @@ fh_clear_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_saved = false;
>  }
>  
> 
> 
> 
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
>  static inline u64 nfsd4_change_attribute(struct kstat *stat,
>  					 struct inode *inode)
>  {
>  	if (inode->i_sb->s_export_op->fetch_iversion)
>  		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> +	else
>  		return time_to_chattr(&stat->ctime);
>  }
>  
> 
> 
> 
> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
> index 9c2d942d055d..f0c8fbe704a2 100644
> --- a/fs/nfsd/xdr4.h
> +++ b/fs/nfsd/xdr4.h
> @@ -761,10 +761,12 @@ void warn_on_nonidempotent_op(struct nfsd4_op *op);
>  static inline void
>  set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
>  {
> +	struct inode *inode = d_inode(fhp->fh_dentry);
> +
>  	BUG_ON(!fhp->fh_pre_saved);
>  	cinfo->atomic = (u32)fhp->fh_post_saved;
>  
> 
> 
> 
> -	if (IS_I_VERSION(d_inode(fhp->fh_dentry))) {
> +	if (inode->i_sb->s_export_op->fetch_iversion) {
>  		cinfo->before_change = fhp->fh_pre_change;
>  		cinfo->after_change = fhp->fh_post_change;
>  	} else {
> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
> index 465fd9e048d4..b950fac3d7df 100644
> --- a/fs/xfs/xfs_export.c
> +++ b/fs/xfs/xfs_export.c
> @@ -16,6 +16,7 @@
>  #include "xfs_inode_item.h"
>  #include "xfs_icache.h"
>  #include "xfs_pnfs.h"
> +#include <linux/iversion.h>
>  
> 
> 
> 
>  /*
>   * Note that we only accept fileids which are long enough rather than allow
> @@ -234,4 +235,5 @@ const struct export_operations xfs_export_operations = {
>  	.map_blocks		= xfs_fs_map_blocks,
>  	.commit_blocks		= xfs_fs_commit_blocks,
>  #endif
> +	.fetch_iversion		= generic_fetch_iversion,
>  };

It seems a little weird to call a static inline here. I imagine that
means the compiler has to add a duplicate inline in every .o file that
does this? It may be cleaner to move generic_fetch_iversion into
fs/libfs.c so we only have one copy of it.

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..ded74523c8a6 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -328,6 +328,32 @@ inode_query_iversion(struct inode *inode)
>  	return cur >> I_VERSION_QUERIED_SHIFT;
>  }
>  
> 
> 
> 
> +/*
> + * We could use i_version alone as the NFSv4 change attribute.  However,
> + * i_version can go backwards after a reboot.  On its own that doesn't
> + * necessarily cause a problem, but if i_version goes backwards and then
> + * is incremented again it could reuse a value that was previously used
> + * before boot, and a client who queried the two values might
> + * incorrectly assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as
> + * long as time doesn't go backwards we never reuse an old value.
> + *
> + * A filesystem that has an on-disk boot counter or similar might prefer
> + * to use that to avoid the risk of the change attribute going backwards
> + * if system time is set backwards.
> + */
> +static inline u64 generic_fetch_iversion(struct inode *inode)
> +{
> +	u64 chattr;
> +
> +	chattr =  inode->i_ctime.tv_sec;
> +	chattr <<= 30;
> +	chattr += inode->i_ctime.tv_nsec;
> +	chattr += inode_query_iversion(inode);
> +	return chattr;
> +}
> +
>  /*
>   * For filesystems without any sort of change attribute, the best we can
>   * do is fake one up from the ctime:

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 6/8] nfsd: move change attribute generation to filesystem
  2020-11-21  0:58                                                           ` Jeff Layton
@ 2020-11-21  1:01                                                             ` J. Bruce Fields
  0 siblings, 0 replies; 129+ messages in thread
From: J. Bruce Fields @ 2020-11-21  1:01 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, Nov 20, 2020 at 07:58:38PM -0500, Jeff Layton wrote:
> On Fri, 2020-11-20 at 17:39 -0500, J. Bruce Fields wrote:
> > diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
> > index 465fd9e048d4..b950fac3d7df 100644
> > --- a/fs/xfs/xfs_export.c
> > +++ b/fs/xfs/xfs_export.c
> > @@ -16,6 +16,7 @@
> >  #include "xfs_inode_item.h"
> >  #include "xfs_icache.h"
> >  #include "xfs_pnfs.h"
> > +#include <linux/iversion.h>
> >  
> > 
> > 
> > 
> >  /*
> >   * Note that we only accept fileids which are long enough rather than allow
> > @@ -234,4 +235,5 @@ const struct export_operations xfs_export_operations = {
> >  	.map_blocks		= xfs_fs_map_blocks,
> >  	.commit_blocks		= xfs_fs_commit_blocks,
> >  #endif
> > +	.fetch_iversion		= generic_fetch_iversion,
> >  };
> 
> It seems a little weird to call a static inline here. I imagine that
> means the compiler has to add a duplicate inline in every .o file that
> does this? It may be cleaner to move generic_fetch_iversion into
> fs/libfs.c so we only have one copy of it.

OK.

(To be honest, I was a little suprised this worked.)

--b.

> 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..ded74523c8a6 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -328,6 +328,32 @@ inode_query_iversion(struct inode *inode)
> >  	return cur >> I_VERSION_QUERIED_SHIFT;
> >  }
> >  
> > 
> > 
> > 
> > +/*
> > + * We could use i_version alone as the NFSv4 change attribute.  However,
> > + * i_version can go backwards after a reboot.  On its own that doesn't
> > + * necessarily cause a problem, but if i_version goes backwards and then
> > + * is incremented again it could reuse a value that was previously used
> > + * before boot, and a client who queried the two values might
> > + * incorrectly assume nothing changed.
> > + *
> > + * By using both ctime and the i_version counter we guarantee that as
> > + * long as time doesn't go backwards we never reuse an old value.
> > + *
> > + * A filesystem that has an on-disk boot counter or similar might prefer
> > + * to use that to avoid the risk of the change attribute going backwards
> > + * if system time is set backwards.
> > + */
> > +static inline u64 generic_fetch_iversion(struct inode *inode)
> > +{
> > +	u64 chattr;
> > +
> > +	chattr =  inode->i_ctime.tv_sec;
> > +	chattr <<= 30;
> > +	chattr += inode->i_ctime.tv_nsec;
> > +	chattr += inode_query_iversion(inode);
> > +	return chattr;
> > +}
> > +
> >  /*
> >   * For filesystems without any sort of change attribute, the best we can
> >   * do is fake one up from the ctime:
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
@ 2020-11-21  1:03                                                         ` Jeff Layton
  2020-11-21 21:44                                                           ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Layton @ 2020-11-21  1:03 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: J. Bruce Fields, Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, 2020-11-20 at 17:44 -0500, J. Bruce Fields wrote:
> On Fri, Nov 20, 2020 at 05:38:31PM -0500, J. Bruce Fields wrote:
> > On Tue, Nov 17, 2020 at 10:34:57AM -0500, Jeff Layton wrote:
> > > On Tue, 2020-11-17 at 10:26 -0500, J. Bruce Fields wrote:
> > > > On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
> > > > > I don't think I described what I was thinking well. Let me try again...
> > > > > 
> > > > > There should be no need to change the code in iversion.h -- I think we
> > > > > can do this in a way that's confined to just nfsd/export code.
> > > > > 
> > > > > What I would suggest is to have nfsd4_change_attribute call the
> > > > > fetch_iversion op if it exists, instead of checking IS_I_VERSION and
> > > > > doing the stuff in that block. If fetch_iversion is NULL, then just use
> > > > > the ctime.
> > > > > 
> > > > > Then, you just need to make sure that the filesystems' export_ops have
> > > > > an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
> > > > > inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
> > > > > The rest of the filesystems can leave fetch_iversion as NULL (since we
> > > > > don't want to use it on them).
> > > > 
> > > > Thanks for your patience, that makes sense, I'll try it.
> > > > 
> > > 
> > > There is one gotcha in here though... ext4 needs to also handle the case
> > > where SB_I_VERSION is not set. The simple fix might be to just have
> > > different export ops for ext4 based on whether it was mounted with -o
> > > iversion or not, but maybe there is some better way to do it?
> > 
> > I was thinking ext4's export op could check for I_VERSION on its own and
> > vary behavior based on that.
> > 
> > I'll follow up with new patches in a moment.
> > 
> > I think the first one's all that's needed to fix the problem Daire
> > identified.  I'm a little less sure of the rest.
> > 
> > Lightly tested, just by running them through my usual regression tests
> > (which don't re-export) and then running connectathon on a 4.2 re-export
> > of a 4.2 mount.
> > 
> > The latter triggered a crash preceded by a KASAN use-after free warning.
> > Looks like it might be a problem with blocking lock notifications,
> > probably not related to these patches.
> 

The set looks pretty reasonable at first glance. Nice work.

Once you put this in, I'll plan to add a suitable fetch_iversion op for
ceph too.

> Another nit I ran across:
> 
> Some NFSv4 directory-modifying operations return pre- and post- change
> attributes together with an "atomic" flag that's supposed to indicate
> whether the change attributes were read atomically with the operation.
> It looks like we're setting the atomic flag under the assumptions that
> local vfs locks are sufficient to guarantee atomicity, which isn't right
> when we're exporting a distributed filesystem.
> 
> In the case we're reexporting NFS I guess ideal would be to use the pre-
> and post- attributes that the original server returned and also save
> having to do extra getattr calls.  Not sure how we'd do that,
> though--more export operations?  Maybe for now we could just figure out
> when to turn off the atomic bit.

Oh yeah, good point.

I'm not even sure that local locks are really enough -- IIRC, there are
still some race windows between doing the metadata operations and the
getattrs called to fill pre/post op attrs. Still, those windows are a
lot larger on something like NFS, so setting the flag there is really
stretching things.

One hacky fix might be to add a flags field to export_operations, and
have one that indicates that the atomic flag shouldn't be set. Then we
could add that flag to all of the netfs' (nfs, ceph, cifs), and anywhere
else that we thought it appropriate?

That approach might be helpful later too since we're starting to see a
wider variety of exportable filesystems these days. We may need more
"quirk" flags like this.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 6/8] nfsd: move change attribute generation to filesystem
  2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
  2020-11-21  0:58                                                           ` Jeff Layton
@ 2020-11-21 13:00                                                           ` Jeff Layton
  1 sibling, 0 replies; 129+ messages in thread
From: Jeff Layton @ 2020-11-21 13:00 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Daire Byrne, Trond Myklebust, linux-cachefs, linux-nfs

On Fri, 2020-11-20 at 17:39 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> After this, only filesystems lacking change attribute support will leave
> the fetch_iversion export op NULL.
> 
> This seems cleaner to me, and will allow some minor optimizations in the
> nfsd code.
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/btrfs/export.c        |  2 ++
>  fs/ext4/super.c          |  9 +++++++++
>  fs/nfsd/nfs4xdr.c        |  2 +-
>  fs/nfsd/nfsfh.h          | 25 +++----------------------
>  fs/nfsd/xdr4.h           |  4 +++-
>  fs/xfs/xfs_export.c      |  2 ++
>  include/linux/iversion.h | 26 ++++++++++++++++++++++++++
>  7 files changed, 46 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 1a8d419d9e1f..ece32440999a 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -7,6 +7,7 @@
>  #include "btrfs_inode.h"
>  #include "print-tree.h"
>  #include "export.h"
> +#include <linux/iversion.h>
>  
> 
>  #define BTRFS_FID_SIZE_NON_CONNECTABLE (offsetof(struct btrfs_fid, \
>  						 parent_objectid) / 4)
> @@ -279,4 +280,5 @@ const struct export_operations btrfs_export_ops = {
>  	.fh_to_parent	= btrfs_fh_to_parent,
>  	.get_parent	= btrfs_get_parent,
>  	.get_name	= btrfs_get_name,
> +	.fetch_iversion	= generic_fetch_iversion,
>  };
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ef4734b40e2a..a4f48273d435 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1685,11 +1685,20 @@ static const struct super_operations ext4_sops = {
>  	.bdev_try_to_free_page = bdev_try_to_free_page,
>  };
>  
> 
> +static u64 ext4_fetch_iversion(struct inode *inode)
> +{
> +	if (IS_I_VERSION(inode))
> +		return generic_fetch_iversion(inode);
> +	else
> +		return time_to_chattr(&inode->i_ctime);
> +}
> +
>  static const struct export_operations ext4_export_ops = {
>  	.fh_to_dentry = ext4_fh_to_dentry,
>  	.fh_to_parent = ext4_fh_to_parent,
>  	.get_parent = ext4_get_parent,
>  	.commit_metadata = ext4_nfs_commit_metadata,
> +	.fetch_iversion = ext4_fetch_iversion,
>  };
>  
> 
>  enum {
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 18c912930947..182190684792 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3187,7 +3187,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		p = xdr_reserve_space(xdr, 4);
>  		if (!p)
>  			goto out_resource;
> -		if (IS_I_VERSION(d_inode(dentry)))
> +		if (IS_I_VERSION(d_inode(dentry))
>  			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR);
>  		else
>  			*p++ = cpu_to_be32(NFS4_CHANGE_TYPE_IS_TIME_METADATA);
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 2656a3464c6c..ac3e309d7339 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -46,8 +46,8 @@ typedef struct svc_fh {
>  	struct timespec64	fh_pre_mtime;	/* mtime before oper */
>  	struct timespec64	fh_pre_ctime;	/* ctime before oper */
>  	/*
> -	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
> -	 *  to find out if it is valid.
> +	 * pre-op nfsv4 change attr: note must check for fetch_iversion
> +	 * op to find out if it is valid.
>  	 */
>  	u64			fh_pre_change;
>  
> 
> @@ -246,31 +246,12 @@ fh_clear_wcc(struct svc_fh *fhp)
>  	fhp->fh_pre_saved = false;
>  }
>  
> 
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
>  static inline u64 nfsd4_change_attribute(struct kstat *stat,
>  					 struct inode *inode)
>  {
>  	if (inode->i_sb->s_export_op->fetch_iversion)
>  		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> +	else
>  		return time_to_chattr(&stat->ctime);
>  }
>  
> 
> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
> index 9c2d942d055d..f0c8fbe704a2 100644
> --- a/fs/nfsd/xdr4.h
> +++ b/fs/nfsd/xdr4.h
> @@ -761,10 +761,12 @@ void warn_on_nonidempotent_op(struct nfsd4_op *op);
>  static inline void
>  set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
>  {
> +	struct inode *inode = d_inode(fhp->fh_dentry);
> +
>  	BUG_ON(!fhp->fh_pre_saved);
>  	cinfo->atomic = (u32)fhp->fh_post_saved;
>  
> 
> -	if (IS_I_VERSION(d_inode(fhp->fh_dentry))) {
> +	if (inode->i_sb->s_export_op->fetch_iversion) {
>  		cinfo->before_change = fhp->fh_pre_change;
>  		cinfo->after_change = fhp->fh_post_change;
>  	} else {
> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
> index 465fd9e048d4..b950fac3d7df 100644
> --- a/fs/xfs/xfs_export.c
> +++ b/fs/xfs/xfs_export.c
> @@ -16,6 +16,7 @@
>  #include "xfs_inode_item.h"
>  #include "xfs_icache.h"
>  #include "xfs_pnfs.h"
> +#include <linux/iversion.h>
>  
> 
>  /*
>   * Note that we only accept fileids which are long enough rather than allow
> @@ -234,4 +235,5 @@ const struct export_operations xfs_export_operations = {
>  	.map_blocks		= xfs_fs_map_blocks,
>  	.commit_blocks		= xfs_fs_commit_blocks,
>  #endif
> +	.fetch_iversion		= generic_fetch_iversion,
>  };
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..ded74523c8a6 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -328,6 +328,32 @@ inode_query_iversion(struct inode *inode)
>  	return cur >> I_VERSION_QUERIED_SHIFT;
>  }
>  
> 
> +/*
> + * We could use i_version alone as the NFSv4 change attribute.  However,
> + * i_version can go backwards after a reboot.  On its own that doesn't
> + * necessarily cause a problem, but if i_version goes backwards and then
> + * is incremented again it could reuse a value that was previously used
> + * before boot, and a client who queried the two values might
> + * incorrectly assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as
> + * long as time doesn't go backwards we never reuse an old value.
> + *
> + * A filesystem that has an on-disk boot counter or similar might prefer
> + * to use that to avoid the risk of the change attribute going backwards
> + * if system time is set backwards.
> + */
> +static inline u64 generic_fetch_iversion(struct inode *inode)
> +{
> +	u64 chattr;
> +
> +	chattr =  inode->i_ctime.tv_sec;
> +	chattr <<= 30;
> +	chattr += inode->i_ctime.tv_nsec;
> +	chattr += inode_query_iversion(inode);
> +	return chattr;
> +}
> +
>  /*
>   * For filesystems without any sort of change attribute, the best we can
>   * do is fake one up from the ctime:

One more nit: 

We probably don't want anyone using this on filesystems that don't set
SB_I_VERSION. It might be a good idea to add something like:

    WARN_ON_ONCE(!IS_I_VERSION(inode));

To this function, to catch anyone trying to do it.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-21  1:03                                                         ` Jeff Layton
@ 2020-11-21 21:44                                                           ` Daire Byrne
  2020-11-22  0:02                                                             ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-11-21 21:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: bfields, J. Bruce Fields, Trond Myklebust, linux-cachefs, linux-nfs

----- On 21 Nov, 2020, at 01:03, Jeff Layton jlayton@kernel.org wrote:
> On Fri, 2020-11-20 at 17:44 -0500, J. Bruce Fields wrote:
>> On Fri, Nov 20, 2020 at 05:38:31PM -0500, J. Bruce Fields wrote:
>> > On Tue, Nov 17, 2020 at 10:34:57AM -0500, Jeff Layton wrote:
>> > > On Tue, 2020-11-17 at 10:26 -0500, J. Bruce Fields wrote:
>> > > > On Tue, Nov 17, 2020 at 07:34:49AM -0500, Jeff Layton wrote:
>> > > > > I don't think I described what I was thinking well. Let me try again...
>> > > > > 
>> > > > > There should be no need to change the code in iversion.h -- I think we
>> > > > > can do this in a way that's confined to just nfsd/export code.
>> > > > > 
>> > > > > What I would suggest is to have nfsd4_change_attribute call the
>> > > > > fetch_iversion op if it exists, instead of checking IS_I_VERSION and
>> > > > > doing the stuff in that block. If fetch_iversion is NULL, then just use
>> > > > > the ctime.
>> > > > > 
>> > > > > Then, you just need to make sure that the filesystems' export_ops have
>> > > > > an appropriate fetch_iversion vector. xfs, ext4 and btrfs can just call
>> > > > > inode_query_iversion, and NFS and Ceph can call inode_peek_iversion_raw.
>> > > > > The rest of the filesystems can leave fetch_iversion as NULL (since we
>> > > > > don't want to use it on them).
>> > > > 
>> > > > Thanks for your patience, that makes sense, I'll try it.
>> > > > 
>> > > 
>> > > There is one gotcha in here though... ext4 needs to also handle the case
>> > > where SB_I_VERSION is not set. The simple fix might be to just have
>> > > different export ops for ext4 based on whether it was mounted with -o
>> > > iversion or not, but maybe there is some better way to do it?
>> > 
>> > I was thinking ext4's export op could check for I_VERSION on its own and
>> > vary behavior based on that.
>> > 
>> > I'll follow up with new patches in a moment.
>> > 
>> > I think the first one's all that's needed to fix the problem Daire
>> > identified.  I'm a little less sure of the rest.

I can confirm that patch 1/8 alone does indeed address the reported revalidation issue for us (as did the previous patch). The re-export server's client cache seems to remain intact and can serve the same cached results to multiple clients.

>> > Lightly tested, just by running them through my usual regression tests
>> > (which don't re-export) and then running connectathon on a 4.2 re-export
>> > of a 4.2 mount.
>> > 
>> > The latter triggered a crash preceded by a KASAN use-after free warning.
>> > Looks like it might be a problem with blocking lock notifications,
>> > probably not related to these patches.
>> >
> The set looks pretty reasonable at first glance. Nice work.
> 
> Once you put this in, I'll plan to add a suitable fetch_iversion op for
> ceph too.
> 
>> Another nit I ran across:
>> 
>> Some NFSv4 directory-modifying operations return pre- and post- change
>> attributes together with an "atomic" flag that's supposed to indicate
>> whether the change attributes were read atomically with the operation.
>> It looks like we're setting the atomic flag under the assumptions that
>> local vfs locks are sufficient to guarantee atomicity, which isn't right
>> when we're exporting a distributed filesystem.
>> 
>> In the case we're reexporting NFS I guess ideal would be to use the pre-
>> and post- attributes that the original server returned and also save
>> having to do extra getattr calls.  Not sure how we'd do that,
>> though--more export operations?  Maybe for now we could just figure out
>> when to turn off the atomic bit.
> 
> Oh yeah, good point.
> 
> I'm not even sure that local locks are really enough -- IIRC, there are
> still some race windows between doing the metadata operations and the
> getattrs called to fill pre/post op attrs. Still, those windows are a
> lot larger on something like NFS, so setting the flag there is really
> stretching things.
> 
> One hacky fix might be to add a flags field to export_operations, and
> have one that indicates that the atomic flag shouldn't be set. Then we
> could add that flag to all of the netfs' (nfs, ceph, cifs), and anywhere
> else that we thought it appropriate?
> 
> That approach might be helpful later too since we're starting to see a
> wider variety of exportable filesystems these days. We may need more
> "quirk" flags like this.
> --
> Jeff Layton <jlayton@kernel.org>

I should also mention that I still see a lot of unexpected repeat lookups even with the iversion optimisation patches with certain workloads. For example, looking at a network capture on the re-export server I might see 100s of getattr calls to the originating server for the same filehandle within 30 seconds which I would have expected the client cache to serve. But it could also be that the client cache is under memory pressure and not holding that data for very long.

But now I do wonder if these NFSv4 directory modifications and pre/post change attributes could be one potential contributor? I might run some production loads with a v3 re-export of a v3 server to see if that changes anything.

Many thanks again for the patches, I will take the entire set and run them through our production re-export workloads to see if anything shakes out.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-21 21:44                                                           ` Daire Byrne
@ 2020-11-22  0:02                                                             ` bfields
  2020-11-22  1:55                                                               ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-22  0:02 UTC (permalink / raw)
  To: Daire Byrne
  Cc: Jeff Layton, J. Bruce Fields, Trond Myklebust, linux-cachefs, linux-nfs

On Sat, Nov 21, 2020 at 09:44:29PM +0000, Daire Byrne wrote:
> ----- On 21 Nov, 2020, at 01:03, Jeff Layton jlayton@kernel.org wrote:
> > On Fri, 2020-11-20 at 17:44 -0500, J. Bruce Fields wrote:
> >> On Fri, Nov 20, 2020 at 05:38:31PM -0500, J. Bruce Fields wrote:
> >> > I think the first one's all that's needed to fix the problem Daire
> >> > identified.  I'm a little less sure of the rest.
> 
> I can confirm that patch 1/8 alone does indeed address the reported revalidation issue for us (as did the previous patch). The re-export server's client cache seems to remain intact and can serve the same cached results to multiple clients.

Thanks again for the testing.

> I should also mention that I still see a lot of unexpected repeat
> lookups even with the iversion optimisation patches with certain
> workloads. For example, looking at a network capture on the re-export
> server I might see 100s of getattr calls to the originating server for
> the same filehandle within 30 seconds which I would have expected the
> client cache to serve. But it could also be that the client cache is
> under memory pressure and not holding that data for very long.

That sounds weird.  Is the filehandle for a file or a directory?  Is the
file or directory actually changing at the time, and if so, is it the
client that's changing it?

Remind me what the setup is--a v3 re-export of a v4 mount?

--b.

> But now I do wonder if these NFSv4 directory modifications and
> pre/post change attributes could be one potential contributor? I might
> run some production loads with a v3 re-export of a v3 server to see if
> that changes anything.
> 
> Many thanks again for the patches, I will take the entire set and run
> them through our production re-export workloads to see if anything
> shakes out.
> 
> Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-22  0:02                                                             ` bfields
@ 2020-11-22  1:55                                                               ` Daire Byrne
  2020-11-22  3:03                                                                 ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-11-22  1:55 UTC (permalink / raw)
  To: bfields
  Cc: Jeff Layton, J. Bruce Fields, Trond Myklebust, linux-cachefs, linux-nfs


----- On 22 Nov, 2020, at 00:02, bfields bfields@fieldses.org wrote:
>> I should also mention that I still see a lot of unexpected repeat
>> lookups even with the iversion optimisation patches with certain
>> workloads. For example, looking at a network capture on the re-export
>> server I might see 100s of getattr calls to the originating server for
>> the same filehandle within 30 seconds which I would have expected the
>> client cache to serve. But it could also be that the client cache is
>> under memory pressure and not holding that data for very long.
> 
> That sounds weird.  Is the filehandle for a file or a directory?  Is the
> file or directory actually changing at the time, and if so, is it the
> client that's changing it?
> 
> Remind me what the setup is--a v3 re-export of a v4 mount?

Maybe this discussion should go back into the "Advenvetures in re-exporting" thread? But to give a quick answer here anyway...

The workload I have been looking at recently is a NFSv3 re-export of a NFSv4.2 mount. I can also say that it is generally when new files are being written to a directory. So yes, the files and dir are changing at the time but I still didn't expect to see so many repeated getattr neatly bundled together in short bursts, e.g. (re-export server = 10.156.12.1, originating server 10.21.22.117).

54544  88.147927  10.156.12.1 → 10.21.22.117 NFS 326 V4 Call SETATTR FH: 0x4dbdfb01
54547  88.160469  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call SETATTR FH: 0x4dbdfb01
54556  88.185592  10.156.12.1 → 10.21.22.117 NFS 330 V4 Call SETATTR FH: 0x4dbdfb01
54559  88.198350  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call SETATTR FH: 0x4dbdfb01
54562  88.211670  10.156.12.1 → 10.21.22.117 NFS 326 V4 Call SETATTR FH: 0x4dbdfb01
54565  88.243251  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call OPEN DH: 0x4dbdfb01/
54637  88.269587  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
55078  88.277138  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
57747  88.390197  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57748  88.390212  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57749  88.390215  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57750  88.390218  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57751  88.390220  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57752  88.390222  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57753  88.390231  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57754  88.390261  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
57755  88.390292  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
57852  88.415541  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
57853  88.415551  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
58965  88.442004  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60201  88.486231  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60615  88.505453  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60616  88.505473  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60617  88.505477  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60618  88.505480  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
60619  88.505482  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0

Often I only capture an open dh followed by a flurry of getattr:

 3068  24.603153  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call OPEN DH: 0xb63a98ec/
 3089  24.641542  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3093  24.642172  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3140  24.719930  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3360  24.769423  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3376  24.771353  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3436  24.782817  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3569  24.798207  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3753  24.855233  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3777  24.856130  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3824  24.862919  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 3873  24.873890  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4001  24.898289  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4070  24.925970  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4127  24.940616  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4174  24.985160  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4343  25.007565  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4344  25.008343  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
 4358  25.036177  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec

The common workload is that we will have multiple clients of the re-export server all writing different (frame) files into the same directory at the same time. But on the re-export server it is ultimately 16 threads of nfsd making those calls to the originating server.

The re-export server's client should be the only one making most of the changes, although there are other NFSv3 clients of the originating servers that could conceivably be updating files too.

Like I said, it might be interesting to see if we see the same behaviour with a NFSv3 re-export of an NFSv3 server.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-22  1:55                                                               ` Daire Byrne
@ 2020-11-22  3:03                                                                 ` bfields
  2020-11-23 20:07                                                                   ` Daire Byrne
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-11-22  3:03 UTC (permalink / raw)
  To: Daire Byrne
  Cc: Jeff Layton, J. Bruce Fields, Trond Myklebust, linux-cachefs, linux-nfs

On Sun, Nov 22, 2020 at 01:55:50AM +0000, Daire Byrne wrote:
> 
> ----- On 22 Nov, 2020, at 00:02, bfields bfields@fieldses.org wrote:
> >> I should also mention that I still see a lot of unexpected repeat
> >> lookups even with the iversion optimisation patches with certain
> >> workloads. For example, looking at a network capture on the re-export
> >> server I might see 100s of getattr calls to the originating server for
> >> the same filehandle within 30 seconds which I would have expected the
> >> client cache to serve. But it could also be that the client cache is
> >> under memory pressure and not holding that data for very long.
> > 
> > That sounds weird.  Is the filehandle for a file or a directory?  Is the
> > file or directory actually changing at the time, and if so, is it the
> > client that's changing it?
> > 
> > Remind me what the setup is--a v3 re-export of a v4 mount?
> 
> Maybe this discussion should go back into the "Advenvetures in re-exporting" thread? But to give a quick answer here anyway...
> 
> The workload I have been looking at recently is a NFSv3 re-export of a NFSv4.2 mount. I can also say that it is generally when new files are being written to a directory. So yes, the files and dir are changing at the time but I still didn't expect to see so many repeated getattr neatly bundled together in short bursts, e.g. (re-export server = 10.156.12.1, originating server 10.21.22.117).

Well, I guess the pre/post-op attributes could contribute to the
problem, in that they could unnecessarily turn a COMMIT into

	GETATTR
	COMMIT
	GETATTR

And ditto for anything that modifies file or directory contents.  But
I'd've thought some of those could have been cached.  Also it looks like
you've got more GETATTRs than that.  Hm.

--b.

> 
> 54544  88.147927  10.156.12.1 → 10.21.22.117 NFS 326 V4 Call SETATTR FH: 0x4dbdfb01
> 54547  88.160469  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call SETATTR FH: 0x4dbdfb01
> 54556  88.185592  10.156.12.1 → 10.21.22.117 NFS 330 V4 Call SETATTR FH: 0x4dbdfb01
> 54559  88.198350  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call SETATTR FH: 0x4dbdfb01
> 54562  88.211670  10.156.12.1 → 10.21.22.117 NFS 326 V4 Call SETATTR FH: 0x4dbdfb01
> 54565  88.243251  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call OPEN DH: 0x4dbdfb01/
> 54637  88.269587  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 55078  88.277138  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
> 57747  88.390197  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57748  88.390212  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57749  88.390215  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57750  88.390218  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57751  88.390220  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57752  88.390222  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57753  88.390231  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57754  88.390261  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
> 57755  88.390292  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 57852  88.415541  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
> 57853  88.415551  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 58965  88.442004  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60201  88.486231  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60615  88.505453  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60616  88.505473  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60617  88.505477  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60618  88.505480  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0x4dbdfb01
> 60619  88.505482  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call COMMIT FH: 0x4dbdfb01 Offset: 0 Len: 0
> 
> Often I only capture an open dh followed by a flurry of getattr:
> 
>  3068  24.603153  10.156.12.1 → 10.21.22.117 NFS 350 V4 Call OPEN DH: 0xb63a98ec/
>  3089  24.641542  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3093  24.642172  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3140  24.719930  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3360  24.769423  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3376  24.771353  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3436  24.782817  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3569  24.798207  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3753  24.855233  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3777  24.856130  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3824  24.862919  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  3873  24.873890  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4001  24.898289  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4070  24.925970  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4127  24.940616  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4174  24.985160  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4343  25.007565  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4344  25.008343  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
>  4358  25.036177  10.156.12.1 → 10.21.22.117 NFS 282 V4 Call GETATTR FH: 0xb63a98ec
> 
> The common workload is that we will have multiple clients of the re-export server all writing different (frame) files into the same directory at the same time. But on the re-export server it is ultimately 16 threads of nfsd making those calls to the originating server.
> 
> The re-export server's client should be the only one making most of the changes, although there are other NFSv3 clients of the originating servers that could conceivably be updating files too.
> 
> Like I said, it might be interesting to see if we see the same behaviour with a NFSv3 re-export of an NFSv3 server.
> 
> Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute
  2020-11-22  3:03                                                                 ` bfields
@ 2020-11-23 20:07                                                                   ` Daire Byrne
  0 siblings, 0 replies; 129+ messages in thread
From: Daire Byrne @ 2020-11-23 20:07 UTC (permalink / raw)
  To: bfields
  Cc: Jeff Layton, J. Bruce Fields, Trond Myklebust, linux-cachefs, linux-nfs

----- On 22 Nov, 2020, at 03:03, bfields bfields@fieldses.org wrote:
>> The workload I have been looking at recently is a NFSv3 re-export of a NFSv4.2
>> mount. I can also say that it is generally when new files are being written to
>> a directory. So yes, the files and dir are changing at the time but I still
>> didn't expect to see so many repeated getattr neatly bundled together in short
>> bursts, e.g. (re-export server = 10.156.12.1, originating server 10.21.22.117).
> 
> Well, I guess the pre/post-op attributes could contribute to the
> problem, in that they could unnecessarily turn a COMMIT into
> 
>	GETATTR
>	COMMIT
>	GETATTR
> 
> And ditto for anything that modifies file or directory contents.  But
> I'd've thought some of those could have been cached.  Also it looks like
> you've got more GETATTRs than that.  Hm.

Yea, I definitely see those COMMITs surrounded by GETTATTRs with NFSv4.2... But as you say, I get way more repeat GETATTRs for the same filehandles.

I switched to a NFSv4.2 re-export of a NFSv3 server and saw the kind of thing - sometimes the wire would see 4-5 GETTATRs for the same FH in tight sequence with nothing in between. So then I started thinking.... how does nconnect work again? Because my re-export server is mounting the originating server with nconnect=16 and the flurries of repeat GETATTRs often contain a count in that ballpark.

I need to re-test without nconnect... Maybe that's how it's supposed to work and I'm just being over sensitive after this iversion issue.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-12 13:01             ` Daire Byrne
  2020-11-12 13:57               ` bfields
@ 2020-11-24 20:35               ` Daire Byrne
  2020-11-24 21:15                 ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-11-24 20:35 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

----- On 12 Nov, 2020, at 13:01, Daire Byrne daire@dneg.com wrote:
> 
> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's
> NFSv3 lookupp emulation patches, I can now revise my original list of issues
> that others will likely experience if they ever try to do this craziness:
> 
> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will
> see random input/output errors on your clients when things are dropped out of
> the cache. In the end we gave up on using NFSv4.0 with our Netapps because the
> 7-mode implementation seemed a bit flakey with modern Linux clients (Linux
> NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with
> Trond's lookupp emulation patches instead.
> 
> 2) In order to better utilise the re-export server's client cache when
> re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to
> use the horrible inode_peek_iversion_raw hack to maintain good metadata
> performance for large numbers of clients. Otherwise each re-export server's
> clients can cause invalidation of the re-export server client cache. Once you
> have hundreds of clients they all combine to constantly invalidate the cache
> resulting in an order of magnitude slower metadata performance. If you are
> re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not
> required.
> 
> 3) For some reason, when a 1MB read call arrives at the re-export server from a
> client, it gets chopped up into 128k read calls that are issued back to the
> originating server despite rsize/wsize=1MB on all mounts. This results in a
> noticeable increase in rpc chatter for large reads. Writes on the other hand
> retain their 1MB size from client to re-export server and back to the
> originating server. I am using nconnect but I doubt that is related.
> 
> 4) After some random time, the cachefilesd userspace daemon stops culling old
> data from an fscache disk storage. I thought it was to do with setting
> vfs_cache_pressure=0 but even with it set to the default 100 it just randomly
> decides to stop culling and never comes back to life until restarted or
> rebooted. Perhaps the fscache/cachefilesd rewrite that David Howells & David
> Wysochanski have been working on will improve matters.
> 
> 5) It's still really hard to cache nfs client metadata for any definitive time
> (actimeo,nocto) due to the pagecache churn that reads cause. If all required
> metadata (i.e. directory contents) could either be locally cached to disk or
> the inode cache rather than pagecache then maybe we would have more control
> over the actual cache times we are comfortable with for our workloads. This has
> little to do with re-exporting and is just a general NFS performance over the
> WAN thing. I'm very interested to see how Trond's recent patches to improve
> readdir performance might at least help re-populate the dropped cached metadata
> more efficiently over the WAN.
> 
> I just want to finish with one more crazy thing we have been doing - a re-export
> server of a re-export server! Again, a locking and consistency nightmare so
> only possible for very specific workloads (like ours). The advantage of this
> topology is that you can pull all your data over the WAN once (e.g. on-premise
> to cloud) and then fan-out that data to multiple other NFS re-export servers in
> the cloud to improve the aggregate performance to many clients. This avoids
> having multiple re-export servers all needing to pull the same data across the
> WAN.

I will officially add another point to the wishlist that I mentioned in Bruce's recent patches thread (for dealing with the iversion change on NFS re-export). I had held off mentioning this one because I wasn't really sure if it was just a normal production workload and expected behaviour for NFS, but the more I look into it, the more it seems like maybe it could be optimised for the re-export case. But then I also might be too overly sensitive about metadata ops over the WAN at this point....

6) I see many fast repeating COMMITs & GETATTRs from the NFS re-export server to the originating server for the same file while writing through it from a client. If I do a write from userspace on the re-export server directly to the client mountpoint (i.e. no re-exporting) I do not see the GETATTRs or COMMITs.

I see something similar with both a re-export of a NFSv3 originating server and a re-export of a NFSv4.2 originating server (using either NFSv3 or NFSv4). Bruce mentioned an extra GETATTR in the NFSv4.2 re-export case for a COMMIT (pre/post).

For simplicity let's look at the NFSv3 re-export of an NFSv3 originating server. But first let's write a file from userspace directly on the re-export server back to the originating server mount point (ie no re-export):

    3   0.772902  V3 GETATTR Call, FH: 0x6791bc70
    6   0.781239  V3 SETATTR Call, FH: 0x6791bc70
 3286   0.919601  V3 WRITE Call, FH: 0x6791bc70 Offset: 1048576 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
 3494   0.921351  V3 WRITE Call, FH: 0x6791bc70 Offset: 8388608 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
...
...
48178   1.462670  V3 WRITE Call, FH: 0x6791bc70 Offset: 102760448 Len: 1048576 UNSTABLE
48210   1.472400  V3 COMMIT Call, FH: 0x6791bc70

So lots of uninterrupted 1MB write calls back to the originating server as expected with a final COMMIT (good). We can also set nconnect=16 back to the originating server and get the same trace but with the write packets going down different ports (also good).

Now let's do the same write through the re-export server from a client (NFSv4.2 or NFSv3, it doesn't matter much):

    7   0.034411  V3 SETATTR Call, FH: 0x364ced2c
  286   0.148066  V3 WRITE Call, FH: 0x364ced2c Offset: 0 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
  343   0.152644  V3 WRITE Call, FH: 0x364ced2c Offset: 1048576 Len: 196608 UNSTABLEV3 WRITE Call, FH: 0x364ced2c Offset: 1245184 Len: 8192 FILE_SYNC
  580   0.168159  V3 WRITE Call, FH: 0x364ced2c Offset: 1253376 Len: 843776 UNSTABLE
  671   0.174668  V3 COMMIT Call, FH: 0x364ced2c
 1105   0.193805  V3 COMMIT Call, FH: 0x364ced2c
 1123   0.201570  V3 WRITE Call, FH: 0x364ced2c Offset: 2097152 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
 1592   0.242259  V3 WRITE Call, FH: 0x364ced2c Offset: 3145728 Len: 1048576 UNSTABLE
...
...
54571   3.668028  V3 WRITE Call, FH: 0x364ced2c Offset: 102760448 Len: 1048576 FILE_SYNC [TCP segment of a reassembled PDU]
54940   3.713392  V3 WRITE Call, FH: 0x364ced2c Offset: 103809024 Len: 1048576 UNSTABLE
55706   3.733284  V3 COMMIT Call, FH: 0x364ced2c

So now we have lots of pairs of COMMIT calls inbetween the WRITE calls. We also see sporadic FILE_SYNC write calls which we don't when we just write direct to the originating server from userspace (all UNSTABLE).

Finally, if we add nconnect=16 when mounting the originating server (useful for increasing WAN throughput) and again write through from the client, we start to see lots of GETATTRs mixed with the WRITEs & COMMITs:

   84   0.075830  V3 SETATTR Call, FH: 0x0e9698e8
  608   0.201944  V3 WRITE Call, FH: 0x0e9698e8 Offset: 0 Len: 1048576 UNSTABLE
  857   0.218760  V3 COMMIT Call, FH: 0x0e9698e8
  968   0.231706  V3 WRITE Call, FH: 0x0e9698e8 Offset: 1048576 Len: 1048576 UNSTABLE
 1042   0.246934  V3 COMMIT Call, FH: 0x0e9698e8
...
...
43754   3.033689  V3 WRITE Call, FH: 0x0e9698e8 Offset: 100663296 Len: 1048576 UNSTABLE
44085   3.044767  V3 COMMIT Call, FH: 0x0e9698e8
44086   3.044959  V3 GETATTR Call, FH: 0x0e9698e8
44087   3.044964  V3 GETATTR Call, FH: 0x0e9698e8
44088   3.044983  V3 COMMIT Call, FH: 0x0e9698e8
44615   3.079491  V3 WRITE Call, FH: 0x0e9698e8 Offset: 102760448 Len: 1048576 UNSTABLE
44700   3.082909  V3 WRITE Call, FH: 0x0e9698e8 Offset: 103809024 Len: 1048576 UNSTABLE
44978   3.092010  V3 COMMIT Call, FH: 0x0e9698e8
44982   3.092943  V3 COMMIT Call, FH: 0x0e9698e8

Sometimes I have seen clusters of 16 GETATTRs for the same file on the wire with nothing else inbetween. So if the re-export server is the only "client" writing these files to the originating server, why do we need to do so many repeat GETATTR calls when using nconnect>1? And why are the COMMIT calls required when the writes are coming via nfsd but not from userspace on the re-export server? Is that due to some sort of memory pressure or locking?

I picked the NFSv3 originating server case because my head starts to hurt tracking the equivalent packets, stateids and compound calls with NFSv4. But I think it's mostly the same for NFSv4. The writes through the re-export server lead to lots of COMMITs and (double) GETATTRs but using nconnect>1 at least doesn't seem to make it any worse like it does for NFSv3.

But maybe you actually want all the extra COMMITs to help better guarantee your writes when putting a re-export server in the way? Perhaps all of this is by design...

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
@ 2020-11-24 21:15                 ` bfields
  2020-11-24 22:15                   ` Frank Filz
  2020-11-25 17:14                   ` Daire Byrne
  0 siblings, 2 replies; 129+ messages in thread
From: bfields @ 2020-11-24 21:15 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
> Sometimes I have seen clusters of 16 GETATTRs for the same file on the
> wire with nothing else inbetween. So if the re-export server is the
> only "client" writing these files to the originating server, why do we
> need to do so many repeat GETATTR calls when using nconnect>1? And why
> are the COMMIT calls required when the writes are coming via nfsd but
> not from userspace on the re-export server? Is that due to some sort
> of memory pressure or locking?
> 
> I picked the NFSv3 originating server case because my head starts to
> hurt tracking the equivalent packets, stateids and compound calls with
> NFSv4. But I think it's mostly the same for NFSv4. The writes through
> the re-export server lead to lots of COMMITs and (double) GETATTRs but
> using nconnect>1 at least doesn't seem to make it any worse like it
> does for NFSv3.
> 
> But maybe you actually want all the extra COMMITs to help better
> guarantee your writes when putting a re-export server in the way?
> Perhaps all of this is by design...

Maybe that's close-to-open combined with the server's tendency to
open/close on every IO operation?  (Though the file cache should have
helped with that, I thought; as would using version >=4.0 on the final
client.)

Might be interesting to know whether the nocto mount option makes a
difference.  (So, add "nocto" to the mount options for the NFS mount
that you're re-exporting on the re-export server.)

By the way I made a start at a list of issues at

	http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export

but I was a little vague on which of your issues remained and didn't
take much time over it.

(If you want an account on that wiki BTW I seem to recall you just have
to ask Trond (for anti-spam reasons).)

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-11-24 21:15                 ` bfields
@ 2020-11-24 22:15                   ` Frank Filz
  2020-11-25 14:47                     ` 'bfields'
  2020-11-25 17:14                   ` Daire Byrne
  1 sibling, 1 reply; 129+ messages in thread
From: Frank Filz @ 2020-11-24 22:15 UTC (permalink / raw)
  To: 'bfields', 'Daire Byrne'
  Cc: 'Trond Myklebust', 'linux-cachefs', 'linux-nfs'

> On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
> > Sometimes I have seen clusters of 16 GETATTRs for the same file on the
> > wire with nothing else inbetween. So if the re-export server is the
> > only "client" writing these files to the originating server, why do we
> > need to do so many repeat GETATTR calls when using nconnect>1? And why
> > are the COMMIT calls required when the writes are coming via nfsd but
> > not from userspace on the re-export server? Is that due to some sort
> > of memory pressure or locking?
> >
> > I picked the NFSv3 originating server case because my head starts to
> > hurt tracking the equivalent packets, stateids and compound calls with
> > NFSv4. But I think it's mostly the same for NFSv4. The writes through
> > the re-export server lead to lots of COMMITs and (double) GETATTRs but
> > using nconnect>1 at least doesn't seem to make it any worse like it
> > does for NFSv3.
> >
> > But maybe you actually want all the extra COMMITs to help better
> > guarantee your writes when putting a re-export server in the way?
> > Perhaps all of this is by design...
> 
> Maybe that's close-to-open combined with the server's tendency to
open/close
> on every IO operation?  (Though the file cache should have helped with
that, I
> thought; as would using version >=4.0 on the final
> client.)
> 
> Might be interesting to know whether the nocto mount option makes a
> difference.  (So, add "nocto" to the mount options for the NFS mount that
> you're re-exporting on the re-export server.)
> 
> By the way I made a start at a list of issues at
> 
> 	http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
> 
> but I was a little vague on which of your issues remained and didn't take
much
> time over it.
> 
> (If you want an account on that wiki BTW I seem to recall you just have to
ask
> Trond (for anti-spam reasons).)

How much conversation about re-export has been had at the wider NFS
community level? I have an interest because Ganesha  supports re-export via
the PROXY_V3 and PROXY_V4 FSALs. We currently don't have a data cache though
there has been discussion of such, we do have attribute and dirent caches.

Looking over the wiki page, I have considered being able to specify a
re-export of a Ganesha export without encapsulating handles. Ganesha
encapsulates the export_fs handle in a way that could be coordinated between
the original server and the re-export so they would both effectively have
the same encapsulation layer.

I'd love to see some re-export best practices shared among server
implementations, and also what we can do to improve things when two server
implementations are interoperating via re-export.

Frank


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-24 22:15                   ` Frank Filz
@ 2020-11-25 14:47                     ` 'bfields'
  2020-11-25 16:25                       ` Frank Filz
  0 siblings, 1 reply; 129+ messages in thread
From: 'bfields' @ 2020-11-25 14:47 UTC (permalink / raw)
  To: Frank Filz
  Cc: 'Daire Byrne', 'Trond Myklebust',
	'linux-cachefs', 'linux-nfs'

On Tue, Nov 24, 2020 at 02:15:57PM -0800, Frank Filz wrote:
> How much conversation about re-export has been had at the wider NFS
> community level? I have an interest because Ganesha  supports re-export via
> the PROXY_V3 and PROXY_V4 FSALs. We currently don't have a data cache though
> there has been discussion of such, we do have attribute and dirent caches.
> 
> Looking over the wiki page, I have considered being able to specify a
> re-export of a Ganesha export without encapsulating handles. Ganesha
> encapsulates the export_fs handle in a way that could be coordinated between
> the original server and the re-export so they would both effectively have
> the same encapsulation layer.

In the case the re-export server only servers a single export, I guess
you could do away with the encapsulation.  (The only risk I see is that
a client of the re-export server could also access any export of the
original server if it could guess filehandles, which might surprise
admins.)  Maybe that'd be useful.

Another advantage of not encapsulating filehandles is that clients could
more easily migrate between servers.

Cooperating servers could have an agreement on filehandles.  And I guess
we could standardize that somehow.  Are we ready for that?  I'm not sure
what other re-exporting problems there are that I haven't thought of.

--b.

> I'd love to see some re-export best practices shared among server
> implementations, and also what we can do to improve things when two server
> implementations are interoperating via re-export.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-11-25 14:47                     ` 'bfields'
@ 2020-11-25 16:25                       ` Frank Filz
  2020-11-25 19:03                         ` 'bfields'
  0 siblings, 1 reply; 129+ messages in thread
From: Frank Filz @ 2020-11-25 16:25 UTC (permalink / raw)
  To: 'bfields'
  Cc: 'Daire Byrne', 'Trond Myklebust',
	'linux-cachefs', 'linux-nfs'

> On Tue, Nov 24, 2020 at 02:15:57PM -0800, Frank Filz wrote:
> > How much conversation about re-export has been had at the wider NFS
> > community level? I have an interest because Ganesha  supports
> > re-export via the PROXY_V3 and PROXY_V4 FSALs. We currently don't have
> > a data cache though there has been discussion of such, we do have
attribute
> and dirent caches.
> >
> > Looking over the wiki page, I have considered being able to specify a
> > re-export of a Ganesha export without encapsulating handles. Ganesha
> > encapsulates the export_fs handle in a way that could be coordinated
> > between the original server and the re-export so they would both
> > effectively have the same encapsulation layer.
> 
> In the case the re-export server only servers a single export, I guess you
could do
> away with the encapsulation.  (The only risk I see is that a client of the
re-export
> server could also access any export of the original server if it could
guess
> filehandles, which might surprise
> admins.)  Maybe that'd be useful.

Ganesha handles have a minor downside that is a help here if Ganesha was
re-exporting another Ganesha server. There is a 16 bit export_id that comes
from the export configuration and is part of the handle. We could easily set
it up so that if the sysadmin configured it as such, each re-exported
Ganesha export would have the same export_id, and then a client handle for
export_id 1 would be mirrored to the original server as export_id 1 and the
two servers can have the same export permissions and everything.

There is some additional stuff we could easily implement in Ganesha to
restrict handle manipulation to sneak around export permissions.

> Another advantage of not encapsulating filehandles is that clients could
more
> easily migrate between servers.

Yea, with the idea I've been mulling for Ganesha, migration between original
server and re-export server would be simple with the same handles. Of course
state migration is a whole different ball of wax, but a clustered setup
could work just as well as Ganesha's clustered filesystems. On the other
hand, re-export with state has a pitfall. If the re-export server crashes,
the state is lost on the original server unless we make a protocol change to
allow state re-export such that a re-export server crashing doesn't cause
state loss. For this reason, I haven't rushed to implement lock state
re-export in Ganesha, rather allowing the re-export server to just manage
lock state locally.

> Cooperating servers could have an agreement on filehandles.  And I guess
we
> could standardize that somehow.  Are we ready for that?  I'm not sure what
> other re-exporting problems there are that I haven't thought of.

I'm not sure how far we want to go there, but potentially specific server
implementations could choose to be interoperable in a way that allows the
handle encapsulation to either be smaller or no extra overhead. For example,
if we implemented what I've thought about for Ganesha-Ganesha re-export,
Ganesha COULD also be "taught" which portion of the knfsd handle is the
filesystem/export identifier, and maintain a database of Ganesha
export/filesystem <-> knfsd export/filesystem and have Ganesha
re-encapsulate the exportfs/name_to_handle_at portion of the handle. Of
course in this case, trivial migration isn't possible since Ganesha will
have a different encapsulation than knfsd.

Incidentally, I also purposefully made Ganesha's encapsulation different so
it never collides with either version of knfsd handles (now if over the
course of the past 10 years another handle version has come along...).

Frank

> --b.
> 
> > I'd love to see some re-export best practices shared among server
> > implementations, and also what we can do to improve things when two
> > server implementations are interoperating via re-export.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-24 21:15                 ` bfields
  2020-11-24 22:15                   ` Frank Filz
@ 2020-11-25 17:14                   ` Daire Byrne
  2020-11-25 19:31                     ` bfields
  2020-12-03 12:20                     ` Daire Byrne
  1 sibling, 2 replies; 129+ messages in thread
From: Daire Byrne @ 2020-11-25 17:14 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 24 Nov, 2020, at 21:15, bfields bfields@fieldses.org wrote:
> On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
>> Sometimes I have seen clusters of 16 GETATTRs for the same file on the
>> wire with nothing else inbetween. So if the re-export server is the
>> only "client" writing these files to the originating server, why do we
>> need to do so many repeat GETATTR calls when using nconnect>1? And why
>> are the COMMIT calls required when the writes are coming via nfsd but
>> not from userspace on the re-export server? Is that due to some sort
>> of memory pressure or locking?
>> 
>> I picked the NFSv3 originating server case because my head starts to
>> hurt tracking the equivalent packets, stateids and compound calls with
>> NFSv4. But I think it's mostly the same for NFSv4. The writes through
>> the re-export server lead to lots of COMMITs and (double) GETATTRs but
>> using nconnect>1 at least doesn't seem to make it any worse like it
>> does for NFSv3.
>> 
>> But maybe you actually want all the extra COMMITs to help better
>> guarantee your writes when putting a re-export server in the way?
>> Perhaps all of this is by design...
> 
> Maybe that's close-to-open combined with the server's tendency to
> open/close on every IO operation?  (Though the file cache should have
> helped with that, I thought; as would using version >=4.0 on the final
> client.)
> 
> Might be interesting to know whether the nocto mount option makes a
> difference.  (So, add "nocto" to the mount options for the NFS mount
> that you're re-exporting on the re-export server.)

The nocto didn't really seem to help but the NFSv4.2 re-export of a NFSv3 server did. I also realised I had done some tests with nconnect on the re-export server's client and consequently mixed things up a bit in my head. So I did some more tests and tried to make the results clear and simple. In all cases I'm just writing a big file with "dd" and capturing the traffic between the originating server and re-export server.

First off, writing direct to the originating server mount on the re-export server from userspace shows the ideal behaviour for all combinations:

 originating server <- (vers=X,actimeo=1800,nconnect=X) <- reexport server writing = WRITE,WRITE .... repeating (good!)

Then re-exporting a NFSv4.2 server:

 originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = GETATTR,COMMIT,WRITE .... repeating
 originating server <- (vers=4.2) <- reexport server - (vers=4.2) <- client writing = GETATTR,WRITE .... repeating

And re-exporting a NFSv3 server:

 originating server <- (vers=3) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
 originating server <- (vers=3) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT .... repeating
  
So of all the combinations, a NFSv4.2 re-export of an NFSv3 server is the only one that matches the "ideal" case where we WRITE continuously without all the extra chatter.

And for completeness, taking that "good" case and making it bad with nconnect:

 originating server <- (vers=3,nconnect=16) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
 originating server <- (vers=3) <- reexport server <- (vers=4.2,nconnect=16) <- client writing = WRITE,COMMIT,GETATTR .... randomly repeating

So using nconnect on the re-export's client causes lots more metadata ops. There are reasons for doing that for increasing throughput but it could be that the gain is offset by the extra metadata roundtrips. 

Similarly, we have mostly been using a NFSv4.2 re-export of a NFSV4.2 server over the WAN because of reduced metadata ops for reading, but it looks like we incur extra metadata ops for writing.

Side note: it's hard to decode nconnect enabled packet captures because wireshark doesn't seem to like those extra port streams.

> By the way I made a start at a list of issues at
> 
>	http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
> 
> but I was a little vague on which of your issues remained and didn't
> take much time over it.

Cool. I'm glad there are some notes for others to reference - this thread is now too long for any human to read. The only things I'd consider adding are:

* re-export of NFSv4.0 filesystem can give input/output errors when the cache is dropped
* a weird interaction with nfs client readahead such that all reads are limited to the default 128k unless you manually increase it to match rsize.

The only other thing I can offer are tips & tricks for doing this kind of thing over the WAN (vfs_cache_pressure, actimeo, nocto) and using fscache.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-25 16:25                       ` Frank Filz
@ 2020-11-25 19:03                         ` 'bfields'
  2020-11-26  0:04                           ` Frank Filz
  0 siblings, 1 reply; 129+ messages in thread
From: 'bfields' @ 2020-11-25 19:03 UTC (permalink / raw)
  To: Frank Filz
  Cc: 'Daire Byrne', 'Trond Myklebust',
	'linux-cachefs', 'linux-nfs'

On Wed, Nov 25, 2020 at 08:25:19AM -0800, Frank Filz wrote:
> On the other
> hand, re-export with state has a pitfall. If the re-export server crashes,
> the state is lost on the original server unless we make a protocol change to
> allow state re-export such that a re-export server crashing doesn't cause
> state loss.

Oh, yes, reboot recovery's an interesting problem that I'd forgotten
about; added to that wiki page.

By "state re-export" you mean you'd take the stateids the original
server returned to you, and return them to your own clients?  So then
I guess you wouldn't need much state at all.

> For this reason, I haven't rushed to implement lock state
> re-export in Ganesha, rather allowing the re-export server to just manage
> lock state locally.
> 
> > Cooperating servers could have an agreement on filehandles.  And I guess
> we
> > could standardize that somehow.  Are we ready for that?  I'm not sure what
> > other re-exporting problems there are that I haven't thought of.
> 
> I'm not sure how far we want to go there, but potentially specific server
> implementations could choose to be interoperable in a way that allows the
> handle encapsulation to either be smaller or no extra overhead. For example,
> if we implemented what I've thought about for Ganesha-Ganesha re-export,
> Ganesha COULD also be "taught" which portion of the knfsd handle is the
> filesystem/export identifier, and maintain a database of Ganesha
> export/filesystem <-> knfsd export/filesystem and have Ganesha
> re-encapsulate the exportfs/name_to_handle_at portion of the handle. Of
> course in this case, trivial migration isn't possible since Ganesha will
> have a different encapsulation than knfsd.
> 
> Incidentally, I also purposefully made Ganesha's encapsulation different so
> it never collides with either version of knfsd handles (now if over the
> course of the past 10 years another handle version has come along...).

I don't think anything's changed there.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-25 17:14                   ` Daire Byrne
@ 2020-11-25 19:31                     ` bfields
  2020-12-03 12:20                     ` Daire Byrne
  1 sibling, 0 replies; 129+ messages in thread
From: bfields @ 2020-11-25 19:31 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Wed, Nov 25, 2020 at 05:14:51PM +0000, Daire Byrne wrote:
> Cool. I'm glad there are some notes for others to reference - this
> thread is now too long for any human to read. The only things I'd
> consider adding are:

Thanks, done.

> * re-export of NFSv4.0 filesystem can give input/output errors when the cache is dropped

Looking back at that thread....  I suspect that's just unfixable, so all
you can do is either use v4.1+ on the original server or 4.0+ on the
edge clients.  Or I wonder if it would help if there was some way to
tell the 4.0 client just to try special stateids instead of attempting
an open?

> * a weird interaction with nfs client readahead such that all reads
> are limited to the default 128k unless you manually increase it to
> match rsize.
>
> The only other thing I can offer are tips & tricks for doing this kind
> of thing over the WAN (vfs_cache_pressure, actimeo, nocto) and using
> fscache.

OK, I haven't tried to pick that out of the thread yet.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-11-25 19:03                         ` 'bfields'
@ 2020-11-26  0:04                           ` Frank Filz
  0 siblings, 0 replies; 129+ messages in thread
From: Frank Filz @ 2020-11-26  0:04 UTC (permalink / raw)
  To: 'bfields'
  Cc: 'Daire Byrne', 'Trond Myklebust',
	'linux-cachefs', 'linux-nfs'



> -----Original Message-----
> From: 'bfields' [mailto:bfields@fieldses.org]
> Sent: Wednesday, November 25, 2020 11:03 AM
> To: Frank Filz <ffilzlnx@mindspring.com>
> Cc: 'Daire Byrne' <daire@dneg.com>; 'Trond Myklebust'
> <trondmy@hammerspace.com>; 'linux-cachefs' <linux-cachefs@redhat.com>;
> 'linux-nfs' <linux-nfs@vger.kernel.org>
> Subject: Re: Adventures in NFS re-exporting
> 
> On Wed, Nov 25, 2020 at 08:25:19AM -0800, Frank Filz wrote:
> > On the other
> > hand, re-export with state has a pitfall. If the re-export server
> > crashes, the state is lost on the original server unless we make a
> > protocol change to allow state re-export such that a re-export server
> > crashing doesn't cause state loss.
> 
> Oh, yes, reboot recovery's an interesting problem that I'd forgotten
about;
> added to that wiki page.
>
> By "state re-export" you mean you'd take the stateids the original server
> returned to you, and return them to your own clients?  So then I guess you
> wouldn't need much state at all.

By state re-export I meant reflecting locks the end client takes on the
re-export server to the original server. Not necessarily by reflecting the
stateid (probably something to trip on there...) (Can we nail down a good
name for it? Proxy server or re-export server work well for the man in the
middle, but what about the back end server?)

Frank

> > For this reason, I haven't rushed to implement lock state re-export in
> > Ganesha, rather allowing the re-export server to just manage lock
> > state locally.
> >
> > > Cooperating servers could have an agreement on filehandles.  And I
> > > guess
> > we
> > > could standardize that somehow.  Are we ready for that?  I'm not
> > > sure what other re-exporting problems there are that I haven't thought
of.
> >
> > I'm not sure how far we want to go there, but potentially specific
> > server implementations could choose to be interoperable in a way that
> > allows the handle encapsulation to either be smaller or no extra
> > overhead. For example, if we implemented what I've thought about for
> > Ganesha-Ganesha re-export, Ganesha COULD also be "taught" which
> > portion of the knfsd handle is the filesystem/export identifier, and
> > maintain a database of Ganesha export/filesystem <-> knfsd
> > export/filesystem and have Ganesha re-encapsulate the
> > exportfs/name_to_handle_at portion of the handle. Of course in this
> > case, trivial migration isn't possible since Ganesha will have a
different
> encapsulation than knfsd.
> >
> > Incidentally, I also purposefully made Ganesha's encapsulation
> > different so it never collides with either version of knfsd handles
> > (now if over the course of the past 10 years another handle version has
come
> along...).
> 
> I don't think anything's changed there.
> 
> --b.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-11-25 17:14                   ` Daire Byrne
  2020-11-25 19:31                     ` bfields
@ 2020-12-03 12:20                     ` Daire Byrne
  2020-12-03 18:51                       ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Daire Byrne @ 2020-12-03 12:20 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, linux-cachefs, linux-nfs


----- On 25 Nov, 2020, at 17:14, Daire Byrne daire@dneg.com wrote:
> First off, writing direct to the originating server mount on the re-export
> server from userspace shows the ideal behaviour for all combinations:
> 
> originating server <- (vers=X,actimeo=1800,nconnect=X) <- reexport server
> writing = WRITE,WRITE .... repeating (good!)
> 
> Then re-exporting a NFSv4.2 server:
> 
> originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing
> = GETATTR,COMMIT,WRITE .... repeating
> originating server <- (vers=4.2) <- reexport server - (vers=4.2) <- client
> writing = GETATTR,WRITE .... repeating
> 
> And re-exporting a NFSv3 server:
> 
> originating server <- (vers=3) <- reexport server - (vers=4.2) <- client writing
> = WRITE,WRITE .... repeating (good!)
> originating server <- (vers=3) <- reexport server - (vers=3) <- client writing =
> WRITE,COMMIT .... repeating
>  
> So of all the combinations, a NFSv4.2 re-export of an NFSv3 server is the only
> one that matches the "ideal" case where we WRITE continuously without all the
> extra chatter.
> 
> And for completeness, taking that "good" case and making it bad with nconnect:
> 
> originating server <- (vers=3,nconnect=16) <- reexport server - (vers=4.2) <-
> client writing = WRITE,WRITE .... repeating (good!)
> originating server <- (vers=3) <- reexport server <- (vers=4.2,nconnect=16) <-
> client writing = WRITE,COMMIT,GETATTR .... randomly repeating
> 
> So using nconnect on the re-export's client causes lots more metadata ops. There
> are reasons for doing that for increasing throughput but it could be that the
> gain is offset by the extra metadata roundtrips.
> 
> Similarly, we have mostly been using a NFSv4.2 re-export of a NFSV4.2 server
> over the WAN because of reduced metadata ops for reading, but it looks like we
> incur extra metadata ops for writing.

Just a small update based on the most recent patchsets from Trond & Bruce:

https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
https://patchwork.kernel.org/project/linux-nfs/list/?series=393561

For the write-through tests, the NFSv3 re-export of a NFSv4.2 server has trimmed an extra GETATTR:

Before:
originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT,GETATTR .... repeating
 
After:
originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT .... repeating

I'm assuming this is specifically due to the "EXPORT_OP_NOWCC" patch? All other combinations look the same as before (for write-through). An NFSv4.2 re-export of a NFSv3 server is still the best/ideal in terms of not incurring extra metadata roundtrips when writing.

It's great to see this re-export scenario becoming a better supported (and performing) topology; many thanks all.

Daire

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 12:20                     ` Daire Byrne
@ 2020-12-03 18:51                       ` bfields
  2020-12-03 20:27                         ` Trond Myklebust
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-12-03 18:51 UTC (permalink / raw)
  To: Daire Byrne; +Cc: Trond Myklebust, linux-cachefs, linux-nfs

On Thu, Dec 03, 2020 at 12:20:35PM +0000, Daire Byrne wrote:
> Just a small update based on the most recent patchsets from Trond &
> Bruce:
> 
> https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
> https://patchwork.kernel.org/project/linux-nfs/list/?series=393561
> 
> For the write-through tests, the NFSv3 re-export of a NFSv4.2 server
> has trimmed an extra GETATTR:
> 
> Before: originating server <- (vers=4.2) <- reexport server - (vers=3)
> <- client writing = WRITE,COMMIT,GETATTR .... repeating
>  
> After: originating server <- (vers=4.2) <- reexport server - (vers=3)
> <- client writing = WRITE,COMMIT .... repeating
> 
> I'm assuming this is specifically due to the "EXPORT_OP_NOWCC" patch?

Probably so, thanks for the update.

> All other combinations look the same as before (for write-through). An
> NFSv4.2 re-export of a NFSv3 server is still the best/ideal in terms
> of not incurring extra metadata roundtrips when writing.
> 
> It's great to see this re-export scenario becoming a better supported
> (and performing) topology; many thanks all.

I've been scratching my head over how to handle reboot of a re-exporting
server.  I think one way to fix it might be just to allow the re-export
server to pass along reclaims to the original server as it receives them
from its own clients.  It might require some protocol tweaks, I'm not
sure.  I'll try to get my thoughts in order and propose something.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 18:51                       ` bfields
@ 2020-12-03 20:27                         ` Trond Myklebust
  2020-12-03 21:13                           ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 20:27 UTC (permalink / raw)
  To: bfields, daire; +Cc: linux-cachefs, linux-nfs

On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> On Thu, Dec 03, 2020 at 12:20:35PM +0000, Daire Byrne wrote:
> > Just a small update based on the most recent patchsets from Trond &
> > Bruce:
> > 
> > https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
> > https://patchwork.kernel.org/project/linux-nfs/list/?series=393561
> > 
> > For the write-through tests, the NFSv3 re-export of a NFSv4.2
> > server
> > has trimmed an extra GETATTR:
> > 
> > Before: originating server <- (vers=4.2) <- reexport server -
> > (vers=3)
> > <- client writing = WRITE,COMMIT,GETATTR .... repeating
> >  
> > After: originating server <- (vers=4.2) <- reexport server -
> > (vers=3)
> > <- client writing = WRITE,COMMIT .... repeating
> > 
> > I'm assuming this is specifically due to the "EXPORT_OP_NOWCC"
> > patch?
> 
> Probably so, thanks for the update.
> 
> > All other combinations look the same as before (for write-through).
> > An
> > NFSv4.2 re-export of a NFSv3 server is still the best/ideal in
> > terms
> > of not incurring extra metadata roundtrips when writing.
> > 
> > It's great to see this re-export scenario becoming a better
> > supported
> > (and performing) topology; many thanks all.
> 
> I've been scratching my head over how to handle reboot of a re-
> exporting
> server.  I think one way to fix it might be just to allow the re-
> export
> server to pass along reclaims to the original server as it receives
> them
> from its own clients.  It might require some protocol tweaks, I'm not
> sure.  I'll try to get my thoughts in order and propose something.
> 

It's more complicated than that. If the re-exporting server reboots,
but the original server does not, then unless that re-exporting server
persisted its lease and a full set of stateids somewhere, it will not
be able to atomically reclaim delegation and lock state on the server
on behalf of its clients.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 20:27                         ` Trond Myklebust
@ 2020-12-03 21:13                           ` bfields
  2020-12-03 21:32                             ` Frank Filz
  2020-12-03 21:34                             ` Trond Myklebust
  0 siblings, 2 replies; 129+ messages in thread
From: bfields @ 2020-12-03 21:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: daire, linux-cachefs, linux-nfs

On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > I've been scratching my head over how to handle reboot of a re-
> > exporting
> > server.  I think one way to fix it might be just to allow the re-
> > export
> > server to pass along reclaims to the original server as it receives
> > them
> > from its own clients.  It might require some protocol tweaks, I'm not
> > sure.  I'll try to get my thoughts in order and propose something.
> > 
> 
> It's more complicated than that. If the re-exporting server reboots,
> but the original server does not, then unless that re-exporting server
> persisted its lease and a full set of stateids somewhere, it will not
> be able to atomically reclaim delegation and lock state on the server
> on behalf of its clients.

By sending reclaims to the original server, I mean literally sending new
open and lock requests with the RECLAIM bit set, which would get brand
new stateids.

So, the original server would invalidate the existing client's previous
clientid and stateids--just as it normally would on reboot--but it would
optionally remember the underlying locks held by the client and allow
compatible lock reclaims.

Rough attempt:

	https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers

Think it would fly?

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-12-03 21:13                           ` bfields
@ 2020-12-03 21:32                             ` Frank Filz
  2020-12-03 21:34                             ` Trond Myklebust
  1 sibling, 0 replies; 129+ messages in thread
From: Frank Filz @ 2020-12-03 21:32 UTC (permalink / raw)
  To: bfields, 'Trond Myklebust'
  Cc: daire, linux-cachefs, linux-nfs, Jeff Layton, 'Solomon Boulos'

> On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > I've been scratching my head over how to handle reboot of a re-
> > > exporting server.  I think one way to fix it might be just to allow
> > > the re- export server to pass along reclaims to the original server
> > > as it receives them from its own clients.  It might require some
> > > protocol tweaks, I'm not sure.  I'll try to get my thoughts in order
> > > and propose something.
> > >
> >
> > It's more complicated than that. If the re-exporting server reboots,
> > but the original server does not, then unless that re-exporting server
> > persisted its lease and a full set of stateids somewhere, it will not
> > be able to atomically reclaim delegation and lock state on the server
> > on behalf of its clients.
> 
> By sending reclaims to the original server, I mean literally sending new
> open and lock requests with the RECLAIM bit set, which would get brand
> new stateids.
> 
> So, the original server would invalidate the existing client's previous
> clientid and stateids--just as it normally would on reboot--but it would
> optionally remember the underlying locks held by the client and allow
> compatible lock reclaims.
> 
> Rough attempt:
> 
> 	https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-
> export_servers
> 
> Think it would fly?

At a quick read through, that sounds good. I'm sure there's some bits and bobs we need to fix up.

I'm cc:ing Jeff Layton because what the original server needs to do looks a bit like what he implemented in CephFS to allow HA restarts of nfs-ganesha instances.

Maybe we should take this to the IETF mailing list? I'm certainly interested in discussion on what we could do in the protocol to facilitate this from nfs-ganesha perspective.

Frank




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 21:13                           ` bfields
  2020-12-03 21:32                             ` Frank Filz
@ 2020-12-03 21:34                             ` Trond Myklebust
  2020-12-03 21:45                               ` Frank Filz
                                                 ` (2 more replies)
  1 sibling, 3 replies; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 21:34 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > I've been scratching my head over how to handle reboot of a re-
> > > exporting
> > > server.  I think one way to fix it might be just to allow the re-
> > > export
> > > server to pass along reclaims to the original server as it
> > > receives
> > > them
> > > from its own clients.  It might require some protocol tweaks, I'm
> > > not
> > > sure.  I'll try to get my thoughts in order and propose
> > > something.
> > > 
> > 
> > It's more complicated than that. If the re-exporting server
> > reboots,
> > but the original server does not, then unless that re-exporting
> > server
> > persisted its lease and a full set of stateids somewhere, it will
> > not
> > be able to atomically reclaim delegation and lock state on the
> > server
> > on behalf of its clients.
> 
> By sending reclaims to the original server, I mean literally sending
> new
> open and lock requests with the RECLAIM bit set, which would get
> brand
> new stateids.
> 
> So, the original server would invalidate the existing client's
> previous
> clientid and stateids--just as it normally would on reboot--but it
> would
> optionally remember the underlying locks held by the client and allow
> compatible lock reclaims.
> 
> Rough attempt:
> 
>         https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers
> 
> Think it would fly?

So this would be a variant of courtesy locks that can be reclaimed by
the client using the reboot reclaim variant of OPEN/LOCK outside the
grace period? The purpose being to allow reclaim without forcing the
client to persist the original stateid?

Hmm... That's doable, but how about the following alternative: Add a
function that allows the client to request the full list of stateids
that the server holds on its behalf?

I've been wanting such a function for quite a while anyway in order to
allow the client to detect state leaks (either due to soft timeouts, or
due to reordered close/open operations).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-12-03 21:34                             ` Trond Myklebust
@ 2020-12-03 21:45                               ` Frank Filz
  2020-12-03 21:57                                 ` Trond Myklebust
  2020-12-03 21:54                               ` bfields
  2020-12-03 22:45                               ` bfields
  2 siblings, 1 reply; 129+ messages in thread
From: Frank Filz @ 2020-12-03 21:45 UTC (permalink / raw)
  To: 'Trond Myklebust', bfields; +Cc: linux-cachefs, linux-nfs, daire

> On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > I've been scratching my head over how to handle reboot of a re-
> > > > exporting server.  I think one way to fix it might be just to
> > > > allow the re- export server to pass along reclaims to the original
> > > > server as it receives them from its own clients.  It might require
> > > > some protocol tweaks, I'm not sure.  I'll try to get my thoughts
> > > > in order and propose something.
> > > >
> > >
> > > It's more complicated than that. If the re-exporting server reboots,
> > > but the original server does not, then unless that re-exporting
> > > server persisted its lease and a full set of stateids somewhere, it
> > > will not be able to atomically reclaim delegation and lock state on
> > > the server on behalf of its clients.
> >
> > By sending reclaims to the original server, I mean literally sending
> > new open and lock requests with the RECLAIM bit set, which would get
> > brand new stateids.
> >
> > So, the original server would invalidate the existing client's
> > previous clientid and stateids--just as it normally would on
> > reboot--but it would optionally remember the underlying locks held by
> > the client and allow compatible lock reclaims.
> >
> > Rough attempt:
> >
> >
> > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > t_servers
> >
> > Think it would fly?
> 
> So this would be a variant of courtesy locks that can be reclaimed by the client
> using the reboot reclaim variant of OPEN/LOCK outside the grace period? The
> purpose being to allow reclaim without forcing the client to persist the original
> stateid?
> 
> Hmm... That's doable, but how about the following alternative: Add a function
> that allows the client to request the full list of stateids that the server holds on
> its behalf?
> 
> I've been wanting such a function for quite a while anyway in order to allow the
> client to detect state leaks (either due to soft timeouts, or due to reordered
> close/open operations).

Oh, that sounds interesting. So basically the re-export server would re-populate it's state from the original server rather than relying on it's clients doing reclaims? Hmm, but how does the re-export server rebuild its stateids? I guess it could make the clients repopulate them with the same "give me a dump of all my state", using the state details to match up with the old state and replacing stateids. Or did you have something different in mind?

Frank


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 21:34                             ` Trond Myklebust
  2020-12-03 21:45                               ` Frank Filz
@ 2020-12-03 21:54                               ` bfields
  2020-12-03 22:45                               ` bfields
  2 siblings, 0 replies; 129+ messages in thread
From: bfields @ 2020-12-03 21:54 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, daire

On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > I've been scratching my head over how to handle reboot of a re-
> > > > exporting
> > > > server.  I think one way to fix it might be just to allow the re-
> > > > export
> > > > server to pass along reclaims to the original server as it
> > > > receives
> > > > them
> > > > from its own clients.  It might require some protocol tweaks, I'm
> > > > not
> > > > sure.  I'll try to get my thoughts in order and propose
> > > > something.
> > > > 
> > > 
> > > It's more complicated than that. If the re-exporting server
> > > reboots,
> > > but the original server does not, then unless that re-exporting
> > > server
> > > persisted its lease and a full set of stateids somewhere, it will
> > > not
> > > be able to atomically reclaim delegation and lock state on the
> > > server
> > > on behalf of its clients.
> > 
> > By sending reclaims to the original server, I mean literally sending
> > new
> > open and lock requests with the RECLAIM bit set, which would get
> > brand
> > new stateids.
> > 
> > So, the original server would invalidate the existing client's
> > previous
> > clientid and stateids--just as it normally would on reboot--but it
> > would
> > optionally remember the underlying locks held by the client and allow
> > compatible lock reclaims.
> > 
> > Rough attempt:
> > 
> >         https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers
> > 
> > Think it would fly?
> 
> So this would be a variant of courtesy locks that can be reclaimed by
> the client using the reboot reclaim variant of OPEN/LOCK outside the
> grace period? The purpose being to allow reclaim without forcing the
> client to persist the original stateid?

Right.

> Hmm... That's doable,

Keep mulling it over and let me know if you see something that doesn't
work.

> but how about the following alternative: Add a
> function that allows the client to request the full list of stateids
> that the server holds on its behalf?

So, on the re-export server:

The client comes back up knowing nothing, so it requests that list of
stateids.  A reclaim comes in from an end client.  The client looks
through its list for a stateid that matches that reclaim somehow.  So, I
guess the list of stateids also has to include filehandles and access
bits and lock ranges and such, so the client can pick an appropriate
stateid to use?

> I've been wanting such a function for quite a while anyway in order to
> allow the client to detect state leaks (either due to soft timeouts, or
> due to reordered close/open operations).

Yipes, I hadn't realized that was possible.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 21:45                               ` Frank Filz
@ 2020-12-03 21:57                                 ` Trond Myklebust
  2020-12-03 22:04                                   ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 21:57 UTC (permalink / raw)
  To: bfields, ffilzlnx; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > I've been scratching my head over how to handle reboot of a
> > > > > re-
> > > > > exporting server.  I think one way to fix it might be just to
> > > > > allow the re- export server to pass along reclaims to the
> > > > > original
> > > > > server as it receives them from its own clients.  It might
> > > > > require
> > > > > some protocol tweaks, I'm not sure.  I'll try to get my
> > > > > thoughts
> > > > > in order and propose something.
> > > > > 
> > > > 
> > > > It's more complicated than that. If the re-exporting server
> > > > reboots,
> > > > but the original server does not, then unless that re-exporting
> > > > server persisted its lease and a full set of stateids
> > > > somewhere, it
> > > > will not be able to atomically reclaim delegation and lock
> > > > state on
> > > > the server on behalf of its clients.
> > > 
> > > By sending reclaims to the original server, I mean literally
> > > sending
> > > new open and lock requests with the RECLAIM bit set, which would
> > > get
> > > brand new stateids.
> > > 
> > > So, the original server would invalidate the existing client's
> > > previous clientid and stateids--just as it normally would on
> > > reboot--but it would optionally remember the underlying locks
> > > held by
> > > the client and allow compatible lock reclaims.
> > > 
> > > Rough attempt:
> > > 
> > > 
> > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > t_servers
> > > 
> > > Think it would fly?
> > 
> > So this would be a variant of courtesy locks that can be reclaimed
> > by the client
> > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > period? The
> > purpose being to allow reclaim without forcing the client to
> > persist the original
> > stateid?
> > 
> > Hmm... That's doable, but how about the following alternative: Add
> > a function
> > that allows the client to request the full list of stateids that
> > the server holds on
> > its behalf?
> > 
> > I've been wanting such a function for quite a while anyway in order
> > to allow the
> > client to detect state leaks (either due to soft timeouts, or due
> > to reordered
> > close/open operations).
> 
> Oh, that sounds interesting. So basically the re-export server would
> re-populate it's state from the original server rather than relying
> on it's clients doing reclaims? Hmm, but how does the re-export
> server rebuild its stateids? I guess it could make the clients
> repopulate them with the same "give me a dump of all my state", using
> the state details to match up with the old state and replacing
> stateids. Or did you have something different in mind?
> 

I was thinking that the re-export server could just use that list of
stateids to figure out which locks can be reclaimed atomically, and
which ones have been irredeemably lost. The assumption is that if you
have a lock stateid or a delegation, then that means the clients can
reclaim all the locks that were represented by that stateid.

I suppose the client would also need to know the lockowner for the
stateid, but presumably that information could also be returned by the
server?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 21:57                                 ` Trond Myklebust
@ 2020-12-03 22:04                                   ` bfields
  2020-12-03 22:14                                     ` Trond Myklebust
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-12-03 22:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: ffilzlnx, linux-cachefs, linux-nfs, daire

On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > I've been scratching my head over how to handle reboot of a
> > > > > > re-
> > > > > > exporting server.  I think one way to fix it might be just to
> > > > > > allow the re- export server to pass along reclaims to the
> > > > > > original
> > > > > > server as it receives them from its own clients.  It might
> > > > > > require
> > > > > > some protocol tweaks, I'm not sure.  I'll try to get my
> > > > > > thoughts
> > > > > > in order and propose something.
> > > > > > 
> > > > > 
> > > > > It's more complicated than that. If the re-exporting server
> > > > > reboots,
> > > > > but the original server does not, then unless that re-exporting
> > > > > server persisted its lease and a full set of stateids
> > > > > somewhere, it
> > > > > will not be able to atomically reclaim delegation and lock
> > > > > state on
> > > > > the server on behalf of its clients.
> > > > 
> > > > By sending reclaims to the original server, I mean literally
> > > > sending
> > > > new open and lock requests with the RECLAIM bit set, which would
> > > > get
> > > > brand new stateids.
> > > > 
> > > > So, the original server would invalidate the existing client's
> > > > previous clientid and stateids--just as it normally would on
> > > > reboot--but it would optionally remember the underlying locks
> > > > held by
> > > > the client and allow compatible lock reclaims.
> > > > 
> > > > Rough attempt:
> > > > 
> > > > 
> > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > t_servers
> > > > 
> > > > Think it would fly?
> > > 
> > > So this would be a variant of courtesy locks that can be reclaimed
> > > by the client
> > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > period? The
> > > purpose being to allow reclaim without forcing the client to
> > > persist the original
> > > stateid?
> > > 
> > > Hmm... That's doable, but how about the following alternative: Add
> > > a function
> > > that allows the client to request the full list of stateids that
> > > the server holds on
> > > its behalf?
> > > 
> > > I've been wanting such a function for quite a while anyway in order
> > > to allow the
> > > client to detect state leaks (either due to soft timeouts, or due
> > > to reordered
> > > close/open operations).
> > 
> > Oh, that sounds interesting. So basically the re-export server would
> > re-populate it's state from the original server rather than relying
> > on it's clients doing reclaims? Hmm, but how does the re-export
> > server rebuild its stateids? I guess it could make the clients
> > repopulate them with the same "give me a dump of all my state", using
> > the state details to match up with the old state and replacing
> > stateids. Or did you have something different in mind?
> > 
> 
> I was thinking that the re-export server could just use that list of
> stateids to figure out which locks can be reclaimed atomically, and
> which ones have been irredeemably lost. The assumption is that if you
> have a lock stateid or a delegation, then that means the clients can
> reclaim all the locks that were represented by that stateid.

I'm confused about how the re-export server uses that list.  Are you
assuming it persisted its own list across its own crash/reboot?  I guess
that's what I was trying to avoid having to do.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 22:04                                   ` bfields
@ 2020-12-03 22:14                                     ` Trond Myklebust
  2020-12-03 22:39                                       ` Frank Filz
  2020-12-03 22:44                                       ` bfields
  0 siblings, 2 replies; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 22:14 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, ffilzlnx, linux-nfs, daire

On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > a
> > > > > > > re-
> > > > > > > exporting server.  I think one way to fix it might be
> > > > > > > just to
> > > > > > > allow the re- export server to pass along reclaims to the
> > > > > > > original
> > > > > > > server as it receives them from its own clients.  It
> > > > > > > might
> > > > > > > require
> > > > > > > some protocol tweaks, I'm not sure.  I'll try to get my
> > > > > > > thoughts
> > > > > > > in order and propose something.
> > > > > > > 
> > > > > > 
> > > > > > It's more complicated than that. If the re-exporting server
> > > > > > reboots,
> > > > > > but the original server does not, then unless that re-
> > > > > > exporting
> > > > > > server persisted its lease and a full set of stateids
> > > > > > somewhere, it
> > > > > > will not be able to atomically reclaim delegation and lock
> > > > > > state on
> > > > > > the server on behalf of its clients.
> > > > > 
> > > > > By sending reclaims to the original server, I mean literally
> > > > > sending
> > > > > new open and lock requests with the RECLAIM bit set, which
> > > > > would
> > > > > get
> > > > > brand new stateids.
> > > > > 
> > > > > So, the original server would invalidate the existing
> > > > > client's
> > > > > previous clientid and stateids--just as it normally would on
> > > > > reboot--but it would optionally remember the underlying locks
> > > > > held by
> > > > > the client and allow compatible lock reclaims.
> > > > > 
> > > > > Rough attempt:
> > > > > 
> > > > > 
> > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > > t_servers
> > > > > 
> > > > > Think it would fly?
> > > > 
> > > > So this would be a variant of courtesy locks that can be
> > > > reclaimed
> > > > by the client
> > > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > > period? The
> > > > purpose being to allow reclaim without forcing the client to
> > > > persist the original
> > > > stateid?
> > > > 
> > > > Hmm... That's doable, but how about the following alternative:
> > > > Add
> > > > a function
> > > > that allows the client to request the full list of stateids
> > > > that
> > > > the server holds on
> > > > its behalf?
> > > > 
> > > > I've been wanting such a function for quite a while anyway in
> > > > order
> > > > to allow the
> > > > client to detect state leaks (either due to soft timeouts, or
> > > > due
> > > > to reordered
> > > > close/open operations).
> > > 
> > > Oh, that sounds interesting. So basically the re-export server
> > > would
> > > re-populate it's state from the original server rather than
> > > relying
> > > on it's clients doing reclaims? Hmm, but how does the re-export
> > > server rebuild its stateids? I guess it could make the clients
> > > repopulate them with the same "give me a dump of all my state",
> > > using
> > > the state details to match up with the old state and replacing
> > > stateids. Or did you have something different in mind?
> > > 
> > 
> > I was thinking that the re-export server could just use that list
> > of
> > stateids to figure out which locks can be reclaimed atomically, and
> > which ones have been irredeemably lost. The assumption is that if
> > you
> > have a lock stateid or a delegation, then that means the clients
> > can
> > reclaim all the locks that were represented by that stateid.
> 
> I'm confused about how the re-export server uses that list.  Are you
> assuming it persisted its own list across its own crash/reboot?  I
> guess
> that's what I was trying to avoid having to do.
> 
No. The server just uses the stateids as part of a check for 'do I hold
state for this file on this server?'. If the answer is 'yes' and the
lock owners are sane, then we should be able to assume the full set of
locks that lock owner held on that file are still valid.

BTW: if the lock owner is also returned by the server, then since the
lock owner is an opaque value, it could, for instance, be used by the
client to cache info on the server about which uid/gid owns these
locks.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-12-03 22:14                                     ` Trond Myklebust
@ 2020-12-03 22:39                                       ` Frank Filz
  2020-12-03 22:50                                         ` Trond Myklebust
  2020-12-03 22:44                                       ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Frank Filz @ 2020-12-03 22:39 UTC (permalink / raw)
  To: 'Trond Myklebust', bfields; +Cc: linux-cachefs, linux-nfs, daire



> -----Original Message-----
> From: Trond Myklebust [mailto:trondmy@hammerspace.com]
> Sent: Thursday, December 3, 2020 2:14 PM
> To: bfields@fieldses.org
> Cc: linux-cachefs@redhat.com; ffilzlnx@mindspring.com; linux-
> nfs@vger.kernel.org; daire@dneg.com
> Subject: Re: Adventures in NFS re-exporting
> 
> On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > > a
> > > > > > > > re-
> > > > > > > > exporting server.  I think one way to fix it might be just
> > > > > > > > to allow the re- export server to pass along reclaims to
> > > > > > > > the original server as it receives them from its own
> > > > > > > > clients.  It might require some protocol tweaks, I'm not
> > > > > > > > sure.  I'll try to get my thoughts in order and propose
> > > > > > > > something.
> > > > > > > >
> > > > > > >
> > > > > > > It's more complicated than that. If the re-exporting server
> > > > > > > reboots, but the original server does not, then unless that
> > > > > > > re- exporting server persisted its lease and a full set of
> > > > > > > stateids somewhere, it will not be able to atomically
> > > > > > > reclaim delegation and lock state on the server on behalf of
> > > > > > > its clients.
> > > > > >
> > > > > > By sending reclaims to the original server, I mean literally
> > > > > > sending new open and lock requests with the RECLAIM bit set,
> > > > > > which would get brand new stateids.
> > > > > >
> > > > > > So, the original server would invalidate the existing client's
> > > > > > previous clientid and stateids--just as it normally would on
> > > > > > reboot--but it would optionally remember the underlying locks
> > > > > > held by the client and allow compatible lock reclaims.
> > > > > >
> > > > > > Rough attempt:
> > > > > >
> > > > > >
> > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_
> > > > > > re-expor
> > > > > > t_servers
> > > > > >
> > > > > > Think it would fly?
> > > > >
> > > > > So this would be a variant of courtesy locks that can be
> > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > OPEN/LOCK outside the grace period? The purpose being to allow
> > > > > reclaim without forcing the client to persist the original
> > > > > stateid?
> > > > >
> > > > > Hmm... That's doable, but how about the following alternative:
> > > > > Add
> > > > > a function
> > > > > that allows the client to request the full list of stateids that
> > > > > the server holds on its behalf?
> > > > >
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order to allow the client to detect state leaks (either due to
> > > > > soft timeouts, or due to reordered close/open operations).
> > > >
> > > > Oh, that sounds interesting. So basically the re-export server
> > > > would re-populate it's state from the original server rather than
> > > > relying on it's clients doing reclaims? Hmm, but how does the
> > > > re-export server rebuild its stateids? I guess it could make the
> > > > clients repopulate them with the same "give me a dump of all my
> > > > state", using the state details to match up with the old state and
> > > > replacing stateids. Or did you have something different in mind?
> > > >
> > >
> > > I was thinking that the re-export server could just use that list of
> > > stateids to figure out which locks can be reclaimed atomically, and
> > > which ones have been irredeemably lost. The assumption is that if
> > > you have a lock stateid or a delegation, then that means the clients
> > > can reclaim all the locks that were represented by that stateid.
> >
> > I'm confused about how the re-export server uses that list.  Are you
> > assuming it persisted its own list across its own crash/reboot?  I
> > guess that's what I was trying to avoid having to do.
> >
> No. The server just uses the stateids as part of a check for 'do I hold state for
> this file on this server?'. If the answer is 'yes' and the lock owners are sane, then
> we should be able to assume the full set of locks that lock owner held on that
> file are still valid.
> 
> BTW: if the lock owner is also returned by the server, then since the lock owner
> is an opaque value, it could, for instance, be used by the client to cache info on
> the server about which uid/gid owns these locks.

Let me see if I'm understanding your idea right...

Re-export server reboots within the extended lease period it's been given by the original server. I'm assuming it uses the same clientid? But would probably open new sessions. It requests the list of stateids. Hmm, how to make the owner information useful, nfs-ganesha doesn't pass on the actual client's owner but rather just passes the address of its record for that client owner. Maybe it will have to do something a bit different for this degree of re-export support...

Now the re-export server knows which original client lock owners are allowed to reclaim state. So it just acquires locks using the original stateid as the client reclaims (what happens if the client doesn't reclaim a lock? I suppose the re-export server could unlock all regions not explicitly locked once reclaim is complete). Since the re-export server is acquiring new locks using the original stateid it will just overlay the original lock with the new lock and write locks don't conflict since they are being acquired by the same lock owner. Actually the original server could even balk at a "reclaim" in this way that wasn't originally held... And the original server could "refresh" the locks, and discard any that aren't refreshed at the end of reclaim. That part assumes the original server is apprised that what is actually happening is a reclaim.

The re-export server can destroy any stateids that it doesn't receive reclaims for.

Hmm, I think if the re-export server is implemented as an HA cluster, it should establish a clientid on the original server for each virtual IP (assuming that's the unit of HA)  that exists. Then when virtual IPs are moved, the re-export server just goes through the above reclaim process for that clientid.

Frank


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 22:14                                     ` Trond Myklebust
  2020-12-03 22:39                                       ` Frank Filz
@ 2020-12-03 22:44                                       ` bfields
  1 sibling, 0 replies; 129+ messages in thread
From: bfields @ 2020-12-03 22:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, ffilzlnx, linux-nfs, daire

On Thu, Dec 03, 2020 at 10:14:25PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org wrote:
> > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > > a
> > > > > > > > re-
> > > > > > > > exporting server.  I think one way to fix it might be
> > > > > > > > just to
> > > > > > > > allow the re- export server to pass along reclaims to the
> > > > > > > > original
> > > > > > > > server as it receives them from its own clients.  It
> > > > > > > > might
> > > > > > > > require
> > > > > > > > some protocol tweaks, I'm not sure.  I'll try to get my
> > > > > > > > thoughts
> > > > > > > > in order and propose something.
> > > > > > > > 
> > > > > > > 
> > > > > > > It's more complicated than that. If the re-exporting server
> > > > > > > reboots,
> > > > > > > but the original server does not, then unless that re-
> > > > > > > exporting
> > > > > > > server persisted its lease and a full set of stateids
> > > > > > > somewhere, it
> > > > > > > will not be able to atomically reclaim delegation and lock
> > > > > > > state on
> > > > > > > the server on behalf of its clients.
> > > > > > 
> > > > > > By sending reclaims to the original server, I mean literally
> > > > > > sending
> > > > > > new open and lock requests with the RECLAIM bit set, which
> > > > > > would
> > > > > > get
> > > > > > brand new stateids.
> > > > > > 
> > > > > > So, the original server would invalidate the existing
> > > > > > client's
> > > > > > previous clientid and stateids--just as it normally would on
> > > > > > reboot--but it would optionally remember the underlying locks
> > > > > > held by
> > > > > > the client and allow compatible lock reclaims.
> > > > > > 
> > > > > > Rough attempt:
> > > > > > 
> > > > > > 
> > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > > > t_servers
> > > > > > 
> > > > > > Think it would fly?
> > > > > 
> > > > > So this would be a variant of courtesy locks that can be
> > > > > reclaimed
> > > > > by the client
> > > > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > > > period? The
> > > > > purpose being to allow reclaim without forcing the client to
> > > > > persist the original
> > > > > stateid?
> > > > > 
> > > > > Hmm... That's doable, but how about the following alternative:
> > > > > Add
> > > > > a function
> > > > > that allows the client to request the full list of stateids
> > > > > that
> > > > > the server holds on
> > > > > its behalf?
> > > > > 
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order
> > > > > to allow the
> > > > > client to detect state leaks (either due to soft timeouts, or
> > > > > due
> > > > > to reordered
> > > > > close/open operations).
> > > > 
> > > > Oh, that sounds interesting. So basically the re-export server
> > > > would
> > > > re-populate it's state from the original server rather than
> > > > relying
> > > > on it's clients doing reclaims? Hmm, but how does the re-export
> > > > server rebuild its stateids? I guess it could make the clients
> > > > repopulate them with the same "give me a dump of all my state",
> > > > using
> > > > the state details to match up with the old state and replacing
> > > > stateids. Or did you have something different in mind?
> > > > 
> > > 
> > > I was thinking that the re-export server could just use that list
> > > of
> > > stateids to figure out which locks can be reclaimed atomically, and
> > > which ones have been irredeemably lost. The assumption is that if
> > > you
> > > have a lock stateid or a delegation, then that means the clients
> > > can
> > > reclaim all the locks that were represented by that stateid.
> > 
> > I'm confused about how the re-export server uses that list.  Are you
> > assuming it persisted its own list across its own crash/reboot?  I
> > guess
> > that's what I was trying to avoid having to do.
> > 
> No. The server just uses the stateids as part of a check for 'do I hold
> state for this file on this server?'. If the answer is 'yes' and the
> lock owners are sane, then we should be able to assume the full set of
> locks that lock owner held on that file are still valid.
> 
> BTW: if the lock owner is also returned by the server, then since the
> lock owner is an opaque value, it could, for instance, be used by the
> client to cache info on the server about which uid/gid owns these
> locks.

OK, so the list of stateids returned by the server has entries that look
like (type, filehandle, owner, stateid) (where type=open or lock?).

I guess I'd need to see this in more detail.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 21:34                             ` Trond Myklebust
  2020-12-03 21:45                               ` Frank Filz
  2020-12-03 21:54                               ` bfields
@ 2020-12-03 22:45                               ` bfields
  2020-12-03 22:53                                 ` Trond Myklebust
  2 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-12-03 22:45 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, daire

On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> I've been wanting such a function for quite a while anyway in order to
> allow the client to detect state leaks (either due to soft timeouts, or
> due to reordered close/open operations).

One sure way to fix any state leaks is to reboot the server.  The server
throws everything away, the clients reclaim, all that's left is stuff
they still actually care about.

It's very disruptive.

But you could do a limited version of that: the server throws away the
state from one client (keeping the underlying locks on the exported
filesystem), lets the client go through its normal reclaim process, at
the end of that throws away anything that wasn't reclaimed.  The only
delay is to anyone trying to acquire new locks that conflict with that
set of locks, and only for as long as it takes for the one client to
reclaim.

?

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 22:39                                       ` Frank Filz
@ 2020-12-03 22:50                                         ` Trond Myklebust
  2020-12-03 23:34                                           ` Frank Filz
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 22:50 UTC (permalink / raw)
  To: bfields, ffilzlnx; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 14:39 -0800, Frank Filz wrote:
> 
> 
> > -----Original Message-----
> > From: Trond Myklebust [mailto:trondmy@hammerspace.com]
> > Sent: Thursday, December 3, 2020 2:14 PM
> > To: bfields@fieldses.org
> > Cc: linux-cachefs@redhat.com; ffilzlnx@mindspring.com; linux-
> > nfs@vger.kernel.org; daire@dneg.com
> > Subject: Re: Adventures in NFS re-exporting
> > 
> > On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > > On Thu, 2020-12-03 at 16:13 -0500,
> > > > > > bfields@fieldses.org wrote:
> > > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > > wrote:
> > > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > > I've been scratching my head over how to handle
> > > > > > > > > reboot of
> > > > > > > > > a
> > > > > > > > > re-
> > > > > > > > > exporting server.  I think one way to fix it might be
> > > > > > > > > just
> > > > > > > > > to allow the re- export server to pass along reclaims
> > > > > > > > > to
> > > > > > > > > the original server as it receives them from its own
> > > > > > > > > clients.  It might require some protocol tweaks, I'm
> > > > > > > > > not
> > > > > > > > > sure.  I'll try to get my thoughts in order and
> > > > > > > > > propose
> > > > > > > > > something.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > It's more complicated than that. If the re-exporting
> > > > > > > > server
> > > > > > > > reboots, but the original server does not, then unless
> > > > > > > > that
> > > > > > > > re- exporting server persisted its lease and a full set
> > > > > > > > of
> > > > > > > > stateids somewhere, it will not be able to atomically
> > > > > > > > reclaim delegation and lock state on the server on
> > > > > > > > behalf of
> > > > > > > > its clients.
> > > > > > > 
> > > > > > > By sending reclaims to the original server, I mean
> > > > > > > literally
> > > > > > > sending new open and lock requests with the RECLAIM bit
> > > > > > > set,
> > > > > > > which would get brand new stateids.
> > > > > > > 
> > > > > > > So, the original server would invalidate the existing
> > > > > > > client's
> > > > > > > previous clientid and stateids--just as it normally would
> > > > > > > on
> > > > > > > reboot--but it would optionally remember the underlying
> > > > > > > locks
> > > > > > > held by the client and allow compatible lock reclaims.
> > > > > > > 
> > > > > > > Rough attempt:
> > > > > > > 
> > > > > > > 
> > > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_
> > > > > > > re-expor
> > > > > > > t_servers
> > > > > > > 
> > > > > > > Think it would fly?
> > > > > > 
> > > > > > So this would be a variant of courtesy locks that can be
> > > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > > OPEN/LOCK outside the grace period? The purpose being to
> > > > > > allow
> > > > > > reclaim without forcing the client to persist the original
> > > > > > stateid?
> > > > > > 
> > > > > > Hmm... That's doable, but how about the following
> > > > > > alternative:
> > > > > > Add
> > > > > > a function
> > > > > > that allows the client to request the full list of stateids
> > > > > > that
> > > > > > the server holds on its behalf?
> > > > > > 
> > > > > > I've been wanting such a function for quite a while anyway
> > > > > > in
> > > > > > order to allow the client to detect state leaks (either due
> > > > > > to
> > > > > > soft timeouts, or due to reordered close/open operations).
> > > > > 
> > > > > Oh, that sounds interesting. So basically the re-export
> > > > > server
> > > > > would re-populate it's state from the original server rather
> > > > > than
> > > > > relying on it's clients doing reclaims? Hmm, but how does the
> > > > > re-export server rebuild its stateids? I guess it could make
> > > > > the
> > > > > clients repopulate them with the same "give me a dump of all
> > > > > my
> > > > > state", using the state details to match up with the old
> > > > > state and
> > > > > replacing stateids. Or did you have something different in
> > > > > mind?
> > > > > 
> > > > 
> > > > I was thinking that the re-export server could just use that
> > > > list of
> > > > stateids to figure out which locks can be reclaimed atomically,
> > > > and
> > > > which ones have been irredeemably lost. The assumption is that
> > > > if
> > > > you have a lock stateid or a delegation, then that means the
> > > > clients
> > > > can reclaim all the locks that were represented by that
> > > > stateid.
> > > 
> > > I'm confused about how the re-export server uses that list.  Are
> > > you
> > > assuming it persisted its own list across its own crash/reboot? 
> > > I
> > > guess that's what I was trying to avoid having to do.
> > > 
> > No. The server just uses the stateids as part of a check for 'do I
> > hold state for
> > this file on this server?'. If the answer is 'yes' and the lock
> > owners are sane, then
> > we should be able to assume the full set of locks that lock owner
> > held on that
> > file are still valid.
> > 
> > BTW: if the lock owner is also returned by the server, then since
> > the lock owner
> > is an opaque value, it could, for instance, be used by the client
> > to cache info on
> > the server about which uid/gid owns these locks.
> 
> Let me see if I'm understanding your idea right...
> 
> Re-export server reboots within the extended lease period it's been
> given by the original server. I'm assuming it uses the same clientid?

Yes. It would have to use the same clientid.

> But would probably open new sessions. It requests the list of
> stateids. Hmm, how to make the owner information useful, nfs-ganesha
> doesn't pass on the actual client's owner but rather just passes the
> address of its record for that client owner. Maybe it will have to do
> something a bit different for this degree of re-export support...
> 
> Now the re-export server knows which original client lock owners are
> allowed to reclaim state. So it just acquires locks using the
> original stateid as the client reclaims (what happens if the client
> doesn't reclaim a lock? I suppose the re-export server could unlock
> all regions not explicitly locked once reclaim is complete). Since
> the re-export server is acquiring new locks using the original
> stateid it will just overlay the original lock with the new lock and
> write locks don't conflict since they are being acquired by the same
> lock owner. Actually the original server could even balk at a
> "reclaim" in this way that wasn't originally held... And the original
> server could "refresh" the locks, and discard any that aren't
> refreshed at the end of reclaim. That part assumes the original
> server is apprised that what is actually happening is a reclaim.
> 
> The re-export server can destroy any stateids that it doesn't receive
> reclaims for.

Right. That's in essence what I'm suggesting. There are corner cases to
be considered: e.g. "what happens if the re-export server crashes after
unlocking on the server, but before passing the LOCKU reply on the the
client", however I think it should be possible to figure out strategies
for those cases.

> 
> Hmm, I think if the re-export server is implemented as an HA cluster,
> it should establish a clientid on the original server for each
> virtual IP (assuming that's the unit of HA)  that exists. Then when
> virtual IPs are moved, the re-export server just goes through the
> above reclaim process for that clientid.
> 

Yes, we could do something like that.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 22:45                               ` bfields
@ 2020-12-03 22:53                                 ` Trond Myklebust
  2020-12-03 23:16                                   ` bfields
  0 siblings, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-12-03 22:53 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > I've been wanting such a function for quite a while anyway in order
> > to
> > allow the client to detect state leaks (either due to soft
> > timeouts, or
> > due to reordered close/open operations).
> 
> One sure way to fix any state leaks is to reboot the server.  The
> server
> throws everything away, the clients reclaim, all that's left is stuff
> they still actually care about.
> 
> It's very disruptive.
> 
> But you could do a limited version of that: the server throws away
> the
> state from one client (keeping the underlying locks on the exported
> filesystem), lets the client go through its normal reclaim process,
> at
> the end of that throws away anything that wasn't reclaimed.  The only
> delay is to anyone trying to acquire new locks that conflict with
> that
> set of locks, and only for as long as it takes for the one client to
> reclaim.

One could do that, but that requires the existence of a quiescent
period where the client holds no state at all on the server. There are
definitely cases where that is not an option.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 22:53                                 ` Trond Myklebust
@ 2020-12-03 23:16                                   ` bfields
  2020-12-03 23:28                                     ` Frank Filz
  2020-12-04  1:02                                     ` Trond Myklebust
  0 siblings, 2 replies; 129+ messages in thread
From: bfields @ 2020-12-03 23:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, daire

On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > I've been wanting such a function for quite a while anyway in
> > > order to allow the client to detect state leaks (either due to
> > > soft timeouts, or due to reordered close/open operations).
> > 
> > One sure way to fix any state leaks is to reboot the server.  The
> > server throws everything away, the clients reclaim, all that's left
> > is stuff they still actually care about.
> > 
> > It's very disruptive.
> > 
> > But you could do a limited version of that: the server throws away
> > the state from one client (keeping the underlying locks on the
> > exported filesystem), lets the client go through its normal reclaim
> > process, at the end of that throws away anything that wasn't
> > reclaimed.  The only delay is to anyone trying to acquire new locks
> > that conflict with that set of locks, and only for as long as it
> > takes for the one client to reclaim.
> 
> One could do that, but that requires the existence of a quiescent
> period where the client holds no state at all on the server.

No, as I said, the client performs reboot recovery for any state that it
holds when we do this.

--b.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-12-03 23:16                                   ` bfields
@ 2020-12-03 23:28                                     ` Frank Filz
  2020-12-04  1:02                                     ` Trond Myklebust
  1 sibling, 0 replies; 129+ messages in thread
From: Frank Filz @ 2020-12-03 23:28 UTC (permalink / raw)
  To: bfields, 'Trond Myklebust'; +Cc: linux-cachefs, linux-nfs, daire

> On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > I've been wanting such a function for quite a while anyway in
> > > > order to allow the client to detect state leaks (either due to
> > > > soft timeouts, or due to reordered close/open operations).
> > >
> > > One sure way to fix any state leaks is to reboot the server.  The
> > > server throws everything away, the clients reclaim, all that's left
> > > is stuff they still actually care about.
> > >
> > > It's very disruptive.
> > >
> > > But you could do a limited version of that: the server throws away
> > > the state from one client (keeping the underlying locks on the
> > > exported filesystem), lets the client go through its normal reclaim
> > > process, at the end of that throws away anything that wasn't
> > > reclaimed.  The only delay is to anyone trying to acquire new locks
> > > that conflict with that set of locks, and only for as long as it
> > > takes for the one client to reclaim.
> >
> > One could do that, but that requires the existence of a quiescent
> > period where the client holds no state at all on the server.
> 
> No, as I said, the client performs reboot recovery for any state that it holds
> when we do this.

Yea, but the original sever goes through a period where it has dropped all state and isn't in grace, and if it's coordinating with non-NFS users, they don't know anything about grace anyway.

Frank


^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: Adventures in NFS re-exporting
  2020-12-03 22:50                                         ` Trond Myklebust
@ 2020-12-03 23:34                                           ` Frank Filz
  0 siblings, 0 replies; 129+ messages in thread
From: Frank Filz @ 2020-12-03 23:34 UTC (permalink / raw)
  To: 'Trond Myklebust', bfields; +Cc: linux-cachefs, linux-nfs, daire

> > > -----Original Message-----
> > > From: Trond Myklebust [mailto:trondmy@hammerspace.com]
> > > Sent: Thursday, December 3, 2020 2:14 PM
> > > To: bfields@fieldses.org
> > > Cc: linux-cachefs@redhat.com; ffilzlnx@mindspring.com; linux-
> > > nfs@vger.kernel.org; daire@dneg.com
> > > Subject: Re: Adventures in NFS re-exporting
> > >
> > > On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> > > > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > > > On Thu, 2020-12-03 at 16:13 -0500, bfields@fieldses.org
> > > > > > > wrote:
> > > > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > > > wrote:
> > > > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > > > I've been scratching my head over how to handle reboot
> > > > > > > > > > of a
> > > > > > > > > > re-
> > > > > > > > > > exporting server.  I think one way to fix it might be
> > > > > > > > > > just to allow the re- export server to pass along
> > > > > > > > > > reclaims to the original server as it receives them
> > > > > > > > > > from its own clients.  It might require some protocol
> > > > > > > > > > tweaks, I'm not sure.  I'll try to get my thoughts in
> > > > > > > > > > order and propose something.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It's more complicated than that. If the re-exporting
> > > > > > > > > server reboots, but the original server does not, then
> > > > > > > > > unless that
> > > > > > > > > re- exporting server persisted its lease and a full set
> > > > > > > > > of stateids somewhere, it will not be able to atomically
> > > > > > > > > reclaim delegation and lock state on the server on
> > > > > > > > > behalf of its clients.
> > > > > > > >
> > > > > > > > By sending reclaims to the original server, I mean
> > > > > > > > literally sending new open and lock requests with the
> > > > > > > > RECLAIM bit set, which would get brand new stateids.
> > > > > > > >
> > > > > > > > So, the original server would invalidate the existing
> > > > > > > > client's previous clientid and stateids--just as it
> > > > > > > > normally would on reboot--but it would optionally remember
> > > > > > > > the underlying locks held by the client and allow
> > > > > > > > compatible lock reclaims.
> > > > > > > >
> > > > > > > > Rough attempt:
> > > > > > > >
> > > > > > > >
> > > > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_
> > > > > > > > for_
> > > > > > > > re-expor
> > > > > > > > t_servers
> > > > > > > >
> > > > > > > > Think it would fly?
> > > > > > >
> > > > > > > So this would be a variant of courtesy locks that can be
> > > > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > > > OPEN/LOCK outside the grace period? The purpose being to
> > > > > > > allow reclaim without forcing the client to persist the
> > > > > > > original stateid?
> > > > > > >
> > > > > > > Hmm... That's doable, but how about the following
> > > > > > > alternative:
> > > > > > > Add
> > > > > > > a function
> > > > > > > that allows the client to request the full list of stateids
> > > > > > > that the server holds on its behalf?
> > > > > > >
> > > > > > > I've been wanting such a function for quite a while anyway
> > > > > > > in order to allow the client to detect state leaks (either
> > > > > > > due to soft timeouts, or due to reordered close/open
> > > > > > > operations).
> > > > > >
> > > > > > Oh, that sounds interesting. So basically the re-export server
> > > > > > would re-populate it's state from the original server rather
> > > > > > than relying on it's clients doing reclaims? Hmm, but how does
> > > > > > the re-export server rebuild its stateids? I guess it could
> > > > > > make the clients repopulate them with the same "give me a dump
> > > > > > of all my state", using the state details to match up with the
> > > > > > old state and replacing stateids. Or did you have something
> > > > > > different in mind?
> > > > > >
> > > > >
> > > > > I was thinking that the re-export server could just use that
> > > > > list of stateids to figure out which locks can be reclaimed
> > > > > atomically, and which ones have been irredeemably lost. The
> > > > > assumption is that if you have a lock stateid or a delegation,
> > > > > then that means the clients can reclaim all the locks that were
> > > > > represented by that stateid.
> > > >
> > > > I'm confused about how the re-export server uses that list.  Are
> > > > you assuming it persisted its own list across its own
> > > > crash/reboot?
> > > > I
> > > > guess that's what I was trying to avoid having to do.
> > > >
> > > No. The server just uses the stateids as part of a check for 'do I
> > > hold state for this file on this server?'. If the answer is 'yes'
> > > and the lock owners are sane, then we should be able to assume the
> > > full set of locks that lock owner held on that file are still valid.
> > >
> > > BTW: if the lock owner is also returned by the server, then since
> > > the lock owner is an opaque value, it could, for instance, be used
> > > by the client to cache info on the server about which uid/gid owns
> > > these locks.
> >
> > Let me see if I'm understanding your idea right...
> >
> > Re-export server reboots within the extended lease period it's been
> > given by the original server. I'm assuming it uses the same clientid?
> 
> Yes. It would have to use the same clientid.
> 
> > But would probably open new sessions. It requests the list of
> > stateids. Hmm, how to make the owner information useful, nfs-ganesha
> > doesn't pass on the actual client's owner but rather just passes the
> > address of its record for that client owner. Maybe it will have to do
> > something a bit different for this degree of re-export support...
> >
> > Now the re-export server knows which original client lock owners are
> > allowed to reclaim state. So it just acquires locks using the original
> > stateid as the client reclaims (what happens if the client doesn't
> > reclaim a lock? I suppose the re-export server could unlock all
> > regions not explicitly locked once reclaim is complete). Since the
> > re-export server is acquiring new locks using the original stateid it
> > will just overlay the original lock with the new lock and write locks
> > don't conflict since they are being acquired by the same lock owner.
> > Actually the original server could even balk at a "reclaim" in this
> > way that wasn't originally held... And the original server could
> > "refresh" the locks, and discard any that aren't refreshed at the end
> > of reclaim. That part assumes the original server is apprised that
> > what is actually happening is a reclaim.
> >
> > The re-export server can destroy any stateids that it doesn't receive
> > reclaims for.
> 
> Right. That's in essence what I'm suggesting. There are corner cases to be
> considered: e.g. "what happens if the re-export server crashes after unlocking
> on the server, but before passing the LOCKU reply on the the client", however I
> think it should be possible to figure out strategies for those cases.

That's no different than a regular NFS server crashes before responding to an unlock. The client likely doesn't reclaim locks it was attempting to drop at server crash time. So then one place we would definitely have abandoned locks on the original server IF the unlock never made it to the original server. But we're already talking strategies to clean up abandoned locks.

I won't be surprised if we find a more tricky corner case, but my gut feel is every corner case will have a relatively simple solution.

Another consideration is how to handle the size of the state list... Ideally we would have some way to break it up that is less clunky than readdir (at least the state list can be assumed to be static during the course of the fetching of it, even for a regular client just interested in it, it could pause state activity until the list is retrieved).

Frank

Frank


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-03 23:16                                   ` bfields
  2020-12-03 23:28                                     ` Frank Filz
@ 2020-12-04  1:02                                     ` Trond Myklebust
  2020-12-04  1:41                                       ` bfields
  1 sibling, 1 reply; 129+ messages in thread
From: Trond Myklebust @ 2020-12-04  1:02 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 18:16 -0500, bfields@fieldses.org wrote:
> On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > I've been wanting such a function for quite a while anyway in
> > > > order to allow the client to detect state leaks (either due to
> > > > soft timeouts, or due to reordered close/open operations).
> > > 
> > > One sure way to fix any state leaks is to reboot the server.  The
> > > server throws everything away, the clients reclaim, all that's
> > > left
> > > is stuff they still actually care about.
> > > 
> > > It's very disruptive.
> > > 
> > > But you could do a limited version of that: the server throws
> > > away
> > > the state from one client (keeping the underlying locks on the
> > > exported filesystem), lets the client go through its normal
> > > reclaim
> > > process, at the end of that throws away anything that wasn't
> > > reclaimed.  The only delay is to anyone trying to acquire new
> > > locks
> > > that conflict with that set of locks, and only for as long as it
> > > takes for the one client to reclaim.
> > 
> > One could do that, but that requires the existence of a quiescent
> > period where the client holds no state at all on the server.
> 
> No, as I said, the client performs reboot recovery for any state that
> it
> holds when we do this.
> 

Hmm... So how do the client and server coordinate what can and cannot
be reclaimed? The issue is that races can work both ways, with the
client sometimes believing that it holds a layout or a delegation that
the server thinks it has returned. If the server allows a reclaim of
such a delegation, then that could be problematic (because it breaks
lock atomicity on the client and because it may cause conflicts).

By the way, the other thing that I'd like to add to my wishlist is a
callback that allows the server to ask the client if it still holds a
given open or lock stateid. A server can recall a delegation or a
layout, so it can fix up leaks of those, however it has no remedy if
the client loses an open or lock stateid other than to possibly
forcibly revoke state. That could cause application crashes if the
server makes a mistake and revokes a lock that is actually in use.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-04  1:02                                     ` Trond Myklebust
@ 2020-12-04  1:41                                       ` bfields
  2020-12-04  2:27                                         ` Trond Myklebust
  0 siblings, 1 reply; 129+ messages in thread
From: bfields @ 2020-12-04  1:41 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-cachefs, linux-nfs, daire

On Fri, Dec 04, 2020 at 01:02:20AM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 18:16 -0500, bfields@fieldses.org wrote:
> > On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> > > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order to allow the client to detect state leaks (either due to
> > > > > soft timeouts, or due to reordered close/open operations).
> > > > 
> > > > One sure way to fix any state leaks is to reboot the server.  The
> > > > server throws everything away, the clients reclaim, all that's
> > > > left
> > > > is stuff they still actually care about.
> > > > 
> > > > It's very disruptive.
> > > > 
> > > > But you could do a limited version of that: the server throws
> > > > away
> > > > the state from one client (keeping the underlying locks on the
> > > > exported filesystem), lets the client go through its normal
> > > > reclaim
> > > > process, at the end of that throws away anything that wasn't
> > > > reclaimed.  The only delay is to anyone trying to acquire new
> > > > locks
> > > > that conflict with that set of locks, and only for as long as it
> > > > takes for the one client to reclaim.
> > > 
> > > One could do that, but that requires the existence of a quiescent
> > > period where the client holds no state at all on the server.
> > 
> > No, as I said, the client performs reboot recovery for any state that
> > it
> > holds when we do this.
> > 
> 
> Hmm... So how do the client and server coordinate what can and cannot
> be reclaimed? The issue is that races can work both ways, with the
> client sometimes believing that it holds a layout or a delegation that
> the server thinks it has returned. If the server allows a reclaim of
> such a delegation, then that could be problematic (because it breaks
> lock atomicity on the client and because it may cause conflicts).

The server's not actually forgetting anything, it's just pretending to,
in order to trigger the client's reboot recovery.  It can turn down the
client's attempt to reclaim something it doesn't have.

Though isn't it already game over by the time the client thinks it holds
some lock/open/delegation that the server doesn't?  I guess I'd need to
see these cases written out in detail to understand.

--b.

> By the way, the other thing that I'd like to add to my wishlist is a
> callback that allows the server to ask the client if it still holds a
> given open or lock stateid. A server can recall a delegation or a
> layout, so it can fix up leaks of those, however it has no remedy if
> the client loses an open or lock stateid other than to possibly
> forcibly revoke state. That could cause application crashes if the
> server makes a mistake and revokes a lock that is actually in use.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Adventures in NFS re-exporting
  2020-12-04  1:41                                       ` bfields
@ 2020-12-04  2:27                                         ` Trond Myklebust
  0 siblings, 0 replies; 129+ messages in thread
From: Trond Myklebust @ 2020-12-04  2:27 UTC (permalink / raw)
  To: bfields; +Cc: linux-cachefs, linux-nfs, daire

On Thu, 2020-12-03 at 20:41 -0500, bfields@fieldses.org wrote:
> On Fri, Dec 04, 2020 at 01:02:20AM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 18:16 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 17:45 -0500, bfields@fieldses.org wrote:
> > > > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > I've been wanting such a function for quite a while anyway
> > > > > > in
> > > > > > order to allow the client to detect state leaks (either due
> > > > > > to
> > > > > > soft timeouts, or due to reordered close/open operations).
> > > > > 
> > > > > One sure way to fix any state leaks is to reboot the server. 
> > > > > The
> > > > > server throws everything away, the clients reclaim, all
> > > > > that's
> > > > > left
> > > > > is stuff they still actually care about.
> > > > > 
> > > > > It's very disruptive.
> > > > > 
> > > > > But you could do a limited version of that: the server throws
> > > > > away
> > > > > the state from one client (keeping the underlying locks on
> > > > > the
> > > > > exported filesystem), lets the client go through its normal
> > > > > reclaim
> > > > > process, at the end of that throws away anything that wasn't
> > > > > reclaimed.  The only delay is to anyone trying to acquire new
> > > > > locks
> > > > > that conflict with that set of locks, and only for as long as
> > > > > it
> > > > > takes for the one client to reclaim.
> > > > 
> > > > One could do that, but that requires the existence of a
> > > > quiescent
> > > > period where the client holds no state at all on the server.
> > > 
> > > No, as I said, the client performs reboot recovery for any state
> > > that
> > > it
> > > holds when we do this.
> > > 
> > 
> > Hmm... So how do the client and server coordinate what can and
> > cannot
> > be reclaimed? The issue is that races can work both ways, with the
> > client sometimes believing that it holds a layout or a delegation
> > that
> > the server thinks it has returned. If the server allows a reclaim
> > of
> > such a delegation, then that could be problematic (because it
> > breaks
> > lock atomicity on the client and because it may cause conflicts).
> 
> The server's not actually forgetting anything, it's just pretending
> to,
> in order to trigger the client's reboot recovery.  It can turn down
> the
> client's attempt to reclaim something it doesn't have.
> 
> Though isn't it already game over by the time the client thinks it
> holds
> some lock/open/delegation that the server doesn't?  I guess I'd need
> to
> see these cases written out in detail to understand.
> 

Normally, the server will return NFS4ERR_BAD_STATEID or
NFS4ERR_OLD_STATEID if the client tries to use an invalid stateid. The
issue here is that you'd be discarding that machinery, because the
client is forgetting its stateids when it gets told that the server
rebooted.
That again puts the onus on the server to verify more strongly whether
or not the client is recovering state that it actually holds.


So to elaborate a little more on the cases where we have seen the
client and server state mess up here. Typically it happens when we
build COMPOUNDS where there is a stateful operation followed by a slow
operation. Something like

Thread 1
========
OPEN(foo) + LAYOUTGET
-> openstateid(01: blah)

				Thread 2
				========
				OPEN(foo)
				->openstateid(02: blah)
				CLOSE(openstateid(02:blah))

(gets reply from OPEN).

Typically the client forgets about the stateid after the CLOSE, so when
it gets a reply to the original OPEN, it thinks it just got a
completely fresh stateid "openstateid(01: blah)", which it might try to
reclaim if the server declares a reboot.

> --b.
> 
> > By the way, the other thing that I'd like to add to my wishlist is
> > a
> > callback that allows the server to ask the client if it still holds
> > a
> > given open or lock stateid. A server can recall a delegation or a
> > layout, so it can fix up leaks of those, however it has no remedy
> > if
> > the client loses an open or lock stateid other than to possibly
> > forcibly revoke state. That could cause application crashes if the
> > server makes a mistake and revokes a lock that is actually in use.
> > 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2020-12-04  2:28 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08  9:40 ` Mkrtchyan, Tigran
2020-09-08 11:06   ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59   ` Trond Myklebust
2020-09-16 16:01     ` Daire Byrne
2020-10-19 16:19       ` Daire Byrne
2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53             ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05             ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37               ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02           ` bfields
2020-11-12 13:01             ` Daire Byrne
2020-11-12 13:57               ` bfields
2020-11-12 18:33                 ` Daire Byrne
2020-11-12 20:55                   ` bfields
2020-11-12 23:05                     ` Daire Byrne
2020-11-13 14:50                       ` bfields
2020-11-13 22:26                         ` bfields
2020-11-14 12:57                           ` Daire Byrne
2020-11-16 15:18                             ` bfields
2020-11-16 15:53                             ` bfields
2020-11-16 19:21                               ` Daire Byrne
2020-11-16 15:29                           ` Jeff Layton
2020-11-16 15:56                             ` bfields
2020-11-16 16:03                               ` Jeff Layton
2020-11-16 16:14                                 ` bfields
2020-11-16 16:38                                   ` Jeff Layton
2020-11-16 19:03                                     ` bfields
2020-11-16 20:03                                       ` Jeff Layton
2020-11-17  3:16                                         ` bfields
2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-17 15:26                                                 ` J. Bruce Fields
2020-11-17 15:34                                                   ` Jeff Layton
2020-11-20 22:38                                                     ` J. Bruce Fields
2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21  0:34                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21  0:58                                                           ` Jeff Layton
2020-11-21  1:01                                                             ` J. Bruce Fields
2020-11-21 13:00                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21  1:03                                                         ` Jeff Layton
2020-11-21 21:44                                                           ` Daire Byrne
2020-11-22  0:02                                                             ` bfields
2020-11-22  1:55                                                               ` Daire Byrne
2020-11-22  3:03                                                                 ` bfields
2020-11-23 20:07                                                                   ` Daire Byrne
2020-11-17 15:25                                               ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27                                               ` Jeff Layton
2020-11-17 14:14                                                 ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15                 ` bfields
2020-11-24 22:15                   ` Frank Filz
2020-11-25 14:47                     ` 'bfields'
2020-11-25 16:25                       ` Frank Filz
2020-11-25 19:03                         ` 'bfields'
2020-11-26  0:04                           ` Frank Filz
2020-11-25 17:14                   ` Daire Byrne
2020-11-25 19:31                     ` bfields
2020-12-03 12:20                     ` Daire Byrne
2020-12-03 18:51                       ` bfields
2020-12-03 20:27                         ` Trond Myklebust
2020-12-03 21:13                           ` bfields
2020-12-03 21:32                             ` Frank Filz
2020-12-03 21:34                             ` Trond Myklebust
2020-12-03 21:45                               ` Frank Filz
2020-12-03 21:57                                 ` Trond Myklebust
2020-12-03 22:04                                   ` bfields
2020-12-03 22:14                                     ` Trond Myklebust
2020-12-03 22:39                                       ` Frank Filz
2020-12-03 22:50                                         ` Trond Myklebust
2020-12-03 23:34                                           ` Frank Filz
2020-12-03 22:44                                       ` bfields
2020-12-03 21:54                               ` bfields
2020-12-03 22:45                               ` bfields
2020-12-03 22:53                                 ` Trond Myklebust
2020-12-03 23:16                                   ` bfields
2020-12-03 23:28                                     ` Frank Filz
2020-12-04  1:02                                     ` Trond Myklebust
2020-12-04  1:41                                       ` bfields
2020-12-04  2:27                                         ` Trond Myklebust
2020-09-17 16:01   ` Daire Byrne
2020-09-17 19:09     ` bfields
2020-09-17 20:23       ` Frank van der Linden
2020-09-17 21:57         ` bfields
2020-09-19 11:08           ` Daire Byrne
2020-09-22 16:43         ` Chuck Lever
2020-09-23 20:25           ` Daire Byrne
2020-09-23 21:01             ` Frank van der Linden
2020-09-26  9:00               ` Daire Byrne
2020-09-28 15:49                 ` Frank van der Linden
2020-09-28 16:08                   ` Chuck Lever
2020-09-28 17:42                     ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52   ` Trond Myklebust
2020-09-23 12:40     ` J. Bruce Fields
2020-09-23 13:09       ` Trond Myklebust
2020-09-23 17:07         ` bfields
2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
2020-10-01  0:09     ` Daire Byrne
2020-10-01 10:36       ` Jeff Layton
2020-10-01 12:38         ` Trond Myklebust
2020-10-01 16:39           ` Jeff Layton
2020-10-05 12:54         ` Daire Byrne
2020-10-13  9:59           ` Daire Byrne
2020-10-01 18:41     ` J. Bruce Fields
2020-10-01 19:24       ` Trond Myklebust
2020-10-01 19:26         ` bfields
2020-10-01 19:29           ` Trond Myklebust
2020-10-01 19:51             ` bfields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).