Re: safe versions of NFS

From: hedrick@rutgers.edu
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Benjamin Coddington <bcodding@redhat.com>,
	Patrick Goetz <pgoetz@math.utexas.edu>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: safe versions of NFS
Date: Tue, 13 Apr 2021 13:48:50 -0400	[thread overview]
Message-ID: <FDBA5185-F6BE-443B-81E4-12DD1501E4A6@rutgers.edu> (raw)
In-Reply-To: <5F282359-128D-4F72-B393-B87956DFE458@oracle.com>

The two oddities I’ve seen are
* the fairly common failure of mounts with “not exported” because of sssd problems
* a major failure when I inadvertently reinstalled sssd on the server. That caused lots of mounts and authentication to fail. That was on Apr 2, though, and most problems have been in the last week

We’ve been starting to move file systems from our netapp to the Linux-based server.I note that Netapp defaults to delegations off with NFS 4.1. They almost certainly wouldn’t see these problems. It’s also interesting to see that there’s been enough history of problems that gitlab recommends turning delegations off on Linux NFS servers, or using 4.0. I’ve seen another big package that makes a similar recommendation.

As soon as we can verify that our applications work, we’re going to upgrade the server that has shown the most problems with Linux 5.4, to see if that helps. So far our Ubuntu 20 systems (with 5.8) have been OK, though they get fewer users. We’ll be moving everything to 20 this summer. While Ubuntu 20 server uses 5.4, I’m inclined to install it with 5.8, since that’s the combination we’ve tested most.

> On Apr 13, 2021, at 1:24 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Apr 13, 2021, at 12:23 PM, Benjamin Coddington <bcodding@redhat.com> wrote:
>> 
>> (resending this as it bounced off the list - I accidentally embedded HTML)
>> 
>> Yes, if you're pretty sure your hostnames are all different, the client_ids
>> should be different.  For v4.0 you can turn on debugging (rpcdebug -m nfs -s
>> proc) and see the client_id in the kernel log in lines that look like: "NFS
>> call setclientid auth=%s, '%s'\n", which will happen at mount time, but it
>> doesn't look like we have any debugging for v4.1 and v4.2 for EXCHANGE_ID.
>> 
>> You can extract it via the crash utility, or via systemtap, or by doing a
>> wire capture, but nothing that's easily translated to running across a large
>> number of machines.  There's probably other ways, perhaps we should tack
>> that string into the tracepoints for exchange_id and setclientid.
>> 
>> If you're interested in troubleshooting, wire capture's usually the most
>> informative.  If the lockup events all happen at the same time, there
>> might be some network event that is triggering the issue.
>> 
>> You should expect NFSv4.1 to be rock-solid.  Its rare we have reports
>> that it isn't, and I'd love to know why you're having these problems.
> 
> I echo that: NFSv4.1 protocol and implementation are mature, so if
> there are operational problems, it should be root-caused.
> 
> NFSv4.1 uses a uniform client ID. That should be the "good" one,
> not the NFSv4.0 one that has a non-zero probability of collision.
> 
> Charles, please let us know if there are particular workloads that
> trigger the lock reclaim failure. A narrow reproducer would help
> get to the root issue quickly.
> 
> 
>> Ben
>> 
>> On 13 Apr 2021, at 11:38, hedrick@rutgers.edu wrote:
>> 
>>> The server is ubuntu 20, with a ZFS file system.
>>> 
>>> I don’t set the unique ID. Documentation claims that it is set from the hostname. They will surely be unique, or the whole world would blow up. How can I check the actual unique ID being used? The kernel reports a blank one, but I think that just means to use the hostname. We could obviously set a unique one if that would be useful.
>>> 
>>>> On Apr 13, 2021, at 11:35 AM, Benjamin Coddington <bcodding@redhat.com> wrote:
>>>> 
>>>> It would be interesting to know why your clients are failing to reclaim their locks.  Something is misconfigured.  What server are you using, and is there anything fancy on the server-side (like HA)?  Is it possible that you have clients with the same nfs4_unique_id?
>>>> 
>>>> Ben
>>>> 
>>>> On 13 Apr 2021, at 11:17, hedrick@rutgers.edu wrote:
>>>> 
>>>>> many, though not all, of the problems are “lock reclaim failed”.
>>>>> 
>>>>>> On Apr 13, 2021, at 10:52 AM, Patrick Goetz <pgoetz@math.utexas.edu> wrote:
>>>>>> 
>>>>>> I use NFS 4.2 with Ubuntu 18/20 workstations and Ubuntu 18/20 servers and haven't had any problems.
>>>>>> 
>>>>>> Check your configuration files; the last time I experienced something like this it's because I inadvertently used the same fsid on two different exports. Also recommend exporting top level directories only.  Bind mount everything you want to export into /srv/nfs and only export those directories. According to Bruce F. this doesn't buy you any security (I still don't understand why), but it makes for a cleaner system configuration.
>>>>>> 
>>>>>> On 4/13/21 9:33 AM, hedrick@rutgers.edu wrote:
>>>>>>> I am in charge of a large computer science dept computing infrastructure. We have a variety of student and develo9pment users. If there are problems we’ll see them.
>>>>>>> We use an Ubuntu 20 server, with NVMe storage.
>>>>>>> I’ve just had to move Centos 7 and Ubuntu 18 to use NFS 4.0. We had hangs with NFS 4.1 and 4.2. Files would appear to be locked, although eventually the lock would time out. It’s too soon to be sure that moving back to NFS 4.0 will fix it. Next is either NFS 3 or disabling delegations on the server.
>>>>>>> Are there known versions of NFS that are safe to use in production for various kernel versions? The one we’re most interested in is Ubuntu 20, which can be anything from 5.4 to 5.8.
>>>> 
>> 
>> 
>> 
> 
> --
> Chuck Lever
> 
> 
>