Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

From: Chuck Lever III <chuck.lever@oracle.com>
To: Bruce Fields <bfields@fieldses.org>
Cc: Jonathan Woithe <jwoithe@just42.net>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount
Date: Mon, 17 Jan 2022 18:22:15 +0000	[thread overview]
Message-ID: <969927E5-96A4-4700-8AF0-2B383261A6FA@oracle.com> (raw)
In-Reply-To: <20220117155019.GD28708@fieldses.org>

> On Jan 17, 2022, at 10:50 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
>> 
>>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <jwoithe@just42.net> wrote:
>>> 
>>> Hi Chuck
>>> 
>>> Thanks for your response.
>>> 
>>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
>>>>> Recently we migrated an NFS server from a 32-bit environment running 
>>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel.  The NFS configuration remained
>>>>> unchanged between the two systems.
>>>>> 
>>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
>>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
>>>>> machine is turned on for the day.  On both occasions the call trace was
>>>>> essentially identical.  The full oops sequence is at the end of this email. 
>>>>> The oops was not observed when running the 4.14.128 kernel.
>>>>> 
>>>>> Is there anything more I can provide to help track down the cause of the
>>>>> oops?
>>>> 
>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
>>>> nlm_file"), which was introduced in or around v5.15.  You could try a
>>>> simple test and back the server down to v5.14.y to see if the problem
>>>> persists.
>>> 
>>> I could do this, but only perhaps on Monday when I'm next on site.  It may
>>> take a while to get an answer though, since it seems we hit the fault only
>>> around once every 2 weeks.  Since it's a production server we are of course
>>> limited in the things I can do.
>>> 
>>> I *may* be able to set up another system as an NFS server and hit that with
>>> repeated mount requests.  That could help reduce the time we have to wait
>>> for an answer.
>> 
>> Given the callback information you provided, I believe that the problem
>> is due to a client reboot, not a mount request. The callback shows the
>> crash occurs while your server is processing an SM_NOTIFY request from
>> one of your clients.
>> 
>> 
>>> Is it worth considering a revert of 7f024fcd5c97?  I guess it depends on how
>>> many later patches depended on it.
>> 
>> You can try reverting 7f024fcd5c97, but as I recall there are some
>> subsequent changes that depend on that one.
> 
> NLM locking on reexports would stop working.  Which is a new (and
> imperfect) feature, so less important than avoiding this NULL
> dereference, if push came to shove.  But, let's see if we can just fix
> it.....

Agreed. I was suggested reverting only as an experiment.

--
Chuck Lever