All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Bruce Fields <bfields@fieldses.org>
Cc: Jonathan Woithe <jwoithe@just42.net>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount
Date: Mon, 17 Jan 2022 18:22:15 +0000	[thread overview]
Message-ID: <969927E5-96A4-4700-8AF0-2B383261A6FA@oracle.com> (raw)
In-Reply-To: <20220117155019.GD28708@fieldses.org>



> On Jan 17, 2022, at 10:50 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
>> 
>>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <jwoithe@just42.net> wrote:
>>> 
>>> Hi Chuck
>>> 
>>> Thanks for your response.
>>> 
>>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
>>>>> Recently we migrated an NFS server from a 32-bit environment running 
>>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel.  The NFS configuration remained
>>>>> unchanged between the two systems.
>>>>> 
>>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
>>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
>>>>> machine is turned on for the day.  On both occasions the call trace was
>>>>> essentially identical.  The full oops sequence is at the end of this email. 
>>>>> The oops was not observed when running the 4.14.128 kernel.
>>>>> 
>>>>> Is there anything more I can provide to help track down the cause of the
>>>>> oops?
>>>> 
>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
>>>> nlm_file"), which was introduced in or around v5.15.  You could try a
>>>> simple test and back the server down to v5.14.y to see if the problem
>>>> persists.
>>> 
>>> I could do this, but only perhaps on Monday when I'm next on site.  It may
>>> take a while to get an answer though, since it seems we hit the fault only
>>> around once every 2 weeks.  Since it's a production server we are of course
>>> limited in the things I can do.
>>> 
>>> I *may* be able to set up another system as an NFS server and hit that with
>>> repeated mount requests.  That could help reduce the time we have to wait
>>> for an answer.
>> 
>> Given the callback information you provided, I believe that the problem
>> is due to a client reboot, not a mount request. The callback shows the
>> crash occurs while your server is processing an SM_NOTIFY request from
>> one of your clients.
>> 
>> 
>>> Is it worth considering a revert of 7f024fcd5c97?  I guess it depends on how
>>> many later patches depended on it.
>> 
>> You can try reverting 7f024fcd5c97, but as I recall there are some
>> subsequent changes that depend on that one.
> 
> NLM locking on reexports would stop working.  Which is a new (and
> imperfect) feature, so less important than avoiding this NULL
> dereference, if push came to shove.  But, let's see if we can just fix
> it.....

Agreed. I was suggested reverting only as an experiment.

--
Chuck Lever




  reply	other threads:[~2022-01-17 18:23 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-14 10:39 [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount Jonathan Woithe
2022-01-14 15:18 ` Chuck Lever III
2022-01-15  8:14   ` Jonathan Woithe
2022-01-15 19:46     ` Chuck Lever III
2022-01-15 21:23       ` Jonathan Woithe
2022-01-16 22:06         ` Jonathan Woithe
2022-01-16 22:30           ` Chuck Lever III
2022-01-17  7:44             ` Jonathan Woithe
2022-01-17 22:08               ` Jonathan Woithe
2022-01-17 22:11                 ` Bruce Fields
2022-01-18 22:00                   ` [PATCH 1/2] lockd: fix server crash on reboot of client holding lock Bruce Fields
2022-01-18 22:00                     ` [PATCH 2/2] lockd: fix failure to cleanup client locks Bruce Fields
2022-01-18 22:20                     ` [PATCH 1/2] lockd: fix server crash on reboot of client holding lock Jonathan Woithe
2022-01-18 22:27                       ` Bruce Fields
2022-03-23 23:33                         ` Jonathan Woithe
2022-03-24 18:28                           ` Bruce Fields
2022-01-19 16:18                     ` Chuck Lever III
2022-01-31 22:20                       ` Jonathan Woithe
2022-02-01  2:10                         ` Chuck Lever III
2022-01-17 15:50       ` [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount Bruce Fields
2022-01-17 18:22         ` Chuck Lever III [this message]
2022-01-17 15:47   ` Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=969927E5-96A4-4700-8AF0-2B383261A6FA@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=bfields@fieldses.org \
    --cc=jwoithe@just42.net \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.