All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug Report] Re: [PATCH v1] generic/476: requires 27GB scratch size
@ 2022-07-21 14:50 Theodore Ts'o
  2022-07-21 14:59 ` [Bug Report] " Chuck Lever III
  0 siblings, 1 reply; 2+ messages in thread
From: Theodore Ts'o @ 2022-07-21 14:50 UTC (permalink / raw)
  To: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding
bleeding-edge mainline kernels) are looping forever in a livelock or
deadlock when running generic/476 on NFS, both in a loopback and
external export configuration.  This *may* be an ENOSPC related issue.

See the referenced discussion on fstests@vger.kernel.org for more
details.

	 			     	      - Ted


[-- Attachment #2: Type: message/rfc822, Size: 10544 bytes --]

From: "Theodore Ts'o" <tytso@mit.edu>
To: Boyang Xue <bxue@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>, fstests@vger.kernel.org
Subject: Re: [PATCH v1] generic/476: requires 27GB scratch size
Date: Thu, 21 Jul 2022 10:03:45 -0400
Message-ID: <YtlcwZ/66pJhpdiS@mit.edu>

Following up, using NFS loopback with a 5GB scratch device on a Google
Compute Engine VM, generic/476 passes using a 4.14 LTS, 4.19 LTS, and
5.4 LTS kernel.  So this looks like it's a regression which is in 5.10
LTS and newer kernels, and so instead of patching it out of the test,
I think the right thing to do is to add it to a kernel
version-specific exclude file and then filing a bug with the NFS
folks.

KERNEL:    kernel 4.14.284-xfstests #8 SMP Tue Jul 5 08:21:37 EDT 2022 x86_64
CMDLINE:   -c nfs/default generic/476
CPUS:      2
MEM:       7680

nfs/loopback: 1 tests, 597 seconds
  generic/476  Pass     595s
Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 595s

---
KERNEL:    kernel 4.19.248-xfstests #4 SMP Sat Jun 25 10:43:45 EDT 2022 x86_64
CMDLINE:   -c nfs/default generic/476
CPUS:      2
MEM:       7680

nfs/loopback: 1 tests, 407 seconds
  generic/476  Pass     407s
Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 407s

----
KERNEL:    kernel 5.4.199-xfstests #21 SMP Sun Jul 3 12:15:15 EDT 2022 x86_64
CMDLINE:   -c nfs/default generic/476
CPUS:      2
MEM:       7680

nfs/loopback: 1 tests, 404 seconds
  generic/476  Pass     404s
Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 404s


See below for what I'm checking into xfstests-bld for
{kvm,gce}-xfstests.  I don't believe we should be changing xfstests's
generic/476, since it *does* pass with a smaller scratch device on
older kernels, and presumably, RHEL customers would be cranky if this
issue resulted in their production systems to lock up, and so it
should be considered a kernel bug as opposed to a test bug.

						- Ted


commit 4a33b6721d5db9c07f295a10a8ad65d2a0021406
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Jul 21 09:54:50 2022 -0400

    test-appliance: add an nfs test exclusions for kernels newer than 5.4
    
    This is apparently an NFS bug which is visible in 5.10 LTS and newer
    kernels, and likely appeared sometime after 5.4.  Since it causes the
    test VM to spin forever (or at least for days), let's exclude it for
    now.
    
    Link: https://lore.kernel.org/all/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/test-appliance/files/root/fs/nfs/exclude b/test-appliance/files/root/fs/nfs/exclude
index 184750fb..ef4b19bc 100644
--- a/test-appliance/files/root/fs/nfs/exclude
+++ b/test-appliance/files/root/fs/nfs/exclude
@@ -10,3 +10,14 @@ generic/477
 // failing in the expected output of the linux-nfs Wiki page.  So we'll
 // suppress this failure for now.
 generic/294
+
+#if LINUX_VERSION_CODE > KERNEL_VERSION(5,4,0)
+// There appears to be a regression that shows up sometime after 5.4.
+// LTS kernels for 4.14, 4.19, and 5.4 will terminate successfully,
+// but newer kernels will spin forever in some kind of deadlock or livelock
+// This apparently does not happen if the scratch device is > 27GB, so it
+// may be some kind of ENOSPC-related bug.
+// For more information see the e-mail thread starting at:
+// https://lore.kernel.org/r/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
+generic/476
+#endif

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [Bug Report] [PATCH v1] generic/476: requires 27GB scratch size
  2022-07-21 14:50 [Bug Report] Re: [PATCH v1] generic/476: requires 27GB scratch size Theodore Ts'o
@ 2022-07-21 14:59 ` Chuck Lever III
  0 siblings, 0 replies; 2+ messages in thread
From: Chuck Lever III @ 2022-07-21 14:59 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux NFS Mailing List

Hi Ted-

It's not clear from your report whether the kernel range applies
to the client's kernel or the server's kernel (in the non-loopback
case).

Since a scratch device is involved, I suspect the livelock might
be due to a problem with the NFSD filecache code introduced on or
about v5.10. There are patches pending in the NFSD for-next branch
that should address this issue. Is there a way that your tester
can try these out to confirm?


> On Jul 21, 2022, at 10:50 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding
> bleeding-edge mainline kernels) are looping forever in a livelock or
> deadlock when running generic/476 on NFS, both in a loopback and
> external export configuration.  This *may* be an ENOSPC related issue.
> 
> See the referenced discussion on fstests@vger.kernel.org for more
> details.
> 
> 	 			     	      - Ted
> 
> 
> From: "Theodore Ts'o" <tytso@mit.edu>
> Subject: Re: [PATCH v1] generic/476: requires 27GB scratch size
> Date: July 21, 2022 at 10:03:45 AM EDT
> To: Boyang Xue <bxue@redhat.com>
> Cc: "Darrick J. Wong" <djwong@kernel.org>, fstests@vger.kernel.org
> 
> 
> Following up, using NFS loopback with a 5GB scratch device on a Google
> Compute Engine VM, generic/476 passes using a 4.14 LTS, 4.19 LTS, and
> 5.4 LTS kernel.  So this looks like it's a regression which is in 5.10
> LTS and newer kernels, and so instead of patching it out of the test,
> I think the right thing to do is to add it to a kernel
> version-specific exclude file and then filing a bug with the NFS
> folks.
> 
> KERNEL:    kernel 4.14.284-xfstests #8 SMP Tue Jul 5 08:21:37 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 597 seconds
>  generic/476  Pass     595s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 595s
> 
> ---
> KERNEL:    kernel 4.19.248-xfstests #4 SMP Sat Jun 25 10:43:45 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 407 seconds
>  generic/476  Pass     407s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 407s
> 
> ----
> KERNEL:    kernel 5.4.199-xfstests #21 SMP Sun Jul 3 12:15:15 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 404 seconds
>  generic/476  Pass     404s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 404s
> 
> 
> See below for what I'm checking into xfstests-bld for
> {kvm,gce}-xfstests.  I don't believe we should be changing xfstests's
> generic/476, since it *does* pass with a smaller scratch device on
> older kernels, and presumably, RHEL customers would be cranky if this
> issue resulted in their production systems to lock up, and so it
> should be considered a kernel bug as opposed to a test bug.
> 
> 						- Ted
> 
> 
> commit 4a33b6721d5db9c07f295a10a8ad65d2a0021406
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Jul 21 09:54:50 2022 -0400
> 
>    test-appliance: add an nfs test exclusions for kernels newer than 5.4
> 
>    This is apparently an NFS bug which is visible in 5.10 LTS and newer
>    kernels, and likely appeared sometime after 5.4.  Since it causes the
>    test VM to spin forever (or at least for days), let's exclude it for
>    now.
> 
>    Link: https://lore.kernel.org/all/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
>    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> diff --git a/test-appliance/files/root/fs/nfs/exclude b/test-appliance/files/root/fs/nfs/exclude
> index 184750fb..ef4b19bc 100644
> --- a/test-appliance/files/root/fs/nfs/exclude
> +++ b/test-appliance/files/root/fs/nfs/exclude
> @@ -10,3 +10,14 @@ generic/477
> // failing in the expected output of the linux-nfs Wiki page.  So we'll
> // suppress this failure for now.
> generic/294
> +
> +#if LINUX_VERSION_CODE > KERNEL_VERSION(5,4,0)
> +// There appears to be a regression that shows up sometime after 5.4.
> +// LTS kernels for 4.14, 4.19, and 5.4 will terminate successfully,
> +// but newer kernels will spin forever in some kind of deadlock or livelock
> +// This apparently does not happen if the scratch device is > 27GB, so it
> +// may be some kind of ENOSPC-related bug.
> +// For more information see the e-mail thread starting at:
> +// https://lore.kernel.org/r/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
> +generic/476
> +#endif
> 
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-07-21 14:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21 14:50 [Bug Report] Re: [PATCH v1] generic/476: requires 27GB scratch size Theodore Ts'o
2022-07-21 14:59 ` [Bug Report] " Chuck Lever III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.