All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Zorro Lang <zlang@redhat.com>,
	"fstests@vger.kernel.org" <fstests@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Chuck Lever III <chuck.lever@oracle.com>,
	"djwong@vger.kernel.org" <djwong@vger.kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: generic/650 makes v6.0-rc client unusable
Date: Thu, 10 Nov 2022 08:46:29 +0000	[thread overview]
Message-ID: <20221110084628.ztsdhukgtc56ih5e@shindev> (raw)
In-Reply-To: <CAL3q7H5eV9Sb1axmNgvcbG7UrgGTH3AovaibQuWMz44Jfo-8_w@mail.gmail.com>

On Nov 09, 2022 / 10:36, Filipe Manana wrote:
> On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
> <shinichiro.kawasaki@wdc.com> wrote:
> >
> > On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > > While investigating some of the other issues that have been
> > > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > > goes off the rails often (but not always) during generic/650.
> > > >
> > > > This is the test that runs a workload while offlining and
> > > > onlining CPUs. My test client has 12 physical cores.
> > > >
> > > > The test appears to start normally, but then after a bit
> > > > the NFS server workload drops to zero and the NFS mount
> > > > disappears. I can't run programs (sudo, for example) on
> > > > the client. Can't log in, even on the console. The console
> > > > has a constant stream of "can't rotate log: Input/Output
> > > > error" type messages.
> >
> > I also observe this failure when I ran fstests using btrfs on my HDDs.
> > The failure is recreated almost always.
> 
> I'm wondering what do you get in dmesg, any traces?

I show the log I observed at the end of this e-mail [1]. No BUG message.
The WARN "didn't collect load info for all cpus, balancing is broken" is
repeated. But I once the hang without this WARN.

The last message left was from xfs "ctx ticket reservation ran out. Need to up
reservation". This is for the system disk, not for the test target file system.

> I've excluded the test from my runs for over an year now, due to some
> crash that I reported
> to the mm and cpu hotplug people here:
> 
> https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@mail.gmail.com/
> 
> Unfortunately I had no reply from anyone who works or maintains those
> subsystems.
> 
> It didn't happen very often, and I haven't tested again with recent kernels.

Thanks for sharing your experience. Hmm, your failure symptom is different from
mine.


[1]

Nov 09 11:50:09 redsun40 root[3480]: run xfstest generic/650
Nov 09 11:50:09 redsun40 unknown: run fstests generic/650 at 2022-11-09 11:50:09
Nov 09 11:50:09 redsun40 systemd[1]: Started fstests-generic-650.scope - /usr/bin/bash -c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/650.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:11 redsun40 kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 14 is now offline
Nov 09 11:50:14 redsun40 kernel: smpboot: CPU 25 is now offline
Nov 09 11:50:15 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 11:50:15 redsun40 kernel: x86/cpu: SGX disabled by BIOS.
Nov 09 11:50:15 redsun40 kernel: x86/tme: not enabled by BIOS
Nov 09 11:50:15 redsun40 kernel: CPU0: Thermal monitoring enabled (TM1)
Nov 09 11:50:15 redsun40 kernel: x86/cpu: User Mode Instruction Prevention (UMIP) activated
Nov 09 11:50:15 redsun40 kernel: smpboot: CPU 30 is now offline
Nov 09 11:50:17 redsun40 kernel: smpboot: CPU 2 is now offline
Nov 09 11:50:19 redsun40 kernel: smpboot: CPU 20 is now offline
Nov 09 11:50:22 redsun40 kernel: smpboot: CPU 31 is now offline
Nov 09 11:50:23 redsun40 kernel: smpboot: CPU 23 is now offline
Nov 09 11:50:24 redsun40 kernel: smpboot: Booting Node 0 Processor 10 APIC 0x14
Nov 09 11:50:26 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:28 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 11:50:29 redsun40 kernel: smpboot: CPU 21 is now offline
Nov 09 11:50:30 redsun40 kernel: smpboot: CPU 16 is now offline
Nov 09 11:50:31 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 11:50:31 redsun40 kernel: smpboot: Booting Node 0 Processor 30 APIC 0x1d
Nov 09 11:50:32 redsun40 kernel: smpboot: CPU 18 is now offline
Nov 09 11:50:33 redsun40 kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
Nov 09 11:50:34 redsun40 kernel: smpboot: CPU 4 is now offline
Nov 09 11:50:35 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 11:50:36 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 11:50:37 redsun40 kernel: smpboot: CPU 27 is now offline
Nov 09 11:50:38 redsun40 kernel: smpboot: CPU 26 is now offline
Nov 09 11:50:39 redsun40 kernel: smpboot: CPU 11 is now offline
Nov 09 11:50:41 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken

...

Nov 09 12:28:51 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:52 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:53 redsun40 kernel: smpboot: CPU 24 is now offline
Nov 09 12:28:55 redsun40 kernel: smpboot: Booting Node 0 Processor 26 APIC 0x15
Nov 09 12:28:57 redsun40 kernel: smpboot: CPU 29 is now offline
Nov 09 12:28:58 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 12:28:59 redsun40 kernel: smpboot: Booting Node 0 Processor 24 APIC 0x11
Nov 09 12:29:00 redsun40 kernel: x86: Booting SMP configuration:
Nov 09 12:29:00 redsun40 kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
Nov 09 12:29:01 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 12:29:02 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:29:04 redsun40 kernel: smpboot: Booting Node 0 Processor 7 APIC 0xe
Nov 09 12:29:04 redsun40 kernel: smpboot: CPU 1 is now offline
Nov 09 12:29:04 redsun40 kernel: XFS (nvme0n1p3): ctx ticket reservation ran out. Need to up reservation


-- 
Shin'ichiro Kawasaki

      parent reply	other threads:[~2022-11-10  8:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-03 18:43 generic/650 makes v6.0-rc client unusable Chuck Lever III
2022-09-04  8:49 ` David Wysochanski
2022-09-04 12:48 ` Theodore Ts'o
2022-09-04 13:15 ` Zorro Lang
2022-09-04 16:02   ` Chuck Lever III
2022-09-06 15:50     ` Chuck Lever III
2022-11-09  4:19   ` Shinichiro Kawasaki
2022-11-09 10:36     ` Filipe Manana
2022-11-09 18:06       ` Darrick J. Wong
2022-11-10  8:49         ` Shinichiro Kawasaki
2022-11-10 15:21         ` Theodore Ts'o
2022-11-10  8:46       ` Shinichiro Kawasaki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221110084628.ztsdhukgtc56ih5e@shindev \
    --to=shinichiro.kawasaki@wdc.com \
    --cc=chuck.lever@oracle.com \
    --cc=djwong@vger.kernel.org \
    --cc=fdmanana@kernel.org \
    --cc=fstests@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.