linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"
       [not found] ` <0d3206be-fae8-4bbd-4b6c-a5d1f038356d@posteo.de>
@ 2023-01-12 14:48   ` Linux kernel regression tracking (Thorsten Leemhuis)
  2023-01-12 16:42     ` Bjorn Helgaas
  2023-02-17 15:01     ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-01-12 16:37   ` Keith Busch
  1 sibling, 2 replies; 5+ messages in thread
From: Linux kernel regression tracking (Thorsten Leemhuis) @ 2023-01-12 14:48 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Julian Groß, linux-nvme, linux-pci, regressions, LKML

[adding the nvme maintainers and the regressions mailing list to the
list of recipients]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 11.01.23 23:11, Julian Groß wrote:
> Dear Maintainer,
> 
> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
> system seemingly randomly freezes due to the file system being set to
> read-only due to an issue with my NVMe controller.
> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
> 
> Through network logging I am able to catch the issue:
> ```
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
> controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
> your device have a faulty power saving mode enabled?
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
> enabling device (0000 -> 0002)
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.331458] nvme nvme0: Removing
> after probe failure status: -19
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371389] nvme0n1: detected
> capacity change from 1953525168 to 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371389] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371389] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371392] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371394] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371405] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371406] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371411] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371419] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371425] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 10, rd 0, flush 0, corrupt 0,
> gen 0
> Jan  8 14:50:16 x299-desktop kernel: [ 1461.371426] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 11, rd 0, flush 0, corrupt 0,
> gen 0
> ```
> 
> I have tried the suggestion in the log without luck.
> 
> Attached is a log that includes two system freezes, as well as a list of
> PCI(e) devices created by Debian reportbug.
> The first freeze happens at "Jan  8 04:26:28" and the second freeze
> happens at "Jan  8 14:50:16".
> 
> Currently, I am using git bisect to narrow down the window of possible
> commits, but since the issue appears seemingly random, it will take many
> months to identify the offending commit this way.
> 
> The original Debian bug report is here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced v5.19..v6.0-rc7
#regzbot title nvme: system partially freezes with "nvme controller is down"
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"
       [not found] ` <0d3206be-fae8-4bbd-4b6c-a5d1f038356d@posteo.de>
  2023-01-12 14:48   ` Regression in Kernel 6.0: System partially freezes with "nvme controller is down" Linux kernel regression tracking (Thorsten Leemhuis)
@ 2023-01-12 16:37   ` Keith Busch
  1 sibling, 0 replies; 5+ messages in thread
From: Keith Busch @ 2023-01-12 16:37 UTC (permalink / raw)
  To: Julian Groß; +Cc: linux-nvme, linux-pci

On Wed, Jan 11, 2023 at 10:11:22PM +0000, Julian Groß wrote:
> 
> Currently, I am using git bisect to narrow down the window of possible
> commits, but since the issue appears seemingly random, it will take many
> months to identify the offending commit this way.

Unfortunately bisect may be our best option. Link loss like that usually
is a problem below the nvme driver level.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"
  2023-01-12 14:48   ` Regression in Kernel 6.0: System partially freezes with "nvme controller is down" Linux kernel regression tracking (Thorsten Leemhuis)
@ 2023-01-12 16:42     ` Bjorn Helgaas
  2023-02-17 12:39       ` Linux regression tracking (Thorsten Leemhuis)
  2023-02-17 15:01     ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2023-01-12 16:42 UTC (permalink / raw)
  To: Julian Groß
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, linux-pci, linux-kernel,
	Linux regressions mailing list

On Thu, Jan 12, 2023 at 03:48:46PM +0100, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> ...
> On 11.01.23 23:11, Julian Groß wrote:
> > Dear Maintainer,
> > 
> > when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
> > system seemingly randomly freezes due to the file system being set to
> > read-only due to an issue with my NVMe controller.
> > The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
> > 
> > Through network logging I am able to catch the issue:
> > ```
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
> > controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
> > your device have a faulty power saving mode enabled?
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
> > "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
> > enabling device (0000 -> 0002)
> > ...
> > 
> > I have tried the suggestion in the log without luck.
> > 
> > Attached is a log that includes two system freezes, as well as a list of
> > PCI(e) devices created by Debian reportbug.
> > The first freeze happens at "Jan  8 04:26:28" and the second freeze
> > happens at "Jan  8 14:50:16".
> > 
> > Currently, I am using git bisect to narrow down the window of possible
> > commits, but since the issue appears seemingly random, it will take many
> > months to identify the offending commit this way.
> > 
> > The original Debian bug report is here:
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309

For some reason the log [1] has very little of the kernel dmesg log.
It does seem like the freeze is partial (I see messages for hundreds
or thousands of seconds after the nvme reset), but requires a reboot
to recover.

The lspci information [2] shows the 00:1b.0 Root Port leading to the
01:00.0 NVMe device.

Is it possible to collect lspci output after the nvme freeze?  If so,
please save the output of:

  sudo lspci -vv -s00:1b.0
  sudo lspci -vv -s01:00.0

Make sure to run lspci as root so we can see the error logging
registers for these devices.

If you can collect more of the dmesg log after the freeze, e.g., via
the "dmesg" command, that might be helpful, too.

Bjorn

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1028309;filename=x299-desktop_crash.log.xz;msg=5
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=1028309;msg=5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"
  2023-01-12 16:42     ` Bjorn Helgaas
@ 2023-02-17 12:39       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 5+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-02-17 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas, Julian Groß
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, linux-pci, linux-kernel,
	Linux regressions mailing list

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

I might be missing something, but it looks like this discussion stalled.
I wonder why.

Julian, did you ever share the data Bjorn asked for? Or tried a a
bisection, as suggested by Keith? Or did you stop caring for some
reason? Does everything maybe work fine these days?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 12.01.23 17:42, Bjorn Helgaas wrote:
> On Thu, Jan 12, 2023 at 03:48:46PM +0100, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>> ...
>> On 11.01.23 23:11, Julian Groß wrote:
>>> Dear Maintainer,
>>>
>>> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
>>> system seemingly randomly freezes due to the file system being set to
>>> read-only due to an issue with my NVMe controller.
>>> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
>>>
>>> Through network logging I am able to catch the issue:
>>> ```
>>> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
>>> controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>>> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
>>> your device have a faulty power saving mode enabled?
>>> Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
>>> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>>> Jan  8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
>>> enabling device (0000 -> 0002)
>>> ...
>>>
>>> I have tried the suggestion in the log without luck.
>>>
>>> Attached is a log that includes two system freezes, as well as a list of
>>> PCI(e) devices created by Debian reportbug.
>>> The first freeze happens at "Jan  8 04:26:28" and the second freeze
>>> happens at "Jan  8 14:50:16".
>>>
>>> Currently, I am using git bisect to narrow down the window of possible
>>> commits, but since the issue appears seemingly random, it will take many
>>> months to identify the offending commit this way.
>>>
>>> The original Debian bug report is here:
>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309
> 
> For some reason the log [1] has very little of the kernel dmesg log.
> It does seem like the freeze is partial (I see messages for hundreds
> or thousands of seconds after the nvme reset), but requires a reboot
> to recover.
> 
> The lspci information [2] shows the 00:1b.0 Root Port leading to the
> 01:00.0 NVMe device.
> 
> Is it possible to collect lspci output after the nvme freeze?  If so,
> please save the output of:
> 
>   sudo lspci -vv -s00:1b.0
>   sudo lspci -vv -s01:00.0
> 
> Make sure to run lspci as root so we can see the error logging
> registers for these devices.
> 
> If you can collect more of the dmesg log after the freeze, e.g., via
> the "dmesg" command, that might be helpful, too.
> 
> Bjorn
> 
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1028309;filename=x299-desktop_crash.log.xz;msg=5
> [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=1028309;msg=5
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"
  2023-01-12 14:48   ` Regression in Kernel 6.0: System partially freezes with "nvme controller is down" Linux kernel regression tracking (Thorsten Leemhuis)
  2023-01-12 16:42     ` Bjorn Helgaas
@ 2023-02-17 15:01     ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 0 replies; 5+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-02-17 15:01 UTC (permalink / raw)
  To: regressions; +Cc: linux-nvme, linux-pci, LKML

On 12.01.23 15:48, Linux kernel regression tracking (Thorsten Leemhuis)
wrote:
> On 11.01.23 23:11, Julian Groß wrote:
>>
>> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
>> system seemingly randomly freezes due to the file system being set to
>> read-only due to an issue with my NVMe controller.
>> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
>>
>> Through network logging I am able to catch the issue:
> 
> [...]
> 
> #regzbot ^introduced v5.19..v6.0-rc7
> #regzbot title nvme: system partially freezes with "nvme controller is down"
> #regzbot ignore-activity

Stop tracking this for now:

#regzbot inconclusive: stalled and might be a hw issue
#regzbot ignore-activity

For details see:

https://lore.kernel.org/all/81b5b28e-33fb-48ca-9e84-7574d5596bfb@posteo.de/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-17 15:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <d5d8d106-acce-e20c-827d-1b37de2b2188@posteo.de>
     [not found] ` <0d3206be-fae8-4bbd-4b6c-a5d1f038356d@posteo.de>
2023-01-12 14:48   ` Regression in Kernel 6.0: System partially freezes with "nvme controller is down" Linux kernel regression tracking (Thorsten Leemhuis)
2023-01-12 16:42     ` Bjorn Helgaas
2023-02-17 12:39       ` Linux regression tracking (Thorsten Leemhuis)
2023-02-17 15:01     ` Linux regression tracking #update (Thorsten Leemhuis)
2023-01-12 16:37   ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).