linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible nvme regression in 6.4.11
@ 2023-08-16 20:39 Genes Lists
  2023-08-16 21:04 ` Keith Busch
  2023-08-17  3:00 ` Bagas Sanjaya
  0 siblings, 2 replies; 21+ messages in thread
From: Genes Lists @ 2023-08-16 20:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: kbusch, axboe, sagi, linux-nvme, hch


Also reported to bugzilla [1]

Failure happens on 1 laptop with samsung ssd.

Boot log manually transcribed:

kernel: nvme nvme0: controller is down; will reset: CSTS:0xffffffff, 
PCI_STATUS=0xffff
kernel: nvme nvme0: Does your device have a faulty power saving mode 
enabled?
kernel: nvme nvme0: try "nvme_core.default_ps_max_latency_us=0 
pcie_aspm=off" and report a bug
kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to 
D0, device inaccessible
kernel: nvme nvme0: Disabling device after reset failure: -19
mount[353]: mount /sysroot: can't read suprtblock on /dev/nvme0n1p5.
mount[353]:       dmesg(1) may have more information after failed moutn 
system call.
kernel: nvme0m1: detected capacity change from 2000409264 to 0
kernel: EXT4-fs (nvme0n1p5): unable to read superblock
systemd([1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a
...

All kernels are upstream, untainted and compiled on Arch using:

  gcc version 13.2.1

Kernels Tested:
  - 6.4.10 - works fine
  - 6.4.11 - fails
  - 6.5-rc6 - fails
  - 6.4.11 + nvme_core.default_ps_max_latency_us=0 pcie_aspm=off - fails
  - 6.4.11 with 1 revert below - fails

     Revert "nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G 
and 512G"
     This reverts commit 061fbf64825fb47367bbb6e0a528611f08119473.

Hardware:
   model name      : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
   stepping        : 9
   microcode       : 0xf4

nvme:
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe 
SSD Controller SM961/PM961/SM963
         Subsystem: Samsung Electronics Co Ltd SM963 2.5" NVMe PCIe SSD
         Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
         Memory at edb00000 (64-bit, non-prefetchable) [size=16K]
         Capabilities: [40] Power Management version 3
         Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
         Capabilities: [70] Express Endpoint, MSI 00
         Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
         Kernel driver in use: nvme




Gene

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217802

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-16 20:39 Possible nvme regression in 6.4.11 Genes Lists
@ 2023-08-16 21:04 ` Keith Busch
  2023-08-17  1:30   ` Genes Lists
  2023-08-17  3:00 ` Bagas Sanjaya
  1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-08-16 21:04 UTC (permalink / raw)
  To: Genes Lists; +Cc: linux-kernel, axboe, sagi, linux-nvme, hch

On Wed, Aug 16, 2023 at 04:39:34PM -0400, Genes Lists wrote:
> Also reported to bugzilla [1]
> 
> Failure happens on 1 laptop with samsung ssd.
> 
> Boot log manually transcribed:
> 
> kernel: nvme nvme0: controller is down; will reset: CSTS:0xffffffff,
> PCI_STATUS=0xffff
> kernel: nvme nvme0: Does your device have a faulty power saving mode
> enabled?
> kernel: nvme nvme0: try "nvme_core.default_ps_max_latency_us=0
> pcie_aspm=off" and report a bug
> kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0,
> device inaccessible
> kernel: nvme nvme0: Disabling device after reset failure: -19
> mount[353]: mount /sysroot: can't read suprtblock on /dev/nvme0n1p5.
> mount[353]:       dmesg(1) may have more information after failed moutn
> system call.
> kernel: nvme0m1: detected capacity change from 2000409264 to 0
> kernel: EXT4-fs (nvme0n1p5): unable to read superblock
> systemd([1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a
> ...
> 
> All kernels are upstream, untainted and compiled on Arch using:
> 
>  gcc version 13.2.1
> 
> Kernels Tested:
>  - 6.4.10 - works fine
>  - 6.4.11 - fails
>  - 6.5-rc6 - fails
>  - 6.4.11 + nvme_core.default_ps_max_latency_us=0 pcie_aspm=off - fails
>  - 6.4.11 with 1 revert below - fails
> 
>     Revert "nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and
> 512G"
>     This reverts commit 061fbf64825fb47367bbb6e0a528611f08119473.

It sounds like you can recreate this. Since .10 worked and .11 doesn't,
could you bisect the git commits? It looks like it will take 7 steps
between those two versions.

I don't think there are any nvme specific patches that could contribute
to what you're seeing, it's more likely some lower level platform patch
if a kernel change really did cause the regression. None of the recent
commits really stood out to me, so bisect is what I'd recommend.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-16 21:04 ` Keith Busch
@ 2023-08-17  1:30   ` Genes Lists
  2023-08-17  9:16     ` Genes Lists
  0 siblings, 1 reply; 21+ messages in thread
From: Genes Lists @ 2023-08-17  1:30 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-kernel, axboe, sagi, linux-nvme, hch

On 8/16/23 17:04, Keith Busch wrote:
...
> It sounds like you can recreate this. Since .10 worked and .11 doesn't,
> could you bisect the git commits? It looks like it will take 7 steps
> between those two versions.
> 
> I don't think there are any nvme specific patches that could contribute
> to what you're seeing, it's more likely some lower level platform patch
> if a kernel change really did cause the regression. None of the recent
> commits really stood out to me, so bisect is what I'd recommend.

Thank you

Bisect done - This is result:

----------------------------------------------------------------
69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
Author: Ricky WU <ricky_wu@realtek.com>
Date:   Tue Jul 25 09:10:54 2023 +0000

     misc: rtsx: judge ASPM Mode to set PETXCFG Reg

     commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.

     ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
     to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
     always set to HIGH during the initialization.

     Cc: stable@vger.kernel.org
     Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
     Link: 
https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

  drivers/misc/cardreader/rts5227.c  |  2 +-
  drivers/misc/cardreader/rts5228.c  | 18 ------------------
  drivers/misc/cardreader/rts5249.c  |  3 +--
  drivers/misc/cardreader/rts5260.c  | 18 ------------------
  drivers/misc/cardreader/rts5261.c  | 18 ------------------
  drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
  6 files changed, 6 insertions(+), 58 deletions(-)

------------------------------------------------------

And the machine does have this hardware:

03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A 
PCI Express Card Reader (rev 01)
         Subsystem: Dell RTS525A PCI Express Card Reader
         Physical Slot: 1
         Flags: bus master, fast devsel, latency 0, IRQ 141
         Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
         Capabilities: [80] Power Management version 3
         Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
         Capabilities: [b0] Express Endpoint, MSI 00
         Kernel driver in use: rtsx_pci
         Kernel modules: rtsx_pci




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-16 20:39 Possible nvme regression in 6.4.11 Genes Lists
  2023-08-16 21:04 ` Keith Busch
@ 2023-08-17  3:00 ` Bagas Sanjaya
  2023-09-29 11:49   ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 1 reply; 21+ messages in thread
From: Bagas Sanjaya @ 2023-08-17  3:00 UTC (permalink / raw)
  To: Genes Lists, linux-kernel, Ricky WU, Arnd Bergmann
  Cc: kbusch, axboe, sagi, linux-nvme, hch, Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

On Wed, Aug 16, 2023 at 04:39:34PM -0400, Genes Lists wrote:
> 
> Also reported to bugzilla [1]
> 
> Failure happens on 1 laptop with samsung ssd.
> 
> Boot log manually transcribed:
> 
> kernel: nvme nvme0: controller is down; will reset: CSTS:0xffffffff,
> PCI_STATUS=0xffff
> kernel: nvme nvme0: Does your device have a faulty power saving mode
> enabled?
> kernel: nvme nvme0: try "nvme_core.default_ps_max_latency_us=0
> pcie_aspm=off" and report a bug
> kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0,
> device inaccessible
> kernel: nvme nvme0: Disabling device after reset failure: -19
> mount[353]: mount /sysroot: can't read suprtblock on /dev/nvme0n1p5.
> mount[353]:       dmesg(1) may have more information after failed moutn
> system call.
> kernel: nvme0m1: detected capacity change from 2000409264 to 0
> kernel: EXT4-fs (nvme0n1p5): unable to read superblock
> systemd([1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a
> ...
> 
> All kernels are upstream, untainted and compiled on Arch using:
> 
>  gcc version 13.2.1
> 
> Kernels Tested:
>  - 6.4.10 - works fine
>  - 6.4.11 - fails
>  - 6.5-rc6 - fails
>  - 6.4.11 + nvme_core.default_ps_max_latency_us=0 pcie_aspm=off - fails
>  - 6.4.11 with 1 revert below - fails
> 
>     Revert "nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and
> 512G"
>     This reverts commit 061fbf64825fb47367bbb6e0a528611f08119473.
> 
> Hardware:
>   model name      : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
>   stepping        : 9
>   microcode       : 0xf4
> 
> nvme:
> 04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
> Controller SM961/PM961/SM963
>         Subsystem: Samsung Electronics Co Ltd SM963 2.5" NVMe PCIe SSD
>         Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
>         Memory at edb00000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [40] Power Management version 3
>         Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
>         Capabilities: [70] Express Endpoint, MSI 00
>         Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
>         Kernel driver in use: nvme
> 

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 101bd907b4244a
#regzbot title: can't change Samsung SSD power state due to ASPM mode checking
#regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=217802

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-17  1:30   ` Genes Lists
@ 2023-08-17  9:16     ` Genes Lists
  2023-08-17 17:28       ` Keith Busch
  2023-08-23 17:41       ` Keith Busch
  0 siblings, 2 replies; 21+ messages in thread
From: Genes Lists @ 2023-08-17  9:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh



On 8/16/23 21:30, Genes Lists wrote:

> On 8/16/23 17:04, Keith Busch wrote:
> ...
>> It sounds like you can recreate this. Since .10 worked and .11 doesn't,
>> could you bisect the git commits? It looks like it will take 7 steps
>> between those two versions.
>>
>> I don't think there are any nvme specific patches that could contribute
>> to what you're seeing, it's more likely some lower level platform patch
>> if a kernel change really did cause the regression. None of the recent
>> commits really stood out to me, so bisect is what I'd recommend.
> 
> Thank you
> 
> Bisect done - This is result:
> 
> ----------------------------------------------------------------
> 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
> commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
> Author: Ricky WU <ricky_wu@realtek.com>
> Date:   Tue Jul 25 09:10:54 2023 +0000
> 
>      misc: rtsx: judge ASPM Mode to set PETXCFG Reg
> 
>      commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
> 
>      ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
>      to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
>      always set to HIGH during the initialization.
> 
>      Cc: stable@vger.kernel.org
>      Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
>      Link: 
> https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
>   drivers/misc/cardreader/rts5227.c  |  2 +-
>   drivers/misc/cardreader/rts5228.c  | 18 ------------------
>   drivers/misc/cardreader/rts5249.c  |  3 +--
>   drivers/misc/cardreader/rts5260.c  | 18 ------------------
>   drivers/misc/cardreader/rts5261.c  | 18 ------------------
>   drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
>   6 files changed, 6 insertions(+), 58 deletions(-)
> 
> ------------------------------------------------------
> 
> And the machine does have this hardware:
> 
> 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A 
> PCI Express Card Reader (rev 01)
>          Subsystem: Dell RTS525A PCI Express Card Reader
>          Physical Slot: 1
>          Flags: bus master, fast devsel, latency 0, IRQ 141
>          Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
>          Capabilities: [80] Power Management version 3
>          Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
>          Capabilities: [b0] Express Endpoint, MSI 00
>          Kernel driver in use: rtsx_pci
>          Kernel modules: rtsx_pci
> 
> 
> 


Adding to CC list since bisect landed on

    drivers/misc/cardreader/rtsx_pcr.c

Thread starts here: https://lkml.org/lkml/2023/8/16/1154

Thank you,

gene


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-17  9:16     ` Genes Lists
@ 2023-08-17 17:28       ` Keith Busch
  2023-08-17 17:43         ` Genes Lists
  2023-08-23 17:41       ` Keith Busch
  1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-08-17 17:28 UTC (permalink / raw)
  To: Genes Lists
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh

On Thu, Aug 17, 2023 at 05:16:01AM -0400, Genes Lists wrote:
> On 8/16/23 21:30, Genes Lists wrote:
> > On 8/16/23 17:04, Keith Busch wrote:
> > ...
> > > It sounds like you can recreate this. Since .10 worked and .11 doesn't,
> > > could you bisect the git commits? It looks like it will take 7 steps
> > > between those two versions.
> > > 
> > > I don't think there are any nvme specific patches that could contribute
> > > to what you're seeing, it's more likely some lower level platform patch
> > > if a kernel change really did cause the regression. None of the recent
> > > commits really stood out to me, so bisect is what I'd recommend.
> > 
> > Thank you
> > 
> > Bisect done - This is result:

Sounds like the driver's ASPM suspicion was justified, however the
recommended work-around doesn't appear to apply to this hardware.
Thanks for running the bisect!

> > ----------------------------------------------------------------
> > 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
> > commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
> > Author: Ricky WU <ricky_wu@realtek.com>
> > Date:   Tue Jul 25 09:10:54 2023 +0000
> > 
> >      misc: rtsx: judge ASPM Mode to set PETXCFG Reg
> > 
> >      commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
> > 
> >      ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
> >      to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
> >      always set to HIGH during the initialization.
> > 
> >      Cc: stable@vger.kernel.org
> >      Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
> >      Link:
> > https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
> >      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > 
> >   drivers/misc/cardreader/rts5227.c  |  2 +-
> >   drivers/misc/cardreader/rts5228.c  | 18 ------------------
> >   drivers/misc/cardreader/rts5249.c  |  3 +--
> >   drivers/misc/cardreader/rts5260.c  | 18 ------------------
> >   drivers/misc/cardreader/rts5261.c  | 18 ------------------
> >   drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
> >   6 files changed, 6 insertions(+), 58 deletions(-)
> > 
> > ------------------------------------------------------
> > 
> > And the machine does have this hardware:
> > 
> > 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A
> > PCI Express Card Reader (rev 01)
> >          Subsystem: Dell RTS525A PCI Express Card Reader
> >          Physical Slot: 1
> >          Flags: bus master, fast devsel, latency 0, IRQ 141
> >          Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
> >          Capabilities: [80] Power Management version 3
> >          Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >          Capabilities: [b0] Express Endpoint, MSI 00
> >          Kernel driver in use: rtsx_pci
> >          Kernel modules: rtsx_pci

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-17 17:28       ` Keith Busch
@ 2023-08-17 17:43         ` Genes Lists
  0 siblings, 0 replies; 21+ messages in thread
From: Genes Lists @ 2023-08-17 17:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh

On 8/17/23 13:28, Keith Busch wrote:
...
>>>
>>> Bisect done - This is result:
> 
> Sounds like the driver's ASPM suspicion was justified, however the
> recommended work-around doesn't appear to apply to this hardware.
> Thanks for running the bisect!

Happy to help :)

gene


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-17  9:16     ` Genes Lists
  2023-08-17 17:28       ` Keith Busch
@ 2023-08-23 17:41       ` Keith Busch
  2023-08-23 20:25         ` Genes Lists
  2023-08-24 11:29         ` Genes Lists
  1 sibling, 2 replies; 21+ messages in thread
From: Keith Busch @ 2023-08-23 17:41 UTC (permalink / raw)
  To: Genes Lists
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh

On Thu, Aug 17, 2023 at 05:16:01AM -0400, Genes Lists wrote:
> > ----------------------------------------------------------------
> > 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
> > commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
> > Author: Ricky WU <ricky_wu@realtek.com>
> > Date:   Tue Jul 25 09:10:54 2023 +0000
> > 
> >      misc: rtsx: judge ASPM Mode to set PETXCFG Reg
> > 
> >      commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
> > 
> >      ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
> >      to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
> >      always set to HIGH during the initialization.
> > 
> >      Cc: stable@vger.kernel.org
> >      Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
> >      Link:
> > https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
> >      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > 
> >   drivers/misc/cardreader/rts5227.c  |  2 +-
> >   drivers/misc/cardreader/rts5228.c  | 18 ------------------
> >   drivers/misc/cardreader/rts5249.c  |  3 +--
> >   drivers/misc/cardreader/rts5260.c  | 18 ------------------
> >   drivers/misc/cardreader/rts5261.c  | 18 ------------------
> >   drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
> >   6 files changed, 6 insertions(+), 58 deletions(-)
> > 
> > ------------------------------------------------------
> > 
> > And the machine does have this hardware:
> > 
> > 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A
> > PCI Express Card Reader (rev 01)
> >          Subsystem: Dell RTS525A PCI Express Card Reader
> >          Physical Slot: 1
> >          Flags: bus master, fast devsel, latency 0, IRQ 141
> >          Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
> >          Capabilities: [80] Power Management version 3
> >          Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >          Capabilities: [b0] Express Endpoint, MSI 00
> >          Kernel driver in use: rtsx_pci
> >          Kernel modules: rtsx_pci
> > 
> > 
> > 
> 
> 
> Adding to CC list since bisect landed on
> 
>    drivers/misc/cardreader/rtsx_pcr.c
> 
> Thread starts here: https://lkml.org/lkml/2023/8/16/1154

I realize you can work around this by blacklisting the rtsx_pci, but
that's not a pleasant solution. With only a few days left in 6.5, should
the commit just be reverted?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-23 17:41       ` Keith Busch
@ 2023-08-23 20:25         ` Genes Lists
  2023-08-24  2:44           ` Ricky WU
  2023-08-24 11:29         ` Genes Lists
  1 sibling, 1 reply; 21+ messages in thread
From: Genes Lists @ 2023-08-23 20:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh

On 8/23/23 13:41, Keith Busch wrote:
> On Thu, Aug 17, 2023 at 05:16:01AM -0400, Genes Lists wrote:
>>> ----------------------------------------------------------------
>>> 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
>>> commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
>>> Author: Ricky WU <ricky_wu@realtek.com>
>>> Date:   Tue Jul 25 09:10:54 2023 +0000
>>>
>>>       misc: rtsx: judge ASPM Mode to set PETXCFG Reg
>>>
>>>       commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
>>>
>>>       ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
>>>       to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
>>>       always set to HIGH during the initialization.
>>>
>>>       Cc: stable@vger.kernel.org
>>>       Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
>>>       Link:
>>> https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
>>>       Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>
>>>    drivers/misc/cardreader/rts5227.c  |  2 +-
>>>    drivers/misc/cardreader/rts5228.c  | 18 ------------------
>>>    drivers/misc/cardreader/rts5249.c  |  3 +--
>>>    drivers/misc/cardreader/rts5260.c  | 18 ------------------
>>>    drivers/misc/cardreader/rts5261.c  | 18 ------------------
>>>    drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
>>>    6 files changed, 6 insertions(+), 58 deletions(-)
>>>
>>> ------------------------------------------------------
>>>
>>> And the machine does have this hardware:
>>>
>>> 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A
>>> PCI Express Card Reader (rev 01)
>>>           Subsystem: Dell RTS525A PCI Express Card Reader
>>>           Physical Slot: 1
>>>           Flags: bus master, fast devsel, latency 0, IRQ 141
>>>           Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
>>>           Capabilities: [80] Power Management version 3
>>>           Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>           Capabilities: [b0] Express Endpoint, MSI 00
>>>           Kernel driver in use: rtsx_pci
>>>           Kernel modules: rtsx_pci
>>>
>>>
>>>
>>
>>
>> Adding to CC list since bisect landed on
>>
>>     drivers/misc/cardreader/rtsx_pcr.c
>>
>> Thread starts here: https://lkml.org/lkml/2023/8/16/1154
> 
> I realize you can work around this by blacklisting the rtsx_pci, but
> that's not a pleasant solution. With only a few days left in 6.5, should
> the commit just be reverted?

Keith - thanks for reminder.

The card reader device itself is non-critical and very low priority.

What perhaps is a little more worrisome is the change in rtsx somehow 
prevented nvme from functioning normally and the machine then not 
booting (at least for some combination(s) of hardware).

If there is a simple fix to prevent nvme from being impacted by the rtsx 
driver that would be more than sufficient?

On the other hand 6.4.11 is out, and I'm guessing there isn't a lot of 
noise on this either. From what I've seen, 1 other user with same 
problem [1] and 1 with same card reader not having a problema [2].
And no 'me-too's in the kernel bugzilla [3] either.


Gene


[1] https://bbs.archlinux.org/viewtopic.php?id=288095
[2] https://bugs.archlinux.org/task/79439
[3] https://bugzilla.kernel.org/show_bug.cgi?id=217802




^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Possible nvme regression in 6.4.11
  2023-08-23 20:25         ` Genes Lists
@ 2023-08-24  2:44           ` Ricky WU
  2023-08-24  9:48             ` Genes Lists
  0 siblings, 1 reply; 21+ messages in thread
From: Ricky WU @ 2023-08-24  2:44 UTC (permalink / raw)
  To: Genes Lists, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh

Hi Gene,

I can't reproduce this issue on my side...

So if you only revert this patch (69304c8d285b77c9a56d68f5ddb2558f27abf406) can work fine?
This patch only do is pull our clock request to HIGH if HOST need also can pull to LOW, and this only do on our device
I don’t think this will affect other ports...

BR,
Ricky

> -----Original Message-----
> From: Genes Lists <lists@sapience.com>
> Sent: Thursday, August 24, 2023 4:25 AM
> To: Keith Busch <kbusch@kernel.org>
> Cc: linux-kernel@vger.kernel.org; axboe@kernel.dk; sagi@grimberg.me;
> linux-nvme@lists.infradead.org; hch@lst.de; arnd@arndb.de; Ricky WU
> <ricky_wu@realtek.com>; gregkh@linuxfoundation.org
> Subject: Re: Possible nvme regression in 6.4.11
> 
> 
> External mail.
> 
> 
> 
> On 8/23/23 13:41, Keith Busch wrote:
> > On Thu, Aug 17, 2023 at 05:16:01AM -0400, Genes Lists wrote:
> >>> ----------------------------------------------------------------
> >>> 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
> >>> commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
> >>> Author: Ricky WU <ricky_wu@realtek.com>
> >>> Date:   Tue Jul 25 09:10:54 2023 +0000
> >>>
> >>>       misc: rtsx: judge ASPM Mode to set PETXCFG Reg
> >>>
> >>>       commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
> >>>
> >>>       ASPM Mode is ASPM_MODE_CFG need to judge the value of
> clkreq_0
> >>>       to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
> >>>       always set to HIGH during the initialization.
> >>>
> >>>       Cc: stable@vger.kernel.org
> >>>       Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
> >>>       Link:
> >>>
> https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
> >>>       Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>>
> >>>    drivers/misc/cardreader/rts5227.c  |  2 +-
> >>>    drivers/misc/cardreader/rts5228.c  | 18 ------------------
> >>>    drivers/misc/cardreader/rts5249.c  |  3 +--
> >>>    drivers/misc/cardreader/rts5260.c  | 18 ------------------
> >>>    drivers/misc/cardreader/rts5261.c  | 18 ------------------
> >>>    drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
> >>>    6 files changed, 6 insertions(+), 58 deletions(-)
> >>>
> >>> ------------------------------------------------------
> >>>
> >>> And the machine does have this hardware:
> >>>
> >>> 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd.
> >>> RTS525A PCI Express Card Reader (rev 01)
> >>>           Subsystem: Dell RTS525A PCI Express Card Reader
> >>>           Physical Slot: 1
> >>>           Flags: bus master, fast devsel, latency 0, IRQ 141
> >>>           Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
> >>>           Capabilities: [80] Power Management version 3
> >>>           Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >>>           Capabilities: [b0] Express Endpoint, MSI 00
> >>>           Kernel driver in use: rtsx_pci
> >>>           Kernel modules: rtsx_pci
> >>>
> >>>
> >>>
> >>
> >>
> >> Adding to CC list since bisect landed on
> >>
> >>     drivers/misc/cardreader/rtsx_pcr.c
> >>
> >> Thread starts here: https://lkml.org/lkml/2023/8/16/1154
> >
> > I realize you can work around this by blacklisting the rtsx_pci, but
> > that's not a pleasant solution. With only a few days left in 6.5,
> > should the commit just be reverted?
> 
> Keith - thanks for reminder.
> 
> The card reader device itself is non-critical and very low priority.
> 
> What perhaps is a little more worrisome is the change in rtsx somehow
> prevented nvme from functioning normally and the machine then not booting
> (at least for some combination(s) of hardware).
> 
> If there is a simple fix to prevent nvme from being impacted by the rtsx driver
> that would be more than sufficient?
> 
> On the other hand 6.4.11 is out, and I'm guessing there isn't a lot of noise on
> this either. From what I've seen, 1 other user with same problem [1] and 1 with
> same card reader not having a problema [2].
> And no 'me-too's in the kernel bugzilla [3] either.
> 
> 
> Gene
> 
> 
> [1] https://bbs.archlinux.org/viewtopic.php?id=288095
> [2] https://bugs.archlinux.org/task/79439
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=217802
> 
> 
> 
> 
> ------Please consider the environment before printing this e-mail.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-24  2:44           ` Ricky WU
@ 2023-08-24  9:48             ` Genes Lists
  2023-08-24 10:22               ` Genes Lists
  0 siblings, 1 reply; 21+ messages in thread
From: Genes Lists @ 2023-08-24  9:48 UTC (permalink / raw)
  To: Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh

On 8/23/23 22:44, Ricky WU wrote:
> Hi Gene,
> 
> I can't reproduce this issue on my side...
> 
> So if you only revert this patch (69304c8d285b77c9a56d68f5ddb2558f27abf406) can work fine?
> This patch only do is pull our clock request to HIGH if HOST need also can pull to LOW, and this only do on our device
> I don’t think this will affect other ports...
> 
> BR,
> Ricky

Thanks Ricky - I will test revering just that commit and report back.  I 
wont be able to get to it till later today (sometime after 2pm EDT) but 
I will do it today.

FYI, i see one mpre report of someone experiencing same problem [1]

gene

  [1] https://bugs.archlinux.org/task/79439



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-24  9:48             ` Genes Lists
@ 2023-08-24 10:22               ` Genes Lists
  2023-08-25 12:51                 ` Ricky WU
  0 siblings, 1 reply; 21+ messages in thread
From: Genes Lists @ 2023-08-24 10:22 UTC (permalink / raw)
  To: Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh

On 8/24/23 05:48, Genes Lists wrote:
> On 8/23/23 22:44, Ricky WU wrote:
>> Hi Gene,
>>
>> I can't reproduce this issue on my side...
>>
>> So if you only revert this patch 
>> (69304c8d285b77c9a56d68f5ddb2558f27abf406) can work fine?
>> This patch only do is pull our clock request to HIGH if HOST need also 
>> can pull to LOW, and this only do on our device
>> I don’t think this will affect other ports...
>>
>> BR,
>> Ricky
> 
> Thanks Ricky - I will test revering just that commit and report back.  I 
> wont be able to get to it till later today (sometime after 2pm EDT) but 
> I will do it today.
> 
> FYI, i see one mpre report of someone experiencing same problem [1]
> 
> gene
> 
>   [1] https://bugs.archlinux.org/task/79439
> 
> 

That commit was what was reverted in the last step of the git bisect - 
and indeed reverting that commit makes the problem go away and machine 
then boots fine.

thanks

gene



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-23 17:41       ` Keith Busch
  2023-08-23 20:25         ` Genes Lists
@ 2023-08-24 11:29         ` Genes Lists
  1 sibling, 0 replies; 21+ messages in thread
From: Genes Lists @ 2023-08-24 11:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, ricky_wu, gregkh

On 8/23/23 13:41, Keith Busch wrote:
> On Thu, Aug 17, 2023 at 05:16:01AM -0400, Genes Lists wrote:
>>> ----------------------------------------------------------------
>>> 69304c8d285b77c9a56d68f5ddb2558f27abf406 is the first bad commit
>>> commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
>>> Author: Ricky WU <ricky_wu@realtek.com>
>>> Date:   Tue Jul 25 09:10:54 2023 +0000
>>>
>>>       misc: rtsx: judge ASPM Mode to set PETXCFG Reg
>>>
>>>       commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
>>>
>>>       ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
>>>       to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
>>>       always set to HIGH during the initialization.
>>>
>>>       Cc: stable@vger.kernel.org
>>>       Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
>>>       Link:
>>> https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
>>>       Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>
>>>    drivers/misc/cardreader/rts5227.c  |  2 +-
>>>    drivers/misc/cardreader/rts5228.c  | 18 ------------------
>>>    drivers/misc/cardreader/rts5249.c  |  3 +--
>>>    drivers/misc/cardreader/rts5260.c  | 18 ------------------
>>>    drivers/misc/cardreader/rts5261.c  | 18 ------------------
>>>    drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
>>>    6 files changed, 6 insertions(+), 58 deletions(-)
>>>
>>> ------------------------------------------------------
>>>
>>> And the machine does have this hardware:
>>>
>>> 03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A
>>> PCI Express Card Reader (rev 01)
>>>           Subsystem: Dell RTS525A PCI Express Card Reader
>>>           Physical Slot: 1
>>>           Flags: bus master, fast devsel, latency 0, IRQ 141
>>>           Memory at ed100000 (32-bit, non-prefetchable) [size=4K]
>>>           Capabilities: [80] Power Management version 3
>>>           Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>           Capabilities: [b0] Express Endpoint, MSI 00
>>>           Kernel driver in use: rtsx_pci
>>>           Kernel modules: rtsx_pci
>>>
>>>
>>>
>>
>>
>> Adding to CC list since bisect landed on
>>
>>     drivers/misc/cardreader/rtsx_pcr.c
>>
>> Thread starts here: https://lkml.org/lkml/2023/8/16/1154
> 
> I realize you can work around this by blacklisting the rtsx_pci, but
> that's not a pleasant solution. With only a few days left in 6.5, should
> the commit just be reverted?


Looks like here are more people having same problem than I was aware of 
earlier [1].

My recommendation now is to revert this.

thanks

gene

[1] https://bugs.archlinux.org/task/79439#comment221262



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Possible nvme regression in 6.4.11
  2023-08-24 10:22               ` Genes Lists
@ 2023-08-25 12:51                 ` Ricky WU
  2023-08-30 21:09                   ` Genes Lists
  0 siblings, 1 reply; 21+ messages in thread
From: Ricky WU @ 2023-08-25 12:51 UTC (permalink / raw)
  To: Genes Lists, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh





>On 8/24/23 05:48, Genes Lists wrote:
>> On 8/23/23 22:44, Ricky WU wrote:
>>> Hi Gene,
>>>
>>> I can't reproduce this issue on my side...
>>>
>>> So if you only revert this patch
>>> (69304c8d285b77c9a56d68f5ddb2558f27abf406) can work fine?
>>> This patch only do is pull our clock request to HIGH if HOST need also
>>> can pull to LOW, and this only do on our device
>>> I don’t think this will affect other ports...
>>>
>>> BR,
>>> Ricky
>>
>> Thanks Ricky - I will test revering just that commit and report back.  I
>> wont be able to get to it till later today (sometime after 2pm EDT) but
>> I will do it today.
>>
>> FYI, i see one mpre report of someone experiencing same problem [1]
>>
>> gene
>>
>  > [1] https://bugs.archlinux.org/task/79439
>>
>>
>
>That commit was what was reverted in the last step of the git bisect -
>and indeed reverting that commit makes the problem go away and machine
>then boots fine.
>
>thanks
>
>gene

I think maybe it is a system power saving issue....
In the past if the BIOS(config space) not set L1-substate, our driver will keep drive low CLKREQ# when HOST want to enter power saving state that make whole system not enter the power saving state.
But this patch we release the CLKREQ# to HOST, make whole system can enter power saving state success when the HOST want to enter the power saving state, but I don't  know why your system can not wake out success from power saving stat on the platform

Ricky
    

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-25 12:51                 ` Ricky WU
@ 2023-08-30 21:09                   ` Genes Lists
  2023-09-11  8:02                     ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 21+ messages in thread
From: Genes Lists @ 2023-08-30 21:09 UTC (permalink / raw)
  To: Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh

...
> I think maybe it is a system power saving issue....
> In the past if the BIOS(config space) not set L1-substate, our driver will keep drive low CLKREQ# when HOST want to enter power saving state that make whole system not enter the power saving state.
> But this patch we release the CLKREQ# to HOST, make whole system can enter power saving state success when the HOST want to enter the power saving state, but I don't  know why your system can not wake out success from power saving stat on the platform
> 
> Ricky
>      

Hi

    Thanks for continuing to look into this. Can you share your thoughts 
on best way to proceed going forward - do you plan to revert or 
something else?

thanks

gene


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-30 21:09                   ` Genes Lists
@ 2023-09-11  8:02                     ` Linux regression tracking (Thorsten Leemhuis)
  2023-09-11 11:38                       ` Thorsten Leemhuis
  2023-09-11 15:41                       ` Possible nvme regression in 6.4.11 Augusto Zanellato
  0 siblings, 2 replies; 21+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-09-11  8:02 UTC (permalink / raw)
  To: Genes Lists, Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh,
	Linux kernel regressions list

Hi, Thorsten here, the Linux kernel's regression tracker.

On 30.08.23 23:09, Genes Lists wrote:
> ...
>> I think maybe it is a system power saving issue....
>> In the past if the BIOS(config space) not set L1-substate, our driver
>> will keep drive low CLKREQ# when HOST want to enter power saving state
>> that make whole system not enter the power saving state.
>> But this patch we release the CLKREQ# to HOST, make whole system can
>> enter power saving state success when the HOST want to enter the power
>> saving state, but I don't  know why your system can not wake out
>> success from power saving stat on the platform
> 
>    Thanks for continuing to look into this. Can you share your thoughts
> on best way to proceed going forward - do you plan to revert or
> something else?

Hmmm. This looks like it fell through the cracks. Or am I missing something?

Anyway, 6.4.y will likely be EOL in a week or two. Which bears the
question: are 6.5.y and 6.6-rc1 working better for you? From the
bugzilla ticket (https://bugzilla.kernel.org/show_bug.cgi?id=217802) and
comments from others that are affected it sounds like that's not the
case. If that's how it is I guess it overdue that the 101bd907b4244a
("misc: rtsx: judge ASPM Mode to set PETXCFG Reg") is reverted. Or am I
missing something?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-09-11  8:02                     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-09-11 11:38                       ` Thorsten Leemhuis
  2023-09-18 17:07                         ` [Revert] " Jade Lovelace
  2023-09-18 17:07                         ` [PATCH] Revert "misc: rtsx: judge ASPM Mode to set PETXCFG Reg" Jade Lovelace
  2023-09-11 15:41                       ` Possible nvme regression in 6.4.11 Augusto Zanellato
  1 sibling, 2 replies; 21+ messages in thread
From: Thorsten Leemhuis @ 2023-09-11 11:38 UTC (permalink / raw)
  To: Genes Lists, Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh,
	Linux kernel regressions list

On 11.09.23 10:02, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> On 30.08.23 23:09, Genes Lists wrote:
>> ...
>>> I think maybe it is a system power saving issue....
>>> In the past if the BIOS(config space) not set L1-substate, our driver
>>> will keep drive low CLKREQ# when HOST want to enter power saving state
>>> that make whole system not enter the power saving state.
>>> But this patch we release the CLKREQ# to HOST, make whole system can
>>> enter power saving state success when the HOST want to enter the power
>>> saving state, but I don't  know why your system can not wake out
>>> success from power saving stat on the platform
>>
>>    Thanks for continuing to look into this. Can you share your thoughts
>> on best way to proceed going forward - do you plan to revert or
>> something else?
> 
> Hmmm. This looks like it fell through the cracks. Or am I missing something?
> 
> Anyway, 6.4.y will likely be EOL in a week or two. Which bears the
> question: are 6.5.y and 6.6-rc1 working better for you? From the
> bugzilla ticket (https://bugzilla.kernel.org/show_bug.cgi?id=217802) and
> comments from others that are affected it sounds like that's not the
> case. If that's how it is I guess it overdue that the 101bd907b4244a
> ("misc: rtsx: judge ASPM Mode to set PETXCFG Reg") is reverted. Or am I
> missing something?

According to feedback in bugzilla.kernel.org 6.5.y is affected as well.
And openSUSE apparently reverted the culprit about a week ago due to the
problems it causes:
https://bugzilla.suse.com/show_bug.cgi?id=1214428

Guess that means we should do the same for mainline with a CC:
stable@... tag.

Ricky WU, or do you have a better idea? Yes, from earlier in the thread
the root of the problem might not be in the patch you contributed, but
it exposes the problem, hence it should be reverted unless a better
solution can be found quickly. And that hasn't happened in the past two
weeks, hence it's afaics time for a revert.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-09-11  8:02                     ` Linux regression tracking (Thorsten Leemhuis)
  2023-09-11 11:38                       ` Thorsten Leemhuis
@ 2023-09-11 15:41                       ` Augusto Zanellato
  1 sibling, 0 replies; 21+ messages in thread
From: Augusto Zanellato @ 2023-09-11 15:41 UTC (permalink / raw)
  To: Linux regressions mailing list, Genes Lists, Ricky WU, Keith Busch
  Cc: linux-kernel, axboe, sagi, linux-nvme, hch, arnd, gregkh

Hi,

I'm also experiencing the issue described in this thread, just wanted to 
chime in with regards to

> Anyway, 6.4.y will likely be EOL in a week or two. Which bears the
> question: are 6.5.y and 6.6-rc1 working better for you?

I can confirm that the issue is still happening and preventing correct 
boot on both 6.5.2 (Arch Linux package version 6.5.2.arch1-1) and on 
6.6-rc1 (Arch Linux package linux-mainline built from AUR).

For reference: the affected machine is a Dell XPS 15 9560 with the 
latest firmware revision (1.31.0) as of time of writing this email, the 
NVMe drive is a Sabrent Rocket4 1TB drive, but the issue also happens 
with the OEM provided Toshiba XG4.

Thanks,

Augusto


PS: it's my first time here in LKML, feel free to tell me if I did 
anything wrong :)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Revert] Re: Possible nvme regression in 6.4.11
  2023-09-11 11:38                       ` Thorsten Leemhuis
@ 2023-09-18 17:07                         ` Jade Lovelace
  2023-09-18 17:07                         ` [PATCH] Revert "misc: rtsx: judge ASPM Mode to set PETXCFG Reg" Jade Lovelace
  1 sibling, 0 replies; 21+ messages in thread
From: Jade Lovelace @ 2023-09-18 17:07 UTC (permalink / raw)
  To: Gene, Ricky WU, Keith Busch, Thorsten Leemhuis, linux-kernel
  Cc: regressions, Alyssa Ross, Michal Suchanek, axboe @ kernel . dk ,
	sagi @ grimberg . me , linux-nvme @ lists . infradead . org ,
	hch, arnd, gregkh, stable

This regression affects all copies of the Dell XPS 15 9560 and Dell Precision
5520 with any SSD including aftermarket ones.

Per the bugzilla discussion here:
https://bugzilla.kernel.org/show_bug.cgi?id=217802 this regression has
been confirmed to also affect 6.5.2 and 6.6-rc1, and affects several
distros.

Known affected branches: 6.1, 6.4, 6.5, 6.6.

It has already been reverted by OpenSUSE and is soon to be reverted in
NixOS. A patch follows, hopefully with the right metadata tags.

I have compiled a kernel 6.1 with the revert and confirmed it now boots
again.

p.s. this is my first time pointing git-send-email at this particular
list, so I'm sorry if I got anything wrong.

Jade



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Revert "misc: rtsx: judge ASPM Mode to set PETXCFG Reg"
  2023-09-11 11:38                       ` Thorsten Leemhuis
  2023-09-18 17:07                         ` [Revert] " Jade Lovelace
@ 2023-09-18 17:07                         ` Jade Lovelace
  1 sibling, 0 replies; 21+ messages in thread
From: Jade Lovelace @ 2023-09-18 17:07 UTC (permalink / raw)
  To: Gene, Ricky WU, Keith Busch, Thorsten Leemhuis, linux-kernel
  Cc: regressions, Alyssa Ross, Michal Suchanek, axboe @ kernel . dk ,
	sagi @ grimberg . me , linux-nvme @ lists . infradead . org ,
	hch, arnd, gregkh, stable, Gene

This reverts commit 101bd907b4244a726980ee67f95ed9cafab6ff7a.

This commit causes the NVMe controller to not work on the Dell XPS 15
9560, and similar laptop models. It appears to happen with any SSD
model.

This commit is broken on 6.1, 6.4, 6.5, and 6.6-rc1.

OpenSUSE has already reverted, and I have submitted a revert to NixOS.
As far as I can tell, this regression has fallen through the cracks.

Symptom:

kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme nvme0: Disabling device after reset failure: -19
systemd-cryptsetup[169]: Device /dev/disk/by-uuid/b80aedf8-ddd4-46fa-8d09-5215d5f286b9 READ lock released.
systemd-cryptsetup[169]: IO error while decrypting keyslot.
systemd-cryptsetup[169]: Keyslot 0 (luks2) open failed with -5.
systemd-cryptsetup[169]: Keyslot open failed.
systemd-cryptsetup[169]: Failed to activate with specified passphrase: Input/output error

There are several downstream bugs, these are the ones I know of:
- https://bugzilla.suse.com/show_bug.cgi?id=1214428
- https://github.com/NixOS/nixpkgs/issues/253418
- https://bugs.archlinux.org/task/79439#comment221866

Upstream revert links:
- https://github.com/openSUSE/kernel-source/commit/1b02b1528a26f4e9b577e215c114d8c5e773ee10
- https://github.com/NixOS/nixpkgs/pull/255824

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217802
Reported-and-bisected-by: Gene <geneslists@sapience.com>
Link: https://lore.kernel.org/lkml/30b69186-5a6e-4f53-b24c-2221926fc3b4@sapience.com/
Signed-off-by: Jade Lovelace <lists@jade.fyi>
---
 drivers/misc/cardreader/rts5227.c  |  2 +-
 drivers/misc/cardreader/rts5228.c  | 18 ++++++++++++++++++
 drivers/misc/cardreader/rts5249.c  |  3 ++-
 drivers/misc/cardreader/rts5260.c  | 18 ++++++++++++++++++
 drivers/misc/cardreader/rts5261.c  | 18 ++++++++++++++++++
 drivers/misc/cardreader/rtsx_pcr.c |  5 +----
 6 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/drivers/misc/cardreader/rts5227.c b/drivers/misc/cardreader/rts5227.c
index 3dae5e3a1697..d676cf63a966 100644
--- a/drivers/misc/cardreader/rts5227.c
+++ b/drivers/misc/cardreader/rts5227.c
@@ -195,7 +195,7 @@ static int rts5227_extra_init_hw(struct rtsx_pcr *pcr)
 		}
 	}
 
-	if (option->force_clkreq_0 && pcr->aspm_mode == ASPM_MODE_CFG)
+	if (option->force_clkreq_0)
 		rtsx_pci_add_cmd(pcr, WRITE_REG_CMD, PETXCFG,
 				FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_LOW);
 	else
diff --git a/drivers/misc/cardreader/rts5228.c b/drivers/misc/cardreader/rts5228.c
index f4ab09439da7..cfebad51d1d8 100644
--- a/drivers/misc/cardreader/rts5228.c
+++ b/drivers/misc/cardreader/rts5228.c
@@ -435,10 +435,17 @@ static void rts5228_init_from_cfg(struct rtsx_pcr *pcr)
 			option->ltr_enabled = false;
 		}
 	}
+
+	if (rtsx_check_dev_flag(pcr, ASPM_L1_1_EN | ASPM_L1_2_EN
+				| PM_L1_1_EN | PM_L1_2_EN))
+		option->force_clkreq_0 = false;
+	else
+		option->force_clkreq_0 = true;
 }
 
 static int rts5228_extra_init_hw(struct rtsx_pcr *pcr)
 {
+	struct rtsx_cr_option *option = &pcr->option;
 
 	rtsx_pci_write_register(pcr, RTS5228_AUTOLOAD_CFG1,
 			CD_RESUME_EN_MASK, CD_RESUME_EN_MASK);
@@ -469,6 +476,17 @@ static int rts5228_extra_init_hw(struct rtsx_pcr *pcr)
 	else
 		rtsx_pci_write_register(pcr, PETXCFG, 0x30, 0x00);
 
+	/*
+	 * If u_force_clkreq_0 is enabled, CLKREQ# PIN will be forced
+	 * to drive low, and we forcibly request clock.
+	 */
+	if (option->force_clkreq_0)
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_LOW);
+	else
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_HIGH);
+
 	rtsx_pci_write_register(pcr, PWD_SUSPEND_EN, 0xFF, 0xFB);
 
 	if (pcr->rtd3_en) {
diff --git a/drivers/misc/cardreader/rts5249.c b/drivers/misc/cardreader/rts5249.c
index 47ab72a43256..91d240dd68fa 100644
--- a/drivers/misc/cardreader/rts5249.c
+++ b/drivers/misc/cardreader/rts5249.c
@@ -327,11 +327,12 @@ static int rts5249_extra_init_hw(struct rtsx_pcr *pcr)
 		}
 	}
 
+
 	/*
 	 * If u_force_clkreq_0 is enabled, CLKREQ# PIN will be forced
 	 * to drive low, and we forcibly request clock.
 	 */
-	if (option->force_clkreq_0 && pcr->aspm_mode == ASPM_MODE_CFG)
+	if (option->force_clkreq_0)
 		rtsx_pci_write_register(pcr, PETXCFG,
 			FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_LOW);
 	else
diff --git a/drivers/misc/cardreader/rts5260.c b/drivers/misc/cardreader/rts5260.c
index 79b18f6f73a8..9b42b20a3e5a 100644
--- a/drivers/misc/cardreader/rts5260.c
+++ b/drivers/misc/cardreader/rts5260.c
@@ -517,10 +517,17 @@ static void rts5260_init_from_cfg(struct rtsx_pcr *pcr)
 			option->ltr_enabled = false;
 		}
 	}
+
+	if (rtsx_check_dev_flag(pcr, ASPM_L1_1_EN | ASPM_L1_2_EN
+				| PM_L1_1_EN | PM_L1_2_EN))
+		option->force_clkreq_0 = false;
+	else
+		option->force_clkreq_0 = true;
 }
 
 static int rts5260_extra_init_hw(struct rtsx_pcr *pcr)
 {
+	struct rtsx_cr_option *option = &pcr->option;
 
 	/* Set mcu_cnt to 7 to ensure data can be sampled properly */
 	rtsx_pci_write_register(pcr, 0xFC03, 0x7F, 0x07);
@@ -539,6 +546,17 @@ static int rts5260_extra_init_hw(struct rtsx_pcr *pcr)
 
 	rts5260_init_hw(pcr);
 
+	/*
+	 * If u_force_clkreq_0 is enabled, CLKREQ# PIN will be forced
+	 * to drive low, and we forcibly request clock.
+	 */
+	if (option->force_clkreq_0)
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_LOW);
+	else
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_HIGH);
+
 	rtsx_pci_write_register(pcr, pcr->reg_pm_ctrl3, 0x10, 0x00);
 
 	return 0;
diff --git a/drivers/misc/cardreader/rts5261.c b/drivers/misc/cardreader/rts5261.c
index 94af6bf8a25a..b1e76030cafd 100644
--- a/drivers/misc/cardreader/rts5261.c
+++ b/drivers/misc/cardreader/rts5261.c
@@ -498,10 +498,17 @@ static void rts5261_init_from_cfg(struct rtsx_pcr *pcr)
 			option->ltr_enabled = false;
 		}
 	}
+
+	if (rtsx_check_dev_flag(pcr, ASPM_L1_1_EN | ASPM_L1_2_EN
+				| PM_L1_1_EN | PM_L1_2_EN))
+		option->force_clkreq_0 = false;
+	else
+		option->force_clkreq_0 = true;
 }
 
 static int rts5261_extra_init_hw(struct rtsx_pcr *pcr)
 {
+	struct rtsx_cr_option *option = &pcr->option;
 	u32 val;
 
 	rtsx_pci_write_register(pcr, RTS5261_AUTOLOAD_CFG1,
@@ -547,6 +554,17 @@ static int rts5261_extra_init_hw(struct rtsx_pcr *pcr)
 	else
 		rtsx_pci_write_register(pcr, PETXCFG, 0x30, 0x00);
 
+	/*
+	 * If u_force_clkreq_0 is enabled, CLKREQ# PIN will be forced
+	 * to drive low, and we forcibly request clock.
+	 */
+	if (option->force_clkreq_0)
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_LOW);
+	else
+		rtsx_pci_write_register(pcr, PETXCFG,
+				 FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_HIGH);
+
 	rtsx_pci_write_register(pcr, PWD_SUSPEND_EN, 0xFF, 0xFB);
 
 	if (pcr->rtd3_en) {
diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
index a3f4b52bb159..32b7783e9d4f 100644
--- a/drivers/misc/cardreader/rtsx_pcr.c
+++ b/drivers/misc/cardreader/rtsx_pcr.c
@@ -1326,11 +1326,8 @@ static int rtsx_pci_init_hw(struct rtsx_pcr *pcr)
 			return err;
 	}
 
-	if (pcr->aspm_mode == ASPM_MODE_REG) {
+	if (pcr->aspm_mode == ASPM_MODE_REG)
 		rtsx_pci_write_register(pcr, ASPM_FORCE_CTL, 0x30, 0x30);
-		rtsx_pci_write_register(pcr, PETXCFG,
-				FORCE_CLKREQ_DELINK_MASK, FORCE_CLKREQ_HIGH);
-	}
 
 	/* No CD interrupt if probing driver with card inserted.
 	 * So we need to initialize pcr->card_exist here.
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Possible nvme regression in 6.4.11
  2023-08-17  3:00 ` Bagas Sanjaya
@ 2023-09-29 11:49   ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 21+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-09-29 11:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-nvme, Linux Regressions

On 17.08.23 05:00, Bagas Sanjaya wrote:
> On Wed, Aug 16, 2023 at 04:39:34PM -0400, Genes Lists wrote:

> Thanks for the regression report. I'm adding it to regzbot:
> 
> #regzbot ^introduced: 101bd907b4244a
> #regzbot title: can't change Samsung SSD power state due to ASPM mode checking
> #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=217802

Fix is in Gregs tree and hopefully soon in mainline:

#regzbot fix: 0e4cac557531a4
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.




^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-09-29 11:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-16 20:39 Possible nvme regression in 6.4.11 Genes Lists
2023-08-16 21:04 ` Keith Busch
2023-08-17  1:30   ` Genes Lists
2023-08-17  9:16     ` Genes Lists
2023-08-17 17:28       ` Keith Busch
2023-08-17 17:43         ` Genes Lists
2023-08-23 17:41       ` Keith Busch
2023-08-23 20:25         ` Genes Lists
2023-08-24  2:44           ` Ricky WU
2023-08-24  9:48             ` Genes Lists
2023-08-24 10:22               ` Genes Lists
2023-08-25 12:51                 ` Ricky WU
2023-08-30 21:09                   ` Genes Lists
2023-09-11  8:02                     ` Linux regression tracking (Thorsten Leemhuis)
2023-09-11 11:38                       ` Thorsten Leemhuis
2023-09-18 17:07                         ` [Revert] " Jade Lovelace
2023-09-18 17:07                         ` [PATCH] Revert "misc: rtsx: judge ASPM Mode to set PETXCFG Reg" Jade Lovelace
2023-09-11 15:41                       ` Possible nvme regression in 6.4.11 Augusto Zanellato
2023-08-24 11:29         ` Genes Lists
2023-08-17  3:00 ` Bagas Sanjaya
2023-09-29 11:49   ` Linux regression tracking #update (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).