All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
@ 2021-09-22  1:56 ` Yi Zhang
  2021-09-22  2:32   ` Chaitanya Kulkarni
                     ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Yi Zhang @ 2021-09-22  1:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: Keith Busch

Hello

I found this issue during the nvme removal test, I did some debug code
found it was failed during nvme "CSTS – Controller Status" read, could
anyone help check if this is one HW or SW issue?


# nvme list
Node                  SN                   Model
             Namespace Usage                      Format           FW
Rev
--------------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1          S48CNC0N400972B      Samsung SSD 983 DCT 960GB
             1           4.10  kB / 960.20  GB    512   B +  0 B
EDA5302Q
# lspci -s 87:00.0 -v
87:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
DeviceName: PCIe SSD in Slot 23 Bay 1
Subsystem: Samsung Electronics Co Ltd Device a801
Physical Slot: 7
Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 1
Memory at c8600000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at c8610000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158] Power Budgeting <?>
Capabilities: [168] Secondary PCI Express
Capabilities: [188] Latency Tolerance Reporting
Capabilities: [190] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme

# echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
# echo 1 >/sys/bus/pci/rescan
# dmesg
[  251.864254] pci 0000:87:00.0: [144d:a808] type 00 class 0x010802
[  251.864286] pci 0000:87:00.0: reg 0x10: [mem 0xc8600000-0xc8603fff 64bit]
[  251.864337] pci 0000:87:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
[  251.889196] pci 0000:87:00.0: BAR 6: assigned [mem
0xc8600000-0xc860ffff pref]
[  251.889206] pci 0000:87:00.0: BAR 0: assigned [mem
0xc8610000-0xc8613fff 64bit]
[  251.889777] nvme nvme0: pci function 0000:87:00.0
[  251.889888] nvme nvme0: readl(dev->bar + NVME_REG_CSTS) == -1,
return - ENODEV
[  251.898057] nvme nvme0: nvme_pci_enable: -19
[  251.902821] nvme nvme0: Removing after probe failure status: -19


-- 
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22  1:56 ` [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal Yi Zhang
@ 2021-09-22  2:32   ` Chaitanya Kulkarni
  2021-09-22 11:21     ` Yi Zhang
  2021-09-22  2:33   ` Chaitanya Kulkarni
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-22  2:32 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme; +Cc: Keith Busch

On 9/21/21 6:56 PM, Yi Zhang wrote:
> Hello
> 
> I found this issue during the nvme removal test, I did some debug code
> found it was failed during nvme "CSTS – Controller Status" read, could
> anyone help check if this is one HW or SW issue?
> 
> 

One easiest way to isolate this issue is to test the steps on the QEMU
NVMe ctrl and update everyone here...
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22  1:56 ` [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal Yi Zhang
  2021-09-22  2:32   ` Chaitanya Kulkarni
@ 2021-09-22  2:33   ` Chaitanya Kulkarni
  2021-09-23 16:54   ` Adam Manzanares
  2021-09-24  3:13   ` Keith Busch
  3 siblings, 0 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-22  2:33 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme; +Cc: Keith Busch

On 9/21/21 6:56 PM, Yi Zhang wrote:
> Hello
> 
> I found this issue during the nvme removal test, I did some debug code
> found it was failed during nvme "CSTS – Controller Status" read, could
> anyone help check if this is one HW or SW issue?
> 
> 

also if you can specify the git repo, branch and git head it will be
great.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22  2:32   ` Chaitanya Kulkarni
@ 2021-09-22 11:21     ` Yi Zhang
  2021-09-23  6:52       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 10+ messages in thread
From: Yi Zhang @ 2021-09-22 11:21 UTC (permalink / raw)
  To: Chaitanya Kulkarni; +Cc: linux-nvme, Keith Busch

On Wed, Sep 22, 2021 at 10:32 AM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> On 9/21/21 6:56 PM, Yi Zhang wrote:
> > Hello
> >
> > I found this issue during the nvme removal test, I did some debug code
> > found it was failed during nvme "CSTS – Controller Status" read, could
> > anyone help check if this is one HW or SW issue?
> >
> >
>
> One easiest way to isolate this issue is to test the steps on the QEMU
> NVMe ctrl and update everyone here...

Hi Chaitanya
I tried another NVMe SSD[1], and it works well, seems this is special
for "Samsung SSD 983 DCT 960GB".
And the kernel I used was upstream 5.15.0-rc2.

[1]
# nvme list
Node             SN                   Model
        Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme2n1     3080A09HTAFR         Dell Express Flash CD5 960G SFF
        1         469.18  MB / 960.20  GB    512   B +  0 B   1.2.0

-- 
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22 11:21     ` Yi Zhang
@ 2021-09-23  6:52       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-23  6:52 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme, Keith Busch

On 9/22/2021 4:21 AM, Yi Zhang wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Sep 22, 2021 at 10:32 AM Chaitanya Kulkarni
> <chaitanyak@nvidia.com> wrote:
>>
>> On 9/21/21 6:56 PM, Yi Zhang wrote:
>>> Hello
>>>
>>> I found this issue during the nvme removal test, I did some debug code
>>> found it was failed during nvme "CSTS – Controller Status" read, could
>>> anyone help check if this is one HW or SW issue?
>>>
>>>
>>
>> One easiest way to isolate this issue is to test the steps on the QEMU
>> NVMe ctrl and update everyone here...
> 
> Hi Chaitanya
> I tried another NVMe SSD[1], and it works well, seems this is special
> for "Samsung SSD 983 DCT 960GB".
> And the kernel I used was upstream 5.15.0-rc2.
> 

Thanks for the update.

You can also setup a NVMeOF Block device backend controller with the
nvme-loop transport and see if the problem exists.


> [1]
> # nvme list
> Node             SN                   Model
>          Namespace Usage                      Format           FW Rev
> ---------------- --------------------
> ---------------------------------------- ---------
> -------------------------- ---------------- --------
> /dev/nvme2n1     3080A09HTAFR         Dell Express Flash CD5 960G SFF
>          1         469.18  MB / 960.20  GB    512   B +  0 B   1.2.0
> 
> --
> Best Regards,
>    Yi Zhang
> 
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22  1:56 ` [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal Yi Zhang
  2021-09-22  2:32   ` Chaitanya Kulkarni
  2021-09-22  2:33   ` Chaitanya Kulkarni
@ 2021-09-23 16:54   ` Adam Manzanares
  2021-09-24  2:33     ` Yi Zhang
  2021-09-24  3:13   ` Keith Busch
  3 siblings, 1 reply; 10+ messages in thread
From: Adam Manzanares @ 2021-09-23 16:54 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme, Keith Busch

On Wed, Sep 22, 2021 at 09:56:47AM +0800, Yi Zhang wrote:
> Hello
> 
> I found this issue during the nvme removal test, I did some debug code
> found it was failed during nvme "CSTS – Controller Status" read, could
> anyone help check if this is one HW or SW issue?

Hello Yi,

What is the nvme removal test? When I get access to the test I will run it on 
the HW on my current machine to see what happens. I don't have the same SSD in 
my current test system, but I will make sure I am able to run the test on the 
same HW you are having issues with. 

Thanks,
Adam

> 
> 
> # nvme list
> Node                  SN                   Model
>              Namespace Usage                      Format           FW
> Rev
> --------------------- --------------------
> ---------------------------------------- ---------
> -------------------------- ---------------- --------
> /dev/nvme0n1          S48CNC0N400972B      Samsung SSD 983 DCT 960GB
>              1           4.10  kB / 960.20  GB    512   B +  0 B
> EDA5302Q
> # lspci -s 87:00.0 -v
> 87:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
> NVMe SSD Controller SM981/PM981/PM983 (prog-if 02 [NVM Express])
> DeviceName: PCIe SSD in Slot 23 Bay 1
> Subsystem: Samsung Electronics Co Ltd Device a801
> Physical Slot: 7
> Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 1
> Memory at c8600000 (64-bit, non-prefetchable) [size=16K]
> Expansion ROM at c8610000 [disabled] [size=64K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 00
> Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
> Capabilities: [158] Power Budgeting <?>
> Capabilities: [168] Secondary PCI Express
> Capabilities: [188] Latency Tolerance Reporting
> Capabilities: [190] L1 PM Substates
> Kernel driver in use: nvme
> Kernel modules: nvme
> 
> # echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
> # echo 1 >/sys/bus/pci/rescan
> # dmesg
> [  251.864254] pci 0000:87:00.0: [144d:a808] type 00 class 0x010802
> [  251.864286] pci 0000:87:00.0: reg 0x10: [mem 0xc8600000-0xc8603fff 64bit]
> [  251.864337] pci 0000:87:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
> [  251.889196] pci 0000:87:00.0: BAR 6: assigned [mem
> 0xc8600000-0xc860ffff pref]
> [  251.889206] pci 0000:87:00.0: BAR 0: assigned [mem
> 0xc8610000-0xc8613fff 64bit]
> [  251.889777] nvme nvme0: pci function 0000:87:00.0
> [  251.889888] nvme nvme0: readl(dev->bar + NVME_REG_CSTS) == -1,
> return - ENODEV
> [  251.898057] nvme nvme0: nvme_pci_enable: -19
> [  251.902821] nvme nvme0: Removing after probe failure status: -19
> 
> 
> -- 
> Best Regards,
>   Yi Zhang
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=91c450cb-ce5f6836-91c5db84-000babff317b-8f74ab0f1a6811c9&q=1&e=8f702a72-b512-4abf-8e2f-4b7d6fc9fa4c&u=http*3A*2F*2Flists.infradead.org*2Fmailman*2Flistinfo*2Flinux-nvme__;JSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!Et5uoUkTlQwi8ttA2vHe_Ab8xGrNj_J0VV7ab_fhyVorL0YT4vQ2k_36r0eUs-0p78pn$ 
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-23 16:54   ` Adam Manzanares
@ 2021-09-24  2:33     ` Yi Zhang
  2021-09-24 19:16       ` Adam Manzanares
  0 siblings, 1 reply; 10+ messages in thread
From: Yi Zhang @ 2021-09-24  2:33 UTC (permalink / raw)
  To: Adam Manzanares; +Cc: linux-nvme, Keith Busch

On Fri, Sep 24, 2021 at 12:58 AM Adam Manzanares
<a.manzanares@samsung.com> wrote:
>
> On Wed, Sep 22, 2021 at 09:56:47AM +0800, Yi Zhang wrote:
> > Hello
> >
> > I found this issue during the nvme removal test, I did some debug code
> > found it was failed during nvme "CSTS – Controller Status" read, could
> > anyone help check if this is one HW or SW issue?
>
> Hello Yi,
>
> What is the nvme removal test? When I get access to the test I will run it on
> the HW on my current machine to see what happens. I don't have the same SSD in
> my current test system, but I will make sure I am able to run the test on the
> same HW you are having issues with.
>
Hi Adam
Here is the steps, and the server I used is DELL R730xd and DELL R640,
feel free to let me know if you need more info, thanks.

# lspci | grep -i nvme
87:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller SM981/PM981/PM983
#echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
#echo 1 >/sys/bus/pci/rescan

Thanks
Yi


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-22  1:56 ` [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal Yi Zhang
                     ` (2 preceding siblings ...)
  2021-09-23 16:54   ` Adam Manzanares
@ 2021-09-24  3:13   ` Keith Busch
  2021-09-26 11:14     ` Yi Zhang
  3 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2021-09-24  3:13 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme

On Wed, Sep 22, 2021 at 09:56:47AM +0800, Yi Zhang wrote:
> # echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
> # echo 1 >/sys/bus/pci/rescan
> # dmesg
> [  251.864254] pci 0000:87:00.0: [144d:a808] type 00 class 0x010802
> [  251.864286] pci 0000:87:00.0: reg 0x10: [mem 0xc8600000-0xc8603fff 64bit]
> [  251.864337] pci 0000:87:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
> [  251.889196] pci 0000:87:00.0: BAR 6: assigned [mem 0xc8600000-0xc860ffff pref]
> [  251.889206] pci 0000:87:00.0: BAR 0: assigned [mem 0xc8610000-0xc8613fff 64bit]
> [  251.889777] nvme nvme0: pci function 0000:87:00.0
> [  251.889888] nvme nvme0: readl(dev->bar + NVME_REG_CSTS) == -1,
> return - ENODEV

An all 1's return almost certainly means the memory read request failed. The
test your described usually means the target did not properly configure the
memory range it was assigned. Is this directly attached to a root port?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-24  2:33     ` Yi Zhang
@ 2021-09-24 19:16       ` Adam Manzanares
  0 siblings, 0 replies; 10+ messages in thread
From: Adam Manzanares @ 2021-09-24 19:16 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme, Keith Busch

On Fri, Sep 24, 2021 at 10:33:14AM +0800, Yi Zhang wrote:
> On Fri, Sep 24, 2021 at 12:58 AM Adam Manzanares
> <a.manzanares@samsung.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 09:56:47AM +0800, Yi Zhang wrote:
> > > Hello
> > >
> > > I found this issue during the nvme removal test, I did some debug code
> > > found it was failed during nvme "CSTS – Controller Status" read, could
> > > anyone help check if this is one HW or SW issue?
> >
> > Hello Yi,
> >
> > What is the nvme removal test? When I get access to the test I will run it on
> > the HW on my current machine to see what happens. I don't have the same SSD in
> > my current test system, but I will make sure I am able to run the test on the
> > same HW you are having issues with.
> >
> Hi Adam
> Here is the steps, and the server I used is DELL R730xd and DELL R640,
> feel free to let me know if you need more info, thanks.
> 
> # lspci | grep -i nvme
> 87:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
> NVMe SSD Controller SM981/PM981/PM983
> #echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
> #echo 1 >/sys/bus/pci/rescan

This is enough info for now. I can't reproduce the problem on the platform 
I am currently using. I will try to track down the same hw and fw that you 
are using, retest, and send an update.


> 
> Thanks
> Yi
> 
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal
  2021-09-24  3:13   ` Keith Busch
@ 2021-09-26 11:14     ` Yi Zhang
  0 siblings, 0 replies; 10+ messages in thread
From: Yi Zhang @ 2021-09-26 11:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme

On Fri, Sep 24, 2021 at 11:13 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Sep 22, 2021 at 09:56:47AM +0800, Yi Zhang wrote:
> > # echo 1 >/sys/bus/pci/devices/0000\:87\:00.0/remove
> > # echo 1 >/sys/bus/pci/rescan
> > # dmesg
> > [  251.864254] pci 0000:87:00.0: [144d:a808] type 00 class 0x010802
> > [  251.864286] pci 0000:87:00.0: reg 0x10: [mem 0xc8600000-0xc8603fff 64bit]
> > [  251.864337] pci 0000:87:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
> > [  251.889196] pci 0000:87:00.0: BAR 6: assigned [mem 0xc8600000-0xc860ffff pref]
> > [  251.889206] pci 0000:87:00.0: BAR 0: assigned [mem 0xc8610000-0xc8613fff 64bit]
> > [  251.889777] nvme nvme0: pci function 0000:87:00.0
> > [  251.889888] nvme nvme0: readl(dev->bar + NVME_REG_CSTS) == -1,
> > return - ENODEV
>
> An all 1's return almost certainly means the memory read request failed. The
> test your described usually means the target did not properly configure the
> memory range it was assigned. Is this directly attached to a root port?
>
Hi Keith
It was connected to the PCIe slot through a PCIe extender card, I
added the full dmesg here, not sure if it helps.

https://pastebin.com/QUP0Y4sT


--
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-26 11:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20210922020425uscas1p22ca979458bbbb94c243af890bdca04b6@uscas1p2.samsung.com>
2021-09-22  1:56 ` [bug report] nvme removing after probe failed with pci rescan after nvme sysfs removal Yi Zhang
2021-09-22  2:32   ` Chaitanya Kulkarni
2021-09-22 11:21     ` Yi Zhang
2021-09-23  6:52       ` Chaitanya Kulkarni
2021-09-22  2:33   ` Chaitanya Kulkarni
2021-09-23 16:54   ` Adam Manzanares
2021-09-24  2:33     ` Yi Zhang
2021-09-24 19:16       ` Adam Manzanares
2021-09-24  3:13   ` Keith Busch
2021-09-26 11:14     ` Yi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.