All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux AER reporting
@ 2016-08-22 15:52 Nisha Miller
  2016-08-22 16:15 ` Keith Busch
  2016-08-22 18:10 ` Guilherme G. Piccoli
  0 siblings, 2 replies; 11+ messages in thread
From: Nisha Miller @ 2016-08-22 15:52 UTC (permalink / raw)


Hi all,

We have a PCIE SSD controller using NVME. This controller works on
Windows and Linux. However, we are seeing a problem under Linux.

In the nvme Linux driver in function nvme_kthread() the CSTS register
is read once a second to check for controller status failure. In our
case we see that occasionally this register is read as 0xFFFFFFFF.
Whenever this happens, the kernel just hangs. This seems to be PCIe
read error and we are trying to gather further information. How does
one use Linux AER with the nvme driver?

We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled
in the kernel and aerdriver.forceload=y is set in the command line.

TIA
Nisha Miller

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-22 15:52 Linux AER reporting Nisha Miller
@ 2016-08-22 16:15 ` Keith Busch
  2016-08-22 18:10 ` Guilherme G. Piccoli
  1 sibling, 0 replies; 11+ messages in thread
From: Keith Busch @ 2016-08-22 16:15 UTC (permalink / raw)


Hi Nisha,

The Linux NVMe driver didn't add AER support until commit:

| commit a0a3408ee614848c27b0d36c2fe490da3b387b8d
| Author: Keith Busch <keith.busch at intel.com>
| Date:   Mon Dec 7 15:30:31 2015 -0700
|
|   NVMe: Add pci error handlers


If you don't have the commit, AER's may cause problems for NVMe.
I think 4.4 was the first kernel release to include it.


On Mon, Aug 22, 2016@08:52:10AM -0700, Nisha Miller wrote:
> Hi all,
> 
> We have a PCIE SSD controller using NVME. This controller works on
> Windows and Linux. However, we are seeing a problem under Linux.
> 
> In the nvme Linux driver in function nvme_kthread() the CSTS register
> is read once a second to check for controller status failure. In our
> case we see that occasionally this register is read as 0xFFFFFFFF.
> Whenever this happens, the kernel just hangs. This seems to be PCIe
> read error and we are trying to gather further information. How does
> one use Linux AER with the nvme driver?
> 
> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled
> in the kernel and aerdriver.forceload=y is set in the command line.
> 
> TIA
> Nisha Miller

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-22 15:52 Linux AER reporting Nisha Miller
  2016-08-22 16:15 ` Keith Busch
@ 2016-08-22 18:10 ` Guilherme G. Piccoli
  2016-08-23 23:56   ` Nisha Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Guilherme G. Piccoli @ 2016-08-22 18:10 UTC (permalink / raw)


On 08/22/2016 12:52 PM, Nisha Miller wrote:
> Hi all,
>
> We have a PCIE SSD controller using NVME. This controller works on
> Windows and Linux. However, we are seeing a problem under Linux.
>
> In the nvme Linux driver in function nvme_kthread() the CSTS register
> is read once a second to check for controller status failure. In our
> case we see that occasionally this register is read as 0xFFFFFFFF.
> Whenever this happens, the kernel just hangs. This seems to be PCIe
> read error and we are trying to gather further information. How does
> one use Linux AER with the nvme driver?

Nisha, we once saw 0xFFFF on CSTS register after issuing a 
reset_controller, for example. The reason it was that device shutdown 
was replaced by device disable when resetting the controller, following 
the NVMe spec, but the device we were testing that time didn't cope well 
with this change.

For that, we implemented a quirk to wait a little on reading this 
register in some occasions. The commit info is:


54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2


I'm really not sure if it's related, but I guess worth a try.
Cheers,


Guilherme


>
> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled
> in the kernel and aerdriver.forceload=y is set in the command line.
>
> TIA
> Nisha Miller
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-22 18:10 ` Guilherme G. Piccoli
@ 2016-08-23 23:56   ` Nisha Miller
  2016-08-24 14:02     ` Guilherme G. Piccoli
  0 siblings, 1 reply; 11+ messages in thread
From: Nisha Miller @ 2016-08-23 23:56 UTC (permalink / raw)


Hi Keith and Guilherme,

thank you for your replies.

Kernel 4.4.19 does not seem to have nvme driver with support for AER.
It is present in Kernel 4.7 but getting it to work on Centos 7.2 is
turning out to be quite a task. Arch Linux has kernel 4.7 so I will
give that a shot.

I should have mentioned that we get the CSTS = 0xFFFFFFFF only after
millions of writes. When using fio, it runs for over 30 minutes before
the problem crops up.

BTW, I subscribed to linux-nvme list but never got a confirmation
email. I don't get email from the list, but I'm able to post to it.

cheers
Nisha

On Mon, Aug 22, 2016 at 11:10 AM, Guilherme G. Piccoli
<gpiccoli@linux.vnet.ibm.com> wrote:
> On 08/22/2016 12:52 PM, Nisha Miller wrote:
>>
>> Hi all,
>>
>> We have a PCIE SSD controller using NVME. This controller works on
>> Windows and Linux. However, we are seeing a problem under Linux.
>>
>> In the nvme Linux driver in function nvme_kthread() the CSTS register
>> is read once a second to check for controller status failure. In our
>> case we see that occasionally this register is read as 0xFFFFFFFF.
>> Whenever this happens, the kernel just hangs. This seems to be PCIe
>> read error and we are trying to gather further information. How does
>> one use Linux AER with the nvme driver?
>
>
> Nisha, we once saw 0xFFFF on CSTS register after issuing a reset_controller,
> for example. The reason it was that device shutdown was replaced by device
> disable when resetting the controller, following the NVMe spec, but the
> device we were testing that time didn't cope well with this change.
>
> For that, we implemented a quirk to wait a little on reading this register
> in some occasions. The commit info is:
>
>
> 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness")
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2
>
>
> I'm really not sure if it's related, but I guess worth a try.
> Cheers,
>
>
> Guilherme
>
>
>>
>> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled
>> in the kernel and aerdriver.forceload=y is set in the command line.
>>
>> TIA
>> Nisha Miller
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-23 23:56   ` Nisha Miller
@ 2016-08-24 14:02     ` Guilherme G. Piccoli
  2016-08-24 14:40       ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Guilherme G. Piccoli @ 2016-08-24 14:02 UTC (permalink / raw)


On 08/23/2016 08:56 PM, Nisha Miller wrote:
> Hi Keith and Guilherme,
>
> thank you for your replies.
>
> Kernel 4.4.19 does not seem to have nvme driver with support for AER.
> It is present in Kernel 4.7 but getting it to work on Centos 7.2 is
> turning out to be quite a task. Arch Linux has kernel 4.7 so I will
> give that a shot.
>
> I should have mentioned that we get the CSTS = 0xFFFFFFFF only after
> millions of writes. When using fio, it runs for over 30 minutes before
> the problem crops up.

Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless 
here, since you're getting the error after multiple writes. Hope Keith 
can provide more ideas for you!

By the way, do you have some logs to share? It'd help to figure out the 
situation I guess.

Thanks,


Guilherme


>
> BTW, I subscribed to linux-nvme list but never got a confirmation
> email. I don't get email from the list, but I'm able to post to it.
>
> cheers
> Nisha
>
> On Mon, Aug 22, 2016 at 11:10 AM, Guilherme G. Piccoli
> <gpiccoli@linux.vnet.ibm.com> wrote:
>> On 08/22/2016 12:52 PM, Nisha Miller wrote:
>>>
>>> Hi all,
>>>
>>> We have a PCIE SSD controller using NVME. This controller works on
>>> Windows and Linux. However, we are seeing a problem under Linux.
>>>
>>> In the nvme Linux driver in function nvme_kthread() the CSTS register
>>> is read once a second to check for controller status failure. In our
>>> case we see that occasionally this register is read as 0xFFFFFFFF.
>>> Whenever this happens, the kernel just hangs. This seems to be PCIe
>>> read error and we are trying to gather further information. How does
>>> one use Linux AER with the nvme driver?
>>
>>
>> Nisha, we once saw 0xFFFF on CSTS register after issuing a reset_controller,
>> for example. The reason it was that device shutdown was replaced by device
>> disable when resetting the controller, following the NVMe spec, but the
>> device we were testing that time didn't cope well with this change.
>>
>> For that, we implemented a quirk to wait a little on reading this register
>> in some occasions. The commit info is:
>>
>>
>> 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness")
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2
>>
>>
>> I'm really not sure if it's related, but I guess worth a try.
>> Cheers,
>>
>>
>> Guilherme
>>
>>
>>>
>>> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled
>>> in the kernel and aerdriver.forceload=y is set in the command line.
>>>
>>> TIA
>>> Nisha Miller
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>
>>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-24 14:02     ` Guilherme G. Piccoli
@ 2016-08-24 14:40       ` Keith Busch
  2016-08-24 17:13         ` Nisha Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-08-24 14:40 UTC (permalink / raw)


On Wed, Aug 24, 2016@11:02:46AM -0300, Guilherme G. Piccoli wrote:
> On 08/23/2016 08:56 PM, Nisha Miller wrote:
> >Hi Keith and Guilherme,
> >
> >thank you for your replies.
> >
> >Kernel 4.4.19 does not seem to have nvme driver with support for AER.
> >It is present in Kernel 4.7 but getting it to work on Centos 7.2 is
> >turning out to be quite a task. Arch Linux has kernel 4.7 so I will
> >give that a shot.
> >
> >I should have mentioned that we get the CSTS = 0xFFFFFFFF only after
> >millions of writes. When using fio, it runs for over 30 minutes before
> >the problem crops up.
> 
> Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless
> here, since you're getting the error after multiple writes. Hope Keith can
> provide more ideas for you!

An all 1's completion indicates the link is down. There should never be
a case where a functioning drive actually returns that from a read to
the CSTS register.

I'm not sure if PCIe AER has anything do with this, though. Are you
injecting these sorts of errors? If you're just doing normal IO testing,
AER may not apply here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-24 14:40       ` Keith Busch
@ 2016-08-24 17:13         ` Nisha Miller
  2016-08-25 15:06           ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Nisha Miller @ 2016-08-24 17:13 UTC (permalink / raw)


Hi Keith,

I'm not injecting any errors. I'm trying to see if AER reporting can
help me diagnose the problem. Is this something AER can help me do or
am I on the wrong track here?

thanks
Nisha Miller

On Wed, Aug 24, 2016@7:40 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Aug 24, 2016@11:02:46AM -0300, Guilherme G. Piccoli wrote:
>> On 08/23/2016 08:56 PM, Nisha Miller wrote:
>> >Hi Keith and Guilherme,
>> >
>> >thank you for your replies.
>> >
>> >Kernel 4.4.19 does not seem to have nvme driver with support for AER.
>> >It is present in Kernel 4.7 but getting it to work on Centos 7.2 is
>> >turning out to be quite a task. Arch Linux has kernel 4.7 so I will
>> >give that a shot.
>> >
>> >I should have mentioned that we get the CSTS = 0xFFFFFFFF only after
>> >millions of writes. When using fio, it runs for over 30 minutes before
>> >the problem crops up.
>>
>> Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless
>> here, since you're getting the error after multiple writes. Hope Keith can
>> provide more ideas for you!
>
> An all 1's completion indicates the link is down. There should never be
> a case where a functioning drive actually returns that from a read to
> the CSTS register.
>
> I'm not sure if PCIe AER has anything do with this, though. Are you
> injecting these sorts of errors? If you're just doing normal IO testing,
> AER may not apply here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-24 17:13         ` Nisha Miller
@ 2016-08-25 15:06           ` Keith Busch
  2016-08-25 17:37             ` Nisha Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-08-25 15:06 UTC (permalink / raw)


On Wed, Aug 24, 2016@10:13:43AM -0700, Nisha Miller wrote:
> Hi Keith,
> 
> I'm not injecting any errors. I'm trying to see if AER reporting can
> help me diagnose the problem. Is this something AER can help me do or
> am I on the wrong track here?

Is there anything else in the dmesg occuring when you observe the ff's
CSTS register value? That status usually (only?) happens if the device
link is disconnected or broken.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-25 15:06           ` Keith Busch
@ 2016-08-25 17:37             ` Nisha Miller
  2016-08-25 17:49               ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Nisha Miller @ 2016-08-25 17:37 UTC (permalink / raw)


On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote:

> Is there anything else in the dmesg occuring when you observe the ff's
> CSTS register value? That status usually (only?) happens if the device
> link is disconnected or broken.

no messages are displayed in dmesg when CSTS is read as all FFs.

thanks
Nisha

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-25 17:37             ` Nisha Miller
@ 2016-08-25 17:49               ` Keith Busch
  2016-08-25 18:08                 ` Nisha Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-08-25 17:49 UTC (permalink / raw)


On Thu, Aug 25, 2016@10:37:41AM -0700, Nisha Miller wrote:
> On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote:
> 
> > Is there anything else in the dmesg occuring when you observe the ff's
> > CSTS register value? That status usually (only?) happens if the device
> > link is disconnected or broken.
> 
> no messages are displayed in dmesg when CSTS is read as all FFs.

Have you got a protocol analyzer?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Linux AER reporting
  2016-08-25 17:49               ` Keith Busch
@ 2016-08-25 18:08                 ` Nisha Miller
  0 siblings, 0 replies; 11+ messages in thread
From: Nisha Miller @ 2016-08-25 18:08 UTC (permalink / raw)


On Thu, Aug 25, 2016@10:49 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Thu, Aug 25, 2016@10:37:41AM -0700, Nisha Miller wrote:
>> On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote:
>>
>> > Is there anything else in the dmesg occuring when you observe the ff's
>> > CSTS register value? That status usually (only?) happens if the device
>> > link is disconnected or broken.
>>
>> no messages are displayed in dmesg when CSTS is read as all FFs.
>
> Have you got a protocol analyzer?

No, I don't. I was hoping that AER would help me better understand
what is going on. But if that is not possible, I can rent a protocol
analyzer.

thanks
Nisha

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-25 18:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22 15:52 Linux AER reporting Nisha Miller
2016-08-22 16:15 ` Keith Busch
2016-08-22 18:10 ` Guilherme G. Piccoli
2016-08-23 23:56   ` Nisha Miller
2016-08-24 14:02     ` Guilherme G. Piccoli
2016-08-24 14:40       ` Keith Busch
2016-08-24 17:13         ` Nisha Miller
2016-08-25 15:06           ` Keith Busch
2016-08-25 17:37             ` Nisha Miller
2016-08-25 17:49               ` Keith Busch
2016-08-25 18:08                 ` Nisha Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.