* Linux AER reporting @ 2016-08-22 15:52 Nisha Miller 2016-08-22 16:15 ` Keith Busch 2016-08-22 18:10 ` Guilherme G. Piccoli 0 siblings, 2 replies; 11+ messages in thread From: Nisha Miller @ 2016-08-22 15:52 UTC (permalink / raw) Hi all, We have a PCIE SSD controller using NVME. This controller works on Windows and Linux. However, we are seeing a problem under Linux. In the nvme Linux driver in function nvme_kthread() the CSTS register is read once a second to check for controller status failure. In our case we see that occasionally this register is read as 0xFFFFFFFF. Whenever this happens, the kernel just hangs. This seems to be PCIe read error and we are trying to gather further information. How does one use Linux AER with the nvme driver? We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled in the kernel and aerdriver.forceload=y is set in the command line. TIA Nisha Miller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-22 15:52 Linux AER reporting Nisha Miller @ 2016-08-22 16:15 ` Keith Busch 2016-08-22 18:10 ` Guilherme G. Piccoli 1 sibling, 0 replies; 11+ messages in thread From: Keith Busch @ 2016-08-22 16:15 UTC (permalink / raw) Hi Nisha, The Linux NVMe driver didn't add AER support until commit: | commit a0a3408ee614848c27b0d36c2fe490da3b387b8d | Author: Keith Busch <keith.busch at intel.com> | Date: Mon Dec 7 15:30:31 2015 -0700 | | NVMe: Add pci error handlers If you don't have the commit, AER's may cause problems for NVMe. I think 4.4 was the first kernel release to include it. On Mon, Aug 22, 2016@08:52:10AM -0700, Nisha Miller wrote: > Hi all, > > We have a PCIE SSD controller using NVME. This controller works on > Windows and Linux. However, we are seeing a problem under Linux. > > In the nvme Linux driver in function nvme_kthread() the CSTS register > is read once a second to check for controller status failure. In our > case we see that occasionally this register is read as 0xFFFFFFFF. > Whenever this happens, the kernel just hangs. This seems to be PCIe > read error and we are trying to gather further information. How does > one use Linux AER with the nvme driver? > > We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled > in the kernel and aerdriver.forceload=y is set in the command line. > > TIA > Nisha Miller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-22 15:52 Linux AER reporting Nisha Miller 2016-08-22 16:15 ` Keith Busch @ 2016-08-22 18:10 ` Guilherme G. Piccoli 2016-08-23 23:56 ` Nisha Miller 1 sibling, 1 reply; 11+ messages in thread From: Guilherme G. Piccoli @ 2016-08-22 18:10 UTC (permalink / raw) On 08/22/2016 12:52 PM, Nisha Miller wrote: > Hi all, > > We have a PCIE SSD controller using NVME. This controller works on > Windows and Linux. However, we are seeing a problem under Linux. > > In the nvme Linux driver in function nvme_kthread() the CSTS register > is read once a second to check for controller status failure. In our > case we see that occasionally this register is read as 0xFFFFFFFF. > Whenever this happens, the kernel just hangs. This seems to be PCIe > read error and we are trying to gather further information. How does > one use Linux AER with the nvme driver? Nisha, we once saw 0xFFFF on CSTS register after issuing a reset_controller, for example. The reason it was that device shutdown was replaced by device disable when resetting the controller, following the NVMe spec, but the device we were testing that time didn't cope well with this change. For that, we implemented a quirk to wait a little on reading this register in some occasions. The commit info is: 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2 I'm really not sure if it's related, but I guess worth a try. Cheers, Guilherme > > We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled > in the kernel and aerdriver.forceload=y is set in the command line. > > TIA > Nisha Miller > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-22 18:10 ` Guilherme G. Piccoli @ 2016-08-23 23:56 ` Nisha Miller 2016-08-24 14:02 ` Guilherme G. Piccoli 0 siblings, 1 reply; 11+ messages in thread From: Nisha Miller @ 2016-08-23 23:56 UTC (permalink / raw) Hi Keith and Guilherme, thank you for your replies. Kernel 4.4.19 does not seem to have nvme driver with support for AER. It is present in Kernel 4.7 but getting it to work on Centos 7.2 is turning out to be quite a task. Arch Linux has kernel 4.7 so I will give that a shot. I should have mentioned that we get the CSTS = 0xFFFFFFFF only after millions of writes. When using fio, it runs for over 30 minutes before the problem crops up. BTW, I subscribed to linux-nvme list but never got a confirmation email. I don't get email from the list, but I'm able to post to it. cheers Nisha On Mon, Aug 22, 2016 at 11:10 AM, Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> wrote: > On 08/22/2016 12:52 PM, Nisha Miller wrote: >> >> Hi all, >> >> We have a PCIE SSD controller using NVME. This controller works on >> Windows and Linux. However, we are seeing a problem under Linux. >> >> In the nvme Linux driver in function nvme_kthread() the CSTS register >> is read once a second to check for controller status failure. In our >> case we see that occasionally this register is read as 0xFFFFFFFF. >> Whenever this happens, the kernel just hangs. This seems to be PCIe >> read error and we are trying to gather further information. How does >> one use Linux AER with the nvme driver? > > > Nisha, we once saw 0xFFFF on CSTS register after issuing a reset_controller, > for example. The reason it was that device shutdown was replaced by device > disable when resetting the controller, following the NVMe spec, but the > device we were testing that time didn't cope well with this change. > > For that, we implemented a quirk to wait a little on reading this register > in some occasions. The commit info is: > > > 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness") > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2 > > > I'm really not sure if it's related, but I guess worth a try. > Cheers, > > > Guilherme > > >> >> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled >> in the kernel and aerdriver.forceload=y is set in the command line. >> >> TIA >> Nisha Miller >> >> _______________________________________________ >> Linux-nvme mailing list >> Linux-nvme at lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-nvme >> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-23 23:56 ` Nisha Miller @ 2016-08-24 14:02 ` Guilherme G. Piccoli 2016-08-24 14:40 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Guilherme G. Piccoli @ 2016-08-24 14:02 UTC (permalink / raw) On 08/23/2016 08:56 PM, Nisha Miller wrote: > Hi Keith and Guilherme, > > thank you for your replies. > > Kernel 4.4.19 does not seem to have nvme driver with support for AER. > It is present in Kernel 4.7 but getting it to work on Centos 7.2 is > turning out to be quite a task. Arch Linux has kernel 4.7 so I will > give that a shot. > > I should have mentioned that we get the CSTS = 0xFFFFFFFF only after > millions of writes. When using fio, it runs for over 30 minutes before > the problem crops up. Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless here, since you're getting the error after multiple writes. Hope Keith can provide more ideas for you! By the way, do you have some logs to share? It'd help to figure out the situation I guess. Thanks, Guilherme > > BTW, I subscribed to linux-nvme list but never got a confirmation > email. I don't get email from the list, but I'm able to post to it. > > cheers > Nisha > > On Mon, Aug 22, 2016 at 11:10 AM, Guilherme G. Piccoli > <gpiccoli@linux.vnet.ibm.com> wrote: >> On 08/22/2016 12:52 PM, Nisha Miller wrote: >>> >>> Hi all, >>> >>> We have a PCIE SSD controller using NVME. This controller works on >>> Windows and Linux. However, we are seeing a problem under Linux. >>> >>> In the nvme Linux driver in function nvme_kthread() the CSTS register >>> is read once a second to check for controller status failure. In our >>> case we see that occasionally this register is read as 0xFFFFFFFF. >>> Whenever this happens, the kernel just hangs. This seems to be PCIe >>> read error and we are trying to gather further information. How does >>> one use Linux AER with the nvme driver? >> >> >> Nisha, we once saw 0xFFFF on CSTS register after issuing a reset_controller, >> for example. The reason it was that device shutdown was replaced by device >> disable when resetting the controller, following the NVMe spec, but the >> device we were testing that time didn't cope well with this change. >> >> For that, we implemented a quirk to wait a little on reading this register >> in some occasions. The commit info is: >> >> >> 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness") >> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54adc01055b75ec8769c5a36574c7a0895c0c0b2 >> >> >> I'm really not sure if it's related, but I guess worth a try. >> Cheers, >> >> >> Guilherme >> >> >>> >>> We are using Centos 7.2 with Kernel 3.19.8. PCIe AER has been enabled >>> in the kernel and aerdriver.forceload=y is set in the command line. >>> >>> TIA >>> Nisha Miller >>> >>> _______________________________________________ >>> Linux-nvme mailing list >>> Linux-nvme at lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/linux-nvme >>> >> > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-24 14:02 ` Guilherme G. Piccoli @ 2016-08-24 14:40 ` Keith Busch 2016-08-24 17:13 ` Nisha Miller 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2016-08-24 14:40 UTC (permalink / raw) On Wed, Aug 24, 2016@11:02:46AM -0300, Guilherme G. Piccoli wrote: > On 08/23/2016 08:56 PM, Nisha Miller wrote: > >Hi Keith and Guilherme, > > > >thank you for your replies. > > > >Kernel 4.4.19 does not seem to have nvme driver with support for AER. > >It is present in Kernel 4.7 but getting it to work on Centos 7.2 is > >turning out to be quite a task. Arch Linux has kernel 4.7 so I will > >give that a shot. > > > >I should have mentioned that we get the CSTS = 0xFFFFFFFF only after > >millions of writes. When using fio, it runs for over 30 minutes before > >the problem crops up. > > Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless > here, since you're getting the error after multiple writes. Hope Keith can > provide more ideas for you! An all 1's completion indicates the link is down. There should never be a case where a functioning drive actually returns that from a read to the CSTS register. I'm not sure if PCIe AER has anything do with this, though. Are you injecting these sorts of errors? If you're just doing normal IO testing, AER may not apply here. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-24 14:40 ` Keith Busch @ 2016-08-24 17:13 ` Nisha Miller 2016-08-25 15:06 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Nisha Miller @ 2016-08-24 17:13 UTC (permalink / raw) Hi Keith, I'm not injecting any errors. I'm trying to see if AER reporting can help me diagnose the problem. Is this something AER can help me do or am I on the wrong track here? thanks Nisha Miller On Wed, Aug 24, 2016@7:40 AM, Keith Busch <keith.busch@intel.com> wrote: > On Wed, Aug 24, 2016@11:02:46AM -0300, Guilherme G. Piccoli wrote: >> On 08/23/2016 08:56 PM, Nisha Miller wrote: >> >Hi Keith and Guilherme, >> > >> >thank you for your replies. >> > >> >Kernel 4.4.19 does not seem to have nvme driver with support for AER. >> >It is present in Kernel 4.7 but getting it to work on Centos 7.2 is >> >turning out to be quite a task. Arch Linux has kernel 4.7 so I will >> >give that a shot. >> > >> >I should have mentioned that we get the CSTS = 0xFFFFFFFF only after >> >millions of writes. When using fio, it runs for over 30 minutes before >> >the problem crops up. >> >> Hi Nisha, unfortunately the idea of the quirk I mentioned seems useless >> here, since you're getting the error after multiple writes. Hope Keith can >> provide more ideas for you! > > An all 1's completion indicates the link is down. There should never be > a case where a functioning drive actually returns that from a read to > the CSTS register. > > I'm not sure if PCIe AER has anything do with this, though. Are you > injecting these sorts of errors? If you're just doing normal IO testing, > AER may not apply here. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-24 17:13 ` Nisha Miller @ 2016-08-25 15:06 ` Keith Busch 2016-08-25 17:37 ` Nisha Miller 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2016-08-25 15:06 UTC (permalink / raw) On Wed, Aug 24, 2016@10:13:43AM -0700, Nisha Miller wrote: > Hi Keith, > > I'm not injecting any errors. I'm trying to see if AER reporting can > help me diagnose the problem. Is this something AER can help me do or > am I on the wrong track here? Is there anything else in the dmesg occuring when you observe the ff's CSTS register value? That status usually (only?) happens if the device link is disconnected or broken. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-25 15:06 ` Keith Busch @ 2016-08-25 17:37 ` Nisha Miller 2016-08-25 17:49 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Nisha Miller @ 2016-08-25 17:37 UTC (permalink / raw) On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote: > Is there anything else in the dmesg occuring when you observe the ff's > CSTS register value? That status usually (only?) happens if the device > link is disconnected or broken. no messages are displayed in dmesg when CSTS is read as all FFs. thanks Nisha ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-25 17:37 ` Nisha Miller @ 2016-08-25 17:49 ` Keith Busch 2016-08-25 18:08 ` Nisha Miller 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2016-08-25 17:49 UTC (permalink / raw) On Thu, Aug 25, 2016@10:37:41AM -0700, Nisha Miller wrote: > On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote: > > > Is there anything else in the dmesg occuring when you observe the ff's > > CSTS register value? That status usually (only?) happens if the device > > link is disconnected or broken. > > no messages are displayed in dmesg when CSTS is read as all FFs. Have you got a protocol analyzer? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux AER reporting 2016-08-25 17:49 ` Keith Busch @ 2016-08-25 18:08 ` Nisha Miller 0 siblings, 0 replies; 11+ messages in thread From: Nisha Miller @ 2016-08-25 18:08 UTC (permalink / raw) On Thu, Aug 25, 2016@10:49 AM, Keith Busch <keith.busch@intel.com> wrote: > On Thu, Aug 25, 2016@10:37:41AM -0700, Nisha Miller wrote: >> On Thu, Aug 25, 2016@8:06 AM, Keith Busch <keith.busch@intel.com> wrote: >> >> > Is there anything else in the dmesg occuring when you observe the ff's >> > CSTS register value? That status usually (only?) happens if the device >> > link is disconnected or broken. >> >> no messages are displayed in dmesg when CSTS is read as all FFs. > > Have you got a protocol analyzer? No, I don't. I was hoping that AER would help me better understand what is going on. But if that is not possible, I can rent a protocol analyzer. thanks Nisha ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-08-25 18:08 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-22 15:52 Linux AER reporting Nisha Miller 2016-08-22 16:15 ` Keith Busch 2016-08-22 18:10 ` Guilherme G. Piccoli 2016-08-23 23:56 ` Nisha Miller 2016-08-24 14:02 ` Guilherme G. Piccoli 2016-08-24 14:40 ` Keith Busch 2016-08-24 17:13 ` Nisha Miller 2016-08-25 15:06 ` Keith Busch 2016-08-25 17:37 ` Nisha Miller 2016-08-25 17:49 ` Keith Busch 2016-08-25 18:08 ` Nisha Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.