All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: okaya@codeaurora.org
Cc: Alex_Gagniuc@dellteam.com, linux-pci@vger.kernel.org,
	shyam_iyer@dell.com, linux-nvme@lists.infradead.org,
	Keith Busch <keith.busch@intel.com>,
	austin_bolen@dell.com, linux-pci-owner@vger.kernel.org
Subject: Re: AER: Malformed TLP recovery deadlock with NVMe drives
Date: Mon, 7 May 2018 15:16:02 -0500	[thread overview]
Message-ID: <5c97a7c2-cb53-4740-fda0-50ba92288c5c@gmail.com> (raw)
In-Reply-To: <7afd280ad80a73b39e6c9b9a9e29abcc@codeaurora.org>

On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 19:36, Alex G. wrote:
>> Hi! Me again!
>>
>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>> seems that the device_lock() call in report_slot_reset() never returns.
>> How we get there is interesting:
> 
> Can you give this patch a try?
> 
Oh! Patches so soon? Yay!

> https://patchwork.kernel.org/patch/10351515/

Thank you! I tried a few runs. there was one run where we didn't lock
up, but the other runs all went like before.

For comparison, the run that didn't deadlock looked like [2].

Alex

[2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log

>> Injection of the error happens by changing the maximum payload size to
>> 128 bytes from 256. This is on the switch upstream port.
>> When there's IO to the drive, switch sees a malformed TLP. Switch
>> reports error, AER handles it.
>> More IO goes, another error is triggered, and this time the root port
>> reports it. AER recovery hits the NVMe drive behind the affetced
>> upstream port and deadlocks.
>>
>> I've added some printks in the AER handler to see which lock dies, and I
>> have a fairly succinct log[1], also pasted below. It appears somebody is
>> already holding the lock to the PCI device behind the sick upstream
>> port, and never releases it.
>>
>>
>> I'm not sure how to track down other users of the lock, and I"m hoping
>> somebody may have a brighter idea.
>>
>> Alex
>>
>>
>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>
``

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

WARNING: multiple messages have this Message-ID (diff)
From: mr.nuke.me@gmail.com (Alex G.)
Subject: AER: Malformed TLP recovery deadlock with NVMe drives
Date: Mon, 7 May 2018 15:16:02 -0500	[thread overview]
Message-ID: <5c97a7c2-cb53-4740-fda0-50ba92288c5c@gmail.com> (raw)
In-Reply-To: <7afd280ad80a73b39e6c9b9a9e29abcc@codeaurora.org>

On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 19:36, Alex G. wrote:
>> Hi! Me again!
>>
>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>> seems that the device_lock() call in report_slot_reset() never returns.
>> How we get there is interesting:
> 
> Can you give this patch a try?
> 
Oh! Patches so soon? Yay!

> https://patchwork.kernel.org/patch/10351515/

Thank you! I tried a few runs. there was one run where we didn't lock
up, but the other runs all went like before.

For comparison, the run that didn't deadlock looked like [2].

Alex

[2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log

>> Injection of the error happens by changing the maximum payload size to
>> 128 bytes from 256. This is on the switch upstream port.
>> When there's IO to the drive, switch sees a malformed TLP. Switch
>> reports error, AER handles it.
>> More IO goes, another error is triggered, and this time the root port
>> reports it. AER recovery hits the NVMe drive behind the affetced
>> upstream port and deadlocks.
>>
>> I've added some printks in the AER handler to see which lock dies, and I
>> have a fairly succinct log[1], also pasted below. It appears somebody is
>> already holding the lock to the PCI device behind the sick upstream
>> port, and never releases it.
>>
>>
>> I'm not sure how to track down other users of the lock, and I"m hoping
>> somebody may have a brighter idea.
>>
>> Alex
>>
>>
>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>
``

  reply	other threads:[~2018-05-07 20:16 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-07 18:36 AER: Malformed TLP recovery deadlock with NVMe drives Alex G.
2018-05-07 18:36 ` Alex G.
2018-05-07 18:46 ` okaya
2018-05-07 18:46   ` okaya
2018-05-07 20:16   ` Alex G. [this message]
2018-05-07 20:16     ` Alex G.
2018-05-07 20:30     ` okaya
2018-05-07 20:30       ` okaya
2018-05-07 20:58       ` Alex G.
2018-05-07 20:58         ` Alex G.
2018-05-07 21:48         ` Sinan Kaya
2018-05-07 21:48           ` Sinan Kaya
2018-05-07 22:45         ` okaya
2018-05-07 22:45           ` okaya
2018-05-07 23:57           ` Alex_Gagniuc
2018-05-07 23:57             ` Alex_Gagniuc
2018-05-08  0:21             ` okaya
2018-05-08  0:21               ` okaya
2018-05-08 16:58               ` Bjorn Helgaas
2018-05-08 16:58                 ` Bjorn Helgaas
2018-05-08 17:32                 ` Alex G.
2018-05-08 17:32                   ` Alex G.
2018-05-08 18:01                   ` Bjorn Helgaas
2018-05-08 18:01                     ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c97a7c2-cb53-4740-fda0-50ba92288c5c@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=Alex_Gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=keith.busch@intel.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci-owner@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=okaya@codeaurora.org \
    --cc=shyam_iyer@dell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.