linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Hoyer, David" <David.Hoyer@netapp.com>
To: "Hoyer, David" <David.Hoyer@netapp.com>, Lukas Wunner <lukas@wunner.de>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	Keith Busch <kbusch@kernel.org>
Subject: RE: Kernel hangs when powering up/down drive using sysfs
Date: Mon, 16 Mar 2020 21:35:17 +0000	[thread overview]
Message-ID: <DM5PR06MB31328F1D3AFC3A1192CD1CE592F90@DM5PR06MB3132.namprd06.prod.outlook.com> (raw)
In-Reply-To: <DM5PR06MB31328A7B4E1A95A8C5E5E3E092F90@DM5PR06MB3132.namprd06.prod.outlook.com>

I ran the suggested experiment.   The first interrupt is reporting non-zero pending event (either 0x08 or 0x1000 depending on power up or power down).   The second interrupt is always zero.   So it sounds like we are getting an interrupt indicating work complete.

Power up:
Mar 16 21:10:15 eos-a kernel: pending events x8
Mar 16 21:10:15 eos-a kernel: pciehp 0000:66:09.0:pcie204: Slot(9): Card present
Mar 16 21:10:15 eos-a kernel: pci 0000:6f:00.0: Max Payload Size set to 256 (was 128, max 256)
Mar 16 21:10:15 eos-a kernel: iommu: Adding device 0000:6f:00.0 to group 70
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: no space for [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: failed to assign [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: no space for [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: failed to assign [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pci 0000:6f:00.0: BAR 0: assigned [mem 0xe0b00000-0xe0b03fff 64bit]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: PCI bridge to [bus 6f]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0:   bridge window [mem 0xe0b00000-0xe0bfffff]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0:   bridge window [mem 0x3ac00c000000-0x3ac00dffffff 64bit pref]
Mar 16 21:10:15 eos-a kernel: pending events x0
Mar 16 21:10:16 eos-a kernel: vfio-pci 0000:6f:00.0: enabling device (0100 -> 0102)
Mar 16 21:10:16 eos-a kernel: pcieport 0000:64:00.0: can't derive routing for PCI INT A
Mar 16 21:10:16 eos-a kernel: vfio-pci 0000:6f:00.0: PCI INT A: not connected
Mar 16 21:10:16 eos-a kernel: vfio_ecap_init: 0000:6f:00.0 hiding ecap 0x19@0x178

Power down:
Mar 16 21:10:47 eos-a kernel: pending events x10000
Mar 16 21:10:47 eos-a kernel: iommu: Removing device 0000:6f:00.0 from group 70
Mar 16 21:10:48 eos-a kernel: pending events x0

-----Original Message-----
From: linux-pci-owner@vger.kernel.org <linux-pci-owner@vger.kernel.org> On Behalf Of Hoyer, David
Sent: Monday, March 16, 2020 1:26 PM
To: Lukas Wunner <lukas@wunner.de>
Cc: linux-pci@vger.kernel.org; Keith Busch <kbusch@kernel.org>
Subject: RE: Kernel hangs when powering up/down drive using sysfs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




We were not sure about the return just a few lines up so we did not add the 2 lines.
I will try what you suggested to better understand why we are getting the extra interrupt.

I am not as familiar with submitting a "proper patch" and ask that you do it if you would be so kind.

-----Original Message-----
From: Lukas Wunner <lukas@wunner.de>
Sent: Monday, March 16, 2020 1:20 PM
To: Hoyer, David <David.Hoyer@netapp.com>
Cc: linux-pci@vger.kernel.org; Keith Busch <kbusch@kernel.org>
Subject: Re: Kernel hangs when powering up/down drive using sysfs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




On Sat, Mar 14, 2020 at 02:19:44PM +0000, Hoyer, David wrote:
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -637,6 +637,8 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>         events = atomic_xchg(&ctrl->pending_events, 0);
>         if (!events) {
>                 pci_config_pm_runtime_put(pdev);
> +               ctrl->ist_running = false;
> +               wake_up(&ctrl->requester);
>                 return IRQ_NONE;
>        }

Thanks David for the report and sorry for the breakage.

The above LGTM, please submit it as a proper patch and feel free to add my Reviewed-by.  Please add the same two lines before the "return ret" a little further up in the function.

If it's too cumbersome for you to submit a proper patch I can do it for you.


> We've instrumented the code and we do see that pciehp_ist() runs 
> twice, once exiting with IRQ_HANDLED and then again with IRQ_NONE.
> We believe that is due to the timing differences.  Adding debug in 
> here changes the timings enough that the hang goes away, so we are 
> having troubles proving this 100% at the moment.  But just based on 
> code inspection, if pciehp_ist() exits with the IRQ_NONE case, then 
> nothing will ever set ist_running=false until a subsequent hotplug 
> event happens that causes the IRQ_HANDLED case to run.  (We were able 
> to prove that will cause things to "unhang" and progress at that point
> - if you're hung and you remove a drive, the slot status change will 
> then unstick things.)

The question is, why is pciehp_ist() run once more.  Most likely because another event is signaled from the slot.  Try adding a
printk() at the top of pciehp_ist() which emits ctrl->pending_events to understand what's going on.

Thanks,

Lukas

  reply	other threads:[~2020-03-16 21:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-14 14:19 Kernel hangs when powering up/down drive using sysfs Hoyer, David
2020-03-16 16:15 ` Keith Busch
2020-03-16 18:10   ` Lukas Wunner
2020-03-16 18:42     ` Keith Busch
2020-03-18 11:53       ` Lukas Wunner
2020-03-16 18:19 ` Lukas Wunner
2020-03-16 18:25   ` Hoyer, David
2020-03-16 21:35     ` Hoyer, David [this message]
2020-03-18 11:49     ` Lukas Wunner
2020-03-18 14:06       ` Hoyer, David
2020-03-18 11:33 ` [PATCH] PCI: pciehp: Fix indefinite wait on sysfs requests Lukas Wunner
2020-03-18 16:43   ` Keith Busch
2020-03-28 20:25   ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM5PR06MB31328F1D3AFC3A1192CD1CE592F90@DM5PR06MB3132.namprd06.prod.outlook.com \
    --to=david.hoyer@netapp.com \
    --cc=kbusch@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).