All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Sinan Kaya <okaya@codeaurora.org>
Cc: Oza Pawandeep <poza@codeaurora.org>,
	Gabriele Paoloni <gabriele.paoloni@huawei.com>,
	linux-pci@vger.kernel.org, timur@codeaurora.org,
	linux-arm-kernel@lists.infradead.org,
	Dongdong Liu <liudongdong3@huawei.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-arm-msm@vger.kernel.org
Subject: Re: [PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled
Date: Fri, 17 Nov 2017 18:02:43 -0600	[thread overview]
Message-ID: <20171118000243.GC16907@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <16b9ac67-042d-1010-0c8c-205b2fa40a7f@codeaurora.org>

On Thu, Nov 16, 2017 at 03:52:47PM -0500, Sinan Kaya wrote:
> >>
> >> Whether the AER driver reads ~0 or not really depends on timing. The
> >> link may come up from the DPC driver by the time AER driver reaches
> >> here as an example.
> >>
> >> Bad things do happen. We have seen this with e1000e driver.
> > 
> > I don't doubt that bad things happen.  I'm just trying to understand
> > exactly *what* bad things happen and how, so we can fix them cleanly.
> 
> This was random crashes in the e1000e drivers accompanied with stack
> traces coming from WARN and msi allocation routines.

I didn't look in detail, but I'm not sure there's sufficient locking
in the AER path to make it safe from concurrent device removal.  I
suspect AER could be improved both with respect to handling ~0 data
and this potential concurrency issue.

> > So the "stop" and "recover" commands you mention must be related to
> > AER.  
> 
> I was talking about pci_stop_and_remove_bus_device() vs. error_detected()
> as "stop" and "recover"

Thanks for clearing that up!

> > I suspect this all probably requires tighter integration between DPC
> > and AER, and I'm totally fine with that.  I think the current
> > separation as separate "drivers" is pretty artificial anyway.
> 
> Got it. We will try to plumb DPC error handling into AER driver's error
> handling mechanism.

Looking at the AER code today, I noticed it already uses "DPC" in
another sense.  I don't know what it stands for there (probably
"deferred" something), but I don't think it's "Downstream Port
Containment" :)

> What do you think about the rescan following link up? The only entity
> that does rescan today is hotplug after DPC recovery. There could be
> a platform with DPC support but no hotplug support. 
> 
> How should we handle it?

Good question.  If your system does support both DPC and hotplug, I
assume the link comes back up after you clear DPC Trigger Status.
Does pciehp notice that "link up" event and add the device back?

So I think your question is whether the DPC code should explicitly do
a rescan so that if we don't have pciehp, we'll still automatically
rediscover the device.  I dunno, maybe.  Seems like a plausible idea
anyway.

Bjorn

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Sinan Kaya <okaya@codeaurora.org>
Cc: Oza Pawandeep <poza@codeaurora.org>,
	Gabriele Paoloni <gabriele.paoloni@huawei.com>,
	linux-pci@vger.kernel.org, timur@codeaurora.org,
	linux-arm-kernel@lists.infradead.org,
	Dongdong Liu <liudongdong3@huawei.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-arm-msm@vger.kernel.org
Subject: Re: [PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled
Date: Fri, 17 Nov 2017 18:02:43 -0600	[thread overview]
Message-ID: <20171118000243.GC16907@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <16b9ac67-042d-1010-0c8c-205b2fa40a7f@codeaurora.org>

On Thu, Nov 16, 2017 at 03:52:47PM -0500, Sinan Kaya wrote:
> >>
> >> Whether the AER driver reads ~0 or not really depends on timing. The
> >> link may come up from the DPC driver by the time AER driver reaches
> >> here as an example.
> >>
> >> Bad things do happen. We have seen this with e1000e driver.
> > 
> > I don't doubt that bad things happen.  I'm just trying to understand
> > exactly *what* bad things happen and how, so we can fix them cleanly.
> 
> This was random crashes in the e1000e drivers accompanied with stack
> traces coming from WARN and msi allocation routines.

I didn't look in detail, but I'm not sure there's sufficient locking
in the AER path to make it safe from concurrent device removal.  I
suspect AER could be improved both with respect to handling ~0 data
and this potential concurrency issue.

> > So the "stop" and "recover" commands you mention must be related to
> > AER.  
> 
> I was talking about pci_stop_and_remove_bus_device() vs. error_detected()
> as "stop" and "recover"

Thanks for clearing that up!

> > I suspect this all probably requires tighter integration between DPC
> > and AER, and I'm totally fine with that.  I think the current
> > separation as separate "drivers" is pretty artificial anyway.
> 
> Got it. We will try to plumb DPC error handling into AER driver's error
> handling mechanism.

Looking at the AER code today, I noticed it already uses "DPC" in
another sense.  I don't know what it stands for there (probably
"deferred" something), but I don't think it's "Downstream Port
Containment" :)

> What do you think about the rescan following link up? The only entity
> that does rescan today is hotplug after DPC recovery. There could be
> a platform with DPC support but no hotplug support. 
> 
> How should we handle it?

Good question.  If your system does support both DPC and hotplug, I
assume the link comes back up after you clear DPC Trigger Status.
Does pciehp notice that "link up" event and add the device back?

So I think your question is whether the DPC code should explicitly do
a rescan so that if we don't have pciehp, we'll still automatically
rediscover the device.  I dunno, maybe.  Seems like a plausible idea
anyway.

Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: helgaas@kernel.org (Bjorn Helgaas)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled
Date: Fri, 17 Nov 2017 18:02:43 -0600	[thread overview]
Message-ID: <20171118000243.GC16907@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <16b9ac67-042d-1010-0c8c-205b2fa40a7f@codeaurora.org>

On Thu, Nov 16, 2017 at 03:52:47PM -0500, Sinan Kaya wrote:
> >>
> >> Whether the AER driver reads ~0 or not really depends on timing. The
> >> link may come up from the DPC driver by the time AER driver reaches
> >> here as an example.
> >>
> >> Bad things do happen. We have seen this with e1000e driver.
> > 
> > I don't doubt that bad things happen.  I'm just trying to understand
> > exactly *what* bad things happen and how, so we can fix them cleanly.
> 
> This was random crashes in the e1000e drivers accompanied with stack
> traces coming from WARN and msi allocation routines.

I didn't look in detail, but I'm not sure there's sufficient locking
in the AER path to make it safe from concurrent device removal.  I
suspect AER could be improved both with respect to handling ~0 data
and this potential concurrency issue.

> > So the "stop" and "recover" commands you mention must be related to
> > AER.  
> 
> I was talking about pci_stop_and_remove_bus_device() vs. error_detected()
> as "stop" and "recover"

Thanks for clearing that up!

> > I suspect this all probably requires tighter integration between DPC
> > and AER, and I'm totally fine with that.  I think the current
> > separation as separate "drivers" is pretty artificial anyway.
> 
> Got it. We will try to plumb DPC error handling into AER driver's error
> handling mechanism.

Looking at the AER code today, I noticed it already uses "DPC" in
another sense.  I don't know what it stands for there (probably
"deferred" something), but I don't think it's "Downstream Port
Containment" :)

> What do you think about the rescan following link up? The only entity
> that does rescan today is hotplug after DPC recovery. There could be
> a platform with DPC support but no hotplug support. 
> 
> How should we handle it?

Good question.  If your system does support both DPC and hotplug, I
assume the link comes back up after you clear DPC Trigger Status.
Does pciehp notice that "link up" event and add the device back?

So I think your question is whether the DPC code should explicitly do
a rescan so that if we don't have pciehp, we'll still automatically
rediscover the device.  I dunno, maybe.  Seems like a plausible idea
anyway.

Bjorn

  reply	other threads:[~2017-11-18  0:02 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-15  4:56 [PATCH v2 0/4] PCI: query active service list Oza Pawandeep
2017-11-15  4:56 ` Oza Pawandeep
2017-11-15  4:56 ` Oza Pawandeep
2017-11-15  4:56 ` [PATCH v2 1/4] PCI: Add port service list node for pci_dev Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56 ` [PATCH v2 2/4] PCI/portdrv: Add/Remove port services to the list Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56 ` [PATCH v2 3/4] PCI/portdrv: Implement interface to query the registered service Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56 ` [PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15  4:56   ` Oza Pawandeep
2017-11-15 21:14   ` Bjorn Helgaas
2017-11-15 21:14     ` Bjorn Helgaas
2017-11-15 21:14     ` Bjorn Helgaas
2017-11-16 14:03     ` Sinan Kaya
2017-11-16 14:03       ` Sinan Kaya
2017-11-16 20:17       ` Bjorn Helgaas
2017-11-16 20:17         ` Bjorn Helgaas
2017-11-16 20:17         ` Bjorn Helgaas
2017-11-16 20:52         ` Sinan Kaya
2017-11-16 20:52           ` Sinan Kaya
2017-11-18  0:02           ` Bjorn Helgaas [this message]
2017-11-18  0:02             ` Bjorn Helgaas
2017-11-18  0:02             ` Bjorn Helgaas
2017-11-19 16:41             ` Sinan Kaya
2017-11-19 16:41               ` Sinan Kaya
2017-11-21 16:25       ` David Laight
2017-11-21 16:25         ` David Laight
2017-11-21 16:25         ` David Laight
2017-11-21 16:43         ` Sinan Kaya
2017-11-21 16:43           ` Sinan Kaya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171118000243.GC16907@bhelgaas-glaptop.roam.corp.google.com \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=gabriele.paoloni@huawei.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=liudongdong3@huawei.com \
    --cc=okaya@codeaurora.org \
    --cc=poza@codeaurora.org \
    --cc=tglx@linutronix.de \
    --cc=timur@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.