LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
To: "mika.westerberg@linux.intel.com" <mika.westerberg@linux.intel.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Linux v5.5 serious PCI bug
Date: Tue, 10 Dec 2019 12:00:23 +0000
Message-ID: <PSXP216MB04384F89D9D9DDA6999347CF805B0@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <20191210072800.GY2665@lahna.fi.intel.com>

On Tue, Dec 10, 2019 at 09:28:00AM +0200, mika.westerberg@linux.intel.com wrote:
> On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@linux.intel.com wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > > 
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I 
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the 
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the 
> > > > GPU driver.
> > > > 
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > > 
> > > > Attaching dmesg for now; will bisect and come back with results.
> > > 
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@linux.intel.com wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > > 
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I 
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the 
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the 
> > > > GPU driver.
> > > > 
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > > 
> > > > Attaching dmesg for now; will bisect and come back with results.
> > > 
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > I thought it could be that, too.
> > 
> > The attachment "dmesg-4" from the original email is with iommu parameters.
> > The attachment "dmesg-5" from the original email is with no iommu parameters.
> > Attaching here "dmesg-6" with the iommu explicitly set off like you said.
> > 
> > No difference, still broken. Although, with iommu off, there are less stack traces.
> > 
> > Could it be sysfs-related?
> 
> Bisect would probably be the best option to find the culprit commit.
> There are couple of commits done for pciehp so reverting them one by one
> may help as well:
> 
>   87d0f2a5536f PCI: pciehp: Prevent deadlock on disconnect
>   75fcc0ce72e5 PCI: pciehp: Do not disable interrupt twice on suspend
>   b94ec12dfaee PCI: pciehp: Refactor infinite loop in pcie_poll_cmd()
>   157c1062fcd8 PCI: pciehp: Avoid returning prematurely from sysfs requests
You are not going to believe this. The offending commit is in the SOUND 
subsystem. I thought I had messed up the bisect when only sound commits 
were showing near the end.

And yes, I double checked.

Reverted, compiled, tested that it started working.
Reapplied, compiled, tested that it stopped working.
Twice.

The following is the culprit responsible for the issues:

commit 586bc4aab878efcf672536f0cdec3d04b6990c94
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Fri Nov 22 16:43:50 2019 -0500

    ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD

It is playing with PCI devices. Clearly they did not consider 
hot-removal. I am guessing it is seeing the audio PCI func of the AMD 
card in that Thunderbolt eGPU enclosure.

I will collect information, make a bugzilla report and contact the AMD 
team. If anybody wants to be cc'd in then let me know. Sorry for 
bothering you and Bjorn with something which actually has nothing 
directly to do with the PCI subsystem or Thunderbolt.

I strongly hope that the upcoming Intel Xe GPU driver allows for 
surprise-removal in Linux without any crashing of kernel or userspace. 
The amdgpu and nouveau drivers do not take to surprise removal kindly, 
even without the above sound bug applying to AMD.

Kind regards,
Nicholas

  reply index

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-09 12:34 Nicholas Johnson
2019-12-09 12:37 ` Pavel Machek
2019-12-09 13:07   ` Nicholas Johnson
2019-12-09 13:12 ` mika.westerberg
2019-12-09 13:29   ` Nicholas Johnson
2019-12-09 13:33   ` Nicholas Johnson
2019-12-10  7:28     ` mika.westerberg
2019-12-10 12:00       ` Nicholas Johnson [this message]
2019-12-10 12:29         ` Lukas Wunner
2019-12-10 12:46           ` Takashi Iwai
2019-12-11  7:33             ` Jiasen Lin
2019-12-10 12:52           ` Nicholas Johnson
2019-12-10 12:34         ` mika.westerberg
2019-12-10 13:39 ` [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev Lukas Wunner
2019-12-10 13:41   ` Takashi Iwai
2019-12-10 13:47   ` Nicholas Johnson
2019-12-10 13:50     ` Takashi Iwai
2019-12-10 15:34   ` Deucher, Alexander
2019-12-10 15:46     ` Lukas Wunner
2019-12-10 15:53       ` Deucher, Alexander
2019-12-10 16:10         ` Takashi Iwai
2019-12-10 16:51           ` Deucher, Alexander
2019-12-10 16:13         ` Lukas Wunner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=PSXP216MB04384F89D9D9DDA6999347CF805B0@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM \
    --to=nicholas.johnson-opensource@outlook.com.au \
    --cc=helgaas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mika.westerberg@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git