linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Karol Herbst <kherbst@redhat.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Lyude Paul <lyude@redhat.com>,
	"Rafael J . Wysocki" <rjw@rjwysocki.net>,
	Mika Westerberg <mika.westerberg@intel.com>,
	linux-pci@vger.kernel.org, linux-pm@vger.kernel.org,
	dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	Ben Skeggs <bskeggs@redhat.com>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 5.4 55/84] drm/nouveau: workaround runpm fail by disabling PCI power management on certain intel bridges
Date: Wed, 15 Apr 2020 07:44:12 -0400	[thread overview]
Message-ID: <20200415114442.14166-55-sashal@kernel.org> (raw)
In-Reply-To: <20200415114442.14166-1-sashal@kernel.org>

From: Karol Herbst <kherbst@redhat.com>

[ Upstream commit 434fdb51513bf3057ac144d152e6f2f2b509e857 ]

Fixes the infamous 'runtime PM' bug many users are facing on Laptops with
Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU.

Depending on the used kernel there might be messages like those in demsg:

"nouveau 0000:01:00.0: Refused to change power state, currently in D3"
"nouveau 0000:01:00.0: can't change power state from D3cold to D0 (config
space inaccessible)"
followed by backtraces of kernel crashes or timeouts within nouveau.

It's still unkown why this issue exists, but this is a reliable workaround
and solves a very annoying issue for user having to choose between a
crashing kernel or higher power consumption of their Laptops.

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Mika Westerberg <mika.westerberg@intel.com>
Cc: linux-pci@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205623
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 63 +++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_drv.h |  2 +
 2 files changed, 65 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 2cd83849600f3..b1beed40e746a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -618,6 +618,64 @@ nouveau_drm_device_fini(struct drm_device *dev)
 	kfree(drm);
 }
 
+/*
+ * On some Intel PCIe bridge controllers doing a
+ * D0 -> D3hot -> D3cold -> D0 sequence causes Nvidia GPUs to not reappear.
+ * Skipping the intermediate D3hot step seems to make it work again. This is
+ * probably caused by not meeting the expectation the involved AML code has
+ * when the GPU is put into D3hot state before invoking it.
+ *
+ * This leads to various manifestations of this issue:
+ *  - AML code execution to power on the GPU hits an infinite loop (as the
+ *    code waits on device memory to change).
+ *  - kernel crashes, as all PCI reads return -1, which most code isn't able
+ *    to handle well enough.
+ *
+ * In all cases dmesg will contain at least one line like this:
+ * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
+ * followed by a lot of nouveau timeouts.
+ *
+ * In the \_SB.PCI0.PEG0.PG00._OFF code deeper down writes bit 0x80 to the not
+ * documented PCI config space register 0x248 of the Intel PCIe bridge
+ * controller (0x1901) in order to change the state of the PCIe link between
+ * the PCIe port and the GPU. There are alternative code paths using other
+ * registers, which seem to work fine (executed pre Windows 8):
+ *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
+ *  - 0xb0 bit 0x10 (link disable)
+ * Changing the conditions inside the firmware by poking into the relevant
+ * addresses does resolve the issue, but it seemed to be ACPI private memory
+ * and not any device accessible memory at all, so there is no portable way of
+ * changing the conditions.
+ * On a XPS 9560 that means bits [0,3] on \CPEX need to be cleared.
+ *
+ * The only systems where this behavior can be seen are hybrid graphics laptops
+ * with a secondary Nvidia Maxwell, Pascal or Turing GPU. It's unclear whether
+ * this issue only occurs in combination with listed Intel PCIe bridge
+ * controllers and the mentioned GPUs or other devices as well.
+ *
+ * documentation on the PCIe bridge controller can be found in the
+ * "7th Generation Intel® Processor Families for H Platforms Datasheet Volume 2"
+ * Section "12 PCI Express* Controller (x16) Registers"
+ */
+
+static void quirk_broken_nv_runpm(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct nouveau_drm *drm = nouveau_drm(dev);
+	struct pci_dev *bridge = pci_upstream_bridge(pdev);
+
+	if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL)
+		return;
+
+	switch (bridge->device) {
+	case 0x1901:
+		drm->old_pm_cap = pdev->pm_cap;
+		pdev->pm_cap = 0;
+		NV_INFO(drm, "Disabling PCI power management to avoid bug\n");
+		break;
+	}
+}
+
 static int nouveau_drm_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *pent)
 {
@@ -699,6 +757,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev,
 	if (ret)
 		goto fail_drm_dev_init;
 
+	quirk_broken_nv_runpm(pdev);
 	return 0;
 
 fail_drm_dev_init:
@@ -736,7 +795,11 @@ static void
 nouveau_drm_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct nouveau_drm *drm = nouveau_drm(dev);
 
+	/* revert our workaround */
+	if (drm->old_pm_cap)
+		pdev->pm_cap = drm->old_pm_cap;
 	nouveau_drm_device_remove(dev);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h
index 70f34cacc552c..8104e3806499d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.h
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.h
@@ -138,6 +138,8 @@ struct nouveau_drm {
 
 	struct list_head clients;
 
+	u8 old_pm_cap;
+
 	struct {
 		struct agp_bridge_data *bridge;
 		u32 base;
-- 
2.20.1


  parent reply	other threads:[~2020-04-15 12:24 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-15 11:43 [PATCH AUTOSEL 5.4 01/84] drm/ttm: flush the fence on the bo after we individualize the reservation object Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 02/84] clk: Don't cache errors from clk_ops::get_phase() Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 03/84] clk: at91: usb: continue if clk_hw_round_rate() return zero Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 04/84] net/mlx5e: Enforce setting of a single FEC mode Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 05/84] f2fs: fix the panic in do_checkpoint() Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 06/84] arm64: dts: librem5-devkit: add a vbus supply to usb0 Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 07/84] ARM: dts: rockchip: fix vqmmc-supply property name for rk3188-bqedison2qc Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 08/84] arm64: dts: allwinner: a64: Fix display clock register range Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 09/84] power: supply: bq27xxx_battery: Silence deferred-probe error Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 10/84] clk: tegra: Fix Tegra PMC clock out parents Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 11/84] arm64: tegra: Add PCIe endpoint controllers nodes for Tegra194 Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 12/84] arm64: tegra: Fix Tegra194 PCIe compatible string Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 13/84] arm64: dts: clearfog-gt-8k: set gigabit PHY reset deassert delay Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 14/84] soc: imx: gpc: fix power up sequencing Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 15/84] dma-coherent: fix integer overflow in the reserved-memory dma allocation Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 16/84] rtc: 88pm860x: fix possible race condition Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 17/84] NFS: alloc_nfs_open_context() must use the file cred when available Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 18/84] NFSv4/pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid() Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 19/84] NFSv4.2: error out when relink swapfile Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 20/84] ARM: dts: rockchip: fix lvds-encoder ports subnode for rk3188-bqedison2qc Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 21/84] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 22/84] f2fs: fix to show norecovery mount option Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 23/84] phy: uniphier-usb3ss: Add Pro5 support Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 24/84] NFS: direct.c: Fix memory leak of dreq when nfs_get_lock_context fails Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 25/84] f2fs: Fix mount failure due to SPO after a successful online resize FS Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 26/84] f2fs: Add a new CP flag to help fsck fix resize SPO issues Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 27/84] s390/cpuinfo: fix wrong output when CPU0 is offline Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 28/84] hibernate: Allow uswsusp to write to swap Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 29/84] btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 30/84] btrfs: add RCU locks around block group initialization Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 31/84] powerpc/prom_init: Pass the "os-term" message to hypervisor Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 32/84] powerpc/maple: Fix declaration made after definition Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 33/84] s390/cpum_sf: Fix wrong page count in error message Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 34/84] ext4: do not commit super on read-only bdev Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 35/84] ext4: fix incorrect group count in ext4_fill_super error message Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 36/84] ext4: fix incorrect inodes per group in " Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 37/84] um: ubd: Prevent buffer overrun on command completion Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 38/84] cifs: Allocate encryption header through kmalloc Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 39/84] cxgb4: fix MPS index overwrite when setting MAC address Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 40/84] slcan: Don't transmit uninitialized stack data in padding Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 41/84] net: qualcomm: rmnet: Allow configuration updates to existing devices Sasha Levin
2020-04-15 11:43 ` [PATCH AUTOSEL 5.4 42/84] mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 43/84] net: stmmac: dwmac1000: fix out-of-bounds mac address reg setting Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 44/84] net: dsa: mt7530: fix null pointer dereferencing in port5 setup Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 45/84] tun: Don't put_page() for all negative return values from XDP program Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 46/84] mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_VLAN_MANGLE Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 47/84] net: dsa: bcm_sf2: Do not register slave MDIO bus with OF Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 48/84] drm/nouveau/svm: check for SVM initialized before migrating Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 49/84] drm/nouveau/svm: fix vma range check for migration Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 50/84] include/linux/swapops.h: correct guards for non_swap_entry() Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 51/84] percpu_counter: fix a data race at vm_committed_as Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 52/84] linux/bits.h: add compile time sanity check of GENMASK inputs Sasha Levin
2020-04-15 19:40   ` Rikard Falkeborn
2020-04-22  0:57     ` Sasha Levin
2020-04-23 21:40       ` Rikard Falkeborn
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 53/84] compiler.h: fix error in BUILD_BUG_ON() reporting Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 54/84] KVM: s390: vsie: Fix possible race when shadowing region 3 tables Sasha Levin
2020-04-15 11:44 ` Sasha Levin [this message]
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 56/84] leds: core: Fix warning message when init_data Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 57/84] net: dsa: bcm_sf2: Ensure correct sub-node is parsed Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 58/84] x86: ACPI: fix CPU hotplug deadlock Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 59/84] net: phy: micrel: kszphy_resume(): add delay after genphy_resume() before accessing PHY registers Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 60/84] csky: Fixup cpu speculative execution to IO area Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 61/84] drm/amdkfd: kfree the wrong pointer Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 62/84] NFS: Fix memory leaks in nfs_pageio_stop_mirroring() Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 63/84] csky: Fixup get wrong psr value from phyical reg Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 64/84] f2fs: fix NULL pointer dereference in f2fs_write_begin() Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 65/84] ACPICA: Fixes for acpiExec namespace init file Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 66/84] mfd: dln2: Fix sanity checking for endpoints Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 67/84] um: falloc.h needs to be directly included for older libc Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 68/84] drm/vc4: Fix HDMI mode validation Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 69/84] iommu/virtio: Fix freeing of incomplete domains Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 70/84] iommu/vt-d: Fix mm reference leak Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 71/84] SUNRPC: fix krb5p mount to provide large enough buffer in rq_rcvsize Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 72/84] ext2: fix empty body warnings when -Wextra is used Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 73/84] iommu/vt-d: Silence RCU-list debugging warning in dmar_find_atsr() Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 74/84] iommu/vt-d: Fix page request descriptor size Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 75/84] ovl: fix value of i_ino for lower hardlink corner case Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 76/84] ext2: fix debug reference to ext2_xattr_cache Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 77/84] sunrpc: Fix gss_unwrap_resp_integ() again Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 78/84] csky: Fixup init_fpu compile warning with __init Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 79/84] power: supply: axp288_fuel_gauge: Broaden vendor check for Intel Compute Sticks Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 80/84] platform/chrome: cros_ec_rpmsg: Fix race with host event Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 81/84] libnvdimm: Out of bounds read in __nd_ioctl() Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 82/84] acpi/nfit: improve bounds checking for 'func' Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 83/84] iommu/amd: Fix the configuration of GCR3 table root pointer Sasha Levin
2020-04-15 11:44 ` [PATCH AUTOSEL 5.4 84/84] f2fs: fix to wait all node page writeback Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200415114442.14166-55-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=bskeggs@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=mika.westerberg@intel.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=rjw@rjwysocki.net \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).