linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Karol Herbst <kherbst@redhat.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Lyude Paul <lyude@redhat.com>,
	"Rafael J . Wysocki" <rjw@rjwysocki.net>,
	Mika Westerberg <mika.westerberg@intel.com>,
	linux-pci@vger.kernel.org, linux-pm@vger.kernel.org,
	dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	Ben Skeggs <bskeggs@redhat.com>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 5.4 30/78] drm/nouveau: workaround runpm fail by disabling PCI power management on certain intel bridges
Date: Sat, 18 Apr 2020 10:39:59 -0400	[thread overview]
Message-ID: <20200418144047.9013-30-sashal@kernel.org> (raw)
In-Reply-To: <20200418144047.9013-1-sashal@kernel.org>

From: Karol Herbst <kherbst@redhat.com>

[ Upstream commit 434fdb51513bf3057ac144d152e6f2f2b509e857 ]

Fixes the infamous 'runtime PM' bug many users are facing on Laptops with
Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU.

Depending on the used kernel there might be messages like those in demsg:

"nouveau 0000:01:00.0: Refused to change power state, currently in D3"
"nouveau 0000:01:00.0: can't change power state from D3cold to D0 (config
space inaccessible)"
followed by backtraces of kernel crashes or timeouts within nouveau.

It's still unkown why this issue exists, but this is a reliable workaround
and solves a very annoying issue for user having to choose between a
crashing kernel or higher power consumption of their Laptops.

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Mika Westerberg <mika.westerberg@intel.com>
Cc: linux-pci@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205623
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 63 +++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_drv.h |  2 +
 2 files changed, 65 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 2cd83849600f3..b1beed40e746a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -618,6 +618,64 @@ nouveau_drm_device_fini(struct drm_device *dev)
 	kfree(drm);
 }
 
+/*
+ * On some Intel PCIe bridge controllers doing a
+ * D0 -> D3hot -> D3cold -> D0 sequence causes Nvidia GPUs to not reappear.
+ * Skipping the intermediate D3hot step seems to make it work again. This is
+ * probably caused by not meeting the expectation the involved AML code has
+ * when the GPU is put into D3hot state before invoking it.
+ *
+ * This leads to various manifestations of this issue:
+ *  - AML code execution to power on the GPU hits an infinite loop (as the
+ *    code waits on device memory to change).
+ *  - kernel crashes, as all PCI reads return -1, which most code isn't able
+ *    to handle well enough.
+ *
+ * In all cases dmesg will contain at least one line like this:
+ * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
+ * followed by a lot of nouveau timeouts.
+ *
+ * In the \_SB.PCI0.PEG0.PG00._OFF code deeper down writes bit 0x80 to the not
+ * documented PCI config space register 0x248 of the Intel PCIe bridge
+ * controller (0x1901) in order to change the state of the PCIe link between
+ * the PCIe port and the GPU. There are alternative code paths using other
+ * registers, which seem to work fine (executed pre Windows 8):
+ *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
+ *  - 0xb0 bit 0x10 (link disable)
+ * Changing the conditions inside the firmware by poking into the relevant
+ * addresses does resolve the issue, but it seemed to be ACPI private memory
+ * and not any device accessible memory at all, so there is no portable way of
+ * changing the conditions.
+ * On a XPS 9560 that means bits [0,3] on \CPEX need to be cleared.
+ *
+ * The only systems where this behavior can be seen are hybrid graphics laptops
+ * with a secondary Nvidia Maxwell, Pascal or Turing GPU. It's unclear whether
+ * this issue only occurs in combination with listed Intel PCIe bridge
+ * controllers and the mentioned GPUs or other devices as well.
+ *
+ * documentation on the PCIe bridge controller can be found in the
+ * "7th Generation Intel® Processor Families for H Platforms Datasheet Volume 2"
+ * Section "12 PCI Express* Controller (x16) Registers"
+ */
+
+static void quirk_broken_nv_runpm(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct nouveau_drm *drm = nouveau_drm(dev);
+	struct pci_dev *bridge = pci_upstream_bridge(pdev);
+
+	if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL)
+		return;
+
+	switch (bridge->device) {
+	case 0x1901:
+		drm->old_pm_cap = pdev->pm_cap;
+		pdev->pm_cap = 0;
+		NV_INFO(drm, "Disabling PCI power management to avoid bug\n");
+		break;
+	}
+}
+
 static int nouveau_drm_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *pent)
 {
@@ -699,6 +757,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev,
 	if (ret)
 		goto fail_drm_dev_init;
 
+	quirk_broken_nv_runpm(pdev);
 	return 0;
 
 fail_drm_dev_init:
@@ -736,7 +795,11 @@ static void
 nouveau_drm_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct nouveau_drm *drm = nouveau_drm(dev);
 
+	/* revert our workaround */
+	if (drm->old_pm_cap)
+		pdev->pm_cap = drm->old_pm_cap;
 	nouveau_drm_device_remove(dev);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h
index 70f34cacc552c..8104e3806499d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.h
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.h
@@ -138,6 +138,8 @@ struct nouveau_drm {
 
 	struct list_head clients;
 
+	u8 old_pm_cap;
+
 	struct {
 		struct agp_bridge_data *bridge;
 		u32 base;
-- 
2.20.1


  parent reply	other threads:[~2020-04-18 14:54 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-18 14:39 [PATCH AUTOSEL 5.4 01/78] iommu/amd: Fix the configuration of GCR3 table root pointer Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 02/78] tools/testing/nvdimm: Fix compilation failure without CONFIG_DEV_DAX_PMEM_COMPAT Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 03/78] watchdog: reset last_hw_keepalive time at start Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 04/78] ovl: fix value of i_ino for lower hardlink corner case Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 05/78] iommu/vt-d: Fix page request descriptor size Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 06/78] acpi/nfit: improve bounds checking for 'func' Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 07/78] iommu/vt-d: Fix mm reference leak Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 08/78] scsi: lpfc: Fix kasan slab-out-of-bounds error in lpfc_unreg_login Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 09/78] scsi: lpfc: Fix crash after handling a pci error Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 10/78] scsi: lpfc: Fix crash in target side cable pulls hitting WAIT_FOR_UNREG Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 11/78] scsi: libfc: If PRLI rejected, move rport to PLOGI state Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 12/78] ceph: return ceph_mdsc_do_request() errors from __get_parent() Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 13/78] ceph: don't skip updating wanted caps when cap is stale Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 14/78] pwm: rcar: Fix late Runtime PM enablement Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 15/78] nvme-tcp: fix possible crash in write_zeroes processing Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 16/78] ASoC: dpcm: allow start or stop during pause for backend Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 17/78] scsi: iscsi: Report unbind session event when the target has been removed Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 18/78] tools/test/nvdimm: Fix out of tree build Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 19/78] ASoC: Intel: atom: Take the drv->lock mutex before calling sst_send_slot_map() Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 20/78] nvme: fix deadlock caused by ANA update wrong locking Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 21/78] drm/amd/display: Update stream adjust in dc_stream_adjust_vmin_vmax Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 22/78] dma-direct: fix data truncation in dma_direct_get_required_mask() Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 23/78] dma-debug: fix displaying of dma allocation type Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 24/78] kernel/gcov/fs.c: gcov_seq_next() should increase position index Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 25/78] selftests: kmod: fix handling test numbers above 9 Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 26/78] ipc/util.c: sysvipc_find_ipc() should increase position index Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 27/78] kconfig: qconf: Fix a few alignment issues Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 28/78] lib/raid6/test: fix build on distros whose /bin/sh is not bash Sasha Levin
2020-04-18 14:39 ` [PATCH AUTOSEL 5.4 29/78] KVM: s390: vsie: Fix possible race when shadowing region 3 tables Sasha Levin
2020-04-18 14:39 ` Sasha Levin [this message]
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 31/78] s390/cio: generate delayed uevent for vfio-ccw subchannels Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 32/78] s390/cio: avoid duplicated 'ADD' uevents Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 33/78] loop: Better discard support for block devices Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 34/78] pwm: pca9685: Fix PWM/GPIO inter-operation Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 35/78] Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled" Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 36/78] powerpc/pseries: Fix MCE handling on pseries Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 37/78] drm/amdkfd: kfree the wrong pointer Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 38/78] nvme: fix compat address handling in several ioctls Sasha Levin
2020-04-28  4:53   ` Naresh Kamboju
2020-04-28  5:23     ` Nick Bowler
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 39/78] pwm: renesas-tpu: Fix late Runtime PM enablement Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 40/78] pwm: bcm2835: Dynamically allocate base Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 41/78] drm/vc4: Fix HDMI mode validation Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 42/78] iommu/virtio: Fix freeing of incomplete domains Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 43/78] iommu/vt-d: Silence RCU-list debugging warning in dmar_find_atsr() Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 44/78] platform/chrome: cros_ec_rpmsg: Fix race with host event Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 45/78] libnvdimm: Out of bounds read in __nd_ioctl() Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 46/78] ocfs2: no need try to truncate file beyond i_size Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 47/78] hfsplus: fix crash and filesystem corruption when deleting files Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 48/78] ALSA: hda: Add driver blacklist Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 49/78] ALSA: hda/realtek - Add quirk for MSI GL63 Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 50/78] perf/core: Disable page faults when getting phys address Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 51/78] libata: Return correct status in sata_pmp_eh_recover_pm() when ATA_DFLAG_DETACH is set Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 52/78] drm/amd/display: Calculate scaling ratios on every medium/full update Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 53/78] ALSA: ice1724: Fix invalid access for enumerated ctl items Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 54/78] ALSA: hda: Fix potential access overflow in beep helper Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 55/78] KVM: s390: vsie: Fix delivery of addressing exceptions Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 56/78] ASoC: Intel: bytcr_rt5640: Add quirk for MPMAN MPWIN895CL tablet Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 57/78] ipmi: fix hung processes in __get_guid() Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 58/78] ALSA: usb-audio: Add Pioneer DJ DJM-250MK2 quirk Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 59/78] ALSA: hda/realtek - Add quirk for Lenovo Carbon X1 8th gen Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 60/78] scsi: mpt3sas: Fix kernel panic observed on soft HBA unplug Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 61/78] xhci: Ensure link state is U3 after setting USB_SS_PORT_LS_U3 Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 62/78] xhci: Wait until link state trainsits to U0 after setting USB_SS_PORT_LS_U0 Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 63/78] xhci: Finetune host initiated USB3 rootport link suspend and resume Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 64/78] drm/amd/display: Not doing optimize bandwidth if flip pending Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 65/78] PCI/PM: Add pcie_wait_for_link_delay() Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 66/78] libbpf: Fix readelf output parsing on powerpc with recent binutils Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 67/78] PCI: pciehp: Prevent deadlock on disconnect Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 68/78] ASoC: SOF: trace: fix unconditional free in trace release Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 69/78] powerpc/powernv/ioda: Fix ref count for devices with their own PE Sasha Levin
2020-04-21 11:02   ` Frederic Barrat
2020-04-25 15:00     ` Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 70/78] pci/hotplug/pnv-php: Remove erroneous warning Sasha Levin
2020-04-21 11:03   ` Frederic Barrat
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 71/78] ocxl: Add PCI hotplug dependency to Kconfig Sasha Levin
2020-04-21 11:05   ` Frederic Barrat
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 72/78] tracing/selftests: Turn off timeout setting Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 73/78] virtio-blk: improve virtqueue error to BLK_STS Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 74/78] scsi: smartpqi: fix controller lockup observed during force reboot Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 75/78] scsi: smartpqi: fix call trace in device discovery Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 76/78] scsi: smartpqi: fix problem with unique ID for physical device Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 77/78] PCI/ASPM: Allow re-enabling Clock PM Sasha Levin
2020-04-18 14:40 ` [PATCH AUTOSEL 5.4 78/78] PCI/PM: Add missing link delays required by the PCIe spec Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200418144047.9013-30-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=bskeggs@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=mika.westerberg@intel.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=rjw@rjwysocki.net \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).