linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/2] PCI: Add two Loongson's LS7A quirks
@ 2023-01-06  9:51 Huacai Chen
  2023-01-06  9:51 ` [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
  2023-01-06  9:51 ` [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
  0 siblings, 2 replies; 16+ messages in thread
From: Huacai Chen @ 2023-01-06  9:51 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang,
	Huacai Chen, Tiezhu Yang

This patchset add two quirks to resolves Loongson's LS7A problems: the
first patch improves the mrrs quirk for LS7A chipset; The second patch
add a new quirk for LS7A chipset to avoid poweroff/reboot failure.

V1 -> V2:
1, Update commit messages and comments.

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> 
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 61 ++++++++++++++++++-----------------
 drivers/pci/pci.c                     |  6 ++++
 drivers/pci/pcie/portdrv.c            | 21 ++++++++++--
 include/linux/pci.h                   |  2 ++
 4 files changed, 59 insertions(+), 31 deletions(-)
--
2.27.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A
  2023-01-06  9:51 [PATCH V2 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
@ 2023-01-06  9:51 ` Huacai Chen
  2023-01-31  0:16   ` Bjorn Helgaas
  2023-01-06  9:51 ` [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
  1 sibling, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2023-01-06  9:51 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang, Huacai Chen

In new revision of LS7A, some PCIe ports support larger value than 256,
but their maximum supported MRRS values are not detectable. Moreover,
the current loongson_mrrs_quirk() cannot avoid devices increasing its
MRRS after pci_enable_device(), and some devices (e.g. Realtek 8169)
will actually set a big value in its driver. So the only possible way
is configure MRRS of all devices in BIOS, and add a pci host bridge bit
flag (i.e., no_inc_mrrs) to stop the increasing MRRS operations.

However, according to PCIe Spec, it is legal for an OS to program any
value for MRRS, and it is also legal for an endpoint to generate a Read
Request with any size up to its MRRS. As the hardware engineers say, the
root cause here is LS7A doesn't break up large read requests. In detail,
LS7A PCIe port reports CA (Completer Abort) if it receives a Memory Read
request with a size that's "too big" ("too big" means larger than the
PCIe ports can handle, which means 256 for some ports and 4096 for the
others, and of course this is a problem in the LS7A's hardware design).

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 44 +++++++++------------------
 drivers/pci/pci.c                     |  6 ++++
 include/linux/pci.h                   |  1 +
 3 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index 05c50408f13b..759ec211c17b 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -75,37 +75,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 			DEV_LS7A_LPC, system_bus_quirk);
 
-static void loongson_mrrs_quirk(struct pci_dev *dev)
+static void loongson_mrrs_quirk(struct pci_dev *pdev)
 {
-	struct pci_bus *bus = dev->bus;
-	struct pci_dev *bridge;
-	static const struct pci_device_id bridge_devids[] = {
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_0) },
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_1) },
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_2) },
-		{ 0, },
-	};
-
-	/* look for the matching bridge */
-	while (!pci_is_root_bus(bus)) {
-		bridge = bus->self;
-		bus = bus->parent;
-		/*
-		 * Some Loongson PCIe ports have a h/w limitation of
-		 * 256 bytes maximum read request size. They can't handle
-		 * anything larger than this. So force this limit on
-		 * any devices attached under these ports.
-		 */
-		if (pci_match_id(bridge_devids, bridge)) {
-			if (pcie_get_readrq(dev) > 256) {
-				pci_info(dev, "limiting MRRS to 256\n");
-				pcie_set_readrq(dev, 256);
-			}
-			break;
-		}
-	}
+	/*
+	 * Some Loongson PCIe ports have h/w limitations of maximum read
+	 * request size. They can't handle anything larger than this. So
+	 * force this limit on any devices attached under these ports.
+	 */
+	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
+
+	bridge->no_inc_mrrs = 1;
 }
-DECLARE_PCI_FIXUP_ENABLE(PCI_ANY_ID, PCI_ANY_ID, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_0, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_1, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
 
 static void loongson_pci_pin_quirk(struct pci_dev *pdev)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..ae88210a12c7 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6033,6 +6033,7 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 {
 	u16 v;
 	int ret;
+	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
 
 	if (rq < 128 || rq > 4096 || !is_power_of_2(rq))
 		return -EINVAL;
@@ -6051,6 +6052,11 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 
 	v = (ffs(rq) - 8) << 12;
 
+	if (bridge->no_inc_mrrs) {
+		if (rq > pcie_get_readrq(dev))
+			return -EINVAL;
+	}
+
 	ret = pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
 						  PCI_EXP_DEVCTL_READRQ, v);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index adffd65e84b4..3df2049ec4a8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -572,6 +572,7 @@ struct pci_host_bridge {
 	void		*release_data;
 	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
 	unsigned int	no_ext_tags:1;		/* No Extended Tags */
+	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
 	unsigned int	native_aer:1;		/* OS may use PCIe AER */
 	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
 	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-06  9:51 [PATCH V2 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
  2023-01-06  9:51 ` [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
@ 2023-01-06  9:51 ` Huacai Chen
  2023-01-06 15:38   ` Bjorn Helgaas
  1 sibling, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2023-01-06  9:51 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang, Huacai Chen

After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
shutdown") we observe poweroff/reboot failures on systems with LS7A
chipset.

We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
do_pci_disable_device(), it can work well. The hardware engineer says
that the root cause is that CPU is still accessing PCIe devices while
poweroff/reboot, and if we disable the Bus Master Bit at this time, the
PCIe controller doesn't forward requests to downstream devices, and also
does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
deadlock).

To be clear, the sequence is like this:

  - CPU issues MMIO read to device below Root Port

  - LS7A Root Port fails to forward transaction to secondary bus
    because of LS7A Bus Master defect

  - CPU hangs waiting for response to MMIO read

Then how is userspace able to use a device after the device is removed?

To give more details, let's take the graphics driver (e.g. amdgpu) as
an example. The userspace programs call printf() to display "shutting
down xxx service" during shutdown/reboot, or the kernel calls printk()
to display something during shutdown/reboot. These can happen at any
time, even after we call pcie_port_device_remove() to disable the pcie
port on the graphic card.

The call stack is: printk() --> call_console_drivers() --> con->write()
--> vt_console_print() --> fbcon_putcs()

This scenario happens because userspace programs (or the kernel itself)
don't know whether a device is 'usable', they just use it, at any time.

This hardware behavior is a PCIe protocol violation (Bus Master should
not be involved in CPU MMIO transactions), and it will be fixed in new
revisions of hardware (add timeout mechanism for CPU read request,
whether or not Bus Master bit is cleared).

On some x86 platforms, radeon/amdgpu devices can cause similar problems
[1][2]. Once before I wanted to make a single patch to solve "all of
these problems" together, but it seems unreasonable because maybe they
are not exactly the same problem. So, this patch add a new function
pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
dedicated for the shutdown path, and then add a quirk just for LS7A to
avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
platforms behave as before.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
[2] https://bugs.freedesktop.org/show_bug.cgi?id=98638

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
 drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
 include/linux/pci.h                   |  1 +
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index 759ec211c17b..641308ba4126 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
 
+static void loongson_bmaster_quirk(struct pci_dev *pdev)
+{
+	/*
+	 * Some Loongson PCIe ports will cause CPU deadlock if there is
+	 * MMIO access to a downstream device when the root port disable
+	 * the Bus Master bit during poweroff/reboot.
+	 */
+	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
+
+	bridge->no_dis_bmaster = 1;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_0, loongson_bmaster_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_1, loongson_bmaster_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_2, loongson_bmaster_quirk);
+
 static void loongson_pci_pin_quirk(struct pci_dev *pdev)
 {
 	pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 2cc2e60bcb39..96f45c444422 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
 {
 	device_for_each_child(&dev->dev, NULL, remove_iter);
 	pci_free_irq_vectors(dev);
-	pci_disable_device(dev);
 }
 
 /**
@@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
 	}
 
 	pcie_port_device_remove(dev);
+
+	pci_disable_device(dev);
+}
+
+static void pcie_portdrv_shutdown(struct pci_dev *dev)
+{
+	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
+
+	if (pci_bridge_d3_possible(dev)) {
+		pm_runtime_forbid(&dev->dev);
+		pm_runtime_get_noresume(&dev->dev);
+		pm_runtime_dont_use_autosuspend(&dev->dev);
+	}
+
+	pcie_port_device_remove(dev);
+
+	if (!bridge->no_dis_bmaster)
+		pci_disable_device(dev);
 }
 
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
@@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
 
 	.probe		= pcie_portdrv_probe,
 	.remove		= pcie_portdrv_remove,
-	.shutdown	= pcie_portdrv_remove,
+	.shutdown	= pcie_portdrv_shutdown,
 
 	.err_handler	= &pcie_portdrv_err_handler,
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3df2049ec4a8..a64dbcb89231 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -573,6 +573,7 @@ struct pci_host_bridge {
 	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
 	unsigned int	no_ext_tags:1;		/* No Extended Tags */
 	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
+	unsigned int	no_dis_bmaster:1;	/* No Disable Bus Master */
 	unsigned int	native_aer:1;		/* OS may use PCIe AER */
 	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
 	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-06  9:51 ` [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
@ 2023-01-06 15:38   ` Bjorn Helgaas
  2023-01-07  2:25     ` Huacai Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2023-01-06 15:38 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Huacai Chen, Jiaxun Yang, Rafael J. Wysocki, linux-pm,
	linux-kernel

[+cc Rafael, linux-pm, linux-kernel in case you have comments on
whether devices should still be usable after .remove()/.shutdown()]

On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
> shutdown") we observe poweroff/reboot failures on systems with LS7A
> chipset.
> 
> We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
> do_pci_disable_device(), it can work well. The hardware engineer says
> that the root cause is that CPU is still accessing PCIe devices while
> poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> PCIe controller doesn't forward requests to downstream devices, and also
> does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
> deadlock).
> 
> To be clear, the sequence is like this:
> 
>   - CPU issues MMIO read to device below Root Port
> 
>   - LS7A Root Port fails to forward transaction to secondary bus
>     because of LS7A Bus Master defect
> 
>   - CPU hangs waiting for response to MMIO read
> 
> Then how is userspace able to use a device after the device is removed?
> 
> To give more details, let's take the graphics driver (e.g. amdgpu) as
> an example. The userspace programs call printf() to display "shutting
> down xxx service" during shutdown/reboot, or the kernel calls printk()
> to display something during shutdown/reboot. These can happen at any
> time, even after we call pcie_port_device_remove() to disable the pcie
> port on the graphic card.
> 
> The call stack is: printk() --> call_console_drivers() --> con->write()
> --> vt_console_print() --> fbcon_putcs()
> 
> This scenario happens because userspace programs (or the kernel itself)
> don't know whether a device is 'usable', they just use it, at any time.

Thanks for this background.  So basically we want to call .remove() on
a console device (or a bridge leading to it), but we expect it to keep
working as usual afterwards?

That seems a little weird.  Is that the design we want?  Maybe we
should have a way to mark devices so we don't remove them during
shutdown or reboot?

> This hardware behavior is a PCIe protocol violation (Bus Master should
> not be involved in CPU MMIO transactions), and it will be fixed in new
> revisions of hardware (add timeout mechanism for CPU read request,
> whether or not Bus Master bit is cleared).
> 
> On some x86 platforms, radeon/amdgpu devices can cause similar problems
> [1][2]. Once before I wanted to make a single patch to solve "all of
> these problems" together, but it seems unreasonable because maybe they
> are not exactly the same problem. So, this patch add a new function
> pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> dedicated for the shutdown path, and then add a quirk just for LS7A to
> avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> platforms behave as before.
> 
> [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> 
> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
>  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
>  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
>  include/linux/pci.h                   |  1 +
>  3 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> index 759ec211c17b..641308ba4126 100644
> --- a/drivers/pci/controller/pci-loongson.c
> +++ b/drivers/pci/controller/pci-loongson.c
> @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
>  
> +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Some Loongson PCIe ports will cause CPU deadlock if there is
> +	 * MMIO access to a downstream device when the root port disable
> +	 * the Bus Master bit during poweroff/reboot.
> +	 */
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> +
> +	bridge->no_dis_bmaster = 1;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> +
>  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
>  {
>  	pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 2cc2e60bcb39..96f45c444422 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
>  {
>  	device_for_each_child(&dev->dev, NULL, remove_iter);
>  	pci_free_irq_vectors(dev);
> -	pci_disable_device(dev);
>  }
>  
>  /**
> @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
>  	}
>  
>  	pcie_port_device_remove(dev);
> +
> +	pci_disable_device(dev);
> +}
> +
> +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> +
> +	if (pci_bridge_d3_possible(dev)) {
> +		pm_runtime_forbid(&dev->dev);
> +		pm_runtime_get_noresume(&dev->dev);
> +		pm_runtime_dont_use_autosuspend(&dev->dev);
> +	}
> +
> +	pcie_port_device_remove(dev);
> +
> +	if (!bridge->no_dis_bmaster)
> +		pci_disable_device(dev);
>  }
>  
>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
>  
>  	.probe		= pcie_portdrv_probe,
>  	.remove		= pcie_portdrv_remove,
> -	.shutdown	= pcie_portdrv_remove,
> +	.shutdown	= pcie_portdrv_shutdown,
>  
>  	.err_handler	= &pcie_portdrv_err_handler,
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 3df2049ec4a8..a64dbcb89231 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -573,6 +573,7 @@ struct pci_host_bridge {
>  	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
>  	unsigned int	no_ext_tags:1;		/* No Extended Tags */
>  	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
> +	unsigned int	no_dis_bmaster:1;	/* No Disable Bus Master */
>  	unsigned int	native_aer:1;		/* OS may use PCIe AER */
>  	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
>  	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-06 15:38   ` Bjorn Helgaas
@ 2023-01-07  2:25     ` Huacai Chen
  2023-01-19 12:25       ` Huacai Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2023-01-07  2:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Rafael, linux-pm, linux-kernel in case you have comments on
> whether devices should still be usable after .remove()/.shutdown()]
>
> On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
> > shutdown") we observe poweroff/reboot failures on systems with LS7A
> > chipset.
> >
> > We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
> > do_pci_disable_device(), it can work well. The hardware engineer says
> > that the root cause is that CPU is still accessing PCIe devices while
> > poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> > PCIe controller doesn't forward requests to downstream devices, and also
> > does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > deadlock).
> >
> > To be clear, the sequence is like this:
> >
> >   - CPU issues MMIO read to device below Root Port
> >
> >   - LS7A Root Port fails to forward transaction to secondary bus
> >     because of LS7A Bus Master defect
> >
> >   - CPU hangs waiting for response to MMIO read
> >
> > Then how is userspace able to use a device after the device is removed?
> >
> > To give more details, let's take the graphics driver (e.g. amdgpu) as
> > an example. The userspace programs call printf() to display "shutting
> > down xxx service" during shutdown/reboot, or the kernel calls printk()
> > to display something during shutdown/reboot. These can happen at any
> > time, even after we call pcie_port_device_remove() to disable the pcie
> > port on the graphic card.
> >
> > The call stack is: printk() --> call_console_drivers() --> con->write()
> > --> vt_console_print() --> fbcon_putcs()
> >
> > This scenario happens because userspace programs (or the kernel itself)
> > don't know whether a device is 'usable', they just use it, at any time.
>
> Thanks for this background.  So basically we want to call .remove() on
> a console device (or a bridge leading to it), but we expect it to keep
> working as usual afterwards?
>
> That seems a little weird.  Is that the design we want?  Maybe we
> should have a way to mark devices so we don't remove them during
> shutdown or reboot?
Sounds reasonable, but it seems no existing way can mark this.

Huacai
>
> > This hardware behavior is a PCIe protocol violation (Bus Master should
> > not be involved in CPU MMIO transactions), and it will be fixed in new
> > revisions of hardware (add timeout mechanism for CPU read request,
> > whether or not Bus Master bit is cleared).
> >
> > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > [1][2]. Once before I wanted to make a single patch to solve "all of
> > these problems" together, but it seems unreasonable because maybe they
> > are not exactly the same problem. So, this patch add a new function
> > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > platforms behave as before.
> >
> > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> >
> > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > ---
> >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> >  include/linux/pci.h                   |  1 +
> >  3 files changed, 37 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > index 759ec211c17b..641308ba4126 100644
> > --- a/drivers/pci/controller/pci-loongson.c
> > +++ b/drivers/pci/controller/pci-loongson.c
> > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> >
> > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > +{
> > +     /*
> > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > +      * MMIO access to a downstream device when the root port disable
> > +      * the Bus Master bit during poweroff/reboot.
> > +      */
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > +
> > +     bridge->no_dis_bmaster = 1;
> > +}
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > +
> >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> >  {
> >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 2cc2e60bcb39..96f45c444422 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> >  {
> >       device_for_each_child(&dev->dev, NULL, remove_iter);
> >       pci_free_irq_vectors(dev);
> > -     pci_disable_device(dev);
> >  }
> >
> >  /**
> > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> >       }
> >
> >       pcie_port_device_remove(dev);
> > +
> > +     pci_disable_device(dev);
> > +}
> > +
> > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > +{
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > +
> > +     if (pci_bridge_d3_possible(dev)) {
> > +             pm_runtime_forbid(&dev->dev);
> > +             pm_runtime_get_noresume(&dev->dev);
> > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > +     }
> > +
> > +     pcie_port_device_remove(dev);
> > +
> > +     if (!bridge->no_dis_bmaster)
> > +             pci_disable_device(dev);
> >  }
> >
> >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> >
> >       .probe          = pcie_portdrv_probe,
> >       .remove         = pcie_portdrv_remove,
> > -     .shutdown       = pcie_portdrv_remove,
> > +     .shutdown       = pcie_portdrv_shutdown,
> >
> >       .err_handler    = &pcie_portdrv_err_handler,
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 3df2049ec4a8..a64dbcb89231 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-07  2:25     ` Huacai Chen
@ 2023-01-19 12:25       ` Huacai Chen
  2023-01-19 12:50         ` Bjorn Helgaas
  0 siblings, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2023-01-19 12:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

Ping?


On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
>
> On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Rafael, linux-pm, linux-kernel in case you have comments on
> > whether devices should still be usable after .remove()/.shutdown()]
> >
> > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
> > > shutdown") we observe poweroff/reboot failures on systems with LS7A
> > > chipset.
> > >
> > > We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
> > > do_pci_disable_device(), it can work well. The hardware engineer says
> > > that the root cause is that CPU is still accessing PCIe devices while
> > > poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> > > PCIe controller doesn't forward requests to downstream devices, and also
> > > does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > deadlock).
> > >
> > > To be clear, the sequence is like this:
> > >
> > >   - CPU issues MMIO read to device below Root Port
> > >
> > >   - LS7A Root Port fails to forward transaction to secondary bus
> > >     because of LS7A Bus Master defect
> > >
> > >   - CPU hangs waiting for response to MMIO read
> > >
> > > Then how is userspace able to use a device after the device is removed?
> > >
> > > To give more details, let's take the graphics driver (e.g. amdgpu) as
> > > an example. The userspace programs call printf() to display "shutting
> > > down xxx service" during shutdown/reboot, or the kernel calls printk()
> > > to display something during shutdown/reboot. These can happen at any
> > > time, even after we call pcie_port_device_remove() to disable the pcie
> > > port on the graphic card.
> > >
> > > The call stack is: printk() --> call_console_drivers() --> con->write()
> > > --> vt_console_print() --> fbcon_putcs()
> > >
> > > This scenario happens because userspace programs (or the kernel itself)
> > > don't know whether a device is 'usable', they just use it, at any time.
> >
> > Thanks for this background.  So basically we want to call .remove() on
> > a console device (or a bridge leading to it), but we expect it to keep
> > working as usual afterwards?
> >
> > That seems a little weird.  Is that the design we want?  Maybe we
> > should have a way to mark devices so we don't remove them during
> > shutdown or reboot?
> Sounds reasonable, but it seems no existing way can mark this.
>
> Huacai
> >
> > > This hardware behavior is a PCIe protocol violation (Bus Master should
> > > not be involved in CPU MMIO transactions), and it will be fixed in new
> > > revisions of hardware (add timeout mechanism for CPU read request,
> > > whether or not Bus Master bit is cleared).
> > >
> > > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > > [1][2]. Once before I wanted to make a single patch to solve "all of
> > > these problems" together, but it seems unreasonable because maybe they
> > > are not exactly the same problem. So, this patch add a new function
> > > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > > platforms behave as before.
> > >
> > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> > >
> > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > ---
> > >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> > >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> > >  include/linux/pci.h                   |  1 +
> > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > > index 759ec211c17b..641308ba4126 100644
> > > --- a/drivers/pci/controller/pci-loongson.c
> > > +++ b/drivers/pci/controller/pci-loongson.c
> > > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> > >
> > > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > > +{
> > > +     /*
> > > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > > +      * MMIO access to a downstream device when the root port disable
> > > +      * the Bus Master bit during poweroff/reboot.
> > > +      */
> > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > > +
> > > +     bridge->no_dis_bmaster = 1;
> > > +}
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > > +
> > >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> > >  {
> > >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > index 2cc2e60bcb39..96f45c444422 100644
> > > --- a/drivers/pci/pcie/portdrv.c
> > > +++ b/drivers/pci/pcie/portdrv.c
> > > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> > >  {
> > >       device_for_each_child(&dev->dev, NULL, remove_iter);
> > >       pci_free_irq_vectors(dev);
> > > -     pci_disable_device(dev);
> > >  }
> > >
> > >  /**
> > > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> > >       }
> > >
> > >       pcie_port_device_remove(dev);
> > > +
> > > +     pci_disable_device(dev);
> > > +}
> > > +
> > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > +{
> > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > +
> > > +     if (pci_bridge_d3_possible(dev)) {
> > > +             pm_runtime_forbid(&dev->dev);
> > > +             pm_runtime_get_noresume(&dev->dev);
> > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > +     }
> > > +
> > > +     pcie_port_device_remove(dev);
> > > +
> > > +     if (!bridge->no_dis_bmaster)
> > > +             pci_disable_device(dev);
> > >  }
> > >
> > >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> > >
> > >       .probe          = pcie_portdrv_probe,
> > >       .remove         = pcie_portdrv_remove,
> > > -     .shutdown       = pcie_portdrv_remove,
> > > +     .shutdown       = pcie_portdrv_shutdown,
> > >
> > >       .err_handler    = &pcie_portdrv_err_handler,
> > >
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index 3df2049ec4a8..a64dbcb89231 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> > >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> > >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> > >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> > >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> > >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> > >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > > --
> > > 2.31.1
> > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-19 12:25       ` Huacai Chen
@ 2023-01-19 12:50         ` Bjorn Helgaas
  2023-01-20 13:31           ` Huacai Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2023-01-19 12:50 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> Ping?

I suggested another possible way to do this that wasn't so much of a
special case.  Did you explore that at all?

I know there's no *existing* way to mark devices that we need to use
all the way through shutdown or reboot, but if it makes sense, there's
no reason we couldn't add one.  That has the potential of being more
generic, e.g., we could do it for all console devices, as opposed to
quirking a Root Port that just happens to be in the path to the
console.

> On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> >
> > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc Rafael, linux-pm, linux-kernel in case you have comments on
> > > whether devices should still be usable after .remove()/.shutdown()]
> > >
> > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
> > > > shutdown") we observe poweroff/reboot failures on systems with LS7A
> > > > chipset.
> > > >
> > > > We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
> > > > do_pci_disable_device(), it can work well. The hardware engineer says
> > > > that the root cause is that CPU is still accessing PCIe devices while
> > > > poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> > > > PCIe controller doesn't forward requests to downstream devices, and also
> > > > does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > > deadlock).
> > > >
> > > > To be clear, the sequence is like this:
> > > >
> > > >   - CPU issues MMIO read to device below Root Port
> > > >
> > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > >     because of LS7A Bus Master defect
> > > >
> > > >   - CPU hangs waiting for response to MMIO read
> > > >
> > > > Then how is userspace able to use a device after the device is removed?
> > > >
> > > > To give more details, let's take the graphics driver (e.g. amdgpu) as
> > > > an example. The userspace programs call printf() to display "shutting
> > > > down xxx service" during shutdown/reboot, or the kernel calls printk()
> > > > to display something during shutdown/reboot. These can happen at any
> > > > time, even after we call pcie_port_device_remove() to disable the pcie
> > > > port on the graphic card.
> > > >
> > > > The call stack is: printk() --> call_console_drivers() --> con->write()
> > > > --> vt_console_print() --> fbcon_putcs()
> > > >
> > > > This scenario happens because userspace programs (or the kernel itself)
> > > > don't know whether a device is 'usable', they just use it, at any time.
> > >
> > > Thanks for this background.  So basically we want to call .remove() on
> > > a console device (or a bridge leading to it), but we expect it to keep
> > > working as usual afterwards?
> > >
> > > That seems a little weird.  Is that the design we want?  Maybe we
> > > should have a way to mark devices so we don't remove them during
> > > shutdown or reboot?
> > Sounds reasonable, but it seems no existing way can mark this.
> >
> > Huacai
> > >
> > > > This hardware behavior is a PCIe protocol violation (Bus Master should
> > > > not be involved in CPU MMIO transactions), and it will be fixed in new
> > > > revisions of hardware (add timeout mechanism for CPU read request,
> > > > whether or not Bus Master bit is cleared).
> > > >
> > > > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > > > [1][2]. Once before I wanted to make a single patch to solve "all of
> > > > these problems" together, but it seems unreasonable because maybe they
> > > > are not exactly the same problem. So, this patch add a new function
> > > > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > > > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > > > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > > > platforms behave as before.
> > > >
> > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> > > >
> > > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > > ---
> > > >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> > > >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> > > >  include/linux/pci.h                   |  1 +
> > > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > > > index 759ec211c17b..641308ba4126 100644
> > > > --- a/drivers/pci/controller/pci-loongson.c
> > > > +++ b/drivers/pci/controller/pci-loongson.c
> > > > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> > > >
> > > > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > > > +{
> > > > +     /*
> > > > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > > > +      * MMIO access to a downstream device when the root port disable
> > > > +      * the Bus Master bit during poweroff/reboot.
> > > > +      */
> > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > > > +
> > > > +     bridge->no_dis_bmaster = 1;
> > > > +}
> > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > > > +
> > > >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> > > >  {
> > > >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > > index 2cc2e60bcb39..96f45c444422 100644
> > > > --- a/drivers/pci/pcie/portdrv.c
> > > > +++ b/drivers/pci/pcie/portdrv.c
> > > > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> > > >  {
> > > >       device_for_each_child(&dev->dev, NULL, remove_iter);
> > > >       pci_free_irq_vectors(dev);
> > > > -     pci_disable_device(dev);
> > > >  }
> > > >
> > > >  /**
> > > > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> > > >       }
> > > >
> > > >       pcie_port_device_remove(dev);
> > > > +
> > > > +     pci_disable_device(dev);
> > > > +}
> > > > +
> > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > +{
> > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > +
> > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > +             pm_runtime_forbid(&dev->dev);
> > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > +     }
> > > > +
> > > > +     pcie_port_device_remove(dev);
> > > > +
> > > > +     if (!bridge->no_dis_bmaster)
> > > > +             pci_disable_device(dev);
> > > >  }
> > > >
> > > >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > > > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> > > >
> > > >       .probe          = pcie_portdrv_probe,
> > > >       .remove         = pcie_portdrv_remove,
> > > > -     .shutdown       = pcie_portdrv_remove,
> > > > +     .shutdown       = pcie_portdrv_shutdown,
> > > >
> > > >       .err_handler    = &pcie_portdrv_err_handler,
> > > >
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > index 3df2049ec4a8..a64dbcb89231 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> > > >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> > > >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> > > >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > > > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> > > >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> > > >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> > > >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > > > --
> > > > 2.31.1
> > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-19 12:50         ` Bjorn Helgaas
@ 2023-01-20 13:31           ` Huacai Chen
  2023-01-20 15:36             ` Bjorn Helgaas
  0 siblings, 1 reply; 16+ messages in thread
From: Huacai Chen @ 2023-01-20 13:31 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

Hi, Bjorn,

On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> > Ping?
>
> I suggested another possible way to do this that wasn't so much of a
> special case.  Did you explore that at all?
That is a little difficult for me, but what is worse is that the root
cause doesn't come from gpu or console drivers, but from the root
port. That means: even if we can workaround the gpu issue in another
way, there are still problems on other devices. Besides the graphics
card, the most frequent problematic device is the sata controller
connected on LS7A chipset, there are incomplete I/O accesses after the
root port disabled and also cause reboot failure.

Huacai
>
> I know there's no *existing* way to mark devices that we need to use
> all the way through shutdown or reboot, but if it makes sense, there's
> no reason we couldn't add one.  That has the potential of being more
> generic, e.g., we could do it for all console devices, as opposed to
> quirking a Root Port that just happens to be in the path to the
> console.
>
> > On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> > >
> > > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > [+cc Rafael, linux-pm, linux-kernel in case you have comments on
> > > > whether devices should still be usable after .remove()/.shutdown()]
> > > >
> > > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during
> > > > > shutdown") we observe poweroff/reboot failures on systems with LS7A
> > > > > chipset.
> > > > >
> > > > > We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in
> > > > > do_pci_disable_device(), it can work well. The hardware engineer says
> > > > > that the root cause is that CPU is still accessing PCIe devices while
> > > > > poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> > > > > PCIe controller doesn't forward requests to downstream devices, and also
> > > > > does not send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > > > deadlock).
> > > > >
> > > > > To be clear, the sequence is like this:
> > > > >
> > > > >   - CPU issues MMIO read to device below Root Port
> > > > >
> > > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > > >     because of LS7A Bus Master defect
> > > > >
> > > > >   - CPU hangs waiting for response to MMIO read
> > > > >
> > > > > Then how is userspace able to use a device after the device is removed?
> > > > >
> > > > > To give more details, let's take the graphics driver (e.g. amdgpu) as
> > > > > an example. The userspace programs call printf() to display "shutting
> > > > > down xxx service" during shutdown/reboot, or the kernel calls printk()
> > > > > to display something during shutdown/reboot. These can happen at any
> > > > > time, even after we call pcie_port_device_remove() to disable the pcie
> > > > > port on the graphic card.
> > > > >
> > > > > The call stack is: printk() --> call_console_drivers() --> con->write()
> > > > > --> vt_console_print() --> fbcon_putcs()
> > > > >
> > > > > This scenario happens because userspace programs (or the kernel itself)
> > > > > don't know whether a device is 'usable', they just use it, at any time.
> > > >
> > > > Thanks for this background.  So basically we want to call .remove() on
> > > > a console device (or a bridge leading to it), but we expect it to keep
> > > > working as usual afterwards?
> > > >
> > > > That seems a little weird.  Is that the design we want?  Maybe we
> > > > should have a way to mark devices so we don't remove them during
> > > > shutdown or reboot?
> > > Sounds reasonable, but it seems no existing way can mark this.
> > >
> > > Huacai
> > > >
> > > > > This hardware behavior is a PCIe protocol violation (Bus Master should
> > > > > not be involved in CPU MMIO transactions), and it will be fixed in new
> > > > > revisions of hardware (add timeout mechanism for CPU read request,
> > > > > whether or not Bus Master bit is cleared).
> > > > >
> > > > > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > > > > [1][2]. Once before I wanted to make a single patch to solve "all of
> > > > > these problems" together, but it seems unreasonable because maybe they
> > > > > are not exactly the same problem. So, this patch add a new function
> > > > > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > > > > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > > > > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > > > > platforms behave as before.
> > > > >
> > > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > > > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> > > > >
> > > > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > > > ---
> > > > >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> > > > >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> > > > >  include/linux/pci.h                   |  1 +
> > > > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > > > > index 759ec211c17b..641308ba4126 100644
> > > > > --- a/drivers/pci/controller/pci-loongson.c
> > > > > +++ b/drivers/pci/controller/pci-loongson.c
> > > > > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> > > > >
> > > > > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > > > > +{
> > > > > +     /*
> > > > > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > > > > +      * MMIO access to a downstream device when the root port disable
> > > > > +      * the Bus Master bit during poweroff/reboot.
> > > > > +      */
> > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > > > > +
> > > > > +     bridge->no_dis_bmaster = 1;
> > > > > +}
> > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > > > > +
> > > > >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> > > > >  {
> > > > >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > > > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > > > index 2cc2e60bcb39..96f45c444422 100644
> > > > > --- a/drivers/pci/pcie/portdrv.c
> > > > > +++ b/drivers/pci/pcie/portdrv.c
> > > > > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> > > > >  {
> > > > >       device_for_each_child(&dev->dev, NULL, remove_iter);
> > > > >       pci_free_irq_vectors(dev);
> > > > > -     pci_disable_device(dev);
> > > > >  }
> > > > >
> > > > >  /**
> > > > > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> > > > >       }
> > > > >
> > > > >       pcie_port_device_remove(dev);
> > > > > +
> > > > > +     pci_disable_device(dev);
> > > > > +}
> > > > > +
> > > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > > +{
> > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > > +
> > > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > > +             pm_runtime_forbid(&dev->dev);
> > > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > > +     }
> > > > > +
> > > > > +     pcie_port_device_remove(dev);
> > > > > +
> > > > > +     if (!bridge->no_dis_bmaster)
> > > > > +             pci_disable_device(dev);
> > > > >  }
> > > > >
> > > > >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > > > > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> > > > >
> > > > >       .probe          = pcie_portdrv_probe,
> > > > >       .remove         = pcie_portdrv_remove,
> > > > > -     .shutdown       = pcie_portdrv_remove,
> > > > > +     .shutdown       = pcie_portdrv_shutdown,
> > > > >
> > > > >       .err_handler    = &pcie_portdrv_err_handler,
> > > > >
> > > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > > index 3df2049ec4a8..a64dbcb89231 100644
> > > > > --- a/include/linux/pci.h
> > > > > +++ b/include/linux/pci.h
> > > > > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> > > > >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> > > > >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> > > > >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > > > > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> > > > >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> > > > >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> > > > >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > > > > --
> > > > > 2.31.1
> > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-20 13:31           ` Huacai Chen
@ 2023-01-20 15:36             ` Bjorn Helgaas
  2023-01-21 15:10               ` Huacai Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2023-01-20 15:36 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

On Fri, Jan 20, 2023 at 09:31:43PM +0800, Huacai Chen wrote:
> On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> > > Ping?
> >
> > I suggested another possible way to do this that wasn't so much of a
> > special case.  Did you explore that at all?
>
> That is a little difficult for me, but what is worse is that the root
> cause doesn't come from gpu or console drivers, but from the root
> port. That means: even if we can workaround the gpu issue in another
> way, there are still problems on other devices. Besides the graphics
> card, the most frequent problematic device is the sata controller
> connected on LS7A chipset, there are incomplete I/O accesses after the
> root port disabled and also cause reboot failure.

Yes, SATA sounds like another case where we want to use the device
after we call the driver's remove/shutdown method.  That's not
*worse*, it's just another case where we might have to mark devices
for special handling.

If we remove/shutdown *any* Root Port, not just LS7A, I think the idea
of assuming downstream devices can continue to work as usual is a
little suspect.  They might continue to work by accident today, but it
doesn't seem like a robust design.

> > I know there's no *existing* way to mark devices that we need to use
> > all the way through shutdown or reboot, but if it makes sense, there's
> > no reason we couldn't add one.  That has the potential of being more
> > generic, e.g., we could do it for all console devices, as opposed to
> > quirking a Root Port that just happens to be in the path to the
> > console.
> >
> > > On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> > > > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe
> > > > > > services during shutdown") we observe poweroff/reboot
> > > > > > failures on systems with LS7A chipset.
> > > > > >
> > > > > > We found that if we remove "pci_command &=
> > > > > > ~PCI_COMMAND_MASTER" in do_pci_disable_device(), it can
> > > > > > work well. The hardware engineer says that the root cause
> > > > > > is that CPU is still accessing PCIe devices while
> > > > > > poweroff/reboot, and if we disable the Bus Master Bit at
> > > > > > this time, the PCIe controller doesn't forward requests to
> > > > > > downstream devices, and also does not send TIMEOUT to CPU,
> > > > > > which causes CPU wait forever (hardware deadlock).
> > > > > >
> > > > > > To be clear, the sequence is like this:
> > > > > >
> > > > > >   - CPU issues MMIO read to device below Root Port
> > > > > >
> > > > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > > > >     because of LS7A Bus Master defect
> > > > > >
> > > > > >   - CPU hangs waiting for response to MMIO read
> > > > > >
> > > > > > Then how is userspace able to use a device after the
> > > > > > device is removed?
> > > > > >
> > > > > > To give more details, let's take the graphics driver (e.g.
> > > > > > amdgpu) as an example. The userspace programs call
> > > > > > printf() to display "shutting down xxx service" during
> > > > > > shutdown/reboot, or the kernel calls printk() to display
> > > > > > something during shutdown/reboot. These can happen at any
> > > > > > time, even after we call pcie_port_device_remove() to
> > > > > > disable the pcie port on the graphic card.
> > > > > >
> > > > > > The call stack is: printk() --> call_console_drivers() -->
> > > > > > con->write() --> vt_console_print() --> fbcon_putcs()
> > > > > >
> > > > > > This scenario happens because userspace programs (or the
> > > > > > kernel itself) don't know whether a device is 'usable',
> > > > > > they just use it, at any time.
> > > > >
> > > > > Thanks for this background.  So basically we want to call
> > > > > .remove() on a console device (or a bridge leading to it),
> > > > > but we expect it to keep working as usual afterwards?
> > > > >
> > > > > That seems a little weird.  Is that the design we want?
> > > > > Maybe we should have a way to mark devices so we don't
> > > > > remove them during shutdown or reboot?
> > > >
> > > > Sounds reasonable, but it seems no existing way can mark this.
> > > >
> > > > Huacai
> > > > >
> > > > > > This hardware behavior is a PCIe protocol violation (Bus Master should
> > > > > > not be involved in CPU MMIO transactions), and it will be fixed in new
> > > > > > revisions of hardware (add timeout mechanism for CPU read request,
> > > > > > whether or not Bus Master bit is cleared).
> > > > > >
> > > > > > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > > > > > [1][2]. Once before I wanted to make a single patch to solve "all of
> > > > > > these problems" together, but it seems unreasonable because maybe they
> > > > > > are not exactly the same problem. So, this patch add a new function
> > > > > > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > > > > > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > > > > > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > > > > > platforms behave as before.
> > > > > >
> > > > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > > > > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> > > > > >
> > > > > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > > > > ---
> > > > > >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> > > > > >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> > > > > >  include/linux/pci.h                   |  1 +
> > > > > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > > > > > index 759ec211c17b..641308ba4126 100644
> > > > > > --- a/drivers/pci/controller/pci-loongson.c
> > > > > > +++ b/drivers/pci/controller/pci-loongson.c
> > > > > > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> > > > > >
> > > > > > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > > > > > +{
> > > > > > +     /*
> > > > > > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > > > > > +      * MMIO access to a downstream device when the root port disable
> > > > > > +      * the Bus Master bit during poweroff/reboot.
> > > > > > +      */
> > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > > > > > +
> > > > > > +     bridge->no_dis_bmaster = 1;
> > > > > > +}
> > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > > > > > +
> > > > > >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> > > > > >  {
> > > > > >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > > > > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > > > > index 2cc2e60bcb39..96f45c444422 100644
> > > > > > --- a/drivers/pci/pcie/portdrv.c
> > > > > > +++ b/drivers/pci/pcie/portdrv.c
> > > > > > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> > > > > >  {
> > > > > >       device_for_each_child(&dev->dev, NULL, remove_iter);
> > > > > >       pci_free_irq_vectors(dev);
> > > > > > -     pci_disable_device(dev);
> > > > > >  }
> > > > > >
> > > > > >  /**
> > > > > > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> > > > > >       }
> > > > > >
> > > > > >       pcie_port_device_remove(dev);
> > > > > > +
> > > > > > +     pci_disable_device(dev);
> > > > > > +}
> > > > > > +
> > > > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > > > +{
> > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > > > +
> > > > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > > > +             pm_runtime_forbid(&dev->dev);
> > > > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > > > +     }
> > > > > > +
> > > > > > +     pcie_port_device_remove(dev);
> > > > > > +
> > > > > > +     if (!bridge->no_dis_bmaster)
> > > > > > +             pci_disable_device(dev);
> > > > > >  }
> > > > > >
> > > > > >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > > > > > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> > > > > >
> > > > > >       .probe          = pcie_portdrv_probe,
> > > > > >       .remove         = pcie_portdrv_remove,
> > > > > > -     .shutdown       = pcie_portdrv_remove,
> > > > > > +     .shutdown       = pcie_portdrv_shutdown,
> > > > > >
> > > > > >       .err_handler    = &pcie_portdrv_err_handler,
> > > > > >
> > > > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > > > index 3df2049ec4a8..a64dbcb89231 100644
> > > > > > --- a/include/linux/pci.h
> > > > > > +++ b/include/linux/pci.h
> > > > > > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> > > > > >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> > > > > >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> > > > > >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > > > > > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> > > > > >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> > > > > >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> > > > > >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > > > > > --
> > > > > > 2.31.1
> > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-20 15:36             ` Bjorn Helgaas
@ 2023-01-21 15:10               ` Huacai Chen
  2023-01-30 12:35                 ` Thorsten Leemhuis
  2023-01-31  0:01                 ` Bjorn Helgaas
  0 siblings, 2 replies; 16+ messages in thread
From: Huacai Chen @ 2023-01-21 15:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

On Fri, Jan 20, 2023 at 11:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Jan 20, 2023 at 09:31:43PM +0800, Huacai Chen wrote:
> > On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> > > > Ping?
> > >
> > > I suggested another possible way to do this that wasn't so much of a
> > > special case.  Did you explore that at all?
> >
> > That is a little difficult for me, but what is worse is that the root
> > cause doesn't come from gpu or console drivers, but from the root
> > port. That means: even if we can workaround the gpu issue in another
> > way, there are still problems on other devices. Besides the graphics
> > card, the most frequent problematic device is the sata controller
> > connected on LS7A chipset, there are incomplete I/O accesses after the
> > root port disabled and also cause reboot failure.
>
> Yes, SATA sounds like another case where we want to use the device
> after we call the driver's remove/shutdown method.  That's not
> *worse*, it's just another case where we might have to mark devices
> for special handling.
That needs too much effort because we need to modify nearly every pci
driver, and it exceeds my ability. :)

>
> If we remove/shutdown *any* Root Port, not just LS7A, I think the idea
> of assuming downstream devices can continue to work as usual is a
> little suspect.  They might continue to work by accident today, but it
> doesn't seem like a robust design.
The existing design works for so many years, so it is mostly
reasonable. For the LS7A case, the root cause comes from the root
port, so a workaround on the root port seems somewhat reasonable.

Huacai
>
> > > I know there's no *existing* way to mark devices that we need to use
> > > all the way through shutdown or reboot, but if it makes sense, there's
> > > no reason we couldn't add one.  That has the potential of being more
> > > generic, e.g., we could do it for all console devices, as opposed to
> > > quirking a Root Port that just happens to be in the path to the
> > > console.
> > >
> > > > On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> > > > > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe
> > > > > > > services during shutdown") we observe poweroff/reboot
> > > > > > > failures on systems with LS7A chipset.
> > > > > > >
> > > > > > > We found that if we remove "pci_command &=
> > > > > > > ~PCI_COMMAND_MASTER" in do_pci_disable_device(), it can
> > > > > > > work well. The hardware engineer says that the root cause
> > > > > > > is that CPU is still accessing PCIe devices while
> > > > > > > poweroff/reboot, and if we disable the Bus Master Bit at
> > > > > > > this time, the PCIe controller doesn't forward requests to
> > > > > > > downstream devices, and also does not send TIMEOUT to CPU,
> > > > > > > which causes CPU wait forever (hardware deadlock).
> > > > > > >
> > > > > > > To be clear, the sequence is like this:
> > > > > > >
> > > > > > >   - CPU issues MMIO read to device below Root Port
> > > > > > >
> > > > > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > > > > >     because of LS7A Bus Master defect
> > > > > > >
> > > > > > >   - CPU hangs waiting for response to MMIO read
> > > > > > >
> > > > > > > Then how is userspace able to use a device after the
> > > > > > > device is removed?
> > > > > > >
> > > > > > > To give more details, let's take the graphics driver (e.g.
> > > > > > > amdgpu) as an example. The userspace programs call
> > > > > > > printf() to display "shutting down xxx service" during
> > > > > > > shutdown/reboot, or the kernel calls printk() to display
> > > > > > > something during shutdown/reboot. These can happen at any
> > > > > > > time, even after we call pcie_port_device_remove() to
> > > > > > > disable the pcie port on the graphic card.
> > > > > > >
> > > > > > > The call stack is: printk() --> call_console_drivers() -->
> > > > > > > con->write() --> vt_console_print() --> fbcon_putcs()
> > > > > > >
> > > > > > > This scenario happens because userspace programs (or the
> > > > > > > kernel itself) don't know whether a device is 'usable',
> > > > > > > they just use it, at any time.
> > > > > >
> > > > > > Thanks for this background.  So basically we want to call
> > > > > > .remove() on a console device (or a bridge leading to it),
> > > > > > but we expect it to keep working as usual afterwards?
> > > > > >
> > > > > > That seems a little weird.  Is that the design we want?
> > > > > > Maybe we should have a way to mark devices so we don't
> > > > > > remove them during shutdown or reboot?
> > > > >
> > > > > Sounds reasonable, but it seems no existing way can mark this.
> > > > >
> > > > > Huacai
> > > > > >
> > > > > > > This hardware behavior is a PCIe protocol violation (Bus Master should
> > > > > > > not be involved in CPU MMIO transactions), and it will be fixed in new
> > > > > > > revisions of hardware (add timeout mechanism for CPU read request,
> > > > > > > whether or not Bus Master bit is cleared).
> > > > > > >
> > > > > > > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > > > > > > [1][2]. Once before I wanted to make a single patch to solve "all of
> > > > > > > these problems" together, but it seems unreasonable because maybe they
> > > > > > > are not exactly the same problem. So, this patch add a new function
> > > > > > > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > > > > > > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > > > > > > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > > > > > > platforms behave as before.
> > > > > > >
> > > > > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > > > > > > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> > > > > > >
> > > > > > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > > > > > ---
> > > > > > >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> > > > > > >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> > > > > > >  include/linux/pci.h                   |  1 +
> > > > > > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > > > > > > index 759ec211c17b..641308ba4126 100644
> > > > > > > --- a/drivers/pci/controller/pci-loongson.c
> > > > > > > +++ b/drivers/pci/controller/pci-loongson.c
> > > > > > > @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> > > > > > >
> > > > > > > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > > > > > > +{
> > > > > > > +     /*
> > > > > > > +      * Some Loongson PCIe ports will cause CPU deadlock if there is
> > > > > > > +      * MMIO access to a downstream device when the root port disable
> > > > > > > +      * the Bus Master bit during poweroff/reboot.
> > > > > > > +      */
> > > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > > > > > > +
> > > > > > > +     bridge->no_dis_bmaster = 1;
> > > > > > > +}
> > > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > > > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > > > > > > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > > > > > > +
> > > > > > >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> > > > > > >  {
> > > > > > >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > > > > > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > > > > > index 2cc2e60bcb39..96f45c444422 100644
> > > > > > > --- a/drivers/pci/pcie/portdrv.c
> > > > > > > +++ b/drivers/pci/pcie/portdrv.c
> > > > > > > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> > > > > > >  {
> > > > > > >       device_for_each_child(&dev->dev, NULL, remove_iter);
> > > > > > >       pci_free_irq_vectors(dev);
> > > > > > > -     pci_disable_device(dev);
> > > > > > >  }
> > > > > > >
> > > > > > >  /**
> > > > > > > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> > > > > > >       }
> > > > > > >
> > > > > > >       pcie_port_device_remove(dev);
> > > > > > > +
> > > > > > > +     pci_disable_device(dev);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > > > > +{
> > > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > > > > +
> > > > > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > > > > +             pm_runtime_forbid(&dev->dev);
> > > > > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > > > > +     }
> > > > > > > +
> > > > > > > +     pcie_port_device_remove(dev);
> > > > > > > +
> > > > > > > +     if (!bridge->no_dis_bmaster)
> > > > > > > +             pci_disable_device(dev);
> > > > > > >  }
> > > > > > >
> > > > > > >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > > > > > > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> > > > > > >
> > > > > > >       .probe          = pcie_portdrv_probe,
> > > > > > >       .remove         = pcie_portdrv_remove,
> > > > > > > -     .shutdown       = pcie_portdrv_remove,
> > > > > > > +     .shutdown       = pcie_portdrv_shutdown,
> > > > > > >
> > > > > > >       .err_handler    = &pcie_portdrv_err_handler,
> > > > > > >
> > > > > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > > > > index 3df2049ec4a8..a64dbcb89231 100644
> > > > > > > --- a/include/linux/pci.h
> > > > > > > +++ b/include/linux/pci.h
> > > > > > > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> > > > > > >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> > > > > > >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> > > > > > >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > > > > > > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> > > > > > >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> > > > > > >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> > > > > > >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > > > > > > --
> > > > > > > 2.31.1
> > > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-21 15:10               ` Huacai Chen
@ 2023-01-30 12:35                 ` Thorsten Leemhuis
  2023-02-01 22:10                   ` Bjorn Helgaas
  2023-01-31  0:01                 ` Bjorn Helgaas
  1 sibling, 1 reply; 16+ messages in thread
From: Thorsten Leemhuis @ 2023-01-30 12:35 UTC (permalink / raw)
  To: Huacai Chen, Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel,
	Linux kernel regressions list

On 21.01.23 16:10, Huacai Chen wrote:
> On Fri, Jan 20, 2023 at 11:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>
>> On Fri, Jan 20, 2023 at 09:31:43PM +0800, Huacai Chen wrote:
>>> On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>> On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
>>>>> Ping?
>>>>
>>>> I suggested another possible way to do this that wasn't so much of a
>>>> special case.  Did you explore that at all?
>>>
>>> That is a little difficult for me, but what is worse is that the root
>>> cause doesn't come from gpu or console drivers, but from the root
>>> port. That means: even if we can workaround the gpu issue in another
>>> way, there are still problems on other devices. Besides the graphics
>>> card, the most frequent problematic device is the sata controller
>>> connected on LS7A chipset, there are incomplete I/O accesses after the
>>> root port disabled and also cause reboot failure.
>>
>> Yes, SATA sounds like another case where we want to use the device
>> after we call the driver's remove/shutdown method.  That's not
>> *worse*, it's just another case where we might have to mark devices
>> for special handling.
> That needs too much effort because we need to modify nearly every pci
> driver, and it exceeds my ability. :)

Just wondering: what's the status here? This looks stalled.

I'm asking, as the patches in this thread are supposed to fix this
regression:
https://bugzilla.kernel.org/show_bug.cgi?id=216884

Or should we try to find a different fix/workaround because the proper
solution discussed in this thread needs more time?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

>> If we remove/shutdown *any* Root Port, not just LS7A, I think the idea
>> of assuming downstream devices can continue to work as usual is a
>> little suspect.  They might continue to work by accident today, but it
>> doesn't seem like a robust design.
> The existing design works for so many years, so it is mostly
> reasonable. For the LS7A case, the root cause comes from the root
> port, so a workaround on the root port seems somewhat reasonable.
> 
> Huacai
>>
>>>> I know there's no *existing* way to mark devices that we need to use
>>>> all the way through shutdown or reboot, but if it makes sense, there's
>>>> no reason we couldn't add one.  That has the potential of being more
>>>> generic, e.g., we could do it for all console devices, as opposed to
>>>> quirking a Root Port that just happens to be in the path to the
>>>> console.
>>>>
>>>>> On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
>>>>>> On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>>>> On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
>>>>>>>> After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe
>>>>>>>> services during shutdown") we observe poweroff/reboot
>>>>>>>> failures on systems with LS7A chipset.
>>>>>>>>
>>>>>>>> We found that if we remove "pci_command &=
>>>>>>>> ~PCI_COMMAND_MASTER" in do_pci_disable_device(), it can
>>>>>>>> work well. The hardware engineer says that the root cause
>>>>>>>> is that CPU is still accessing PCIe devices while
>>>>>>>> poweroff/reboot, and if we disable the Bus Master Bit at
>>>>>>>> this time, the PCIe controller doesn't forward requests to
>>>>>>>> downstream devices, and also does not send TIMEOUT to CPU,
>>>>>>>> which causes CPU wait forever (hardware deadlock).
>>>>>>>>
>>>>>>>> To be clear, the sequence is like this:
>>>>>>>>
>>>>>>>>   - CPU issues MMIO read to device below Root Port
>>>>>>>>
>>>>>>>>   - LS7A Root Port fails to forward transaction to secondary bus
>>>>>>>>     because of LS7A Bus Master defect
>>>>>>>>
>>>>>>>>   - CPU hangs waiting for response to MMIO read
>>>>>>>>
>>>>>>>> Then how is userspace able to use a device after the
>>>>>>>> device is removed?
>>>>>>>>
>>>>>>>> To give more details, let's take the graphics driver (e.g.
>>>>>>>> amdgpu) as an example. The userspace programs call
>>>>>>>> printf() to display "shutting down xxx service" during
>>>>>>>> shutdown/reboot, or the kernel calls printk() to display
>>>>>>>> something during shutdown/reboot. These can happen at any
>>>>>>>> time, even after we call pcie_port_device_remove() to
>>>>>>>> disable the pcie port on the graphic card.
>>>>>>>>
>>>>>>>> The call stack is: printk() --> call_console_drivers() -->
>>>>>>>> con->write() --> vt_console_print() --> fbcon_putcs()
>>>>>>>>
>>>>>>>> This scenario happens because userspace programs (or the
>>>>>>>> kernel itself) don't know whether a device is 'usable',
>>>>>>>> they just use it, at any time.
>>>>>>>
>>>>>>> Thanks for this background.  So basically we want to call
>>>>>>> .remove() on a console device (or a bridge leading to it),
>>>>>>> but we expect it to keep working as usual afterwards?
>>>>>>>
>>>>>>> That seems a little weird.  Is that the design we want?
>>>>>>> Maybe we should have a way to mark devices so we don't
>>>>>>> remove them during shutdown or reboot?
>>>>>>
>>>>>> Sounds reasonable, but it seems no existing way can mark this.
>>>>>>
>>>>>> Huacai
>>>>>>>
>>>>>>>> This hardware behavior is a PCIe protocol violation (Bus Master should
>>>>>>>> not be involved in CPU MMIO transactions), and it will be fixed in new
>>>>>>>> revisions of hardware (add timeout mechanism for CPU read request,
>>>>>>>> whether or not Bus Master bit is cleared).
>>>>>>>>
>>>>>>>> On some x86 platforms, radeon/amdgpu devices can cause similar problems
>>>>>>>> [1][2]. Once before I wanted to make a single patch to solve "all of
>>>>>>>> these problems" together, but it seems unreasonable because maybe they
>>>>>>>> are not exactly the same problem. So, this patch add a new function
>>>>>>>> pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
>>>>>>>> dedicated for the shutdown path, and then add a quirk just for LS7A to
>>>>>>>> avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
>>>>>>>> platforms behave as before.
>>>>>>>>
>>>>>>>> [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
>>>>>>>> [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
>>>>>>>>
>>>>>>>> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
>>>>>>>> ---
>>>>>>>>  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
>>>>>>>>  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
>>>>>>>>  include/linux/pci.h                   |  1 +
>>>>>>>>  3 files changed, 37 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
>>>>>>>> index 759ec211c17b..641308ba4126 100644
>>>>>>>> --- a/drivers/pci/controller/pci-loongson.c
>>>>>>>> +++ b/drivers/pci/controller/pci-loongson.c
>>>>>>>> @@ -93,6 +93,24 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>>>>>>>>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>>>>>>>>                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
>>>>>>>>
>>>>>>>> +static void loongson_bmaster_quirk(struct pci_dev *pdev)
>>>>>>>> +{
>>>>>>>> +     /*
>>>>>>>> +      * Some Loongson PCIe ports will cause CPU deadlock if there is
>>>>>>>> +      * MMIO access to a downstream device when the root port disable
>>>>>>>> +      * the Bus Master bit during poweroff/reboot.
>>>>>>>> +      */
>>>>>>>> +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
>>>>>>>> +
>>>>>>>> +     bridge->no_dis_bmaster = 1;
>>>>>>>> +}
>>>>>>>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>>>>>>>> +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
>>>>>>>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>>>>>>>> +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
>>>>>>>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>>>>>>>> +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
>>>>>>>> +
>>>>>>>>  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
>>>>>>>>  {
>>>>>>>>       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
>>>>>>>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>>>>>>>> index 2cc2e60bcb39..96f45c444422 100644
>>>>>>>> --- a/drivers/pci/pcie/portdrv.c
>>>>>>>> +++ b/drivers/pci/pcie/portdrv.c
>>>>>>>> @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
>>>>>>>>  {
>>>>>>>>       device_for_each_child(&dev->dev, NULL, remove_iter);
>>>>>>>>       pci_free_irq_vectors(dev);
>>>>>>>> -     pci_disable_device(dev);
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  /**
>>>>>>>> @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       pcie_port_device_remove(dev);
>>>>>>>> +
>>>>>>>> +     pci_disable_device(dev);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>>>>>>> +{
>>>>>>>> +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>>>>>>>> +
>>>>>>>> +     if (pci_bridge_d3_possible(dev)) {
>>>>>>>> +             pm_runtime_forbid(&dev->dev);
>>>>>>>> +             pm_runtime_get_noresume(&dev->dev);
>>>>>>>> +             pm_runtime_dont_use_autosuspend(&dev->dev);
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +     pcie_port_device_remove(dev);
>>>>>>>> +
>>>>>>>> +     if (!bridge->no_dis_bmaster)
>>>>>>>> +             pci_disable_device(dev);
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>>>>>>> @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
>>>>>>>>
>>>>>>>>       .probe          = pcie_portdrv_probe,
>>>>>>>>       .remove         = pcie_portdrv_remove,
>>>>>>>> -     .shutdown       = pcie_portdrv_remove,
>>>>>>>> +     .shutdown       = pcie_portdrv_shutdown,
>>>>>>>>
>>>>>>>>       .err_handler    = &pcie_portdrv_err_handler,
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>>>>>> index 3df2049ec4a8..a64dbcb89231 100644
>>>>>>>> --- a/include/linux/pci.h
>>>>>>>> +++ b/include/linux/pci.h
>>>>>>>> @@ -573,6 +573,7 @@ struct pci_host_bridge {
>>>>>>>>       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
>>>>>>>>       unsigned int    no_ext_tags:1;          /* No Extended Tags */
>>>>>>>>       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
>>>>>>>> +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
>>>>>>>>       unsigned int    native_aer:1;           /* OS may use PCIe AER */
>>>>>>>>       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
>>>>>>>>       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
>>>>>>>> --
>>>>>>>> 2.31.1
>>>>>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-21 15:10               ` Huacai Chen
  2023-01-30 12:35                 ` Thorsten Leemhuis
@ 2023-01-31  0:01                 ` Bjorn Helgaas
  2023-01-31 12:02                   ` Huacai Chen
  1 sibling, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2023-01-31  0:01 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

On Sat, Jan 21, 2023 at 11:10:09PM +0800, Huacai Chen wrote:
> On Fri, Jan 20, 2023 at 11:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Jan 20, 2023 at 09:31:43PM +0800, Huacai Chen wrote:
> > > On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> > > > > Ping?
> > > >
> > > > I suggested another possible way to do this that wasn't so much of a
> > > > special case.  Did you explore that at all?
> > >
> > > That is a little difficult for me, but what is worse is that the root
> > > cause doesn't come from gpu or console drivers, but from the root
> > > port. That means: even if we can workaround the gpu issue in another
> > > way, there are still problems on other devices. Besides the graphics
> > > card, the most frequent problematic device is the sata controller
> > > connected on LS7A chipset, there are incomplete I/O accesses after the
> > > root port disabled and also cause reboot failure.
> >
> > Yes, SATA sounds like another case where we want to use the device
> > after we call the driver's remove/shutdown method.  That's not
> > *worse*, it's just another case where we might have to mark devices
> > for special handling.
>
> That needs too much effort because we need to modify nearly every pci
> driver, and it exceeds my ability. :)

We would only modify drivers that need this special handling, so it's
only console/graphics/disks/network/..., well, OK, I see your point,
it probably *would* be nearly every driver!

> > If we remove/shutdown *any* Root Port, not just LS7A, I think the idea
> > of assuming downstream devices can continue to work as usual is a
> > little suspect.  They might continue to work by accident today, but it
> > doesn't seem like a robust design.
>
> The existing design works for so many years, so it is mostly
> reasonable. For the LS7A case, the root cause comes from the root
> port, so a workaround on the root port seems somewhat reasonable.

Yeah, I think you're right.  A few more notes below.

> > > > > On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> > > > > > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > > > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe
> > > > > > > > services during shutdown") we observe poweroff/reboot
> > > > > > > > failures on systems with LS7A chipset.
> > > > > > > >
> > > > > > > > We found that if we remove "pci_command &=
> > > > > > > > ~PCI_COMMAND_MASTER" in do_pci_disable_device(), it can
> > > > > > > > work well. The hardware engineer says that the root cause
> > > > > > > > is that CPU is still accessing PCIe devices while
> > > > > > > > poweroff/reboot, and if we disable the Bus Master Bit at
> > > > > > > > this time, the PCIe controller doesn't forward requests to
> > > > > > > > downstream devices, and also does not send TIMEOUT to CPU,
> > > > > > > > which causes CPU wait forever (hardware deadlock).
> > > > > > > >
> > > > > > > > To be clear, the sequence is like this:
> > > > > > > >
> > > > > > > >   - CPU issues MMIO read to device below Root Port
> > > > > > > >
> > > > > > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > > > > > >     because of LS7A Bus Master defect
> > > > > > > >
> > > > > > > >   - CPU hangs waiting for response to MMIO read
> ...

> > > > > > > > +
> > > > > > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > > > > > +{
> > > > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > > > > > +
> > > > > > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > > > > > +             pm_runtime_forbid(&dev->dev);
> > > > > > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > > > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > > > > > +     }
> > > > > > > > +
> > > > > > > > +     pcie_port_device_remove(dev);
> > > > > > > > +
> > > > > > > > +     if (!bridge->no_dis_bmaster)
> > > > > > > > +             pci_disable_device(dev);

I think there's an argument that pcie_portdrv_shutdown() doesn't
actually need to clear bus mastering on *any* platform.

For reboot and poweroff, we only use .shutdown(), and .shutdown() only
needs to stop DMA and interrupts.  Clearing bus master enable stops
MSI/MSI-X since that's a DMA, but doesn't do anything to stop INTx,
which portdrv does use in some cases.

But those .remove() methods *do* clear the interrupt enables for each
service (PCI_ERR_ROOT_COMMAND, PCI_EXP_DPC_CTL, PCI_EXP_SLTCTL, and
PCI_EXP_RTCTL), so all the interrupts should be disabled regardless of
whether they are MSI/MSI-X or INTx, even without disabling bus
mastering.

So I would argue that omitting the pci_disable_device() here might be
enough, and we wouldn't need the quirk at all.

Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A
  2023-01-06  9:51 ` [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
@ 2023-01-31  0:16   ` Bjorn Helgaas
  2023-01-31 11:54     ` Huacai Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2023-01-31  0:16 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Huacai Chen, Jiaxun Yang, HougeLangley, Thorsten Leemhuis,
	Heiner Kallweit, WANG Xuerui

On Fri, Jan 06, 2023 at 05:51:42PM +0800, Huacai Chen wrote:
> In new revision of LS7A, some PCIe ports support larger value than 256,
> but their maximum supported MRRS values are not detectable. Moreover,
> the current loongson_mrrs_quirk() cannot avoid devices increasing its
> MRRS after pci_enable_device(), and some devices (e.g. Realtek 8169)
> will actually set a big value in its driver. So the only possible way
> is configure MRRS of all devices in BIOS, and add a pci host bridge bit
> flag (i.e., no_inc_mrrs) to stop the increasing MRRS operations.
> 
> However, according to PCIe Spec, it is legal for an OS to program any
> value for MRRS, and it is also legal for an endpoint to generate a Read
> Request with any size up to its MRRS. As the hardware engineers say, the
> root cause here is LS7A doesn't break up large read requests. In detail,
> LS7A PCIe port reports CA (Completer Abort) if it receives a Memory Read
> request with a size that's "too big" ("too big" means larger than the
> PCIe ports can handle, which means 256 for some ports and 4096 for the
> others, and of course this is a problem in the LS7A's hardware design).

Can you take a look at
https://bugzilla.kernel.org/show_bug.cgi?id=216884 ?

That claims to be a regression between v6.1 and v6.2-rc2, and WANG
Xuerui says this patch is the fix (though AFAICT the submitter has not
verified this yet).  If so, we should reference that bug here and try
to get this in v6.2.

See below.

> -		if (pci_match_id(bridge_devids, bridge)) {
> -			if (pcie_get_readrq(dev) > 256) {
> -				pci_info(dev, "limiting MRRS to 256\n");
> -				pcie_set_readrq(dev, 256);
> -			}
> -			break;
> -		}

> +	if (bridge->no_inc_mrrs) {
> +		if (rq > pcie_get_readrq(dev))
> +			return -EINVAL;

I think the message about limiting MRRS was useful and we should keep
it.

Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A
  2023-01-31  0:16   ` Bjorn Helgaas
@ 2023-01-31 11:54     ` Huacai Chen
  0 siblings, 0 replies; 16+ messages in thread
From: Huacai Chen @ 2023-01-31 11:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, HougeLangley, Thorsten Leemhuis, Heiner Kallweit,
	WANG Xuerui

Hi, Bjorn,

On Tue, Jan 31, 2023 at 8:16 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Jan 06, 2023 at 05:51:42PM +0800, Huacai Chen wrote:
> > In new revision of LS7A, some PCIe ports support larger value than 256,
> > but their maximum supported MRRS values are not detectable. Moreover,
> > the current loongson_mrrs_quirk() cannot avoid devices increasing its
> > MRRS after pci_enable_device(), and some devices (e.g. Realtek 8169)
> > will actually set a big value in its driver. So the only possible way
> > is configure MRRS of all devices in BIOS, and add a pci host bridge bit
> > flag (i.e., no_inc_mrrs) to stop the increasing MRRS operations.
> >
> > However, according to PCIe Spec, it is legal for an OS to program any
> > value for MRRS, and it is also legal for an endpoint to generate a Read
> > Request with any size up to its MRRS. As the hardware engineers say, the
> > root cause here is LS7A doesn't break up large read requests. In detail,
> > LS7A PCIe port reports CA (Completer Abort) if it receives a Memory Read
> > request with a size that's "too big" ("too big" means larger than the
> > PCIe ports can handle, which means 256 for some ports and 4096 for the
> > others, and of course this is a problem in the LS7A's hardware design).
>
> Can you take a look at
> https://bugzilla.kernel.org/show_bug.cgi?id=216884 ?
>
> That claims to be a regression between v6.1 and v6.2-rc2, and WANG
> Xuerui says this patch is the fix (though AFAICT the submitter has not
> verified this yet).  If so, we should reference that bug here and try
> to get this in v6.2.
Yes, this patch can fix that issue. But I don't think this is a
regression, vanila 6.1 kernel also has this problem, maybe the
reporter uses a patched 6.1 kernel.

Huacai
>
> See below.
>
> > -             if (pci_match_id(bridge_devids, bridge)) {
> > -                     if (pcie_get_readrq(dev) > 256) {
> > -                             pci_info(dev, "limiting MRRS to 256\n");
> > -                             pcie_set_readrq(dev, 256);
> > -                     }
> > -                     break;
> > -             }
>
> > +     if (bridge->no_inc_mrrs) {
> > +             if (rq > pcie_get_readrq(dev))
> > +                     return -EINVAL;
>
> I think the message about limiting MRRS was useful and we should keep
> it.
>
> Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-31  0:01                 ` Bjorn Helgaas
@ 2023-01-31 12:02                   ` Huacai Chen
  0 siblings, 0 replies; 16+ messages in thread
From: Huacai Chen @ 2023-01-31 12:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang, Rafael J. Wysocki, linux-pm, linux-kernel

Hi, Bjorn,

On Tue, Jan 31, 2023 at 8:01 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sat, Jan 21, 2023 at 11:10:09PM +0800, Huacai Chen wrote:
> > On Fri, Jan 20, 2023 at 11:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, Jan 20, 2023 at 09:31:43PM +0800, Huacai Chen wrote:
> > > > On Thu, Jan 19, 2023 at 8:50 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, Jan 19, 2023 at 08:25:20PM +0800, Huacai Chen wrote:
> > > > > > Ping?
> > > > >
> > > > > I suggested another possible way to do this that wasn't so much of a
> > > > > special case.  Did you explore that at all?
> > > >
> > > > That is a little difficult for me, but what is worse is that the root
> > > > cause doesn't come from gpu or console drivers, but from the root
> > > > port. That means: even if we can workaround the gpu issue in another
> > > > way, there are still problems on other devices. Besides the graphics
> > > > card, the most frequent problematic device is the sata controller
> > > > connected on LS7A chipset, there are incomplete I/O accesses after the
> > > > root port disabled and also cause reboot failure.
> > >
> > > Yes, SATA sounds like another case where we want to use the device
> > > after we call the driver's remove/shutdown method.  That's not
> > > *worse*, it's just another case where we might have to mark devices
> > > for special handling.
> >
> > That needs too much effort because we need to modify nearly every pci
> > driver, and it exceeds my ability. :)
>
> We would only modify drivers that need this special handling, so it's
> only console/graphics/disks/network/..., well, OK, I see your point,
> it probably *would* be nearly every driver!
>
> > > If we remove/shutdown *any* Root Port, not just LS7A, I think the idea
> > > of assuming downstream devices can continue to work as usual is a
> > > little suspect.  They might continue to work by accident today, but it
> > > doesn't seem like a robust design.
> >
> > The existing design works for so many years, so it is mostly
> > reasonable. For the LS7A case, the root cause comes from the root
> > port, so a workaround on the root port seems somewhat reasonable.
>
> Yeah, I think you're right.  A few more notes below.
>
> > > > > > On Sat, Jan 7, 2023 at 10:25 AM Huacai Chen <chenhuacai@gmail.com> wrote:
> > > > > > > On Fri, Jan 6, 2023 at 11:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Fri, Jan 06, 2023 at 05:51:43PM +0800, Huacai Chen wrote:
> > > > > > > > > After cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe
> > > > > > > > > services during shutdown") we observe poweroff/reboot
> > > > > > > > > failures on systems with LS7A chipset.
> > > > > > > > >
> > > > > > > > > We found that if we remove "pci_command &=
> > > > > > > > > ~PCI_COMMAND_MASTER" in do_pci_disable_device(), it can
> > > > > > > > > work well. The hardware engineer says that the root cause
> > > > > > > > > is that CPU is still accessing PCIe devices while
> > > > > > > > > poweroff/reboot, and if we disable the Bus Master Bit at
> > > > > > > > > this time, the PCIe controller doesn't forward requests to
> > > > > > > > > downstream devices, and also does not send TIMEOUT to CPU,
> > > > > > > > > which causes CPU wait forever (hardware deadlock).
> > > > > > > > >
> > > > > > > > > To be clear, the sequence is like this:
> > > > > > > > >
> > > > > > > > >   - CPU issues MMIO read to device below Root Port
> > > > > > > > >
> > > > > > > > >   - LS7A Root Port fails to forward transaction to secondary bus
> > > > > > > > >     because of LS7A Bus Master defect
> > > > > > > > >
> > > > > > > > >   - CPU hangs waiting for response to MMIO read
> > ...
>
> > > > > > > > > +
> > > > > > > > > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > > > > > > > > +{
> > > > > > > > > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > > > > > > > > +
> > > > > > > > > +     if (pci_bridge_d3_possible(dev)) {
> > > > > > > > > +             pm_runtime_forbid(&dev->dev);
> > > > > > > > > +             pm_runtime_get_noresume(&dev->dev);
> > > > > > > > > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > > > > > > > > +     }
> > > > > > > > > +
> > > > > > > > > +     pcie_port_device_remove(dev);
> > > > > > > > > +
> > > > > > > > > +     if (!bridge->no_dis_bmaster)
> > > > > > > > > +             pci_disable_device(dev);
>
> I think there's an argument that pcie_portdrv_shutdown() doesn't
> actually need to clear bus mastering on *any* platform.
>
> For reboot and poweroff, we only use .shutdown(), and .shutdown() only
> needs to stop DMA and interrupts.  Clearing bus master enable stops
> MSI/MSI-X since that's a DMA, but doesn't do anything to stop INTx,
> which portdrv does use in some cases.
>
> But those .remove() methods *do* clear the interrupt enables for each
> service (PCI_ERR_ROOT_COMMAND, PCI_EXP_DPC_CTL, PCI_EXP_SLTCTL, and
> PCI_EXP_RTCTL), so all the interrupts should be disabled regardless of
> whether they are MSI/MSI-X or INTx, even without disabling bus
> mastering.
>
> So I would argue that omitting the pci_disable_device() here might be
> enough, and we wouldn't need the quirk at all.
Emm, this seems much simpler and cleaner, I will send a new version
these days, thank you.

Huacai
>
> Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-30 12:35                 ` Thorsten Leemhuis
@ 2023-02-01 22:10                   ` Bjorn Helgaas
  0 siblings, 0 replies; 16+ messages in thread
From: Bjorn Helgaas @ 2023-02-01 22:10 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Huacai Chen, Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, linux-pci, Jianmin Lv,
	Xuefeng Li, Jiaxun Yang, Rafael J. Wysocki, linux-pm,
	linux-kernel, Linux kernel regressions list

On Mon, Jan 30, 2023 at 01:35:16PM +0100, Thorsten Leemhuis wrote:
> Just wondering: what's the status here? This looks stalled.
> 
> I'm asking, as the patches in this thread are supposed to fix this
> regression:
> https://bugzilla.kernel.org/show_bug.cgi?id=216884

#regzbot resolve: [patch 1/2] will fix bz216884, but it is not a regression

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-02-01 22:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06  9:51 [PATCH V2 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
2023-01-06  9:51 ` [PATCH V2 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
2023-01-31  0:16   ` Bjorn Helgaas
2023-01-31 11:54     ` Huacai Chen
2023-01-06  9:51 ` [PATCH V2 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
2023-01-06 15:38   ` Bjorn Helgaas
2023-01-07  2:25     ` Huacai Chen
2023-01-19 12:25       ` Huacai Chen
2023-01-19 12:50         ` Bjorn Helgaas
2023-01-20 13:31           ` Huacai Chen
2023-01-20 15:36             ` Bjorn Helgaas
2023-01-21 15:10               ` Huacai Chen
2023-01-30 12:35                 ` Thorsten Leemhuis
2023-02-01 22:10                   ` Bjorn Helgaas
2023-01-31  0:01                 ` Bjorn Helgaas
2023-01-31 12:02                   ` Huacai Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).