linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] PCI: Add two Loongson's LS7A quirks
@ 2023-01-03  7:33 Huacai Chen
  2023-01-03  7:34 ` [PATCH 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
  2023-01-03  7:34 ` [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
  0 siblings, 2 replies; 7+ messages in thread
From: Huacai Chen @ 2023-01-03  7:33 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang,
	Huacai Chen, Tiezhu Yang

Hi, Bjorn,

This patchset add two quirks to resolves Loongson's LS7A problems: the
first patch improves the mrrs quirk for LS7A chipset; The second patch
add a new quirk for LS7A chipset to avoid poweroff/reboot failure.

You said you would think about these two patches carefully, so after a
lont time I rebase them on top of the latest code and resend them to see
if you have new considerations now. :)

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> 
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 drivers/acpi/pci_mcfg.c               |  13 ++
 drivers/pci/controller/Kconfig        |   2 +-
 drivers/pci/controller/pci-loongson.c | 233 ++++++++++++++++++++++++++--------
 drivers/pci/pci.c                     |   6 +
 drivers/pci/pcie/portdrv_core.c       |   1 -
 drivers/pci/pcie/portdrv_pci.c        |  20 ++-
 include/linux/pci-ecam.h              |   1 +
 include/linux/pci.h                   |   2 +
 8 files changed, 225 insertions(+), 53 deletions(-)
--
2.27.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] PCI: loongson: Improve the MRRS quirk for LS7A
  2023-01-03  7:33 [PATCH 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
@ 2023-01-03  7:34 ` Huacai Chen
  2023-01-03  7:34 ` [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
  1 sibling, 0 replies; 7+ messages in thread
From: Huacai Chen @ 2023-01-03  7:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang, Huacai Chen

In new revision of LS7A, some PCIe ports support larger value than 256,
but their maximum supported MRRS values are not detectable. Moreover,
the current loongson_mrrs_quirk() cannot avoid devices increasing its
MRRS after pci_enable_device(), and some devices (e.g. Realtek 8169)
will actually set a big value in its driver. So the only possible way
is configure MRRS of all devices in BIOS, and add a pci host bridge bit
flag (i.e., no_inc_mrrs) to stop the increasing MRRS operations.

However, according to PCIe Spec, it is legal for an OS to program any
value for MRRS, and it is also legal for an endpoint to generate a Read
Request with any size up to its MRRS. As the hardware engineers say, the
root cause here is LS7A doesn't break up large read requests. In detail,
LS7A PCIe port reports CA (Completer Abort) if it receives a Memory Read
request with a size that's "too big" ("too big" means larger than the
PCIe ports can handle, which means 256 for some ports and 4096 for the
others, and of course this is a problem in the LS7A's hardware design).

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 44 +++++++++------------------
 drivers/pci/pci.c                     |  6 ++++
 include/linux/pci.h                   |  1 +
 3 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index 05c50408f13b..759ec211c17b 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -75,37 +75,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 			DEV_LS7A_LPC, system_bus_quirk);
 
-static void loongson_mrrs_quirk(struct pci_dev *dev)
+static void loongson_mrrs_quirk(struct pci_dev *pdev)
 {
-	struct pci_bus *bus = dev->bus;
-	struct pci_dev *bridge;
-	static const struct pci_device_id bridge_devids[] = {
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_0) },
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_1) },
-		{ PCI_VDEVICE(LOONGSON, DEV_PCIE_PORT_2) },
-		{ 0, },
-	};
-
-	/* look for the matching bridge */
-	while (!pci_is_root_bus(bus)) {
-		bridge = bus->self;
-		bus = bus->parent;
-		/*
-		 * Some Loongson PCIe ports have a h/w limitation of
-		 * 256 bytes maximum read request size. They can't handle
-		 * anything larger than this. So force this limit on
-		 * any devices attached under these ports.
-		 */
-		if (pci_match_id(bridge_devids, bridge)) {
-			if (pcie_get_readrq(dev) > 256) {
-				pci_info(dev, "limiting MRRS to 256\n");
-				pcie_set_readrq(dev, 256);
-			}
-			break;
-		}
-	}
+	/*
+	 * Some Loongson PCIe ports have h/w limitations of maximum read
+	 * request size. They can't handle anything larger than this. So
+	 * force this limit on any devices attached under these ports.
+	 */
+	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
+
+	bridge->no_inc_mrrs = 1;
 }
-DECLARE_PCI_FIXUP_ENABLE(PCI_ANY_ID, PCI_ANY_ID, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_0, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_1, loongson_mrrs_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
 
 static void loongson_pci_pin_quirk(struct pci_dev *pdev)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..ae88210a12c7 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6033,6 +6033,7 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 {
 	u16 v;
 	int ret;
+	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
 
 	if (rq < 128 || rq > 4096 || !is_power_of_2(rq))
 		return -EINVAL;
@@ -6051,6 +6052,11 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 
 	v = (ffs(rq) - 8) << 12;
 
+	if (bridge->no_inc_mrrs) {
+		if (rq > pcie_get_readrq(dev))
+			return -EINVAL;
+	}
+
 	ret = pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
 						  PCI_EXP_DEVCTL_READRQ, v);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index adffd65e84b4..3df2049ec4a8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -572,6 +572,7 @@ struct pci_host_bridge {
 	void		*release_data;
 	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
 	unsigned int	no_ext_tags:1;		/* No Extended Tags */
+	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
 	unsigned int	native_aer:1;		/* OS may use PCIe AER */
 	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
 	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-03  7:33 [PATCH 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
  2023-01-03  7:34 ` [PATCH 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
@ 2023-01-03  7:34 ` Huacai Chen
  2023-01-04 18:37   ` Bjorn Helgaas
  1 sibling, 1 reply; 7+ messages in thread
From: Huacai Chen @ 2023-01-03  7:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring, Krzysztof Wilczyński
  Cc: linux-pci, Jianmin Lv, Xuefeng Li, Huacai Chen, Jiaxun Yang, Huacai Chen

cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during shutdown")
causes poweroff/reboot failure on systems with LS7A chipset. We found
that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in do_pci_disable
_device(), it can work well. The hardware engineer says that the root
cause is that CPU is still accessing PCIe devices while poweroff/reboot,
and if we disable the Bus Master Bit at this time, the PCIe controller
doesn't forward requests to downstream devices, and also does not send
TIMEOUT to CPU, which causes CPU wait forever (hardware deadlock). This
behavior is a PCIe protocol violation (Bus Master should not be involved
in CPU MMIO transactions), and it will be fixed in new revisions of
hardware (add timeout mechanism for CPU read request, whether or not Bus
Master bit is cleared).

On some x86 platforms, radeon/amdgpu devices can cause similar problems
[1][2]. Once before I wanted to make a single patch to solve "all of
these problems" together, but it seems unreasonable because maybe they
are not exactly the same problem. So, this patch add a new function
pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
dedicated for the shutdown path, and then add a quirk just for LS7A to
avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
platforms behave as before.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
[2] https://bugs.freedesktop.org/show_bug.cgi?id=98638

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
 drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
 include/linux/pci.h                   |  1 +
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index 759ec211c17b..641308ba4126 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -93,6 +93,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
 			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
 
+static void loongson_bmaster_quirk(struct pci_dev *pdev)
+{
+	/*
+	 * Some Loongson PCIe ports will cause CPU deadlock if disable
+	 * the Bus Master bit during poweroff/reboot.
+	 */
+	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
+
+	bridge->no_dis_bmaster = 1;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_0, loongson_bmaster_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_1, loongson_bmaster_quirk);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
+			DEV_PCIE_PORT_2, loongson_bmaster_quirk);
+
 static void loongson_pci_pin_quirk(struct pci_dev *pdev)
 {
 	pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 2cc2e60bcb39..96f45c444422 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
 {
 	device_for_each_child(&dev->dev, NULL, remove_iter);
 	pci_free_irq_vectors(dev);
-	pci_disable_device(dev);
 }
 
 /**
@@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
 	}
 
 	pcie_port_device_remove(dev);
+
+	pci_disable_device(dev);
+}
+
+static void pcie_portdrv_shutdown(struct pci_dev *dev)
+{
+	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
+
+	if (pci_bridge_d3_possible(dev)) {
+		pm_runtime_forbid(&dev->dev);
+		pm_runtime_get_noresume(&dev->dev);
+		pm_runtime_dont_use_autosuspend(&dev->dev);
+	}
+
+	pcie_port_device_remove(dev);
+
+	if (!bridge->no_dis_bmaster)
+		pci_disable_device(dev);
 }
 
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
@@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
 
 	.probe		= pcie_portdrv_probe,
 	.remove		= pcie_portdrv_remove,
-	.shutdown	= pcie_portdrv_remove,
+	.shutdown	= pcie_portdrv_shutdown,
 
 	.err_handler	= &pcie_portdrv_err_handler,
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3df2049ec4a8..a64dbcb89231 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -573,6 +573,7 @@ struct pci_host_bridge {
 	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
 	unsigned int	no_ext_tags:1;		/* No Extended Tags */
 	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
+	unsigned int	no_dis_bmaster:1;	/* No Disable Bus Master */
 	unsigned int	native_aer:1;		/* OS may use PCIe AER */
 	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
 	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-03  7:34 ` [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
@ 2023-01-04 18:37   ` Bjorn Helgaas
  2023-01-05  2:49     ` Huacai Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2023-01-04 18:37 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Huacai Chen, Jiaxun Yang

On Tue, Jan 03, 2023 at 03:34:01PM +0800, Huacai Chen wrote:
> cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during shutdown")
> causes poweroff/reboot failure on systems with LS7A chipset. We found
> that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in do_pci_disable
> _device(), it can work well. The hardware engineer says that the root
> cause is that CPU is still accessing PCIe devices while poweroff/reboot,

Did you ever figure out what these CPU accesses are?  If we call the
Root Port .shutdown() method, and later access a downstream device,
that seems like a problem in itself.  At least, we should understand
exactly *why* we access that downstream device.

To be clear, cc27b735ad3a does not cause the failure.  IIUC, the cause
is:

  - CPU issues MMIO read to device below Root Port

  - LS7A Root Port fails to forward transaction to secondary bus
    because of LS7A Bus Master defect

  - CPU hangs waiting for response to MMIO read

> and if we disable the Bus Master Bit at this time, the PCIe controller
> doesn't forward requests to downstream devices, and also does not send
> TIMEOUT to CPU, which causes CPU wait forever (hardware deadlock). This
> behavior is a PCIe protocol violation (Bus Master should not be involved
> in CPU MMIO transactions), and it will be fixed in new revisions of
> hardware (add timeout mechanism for CPU read request, whether or not Bus
> Master bit is cleared).
> 
> On some x86 platforms, radeon/amdgpu devices can cause similar problems
> [1][2]. Once before I wanted to make a single patch to solve "all of
> these problems" together, but it seems unreasonable because maybe they
> are not exactly the same problem.

I don't know what any of these problems are.  Neither one of these bug
reports has a root cause analysis, and it's not obvious how they're
connected to this patch.

> So, this patch add a new function
> pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> dedicated for the shutdown path, and then add a quirk just for LS7A to
> avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> platforms behave as before.

Nit: don't break function names across lines ("do_pci_disable_device()").

> [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> 
> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
>  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
>  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
>  include/linux/pci.h                   |  1 +
>  3 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> index 759ec211c17b..641308ba4126 100644
> --- a/drivers/pci/controller/pci-loongson.c
> +++ b/drivers/pci/controller/pci-loongson.c
> @@ -93,6 +93,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
>  
> +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Some Loongson PCIe ports will cause CPU deadlock if disable
> +	 * the Bus Master bit during poweroff/reboot.

This is not actually true, as far as I can see.

It's not turning off Bus Master that causes the problem; it's the MMIO
read to a downstream device when the Root Port has bus mastering
disabled that causes the problem.

> +	 */
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> +
> +	bridge->no_dis_bmaster = 1;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> +
>  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
>  {
>  	pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 2cc2e60bcb39..96f45c444422 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
>  {
>  	device_for_each_child(&dev->dev, NULL, remove_iter);
>  	pci_free_irq_vectors(dev);
> -	pci_disable_device(dev);
>  }
>  
>  /**
> @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
>  	}
>  
>  	pcie_port_device_remove(dev);
> +
> +	pci_disable_device(dev);
> +}
> +
> +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> +
> +	if (pci_bridge_d3_possible(dev)) {
> +		pm_runtime_forbid(&dev->dev);
> +		pm_runtime_get_noresume(&dev->dev);
> +		pm_runtime_dont_use_autosuspend(&dev->dev);
> +	}
> +
> +	pcie_port_device_remove(dev);
> +
> +	if (!bridge->no_dis_bmaster)
> +		pci_disable_device(dev);
>  }
>  
>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
>  
>  	.probe		= pcie_portdrv_probe,
>  	.remove		= pcie_portdrv_remove,
> -	.shutdown	= pcie_portdrv_remove,
> +	.shutdown	= pcie_portdrv_shutdown,
>  
>  	.err_handler	= &pcie_portdrv_err_handler,
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 3df2049ec4a8..a64dbcb89231 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -573,6 +573,7 @@ struct pci_host_bridge {
>  	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
>  	unsigned int	no_ext_tags:1;		/* No Extended Tags */
>  	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
> +	unsigned int	no_dis_bmaster:1;	/* No Disable Bus Master */
>  	unsigned int	native_aer:1;		/* OS may use PCIe AER */
>  	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
>  	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-04 18:37   ` Bjorn Helgaas
@ 2023-01-05  2:49     ` Huacai Chen
  2023-01-05  4:01       ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Huacai Chen @ 2023-01-05  2:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang

Hi, Bjorn,

On Thu, Jan 5, 2023 at 2:37 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Jan 03, 2023 at 03:34:01PM +0800, Huacai Chen wrote:
> > cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during shutdown")
> > causes poweroff/reboot failure on systems with LS7A chipset. We found
> > that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in do_pci_disable
> > _device(), it can work well. The hardware engineer says that the root
> > cause is that CPU is still accessing PCIe devices while poweroff/reboot,
>
> Did you ever figure out what these CPU accesses are?  If we call the
> Root Port .shutdown() method, and later access a downstream device,
> that seems like a problem in itself.  At least, we should understand
> exactly *why* we access that downstream device.
Maybe I failed to get the key point, but from my point of view, the
root cause is clear in previous discussions:
https://lore.kernel.org/linux-pci/CAAhV-H5uT+wDRkVbW_o1hG2u0rtv6FFABTymL1VdjMMD_UEN+Q@mail.gmail.com/
https://lore.kernel.org/linux-pci/20220617113708.GA1177054@bhelgaas/
https://lore.kernel.org/linux-pci/CAAhV-H6raQnXZ4ZZRq19cugew26wXYONctcFO0392gZ00LC6bw@mail.gmail.com/

>
> To be clear, cc27b735ad3a does not cause the failure.  IIUC, the cause
> is:
cc27b735ad3a is not a bug, we refer to it just because we observe
problems after it.

>
>   - CPU issues MMIO read to device below Root Port
>
>   - LS7A Root Port fails to forward transaction to secondary bus
>     because of LS7A Bus Master defect
>
>   - CPU hangs waiting for response to MMIO read
>
> > and if we disable the Bus Master Bit at this time, the PCIe controller
> > doesn't forward requests to downstream devices, and also does not send
> > TIMEOUT to CPU, which causes CPU wait forever (hardware deadlock). This
> > behavior is a PCIe protocol violation (Bus Master should not be involved
> > in CPU MMIO transactions), and it will be fixed in new revisions of
> > hardware (add timeout mechanism for CPU read request, whether or not Bus
> > Master bit is cleared).
> >
> > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > [1][2]. Once before I wanted to make a single patch to solve "all of
> > these problems" together, but it seems unreasonable because maybe they
> > are not exactly the same problem.
>
> I don't know what any of these problems are.  Neither one of these bug
> reports has a root cause analysis, and it's not obvious how they're
> connected to this patch.
radeon/amdgpu devices cause problems and this patch can solve it.
But..., it is very hard to distinguish "whether they are exactly the
same problem".

>
> > So, this patch add a new function
> > pcie_portdrv_shutdown(), a slight modified copy of pcie_portdrv_remove()
> > dedicated for the shutdown path, and then add a quirk just for LS7A to
> > avoid clearing Bus Master bit in pcie_portdrv_shutdown(). Leave other
> > platforms behave as before.
>
> Nit: don't break function names across lines ("do_pci_disable_device()").
OK, I will improve it.

>
> > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> >
> > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > ---
> >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> >  drivers/pci/pcie/portdrv.c            | 21 +++++++++++++++++++--
> >  include/linux/pci.h                   |  1 +
> >  3 files changed, 37 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > index 759ec211c17b..641308ba4126 100644
> > --- a/drivers/pci/controller/pci-loongson.c
> > +++ b/drivers/pci/controller/pci-loongson.c
> > @@ -93,6 +93,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> >
> > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > +{
> > +     /*
> > +      * Some Loongson PCIe ports will cause CPU deadlock if disable
> > +      * the Bus Master bit during poweroff/reboot.
>
> This is not actually true, as far as I can see.
>
> It's not turning off Bus Master that causes the problem; it's the MMIO
> read to a downstream device when the Root Port has bus mastering
> disabled that causes the problem.
OK, I will improve the comments.

Huacai
>
> > +      */
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > +
> > +     bridge->no_dis_bmaster = 1;
> > +}
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > +
> >  static void loongson_pci_pin_quirk(struct pci_dev *pdev)
> >  {
> >       pdev->pin = 1 + (PCI_FUNC(pdev->devfn) & 3);
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 2cc2e60bcb39..96f45c444422 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -501,7 +501,6 @@ static void pcie_port_device_remove(struct pci_dev *dev)
> >  {
> >       device_for_each_child(&dev->dev, NULL, remove_iter);
> >       pci_free_irq_vectors(dev);
> > -     pci_disable_device(dev);
> >  }
> >
> >  /**
> > @@ -727,6 +726,24 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
> >       }
> >
> >       pcie_port_device_remove(dev);
> > +
> > +     pci_disable_device(dev);
> > +}
> > +
> > +static void pcie_portdrv_shutdown(struct pci_dev *dev)
> > +{
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > +
> > +     if (pci_bridge_d3_possible(dev)) {
> > +             pm_runtime_forbid(&dev->dev);
> > +             pm_runtime_get_noresume(&dev->dev);
> > +             pm_runtime_dont_use_autosuspend(&dev->dev);
> > +     }
> > +
> > +     pcie_port_device_remove(dev);
> > +
> > +     if (!bridge->no_dis_bmaster)
> > +             pci_disable_device(dev);
> >  }
> >
> >  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> > @@ -777,7 +794,7 @@ static struct pci_driver pcie_portdriver = {
> >
> >       .probe          = pcie_portdrv_probe,
> >       .remove         = pcie_portdrv_remove,
> > -     .shutdown       = pcie_portdrv_remove,
> > +     .shutdown       = pcie_portdrv_shutdown,
> >
> >       .err_handler    = &pcie_portdrv_err_handler,
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 3df2049ec4a8..a64dbcb89231 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -573,6 +573,7 @@ struct pci_host_bridge {
> >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-05  2:49     ` Huacai Chen
@ 2023-01-05  4:01       ` Bjorn Helgaas
  2023-01-06  7:13         ` Huacai Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2023-01-05  4:01 UTC (permalink / raw)
  To: Huacai Chen
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang

On Thu, Jan 05, 2023 at 10:49:53AM +0800, Huacai Chen wrote:
> On Thu, Jan 5, 2023 at 2:37 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Jan 03, 2023 at 03:34:01PM +0800, Huacai Chen wrote:
> > > cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during shutdown")
> > > causes poweroff/reboot failure on systems with LS7A chipset. We found
> > > that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in do_pci_disable
> > > _device(), it can work well. The hardware engineer says that the root
> > > cause is that CPU is still accessing PCIe devices while poweroff/reboot,
> >
> > Did you ever figure out what these CPU accesses are?  If we call the
> > Root Port .shutdown() method, and later access a downstream device,
> > that seems like a problem in itself.  At least, we should understand
> > exactly *why* we access that downstream device.
>
> Maybe I failed to get the key point, but from my point of view, the
> root cause is clear in previous discussions:
> https://lore.kernel.org/linux-pci/CAAhV-H5uT+wDRkVbW_o1hG2u0rtv6FFABTymL1VdjMMD_UEN+Q@mail.gmail.com/
> https://lore.kernel.org/linux-pci/20220617113708.GA1177054@bhelgaas/
> https://lore.kernel.org/linux-pci/CAAhV-H6raQnXZ4ZZRq19cugew26wXYONctcFO0392gZ00LC6bw@mail.gmail.com/

That's great, but the root cause should be summarized here in the
commit log.

> > To be clear, cc27b735ad3a does not cause the failure.  IIUC, the cause
> > is:
>
> cc27b735ad3a is not a bug, we refer to it just because we observe
> problems after it.

Right.  But you said "cc27b735ad3a ... causes failure," which is not
quite true.  cc27b735ad3a may *expose* an LS7A hardware defect that
previously didn't cause a problem, but I don't want to blame
cc27b735ad3a for that hardware issue.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure
  2023-01-05  4:01       ` Bjorn Helgaas
@ 2023-01-06  7:13         ` Huacai Chen
  0 siblings, 0 replies; 7+ messages in thread
From: Huacai Chen @ 2023-01-06  7:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huacai Chen, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, linux-pci, Jianmin Lv, Xuefeng Li,
	Jiaxun Yang

On Thu, Jan 5, 2023 at 12:01 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Jan 05, 2023 at 10:49:53AM +0800, Huacai Chen wrote:
> > On Thu, Jan 5, 2023 at 2:37 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Tue, Jan 03, 2023 at 03:34:01PM +0800, Huacai Chen wrote:
> > > > cc27b735ad3a7557 ("PCI/portdrv: Turn off PCIe services during shutdown")
> > > > causes poweroff/reboot failure on systems with LS7A chipset. We found
> > > > that if we remove "pci_command &= ~PCI_COMMAND_MASTER" in do_pci_disable
> > > > _device(), it can work well. The hardware engineer says that the root
> > > > cause is that CPU is still accessing PCIe devices while poweroff/reboot,
> > >
> > > Did you ever figure out what these CPU accesses are?  If we call the
> > > Root Port .shutdown() method, and later access a downstream device,
> > > that seems like a problem in itself.  At least, we should understand
> > > exactly *why* we access that downstream device.
> >
> > Maybe I failed to get the key point, but from my point of view, the
> > root cause is clear in previous discussions:
> > https://lore.kernel.org/linux-pci/CAAhV-H5uT+wDRkVbW_o1hG2u0rtv6FFABTymL1VdjMMD_UEN+Q@mail.gmail.com/
> > https://lore.kernel.org/linux-pci/20220617113708.GA1177054@bhelgaas/
> > https://lore.kernel.org/linux-pci/CAAhV-H6raQnXZ4ZZRq19cugew26wXYONctcFO0392gZ00LC6bw@mail.gmail.com/
>
> That's great, but the root cause should be summarized here in the
> commit log.
OK, I will update the commit log.

>
> > > To be clear, cc27b735ad3a does not cause the failure.  IIUC, the cause
> > > is:
> >
> > cc27b735ad3a is not a bug, we refer to it just because we observe
> > problems after it.
>
> Right.  But you said "cc27b735ad3a ... causes failure," which is not
> quite true.  cc27b735ad3a may *expose* an LS7A hardware defect that
> previously didn't cause a problem, but I don't want to blame
> cc27b735ad3a for that hardware issue.
OK, got it.

Huacai
>
> Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-06  7:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-03  7:33 [PATCH 0/2] PCI: Add two Loongson's LS7A quirks Huacai Chen
2023-01-03  7:34 ` [PATCH 1/2] PCI: loongson: Improve the MRRS quirk for LS7A Huacai Chen
2023-01-03  7:34 ` [PATCH 2/2] PCI: Add quirk for LS7A to avoid reboot failure Huacai Chen
2023-01-04 18:37   ` Bjorn Helgaas
2023-01-05  2:49     ` Huacai Chen
2023-01-05  4:01       ` Bjorn Helgaas
2023-01-06  7:13         ` Huacai Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).