[RFC] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
diff mbox series

Message ID 20191225192118.283637-1-kasong@redhat.com
State New, archived
Headers show
Series
  • [RFC] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
Related show

Commit Message

Kairui Song Dec. 25, 2019, 7:21 p.m. UTC
There are reports about kdump hang upon reboot on some HPE machines,
kernel hanged when trying to shutdown a PCIe port, an uncorrectable
error occurred and crashed the system.

On the machine I can reproduce this issue, part of the topology
looks like this:

[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
          +-01.0-[02]--
          +-01.1-[05]--
          +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
          |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
          +-02.1-[0f]--
          +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers

When shuting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
will hang, depend on which device is reinitialized in kdump kernel.

If force remove unused device then trigger kdump, the problem will never
happen:

    echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
    echo c > /proc/sysrq-trigger

    ... Kdump save vmcore through network, the NIC get reinitialized and
    hpsa is untouched. Then reboot with no problem. (If hpsa is used
    instead, shutdown the NIC in first kernel will help)

The cause is that some devices are enabled by the first kernel, but it
don't have the chance to shutdown the device, and kdump kernel is not
aware of it, unless it reinitialize the device.

Upon reboot, kdump kernel will skip downstream device shutdown and
clears its bridge's master bit directly. The downstream device could
error out as it can still send requests but upstream refuses it.

So for kdump, let kernel read the correct hardware power state on boot,
and always clear the bus master bit of PCI device upon shutdown if the
device is on. PCIe port driver will always shutdown all downstream
devices first, so this should ensure all downstream devices have bus
master bit off before clearing the bridge's bus master bit.

Signed-off-by: Kairui Song <kasong@redhat.com>
---
 drivers/pci/pci-driver.c | 11 ++++++++---
 drivers/pci/quirks.c     | 20 ++++++++++++++++++++
 2 files changed, 28 insertions(+), 3 deletions(-)

Comments

Kairui Song Jan. 3, 2020, 7:58 a.m. UTC | #1
On Thu, Dec 26, 2019 at 3:21 AM Kairui Song <kasong@redhat.com> wrote:
>
> There are reports about kdump hang upon reboot on some HPE machines,
> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> error occurred and crashed the system.
>
> On the machine I can reproduce this issue, part of the topology
> looks like this:
>
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>           +-01.0-[02]--
>           +-01.1-[05]--
>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>           +-02.1-[0f]--
>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
>
> When shuting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> will hang, depend on which device is reinitialized in kdump kernel.
>
> If force remove unused device then trigger kdump, the problem will never
> happen:
>
>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>     echo c > /proc/sysrq-trigger
>
>     ... Kdump save vmcore through network, the NIC get reinitialized and
>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>     instead, shutdown the NIC in first kernel will help)
>
> The cause is that some devices are enabled by the first kernel, but it
> don't have the chance to shutdown the device, and kdump kernel is not
> aware of it, unless it reinitialize the device.
>
> Upon reboot, kdump kernel will skip downstream device shutdown and
> clears its bridge's master bit directly. The downstream device could
> error out as it can still send requests but upstream refuses it.
>
> So for kdump, let kernel read the correct hardware power state on boot,
> and always clear the bus master bit of PCI device upon shutdown if the
> device is on. PCIe port driver will always shutdown all downstream
> devices first, so this should ensure all downstream devices have bus
> master bit off before clearing the bridge's bus master bit.
>
> Signed-off-by: Kairui Song <kasong@redhat.com>
> ---
>  drivers/pci/pci-driver.c | 11 ++++++++---
>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>  2 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 0454ca0e4e3f..84a7fd643b4d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -18,6 +18,7 @@
>  #include <linux/kexec.h>
>  #include <linux/of_device.h>
>  #include <linux/acpi.h>
> +#include <linux/crash_dump.h>
>  #include "pci.h"
>  #include "pcie/portdrv.h"
>
> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>          * If this is a kexec reboot, turn off Bus Master bit on the
>          * device to tell it to not continue to do DMA. Don't touch
>          * devices in D3cold or unknown states.
> -        * If it is not a kexec reboot, firmware will hit the PCI
> -        * devices with big hammer and stop their DMA any way.
> +        * If this is kdump kernel, also turn off Bus Master, the device
> +        * could be activated by previous crashed kernel and may block
> +        * it's upstream from shutting down.
> +        * Else, firmware will hit the PCI devices with big hammer
> +        * and stop their DMA any way.
>          */
> -       if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> +       if ((kexec_in_progress || is_kdump_kernel()) &&
> +                       pci_dev->current_state <= PCI_D3hot)
>                 pci_clear_master(pci_dev);
>  }
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4937a088d7d8..c65d11ab3939 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -28,6 +28,7 @@
>  #include <linux/platform_data/x86/apple.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/switchtec.h>
> +#include <linux/crash_dump.h>
>  #include <asm/dma.h>   /* isa_dma_bridge_buggy */
>  #include "pci.h"
>
> @@ -192,6 +193,25 @@ static int __init pci_apply_final_quirks(void)
>  }
>  fs_initcall_sync(pci_apply_final_quirks);
>
> +/*
> + * Read the device state even if it's not enabled. The device could be
> + * activated by previous crashed kernel, this will read and correct the
> + * cached state.
> + */
> +static void quirk_read_pm_state_in_kdump(struct pci_dev *dev)
> +{
> +       u16 pmcsr;
> +
> +       if (!is_kdump_kernel())
> +               return;
> +
> +       if (dev->pm_cap) {
> +               pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +               dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> +       }
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_read_pm_state_in_kdump);
> +
>  /*
>   * Decoding should be disabled for a PCI device during BAR sizing to avoid
>   * conflict. But doing so may cause problems on host bridge and perhaps other
> --
> 2.24.1
>

Let me make a simplified version of the commit message:

On some HPE machines, If kernel clears the bus master bit of a PCI
bridge, but some downstream device of it is still alive, the
downstream device could error out and crash the system.

Usually this never happen, as PCI devices are always shutdown before
their upstream bridge is shutdown. But in kdump kernel, the real
hardware state is out of sync with the data in kernel, as previous
kernel could enable some hardware and kdump kernel is unaware of it,
so kdump kernel just skipped the downstream device shutdown.

This patch is trying to fix this issue.
Bjorn Helgaas Jan. 10, 2020, 9:42 p.m. UTC | #2
[+cc Deepa (also working in this area)]

On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> There are reports about kdump hang upon reboot on some HPE machines,
> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> error occurred and crashed the system.

Details?  Do you have URLs for bug reports, dmesg logs, etc?

> On the machine I can reproduce this issue, part of the topology
> looks like this:
> 
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>           +-01.0-[02]--
>           +-01.1-[05]--
>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>           +-02.1-[0f]--
>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> 
> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> will hang, depend on which device is reinitialized in kdump kernel.
> 
> If force remove unused device then trigger kdump, the problem will never
> happen:
> 
>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>     echo c > /proc/sysrq-trigger
> 
>     ... Kdump save vmcore through network, the NIC get reinitialized and
>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>     instead, shutdown the NIC in first kernel will help)
> 
> The cause is that some devices are enabled by the first kernel, but it
> don't have the chance to shutdown the device, and kdump kernel is not
> aware of it, unless it reinitialize the device.
> 
> Upon reboot, kdump kernel will skip downstream device shutdown and
> clears its bridge's master bit directly. The downstream device could
> error out as it can still send requests but upstream refuses it.

Can you help me understand the sequence of events?  If I understand
correctly, the desired sequence is:

  - user kernel boots
  - user kernel panics and kexecs to kdump kernel
  - kdump kernel writes vmcore to network or disk
  - kdump kernel reboots
  - user kernel boots

But the problem is that as part of the kdump kernel reboot,

  - kdump kernel disables bus mastering for a Root Port
  - device below the Root Port attempts DMA
  - Root Port receives DMA transaction, handles it as Unsupported
    Request, sends UR Completion to device
  - device signals uncorrectable error
  - uncorrectable error causes a crash (Or a hang?  You mention both
    and I'm not sure which it is)

Is that right so far?

> So for kdump, let kernel read the correct hardware power state on boot,
> and always clear the bus master bit of PCI device upon shutdown if the
> device is on. PCIe port driver will always shutdown all downstream
> devices first, so this should ensure all downstream devices have bus
> master bit off before clearing the bridge's bus master bit.
> 
> Signed-off-by: Kairui Song <kasong@redhat.com>
> ---
>  drivers/pci/pci-driver.c | 11 ++++++++---
>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 0454ca0e4e3f..84a7fd643b4d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -18,6 +18,7 @@
>  #include <linux/kexec.h>
>  #include <linux/of_device.h>
>  #include <linux/acpi.h>
> +#include <linux/crash_dump.h>
>  #include "pci.h"
>  #include "pcie/portdrv.h"
>  
> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>  	 * If this is a kexec reboot, turn off Bus Master bit on the
>  	 * device to tell it to not continue to do DMA. Don't touch
>  	 * devices in D3cold or unknown states.
> -	 * If it is not a kexec reboot, firmware will hit the PCI
> -	 * devices with big hammer and stop their DMA any way.
> +	 * If this is kdump kernel, also turn off Bus Master, the device
> +	 * could be activated by previous crashed kernel and may block
> +	 * it's upstream from shutting down.
> +	 * Else, firmware will hit the PCI devices with big hammer
> +	 * and stop their DMA any way.
>  	 */
> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> +	if ((kexec_in_progress || is_kdump_kernel()) &&
> +			pci_dev->current_state <= PCI_D3hot)
>  		pci_clear_master(pci_dev);

I'm clearly missing something because this will turn off bus mastering
in cases where we previously left it enabled.

I was assuming the crash was related to a device doing DMA when the
Root Port had bus mastering disabled.  But that must be wrong.

I'd like to understand the crash/hang better because the quirk
especially is hard to connect to anything.  If the crash is because of
an AER or other PCIe error, maybe another possibility is that we could
handle it better or disable signaling of it or something.

>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4937a088d7d8..c65d11ab3939 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -28,6 +28,7 @@
>  #include <linux/platform_data/x86/apple.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/switchtec.h>
> +#include <linux/crash_dump.h>
>  #include <asm/dma.h>	/* isa_dma_bridge_buggy */
>  #include "pci.h"
>  
> @@ -192,6 +193,25 @@ static int __init pci_apply_final_quirks(void)
>  }
>  fs_initcall_sync(pci_apply_final_quirks);
>  
> +/*
> + * Read the device state even if it's not enabled. The device could be
> + * activated by previous crashed kernel, this will read and correct the
> + * cached state.
> + */
> +static void quirk_read_pm_state_in_kdump(struct pci_dev *dev)
> +{
> +	u16 pmcsr;
> +
> +	if (!is_kdump_kernel())
> +		return;
> +
> +	if (dev->pm_cap) {
> +		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> +	}
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_read_pm_state_in_kdump);
> +
>  /*
>   * Decoding should be disabled for a PCI device during BAR sizing to avoid
>   * conflict. But doing so may cause problems on host bridge and perhaps other
> -- 
> 2.24.1
>
Khalid Aziz and Shuah Khan Jan. 10, 2020, 10:25 p.m. UTC | #3
On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
> [+cc Deepa (also working in this area)]
> 
> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
>> There are reports about kdump hang upon reboot on some HPE machines,
>> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
>> error occurred and crashed the system.
> 
> Details?  Do you have URLs for bug reports, dmesg logs, etc?
> 
>> On the machine I can reproduce this issue, part of the topology
>> looks like this:
>>
>> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>>           +-01.0-[02]--
>>           +-01.1-[05]--
>>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>>           +-02.1-[0f]--
>>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
>>
>> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
>> will hang, depend on which device is reinitialized in kdump kernel.
>>
>> If force remove unused device then trigger kdump, the problem will never
>> happen:
>>
>>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>>     echo c > /proc/sysrq-trigger
>>
>>     ... Kdump save vmcore through network, the NIC get reinitialized and
>>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>>     instead, shutdown the NIC in first kernel will help)
>>
>> The cause is that some devices are enabled by the first kernel, but it
>> don't have the chance to shutdown the device, and kdump kernel is not
>> aware of it, unless it reinitialize the device.
>>
>> Upon reboot, kdump kernel will skip downstream device shutdown and
>> clears its bridge's master bit directly. The downstream device could
>> error out as it can still send requests but upstream refuses it.
> 
> Can you help me understand the sequence of events?  If I understand
> correctly, the desired sequence is:
> 
>   - user kernel boots
>   - user kernel panics and kexecs to kdump kernel
>   - kdump kernel writes vmcore to network or disk
>   - kdump kernel reboots
>   - user kernel boots
> 
> But the problem is that as part of the kdump kernel reboot,
> 
>   - kdump kernel disables bus mastering for a Root Port
>   - device below the Root Port attempts DMA
>   - Root Port receives DMA transaction, handles it as Unsupported
>     Request, sends UR Completion to device
>   - device signals uncorrectable error
>   - uncorrectable error causes a crash (Or a hang?  You mention both
>     and I'm not sure which it is)
> 
> Is that right so far?
> 
>> So for kdump, let kernel read the correct hardware power state on boot,
>> and always clear the bus master bit of PCI device upon shutdown if the
>> device is on. PCIe port driver will always shutdown all downstream
>> devices first, so this should ensure all downstream devices have bus
>> master bit off before clearing the bridge's bus master bit.
>>
>> Signed-off-by: Kairui Song <kasong@redhat.com>
>> ---
>>  drivers/pci/pci-driver.c | 11 ++++++++---
>>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 0454ca0e4e3f..84a7fd643b4d 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -18,6 +18,7 @@
>>  #include <linux/kexec.h>
>>  #include <linux/of_device.h>
>>  #include <linux/acpi.h>
>> +#include <linux/crash_dump.h>
>>  #include "pci.h"
>>  #include "pcie/portdrv.h"
>>  
>> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>>  	 * If this is a kexec reboot, turn off Bus Master bit on the
>>  	 * device to tell it to not continue to do DMA. Don't touch
>>  	 * devices in D3cold or unknown states.
>> -	 * If it is not a kexec reboot, firmware will hit the PCI
>> -	 * devices with big hammer and stop their DMA any way.
>> +	 * If this is kdump kernel, also turn off Bus Master, the device
>> +	 * could be activated by previous crashed kernel and may block
>> +	 * it's upstream from shutting down.
>> +	 * Else, firmware will hit the PCI devices with big hammer
>> +	 * and stop their DMA any way.
>>  	 */
>> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
>> +	if ((kexec_in_progress || is_kdump_kernel()) &&
>> +			pci_dev->current_state <= PCI_D3hot)
>>  		pci_clear_master(pci_dev);
> 
> I'm clearly missing something because this will turn off bus mastering
> in cases where we previously left it enabled.
> 
> I was assuming the crash was related to a device doing DMA when the
> Root Port had bus mastering disabled.  But that must be wrong.
> 
> I'd like to understand the crash/hang better because the quirk
> especially is hard to connect to anything.  If the crash is because of
> an AER or other PCIe error, maybe another possibility is that we could
> handle it better or disable signaling of it or something.
> 

I am not understanding this failure mode either. That code in
pci_device_shutdown() was added originally to address this very issue.
The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
shut down any errant DMAs from PCI devices as we kexec a new kernel. In
this new patch, this is the same code path that will be taken again when
kdump kernel is shutting down. If the errant DMA problem was not fixed
by clearing Bus Master bit in this path when kdump kernel was being
kexec'd, why does the same code path work the second time around when
kdump kernel is shutting down? Is there more going on that we don't
understand?

--
Khalid
Jerry Hoemann Jan. 10, 2020, 11 p.m. UTC | #4
On Fri, Jan 10, 2020 at 03:25:36PM -0700, Khalid Aziz and Shuah Khan wrote:
> On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
> > [+cc Deepa (also working in this area)]
> > 
> > On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> >> There are reports about kdump hang upon reboot on some HPE machines,
> >> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> >> error occurred and crashed the system.
> > 
> > Details?  Do you have URLs for bug reports, dmesg logs, etc?
> > 
> >> On the machine I can reproduce this issue, part of the topology
> >> looks like this:
> >>
> >> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >>           +-01.0-[02]--
> >>           +-01.1-[05]--
> >>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >>           +-02.1-[0f]--
> >>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> >>
> >> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> >> will hang, depend on which device is reinitialized in kdump kernel.
> >>
> >> If force remove unused device then trigger kdump, the problem will never
> >> happen:
> >>
> >>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
> >>     echo c > /proc/sysrq-trigger
> >>
> >>     ... Kdump save vmcore through network, the NIC get reinitialized and
> >>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
> >>     instead, shutdown the NIC in first kernel will help)
> >>
> >> The cause is that some devices are enabled by the first kernel, but it
> >> don't have the chance to shutdown the device, and kdump kernel is not
> >> aware of it, unless it reinitialize the device.
> >>
> >> Upon reboot, kdump kernel will skip downstream device shutdown and
> >> clears its bridge's master bit directly. The downstream device could
> >> error out as it can still send requests but upstream refuses it.
> > 
> > Can you help me understand the sequence of events?  If I understand
> > correctly, the desired sequence is:
> > 
> >   - user kernel boots
> >   - user kernel panics and kexecs to kdump kernel
> >   - kdump kernel writes vmcore to network or disk
> >   - kdump kernel reboots
> >   - user kernel boots
> > 
> > But the problem is that as part of the kdump kernel reboot,
> > 
> >   - kdump kernel disables bus mastering for a Root Port
> >   - device below the Root Port attempts DMA
> >   - Root Port receives DMA transaction, handles it as Unsupported
> >     Request, sends UR Completion to device
> >   - device signals uncorrectable error
> >   - uncorrectable error causes a crash (Or a hang?  You mention both
> >     and I'm not sure which it is)
> > 
> > Is that right so far?
> > 
> >> So for kdump, let kernel read the correct hardware power state on boot,
> >> and always clear the bus master bit of PCI device upon shutdown if the
> >> device is on. PCIe port driver will always shutdown all downstream
> >> devices first, so this should ensure all downstream devices have bus
> >> master bit off before clearing the bridge's bus master bit.
> >>
> >> Signed-off-by: Kairui Song <kasong@redhat.com>
> >> ---
> >>  drivers/pci/pci-driver.c | 11 ++++++++---
> >>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
> >>  2 files changed, 28 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >> index 0454ca0e4e3f..84a7fd643b4d 100644
> >> --- a/drivers/pci/pci-driver.c
> >> +++ b/drivers/pci/pci-driver.c
> >> @@ -18,6 +18,7 @@
> >>  #include <linux/kexec.h>
> >>  #include <linux/of_device.h>
> >>  #include <linux/acpi.h>
> >> +#include <linux/crash_dump.h>
> >>  #include "pci.h"
> >>  #include "pcie/portdrv.h"
> >>  
> >> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
> >>  	 * If this is a kexec reboot, turn off Bus Master bit on the
> >>  	 * device to tell it to not continue to do DMA. Don't touch
> >>  	 * devices in D3cold or unknown states.
> >> -	 * If it is not a kexec reboot, firmware will hit the PCI
> >> -	 * devices with big hammer and stop their DMA any way.
> >> +	 * If this is kdump kernel, also turn off Bus Master, the device
> >> +	 * could be activated by previous crashed kernel and may block
> >> +	 * it's upstream from shutting down.
> >> +	 * Else, firmware will hit the PCI devices with big hammer
> >> +	 * and stop their DMA any way.
> >>  	 */
> >> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> >> +	if ((kexec_in_progress || is_kdump_kernel()) &&
> >> +			pci_dev->current_state <= PCI_D3hot)
> >>  		pci_clear_master(pci_dev);
> > 
> > I'm clearly missing something because this will turn off bus mastering
> > in cases where we previously left it enabled.
> > 
> > I was assuming the crash was related to a device doing DMA when the
> > Root Port had bus mastering disabled.  But that must be wrong.
> > 
> > I'd like to understand the crash/hang better because the quirk
> > especially is hard to connect to anything.  If the crash is because of
> > an AER or other PCIe error, maybe another possibility is that we could
> > handle it better or disable signaling of it or something.
> > 
> 
> I am not understanding this failure mode either. That code in
> pci_device_shutdown() was added originally to address this very issue.
> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> this new patch, this is the same code path that will be taken again when
> kdump kernel is shutting down. If the errant DMA problem was not fixed
> by clearing Bus Master bit in this path when kdump kernel was being
> kexec'd, why does the same code path work the second time around when
> kdump kernel is shutting down? Is there more going on that we don't
> understand?
> 

  Khalid,

  I don't believe we execute that code path in the crash case.

  The variable kexec_in_progress is set true in kernel_kexec() before calling
  machine_kexec().  This is the fast reboot case.

  I don't see kexec_in_progress set true elsewhere.


  The code path for crash is different.

  For instance, panic() will call
	-> __crash_kexec()  which calls
		-> machine_kexec().

 So the setting of kexec_in_progress is bypassed.

 Jerry
Jerry Hoemann Jan. 10, 2020, 11:36 p.m. UTC | #5
On Fri, Jan 10, 2020 at 03:42:17PM -0600, Bjorn Helgaas wrote:
> [+cc Deepa (also working in this area)]
> 
> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> > There are reports about kdump hang upon reboot on some HPE machines,
> > kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> > error occurred and crashed the system.
> 
> Details?  Do you have URLs for bug reports, dmesg logs, etc?

Hi, Bjorn,

Not sure if you have access to Red Hat Bugzilla, but I filed:

	https://bugzilla.redhat.com/show_bug.cgi?id=1774802

When I hit this issue.



> 
> > On the machine I can reproduce this issue, part of the topology
> > looks like this:
> > 
> > [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >           +-01.0-[02]--
> >           +-01.1-[05]--
> >           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >           +-02.1-[0f]--
> >           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> > 
> > When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> > will hang, depend on which device is reinitialized in kdump kernel.
> > 
> > If force remove unused device then trigger kdump, the problem will never
> > happen:
> > 
> >     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
> >     echo c > /proc/sysrq-trigger
> > 
> >     ... Kdump save vmcore through network, the NIC get reinitialized and
> >     hpsa is untouched. Then reboot with no problem. (If hpsa is used
> >     instead, shutdown the NIC in first kernel will help)
> > 
> > The cause is that some devices are enabled by the first kernel, but it
> > don't have the chance to shutdown the device, and kdump kernel is not
> > aware of it, unless it reinitialize the device.
> > 
> > Upon reboot, kdump kernel will skip downstream device shutdown and
> > clears its bridge's master bit directly. The downstream device could
> > error out as it can still send requests but upstream refuses it.
> 
> Can you help me understand the sequence of events?  If I understand
> correctly, the desired sequence is:
> 
>   - user kernel boots
>   - user kernel panics and kexecs to kdump kernel
>   - kdump kernel writes vmcore to network or disk

Some context:

The problem for me hits itermittently during shutdown of the kdump kernel.
During this time, the SUT sometimes gets a PCI error that raises an NMI.

The reaction to the NMI that the kdump kernel takes is problematic.
Sometimes the system prints the tombstones and resets through firmware
without problems.  Other times it takes a second NMI and hangs.

I'll note that the kdump initrd doesn't contain the NIC drivers.  When
these are added, we don't see the issue.


Jerry
Khalid Aziz Jan. 11, 2020, 12:18 a.m. UTC | #6
On 1/10/20 4:00 PM, Jerry Hoemann wrote:
> On Fri, Jan 10, 2020 at 03:25:36PM -0700, Khalid Aziz and Shuah Khan wrote:
>> On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
>>> [+cc Deepa (also working in this area)]
>>>
>>> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
>>>> There are reports about kdump hang upon reboot on some HPE machines,
>>>> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
>>>> error occurred and crashed the system.
>>>
>>> Details?  Do you have URLs for bug reports, dmesg logs, etc?
>>>
>>>> On the machine I can reproduce this issue, part of the topology
>>>> looks like this:
>>>>
>>>> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>>>>           +-01.0-[02]--
>>>>           +-01.1-[05]--
>>>>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>           +-02.1-[0f]--
>>>>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
>>>>
>>>> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
>>>> will hang, depend on which device is reinitialized in kdump kernel.
>>>>
>>>> If force remove unused device then trigger kdump, the problem will never
>>>> happen:
>>>>
>>>>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>>>>     echo c > /proc/sysrq-trigger
>>>>
>>>>     ... Kdump save vmcore through network, the NIC get reinitialized and
>>>>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>>>>     instead, shutdown the NIC in first kernel will help)
>>>>
>>>> The cause is that some devices are enabled by the first kernel, but it
>>>> don't have the chance to shutdown the device, and kdump kernel is not
>>>> aware of it, unless it reinitialize the device.
>>>>
>>>> Upon reboot, kdump kernel will skip downstream device shutdown and
>>>> clears its bridge's master bit directly. The downstream device could
>>>> error out as it can still send requests but upstream refuses it.
>>>
>>> Can you help me understand the sequence of events?  If I understand
>>> correctly, the desired sequence is:
>>>
>>>   - user kernel boots
>>>   - user kernel panics and kexecs to kdump kernel
>>>   - kdump kernel writes vmcore to network or disk
>>>   - kdump kernel reboots
>>>   - user kernel boots
>>>
>>> But the problem is that as part of the kdump kernel reboot,
>>>
>>>   - kdump kernel disables bus mastering for a Root Port
>>>   - device below the Root Port attempts DMA
>>>   - Root Port receives DMA transaction, handles it as Unsupported
>>>     Request, sends UR Completion to device
>>>   - device signals uncorrectable error
>>>   - uncorrectable error causes a crash (Or a hang?  You mention both
>>>     and I'm not sure which it is)
>>>
>>> Is that right so far?
>>>
>>>> So for kdump, let kernel read the correct hardware power state on boot,
>>>> and always clear the bus master bit of PCI device upon shutdown if the
>>>> device is on. PCIe port driver will always shutdown all downstream
>>>> devices first, so this should ensure all downstream devices have bus
>>>> master bit off before clearing the bridge's bus master bit.
>>>>
>>>> Signed-off-by: Kairui Song <kasong@redhat.com>
>>>> ---
>>>>  drivers/pci/pci-driver.c | 11 ++++++++---
>>>>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>>>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>>>> index 0454ca0e4e3f..84a7fd643b4d 100644
>>>> --- a/drivers/pci/pci-driver.c
>>>> +++ b/drivers/pci/pci-driver.c
>>>> @@ -18,6 +18,7 @@
>>>>  #include <linux/kexec.h>
>>>>  #include <linux/of_device.h>
>>>>  #include <linux/acpi.h>
>>>> +#include <linux/crash_dump.h>
>>>>  #include "pci.h"
>>>>  #include "pcie/portdrv.h"
>>>>  
>>>> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>>>>  	 * If this is a kexec reboot, turn off Bus Master bit on the
>>>>  	 * device to tell it to not continue to do DMA. Don't touch
>>>>  	 * devices in D3cold or unknown states.
>>>> -	 * If it is not a kexec reboot, firmware will hit the PCI
>>>> -	 * devices with big hammer and stop their DMA any way.
>>>> +	 * If this is kdump kernel, also turn off Bus Master, the device
>>>> +	 * could be activated by previous crashed kernel and may block
>>>> +	 * it's upstream from shutting down.
>>>> +	 * Else, firmware will hit the PCI devices with big hammer
>>>> +	 * and stop their DMA any way.
>>>>  	 */
>>>> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
>>>> +	if ((kexec_in_progress || is_kdump_kernel()) &&
>>>> +			pci_dev->current_state <= PCI_D3hot)
>>>>  		pci_clear_master(pci_dev);
>>>
>>> I'm clearly missing something because this will turn off bus mastering
>>> in cases where we previously left it enabled.
>>>
>>> I was assuming the crash was related to a device doing DMA when the
>>> Root Port had bus mastering disabled.  But that must be wrong.
>>>
>>> I'd like to understand the crash/hang better because the quirk
>>> especially is hard to connect to anything.  If the crash is because of
>>> an AER or other PCIe error, maybe another possibility is that we could
>>> handle it better or disable signaling of it or something.
>>>
>>
>> I am not understanding this failure mode either. That code in
>> pci_device_shutdown() was added originally to address this very issue.
>> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
>> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
>> this new patch, this is the same code path that will be taken again when
>> kdump kernel is shutting down. If the errant DMA problem was not fixed
>> by clearing Bus Master bit in this path when kdump kernel was being
>> kexec'd, why does the same code path work the second time around when
>> kdump kernel is shutting down? Is there more going on that we don't
>> understand?
>>
> 
>   Khalid,
> 
>   I don't believe we execute that code path in the crash case.
> 
>   The variable kexec_in_progress is set true in kernel_kexec() before calling
>   machine_kexec().  This is the fast reboot case.
> 
>   I don't see kexec_in_progress set true elsewhere.
> 
> 
>   The code path for crash is different.
> 
>   For instance, panic() will call
> 	-> __crash_kexec()  which calls
> 		-> machine_kexec().
> 
>  So the setting of kexec_in_progress is bypassed.
> 

True, but what that means is if it is an errant DMA causing the issue
you are seeing, that errant DMA can happen any time between when we
start booting kdump kernel and until kdump kernel starts shutting down.
Clearing Bus Master bit when kdump kernel is shutting down means kernel
stays vulnerable for significant amount of time. It might be just a
matter of time before errant DMA happens when crash dump is happening
and causes crash dump to be incomplete or hang. Does that make sense?

--
Khalid
Baoquan He Jan. 11, 2020, 12:45 a.m. UTC | #7
On 01/10/20 at 04:00pm, Jerry Hoemann wrote:
> > I am not understanding this failure mode either. That code in
> > pci_device_shutdown() was added originally to address this very issue.
> > The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> > shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> > this new patch, this is the same code path that will be taken again when
> > kdump kernel is shutting down. If the errant DMA problem was not fixed
> > by clearing Bus Master bit in this path when kdump kernel was being
> > kexec'd, why does the same code path work the second time around when
> > kdump kernel is shutting down? Is there more going on that we don't
> > understand?
> > 
> 
>   Khalid,
> 
>   I don't believe we execute that code path in the crash case.
> 
>   The variable kexec_in_progress is set true in kernel_kexec() before calling
>   machine_kexec().  This is the fast reboot case.
> 
>   I don't see kexec_in_progress set true elsewhere.
> 
> 
>   The code path for crash is different.
> 
>   For instance, panic() will call
> 	-> __crash_kexec()  which calls
> 		-> machine_kexec().
> 
>  So the setting of kexec_in_progress is bypassed.

Yeah, it's a differet behaviour than kexec case. I talked to Kairui, the
patch log may be not very clear. Below is summary I got from my
understanding about this issue:

~~~~~~~~~~~~~~~~~~~~~~~
Problem:

When crash is triggered, system jumps into kdump kernel to collect
vmcore and dump out. After dumping is finished, kdump kernel will try
ty reboot to normal kernel. This hang happened during kdump kernel
rebooting, when dumping is network dumping, e.g ssh/nfs, local storage
is HPSA.

Root cause:

When configuring network dumping, only network driver modules are added
into kdump initramfs. However, the storage HPSA pcie device is enabled
in 1st kernel, its status is PCI_D3hot. When crashed system jumps to kdump
kernel, we didn't shutdown any device for safety and efficiency. Then
during kdump kernel boot up, the pci scan will get hpsa device and only
initialize its status as pci_dev->current_state = PCI_UNKNOWN. This
pci_dev->current_state will be manipulated by the relevant device
driver. So HPSA device will never have chance to calibrate its status,
and can't be shut down by pci_device_shutdown() called by reboot
service. It's still PCI_D3hot, then crash happened when system try to
shutdown its upper bridge.

Fix:

Here, Kairui uses a quirk to get PM state and mask off value bigger than
PCI_D3cold. Means, all devices will get PM state 
pci_dev->current_state = PCI_D0 or PCI_D3hot. Finally, during kdump
reboot stage, this device can be shut down successfully by clearing its
master bit.

~~~~~~~~~~~~~~~

About this patch, I think the quirk getting active PM state for all devices
may be risky, it will impact normal kernel too which doesn't have this issue.

Wondering if there's any other way to fix or work around it.

Thanks
Baoquan
Baoquan He Jan. 11, 2020, 12:50 a.m. UTC | #8
On 01/10/20 at 05:18pm, Khalid Aziz wrote:
> On 1/10/20 4:00 PM, Jerry Hoemann wrote:
> > On Fri, Jan 10, 2020 at 03:25:36PM -0700, Khalid Aziz and Shuah Khan wrote:
> >> On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
> >>> [+cc Deepa (also working in this area)]
> >>>
> >>> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> >>>> There are reports about kdump hang upon reboot on some HPE machines,
> >>>> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> >>>> error occurred and crashed the system.
> >>>
> >>> Details?  Do you have URLs for bug reports, dmesg logs, etc?
> >>>
> >>>> On the machine I can reproduce this issue, part of the topology
> >>>> looks like this:
> >>>>
> >>>> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >>>>           +-01.0-[02]--
> >>>>           +-01.1-[05]--
> >>>>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>           +-02.1-[0f]--
> >>>>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> >>>>
> >>>> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> >>>> will hang, depend on which device is reinitialized in kdump kernel.
> >>>>
> >>>> If force remove unused device then trigger kdump, the problem will never
> >>>> happen:
> >>>>
> >>>>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
> >>>>     echo c > /proc/sysrq-trigger
> >>>>
> >>>>     ... Kdump save vmcore through network, the NIC get reinitialized and
> >>>>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
> >>>>     instead, shutdown the NIC in first kernel will help)
> >>>>
> >>>> The cause is that some devices are enabled by the first kernel, but it
> >>>> don't have the chance to shutdown the device, and kdump kernel is not
> >>>> aware of it, unless it reinitialize the device.
> >>>>
> >>>> Upon reboot, kdump kernel will skip downstream device shutdown and
> >>>> clears its bridge's master bit directly. The downstream device could
> >>>> error out as it can still send requests but upstream refuses it.
> >>>
> >>> Can you help me understand the sequence of events?  If I understand
> >>> correctly, the desired sequence is:
> >>>
> >>>   - user kernel boots
> >>>   - user kernel panics and kexecs to kdump kernel
> >>>   - kdump kernel writes vmcore to network or disk
> >>>   - kdump kernel reboots
> >>>   - user kernel boots
> >>>
> >>> But the problem is that as part of the kdump kernel reboot,
> >>>
> >>>   - kdump kernel disables bus mastering for a Root Port
> >>>   - device below the Root Port attempts DMA
> >>>   - Root Port receives DMA transaction, handles it as Unsupported
> >>>     Request, sends UR Completion to device
> >>>   - device signals uncorrectable error
> >>>   - uncorrectable error causes a crash (Or a hang?  You mention both
> >>>     and I'm not sure which it is)
> >>>
> >>> Is that right so far?
> >>>
> >>>> So for kdump, let kernel read the correct hardware power state on boot,
> >>>> and always clear the bus master bit of PCI device upon shutdown if the
> >>>> device is on. PCIe port driver will always shutdown all downstream
> >>>> devices first, so this should ensure all downstream devices have bus
> >>>> master bit off before clearing the bridge's bus master bit.
> >>>>
> >>>> Signed-off-by: Kairui Song <kasong@redhat.com>
> >>>> ---
> >>>>  drivers/pci/pci-driver.c | 11 ++++++++---
> >>>>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
> >>>>  2 files changed, 28 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >>>> index 0454ca0e4e3f..84a7fd643b4d 100644
> >>>> --- a/drivers/pci/pci-driver.c
> >>>> +++ b/drivers/pci/pci-driver.c
> >>>> @@ -18,6 +18,7 @@
> >>>>  #include <linux/kexec.h>
> >>>>  #include <linux/of_device.h>
> >>>>  #include <linux/acpi.h>
> >>>> +#include <linux/crash_dump.h>
> >>>>  #include "pci.h"
> >>>>  #include "pcie/portdrv.h"
> >>>>  
> >>>> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
> >>>>  	 * If this is a kexec reboot, turn off Bus Master bit on the
> >>>>  	 * device to tell it to not continue to do DMA. Don't touch
> >>>>  	 * devices in D3cold or unknown states.
> >>>> -	 * If it is not a kexec reboot, firmware will hit the PCI
> >>>> -	 * devices with big hammer and stop their DMA any way.
> >>>> +	 * If this is kdump kernel, also turn off Bus Master, the device
> >>>> +	 * could be activated by previous crashed kernel and may block
> >>>> +	 * it's upstream from shutting down.
> >>>> +	 * Else, firmware will hit the PCI devices with big hammer
> >>>> +	 * and stop their DMA any way.
> >>>>  	 */
> >>>> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> >>>> +	if ((kexec_in_progress || is_kdump_kernel()) &&
> >>>> +			pci_dev->current_state <= PCI_D3hot)
> >>>>  		pci_clear_master(pci_dev);
> >>>
> >>> I'm clearly missing something because this will turn off bus mastering
> >>> in cases where we previously left it enabled.
> >>>
> >>> I was assuming the crash was related to a device doing DMA when the
> >>> Root Port had bus mastering disabled.  But that must be wrong.
> >>>
> >>> I'd like to understand the crash/hang better because the quirk
> >>> especially is hard to connect to anything.  If the crash is because of
> >>> an AER or other PCIe error, maybe another possibility is that we could
> >>> handle it better or disable signaling of it or something.
> >>>
> >>
> >> I am not understanding this failure mode either. That code in
> >> pci_device_shutdown() was added originally to address this very issue.
> >> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> >> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> >> this new patch, this is the same code path that will be taken again when
> >> kdump kernel is shutting down. If the errant DMA problem was not fixed
> >> by clearing Bus Master bit in this path when kdump kernel was being
> >> kexec'd, why does the same code path work the second time around when
> >> kdump kernel is shutting down? Is there more going on that we don't
> >> understand?
> >>
> > 
> >   Khalid,
> > 
> >   I don't believe we execute that code path in the crash case.
> > 
> >   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >   machine_kexec().  This is the fast reboot case.
> > 
> >   I don't see kexec_in_progress set true elsewhere.
> > 
> > 
> >   The code path for crash is different.
> > 
> >   For instance, panic() will call
> > 	-> __crash_kexec()  which calls
> > 		-> machine_kexec().
> > 
> >  So the setting of kexec_in_progress is bypassed.
> > 
> 
> True, but what that means is if it is an errant DMA causing the issue
> you are seeing, that errant DMA can happen any time between when we

Here, there could be misunderstanding. It's not an errant DMA, it's an
device which may be in DMA transporting state in normal kernel, but in
kdump kernel it's not manipulated by its driver because we don't use it
to dump, so exlucde its driver from kdump initramfs for saving space. 

> start booting kdump kernel and until kdump kernel starts shutting down.
> Clearing Bus Master bit when kdump kernel is shutting down means kernel
> stays vulnerable for significant amount of time. It might be just a
> matter of time before errant DMA happens when crash dump is happening
> and causes crash dump to be incomplete or hang. Does that make sense?
> 
> --
> Khalid
>
Baoquan He Jan. 11, 2020, 12:51 a.m. UTC | #9
On 01/11/20 at 08:45am, Baoquan He wrote:
> On 01/10/20 at 04:00pm, Jerry Hoemann wrote:
> > > I am not understanding this failure mode either. That code in
> > > pci_device_shutdown() was added originally to address this very issue.
> > > The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> > > shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> > > this new patch, this is the same code path that will be taken again when
> > > kdump kernel is shutting down. If the errant DMA problem was not fixed
> > > by clearing Bus Master bit in this path when kdump kernel was being
> > > kexec'd, why does the same code path work the second time around when
> > > kdump kernel is shutting down? Is there more going on that we don't
> > > understand?
> > > 
> > 
> >   Khalid,
> > 
> >   I don't believe we execute that code path in the crash case.
> > 
> >   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >   machine_kexec().  This is the fast reboot case.
> > 
> >   I don't see kexec_in_progress set true elsewhere.
> > 
> > 
> >   The code path for crash is different.
> > 
> >   For instance, panic() will call
> > 	-> __crash_kexec()  which calls
> > 		-> machine_kexec().
> > 
> >  So the setting of kexec_in_progress is bypassed.
> 
> Yeah, it's a differet behaviour than kexec case. I talked to Kairui, the
> patch log may be not very clear. Below is summary I got from my
> understanding about this issue:
> 
> ~~~~~~~~~~~~~~~~~~~~~~~
> Problem:
> 
> When crash is triggered, system jumps into kdump kernel to collect
> vmcore and dump out. After dumping is finished, kdump kernel will try
> ty reboot to normal kernel. This hang happened during kdump kernel
> rebooting, when dumping is network dumping, e.g ssh/nfs, local storage
> is HPSA.
> 
> Root cause:
> 
> When configuring network dumping, only network driver modules are added
> into kdump initramfs. However, the storage HPSA pcie device is enabled
> in 1st kernel, its status is PCI_D3hot. When crashed system jumps to kdump
> kernel, we didn't shutdown any device for safety and efficiency. Then
> during kdump kernel boot up, the pci scan will get hpsa device and only
> initialize its status as pci_dev->current_state = PCI_UNKNOWN. This
> pci_dev->current_state will be manipulated by the relevant device
> driver. So HPSA device will never have chance to calibrate its status,
> and can't be shut down by pci_device_shutdown() called by reboot
> service. It's still PCI_D3hot, then crash happened when system try to
> shutdown its upper bridge.
> 
> Fix:
> 
> Here, Kairui uses a quirk to get PM state and mask off value bigger than
> PCI_D3cold. Means, all devices will get PM state 
  ~~~~~~~~~ s/PCI_D3cold/PCI_D3hot/, typo
> pci_dev->current_state = PCI_D0 or PCI_D3hot. Finally, during kdump
> reboot stage, this device can be shut down successfully by clearing its
> master bit.
> 
> ~~~~~~~~~~~~~~~
> 
> About this patch, I think the quirk getting active PM state for all devices
> may be risky, it will impact normal kernel too which doesn't have this issue.
> 
> Wondering if there's any other way to fix or work around it.
> 
> Thanks
> Baoquan
Baoquan He Jan. 11, 2020, 1:46 a.m. UTC | #10
On 01/11/20 at 08:45am, Baoquan He wrote:
> On 01/10/20 at 04:00pm, Jerry Hoemann wrote:
> > > I am not understanding this failure mode either. That code in
> > > pci_device_shutdown() was added originally to address this very issue.
> > > The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> > > shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> > > this new patch, this is the same code path that will be taken again when
> > > kdump kernel is shutting down. If the errant DMA problem was not fixed
> > > by clearing Bus Master bit in this path when kdump kernel was being
> > > kexec'd, why does the same code path work the second time around when
> > > kdump kernel is shutting down? Is there more going on that we don't
> > > understand?
> > > 
> > 
> >   Khalid,
> > 
> >   I don't believe we execute that code path in the crash case.
> > 
> >   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >   machine_kexec().  This is the fast reboot case.
> > 
> >   I don't see kexec_in_progress set true elsewhere.
> > 
> > 
> >   The code path for crash is different.
> > 
> >   For instance, panic() will call
> > 	-> __crash_kexec()  which calls
> > 		-> machine_kexec().
> > 
> >  So the setting of kexec_in_progress is bypassed.
> 
> Yeah, it's a differet behaviour than kexec case. I talked to Kairui, the
> patch log may be not very clear. Below is summary I got from my
> understanding about this issue:
> 
> ~~~~~~~~~~~~~~~~~~~~~~~
> Problem:
> 
> When crash is triggered, system jumps into kdump kernel to collect
> vmcore and dump out. After dumping is finished, kdump kernel will try
> ty reboot to normal kernel. This hang happened during kdump kernel
> rebooting, when dumping is network dumping, e.g ssh/nfs, local storage
> is HPSA.
> 
> Root cause:
> 
> When configuring network dumping, only network driver modules are added
> into kdump initramfs. However, the storage HPSA pcie device is enabled
> in 1st kernel, its status is PCI_D3hot. When crashed system jumps to kdump
> kernel, we didn't shutdown any device for safety and efficiency. Then
> during kdump kernel boot up, the pci scan will get hpsa device and only
> initialize its status as pci_dev->current_state = PCI_UNKNOWN. This
> pci_dev->current_state will be manipulated by the relevant device
> driver. So HPSA device will never have chance to calibrate its status,
> and can't be shut down by pci_device_shutdown() called by reboot
> service. It's still PCI_D3hot, then crash happened when system try to
                                      ~~~~~ s/crash/hang/, sorry, typo again
> shutdown its upper bridge.
> 
> Fix:
> 
> Here, Kairui uses a quirk to get PM state and mask off value bigger than
> PCI_D3cold. Means, all devices will get PM state 
> pci_dev->current_state = PCI_D0 or PCI_D3hot. Finally, during kdump
> reboot stage, this device can be shut down successfully by clearing its
> master bit.
> 
> ~~~~~~~~~~~~~~~
> 
> About this patch, I think the quirk getting active PM state for all devices
> may be risky, it will impact normal kernel too which doesn't have this issue.
> 
> Wondering if there's any other way to fix or work around it.
> 
> Thanks
> Baoquan
Khalid Aziz Jan. 11, 2020, 3:45 a.m. UTC | #11
On 1/10/20 5:50 PM, Baoquan He wrote:
> On 01/10/20 at 05:18pm, Khalid Aziz wrote:
>> On 1/10/20 4:00 PM, Jerry Hoemann wrote:
>>> On Fri, Jan 10, 2020 at 03:25:36PM -0700, Khalid Aziz and Shuah Khan wrote:
>>>> On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
>>>>> [+cc Deepa (also working in this area)]
>>>>>
>>>>> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
>>>>>> There are reports about kdump hang upon reboot on some HPE machines,
>>>>>> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
>>>>>> error occurred and crashed the system.
>>>>>
>>>>> Details?  Do you have URLs for bug reports, dmesg logs, etc?
>>>>>
>>>>>> On the machine I can reproduce this issue, part of the topology
>>>>>> looks like this:
>>>>>>
>>>>>> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>>>>>>           +-01.0-[02]--
>>>>>>           +-01.1-[05]--
>>>>>>           +-02.0-[06]--+-00.0  tEmulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>>>>>>           +-02.1-[0f]--
>>>>>>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
>>>>>>
>>>>>> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
>>>>>> will hang, depend on which device is reinitialized in kdump kernel.
>>>>>>
>>>>>> If force remove unused device then trigger kdump, the problem will never
>>>>>> happen:
>>>>>>
>>>>>>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>>>>>>     echo c > /proc/sysrq-trigger
>>>>>>
>>>>>>     ... Kdump save vmcore through network, the NIC get reinitialized and
>>>>>>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>>>>>>     instead, shutdown the NIC in first kernel will help)
>>>>>>
>>>>>> The cause is that some devices are enabled by the first kernel, but it
>>>>>> don't have the chance to shutdown the device, and kdump kernel is not
>>>>>> aware of it, unless it reinitialize the device.
>>>>>>
>>>>>> Upon reboot, kdump kernel will skip downstream device shutdown and
>>>>>> clears its bridge's master bit directly. The downstream device could
>>>>>> error out as it can still send requests but upstream refuses it.
>>>>>
>>>>> Can you help me understand the sequence of events?  If I understand
>>>>> correctly, the desired sequence is:
>>>>>
>>>>>   - user kernel boots
>>>>>   - user kernel panics and kexecs to kdump kernel
>>>>>   - kdump kernel writes vmcore to network or disk
>>>>>   - kdump kernel reboots
>>>>>   - user kernel boots
>>>>>
>>>>> But the problem is that as part of the kdump kernel reboot,
>>>>>
>>>>>   - kdump kernel disables bus mastering for a Root Port
>>>>>   - device below the Root Port attempts DMA
>>>>>   - Root Port receives DMA transaction, handles it as Unsupported
>>>>>     Request, sends UR Completion to device
>>>>>   - device signals uncorrectable error
>>>>>   - uncorrectable error causes a crash (Or a hang?  You mention both
>>>>>     and I'm not sure which it is)
>>>>>
>>>>> Is that right so far?
>>>>>
>>>>>> So for kdump, let kernel read the correct hardware power state on boot,
>>>>>> and always clear the bus master bit of PCI device upon shutdown if the
>>>>>> device is on. PCIe port driver will always shutdown all downstream
>>>>>> devices first, so this should ensure all downstream devices have bus
>>>>>> master bit off before clearing the bridge's bus master bit.
>>>>>>
>>>>>> Signed-off-by: Kairui Song <kasong@redhat.com>
>>>>>> ---
>>>>>>  drivers/pci/pci-driver.c | 11 ++++++++---
>>>>>>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>>>>>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>>>>>> index 0454ca0e4e3f..84a7fd643b4d 100644
>>>>>> --- a/drivers/pci/pci-driver.c
>>>>>> +++ b/drivers/pci/pci-driver.c
>>>>>> @@ -18,6 +18,7 @@
>>>>>>  #include <linux/kexec.h>
>>>>>>  #include <linux/of_device.h>
>>>>>>  #include <linux/acpi.h>
>>>>>> +#include <linux/crash_dump.h>
>>>>>>  #include "pci.h"
>>>>>>  #include "pcie/portdrv.h"
>>>>>>  
>>>>>> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>>>>>>  	 * If this is a kexec reboot, turn off Bus Master bit on the
>>>>>>  	 * device to tell it to not continue to do DMA. Don't touch
>>>>>>  	 * devices in D3cold or unknown states.
>>>>>> -	 * If it is not a kexec reboot, firmware will hit the PCI
>>>>>> -	 * devices with big hammer and stop their DMA any way.
>>>>>> +	 * If this is kdump kernel, also turn off Bus Master, the device
>>>>>> +	 * could be activated by previous crashed kernel and may block
>>>>>> +	 * it's upstream from shutting down.
>>>>>> +	 * Else, firmware will hit the PCI devices with big hammer
>>>>>> +	 * and stop their DMA any way.
>>>>>>  	 */
>>>>>> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
>>>>>> +	if ((kexec_in_progress || is_kdump_kernel()) &&
>>>>>> +			pci_dev->current_state <= PCI_D3hot)
>>>>>>  		pci_clear_master(pci_dev);
>>>>>
>>>>> I'm clearly missing something because this will turn off bus mastering
>>>>> in cases where we previously left it enabled.
>>>>>
>>>>> I was assuming the crash was related to a device doing DMA when the
>>>>> Root Port had bus mastering disabled.  But that must be wrong.
>>>>>
>>>>> I'd like to understand the crash/hang better because the quirk
>>>>> especially is hard to connect to anything.  If the crash is because of
>>>>> an AER or other PCIe error, maybe another possibility is that we could
>>>>> handle it better or disable signaling of it or something.
>>>>>
>>>>
>>>> I am not understanding this failure mode either. That code in
>>>> pci_device_shutdown() was added originally to address this very issue.
>>>> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
>>>> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
>>>> this new patch, this is the same code path that will be taken again when
>>>> kdump kernel is shutting down. If the errant DMA problem was not fixed
>>>> by clearing Bus Master bit in this path when kdump kernel was being
>>>> kexec'd, why does the same code path work the second time around when
>>>> kdump kernel is shutting down? Is there more going on that we don't
>>>> understand?
>>>>
>>>
>>>   Khalid,
>>>
>>>   I don't believe we execute that code path in the crash case.
>>>
>>>   The variable kexec_in_progress is set true in kernel_kexec() before calling
>>>   machine_kexec().  This is the fast reboot case.
>>>
>>>   I don't see kexec_in_progress set true elsewhere.
>>>
>>>
>>>   The code path for crash is different.
>>>
>>>   For instance, panic() will call
>>> 	-> __crash_kexec()  which calls
>>> 		-> machine_kexec().
>>>
>>>  So the setting of kexec_in_progress is bypassed.
>>>
>>
>> True, but what that means is if it is an errant DMA causing the issue
>> you are seeing, that errant DMA can happen any time between when we
> 
> Here, there could be misunderstanding. It's not an errant DMA, it's an
> device which may be in DMA transporting state in normal kernel, but in
> kdump kernel it's not manipulated by its driver because we don't use it
> to dump, so exlucde its driver from kdump initramfs for saving space. 
> 

Errant DMA as in currently running kernel did not enable the device to
do DMA and is not ready for it. If a device can issue DMA request in
this state, it could do it well before kdump kernel starts shutting
down. Don't we need to fix this before we start shutting down kdump in
preparation for reboot? I can see the need for this fix, but I am not
sure if this patch places the fix in right place.

--
Khalid
Kairui Song Jan. 11, 2020, 8:46 a.m. UTC | #12
On Sat, Jan 11, 2020 at 5:42 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> Can you help me understand the sequence of events?  If I understand
> correctly, the desired sequence is:
>
>   - user kernel boots
>   - user kernel panics and kexecs to kdump kernel

One thing imported need to be mentioned here, user kernel kexec into
kdump kernel using the fast path, which does very few things, and
leave all the PCI devices untouched. If they are on, or doing DMA,
will just keep doing that, nothing will stop them.

In most cases the on going DMA seems harmless though, as kdump kernel
only live in reserved crash memory.

>   - kdump kernel writes vmcore to network or disk
>   - kdump kernel reboots
>   - user kernel boots
>
> But the problem is that as part of the kdump kernel reboot,
>
>   - kdump kernel disables bus mastering for a Root Port
>   - device below the Root Port attempts DMA
>   - Root Port receives DMA transaction, handles it as Unsupported
>     Request, sends UR Completion to device
>   - device signals uncorrectable error
>   - uncorrectable error causes a crash (Or a hang?  You mention both
>     and I'm not sure which it is)
>
> Is that right so far?

Yes everything else all correct. On the machine I can reproduce it,
system just hanged, even serial console is dead with no output.

>
> > So for kdump, let kernel read the correct hardware power state on boot,
> > and always clear the bus master bit of PCI device upon shutdown if the
> > device is on. PCIe port driver will always shutdown all downstream
> > devices first, so this should ensure all downstream devices have bus
> > master bit off before clearing the bridge's bus master bit.
> >
> > Signed-off-by: Kairui Song <kasong@redhat.com>
> > ---
> >  drivers/pci/pci-driver.c | 11 ++++++++---
> >  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
> >  2 files changed, 28 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index 0454ca0e4e3f..84a7fd643b4d 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/kexec.h>
> >  #include <linux/of_device.h>
> >  #include <linux/acpi.h>
> > +#include <linux/crash_dump.h>
> >  #include "pci.h"
> >  #include "pcie/portdrv.h"
> >
> > @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
> >        * If this is a kexec reboot, turn off Bus Master bit on the
> >        * device to tell it to not continue to do DMA. Don't touch
> >        * devices in D3cold or unknown states.
> > -      * If it is not a kexec reboot, firmware will hit the PCI
> > -      * devices with big hammer and stop their DMA any way.
> > +      * If this is kdump kernel, also turn off Bus Master, the device
> > +      * could be activated by previous crashed kernel and may block
> > +      * it's upstream from shutting down.
> > +      * Else, firmware will hit the PCI devices with big hammer
> > +      * and stop their DMA any way.
> >        */
> > -     if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> > +     if ((kexec_in_progress || is_kdump_kernel()) &&
> > +                     pci_dev->current_state <= PCI_D3hot)
> >               pci_clear_master(pci_dev);
>
> I'm clearly missing something because this will turn off bus mastering
> in cases where we previously left it enabled.
>
> I was assuming the crash was related to a device doing DMA when the
> Root Port had bus mastering disabled.  But that must be wrong.

That is just what is happening. When kdump kernel try to reboot, it
only cleared bus mastering bit of the Root Port, ignoring enabled
device under it, because it's not the kdump kernel that enabled the
device, it's the first kernel enabled it, and kdump kernel don't know
it.

>
> I'd like to understand the crash/hang better because the quirk
> especially is hard to connect to anything.  If the crash is because of
> an AER or other PCIe error, maybe another possibility is that we could
> handle it better or disable signaling of it or something.
>

Maybe if we can solve the problem by properly shutdown the devices in
right order, then better don't disable any error handling features? Or
kernel might miss some real hardware issue.

--
Best Regards,
Kairui Song
Kairui Song Jan. 11, 2020, 9:24 a.m. UTC | #13
On Sat, Jan 11, 2020 at 8:45 AM Baoquan He <bhe@redhat.com> wrote:
> On 01/10/20 at 04:00pm, Jerry Hoemann wrote:
> > > I am not understanding this failure mode either. That code in
> > > pci_device_shutdown() was added originally to address this very issue.
> > > The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> > > shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> > > this new patch, this is the same code path that will be taken again when
> > > kdump kernel is shutting down. If the errant DMA problem was not fixed
> > > by clearing Bus Master bit in this path when kdump kernel was being
> > > kexec'd, why does the same code path work the second time around when
> > > kdump kernel is shutting down? Is there more going on that we don't
> > > understand?
> > >
> >
> >   Khalid,
> >
> >   I don't believe we execute that code path in the crash case.
> >
> >   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >   machine_kexec().  This is the fast reboot case.
> >
> >   I don't see kexec_in_progress set true elsewhere.
> >
> >
> >   The code path for crash is different.
> >
> >   For instance, panic() will call
> >       -> __crash_kexec()  which calls
> >               -> machine_kexec().
> >
> >  So the setting of kexec_in_progress is bypassed.
>
> Yeah, it's a differet behaviour than kexec case. I talked to Kairui, the
> patch log may be not very clear. Below is summary I got from my
> understanding about this issue:
>
> ~~~~~~~~~~~~~~~~~~~~~~~
> Problem:
>
> When crash is triggered, system jumps into kdump kernel to collect
> vmcore and dump out. After dumping is finished, kdump kernel will try
> ty reboot to normal kernel. This hang happened during kdump kernel
> rebooting, when dumping is network dumping, e.g ssh/nfs, local storage
> is HPSA.
>
> Root cause:
>
> When configuring network dumping, only network driver modules are added
> into kdump initramfs. However, the storage HPSA pcie device is enabled
> in 1st kernel, its status is PCI_D3hot. When crashed system jumps to kdump
> kernel, we didn't shutdown any device for safety and efficiency. Then
> during kdump kernel boot up, the pci scan will get hpsa device and only
> initialize its status as pci_dev->current_state = PCI_UNKNOWN. This
> pci_dev->current_state will be manipulated by the relevant device
> driver. So HPSA device will never have chance to calibrate its status,
> and can't be shut down by pci_device_shutdown() called by reboot
> service. It's still PCI_D3hot, then crash happened when system try to
> shutdown its upper bridge.
>
> Fix:
>
> Here, Kairui uses a quirk to get PM state and mask off value bigger than
> PCI_D3cold. Means, all devices will get PM state
> pci_dev->current_state = PCI_D0 or PCI_D3hot

Or to put it simple, I just synced the actual PM state into
pci_dev->current_state using a quirk, for kdump kernel only.

> Finally, during kdump
> reboot stage, this device can be shut down successfully by clearing its
> master bit.
>
> ~~~~~~~~~~~~~~~
>
> About this patch, I think the quirk getting active PM state for all devices
> may be risky, it will impact normal kernel too which doesn't have this issue.
>
> Wondering if there's any other way to fix or work around it.
>

Thank you for the detailed description!
Kairui Song Jan. 11, 2020, 9:35 a.m. UTC | #14
On Sat, Jan 11, 2020 at 11:46 AM Khalid Aziz <khalid@gonehiking.org> wrote:
>
> On 1/10/20 5:50 PM, Baoquan He wrote:
> > On 01/10/20 at 05:18pm, Khalid Aziz wrote:
> >> On 1/10/20 4:00 PM, Jerry Hoemann wrote:
> >>> On Fri, Jan 10, 2020 at 03:25:36PM -0700, Khalid Aziz and Shuah Khan wrote:
> >>>> On 1/10/20 2:42 PM, Bjorn Helgaas wrote:
> >>>>> [+cc Deepa (also working in this area)]
> >>>>>
> >>>>> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> >>>>>> There are reports about kdump hang upon reboot on some HPE machines,
> >>>>>> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> >>>>>> error occurred and crashed the system.
> >>>>>
> >>>>> Details?  Do you have URLs for bug reports, dmesg logs, etc?
> >>>>>
> >>>>>> On the machine I can reproduce this issue, part of the topology
> >>>>>> looks like this:
> >>>>>>
> >>>>>> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >>>>>>           +-01.0-[02]--
> >>>>>>           +-01.1-[05]--
> >>>>>>           +-02.0-[06]--+-00.0  tEmulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >>>>>>           +-02.1-[0f]--
> >>>>>>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> >>>>>>
> >>>>>> When shutting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> >>>>>> will hang, depend on which device is reinitialized in kdump kernel.
> >>>>>>
> >>>>>> If force remove unused device then trigger kdump, the problem will never
> >>>>>> happen:
> >>>>>>
> >>>>>>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
> >>>>>>     echo c > /proc/sysrq-trigger
> >>>>>>
> >>>>>>     ... Kdump save vmcore through network, the NIC get reinitialized and
> >>>>>>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
> >>>>>>     instead, shutdown the NIC in first kernel will help)
> >>>>>>
> >>>>>> The cause is that some devices are enabled by the first kernel, but it
> >>>>>> don't have the chance to shutdown the device, and kdump kernel is not
> >>>>>> aware of it, unless it reinitialize the device.
> >>>>>>
> >>>>>> Upon reboot, kdump kernel will skip downstream device shutdown and
> >>>>>> clears its bridge's master bit directly. The downstream device could
> >>>>>> error out as it can still send requests but upstream refuses it.
> >>>>>
> >>>>> Can you help me understand the sequence of events?  If I understand
> >>>>> correctly, the desired sequence is:
> >>>>>
> >>>>>   - user kernel boots
> >>>>>   - user kernel panics and kexecs to kdump kernel
> >>>>>   - kdump kernel writes vmcore to network or disk
> >>>>>   - kdump kernel reboots
> >>>>>   - user kernel boots
> >>>>>
> >>>>> But the problem is that as part of the kdump kernel reboot,
> >>>>>
> >>>>>   - kdump kernel disables bus mastering for a Root Port
> >>>>>   - device below the Root Port attempts DMA
> >>>>>   - Root Port receives DMA transaction, handles it as Unsupported
> >>>>>     Request, sends UR Completion to device
> >>>>>   - device signals uncorrectable error
> >>>>>   - uncorrectable error causes a crash (Or a hang?  You mention both
> >>>>>     and I'm not sure which it is)
> >>>>>
> >>>>> Is that right so far?
> >>>>>
> >>>>>> So for kdump, let kernel read the correct hardware power state on boot,
> >>>>>> and always clear the bus master bit of PCI device upon shutdown if the
> >>>>>> device is on. PCIe port driver will always shutdown all downstream
> >>>>>> devices first, so this should ensure all downstream devices have bus
> >>>>>> master bit off before clearing the bridge's bus master bit.
> >>>>>>
> >>>>>> Signed-off-by: Kairui Song <kasong@redhat.com>
> >>>>>> ---
> >>>>>>  drivers/pci/pci-driver.c | 11 ++++++++---
> >>>>>>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
> >>>>>>  2 files changed, 28 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >>>>>> index 0454ca0e4e3f..84a7fd643b4d 100644
> >>>>>> --- a/drivers/pci/pci-driver.c
> >>>>>> +++ b/drivers/pci/pci-driver.c
> >>>>>> @@ -18,6 +18,7 @@
> >>>>>>  #include <linux/kexec.h>
> >>>>>>  #include <linux/of_device.h>
> >>>>>>  #include <linux/acpi.h>
> >>>>>> +#include <linux/crash_dump.h>
> >>>>>>  #include "pci.h"
> >>>>>>  #include "pcie/portdrv.h"
> >>>>>>
> >>>>>> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
> >>>>>>           * If this is a kexec reboot, turn off Bus Master bit on the
> >>>>>>           * device to tell it to not continue to do DMA. Don't touch
> >>>>>>           * devices in D3cold or unknown states.
> >>>>>> -         * If it is not a kexec reboot, firmware will hit the PCI
> >>>>>> -         * devices with big hammer and stop their DMA any way.
> >>>>>> +         * If this is kdump kernel, also turn off Bus Master, the device
> >>>>>> +         * could be activated by previous crashed kernel and may block
> >>>>>> +         * it's upstream from shutting down.
> >>>>>> +         * Else, firmware will hit the PCI devices with big hammer
> >>>>>> +         * and stop their DMA any way.
> >>>>>>           */
> >>>>>> -        if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> >>>>>> +        if ((kexec_in_progress || is_kdump_kernel()) &&
> >>>>>> +                        pci_dev->current_state <= PCI_D3hot)
> >>>>>>                  pci_clear_master(pci_dev);
> >>>>>
> >>>>> I'm clearly missing something because this will turn off bus mastering
> >>>>> in cases where we previously left it enabled.
> >>>>>
> >>>>> I was assuming the crash was related to a device doing DMA when the
> >>>>> Root Port had bus mastering disabled.  But that must be wrong.
> >>>>>
> >>>>> I'd like to understand the crash/hang better because the quirk
> >>>>> especially is hard to connect to anything.  If the crash is because of
> >>>>> an AER or other PCIe error, maybe another possibility is that we could
> >>>>> handle it better or disable signaling of it or something.
> >>>>>
> >>>>
> >>>> I am not understanding this failure mode either. That code in
> >>>> pci_device_shutdown() was added originally to address this very issue.
> >>>> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> >>>> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> >>>> this new patch, this is the same code path that will be taken again when
> >>>> kdump kernel is shutting down. If the errant DMA problem was not fixed
> >>>> by clearing Bus Master bit in this path when kdump kernel was being
> >>>> kexec'd, why does the same code path work the second time around when
> >>>> kdump kernel is shutting down? Is there more going on that we don't
> >>>> understand?
> >>>>
> >>>
> >>>   Khalid,
> >>>
> >>>   I don't believe we execute that code path in the crash case.
> >>>
> >>>   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >>>   machine_kexec().  This is the fast reboot case.
> >>>
> >>>   I don't see kexec_in_progress set true elsewhere.
> >>>
> >>>
> >>>   The code path for crash is different.
> >>>
> >>>   For instance, panic() will call
> >>>     -> __crash_kexec()  which calls
> >>>             -> machine_kexec().
> >>>
> >>>  So the setting of kexec_in_progress is bypassed.
> >>>
> >>
> >> True, but what that means is if it is an errant DMA causing the issue
> >> you are seeing, that errant DMA can happen any time between when we
> >
> > Here, there could be misunderstanding. It's not an errant DMA, it's an
> > device which may be in DMA transporting state in normal kernel, but in
> > kdump kernel it's not manipulated by its driver because we don't use it
> > to dump, so exlucde its driver from kdump initramfs for saving space.
> >
>
> Errant DMA as in currently running kernel did not enable the device to
> do DMA and is not ready for it. If a device can issue DMA request in
> this state, it could do it well before kdump kernel starts shutting
> down. Don't we need to fix this before we start shutting down kdump in
> preparation for reboot? I can see the need for this fix, but I am not
> sure if this patch places the fix in right place.
>
> --
> Khalid
>

Hi, there are some previous works about this issue, reset PCI devices
in kdump kernel to stop ongoing DMA:

[v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
https://lore.kernel.org/patchwork/cover/343767/

[v2] PCI: Reset PCIe devices to stop ongoing DMA
https://lore.kernel.org/patchwork/patch/379191/

And didn't get merged, that patch are trying to fix some DMAR error
problem, but resetting devices is a bit too destructive, and the
problem is later fixed in IOMMU side. And in most case the DMA seems
harmless, as they targets first kernel's memory and kdump kernel only
live in crash memory.

Also, by the time kdump kernel is able to scan and reset devices,
there are already a very large time window where things could go
wrong.

The currently problem observed only happens upon kdump kernel
shutdown, as the upper bridge is disabled before the device is
disabledm so DMA will raise error. It's more like a problem of wrong
device shutting down order.
Baoquan He Jan. 11, 2020, 10:04 a.m. UTC | #15
On 01/10/20 at 08:45pm, Khalid Aziz wrote:
> >>>>>> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> >>>>>> +	if ((kexec_in_progress || is_kdump_kernel()) &&
> >>>>>> +			pci_dev->current_state <= PCI_D3hot)
> >>>>>>  		pci_clear_master(pci_dev);
> >>>>>
> >>>>> I'm clearly missing something because this will turn off bus mastering
> >>>>> in cases where we previously left it enabled.
> >>>>>
> >>>>> I was assuming the crash was related to a device doing DMA when the
> >>>>> Root Port had bus mastering disabled.  But that must be wrong.
> >>>>>
> >>>>> I'd like to understand the crash/hang better because the quirk
> >>>>> especially is hard to connect to anything.  If the crash is because of
> >>>>> an AER or other PCIe error, maybe another possibility is that we could
> >>>>> handle it better or disable signaling of it or something.
> >>>>>
> >>>>
> >>>> I am not understanding this failure mode either. That code in
> >>>> pci_device_shutdown() was added originally to address this very issue.
> >>>> The patch 4fc9bbf98fd6 ("PCI: Disable Bus Master only on kexec reboot")
> >>>> shut down any errant DMAs from PCI devices as we kexec a new kernel. In
> >>>> this new patch, this is the same code path that will be taken again when
> >>>> kdump kernel is shutting down. If the errant DMA problem was not fixed
> >>>> by clearing Bus Master bit in this path when kdump kernel was being
> >>>> kexec'd, why does the same code path work the second time around when
> >>>> kdump kernel is shutting down? Is there more going on that we don't
> >>>> understand?
> >>>>
> >>>
> >>>   Khalid,
> >>>
> >>>   I don't believe we execute that code path in the crash case.
> >>>
> >>>   The variable kexec_in_progress is set true in kernel_kexec() before calling
> >>>   machine_kexec().  This is the fast reboot case.
> >>>
> >>>   I don't see kexec_in_progress set true elsewhere.
> >>>
> >>>
> >>>   The code path for crash is different.
> >>>
> >>>   For instance, panic() will call
> >>> 	-> __crash_kexec()  which calls
> >>> 		-> machine_kexec().
> >>>
> >>>  So the setting of kexec_in_progress is bypassed.
> >>>
> >>
> >> True, but what that means is if it is an errant DMA causing the issue
> >> you are seeing, that errant DMA can happen any time between when we
> > 
> > Here, there could be misunderstanding. It's not an errant DMA, it's an
> > device which may be in DMA transporting state in normal kernel, but in
> > kdump kernel it's not manipulated by its driver because we don't use it
> > to dump, so exlucde its driver from kdump initramfs for saving space. 
> > 
> 
> Errant DMA as in currently running kernel did not enable the device to
> do DMA and is not ready for it. If a device can issue DMA request in
> this state, it could do it well before kdump kernel starts shutting
> down. Don't we need to fix this before we start shutting down kdump in
> preparation for reboot? I can see the need for this fix, but I am not
> sure if this patch places the fix in right place.

Ah, I could get your point now, but not very sure. Do you mean the HPSA
is in errant DMA state, because it doesn't have driver to re-initilize
to correct state? 

We have been doing like this for kdump kernel always, to only
include needed devices' driver to kdump initrd so that the driver will
initialize and opearte the device. For other unneeded devices by kdump
kernel, we just leave it as is. This error is only seen on this HPE
owned system. Wondering if HPE can do something to check their
firmware/hardware setting.

This patch is using a quirk to adjust all those devices if its
pci_dev->current_state is beyond PCI_D3hot during bootup. Then clear the
master bit of device firstly, next clear its parent bridge's master bit,
to make reboot proceed further. This need you PCI experts to offer help
to evaluate.

Thanks
Baoquan
Deepa Dinamani Jan. 11, 2020, 6:32 p.m. UTC | #16
> Hi, there are some previous works about this issue, reset PCI devices
> in kdump kernel to stop ongoing DMA:
>
> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> https://lore.kernel.org/patchwork/cover/343767/
>
> [v2] PCI: Reset PCIe devices to stop ongoing DMA
> https://lore.kernel.org/patchwork/patch/379191/
>
> And didn't get merged, that patch are trying to fix some DMAR error
> problem, but resetting devices is a bit too destructive, and the
> problem is later fixed in IOMMU side. And in most case the DMA seems
> harmless, as they targets first kernel's memory and kdump kernel only
> live in crash memory.

I was going to ask the same. If the kdump kernel had IOMMU on, would
that still be a problem?

> Also, by the time kdump kernel is able to scan and reset devices,
> there are already a very large time window where things could go
> wrong.
>
> The currently problem observed only happens upon kdump kernel
> shutdown, as the upper bridge is disabled before the device is
> disabledm so DMA will raise error. It's more like a problem of wrong
> device shutting down order.

The way it was described earlier "During this time, the SUT sometimes
gets a PCI error that raises an NMI." suggests that it isn't really
restricted to kexec/kdump.
Any attached device without an active driver might attempt spurious or
malicious DMA and trigger the same during normal operation.
Do you have available some more reporting of what happens during the
PCIe error handling?

"The reaction to the NMI that the kdump kernel takes is problematic."
Or the NMI should not have been triggered to begin with? Where does that happen?

I couldn't access the bug report.

-Deepa
Kairui Song Jan. 13, 2020, 5:07 p.m. UTC | #17
On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>
> > Hi, there are some previous works about this issue, reset PCI devices
> > in kdump kernel to stop ongoing DMA:
> >
> > [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> > https://lore.kernel.org/patchwork/cover/343767/
> >
> > [v2] PCI: Reset PCIe devices to stop ongoing DMA
> > https://lore.kernel.org/patchwork/patch/379191/
> >
> > And didn't get merged, that patch are trying to fix some DMAR error
> > problem, but resetting devices is a bit too destructive, and the
> > problem is later fixed in IOMMU side. And in most case the DMA seems
> > harmless, as they targets first kernel's memory and kdump kernel only
> > live in crash memory.
>
> I was going to ask the same. If the kdump kernel had IOMMU on, would
> that still be a problem?

It will still fail, doing DMA is not a problem, it only go wrong when
a device's upstream bridge is mistakenly shutdown before the device
shutdown.

>
> > Also, by the time kdump kernel is able to scan and reset devices,
> > there are already a very large time window where things could go
> > wrong.
> >
> > The currently problem observed only happens upon kdump kernel
> > shutdown, as the upper bridge is disabled before the device is
> > disabledm so DMA will raise error. It's more like a problem of wrong
> > device shutting down order.
>
> The way it was described earlier "During this time, the SUT sometimes
> gets a PCI error that raises an NMI." suggests that it isn't really
> restricted to kexec/kdump.
> Any attached device without an active driver might attempt spurious or
> malicious DMA and trigger the same during normal operation.
> Do you have available some more reporting of what happens during the
> PCIe error handling?

Let me add more info about this:

On the machine where I can reproduce this issue, the first kernel
always runs fine, and kdump kernel works fine during dumping the
vmcore, even if I keep the kdump kernel running for hours, nothing
goes wrong. If there are DMA during normal operation that will cause
problem, this should have exposed it.

The problem only occur when kdump kernel try to reboot, no matter how
long the kdump kernel have been running (few minutes or hours). The
machine is dead after printing:
[  101.438300] reboot: Restarting system^M
[  101.455360] reboot: machine restart^M

And I can find following logs happend just at that time, in the
"Integrated Management Log" from the iLO web interface:
1254 OS 12/25/2019 09:08 12/25/2019 09:08 1 User Remotely Initiated NMI Switch
1253 System Error 12/25/2019 09:08 12/25/2019 09:08 1 An Unrecoverable
System Error (NMI) has occurred (Service Information: 0x00000000,
0x00000000)
1252 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 Uncorrectable PCI
Express Error (Embedded device, Bus 0, Device 2, Function 2, Error
status 0x00100000)
1251 System Error 12/25/2019 09:07 12/25/2019 09:07 1 Unrecoverable
System Error (NMI) has occurred.  System Firmware will log additional
details in a separate IML entry if possible
1250 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 PCI Bus Error (Slot
0, Bus 0, Device 2, Function 2)

And the topology is:
[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
          +-01.0-[02]--
          +-01.1-[05]--
          +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
          |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
          |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
          +-02.1-[0f]--
          +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array
Gen9 Controllers

It's a bridge reporting the error. It should be an unsupported request
error, bacause downstream device is still alive and sending request,
but the port have bus mastering off. If I manually shutdown the "Smart
Array" (HPSA) device before kdump reboot, it will always reboot just
fine.

And as the patch descriptions said, the HPSA is used in first kernel,
but didn't get reset in kdump kernel because driver is not loaded.
When shutting down a bridge, kernel should shutdown downstream device
first, and then shutdown and clear bus master bit of the bridge. But
in kdump case, kernel skipped some device shutdown due to driver not
loaded issue, and kernel don't know they are enabled.

This problem is not limited to HPSA, the NIC listed in above topology
maybe also make the bridge error out, if HPSA get loaded in kdump
kernel and NIC get ignored.

>
> "The reaction to the NMI that the kdump kernel takes is problematic."
> Or the NMI should not have been triggered to begin with? Where does that happen?

The NMI is triggered by firmware upon the bridge error mentioned
above, it should have triggered a kernel panic, but on the test
system, it just hanged and no longer give any output, so I can't post
any log about it.
Deepa Dinamani Jan. 15, 2020, 1:16 a.m. UTC | #18
On Mon, Jan 13, 2020 at 9:07 AM Kairui Song <kasong@redhat.com> wrote:
>
> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> >
> > > Hi, there are some previous works about this issue, reset PCI devices
> > > in kdump kernel to stop ongoing DMA:
> > >
> > > [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> > > https://lore.kernel.org/patchwork/cover/343767/
> > >
> > > [v2] PCI: Reset PCIe devices to stop ongoing DMA
> > > https://lore.kernel.org/patchwork/patch/379191/
> > >
> > > And didn't get merged, that patch are trying to fix some DMAR error
> > > problem, but resetting devices is a bit too destructive, and the
> > > problem is later fixed in IOMMU side. And in most case the DMA seems
> > > harmless, as they targets first kernel's memory and kdump kernel only
> > > live in crash memory.
> >
> > I was going to ask the same. If the kdump kernel had IOMMU on, would
> > that still be a problem?
>
> It will still fail, doing DMA is not a problem, it only go wrong when
> a device's upstream bridge is mistakenly shutdown before the device
> shutdown.
>
> >
> > > Also, by the time kdump kernel is able to scan and reset devices,
> > > there are already a very large time window where things could go
> > > wrong.
> > >
> > > The currently problem observed only happens upon kdump kernel
> > > shutdown, as the upper bridge is disabled before the device is
> > > disabledm so DMA will raise error. It's more like a problem of wrong
> > > device shutting down order.
> >
> > The way it was described earlier "During this time, the SUT sometimes
> > gets a PCI error that raises an NMI." suggests that it isn't really
> > restricted to kexec/kdump.
> > Any attached device without an active driver might attempt spurious or
> > malicious DMA and trigger the same during normal operation.
> > Do you have available some more reporting of what happens during the
> > PCIe error handling?
>
> Let me add more info about this:
>
> On the machine where I can reproduce this issue, the first kernel
> always runs fine, and kdump kernel works fine during dumping the
> vmcore, even if I keep the kdump kernel running for hours, nothing
> goes wrong. If there are DMA during normal operation that will cause
> problem, this should have exposed it.
>
> The problem only occur when kdump kernel try to reboot, no matter how
> long the kdump kernel have been running (few minutes or hours). The
> machine is dead after printing:
> [  101.438300] reboot: Restarting system^M
> [  101.455360] reboot: machine restart^M
>
> And I can find following logs happend just at that time, in the
> "Integrated Management Log" from the iLO web interface:
> 1254 OS 12/25/2019 09:08 12/25/2019 09:08 1 User Remotely Initiated NMI Switch
> 1253 System Error 12/25/2019 09:08 12/25/2019 09:08 1 An Unrecoverable
> System Error (NMI) has occurred (Service Information: 0x00000000,
> 0x00000000)
> 1252 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 Uncorrectable PCI
> Express Error (Embedded device, Bus 0, Device 2, Function 2, Error
> status 0x00100000)
> 1251 System Error 12/25/2019 09:07 12/25/2019 09:07 1 Unrecoverable
> System Error (NMI) has occurred.  System Firmware will log additional
> details in a separate IML entry if possible
> 1250 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 PCI Bus Error (Slot
> 0, Bus 0, Device 2, Function 2)
>
> And the topology is:
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>           +-01.0-[02]--
>           +-01.1-[05]--
>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>           +-02.1-[0f]--
>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array
> Gen9 Controllers
>
> It's a bridge reporting the error. It should be an unsupported request
> error, bacause downstream device is still alive and sending request,
> but the port have bus mastering off. If I manually shutdown the "Smart
> Array" (HPSA) device before kdump reboot, it will always reboot just
> fine.
>
> And as the patch descriptions said, the HPSA is used in first kernel,
> but didn't get reset in kdump kernel because driver is not loaded.
> When shutting down a bridge, kernel should shutdown downstream device
> first, and then shutdown and clear bus master bit of the bridge. But
> in kdump case, kernel skipped some device shutdown due to driver not
> loaded issue, and kernel don't know they are enabled.
>
> This problem is not limited to HPSA, the NIC listed in above topology
> maybe also make the bridge error out, if HPSA get loaded in kdump
> kernel and NIC get ignored.

It looks like the right answer is for the kernel to handle such cases
gracefully. From what I recall, we can only trust the bus mastering at
root ports. So, it is possible that the endpoint devices can always
try to DMA, but it can be blocked by the root port. So the right fix
seems to teach kernel how to handle these insted of hacking the
shutdown code.
-Deepa


-Deepa
Kairui Song Jan. 15, 2020, 7:56 a.m. UTC | #19
On Wed, Jan 15, 2020 at 9:17 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>
> On Mon, Jan 13, 2020 at 9:07 AM Kairui Song <kasong@redhat.com> wrote:
> >
> > On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> > >
> > > > Hi, there are some previous works about this issue, reset PCI devices
> > > > in kdump kernel to stop ongoing DMA:
> > > >
> > > > [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> > > > https://lore.kernel.org/patchwork/cover/343767/
> > > >
> > > > [v2] PCI: Reset PCIe devices to stop ongoing DMA
> > > > https://lore.kernel.org/patchwork/patch/379191/
> > > >
> > > > And didn't get merged, that patch are trying to fix some DMAR error
> > > > problem, but resetting devices is a bit too destructive, and the
> > > > problem is later fixed in IOMMU side. And in most case the DMA seems
> > > > harmless, as they targets first kernel's memory and kdump kernel only
> > > > live in crash memory.
> > >
> > > I was going to ask the same. If the kdump kernel had IOMMU on, would
> > > that still be a problem?
> >
> > It will still fail, doing DMA is not a problem, it only go wrong when
> > a device's upstream bridge is mistakenly shutdown before the device
> > shutdown.
> >
> > >
> > > > Also, by the time kdump kernel is able to scan and reset devices,
> > > > there are already a very large time window where things could go
> > > > wrong.
> > > >
> > > > The currently problem observed only happens upon kdump kernel
> > > > shutdown, as the upper bridge is disabled before the device is
> > > > disabledm so DMA will raise error. It's more like a problem of wrong
> > > > device shutting down order.
> > >
> > > The way it was described earlier "During this time, the SUT sometimes
> > > gets a PCI error that raises an NMI." suggests that it isn't really
> > > restricted to kexec/kdump.
> > > Any attached device without an active driver might attempt spurious or
> > > malicious DMA and trigger the same during normal operation.
> > > Do you have available some more reporting of what happens during the
> > > PCIe error handling?
> >
> > Let me add more info about this:
> >
> > On the machine where I can reproduce this issue, the first kernel
> > always runs fine, and kdump kernel works fine during dumping the
> > vmcore, even if I keep the kdump kernel running for hours, nothing
> > goes wrong. If there are DMA during normal operation that will cause
> > problem, this should have exposed it.
> >
> > The problem only occur when kdump kernel try to reboot, no matter how
> > long the kdump kernel have been running (few minutes or hours). The
> > machine is dead after printing:
> > [  101.438300] reboot: Restarting system^M
> > [  101.455360] reboot: machine restart^M
> >
> > And I can find following logs happend just at that time, in the
> > "Integrated Management Log" from the iLO web interface:
> > 1254 OS 12/25/2019 09:08 12/25/2019 09:08 1 User Remotely Initiated NMI Switch
> > 1253 System Error 12/25/2019 09:08 12/25/2019 09:08 1 An Unrecoverable
> > System Error (NMI) has occurred (Service Information: 0x00000000,
> > 0x00000000)
> > 1252 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 Uncorrectable PCI
> > Express Error (Embedded device, Bus 0, Device 2, Function 2, Error
> > status 0x00100000)
> > 1251 System Error 12/25/2019 09:07 12/25/2019 09:07 1 Unrecoverable
> > System Error (NMI) has occurred.  System Firmware will log additional
> > details in a separate IML entry if possible
> > 1250 PCI Bus 12/25/2019 09:07 12/25/2019 09:07 1 PCI Bus Error (Slot
> > 0, Bus 0, Device 2, Function 2)
> >
> > And the topology is:
> > [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >           +-01.0-[02]--
> >           +-01.1-[05]--
> >           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >           +-02.1-[0f]--
> >           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array
> > Gen9 Controllers
> >
> > It's a bridge reporting the error. It should be an unsupported request
> > error, bacause downstream device is still alive and sending request,
> > but the port have bus mastering off. If I manually shutdown the "Smart
> > Array" (HPSA) device before kdump reboot, it will always reboot just
> > fine.
> >
> > And as the patch descriptions said, the HPSA is used in first kernel,
> > but didn't get reset in kdump kernel because driver is not loaded.
> > When shutting down a bridge, kernel should shutdown downstream device
> > first, and then shutdown and clear bus master bit of the bridge. But
> > in kdump case, kernel skipped some device shutdown due to driver not
> > loaded issue, and kernel don't know they are enabled.
> >
> > This problem is not limited to HPSA, the NIC listed in above topology
> > maybe also make the bridge error out, if HPSA get loaded in kdump
> > kernel and NIC get ignored.
>
> It looks like the right answer is for the kernel to handle such cases
> gracefully. From what I recall, we can only trust the bus mastering at
> root ports. So, it is possible that the endpoint devices can always
> try to DMA, but it can be blocked by the root port. So the right fix
> seems to teach kernel how to handle these insted of hacking the
> shutdown code.
> -Deepa
>

Kexec have been disabling bus mastering on endpoints to prevent memory
corruption since commit b566a22c2332 ("PCI: disable Bus Master on PCI
device shutdown"), later improved by 4fc9bbf98fd6 ("PCI: Disable Bus
Master only on kexec reboot") and 6e0eda3c3898 ("PCI: Don't try to
disable Bus Master on disconnected PCI devices").

That's done before 2014, it worked for kexec reboot, so I think
disable bus mastering on endpoints should work in most cases.

It's true that on some machine disabling bus mastering doesn't work
well, but I guess kexec/kdump will also not work well on such machine?
And for kdump, the hardware is already in an unstable status, and
kernel don't know how to shutdown/reset some device due to driver not
loaded. And loading all drivers in kdump kernel is usually not
practical. So I think disable bus mastering bit is the best effort
kernel could do at that time (upon shutdown, at least ensure device
shutdown is in right order)?

And about the UR error and Bus Mastering bit, from the spec:
"Endpoints:When this bit is Set, the PCI Express Function is allowed
to issue Memory or I/O Requests. When this bit is Clear, the PCI
Express Function is not allowed to issue any Memory or I/O Requests."
"Root and Switch Ports: This bit controls forwarding of Memory or I/O
Requests by a Switch or Root Port in the Upstream direction. When this
bit is 0b, Memory and I/O Requests received at a Root Port or the
Downstream side of a Switch Port must be handled as Unsupported
Requests (UR), and for Non-Posted Requests a Completion with UR
completion status must be returned."

Hardware is supposed to report UR error, but this hanging problem is
only observed on some HPE machines, the firmware will send NMI (and
supposed to panic the kernel) when it happened. So maybe this could be
fixed with the firmware or apply some workaround for HPE (eg in
hpwdt)?




--
Best Regards,
Kairui Song
Khalid Aziz Jan. 15, 2020, 5:30 p.m. UTC | #20
On 1/13/20 10:07 AM, Kairui Song wrote:
> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>>
>>> Hi, there are some previous works about this issue, reset PCI devices
>>> in kdump kernel to stop ongoing DMA:
>>>
>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
>>> https://lore.kernel.org/patchwork/cover/343767/
>>>
>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
>>> https://lore.kernel.org/patchwork/patch/379191/
>>>
>>> And didn't get merged, that patch are trying to fix some DMAR error
>>> problem, but resetting devices is a bit too destructive, and the
>>> problem is later fixed in IOMMU side. And in most case the DMA seems
>>> harmless, as they targets first kernel's memory and kdump kernel only
>>> live in crash memory.
>>
>> I was going to ask the same. If the kdump kernel had IOMMU on, would
>> that still be a problem?
> 
> It will still fail, doing DMA is not a problem, it only go wrong when
> a device's upstream bridge is mistakenly shutdown before the device
> shutdown.
> 
>>
>>> Also, by the time kdump kernel is able to scan and reset devices,
>>> there are already a very large time window where things could go
>>> wrong.
>>>
>>> The currently problem observed only happens upon kdump kernel
>>> shutdown, as the upper bridge is disabled before the device is
>>> disabledm so DMA will raise error. It's more like a problem of wrong
>>> device shutting down order.
>>
>> The way it was described earlier "During this time, the SUT sometimes
>> gets a PCI error that raises an NMI." suggests that it isn't really
>> restricted to kexec/kdump.
>> Any attached device without an active driver might attempt spurious or
>> malicious DMA and trigger the same during normal operation.
>> Do you have available some more reporting of what happens during the
>> PCIe error handling?
> 
> Let me add more info about this:
> 
> On the machine where I can reproduce this issue, the first kernel
> always runs fine, and kdump kernel works fine during dumping the
> vmcore, even if I keep the kdump kernel running for hours, nothing
> goes wrong. If there are DMA during normal operation that will cause
> problem, this should have exposed it.
> 

This is the part that is puzzling me. Error shows up only when kdump
kernel is being shut down. kdump kernel can run for hours without this
issue. What is the operation from downstream device that is resulting in
uncorrectable error - is it indeed a DMA request? Why does that
operation from downstream device not happen until shutdown?

I just want to make sure we fix the right problem in the right way.

--
Khalid
Kairui Song Jan. 15, 2020, 6:05 p.m. UTC | #21
On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@gonehiking.org> wrote:
>
> On 1/13/20 10:07 AM, Kairui Song wrote:
> > On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> >>
> >>> Hi, there are some previous works about this issue, reset PCI devices
> >>> in kdump kernel to stop ongoing DMA:
> >>>
> >>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> >>> https://lore.kernel.org/patchwork/cover/343767/
> >>>
> >>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
> >>> https://lore.kernel.org/patchwork/patch/379191/
> >>>
> >>> And didn't get merged, that patch are trying to fix some DMAR error
> >>> problem, but resetting devices is a bit too destructive, and the
> >>> problem is later fixed in IOMMU side. And in most case the DMA seems
> >>> harmless, as they targets first kernel's memory and kdump kernel only
> >>> live in crash memory.
> >>
> >> I was going to ask the same. If the kdump kernel had IOMMU on, would
> >> that still be a problem?
> >
> > It will still fail, doing DMA is not a problem, it only go wrong when
> > a device's upstream bridge is mistakenly shutdown before the device
> > shutdown.
> >
> >>
> >>> Also, by the time kdump kernel is able to scan and reset devices,
> >>> there are already a very large time window where things could go
> >>> wrong.
> >>>
> >>> The currently problem observed only happens upon kdump kernel
> >>> shutdown, as the upper bridge is disabled before the device is
> >>> disabledm so DMA will raise error. It's more like a problem of wrong
> >>> device shutting down order.
> >>
> >> The way it was described earlier "During this time, the SUT sometimes
> >> gets a PCI error that raises an NMI." suggests that it isn't really
> >> restricted to kexec/kdump.
> >> Any attached device without an active driver might attempt spurious or
> >> malicious DMA and trigger the same during normal operation.
> >> Do you have available some more reporting of what happens during the
> >> PCIe error handling?
> >
> > Let me add more info about this:
> >
> > On the machine where I can reproduce this issue, the first kernel
> > always runs fine, and kdump kernel works fine during dumping the
> > vmcore, even if I keep the kdump kernel running for hours, nothing
> > goes wrong. If there are DMA during normal operation that will cause
> > problem, this should have exposed it.
> >
>
> This is the part that is puzzling me. Error shows up only when kdump
> kernel is being shut down. kdump kernel can run for hours without this
> issue. What is the operation from downstream device that is resulting in
> uncorrectable error - is it indeed a DMA request? Why does that
> operation from downstream device not happen until shutdown?
>
> I just want to make sure we fix the right problem in the right way.
>

Actually the device could keep sending request with no problem during
kdump kernel running. Eg. keep sending DMA, and all DMA targets first
kernel's system memory, so kdump runs fine as long as nothing touch
the reserved crash memory. And the error is reported by the port, when
shutdown it has bus master bit, and downstream request will cause
error.

I'm not sure what request it really is either, it could depend on
device. On that machine, error could be reproduced when either the NIC
or HPSA is not reset in kdump, and from the bug report, the reporter
used a different NIC card and it's also reproducible.
The NIC is much less like to cause bridge error though (HPSA is about
7/10 reproducible, NIC is about 3/10), so the device could send
different requests but fail in the same way (UR error reported from
the bridge).

Will try to do more debug, but I'm not sure how can I intercept the
PCIe operation to get some info about what is actually causing the
issue, do you have any suggestion?
Khalid Aziz Jan. 15, 2020, 9:17 p.m. UTC | #22
On 1/15/20 11:05 AM, Kairui Song wrote:
> On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@gonehiking.org> wrote:
>>
>> On 1/13/20 10:07 AM, Kairui Song wrote:
>>> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>>>>
>>>>> Hi, there are some previous works about this issue, reset PCI devices
>>>>> in kdump kernel to stop ongoing DMA:
>>>>>
>>>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
>>>>> https://lore.kernel.org/patchwork/cover/343767/
>>>>>
>>>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
>>>>> https://lore.kernel.org/patchwork/patch/379191/
>>>>>
>>>>> And didn't get merged, that patch are trying to fix some DMAR error
>>>>> problem, but resetting devices is a bit too destructive, and the
>>>>> problem is later fixed in IOMMU side. And in most case the DMA seems
>>>>> harmless, as they targets first kernel's memory and kdump kernel only
>>>>> live in crash memory.
>>>>
>>>> I was going to ask the same. If the kdump kernel had IOMMU on, would
>>>> that still be a problem?
>>>
>>> It will still fail, doing DMA is not a problem, it only go wrong when
>>> a device's upstream bridge is mistakenly shutdown before the device
>>> shutdown.
>>>
>>>>
>>>>> Also, by the time kdump kernel is able to scan and reset devices,
>>>>> there are already a very large time window where things could go
>>>>> wrong.
>>>>>
>>>>> The currently problem observed only happens upon kdump kernel
>>>>> shutdown, as the upper bridge is disabled before the device is
>>>>> disabledm so DMA will raise error. It's more like a problem of wrong
>>>>> device shutting down order.
>>>>
>>>> The way it was described earlier "During this time, the SUT sometimes
>>>> gets a PCI error that raises an NMI." suggests that it isn't really
>>>> restricted to kexec/kdump.
>>>> Any attached device without an active driver might attempt spurious or
>>>> malicious DMA and trigger the same during normal operation.
>>>> Do you have available some more reporting of what happens during the
>>>> PCIe error handling?
>>>
>>> Let me add more info about this:
>>>
>>> On the machine where I can reproduce this issue, the first kernel
>>> always runs fine, and kdump kernel works fine during dumping the
>>> vmcore, even if I keep the kdump kernel running for hours, nothing
>>> goes wrong. If there are DMA during normal operation that will cause
>>> problem, this should have exposed it.
>>>
>>
>> This is the part that is puzzling me. Error shows up only when kdump
>> kernel is being shut down. kdump kernel can run for hours without this
>> issue. What is the operation from downstream device that is resulting in
>> uncorrectable error - is it indeed a DMA request? Why does that
>> operation from downstream device not happen until shutdown?
>>
>> I just want to make sure we fix the right problem in the right way.
>>
> 
> Actually the device could keep sending request with no problem during
> kdump kernel running. Eg. keep sending DMA, and all DMA targets first
> kernel's system memory, so kdump runs fine as long as nothing touch
> the reserved crash memory. And the error is reported by the port, when
> shutdown it has bus master bit, and downstream request will cause
> error.
> 

Problem really is there are active devices while kdump kernel is
running. You did say earlier - "And in most case the DMA seems
harmless, as they targets first kernel's memory and kdump kernel only
live in crash memory.". Even if this holds today, it is going to break
one of these days. There is the "reset_devices" option but that does not
work if driver is not loaded by kdump kernel. Can we try to shut down
devices in machine_crash_shutdown() before we start kdump kernel?

--
Khalid
Dave Young Jan. 17, 2020, 3:24 a.m. UTC | #23
On 01/15/20 at 02:17pm, Khalid Aziz wrote:
> On 1/15/20 11:05 AM, Kairui Song wrote:
> > On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@gonehiking.org> wrote:
> >>
> >> On 1/13/20 10:07 AM, Kairui Song wrote:
> >>> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> >>>>
> >>>>> Hi, there are some previous works about this issue, reset PCI devices
> >>>>> in kdump kernel to stop ongoing DMA:
> >>>>>
> >>>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> >>>>> https://lore.kernel.org/patchwork/cover/343767/
> >>>>>
> >>>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
> >>>>> https://lore.kernel.org/patchwork/patch/379191/
> >>>>>
> >>>>> And didn't get merged, that patch are trying to fix some DMAR error
> >>>>> problem, but resetting devices is a bit too destructive, and the
> >>>>> problem is later fixed in IOMMU side. And in most case the DMA seems
> >>>>> harmless, as they targets first kernel's memory and kdump kernel only
> >>>>> live in crash memory.
> >>>>
> >>>> I was going to ask the same. If the kdump kernel had IOMMU on, would
> >>>> that still be a problem?
> >>>
> >>> It will still fail, doing DMA is not a problem, it only go wrong when
> >>> a device's upstream bridge is mistakenly shutdown before the device
> >>> shutdown.
> >>>
> >>>>
> >>>>> Also, by the time kdump kernel is able to scan and reset devices,
> >>>>> there are already a very large time window where things could go
> >>>>> wrong.
> >>>>>
> >>>>> The currently problem observed only happens upon kdump kernel
> >>>>> shutdown, as the upper bridge is disabled before the device is
> >>>>> disabledm so DMA will raise error. It's more like a problem of wrong
> >>>>> device shutting down order.
> >>>>
> >>>> The way it was described earlier "During this time, the SUT sometimes
> >>>> gets a PCI error that raises an NMI." suggests that it isn't really
> >>>> restricted to kexec/kdump.
> >>>> Any attached device without an active driver might attempt spurious or
> >>>> malicious DMA and trigger the same during normal operation.
> >>>> Do you have available some more reporting of what happens during the
> >>>> PCIe error handling?
> >>>
> >>> Let me add more info about this:
> >>>
> >>> On the machine where I can reproduce this issue, the first kernel
> >>> always runs fine, and kdump kernel works fine during dumping the
> >>> vmcore, even if I keep the kdump kernel running for hours, nothing
> >>> goes wrong. If there are DMA during normal operation that will cause
> >>> problem, this should have exposed it.
> >>>
> >>
> >> This is the part that is puzzling me. Error shows up only when kdump
> >> kernel is being shut down. kdump kernel can run for hours without this
> >> issue. What is the operation from downstream device that is resulting in
> >> uncorrectable error - is it indeed a DMA request? Why does that
> >> operation from downstream device not happen until shutdown?
> >>
> >> I just want to make sure we fix the right problem in the right way.
> >>
> > 
> > Actually the device could keep sending request with no problem during
> > kdump kernel running. Eg. keep sending DMA, and all DMA targets first
> > kernel's system memory, so kdump runs fine as long as nothing touch
> > the reserved crash memory. And the error is reported by the port, when
> > shutdown it has bus master bit, and downstream request will cause
> > error.
> > 
> 
> Problem really is there are active devices while kdump kernel is
> running. You did say earlier - "And in most case the DMA seems
> harmless, as they targets first kernel's memory and kdump kernel only
> live in crash memory.". Even if this holds today, it is going to break
> one of these days. There is the "reset_devices" option but that does not
> work if driver is not loaded by kdump kernel. Can we try to shut down
> devices in machine_crash_shutdown() before we start kdump kernel?

It is not a good idea :)  We do not add extra logic after a panic
because the kernel is not stable and we want a correct vmcore.

Similar suggestions had been rejected a lot of times..

Thanks
Dave
Baoquan He Jan. 17, 2020, 3:46 a.m. UTC | #24
On 01/17/20 at 11:24am, Dave Young wrote:
> On 01/15/20 at 02:17pm, Khalid Aziz wrote:
> > On 1/15/20 11:05 AM, Kairui Song wrote:
> > > On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@gonehiking.org> wrote:
> > >>
> > >> On 1/13/20 10:07 AM, Kairui Song wrote:
> > >>> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> > >>>>
> > >>>>> Hi, there are some previous works about this issue, reset PCI devices
> > >>>>> in kdump kernel to stop ongoing DMA:
> > >>>>>
> > >>>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> > >>>>> https://lore.kernel.org/patchwork/cover/343767/
> > >>>>>
> > >>>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
> > >>>>> https://lore.kernel.org/patchwork/patch/379191/
> > >>>>>
> > >>>>> And didn't get merged, that patch are trying to fix some DMAR error
> > >>>>> problem, but resetting devices is a bit too destructive, and the
> > >>>>> problem is later fixed in IOMMU side. And in most case the DMA seems
> > >>>>> harmless, as they targets first kernel's memory and kdump kernel only
> > >>>>> live in crash memory.
> > >>>>
> > >>>> I was going to ask the same. If the kdump kernel had IOMMU on, would
> > >>>> that still be a problem?
> > >>>
> > >>> It will still fail, doing DMA is not a problem, it only go wrong when
> > >>> a device's upstream bridge is mistakenly shutdown before the device
> > >>> shutdown.
> > >>>
> > >>>>
> > >>>>> Also, by the time kdump kernel is able to scan and reset devices,
> > >>>>> there are already a very large time window where things could go
> > >>>>> wrong.
> > >>>>>
> > >>>>> The currently problem observed only happens upon kdump kernel
> > >>>>> shutdown, as the upper bridge is disabled before the device is
> > >>>>> disabledm so DMA will raise error. It's more like a problem of wrong
> > >>>>> device shutting down order.
> > >>>>
> > >>>> The way it was described earlier "During this time, the SUT sometimes
> > >>>> gets a PCI error that raises an NMI." suggests that it isn't really
> > >>>> restricted to kexec/kdump.
> > >>>> Any attached device without an active driver might attempt spurious or
> > >>>> malicious DMA and trigger the same during normal operation.
> > >>>> Do you have available some more reporting of what happens during the
> > >>>> PCIe error handling?
> > >>>
> > >>> Let me add more info about this:
> > >>>
> > >>> On the machine where I can reproduce this issue, the first kernel
> > >>> always runs fine, and kdump kernel works fine during dumping the
> > >>> vmcore, even if I keep the kdump kernel running for hours, nothing
> > >>> goes wrong. If there are DMA during normal operation that will cause
> > >>> problem, this should have exposed it.
> > >>>
> > >>
> > >> This is the part that is puzzling me. Error shows up only when kdump
> > >> kernel is being shut down. kdump kernel can run for hours without this
> > >> issue. What is the operation from downstream device that is resulting in
> > >> uncorrectable error - is it indeed a DMA request? Why does that
> > >> operation from downstream device not happen until shutdown?
> > >>
> > >> I just want to make sure we fix the right problem in the right way.
> > >>
> > > 
> > > Actually the device could keep sending request with no problem during
> > > kdump kernel running. Eg. keep sending DMA, and all DMA targets first
> > > kernel's system memory, so kdump runs fine as long as nothing touch
> > > the reserved crash memory. And the error is reported by the port, when
> > > shutdown it has bus master bit, and downstream request will cause
> > > error.
> > > 
> > 
> > Problem really is there are active devices while kdump kernel is
> > running. You did say earlier - "And in most case the DMA seems
> > harmless, as they targets first kernel's memory and kdump kernel only
> > live in crash memory.". Even if this holds today, it is going to break
> > one of these days. There is the "reset_devices" option but that does not
> > work if driver is not loaded by kdump kernel. Can we try to shut down
> > devices in machine_crash_shutdown() before we start kdump kernel?
> 
> It is not a good idea :)  We do not add extra logic after a panic
> because the kernel is not stable and we want a correct vmcore.
> 
> Similar suggestions had been rejected a lot of times..

Yes.

About this bug, I think HPE may need to check if their firmware/hardware
has special design. I checked other vendors' system which has many
devices under several bridges, this issue is not seen.

Say the issue again, we have been excluding the unneeded devices drivers
from kdump initramfs. This issue is seen first time. And we don't suggest
customer to close IOMMU in kdump kernel if hardware IOMMU is deployed.

So for a system with hardware IOMMU, those devices will go through these
life cycles:

1) devices handling DMA transferring in 1st kernel;
2) after crash, we don't shutdown them. Means they are keeping alive,
and handling DMA transferring;
3) during kdump kernel boot, hardware IOMMU initialization will copy the
old iommu translation table to make the on-flight DMA continue;
4) device driver will initialize the relevant device, to reset it;
   - if no device driver loaded, the device will keep active in the
     whole kdump kernel life cycle
5) when vmcore dumping is done, we try to reboot kdump kernel, shutdown
all devices;

The issue happened in the 5) step. As Kairui provided log, shutting down
the bridge which subordinate devices are not controlled by loaded driver
will cause the UR and NMI, then hang happened.

Thanks
Baoquan
Khalid Aziz Jan. 17, 2020, 3:44 p.m. UTC | #25
On 1/16/20 8:24 PM, Dave Young wrote:
> On 01/15/20 at 02:17pm, Khalid Aziz wrote:
>> On 1/15/20 11:05 AM, Kairui Song wrote:
>>> On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@gonehiking.org> wrote:
>>>>
>>>> On 1/13/20 10:07 AM, Kairui Song wrote:
>>>>> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, there are some previous works about this issue, reset PCI devices
>>>>>>> in kdump kernel to stop ongoing DMA:
>>>>>>>
>>>>>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
>>>>>>> https://lore.kernel.org/patchwork/cover/343767/
>>>>>>>
>>>>>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
>>>>>>> https://lore.kernel.org/patchwork/patch/379191/
>>>>>>>
>>>>>>> And didn't get merged, that patch are trying to fix some DMAR error
>>>>>>> problem, but resetting devices is a bit too destructive, and the
>>>>>>> problem is later fixed in IOMMU side. And in most case the DMA seems
>>>>>>> harmless, as they targets first kernel's memory and kdump kernel only
>>>>>>> live in crash memory.
>>>>>>
>>>>>> I was going to ask the same. If the kdump kernel had IOMMU on, would
>>>>>> that still be a problem?
>>>>>
>>>>> It will still fail, doing DMA is not a problem, it only go wrong when
>>>>> a device's upstream bridge is mistakenly shutdown before the device
>>>>> shutdown.
>>>>>
>>>>>>
>>>>>>> Also, by the time kdump kernel is able to scan and reset devices,
>>>>>>> there are already a very large time window where things could go
>>>>>>> wrong.
>>>>>>>
>>>>>>> The currently problem observed only happens upon kdump kernel
>>>>>>> shutdown, as the upper bridge is disabled before the device is
>>>>>>> disabledm so DMA will raise error. It's more like a problem of wrong
>>>>>>> device shutting down order.
>>>>>>
>>>>>> The way it was described earlier "During this time, the SUT sometimes
>>>>>> gets a PCI error that raises an NMI." suggests that it isn't really
>>>>>> restricted to kexec/kdump.
>>>>>> Any attached device without an active driver might attempt spurious or
>>>>>> malicious DMA and trigger the same during normal operation.
>>>>>> Do you have available some more reporting of what happens during the
>>>>>> PCIe error handling?
>>>>>
>>>>> Let me add more info about this:
>>>>>
>>>>> On the machine where I can reproduce this issue, the first kernel
>>>>> always runs fine, and kdump kernel works fine during dumping the
>>>>> vmcore, even if I keep the kdump kernel running for hours, nothing
>>>>> goes wrong. If there are DMA during normal operation that will cause
>>>>> problem, this should have exposed it.
>>>>>
>>>>
>>>> This is the part that is puzzling me. Error shows up only when kdump
>>>> kernel is being shut down. kdump kernel can run for hours without this
>>>> issue. What is the operation from downstream device that is resulting in
>>>> uncorrectable error - is it indeed a DMA request? Why does that
>>>> operation from downstream device not happen until shutdown?
>>>>
>>>> I just want to make sure we fix the right problem in the right way.
>>>>
>>>
>>> Actually the device could keep sending request with no problem during
>>> kdump kernel running. Eg. keep sending DMA, and all DMA targets first
>>> kernel's system memory, so kdump runs fine as long as nothing touch
>>> the reserved crash memory. And the error is reported by the port, when
>>> shutdown it has bus master bit, and downstream request will cause
>>> error.
>>>
>>
>> Problem really is there are active devices while kdump kernel is
>> running. You did say earlier - "And in most case the DMA seems
>> harmless, as they targets first kernel's memory and kdump kernel only
>> live in crash memory.". Even if this holds today, it is going to break
>> one of these days. There is the "reset_devices" option but that does not
>> work if driver is not loaded by kdump kernel. Can we try to shut down
>> devices in machine_crash_shutdown() before we start kdump kernel?
> 
> It is not a good idea :)  We do not add extra logic after a panic
> because the kernel is not stable and we want a correct vmcore.
> 

I agree any extra code in panic path opens up door to more trouble. For
kdump kernel if hardware is not in a good shape when it boots up, we are
still in no better place. There may be some room to do minimal and
absolutely essential hardware shutdown in panic path to ensure kdump
kernel can work reliably, but such code has to be approached very
carefully. For this specific problem, it seems to stems from not loading
the same drivers in kdump kernel that were running in previous kernel.
Recommendation for situations like this just might be that one must load
all the same driver and use "reset_devices" option if they want to
ensure stability for kdump kernel. I can see reluctance to load any more
drivers than absolutely needed in kdump kernel, but not doing that has
side-effects as we are seeing in this case.

--
Khalid
Bjorn Helgaas Feb. 22, 2020, 4:56 p.m. UTC | #26
[+cc Khalid, Deepa, Randy, Dave, Myron]

On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> There are reports about kdump hang upon reboot on some HPE machines,
> kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> error occurred and crashed the system.

Did we ever make progress on this?  This definitely sounds like a
problem that needs to be fixed, but I don't see a resolution here.

> On the machine I can reproduce this issue, part of the topology
> looks like this:
> 
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>           +-01.0-[02]--
>           +-01.1-[05]--
>           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
>           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
>           +-02.1-[0f]--
>           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> 
> When shuting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> will hang, depend on which device is reinitialized in kdump kernel.
> 
> If force remove unused device then trigger kdump, the problem will never
> happen:
> 
>     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
>     echo c > /proc/sysrq-trigger
> 
>     ... Kdump save vmcore through network, the NIC get reinitialized and
>     hpsa is untouched. Then reboot with no problem. (If hpsa is used
>     instead, shutdown the NIC in first kernel will help)
> 
> The cause is that some devices are enabled by the first kernel, but it
> don't have the chance to shutdown the device, and kdump kernel is not
> aware of it, unless it reinitialize the device.
> 
> Upon reboot, kdump kernel will skip downstream device shutdown and
> clears its bridge's master bit directly. The downstream device could
> error out as it can still send requests but upstream refuses it.
> 
> So for kdump, let kernel read the correct hardware power state on boot,
> and always clear the bus master bit of PCI device upon shutdown if the
> device is on. PCIe port driver will always shutdown all downstream
> devices first, so this should ensure all downstream devices have bus
> master bit off before clearing the bridge's bus master bit.
> 
> Signed-off-by: Kairui Song <kasong@redhat.com>
> ---
>  drivers/pci/pci-driver.c | 11 ++++++++---
>  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 0454ca0e4e3f..84a7fd643b4d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -18,6 +18,7 @@
>  #include <linux/kexec.h>
>  #include <linux/of_device.h>
>  #include <linux/acpi.h>
> +#include <linux/crash_dump.h>
>  #include "pci.h"
>  #include "pcie/portdrv.h"
>  
> @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
>  	 * If this is a kexec reboot, turn off Bus Master bit on the
>  	 * device to tell it to not continue to do DMA. Don't touch
>  	 * devices in D3cold or unknown states.
> -	 * If it is not a kexec reboot, firmware will hit the PCI
> -	 * devices with big hammer and stop their DMA any way.
> +	 * If this is kdump kernel, also turn off Bus Master, the device
> +	 * could be activated by previous crashed kernel and may block
> +	 * it's upstream from shutting down.
> +	 * Else, firmware will hit the PCI devices with big hammer
> +	 * and stop their DMA any way.
>  	 */
> -	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> +	if ((kexec_in_progress || is_kdump_kernel()) &&
> +			pci_dev->current_state <= PCI_D3hot)
>  		pci_clear_master(pci_dev);
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4937a088d7d8..c65d11ab3939 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -28,6 +28,7 @@
>  #include <linux/platform_data/x86/apple.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/switchtec.h>
> +#include <linux/crash_dump.h>
>  #include <asm/dma.h>	/* isa_dma_bridge_buggy */
>  #include "pci.h"
>  
> @@ -192,6 +193,25 @@ static int __init pci_apply_final_quirks(void)
>  }
>  fs_initcall_sync(pci_apply_final_quirks);
>  
> +/*
> + * Read the device state even if it's not enabled. The device could be
> + * activated by previous crashed kernel, this will read and correct the
> + * cached state.
> + */
> +static void quirk_read_pm_state_in_kdump(struct pci_dev *dev)
> +{
> +	u16 pmcsr;
> +
> +	if (!is_kdump_kernel())
> +		return;
> +
> +	if (dev->pm_cap) {
> +		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> +	}
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_read_pm_state_in_kdump);
> +
>  /*
>   * Decoding should be disabled for a PCI device during BAR sizing to avoid
>   * conflict. But doing so may cause problems on host bridge and perhaps other
> -- 
> 2.24.1
>
Dave Young Feb. 24, 2020, 4:56 a.m. UTC | #27
Hi Bjorn,
On 02/22/20 at 10:56am, Bjorn Helgaas wrote:
> [+cc Khalid, Deepa, Randy, Dave, Myron]
> 
> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> > There are reports about kdump hang upon reboot on some HPE machines,
> > kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> > error occurred and crashed the system.
> 
> Did we ever make progress on this?  This definitely sounds like a
> problem that needs to be fixed, but I don't see a resolution here.
> 

I'm not familar with the PCI details,  but if this only adds a quirk for
kdump use and no other risks added then it should be good to have.

Or we can provide a kernel parameter for the quirk?  Then it is even
limited to only be effective when in-kdump && the-param-is-used

Anyway still prefer to people who know more about this area to evaluate
the risk.

Thanks
Dave
Kairui Song Feb. 24, 2020, 5:30 p.m. UTC | #28
Hi,

Thanks for the reply, I don't have any better idea than this RFC patch
yet. The patch is hold as previous discussion suggests this just work
around the problem, the real fix should be let crash kernel load every
required kernel module and reset whichever hardware that is not in a
good status. However, user may struggle to find out which driver is
actually needed, and it's not practical to load all drivers in kdump
kernel. (actually kdump have been trying to load as less driver as
possible to save memory).

So as Dave Y suggested in another reply, will it better to apply this
quirk with a kernel param controlling it? If such problem happens, the
option could be turned on as a fix.


On Sun, Feb 23, 2020 at 12:59 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Khalid, Deepa, Randy, Dave, Myron]
>
> On Thu, Dec 26, 2019 at 03:21:18AM +0800, Kairui Song wrote:
> > There are reports about kdump hang upon reboot on some HPE machines,
> > kernel hanged when trying to shutdown a PCIe port, an uncorrectable
> > error occurred and crashed the system.
>
> Did we ever make progress on this?  This definitely sounds like a
> problem that needs to be fixed, but I don't see a resolution here.
>
> > On the machine I can reproduce this issue, part of the topology
> > looks like this:
> >
> > [0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
> >           +-01.0-[02]--
> >           +-01.1-[05]--
> >           +-02.0-[06]--+-00.0  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.1  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.2  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.3  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.4  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.5  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            +-00.6  Emulex Corporation OneConnect NIC (Skyhawk)
> >           |            \-00.7  Emulex Corporation OneConnect NIC (Skyhawk)
> >           +-02.1-[0f]--
> >           +-02.2-[07]----00.0  Hewlett-Packard Company Smart Array Gen9 Controllers
> >
> > When shuting down PCIe port 0000:00:02.2 or 0000:00:02.0, the machine
> > will hang, depend on which device is reinitialized in kdump kernel.
> >
> > If force remove unused device then trigger kdump, the problem will never
> > happen:
> >
> >     echo 1 > /sys/bus/pci/devices/0000\:00\:02.2/0000\:07\:00.0/remove
> >     echo c > /proc/sysrq-trigger
> >
> >     ... Kdump save vmcore through network, the NIC get reinitialized and
> >     hpsa is untouched. Then reboot with no problem. (If hpsa is used
> >     instead, shutdown the NIC in first kernel will help)
> >
> > The cause is that some devices are enabled by the first kernel, but it
> > don't have the chance to shutdown the device, and kdump kernel is not
> > aware of it, unless it reinitialize the device.
> >
> > Upon reboot, kdump kernel will skip downstream device shutdown and
> > clears its bridge's master bit directly. The downstream device could
> > error out as it can still send requests but upstream refuses it.
> >
> > So for kdump, let kernel read the correct hardware power state on boot,
> > and always clear the bus master bit of PCI device upon shutdown if the
> > device is on. PCIe port driver will always shutdown all downstream
> > devices first, so this should ensure all downstream devices have bus
> > master bit off before clearing the bridge's bus master bit.
> >
> > Signed-off-by: Kairui Song <kasong@redhat.com>
> > ---
> >  drivers/pci/pci-driver.c | 11 ++++++++---
> >  drivers/pci/quirks.c     | 20 ++++++++++++++++++++
> >  2 files changed, 28 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index 0454ca0e4e3f..84a7fd643b4d 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/kexec.h>
> >  #include <linux/of_device.h>
> >  #include <linux/acpi.h>
> > +#include <linux/crash_dump.h>
> >  #include "pci.h"
> >  #include "pcie/portdrv.h"
> >
> > @@ -488,10 +489,14 @@ static void pci_device_shutdown(struct device *dev)
> >        * If this is a kexec reboot, turn off Bus Master bit on the
> >        * device to tell it to not continue to do DMA. Don't touch
> >        * devices in D3cold or unknown states.
> > -      * If it is not a kexec reboot, firmware will hit the PCI
> > -      * devices with big hammer and stop their DMA any way.
> > +      * If this is kdump kernel, also turn off Bus Master, the device
> > +      * could be activated by previous crashed kernel and may block
> > +      * it's upstream from shutting down.
> > +      * Else, firmware will hit the PCI devices with big hammer
> > +      * and stop their DMA any way.
> >        */
> > -     if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> > +     if ((kexec_in_progress || is_kdump_kernel()) &&
> > +                     pci_dev->current_state <= PCI_D3hot)
> >               pci_clear_master(pci_dev);
> >  }
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 4937a088d7d8..c65d11ab3939 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -28,6 +28,7 @@
> >  #include <linux/platform_data/x86/apple.h>
> >  #include <linux/pm_runtime.h>
> >  #include <linux/switchtec.h>
> > +#include <linux/crash_dump.h>
> >  #include <asm/dma.h> /* isa_dma_bridge_buggy */
> >  #include "pci.h"
> >
> > @@ -192,6 +193,25 @@ static int __init pci_apply_final_quirks(void)
> >  }
> >  fs_initcall_sync(pci_apply_final_quirks);
> >
> > +/*
> > + * Read the device state even if it's not enabled. The device could be
> > + * activated by previous crashed kernel, this will read and correct the
> > + * cached state.
> > + */
> > +static void quirk_read_pm_state_in_kdump(struct pci_dev *dev)
> > +{
> > +     u16 pmcsr;
> > +
> > +     if (!is_kdump_kernel())
> > +             return;
> > +
> > +     if (dev->pm_cap) {
> > +             pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > +             dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> > +     }
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_read_pm_state_in_kdump);
> > +
> >  /*
> >   * Decoding should be disabled for a PCI device during BAR sizing to avoid
> >   * conflict. But doing so may cause problems on host bridge and perhaps other
> > --
> > 2.24.1
> >
>


--
Best Regards,
Kairui Song
Deepa Dinamani Feb. 28, 2020, 7:53 p.m. UTC | #29
I think the quirk clearing BME might not solve all the problems, since
endpoint devices can sometimes not honor BME settings.

Better solution to me seems like not letting devices that we cannot
identify send us interrupts. I was working on a patch for this. I'm on
vacation for a couple of weeks and can post an update after I test out
my patch internally.

-Deepa
Deepa Dinamani March 3, 2020, 9:01 p.m. UTC | #30
I looked at this some more. Looks like we do not clear irqs when we do
a kexec reboot. And, the bootup code maintains the same table for the
kexec-ed kernel. I'm looking at the following code in
intel_irq_remapping.c:

        if (ir_pre_enabled(iommu)) {
                if (!is_kdump_kernel()) {
                        pr_warn("IRQ remapping was enabled on %s but
we are not in kdump mode\n",
                                iommu->name);
                        clear_ir_pre_enabled(iommu);
                        iommu_disable_irq_remapping(iommu);
                } else if (iommu_load_old_irte(iommu))
                        pr_err("Failed to copy IR table for %s from
previous kernel\n",
                               iommu->name);
                else
                        pr_info("Copied IR table for %s from previous kernel\n",
                                iommu->name);
        }

Would cleaning the interrupts(like in the non kdump path above) just
before shutdown help here? This should clear the interrupts enabled
for all the devices in the current kernel. So when kdump kernel
starts, it starts clean. This should probably help block out the
interrupts from a device that does not have a driver.

-Deepa
Baoquan He March 5, 2020, 3:53 a.m. UTC | #31
+Joerg to CC.

On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> I looked at this some more. Looks like we do not clear irqs when we do
> a kexec reboot. And, the bootup code maintains the same table for the
> kexec-ed kernel. I'm looking at the following code in

I guess you are talking about kdump reboot here, right? Kexec and kdump
boot take the similar mechanism, but differ a little.

> intel_irq_remapping.c:
> 
>         if (ir_pre_enabled(iommu)) {
>                 if (!is_kdump_kernel()) {
>                         pr_warn("IRQ remapping was enabled on %s but
> we are not in kdump mode\n",
>                                 iommu->name);
>                         clear_ir_pre_enabled(iommu);
>                         iommu_disable_irq_remapping(iommu);
>                 } else if (iommu_load_old_irte(iommu))

Here, it's for kdump kernel to copy old ir table from 1st kernel.

>                         pr_err("Failed to copy IR table for %s from
> previous kernel\n",
>                                iommu->name);
>                 else
>                         pr_info("Copied IR table for %s from previous kernel\n",
>                                 iommu->name);
>         }
> 
> Would cleaning the interrupts(like in the non kdump path above) just
> before shutdown help here? This should clear the interrupts enabled
> for all the devices in the current kernel. So when kdump kernel
> starts, it starts clean. This should probably help block out the
> interrupts from a device that does not have a driver.

I think stopping those devices out of control from continue sending
interrupts is a good idea. While not sure if only clearing the interrupt 
will be enough. Those devices which will be initialized by their driver
will brake, but devices which drivers are not loaded into kdump kernel
may continue acting. Even though interrupts are cleaning at this time,
the on-flight DMA could continue triggerring interrupt since the ir
table and iopage table are rebuilt.
Deepa Dinamani March 5, 2020, 4:53 a.m. UTC | #32
On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
>
> +Joerg to CC.
>
> On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > I looked at this some more. Looks like we do not clear irqs when we do
> > a kexec reboot. And, the bootup code maintains the same table for the
> > kexec-ed kernel. I'm looking at the following code in
>
> I guess you are talking about kdump reboot here, right? Kexec and kdump
> boot take the similar mechanism, but differ a little.

Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case below.

>
> > intel_irq_remapping.c:
> >
> >         if (ir_pre_enabled(iommu)) {
> >                 if (!is_kdump_kernel()) {
> >                         pr_warn("IRQ remapping was enabled on %s but
> > we are not in kdump mode\n",
> >                                 iommu->name);
> >                         clear_ir_pre_enabled(iommu);
> >                         iommu_disable_irq_remapping(iommu);
> >                 } else if (iommu_load_old_irte(iommu))
>
> Here, it's for kdump kernel to copy old ir table from 1st kernel.

Correct.

> >                         pr_err("Failed to copy IR table for %s from
> > previous kernel\n",
> >                                iommu->name);
> >                 else
> >                         pr_info("Copied IR table for %s from previous kernel\n",
> >                                 iommu->name);
> >         }
> >
> > Would cleaning the interrupts(like in the non kdump path above) just
> > before shutdown help here? This should clear the interrupts enabled
> > for all the devices in the current kernel. So when kdump kernel
> > starts, it starts clean. This should probably help block out the
> > interrupts from a device that does not have a driver.
>
> I think stopping those devices out of control from continue sending
> interrupts is a good idea. While not sure if only clearing the interrupt
> will be enough. Those devices which will be initialized by their driver
> will brake, but devices which drivers are not loaded into kdump kernel
> may continue acting. Even though interrupts are cleaning at this time,
> the on-flight DMA could continue triggerring interrupt since the ir
> table and iopage table are rebuilt.

This should be handled by the IOMMU, right? And, hence you are getting
UR. This seems like the correct execution flow to me.

Anyway, you could just test this theory by removing the
is_kdump_kernel() check above and see if it solves your problem.
Obviously, check the VT-d spec to figure out the exact sequence to
turn off the IR.

Note that the device that is causing the problem here is a legit
device. We want to have interrupts from devices we don't know about
blocked anyway because we can have compromised firmware/ devices that
could cause a DoS attack. So blocking the unwanted interrupts seems
like the right thing to do here.

-Deepa
Deepa Dinamani March 5, 2020, 6:06 a.m. UTC | #33
On Wed, Mar 4, 2020 at 8:53 PM Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > +Joerg to CC.
> >
> > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > I looked at this some more. Looks like we do not clear irqs when we do
> > > a kexec reboot. And, the bootup code maintains the same table for the
> > > kexec-ed kernel. I'm looking at the following code in
> >
> > I guess you are talking about kdump reboot here, right? Kexec and kdump
> > boot take the similar mechanism, but differ a little.
>
> Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case below.
>
> >
> > > intel_irq_remapping.c:
> > >
> > >         if (ir_pre_enabled(iommu)) {
> > >                 if (!is_kdump_kernel()) {
> > >                         pr_warn("IRQ remapping was enabled on %s but
> > > we are not in kdump mode\n",
> > >                                 iommu->name);
> > >                         clear_ir_pre_enabled(iommu);
> > >                         iommu_disable_irq_remapping(iommu);
> > >                 } else if (iommu_load_old_irte(iommu))
> >
> > Here, it's for kdump kernel to copy old ir table from 1st kernel.
>
> Correct.
>
> > >                         pr_err("Failed to copy IR table for %s from
> > > previous kernel\n",
> > >                                iommu->name);
> > >                 else
> > >                         pr_info("Copied IR table for %s from previous kernel\n",
> > >                                 iommu->name);
> > >         }
> > >
> > > Would cleaning the interrupts(like in the non kdump path above) just
> > > before shutdown help here? This should clear the interrupts enabled
> > > for all the devices in the current kernel. So when kdump kernel
> > > starts, it starts clean. This should probably help block out the
> > > interrupts from a device that does not have a driver.
> >
> > I think stopping those devices out of control from continue sending
> > interrupts is a good idea. While not sure if only clearing the interrupt
> > will be enough. Those devices which will be initialized by their driver
> > will brake, but devices which drivers are not loaded into kdump kernel
> > may continue acting. Even though interrupts are cleaning at this time,
> > the on-flight DMA could continue triggerring interrupt since the ir
> > table and iopage table are rebuilt.
>
> This should be handled by the IOMMU, right? And, hence you are getting
> UR. This seems like the correct execution flow to me.

One small correction, I meant the IOMMU and BME here.
Baoquan He March 6, 2020, 9:38 a.m. UTC | #34
On 03/04/20 at 08:53pm, Deepa Dinamani wrote:
> On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > +Joerg to CC.
> >
> > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > I looked at this some more. Looks like we do not clear irqs when we do
> > > a kexec reboot. And, the bootup code maintains the same table for the
> > > kexec-ed kernel. I'm looking at the following code in
> >
> > I guess you are talking about kdump reboot here, right? Kexec and kdump
> > boot take the similar mechanism, but differ a little.
> 
> Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case below.
> 
> >
> > > intel_irq_remapping.c:
> > >
> > >         if (ir_pre_enabled(iommu)) {
> > >                 if (!is_kdump_kernel()) {
> > >                         pr_warn("IRQ remapping was enabled on %s but
> > > we are not in kdump mode\n",
> > >                                 iommu->name);
> > >                         clear_ir_pre_enabled(iommu);
> > >                         iommu_disable_irq_remapping(iommu);
> > >                 } else if (iommu_load_old_irte(iommu))
> >
> > Here, it's for kdump kernel to copy old ir table from 1st kernel.
> 
> Correct.
> 
> > >                         pr_err("Failed to copy IR table for %s from
> > > previous kernel\n",
> > >                                iommu->name);
> > >                 else
> > >                         pr_info("Copied IR table for %s from previous kernel\n",
> > >                                 iommu->name);
> > >         }
> > >
> > > Would cleaning the interrupts(like in the non kdump path above) just
> > > before shutdown help here? This should clear the interrupts enabled
> > > for all the devices in the current kernel. So when kdump kernel
> > > starts, it starts clean. This should probably help block out the
> > > interrupts from a device that does not have a driver.
> >
> > I think stopping those devices out of control from continue sending
> > interrupts is a good idea. While not sure if only clearing the interrupt
> > will be enough. Those devices which will be initialized by their driver
> > will brake, but devices which drivers are not loaded into kdump kernel
> > may continue acting. Even though interrupts are cleaning at this time,
> > the on-flight DMA could continue triggerring interrupt since the ir
> > table and iopage table are rebuilt.
> 
> This should be handled by the IOMMU, right? And, hence you are getting
> UR. This seems like the correct execution flow to me.

Sorry for late reply.
Yes, this is initializing IOMMU device.

> 
> Anyway, you could just test this theory by removing the
> is_kdump_kernel() check above and see if it solves your problem.
> Obviously, check the VT-d spec to figure out the exact sequence to
> turn off the IR.

OK, I will talk to Kairui and get a machine to test it. Thanks for your
nice idea, if you have a draft patch, we are happy to test it.

> 
> Note that the device that is causing the problem here is a legit
> device. We want to have interrupts from devices we don't know about
> blocked anyway because we can have compromised firmware/ devices that
> could cause a DoS attack. So blocking the unwanted interrupts seems
> like the right thing to do here.

Kairui said it's a device which driver is not loaded in kdump kernel
because it's not needed by kdump. We try to only load kernel modules
which are needed, e.g one device is the dump target, its driver has to
be loaded in. In this case, the device is more like a out of control
device to kdump kernel.
Kairui Song July 22, 2020, 2:52 p.m. UTC | #35
On Fri, Mar 6, 2020 at 5:38 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 03/04/20 at 08:53pm, Deepa Dinamani wrote:
> > On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > +Joerg to CC.
> > >
> > > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > > I looked at this some more. Looks like we do not clear irqs when we do
> > > > a kexec reboot. And, the bootup code maintains the same table for the
> > > > kexec-ed kernel. I'm looking at the following code in
> > >
> > > I guess you are talking about kdump reboot here, right? Kexec and kdump
> > > boot take the similar mechanism, but differ a little.
> >
> > Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case below.
> >
> > >
> > > > intel_irq_remapping.c:
> > > >
> > > >         if (ir_pre_enabled(iommu)) {
> > > >                 if (!is_kdump_kernel()) {
> > > >                         pr_warn("IRQ remapping was enabled on %s but
> > > > we are not in kdump mode\n",
> > > >                                 iommu->name);
> > > >                         clear_ir_pre_enabled(iommu);
> > > >                         iommu_disable_irq_remapping(iommu);
> > > >                 } else if (iommu_load_old_irte(iommu))
> > >
> > > Here, it's for kdump kernel to copy old ir table from 1st kernel.
> >
> > Correct.
> >
> > > >                         pr_err("Failed to copy IR table for %s from
> > > > previous kernel\n",
> > > >                                iommu->name);
> > > >                 else
> > > >                         pr_info("Copied IR table for %s from previous kernel\n",
> > > >                                 iommu->name);
> > > >         }
> > > >
> > > > Would cleaning the interrupts(like in the non kdump path above) just
> > > > before shutdown help here? This should clear the interrupts enabled
> > > > for all the devices in the current kernel. So when kdump kernel
> > > > starts, it starts clean. This should probably help block out the
> > > > interrupts from a device that does not have a driver.
> > >
> > > I think stopping those devices out of control from continue sending
> > > interrupts is a good idea. While not sure if only clearing the interrupt
> > > will be enough. Those devices which will be initialized by their driver
> > > will brake, but devices which drivers are not loaded into kdump kernel
> > > may continue acting. Even though interrupts are cleaning at this time,
> > > the on-flight DMA could continue triggerring interrupt since the ir
> > > table and iopage table are rebuilt.
> >
> > This should be handled by the IOMMU, right? And, hence you are getting
> > UR. This seems like the correct execution flow to me.
>
> Sorry for late reply.
> Yes, this is initializing IOMMU device.
>
> >
> > Anyway, you could just test this theory by removing the
> > is_kdump_kernel() check above and see if it solves your problem.
> > Obviously, check the VT-d spec to figure out the exact sequence to
> > turn off the IR.
>
> OK, I will talk to Kairui and get a machine to test it. Thanks for your
> nice idea, if you have a draft patch, we are happy to test it.
>
> >
> > Note that the device that is causing the problem here is a legit
> > device. We want to have interrupts from devices we don't know about
> > blocked anyway because we can have compromised firmware/ devices that
> > could cause a DoS attack. So blocking the unwanted interrupts seems
> > like the right thing to do here.
>
> Kairui said it's a device which driver is not loaded in kdump kernel
> because it's not needed by kdump. We try to only load kernel modules
> which are needed, e.g one device is the dump target, its driver has to
> be loaded in. In this case, the device is more like a out of control
> device to kdump kernel.
>

Hi Bao, Deepa, sorry for this very late response. The test machine was
not available for sometime, and I restarted to work on this problem.

For the workaround mention by Deepa (by remote the is_kdump_kernel()
check), it didn't work, the machine still hangs upon shutdown.
The devices that were left in an unknown state and sending interrupt
could be a problem, but it's irrelevant to this hanging problem.

I think I didn't make one thing clear, The PCI UR error never arrives
in kernel, it's the iLo BMC on that HPE machine caught the error, and
send kernel an NMI. kernel is panicked by NMI, I'm still trying to
figure out why the NMI hanged kernel, even with panic=-1,
panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
NMI by shutdown the devices in right order, that's also a solution.

--
Best Regards,
Kairui Song
Bjorn Helgaas July 22, 2020, 3:21 p.m. UTC | #36
On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:
> On Fri, Mar 6, 2020 at 5:38 PM Baoquan He <bhe@redhat.com> wrote:
> > On 03/04/20 at 08:53pm, Deepa Dinamani wrote:
> > > On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
> > > > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > > > I looked at this some more. Looks like we do not clear irqs
> > > > > when we do a kexec reboot. And, the bootup code maintains
> > > > > the same table for the kexec-ed kernel. I'm looking at the
> > > > > following code in
> > > >
> > > > I guess you are talking about kdump reboot here, right? Kexec
> > > > and kdump boot take the similar mechanism, but differ a
> > > > little.
> > >
> > > Right I meant kdump kernel here. And, clearly the
> > > is_kdump_kernel() case below.
> > >
> > > > > intel_irq_remapping.c:
> > > > >
> > > > >         if (ir_pre_enabled(iommu)) {
> > > > >                 if (!is_kdump_kernel()) {
> > > > >                         pr_warn("IRQ remapping was enabled on %s but
> > > > > we are not in kdump mode\n",
> > > > >                                 iommu->name);
> > > > >                         clear_ir_pre_enabled(iommu);
> > > > >                         iommu_disable_irq_remapping(iommu);
> > > > >                 } else if (iommu_load_old_irte(iommu))
> > > >
> > > > Here, it's for kdump kernel to copy old ir table from 1st kernel.
> > >
> > > Correct.
> > >
> > > > >                         pr_err("Failed to copy IR table for %s from
> > > > > previous kernel\n",
> > > > >                                iommu->name);
> > > > >                 else
> > > > >                         pr_info("Copied IR table for %s from previous kernel\n",
> > > > >                                 iommu->name);
> > > > >         }
> > > > >
> > > > > Would cleaning the interrupts(like in the non kdump path
> > > > > above) just before shutdown help here? This should clear the
> > > > > interrupts enabled for all the devices in the current
> > > > > kernel. So when kdump kernel starts, it starts clean. This
> > > > > should probably help block out the interrupts from a device
> > > > > that does not have a driver.
> > > >
> > > > I think stopping those devices out of control from continue
> > > > sending interrupts is a good idea. While not sure if only
> > > > clearing the interrupt will be enough. Those devices which
> > > > will be initialized by their driver will brake, but devices
> > > > which drivers are not loaded into kdump kernel may continue
> > > > acting. Even though interrupts are cleaning at this time, the
> > > > on-flight DMA could continue triggerring interrupt since the
> > > > ir table and iopage table are rebuilt.
> > >
> > > This should be handled by the IOMMU, right? And, hence you are
> > > getting UR. This seems like the correct execution flow to me.
> >
> > Sorry for late reply.
> > Yes, this is initializing IOMMU device.
> >
> > > Anyway, you could just test this theory by removing the
> > > is_kdump_kernel() check above and see if it solves your problem.
> > > Obviously, check the VT-d spec to figure out the exact sequence to
> > > turn off the IR.
> >
> > OK, I will talk to Kairui and get a machine to test it. Thanks for your
> > nice idea, if you have a draft patch, we are happy to test it.
> >
> > > Note that the device that is causing the problem here is a legit
> > > device. We want to have interrupts from devices we don't know about
> > > blocked anyway because we can have compromised firmware/ devices that
> > > could cause a DoS attack. So blocking the unwanted interrupts seems
> > > like the right thing to do here.
> >
> > Kairui said it's a device which driver is not loaded in kdump kernel
> > because it's not needed by kdump. We try to only load kernel modules
> > which are needed, e.g one device is the dump target, its driver has to
> > be loaded in. In this case, the device is more like a out of control
> > device to kdump kernel.
> 
> Hi Bao, Deepa, sorry for this very late response. The test machine was
> not available for sometime, and I restarted to work on this problem.
> 
> For the workaround mention by Deepa (by remote the is_kdump_kernel()
> check), it didn't work, the machine still hangs upon shutdown.
> The devices that were left in an unknown state and sending interrupt
> could be a problem, but it's irrelevant to this hanging problem.
> 
> I think I didn't make one thing clear, The PCI UR error never arrives
> in kernel, it's the iLo BMC on that HPE machine caught the error, and
> send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> figure out why the NMI hanged kernel, even with panic=-1,
> panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> NMI by shutdown the devices in right order, that's also a solution.

I'm not sure how much sympathy to have for this situation.  A PCIe UR
is fatal for the transaction and maybe even the device, but from the
overall system point of view, it *should* be a recoverable error and
we shouldn't panic.

Errors like that should be reported via the normal AER or ACPI/APEI
mechanisms.  It sounds like in this case, the platform has decided
these aren't enough and it is trying to force a reboot?  If this is
"special" platform behavior, I'm not sure how much we need to cater
for it.

Bjorn
Jerry Hoemann July 22, 2020, 9:50 p.m. UTC | #37
On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:
> > On Fri, Mar 6, 2020 at 5:38 PM Baoquan He <bhe@redhat.com> wrote:
> > > On 03/04/20 at 08:53pm, Deepa Dinamani wrote:
> > > > On Wed, Mar 4, 2020 at 7:53 PM Baoquan He <bhe@redhat.com> wrote:
> > > > > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > > > > I looked at this some more. Looks like we do not clear irqs
> > > > > > when we do a kexec reboot. And, the bootup code maintains
> > > > > > the same table for the kexec-ed kernel. I'm looking at the
> > > > > > following code in
> > > > >
> > > > > I guess you are talking about kdump reboot here, right? Kexec
> > > > > and kdump boot take the similar mechanism, but differ a
> > > > > little.
> > > >
> > > > Right I meant kdump kernel here. And, clearly the
> > > > is_kdump_kernel() case below.
> > > >
> > > > > > intel_irq_remapping.c:
> > > > > >
> > > > > >         if (ir_pre_enabled(iommu)) {
> > > > > >                 if (!is_kdump_kernel()) {
> > > > > >                         pr_warn("IRQ remapping was enabled on %s but
> > > > > > we are not in kdump mode\n",
> > > > > >                                 iommu->name);
> > > > > >                         clear_ir_pre_enabled(iommu);
> > > > > >                         iommu_disable_irq_remapping(iommu);
> > > > > >                 } else if (iommu_load_old_irte(iommu))
> > > > >
> > > > > Here, it's for kdump kernel to copy old ir table from 1st kernel.
> > > >
> > > > Correct.
> > > >
> > > > > >                         pr_err("Failed to copy IR table for %s from
> > > > > > previous kernel\n",
> > > > > >                                iommu->name);
> > > > > >                 else
> > > > > >                         pr_info("Copied IR table for %s from previous kernel\n",
> > > > > >                                 iommu->name);
> > > > > >         }
> > > > > >
> > > > > > Would cleaning the interrupts(like in the non kdump path
> > > > > > above) just before shutdown help here? This should clear the
> > > > > > interrupts enabled for all the devices in the current
> > > > > > kernel. So when kdump kernel starts, it starts clean. This
> > > > > > should probably help block out the interrupts from a device
> > > > > > that does not have a driver.
> > > > >
> > > > > I think stopping those devices out of control from continue
> > > > > sending interrupts is a good idea. While not sure if only
> > > > > clearing the interrupt will be enough. Those devices which
> > > > > will be initialized by their driver will brake, but devices
> > > > > which drivers are not loaded into kdump kernel may continue
> > > > > acting. Even though interrupts are cleaning at this time, the
> > > > > on-flight DMA could continue triggerring interrupt since the
> > > > > ir table and iopage table are rebuilt.
> > > >
> > > > This should be handled by the IOMMU, right? And, hence you are
> > > > getting UR. This seems like the correct execution flow to me.
> > >
> > > Sorry for late reply.
> > > Yes, this is initializing IOMMU device.
> > >
> > > > Anyway, you could just test this theory by removing the
> > > > is_kdump_kernel() check above and see if it solves your problem.
> > > > Obviously, check the VT-d spec to figure out the exact sequence to
> > > > turn off the IR.
> > >
> > > OK, I will talk to Kairui and get a machine to test it. Thanks for your
> > > nice idea, if you have a draft patch, we are happy to test it.
> > >
> > > > Note that the device that is causing the problem here is a legit
> > > > device. We want to have interrupts from devices we don't know about
> > > > blocked anyway because we can have compromised firmware/ devices that
> > > > could cause a DoS attack. So blocking the unwanted interrupts seems
> > > > like the right thing to do here.
> > >
> > > Kairui said it's a device which driver is not loaded in kdump kernel
> > > because it's not needed by kdump. We try to only load kernel modules
> > > which are needed, e.g one device is the dump target, its driver has to
> > > be loaded in. In this case, the device is more like a out of control
> > > device to kdump kernel.
> > 
> > Hi Bao, Deepa, sorry for this very late response. The test machine was
> > not available for sometime, and I restarted to work on this problem.
> > 
> > For the workaround mention by Deepa (by remote the is_kdump_kernel()
> > check), it didn't work, the machine still hangs upon shutdown.
> > The devices that were left in an unknown state and sending interrupt
> > could be a problem, but it's irrelevant to this hanging problem.
> > 
> > I think I didn't make one thing clear, The PCI UR error never arrives
> > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > figure out why the NMI hanged kernel, even with panic=-1,
> > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > NMI by shutdown the devices in right order, that's also a solution.
> 
> I'm not sure how much sympathy to have for this situation.  A PCIe UR
> is fatal for the transaction and maybe even the device, but from the
> overall system point of view, it *should* be a recoverable error and
> we shouldn't panic.
> 
> Errors like that should be reported via the normal AER or ACPI/APEI
> mechanisms.  It sounds like in this case, the platform has decided
> these aren't enough and it is trying to force a reboot?  If this is
> "special" platform behavior, I'm not sure how much we need to cater
> for it.
> 

Are these AER errors the type processed by the GHES code?

I'll note that RedHat runs their crash kernel with:  hest_disable.
So, the ghes code is disabled in the crash kernel.
Bjorn Helgaas July 23, 2020, midnight UTC | #38
On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote:
> On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:

> > > I think I didn't make one thing clear, The PCI UR error never arrives
> > > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > > figure out why the NMI hanged kernel, even with panic=-1,
> > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > > NMI by shutdown the devices in right order, that's also a solution.

ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table
18-394 and sec 18.4.  I'm not familiar enough with APEI to know
whether Linux correctly supports all those cases.  Maybe this is a
symptom that we don't?

> > I'm not sure how much sympathy to have for this situation.  A PCIe UR
> > is fatal for the transaction and maybe even the device, but from the
> > overall system point of view, it *should* be a recoverable error and
> > we shouldn't panic.
> > 
> > Errors like that should be reported via the normal AER or ACPI/APEI
> > mechanisms.  It sounds like in this case, the platform has decided
> > these aren't enough and it is trying to force a reboot?  If this is
> > "special" platform behavior, I'm not sure how much we need to cater
> > for it.
> 
> Are these AER errors the type processed by the GHES code?

My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware
Error Source Table may contain Error Source Descriptors of types like:

  IA-32 Machine Check Exception
  IA-32 Corrected Machine Check
  IA-32 Non-Maskable Interrupt
  PCIe Root Port AER
  PCIe Device AER
  Generic Hardware Error Source (GHES)
  Hardware Error Notification
  IA-32 Deferred Machine Check

I would naively expect PCIe UR errors to be reported via one of the
PCIe Error Sources, not GHES, but maybe there's some reason to use
GHES.

The kernel should already know how to deal with the PCIe AER errors,
but we'd have to add new device-specific code to handle things
reported via GHES, along the lines of what Shiju is doing here:

  https://lore.kernel.org/r/20200722104245.1060-1-shiju.jose@huawei.com

> I'll note that RedHat runs their crash kernel with:  hest_disable.
> So, the ghes code is disabled in the crash kernel.

That would disable all the HEST error sources, including the PCIe AER
ones as well as GHES ones.  If we turn off some of the normal error
handling mechanisms, I guess we have to expect that some errors won't
be handled correctly.

Bjorn
Kairui Song July 23, 2020, 6:34 p.m. UTC | #39
On Thu, Jul 23, 2020 at 8:00 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote:
> > On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> > > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:
>
> > > > I think I didn't make one thing clear, The PCI UR error never arrives
> > > > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > > > figure out why the NMI hanged kernel, even with panic=-1,
> > > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > > > NMI by shutdown the devices in right order, that's also a solution.
>
> ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table
> 18-394 and sec 18.4.  I'm not familiar enough with APEI to know
> whether Linux correctly supports all those cases.  Maybe this is a
> symptom that we don't?
>
> > > I'm not sure how much sympathy to have for this situation.  A PCIe UR
> > > is fatal for the transaction and maybe even the device, but from the
> > > overall system point of view, it *should* be a recoverable error and
> > > we shouldn't panic.
> > >
> > > Errors like that should be reported via the normal AER or ACPI/APEI
> > > mechanisms.  It sounds like in this case, the platform has decided
> > > these aren't enough and it is trying to force a reboot?  If this is
> > > "special" platform behavior, I'm not sure how much we need to cater
> > > for it.
> >
> > Are these AER errors the type processed by the GHES code?
>
> My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware
> Error Source Table may contain Error Source Descriptors of types like:
>
>   IA-32 Machine Check Exception
>   IA-32 Corrected Machine Check
>   IA-32 Non-Maskable Interrupt
>   PCIe Root Port AER
>   PCIe Device AER
>   Generic Hardware Error Source (GHES)
>   Hardware Error Notification
>   IA-32 Deferred Machine Check
>
> I would naively expect PCIe UR errors to be reported via one of the
> PCIe Error Sources, not GHES, but maybe there's some reason to use
> GHES.
>
> The kernel should already know how to deal with the PCIe AER errors,
> but we'd have to add new device-specific code to handle things
> reported via GHES, along the lines of what Shiju is doing here:
>
>   https://lore.kernel.org/r/20200722104245.1060-1-shiju.jose@huawei.com
>
> > I'll note that RedHat runs their crash kernel with:  hest_disable.
> > So, the ghes code is disabled in the crash kernel.
>
> That would disable all the HEST error sources, including the PCIe AER
> ones as well as GHES ones.  If we turn off some of the normal error
> handling mechanisms, I guess we have to expect that some errors won't
> be handled correctly.


Hi, that's true, hest_disable is added by default to reduce memory
usage in special cases.
But even if I remove hest_disable and have GHES enabled, but the
hanging issue still exists, from the iLO console log, it's still
sending an NMI to kernel, and kernel hanged.

The NMI won't hang the kernel for 100 percent, sometime it will just
panic and reboot and sometimes it hangs. This behavior didn't change
after/before enabled the GHES.

Maybe this is a "special platform behavior". I'm also not 100 percent
sure if/how we can cover this in a good way for now.
I'll try to figure how the NMI actually hanged the kernel and see if
it could be fixed in other ways.

Patch
diff mbox series

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0454ca0e4e3f..84a7fd643b4d 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -18,6 +18,7 @@ 
 #include <linux/kexec.h>
 #include <linux/of_device.h>
 #include <linux/acpi.h>
+#include <linux/crash_dump.h>
 #include "pci.h"
 #include "pcie/portdrv.h"
 
@@ -488,10 +489,14 @@  static void pci_device_shutdown(struct device *dev)
 	 * If this is a kexec reboot, turn off Bus Master bit on the
 	 * device to tell it to not continue to do DMA. Don't touch
 	 * devices in D3cold or unknown states.
-	 * If it is not a kexec reboot, firmware will hit the PCI
-	 * devices with big hammer and stop their DMA any way.
+	 * If this is kdump kernel, also turn off Bus Master, the device
+	 * could be activated by previous crashed kernel and may block
+	 * it's upstream from shutting down.
+	 * Else, firmware will hit the PCI devices with big hammer
+	 * and stop their DMA any way.
 	 */
-	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
+	if ((kexec_in_progress || is_kdump_kernel()) &&
+			pci_dev->current_state <= PCI_D3hot)
 		pci_clear_master(pci_dev);
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4937a088d7d8..c65d11ab3939 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@ 
 #include <linux/platform_data/x86/apple.h>
 #include <linux/pm_runtime.h>
 #include <linux/switchtec.h>
+#include <linux/crash_dump.h>
 #include <asm/dma.h>	/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -192,6 +193,25 @@  static int __init pci_apply_final_quirks(void)
 }
 fs_initcall_sync(pci_apply_final_quirks);
 
+/*
+ * Read the device state even if it's not enabled. The device could be
+ * activated by previous crashed kernel, this will read and correct the
+ * cached state.
+ */
+static void quirk_read_pm_state_in_kdump(struct pci_dev *dev)
+{
+	u16 pmcsr;
+
+	if (!is_kdump_kernel())
+		return;
+
+	if (dev->pm_cap) {
+		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
+		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
+	}
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, quirk_read_pm_state_in_kdump);
+
 /*
  * Decoding should be disabled for a PCI device during BAR sizing to avoid
  * conflict. But doing so may cause problems on host bridge and perhaps other