linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
       [not found] <1589251566-32126-1-git-send-email-pkushwaha@marvell.com>
@ 2020-05-12 22:03 ` Bjorn Helgaas
  2020-05-14  7:17   ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-05-12 22:03 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: linux-arm-kernel, kexec, robin.murphy, maz, will, gkulkarni,
	bhsharma, prabhakar.pkin, linux-pci

[+cc linux-pci]

On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> An SMMU Stream table is created by the primary kernel. This table is
> used by the SMMU to perform address translations for device-originated
> transactions. Any crash (if happened) launches the kdump kernel which
> re-creates the SMMU Stream table. New transactions will be translated
> via this new table.
> 
> There are scenarios, where devices are still having old pending
> transactions (configured in the primary kernel). These transactions
> come in-between Stream table creation and device-driver probe.
> As new stream table does not have entry for older transactions,
> it will be aborted by SMMU.
> 
> Similar observations were found with PCIe-Intel 82576 Gigabit
> Network card. It sends old Memory Read transaction in kdump kernel.
> Transactions configured for older Stream table entries, that do not
> exist any longer in the new table, will cause a PCIe Completion Abort.

That sounds like exactly what we want, doesn't it?

Or do you *want* DMA from the previous kernel to complete?  That will
read or scribble on something, but maybe that's not terrible as long
as it's not memory used by the kdump kernel.

> Returned PCIe completion abort further leads to AER Errors from APEI
> Generic Hardware Error Source (GHES) with completion timeout.
> A network device hang is observed even after continuous
> reset/recovery from driver, Hence device is no more usable.

The fact that the device is no longer usable is definitely a problem.
But in principle we *should* be able to recover from these errors.  If
we could recover and reliably use the device after the error, that
seems like it would be a more robust solution that having to add
special cases in every IOMMU driver.

If you have details about this sort of error, I'd like to try to fix
it because we want to recover from that sort of error in normal
(non-crash) situations as well.

> So, If we are in a kdump kernel try to copy SMMU Stream table from
> primary/old kernel to preserve the mappings until the device driver
> takes over.
> 
> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> ---
> Changes for v2: Used memremap in-place of ioremap
> 
> V2 patch has been sanity tested. 
> 
> V1 patch has been tested with
> A) PCIe-Intel 82576 Gigabit Network card in following
> configurations with "no AER error". Each iteration has
> been tested on both Suse kdump rfs And default Centos distro rfs.
> 
>  1)  with 2 level stream table 
>        ----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        -----------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        -----------------------------------------------------
>        IOMMU bypass       |  41 times      | 10 times
>        -----------------------------------------------------
> 
>  2)  with Linear stream table. 
>        -----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        ------------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        ------------------------------------------------------
>        IOMMU bypass       |  55 times      | 10 times
>        -------------------------------------------------------
> 
> B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> SSD card with 2 level stream table using "fio" in mixed read/write and
> only read configurations. It is tested for both Default Operation and
> IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> default Centos ditstro rfs.
> 
> This patch is not full proof solution. Issue can still come
> from the point device is discovered and driver probe called. 
> This patch has reduced window of scenario from "SMMU Stream table 
> creation - device-driver" to "device discovery - device-driver".
> Usually, device discovery to device-driver is very small time. So
> the probability is very low. 
> 
> Note: device-discovery will overwrite existing stream table entries 
> with both SMMU stage as by-pass.
> 
> 
>  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 82508730feb7..d492d92c2dd7 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  			break;
>  		case STRTAB_STE_0_CFG_S1_TRANS:
>  		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> +			/*
> +			 * As kdump kernel copy STE table from previous
> +			 * kernel. It still may have valid stream table entries.
> +			 * Forcing entry as false to allow overwrite.
> +			 */
> +			if (!is_kdump_kernel())
> +				ste_live = true;
>  			break;
>  		case STRTAB_STE_0_CFG_ABORT:
>  			BUG_ON(!disable_bypass);
> @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  		return -ENOMEM;
>  	}
>  
> +	if (is_kdump_kernel())
> +		return 0;
> +
>  	for (i = 0; i < cfg->num_l1_ents; ++i) {
>  		arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
>  		strtab += STRTAB_L1_DESC_DWORDS << 3;
> @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  	return 0;
>  }
>  
> +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> +			       struct arm_smmu_strtab_cfg *cfg, u32 size)
> +{
> +	struct arm_smmu_strtab_cfg rdcfg;
> +
> +	rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> +	rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> +					      + ARM_SMMU_STRTAB_BASE_CFG);
> +
> +	rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> +	rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> +
> +	memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> +
> +	cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> +}
> +
>  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  {
>  	void *strtab;
> @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  	reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
>  	cfg->strtab_base_cfg = reg;
>  
> +	if (is_kdump_kernel())
> +		arm_smmu_copy_table(smmu, cfg, l1size);
> +
>  	return arm_smmu_init_l1_strtab(smmu);
>  }
>  
> @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
>  	cfg->strtab_base_cfg = reg;
>  
> +	if (is_kdump_kernel()) {
> +		arm_smmu_copy_table(smmu, cfg, size);
> +		return 0;
> +	}
> +
>  	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
>  	return 0;
>  }
> -- 
> 2.18.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-12 22:03 ` [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel Bjorn Helgaas
@ 2020-05-14  7:17   ` Prabhakar Kushwaha
  2020-05-19 23:22     ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-05-14  7:17 UTC (permalink / raw)
  To: Bjorn Helgaas, Robin Murphy, linux-arm-kernel,
	kexec mailing list, linux-pci
  Cc: Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha

Thanks Bjorn for replying on this thread.

On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc linux-pci]
>
> On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > An SMMU Stream table is created by the primary kernel. This table is
> > used by the SMMU to perform address translations for device-originated
> > transactions. Any crash (if happened) launches the kdump kernel which
> > re-creates the SMMU Stream table. New transactions will be translated
> > via this new table..
> >
> > There are scenarios, where devices are still having old pending
> > transactions (configured in the primary kernel). These transactions
> > come in-between Stream table creation and device-driver probe.
> > As new stream table does not have entry for older transactions,
> > it will be aborted by SMMU.
> >
> > Similar observations were found with PCIe-Intel 82576 Gigabit
> > Network card. It sends old Memory Read transaction in kdump kernel.
> > Transactions configured for older Stream table entries, that do not
> > exist any longer in the new table, will cause a PCIe Completion Abort.
>
> That sounds like exactly what we want, doesn't it?
>
> Or do you *want* DMA from the previous kernel to complete?  That will
> read or scribble on something, but maybe that's not terrible as long
> as it's not memory used by the kdump kernel.
>

Yes, Abort should happen. But it should happen in context of driver.
But current abort is happening because of SMMU and no driver/pcie
setup present at this moment.

Solution of this issue should be at 2 place
a) SMMU level: I still believe, this patch has potential to overcome
issue till finally driver's probe takeover.
b) Device level: Even if something goes wrong. Driver/device should
able to recover.


> > Returned PCIe completion abort further leads to AER Errors from APEI
> > Generic Hardware Error Source (GHES) with completion timeout.
> > A network device hang is observed even after continuous
> > reset/recovery from driver, Hence device is no more usable.
>
> The fact that the device is no longer usable is definitely a problem.
> But in principle we *should* be able to recover from these errors.  If
> we could recover and reliably use the device after the error, that
> seems like it would be a more robust solution that having to add
> special cases in every IOMMU driver.
>
> If you have details about this sort of error, I'd like to try to fix
> it because we want to recover from that sort of error in normal
> (non-crash) situations as well.
>
Completion abort case should be gracefully handled.  And device should
always remain usable.

There are 2 scenario which I am testing with Ethernet card PCIe-Intel
82576 Gigabit Network card.

I)  Crash testing using kdump root file system: De-facto scenario
    -  kdump file system does not have Ethernet driver
    -  A lot of AER prints [1], making it impossible to work on shell
of kdump root file system.
    -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.

II) Crash testing using default root file system: Specific case to
test Ethernet driver in second kernel
   -  Default root file system have Ethernet driver
   -  AER error comes even before the driver probe starts.
   -  Driver does reset Ethernet card as part of probe but no success.
   -  AER also tries to recover. but no success.  [2]
   -  I also tries to remove AER errors by using "pci=noaer" bootargs
and commenting ghes_handle_aer() from GHES driver..
          than different set of errors come which also never able to recover [3]

As per my understanding, possible solutions are
 - Copy SMMU table i.e. this patch
OR
 - Doing pci_reset_function() during enumeration phase.
I also tried clearing "M" bit using pci_clear_master during
enumeration but it did not help. Because driver re-set M bit causing
same AER error again.


-pk

---------------------------------------------------------------------------------------------------------------------------
[1] with bootargs having pci=noaer

[   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   22.512773] {4}[Hardware Error]: event severity: recoverable
[   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
[   22.544804] {4}[Hardware Error]:   section_type: PCIe error
[   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
[   22.556268] {4}[Hardware Error]:   version: 3.0
[   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
[   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
[   22.582323] {4}[Hardware Error]:   slot: 0
[   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
[   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   22.608900] {4}[Hardware Error]:   class_code: 000002
[   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
total mem (8153768 kB)
[   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
[   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
[   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
[   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)


----------------------------------------------------------------------------------------------------------------------------
[2] Normal bootargs.

[   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   54.265827] {6}[Hardware Error]: event severity: recoverable
[   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
[   54.281605] {6}[Hardware Error]:   section_type: PCIe error
[   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
[   54.296955] {6}[Hardware Error]:   version: 3.0
[   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
[   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
[   54.317991] {6}[Hardware Error]:   slot: 0
[   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
[   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   54.333797] {6}[Hardware Error]:   class_code: 000002
[   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   54.358001] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
[   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   54.551370] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.705214] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.758703] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.865445] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
[   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   55.057272] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.571401] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.686138] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.786134] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.886141] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.792897] Workqueue: events aer_recover_work_func
[  397.797760] Call trace:
[  397.800199]  __switch_to+0xcc/0x108
[  397.803675]  __schedule+0x2c0/0x700
[  397.807150]  schedule+0x58/0xe8
[  397.810283]  schedule_preempt_disabled+0x18/0x28
[  397.810788] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
[  397.814890]  __mutex_lock_slowpath+0x1c/0x28
[  397.830962]  mutex_lock+0x4c/0x68
[  397.834264]  report_slot_reset+0x30/0xa0
[  397.838178]  pci_walk_bus+0x68/0xc0
[  397.841653]  pcie_do_recovery+0xe8/0x248
[  397.845562]  aer_recover_work_func+0x100/0x138
[  397.849995]  process_one_work+0x1bc/0x458
[  397.853991]  worker_thread+0x150/0x500
[  397.857727]  kthread+0x114/0x118
[  397.860945]  ret_from_fork+0x10/0x18
[  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
[  397.871564]       Not tainted 5.7.0-rc3+ #68
[  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  397.883638] kworker/223:2   D    0  2939      2 0x00000228
[  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
[  397.894505] Call trace:
[  397.896940]  __switch_to+0xcc/0x108
[  397.900419]  __schedule+0x2c0/0x700
[  397.903894]  schedule+0x58/0xe8
[  397.907023]  schedule_preempt_disabled+0x18/0x28
[  397.910798] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
[  397.923440]  __mutex_lock_slowpath+0x1c/0x28
[  397.927696]  mutex_lock+0x4c/0x68
[  397.931005]  rtnl_lock+0x24/0x30
[  397.934220]  addrconf_verify_work+0x18/0x30
[  397.938394]  process_one_work+0x1bc/0x458
[  397.942390]  worker_thread+0x150/0x500
[  397.946126]  kthread+0x114/0x118
[  397.949345]  ret_from_fork+0x10/0x18

---------------------------------------------------------------------------------------------------------------------------------
[3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver

[   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
[   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   69.365121] {9}[Hardware Error]: event severity: corrected
[   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
[   69.376064] {9}[Hardware Error]:   section_type: PCIe error
[   69.381623] {9}[Hardware Error]:   port_type: 4, root port
[   69.387094] {9}[Hardware Error]:   version: 3.0
[   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
[   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
[   69.403248] {9}[Hardware Error]:   slot: 0
[   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
[   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   69.419055] {9}[Hardware Error]:   class_code: 000406
[   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
[   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
[   73.995068]   Tx Queue             <2>
[   73.995068]   TDH                  <0>
[   73.995068]   TDT                  <1>
[   73.995068]   next_to_use          <1>
[   73.995068]   next_to_clean        <0>
[   73.995068] buffer_info[next_to_clean]
[   73.995068]   time_stamp           <ffff9c1a>
[   73.995068]   next_to_watch        <0000000097d42934>
[   73.995068]   jiffies              <ffff9cd0>
[   73.995068]   desc.status          <168000>
[   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
[   75.987323]   Tx Queue             <2>
[   75.987323]   TDH                  <0>
[   75.987323]   TDT                  <1>
[   75.987323]   next_to_use          <1>
[   75.987323]   next_to_clean        <0>
[   75.987323] buffer_info[next_to_clean]
[   75.987323]   time_stamp           <ffff9c1a>
[   75.987323]   next_to_watch        <0000000097d42934>
[   75.987323]   jiffies              <ffff9d98>
[   75.987323]   desc.status          <168000>
[   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   77.971790] {10}[Hardware Error]: event severity: recoverable
[   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
[   77.983254] {10}[Hardware Error]:   section_type: PCIe error
[   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
[   78.005922] {10}[Hardware Error]:   version: 3.0
[   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
[   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
[   78.033107] {10}[Hardware Error]:   slot: 0
[   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
[   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   78.072940] {10}[Hardware Error]:   class_code: 000002
[   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
[   78.096202]   Tx Queue             <2>
[   78.096202]   TDH                  <0>
[   78.096202]   TDT                  <1>
[   78.096202]   next_to_use          <1>
[   78.096202]   next_to_clean        <0>
[   78.096202] buffer_info[next_to_clean]
[   78.096202]   time_stamp           <ffff9c1a>
[   78.096202]   next_to_watch        <0000000097d42934>
[   78.096202]   jiffies              <ffff9e6a>
[   78.096202]   desc.status          <168000>
[   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   79.604254] {11}[Hardware Error]: event severity: corrected
[   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
[   79.615371] {11}[Hardware Error]:   section_type: PCIe error
[   79.621016] {11}[Hardware Error]:   port_type: 4, root port
[   79.626574] {11}[Hardware Error]:   version: 3.0
[   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
[   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
[   79.642988] {11}[Hardware Error]:   slot: 0
[   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
[   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   79.659055] {11}[Hardware Error]:   class_code: 000406
[   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
[   79.987052]   Tx Queue             <2>
[   79.987052]   TDH                  <0>
[   79.987052]   TDT                  <1>
[   79.987052]   next_to_use          <1>
[   79.987052]   next_to_clean        <0>
[   79.987052] buffer_info[next_to_clean]
[   79.987052]   time_stamp           <ffff9c1a>
[   79.987052]   next_to_watch        <0000000097d42934>
[   79.987052]   jiffies              <ffff9f28>
[   79.987052]   desc.status          <168000>
[   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
[   79.987056]   Tx Queue             <3>
[   79.987056]   TDH                  <0>
[   79.987056]   TDT                  <1>
[   79.987056]   next_to_use          <1>
[   79.987056]   next_to_clean        <0>
[   79.987056] buffer_info[next_to_clean]
[   79.987056]   time_stamp           <ffff9e43>
[   79.987056]   next_to_watch        <000000008da33deb>
[   79.987056]   jiffies              <ffff9f28>
[   79.987056]   desc.status          <514000>
[   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
[   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
[   81.986842]   Tx Queue             <2>
[   81.986842]   TDH                  <0>
[   81.986842]   TDT                  <1>
[   81.986842]   next_to_use          <1>
[   81.986842]   next_to_clean        <0>
[   81.986842] buffer_info[next_to_clean]
[   81.986842]   time_stamp           <ffff9c1a>
[   81.986842]   next_to_watch        <0000000097d42934>
[   81.986842]   jiffies              <ffff9ff0>
[   81.986842]   desc.status          <168000>
[   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
[   81.986844]   Tx Queue             <3>
[   81.986844]   TDH                  <0>
[   81.986844]   TDT                  <1>
[   81.986844]   next_to_use          <1>
[   81.986844]   next_to_clean        <0>
[   81.986844] buffer_info[next_to_clean]
[   81.986844]   time_stamp           <ffff9e43>
[   81.986844]   next_to_watch        <000000008da33deb>
[   81.986844]   jiffies              <ffff9ff0>
[   81.986844]   desc.status          <514000>
[   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   85.363365] {12}[Hardware Error]: event severity: corrected
[   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
[   85.374483] {12}[Hardware Error]:   section_type: PCIe error
[   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
[   85.386121] {12}[Hardware Error]:   version: 3.0
[   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
[   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
[   85.402540] {12}[Hardware Error]:   slot: 0
[   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
[   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   85.418609] {12}[Hardware Error]:   class_code: 000002
[   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX





> > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > primary/old kernel to preserve the mappings until the device driver
> > takes over.
> >
> > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > ---
> > Changes for v2: Used memremap in-place of ioremap
> >
> > V2 patch has been sanity tested.
> >
> > V1 patch has been tested with
> > A) PCIe-Intel 82576 Gigabit Network card in following
> > configurations with "no AER error". Each iteration has
> > been tested on both Suse kdump rfs And default Centos distro rfs.
> >
> >  1)  with 2 level stream table
> >        ----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        -----------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        -----------------------------------------------------
> >        IOMMU bypass       |  41 times      | 10 times
> >        -----------------------------------------------------
> >
> >  2)  with Linear stream table.
> >        -----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        ------------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        ------------------------------------------------------
> >        IOMMU bypass       |  55 times      | 10 times
> >        -------------------------------------------------------
> >
> > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > SSD card with 2 level stream table using "fio" in mixed read/write and
> > only read configurations. It is tested for both Default Operation and
> > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > default Centos ditstro rfs.
> >
> > This patch is not full proof solution. Issue can still come
> > from the point device is discovered and driver probe called.
> > This patch has reduced window of scenario from "SMMU Stream table
> > creation - device-driver" to "device discovery - device-driver".
> > Usually, device discovery to device-driver is very small time. So
> > the probability is very low.
> >
> > Note: device-discovery will overwrite existing stream table entries
> > with both SMMU stage as by-pass.
> >
> >
> >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 82508730feb7..d492d92c2dd7 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> >                       break;
> >               case STRTAB_STE_0_CFG_S1_TRANS:
> >               case STRTAB_STE_0_CFG_S2_TRANS:
> > -                     ste_live = true;
> > +                     /*
> > +                      * As kdump kernel copy STE table from previous
> > +                      * kernel. It still may have valid stream table entries.
> > +                      * Forcing entry as false to allow overwrite.
> > +                      */
> > +                     if (!is_kdump_kernel())
> > +                             ste_live = true;
> >                       break;
> >               case STRTAB_STE_0_CFG_ABORT:
> >                       BUG_ON(!disable_bypass);
> > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >               return -ENOMEM;
> >       }
> >
> > +     if (is_kdump_kernel())
> > +             return 0;
> > +
> >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >       return 0;
> >  }
> >
> > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > +{
> > +     struct arm_smmu_strtab_cfg rdcfg;
> > +
> > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > +
> > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > +
> > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > +
> > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > +}
> > +
> >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> >  {
> >       void *strtab;
> > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> >       cfg->strtab_base_cfg = reg;
> >
> > +     if (is_kdump_kernel())
> > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > +
> >       return arm_smmu_init_l1_strtab(smmu);
> >  }
> >
> > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> >       cfg->strtab_base_cfg = reg;
> >
> > +     if (is_kdump_kernel()) {
> > +             arm_smmu_copy_table(smmu, cfg, size);
> > +             return 0;
> > +     }
> > +
> >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> >       return 0;
> >  }
> > --
> > 2.18.2
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-14  7:17   ` Prabhakar Kushwaha
@ 2020-05-19 23:22     ` Bjorn Helgaas
  2020-05-21  3:58       ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-05-19 23:22 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

[+cc Sathy, Vijay, Myron]

On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > An SMMU Stream table is created by the primary kernel. This table is
> > > used by the SMMU to perform address translations for device-originated
> > > transactions. Any crash (if happened) launches the kdump kernel which
> > > re-creates the SMMU Stream table. New transactions will be translated
> > > via this new table..
> > >
> > > There are scenarios, where devices are still having old pending
> > > transactions (configured in the primary kernel). These transactions
> > > come in-between Stream table creation and device-driver probe.
> > > As new stream table does not have entry for older transactions,
> > > it will be aborted by SMMU.
> > >
> > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > Transactions configured for older Stream table entries, that do not
> > > exist any longer in the new table, will cause a PCIe Completion Abort.
> >
> > That sounds like exactly what we want, doesn't it?
> >
> > Or do you *want* DMA from the previous kernel to complete?  That will
> > read or scribble on something, but maybe that's not terrible as long
> > as it's not memory used by the kdump kernel.
> 
> Yes, Abort should happen. But it should happen in context of driver.
> But current abort is happening because of SMMU and no driver/pcie
> setup present at this moment.

I don't understand what you mean by "in context of driver."  The whole
problem is that we can't control *when* the abort happens, so it may
happen in *any* context.  It may happen when a NIC receives a packet
or at some other unpredictable time.

> Solution of this issue should be at 2 place
> a) SMMU level: I still believe, this patch has potential to overcome
> issue till finally driver's probe takeover.
> b) Device level: Even if something goes wrong. Driver/device should
> able to recover.
> 
> > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > Generic Hardware Error Source (GHES) with completion timeout.
> > > A network device hang is observed even after continuous
> > > reset/recovery from driver, Hence device is no more usable.
> >
> > The fact that the device is no longer usable is definitely a problem.
> > But in principle we *should* be able to recover from these errors.  If
> > we could recover and reliably use the device after the error, that
> > seems like it would be a more robust solution that having to add
> > special cases in every IOMMU driver.
> >
> > If you have details about this sort of error, I'd like to try to fix
> > it because we want to recover from that sort of error in normal
> > (non-crash) situations as well.
> >
> Completion abort case should be gracefully handled.  And device should
> always remain usable.
> 
> There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> 82576 Gigabit Network card.
> 
> I)  Crash testing using kdump root file system: De-facto scenario
>     -  kdump file system does not have Ethernet driver
>     -  A lot of AER prints [1], making it impossible to work on shell
> of kdump root file system.

In this case, I think report_error_detected() is deciding that because
the device has no driver, we can't do anything.  The flow is like
this:

  aer_recover_work_func               # aer_recover_work
    kfifo_get(aer_recover_ring, entry)
    dev = pci_get_domain_bus_and_slot
    cper_print_aer(dev, ...)
      pci_err("AER: aer_status:")
      pci_err("AER:   [14] CmpltTO")
      pci_err("AER: aer_layer=")
    if (AER_NONFATAL)
      pcie_do_recovery(dev, pci_channel_io_normal)
	status = CAN_RECOVER
        pci_walk_bus(report_normal_detected)
	  report_error_detected
	    if (!dev->driver)
	      vote = NO_AER_DRIVER
	      pci_info("can't recover (no error_detected callback)")
	    *result = merge_result(*, NO_AER_DRIVER)
	    # always NO_AER_DRIVER
	status is now NO_AER_DRIVER

So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(), 
and status is not RECOVERED, so it skips .resume().

I don't remember the history there, but if a device has no driver and
the device generates errors, it seems like we ought to be able to
reset it.

We should be able to field one (or a few) AER errors, reset the
device, and you should be able to use the shell in the kdump kernel.

>     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> 
> II) Crash testing using default root file system: Specific case to
> test Ethernet driver in second kernel
>    -  Default root file system have Ethernet driver
>    -  AER error comes even before the driver probe starts.
>    -  Driver does reset Ethernet card as part of probe but no success.
>    -  AER also tries to recover. but no success.  [2]
>    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> and commenting ghes_handle_aer() from GHES driver..
>           than different set of errors come which also never able to recover [3]
> 
> As per my understanding, possible solutions are
>  - Copy SMMU table i.e. this patch
> OR
>  - Doing pci_reset_function() during enumeration phase.
> I also tried clearing "M" bit using pci_clear_master during
> enumeration but it did not help. Because driver re-set M bit causing
> same AER error again.
> 
> 
> -pk
> 
> ---------------------------------------------------------------------------------------------------------------------------
> [1] with bootargs having pci=noaer
> 
> [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   22.512773] {4}[Hardware Error]: event severity: recoverable
> [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> [   22.556268] {4}[Hardware Error]:   version: 3.0
> [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> [   22.582323] {4}[Hardware Error]:   slot: 0
> [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   22.608900] {4}[Hardware Error]:   class_code: 000002
> [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> total mem (8153768 kB)
> [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> 
> 
> ----------------------------------------------------------------------------------------------------------------------------
> [2] Normal bootargs.
> 
> [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   54.265827] {6}[Hardware Error]: event severity: recoverable
> [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> [   54.296955] {6}[Hardware Error]:   version: 3.0
> [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> [   54.317991] {6}[Hardware Error]:   slot: 0
> [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   54.333797] {6}[Hardware Error]:   class_code: 000002
> [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.792897] Workqueue: events aer_recover_work_func
> [  397.797760] Call trace:
> [  397.800199]  __switch_to+0xcc/0x108
> [  397.803675]  __schedule+0x2c0/0x700
> [  397.807150]  schedule+0x58/0xe8
> [  397.810283]  schedule_preempt_disabled+0x18/0x28
> [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> [  397.830962]  mutex_lock+0x4c/0x68
> [  397.834264]  report_slot_reset+0x30/0xa0
> [  397.838178]  pci_walk_bus+0x68/0xc0
> [  397.841653]  pcie_do_recovery+0xe8/0x248
> [  397.845562]  aer_recover_work_func+0x100/0x138
> [  397.849995]  process_one_work+0x1bc/0x458
> [  397.853991]  worker_thread+0x150/0x500
> [  397.857727]  kthread+0x114/0x118
> [  397.860945]  ret_from_fork+0x10/0x18
> [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> [  397.871564]       Not tainted 5.7.0-rc3+ #68
> [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> [  397.894505] Call trace:
> [  397.896940]  __switch_to+0xcc/0x108
> [  397.900419]  __schedule+0x2c0/0x700
> [  397.903894]  schedule+0x58/0xe8
> [  397.907023]  schedule_preempt_disabled+0x18/0x28
> [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> [  397.927696]  mutex_lock+0x4c/0x68
> [  397.931005]  rtnl_lock+0x24/0x30
> [  397.934220]  addrconf_verify_work+0x18/0x30
> [  397.938394]  process_one_work+0x1bc/0x458
> [  397.942390]  worker_thread+0x150/0x500
> [  397.946126]  kthread+0x114/0x118
> [  397.949345]  ret_from_fork+0x10/0x18
> 
> ---------------------------------------------------------------------------------------------------------------------------------
> [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> 
> [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   69.365121] {9}[Hardware Error]: event severity: corrected
> [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> [   69.387094] {9}[Hardware Error]:   version: 3.0
> [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> [   69.403248] {9}[Hardware Error]:   slot: 0
> [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   69.419055] {9}[Hardware Error]:   class_code: 000406
> [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> 1000 Mbps Full Duplex, Flow Control: RX
> [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> [   73.995068]   Tx Queue             <2>
> [   73.995068]   TDH                  <0>
> [   73.995068]   TDT                  <1>
> [   73.995068]   next_to_use          <1>
> [   73.995068]   next_to_clean        <0>
> [   73.995068] buffer_info[next_to_clean]
> [   73.995068]   time_stamp           <ffff9c1a>
> [   73.995068]   next_to_watch        <0000000097d42934>
> [   73.995068]   jiffies              <ffff9cd0>
> [   73.995068]   desc.status          <168000>
> [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> [   75.987323]   Tx Queue             <2>
> [   75.987323]   TDH                  <0>
> [   75.987323]   TDT                  <1>
> [   75.987323]   next_to_use          <1>
> [   75.987323]   next_to_clean        <0>
> [   75.987323] buffer_info[next_to_clean]
> [   75.987323]   time_stamp           <ffff9c1a>
> [   75.987323]   next_to_watch        <0000000097d42934>
> [   75.987323]   jiffies              <ffff9d98>
> [   75.987323]   desc.status          <168000>
> [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   77.971790] {10}[Hardware Error]: event severity: recoverable
> [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> [   78.005922] {10}[Hardware Error]:   version: 3.0
> [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> [   78.033107] {10}[Hardware Error]:   slot: 0
> [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   78.072940] {10}[Hardware Error]:   class_code: 000002
> [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> [   78.096202]   Tx Queue             <2>
> [   78.096202]   TDH                  <0>
> [   78.096202]   TDT                  <1>
> [   78.096202]   next_to_use          <1>
> [   78.096202]   next_to_clean        <0>
> [   78.096202] buffer_info[next_to_clean]
> [   78.096202]   time_stamp           <ffff9c1a>
> [   78.096202]   next_to_watch        <0000000097d42934>
> [   78.096202]   jiffies              <ffff9e6a>
> [   78.096202]   desc.status          <168000>
> [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   79.604254] {11}[Hardware Error]: event severity: corrected
> [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> [   79.626574] {11}[Hardware Error]:   version: 3.0
> [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> [   79.642988] {11}[Hardware Error]:   slot: 0
> [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   79.659055] {11}[Hardware Error]:   class_code: 000406
> [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> [   79.987052]   Tx Queue             <2>
> [   79.987052]   TDH                  <0>
> [   79.987052]   TDT                  <1>
> [   79.987052]   next_to_use          <1>
> [   79.987052]   next_to_clean        <0>
> [   79.987052] buffer_info[next_to_clean]
> [   79.987052]   time_stamp           <ffff9c1a>
> [   79.987052]   next_to_watch        <0000000097d42934>
> [   79.987052]   jiffies              <ffff9f28>
> [   79.987052]   desc.status          <168000>
> [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> [   79.987056]   Tx Queue             <3>
> [   79.987056]   TDH                  <0>
> [   79.987056]   TDT                  <1>
> [   79.987056]   next_to_use          <1>
> [   79.987056]   next_to_clean        <0>
> [   79.987056] buffer_info[next_to_clean]
> [   79.987056]   time_stamp           <ffff9e43>
> [   79.987056]   next_to_watch        <000000008da33deb>
> [   79.987056]   jiffies              <ffff9f28>
> [   79.987056]   desc.status          <514000>
> [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> [   81.986842]   Tx Queue             <2>
> [   81.986842]   TDH                  <0>
> [   81.986842]   TDT                  <1>
> [   81.986842]   next_to_use          <1>
> [   81.986842]   next_to_clean        <0>
> [   81.986842] buffer_info[next_to_clean]
> [   81.986842]   time_stamp           <ffff9c1a>
> [   81.986842]   next_to_watch        <0000000097d42934>
> [   81.986842]   jiffies              <ffff9ff0>
> [   81.986842]   desc.status          <168000>
> [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> [   81.986844]   Tx Queue             <3>
> [   81.986844]   TDH                  <0>
> [   81.986844]   TDT                  <1>
> [   81.986844]   next_to_use          <1>
> [   81.986844]   next_to_clean        <0>
> [   81.986844] buffer_info[next_to_clean]
> [   81.986844]   time_stamp           <ffff9e43>
> [   81.986844]   next_to_watch        <000000008da33deb>
> [   81.986844]   jiffies              <ffff9ff0>
> [   81.986844]   desc.status          <514000>
> [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   85.363365] {12}[Hardware Error]: event severity: corrected
> [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> [   85.386121] {12}[Hardware Error]:   version: 3.0
> [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> [   85.402540] {12}[Hardware Error]:   slot: 0
> [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   85.418609] {12}[Hardware Error]:   class_code: 000002
> [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> 1000 Mbps Full Duplex, Flow Control: RX
> 
> 
> 
> 
> 
> > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > primary/old kernel to preserve the mappings until the device driver
> > > takes over.
> > >
> > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > ---
> > > Changes for v2: Used memremap in-place of ioremap
> > >
> > > V2 patch has been sanity tested.
> > >
> > > V1 patch has been tested with
> > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > configurations with "no AER error". Each iteration has
> > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > >
> > >  1)  with 2 level stream table
> > >        ----------------------------------------------------
> > >        SMMU               |  Normal Ping   | Flood Ping
> > >        -----------------------------------------------------
> > >        Default Operation  |  100 times     | 10 times
> > >        -----------------------------------------------------
> > >        IOMMU bypass       |  41 times      | 10 times
> > >        -----------------------------------------------------
> > >
> > >  2)  with Linear stream table.
> > >        -----------------------------------------------------
> > >        SMMU               |  Normal Ping   | Flood Ping
> > >        ------------------------------------------------------
> > >        Default Operation  |  100 times     | 10 times
> > >        ------------------------------------------------------
> > >        IOMMU bypass       |  55 times      | 10 times
> > >        -------------------------------------------------------
> > >
> > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > only read configurations. It is tested for both Default Operation and
> > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > default Centos ditstro rfs.
> > >
> > > This patch is not full proof solution. Issue can still come
> > > from the point device is discovered and driver probe called.
> > > This patch has reduced window of scenario from "SMMU Stream table
> > > creation - device-driver" to "device discovery - device-driver".
> > > Usually, device discovery to device-driver is very small time. So
> > > the probability is very low.
> > >
> > > Note: device-discovery will overwrite existing stream table entries
> > > with both SMMU stage as by-pass.
> > >
> > >
> > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > index 82508730feb7..d492d92c2dd7 100644
> > > --- a/drivers/iommu/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > >                       break;
> > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > -                     ste_live = true;
> > > +                     /*
> > > +                      * As kdump kernel copy STE table from previous
> > > +                      * kernel. It still may have valid stream table entries.
> > > +                      * Forcing entry as false to allow overwrite.
> > > +                      */
> > > +                     if (!is_kdump_kernel())
> > > +                             ste_live = true;
> > >                       break;
> > >               case STRTAB_STE_0_CFG_ABORT:
> > >                       BUG_ON(!disable_bypass);
> > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > >               return -ENOMEM;
> > >       }
> > >
> > > +     if (is_kdump_kernel())
> > > +             return 0;
> > > +
> > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > >       return 0;
> > >  }
> > >
> > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > +{
> > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > +
> > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > +
> > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > +
> > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > +
> > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > +}
> > > +
> > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > >  {
> > >       void *strtab;
> > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > >       cfg->strtab_base_cfg = reg;
> > >
> > > +     if (is_kdump_kernel())
> > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > +
> > >       return arm_smmu_init_l1_strtab(smmu);
> > >  }
> > >
> > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > >       cfg->strtab_base_cfg = reg;
> > >
> > > +     if (is_kdump_kernel()) {
> > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > +             return 0;
> > > +     }
> > > +
> > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > >       return 0;
> > >  }
> > > --
> > > 2.18.2
> > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-19 23:22     ` Bjorn Helgaas
@ 2020-05-21  3:58       ` Prabhakar Kushwaha
  2020-05-21 22:49         ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-05-21  3:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

Hi Bjorn,

On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Sathy, Vijay, Myron]
>
> On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > used by the SMMU to perform address translations for device-originated
> > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > via this new table..
> > > >
> > > > There are scenarios, where devices are still having old pending
> > > > transactions (configured in the primary kernel). These transactions
> > > > come in-between Stream table creation and device-driver probe.
> > > > As new stream table does not have entry for older transactions,
> > > > it will be aborted by SMMU.
> > > >
> > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > Transactions configured for older Stream table entries, that do not
> > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > >
> > > That sounds like exactly what we want, doesn't it?
> > >
> > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > read or scribble on something, but maybe that's not terrible as long
> > > as it's not memory used by the kdump kernel.
> >
> > Yes, Abort should happen. But it should happen in context of driver.
> > But current abort is happening because of SMMU and no driver/pcie
> > setup present at this moment.
>
> I don't understand what you mean by "in context of driver."  The whole
> problem is that we can't control *when* the abort happens, so it may
> happen in *any* context.  It may happen when a NIC receives a packet
> or at some other unpredictable time.
>
> > Solution of this issue should be at 2 place
> > a) SMMU level: I still believe, this patch has potential to overcome
> > issue till finally driver's probe takeover.
> > b) Device level: Even if something goes wrong. Driver/device should
> > able to recover.
> >
> > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > A network device hang is observed even after continuous
> > > > reset/recovery from driver, Hence device is no more usable.
> > >
> > > The fact that the device is no longer usable is definitely a problem.
> > > But in principle we *should* be able to recover from these errors.  If
> > > we could recover and reliably use the device after the error, that
> > > seems like it would be a more robust solution that having to add
> > > special cases in every IOMMU driver.
> > >
> > > If you have details about this sort of error, I'd like to try to fix
> > > it because we want to recover from that sort of error in normal
> > > (non-crash) situations as well.
> > >
> > Completion abort case should be gracefully handled.  And device should
> > always remain usable.
> >
> > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > 82576 Gigabit Network card.
> >
> > I)  Crash testing using kdump root file system: De-facto scenario
> >     -  kdump file system does not have Ethernet driver
> >     -  A lot of AER prints [1], making it impossible to work on shell
> > of kdump root file system.
>
> In this case, I think report_error_detected() is deciding that because
> the device has no driver, we can't do anything.  The flow is like
> this:
>
>   aer_recover_work_func               # aer_recover_work
>     kfifo_get(aer_recover_ring, entry)
>     dev = pci_get_domain_bus_and_slot
>     cper_print_aer(dev, ...)
>       pci_err("AER: aer_status:")
>       pci_err("AER:   [14] CmpltTO")
>       pci_err("AER: aer_layer=")
>     if (AER_NONFATAL)
>       pcie_do_recovery(dev, pci_channel_io_normal)
>         status = CAN_RECOVER
>         pci_walk_bus(report_normal_detected)
>           report_error_detected
>             if (!dev->driver)
>               vote = NO_AER_DRIVER
>               pci_info("can't recover (no error_detected callback)")
>             *result = merge_result(*, NO_AER_DRIVER)
>             # always NO_AER_DRIVER
>         status is now NO_AER_DRIVER
>
> So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> and status is not RECOVERED, so it skips .resume().
>
> I don't remember the history there, but if a device has no driver and
> the device generates errors, it seems like we ought to be able to
> reset it.
>

But how to reset the device considering there is no driver.
Hypothetically, this case should be taken care by PCIe subsystem to
perform reset at PCIe level.

> We should be able to field one (or a few) AER errors, reset the
> device, and you should be able to use the shell in the kdump kernel.
>
here kdump shell is usable only problem is a "lot of AER Errors". One
cannot see what they are typing.

> >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> >
> > II) Crash testing using default root file system: Specific case to
> > test Ethernet driver in second kernel
> >    -  Default root file system have Ethernet driver
> >    -  AER error comes even before the driver probe starts.
> >    -  Driver does reset Ethernet card as part of probe but no success.
> >    -  AER also tries to recover. but no success.  [2]
> >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > and commenting ghes_handle_aer() from GHES driver..
> >           than different set of errors come which also never able to recover [3]
> >

Please suggest your view on this case. Here driver is preset.
(driver/net/ethernet/intel/igb/igb_main.c)
In this case AER errors starts even before driver probe starts.
After probe, driver does the device reset with no success and even AER
recovery does not work.

Problem mentioned in case I and II goes away if do pci_reset_function
during enumeration phase of kdump kernel.
can we thought of doing pci_reset_function for all devices in kdump
kernel or device specific quirk.

--pk


> > As per my understanding, possible solutions are
> >  - Copy SMMU table i.e. this patch
> > OR
> >  - Doing pci_reset_function() during enumeration phase.
> > I also tried clearing "M" bit using pci_clear_master during
> > enumeration but it did not help. Because driver re-set M bit causing
> > same AER error again.
> >
> >
> > -pk
> >
> > ---------------------------------------------------------------------------------------------------------------------------
> > [1] with bootargs having pci=noaer
> >
> > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > [   22.582323] {4}[Hardware Error]:   slot: 0
> > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > total mem (8153768 kB)
> > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> >
> >
> > ----------------------------------------------------------------------------------------------------------------------------
> > [2] Normal bootargs.
> >
> > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > [   54.317991] {6}[Hardware Error]:   slot: 0
> > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.792897] Workqueue: events aer_recover_work_func
> > [  397.797760] Call trace:
> > [  397.800199]  __switch_to+0xcc/0x108
> > [  397.803675]  __schedule+0x2c0/0x700
> > [  397.807150]  schedule+0x58/0xe8
> > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > [  397.830962]  mutex_lock+0x4c/0x68
> > [  397.834264]  report_slot_reset+0x30/0xa0
> > [  397.838178]  pci_walk_bus+0x68/0xc0
> > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > [  397.845562]  aer_recover_work_func+0x100/0x138
> > [  397.849995]  process_one_work+0x1bc/0x458
> > [  397.853991]  worker_thread+0x150/0x500
> > [  397.857727]  kthread+0x114/0x118
> > [  397.860945]  ret_from_fork+0x10/0x18
> > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > [  397.894505] Call trace:
> > [  397.896940]  __switch_to+0xcc/0x108
> > [  397.900419]  __schedule+0x2c0/0x700
> > [  397.903894]  schedule+0x58/0xe8
> > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > [  397.927696]  mutex_lock+0x4c/0x68
> > [  397.931005]  rtnl_lock+0x24/0x30
> > [  397.934220]  addrconf_verify_work+0x18/0x30
> > [  397.938394]  process_one_work+0x1bc/0x458
> > [  397.942390]  worker_thread+0x150/0x500
> > [  397.946126]  kthread+0x114/0x118
> > [  397.949345]  ret_from_fork+0x10/0x18
> >
> > ---------------------------------------------------------------------------------------------------------------------------------
> > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> >
> > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > [   69.403248] {9}[Hardware Error]:   slot: 0
> > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > 1000 Mbps Full Duplex, Flow Control: RX
> > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   73.995068]   Tx Queue             <2>
> > [   73.995068]   TDH                  <0>
> > [   73.995068]   TDT                  <1>
> > [   73.995068]   next_to_use          <1>
> > [   73.995068]   next_to_clean        <0>
> > [   73.995068] buffer_info[next_to_clean]
> > [   73.995068]   time_stamp           <ffff9c1a>
> > [   73.995068]   next_to_watch        <0000000097d42934>
> > [   73.995068]   jiffies              <ffff9cd0>
> > [   73.995068]   desc.status          <168000>
> > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   75.987323]   Tx Queue             <2>
> > [   75.987323]   TDH                  <0>
> > [   75.987323]   TDT                  <1>
> > [   75.987323]   next_to_use          <1>
> > [   75.987323]   next_to_clean        <0>
> > [   75.987323] buffer_info[next_to_clean]
> > [   75.987323]   time_stamp           <ffff9c1a>
> > [   75.987323]   next_to_watch        <0000000097d42934>
> > [   75.987323]   jiffies              <ffff9d98>
> > [   75.987323]   desc.status          <168000>
> > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > [   78.033107] {10}[Hardware Error]:   slot: 0
> > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   78.096202]   Tx Queue             <2>
> > [   78.096202]   TDH                  <0>
> > [   78.096202]   TDT                  <1>
> > [   78.096202]   next_to_use          <1>
> > [   78.096202]   next_to_clean        <0>
> > [   78.096202] buffer_info[next_to_clean]
> > [   78.096202]   time_stamp           <ffff9c1a>
> > [   78.096202]   next_to_watch        <0000000097d42934>
> > [   78.096202]   jiffies              <ffff9e6a>
> > [   78.096202]   desc.status          <168000>
> > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > [   79.642988] {11}[Hardware Error]:   slot: 0
> > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   79.987052]   Tx Queue             <2>
> > [   79.987052]   TDH                  <0>
> > [   79.987052]   TDT                  <1>
> > [   79.987052]   next_to_use          <1>
> > [   79.987052]   next_to_clean        <0>
> > [   79.987052] buffer_info[next_to_clean]
> > [   79.987052]   time_stamp           <ffff9c1a>
> > [   79.987052]   next_to_watch        <0000000097d42934>
> > [   79.987052]   jiffies              <ffff9f28>
> > [   79.987052]   desc.status          <168000>
> > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   79.987056]   Tx Queue             <3>
> > [   79.987056]   TDH                  <0>
> > [   79.987056]   TDT                  <1>
> > [   79.987056]   next_to_use          <1>
> > [   79.987056]   next_to_clean        <0>
> > [   79.987056] buffer_info[next_to_clean]
> > [   79.987056]   time_stamp           <ffff9e43>
> > [   79.987056]   next_to_watch        <000000008da33deb>
> > [   79.987056]   jiffies              <ffff9f28>
> > [   79.987056]   desc.status          <514000>
> > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   81.986842]   Tx Queue             <2>
> > [   81.986842]   TDH                  <0>
> > [   81.986842]   TDT                  <1>
> > [   81.986842]   next_to_use          <1>
> > [   81.986842]   next_to_clean        <0>
> > [   81.986842] buffer_info[next_to_clean]
> > [   81.986842]   time_stamp           <ffff9c1a>
> > [   81.986842]   next_to_watch        <0000000097d42934>
> > [   81.986842]   jiffies              <ffff9ff0>
> > [   81.986842]   desc.status          <168000>
> > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   81.986844]   Tx Queue             <3>
> > [   81.986844]   TDH                  <0>
> > [   81.986844]   TDT                  <1>
> > [   81.986844]   next_to_use          <1>
> > [   81.986844]   next_to_clean        <0>
> > [   81.986844] buffer_info[next_to_clean]
> > [   81.986844]   time_stamp           <ffff9e43>
> > [   81.986844]   next_to_watch        <000000008da33deb>
> > [   81.986844]   jiffies              <ffff9ff0>
> > [   81.986844]   desc.status          <514000>
> > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > [   85.402540] {12}[Hardware Error]:   slot: 0
> > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > 1000 Mbps Full Duplex, Flow Control: RX
> >
> >
> >
> >
> >
> > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > primary/old kernel to preserve the mappings until the device driver
> > > > takes over.
> > > >
> > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > ---
> > > > Changes for v2: Used memremap in-place of ioremap
> > > >
> > > > V2 patch has been sanity tested.
> > > >
> > > > V1 patch has been tested with
> > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > configurations with "no AER error". Each iteration has
> > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > >
> > > >  1)  with 2 level stream table
> > > >        ----------------------------------------------------
> > > >        SMMU               |  Normal Ping   | Flood Ping
> > > >        -----------------------------------------------------
> > > >        Default Operation  |  100 times     | 10 times
> > > >        -----------------------------------------------------
> > > >        IOMMU bypass       |  41 times      | 10 times
> > > >        -----------------------------------------------------
> > > >
> > > >  2)  with Linear stream table.
> > > >        -----------------------------------------------------
> > > >        SMMU               |  Normal Ping   | Flood Ping
> > > >        ------------------------------------------------------
> > > >        Default Operation  |  100 times     | 10 times
> > > >        ------------------------------------------------------
> > > >        IOMMU bypass       |  55 times      | 10 times
> > > >        -------------------------------------------------------
> > > >
> > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > only read configurations. It is tested for both Default Operation and
> > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > default Centos ditstro rfs.
> > > >
> > > > This patch is not full proof solution. Issue can still come
> > > > from the point device is discovered and driver probe called.
> > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > creation - device-driver" to "device discovery - device-driver".
> > > > Usually, device discovery to device-driver is very small time. So
> > > > the probability is very low.
> > > >
> > > > Note: device-discovery will overwrite existing stream table entries
> > > > with both SMMU stage as by-pass.
> > > >
> > > >
> > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > index 82508730feb7..d492d92c2dd7 100644
> > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > >                       break;
> > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > -                     ste_live = true;
> > > > +                     /*
> > > > +                      * As kdump kernel copy STE table from previous
> > > > +                      * kernel. It still may have valid stream table entries.
> > > > +                      * Forcing entry as false to allow overwrite.
> > > > +                      */
> > > > +                     if (!is_kdump_kernel())
> > > > +                             ste_live = true;
> > > >                       break;
> > > >               case STRTAB_STE_0_CFG_ABORT:
> > > >                       BUG_ON(!disable_bypass);
> > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > >               return -ENOMEM;
> > > >       }
> > > >
> > > > +     if (is_kdump_kernel())
> > > > +             return 0;
> > > > +
> > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > >       return 0;
> > > >  }
> > > >
> > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > +{
> > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > +
> > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > +
> > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > +
> > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > +
> > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > +}
> > > > +
> > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > >  {
> > > >       void *strtab;
> > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > >       cfg->strtab_base_cfg = reg;
> > > >
> > > > +     if (is_kdump_kernel())
> > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > +
> > > >       return arm_smmu_init_l1_strtab(smmu);
> > > >  }
> > > >
> > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > >       cfg->strtab_base_cfg = reg;
> > > >
> > > > +     if (is_kdump_kernel()) {
> > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > +             return 0;
> > > > +     }
> > > > +
> > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > >       return 0;
> > > >  }
> > > > --
> > > > 2.18.2
> > > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-21  3:58       ` Prabhakar Kushwaha
@ 2020-05-21 22:49         ` Bjorn Helgaas
  2020-05-27 11:44           ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-05-21 22:49 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > used by the SMMU to perform address translations for device-originated
> > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > via this new table..
> > > > >
> > > > > There are scenarios, where devices are still having old pending
> > > > > transactions (configured in the primary kernel). These transactions
> > > > > come in-between Stream table creation and device-driver probe.
> > > > > As new stream table does not have entry for older transactions,
> > > > > it will be aborted by SMMU.
> > > > >
> > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > Transactions configured for older Stream table entries, that do not
> > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > >
> > > > That sounds like exactly what we want, doesn't it?
> > > >
> > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > read or scribble on something, but maybe that's not terrible as long
> > > > as it's not memory used by the kdump kernel.
> > >
> > > Yes, Abort should happen. But it should happen in context of driver.
> > > But current abort is happening because of SMMU and no driver/pcie
> > > setup present at this moment.
> >
> > I don't understand what you mean by "in context of driver."  The whole
> > problem is that we can't control *when* the abort happens, so it may
> > happen in *any* context.  It may happen when a NIC receives a packet
> > or at some other unpredictable time.
> >
> > > Solution of this issue should be at 2 place
> > > a) SMMU level: I still believe, this patch has potential to overcome
> > > issue till finally driver's probe takeover.
> > > b) Device level: Even if something goes wrong. Driver/device should
> > > able to recover.
> > >
> > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > A network device hang is observed even after continuous
> > > > > reset/recovery from driver, Hence device is no more usable.
> > > >
> > > > The fact that the device is no longer usable is definitely a problem.
> > > > But in principle we *should* be able to recover from these errors.  If
> > > > we could recover and reliably use the device after the error, that
> > > > seems like it would be a more robust solution that having to add
> > > > special cases in every IOMMU driver.
> > > >
> > > > If you have details about this sort of error, I'd like to try to fix
> > > > it because we want to recover from that sort of error in normal
> > > > (non-crash) situations as well.
> > > >
> > > Completion abort case should be gracefully handled.  And device should
> > > always remain usable.
> > >
> > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > 82576 Gigabit Network card.
> > >
> > > I)  Crash testing using kdump root file system: De-facto scenario
> > >     -  kdump file system does not have Ethernet driver
> > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > of kdump root file system.
> >
> > In this case, I think report_error_detected() is deciding that because
> > the device has no driver, we can't do anything.  The flow is like
> > this:
> >
> >   aer_recover_work_func               # aer_recover_work
> >     kfifo_get(aer_recover_ring, entry)
> >     dev = pci_get_domain_bus_and_slot
> >     cper_print_aer(dev, ...)
> >       pci_err("AER: aer_status:")
> >       pci_err("AER:   [14] CmpltTO")
> >       pci_err("AER: aer_layer=")
> >     if (AER_NONFATAL)
> >       pcie_do_recovery(dev, pci_channel_io_normal)
> >         status = CAN_RECOVER
> >         pci_walk_bus(report_normal_detected)
> >           report_error_detected
> >             if (!dev->driver)
> >               vote = NO_AER_DRIVER
> >               pci_info("can't recover (no error_detected callback)")
> >             *result = merge_result(*, NO_AER_DRIVER)
> >             # always NO_AER_DRIVER
> >         status is now NO_AER_DRIVER
> >
> > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > and status is not RECOVERED, so it skips .resume().
> >
> > I don't remember the history there, but if a device has no driver and
> > the device generates errors, it seems like we ought to be able to
> > reset it.
> 
> But how to reset the device considering there is no driver.
> Hypothetically, this case should be taken care by PCIe subsystem to
> perform reset at PCIe level.

I don't understand your question.  The PCI core (not the device
driver) already does the reset.  When pcie_do_recovery() calls
reset_link(), all devices on the other side of the link are reset.

> > We should be able to field one (or a few) AER errors, reset the
> > device, and you should be able to use the shell in the kdump kernel.
> >
> here kdump shell is usable only problem is a "lot of AER Errors". One
> cannot see what they are typing.

Right, that's what I expect.  If the PCI core resets the device, you
should get just a few AER errors, and they should stop after the
device is reset.

> > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > >
> > > II) Crash testing using default root file system: Specific case to
> > > test Ethernet driver in second kernel
> > >    -  Default root file system have Ethernet driver
> > >    -  AER error comes even before the driver probe starts.
> > >    -  Driver does reset Ethernet card as part of probe but no success.
> > >    -  AER also tries to recover. but no success.  [2]
> > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > and commenting ghes_handle_aer() from GHES driver..
> > >           than different set of errors come which also never able to recover [3]
> > >
> 
> Please suggest your view on this case. Here driver is preset.
> (driver/net/ethernet/intel/igb/igb_main.c)
> In this case AER errors starts even before driver probe starts.
> After probe, driver does the device reset with no success and even AER
> recovery does not work.

This case should be the same as the one above.  If we can change the
PCI core so it can reset the device when there's no driver, that would
apply to case I (where there will never be a driver) and to case II
(where there is no driver now, but a driver will probe the device
later).

> Problem mentioned in case I and II goes away if do pci_reset_function
> during enumeration phase of kdump kernel.
> can we thought of doing pci_reset_function for all devices in kdump
> kernel or device specific quirk.
> 
> --pk
> 
> 
> > > As per my understanding, possible solutions are
> > >  - Copy SMMU table i.e. this patch
> > > OR
> > >  - Doing pci_reset_function() during enumeration phase.
> > > I also tried clearing "M" bit using pci_clear_master during
> > > enumeration but it did not help. Because driver re-set M bit causing
> > > same AER error again.
> > >
> > >
> > > -pk
> > >
> > > ---------------------------------------------------------------------------------------------------------------------------
> > > [1] with bootargs having pci=noaer
> > >
> > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > total mem (8153768 kB)
> > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > >
> > >
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > [2] Normal bootargs.
> > >
> > > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   54.317991] {6}[Hardware Error]:   slot: 0
> > > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.792897] Workqueue: events aer_recover_work_func
> > > [  397.797760] Call trace:
> > > [  397.800199]  __switch_to+0xcc/0x108
> > > [  397.803675]  __schedule+0x2c0/0x700
> > > [  397.807150]  schedule+0x58/0xe8
> > > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > > [  397.830962]  mutex_lock+0x4c/0x68
> > > [  397.834264]  report_slot_reset+0x30/0xa0
> > > [  397.838178]  pci_walk_bus+0x68/0xc0
> > > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > > [  397.845562]  aer_recover_work_func+0x100/0x138
> > > [  397.849995]  process_one_work+0x1bc/0x458
> > > [  397.853991]  worker_thread+0x150/0x500
> > > [  397.857727]  kthread+0x114/0x118
> > > [  397.860945]  ret_from_fork+0x10/0x18
> > > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > > [  397.894505] Call trace:
> > > [  397.896940]  __switch_to+0xcc/0x108
> > > [  397.900419]  __schedule+0x2c0/0x700
> > > [  397.903894]  schedule+0x58/0xe8
> > > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > > [  397.927696]  mutex_lock+0x4c/0x68
> > > [  397.931005]  rtnl_lock+0x24/0x30
> > > [  397.934220]  addrconf_verify_work+0x18/0x30
> > > [  397.938394]  process_one_work+0x1bc/0x458
> > > [  397.942390]  worker_thread+0x150/0x500
> > > [  397.946126]  kthread+0x114/0x118
> > > [  397.949345]  ret_from_fork+0x10/0x18
> > >
> > > ---------------------------------------------------------------------------------------------------------------------------------
> > > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> > >
> > > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > > [   69.403248] {9}[Hardware Error]:   slot: 0
> > > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > > 0x6000, control: 0x0002
> > > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > 1000 Mbps Full Duplex, Flow Control: RX
> > > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   73.995068]   Tx Queue             <2>
> > > [   73.995068]   TDH                  <0>
> > > [   73.995068]   TDT                  <1>
> > > [   73.995068]   next_to_use          <1>
> > > [   73.995068]   next_to_clean        <0>
> > > [   73.995068] buffer_info[next_to_clean]
> > > [   73.995068]   time_stamp           <ffff9c1a>
> > > [   73.995068]   next_to_watch        <0000000097d42934>
> > > [   73.995068]   jiffies              <ffff9cd0>
> > > [   73.995068]   desc.status          <168000>
> > > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   75.987323]   Tx Queue             <2>
> > > [   75.987323]   TDH                  <0>
> > > [   75.987323]   TDT                  <1>
> > > [   75.987323]   next_to_use          <1>
> > > [   75.987323]   next_to_clean        <0>
> > > [   75.987323] buffer_info[next_to_clean]
> > > [   75.987323]   time_stamp           <ffff9c1a>
> > > [   75.987323]   next_to_watch        <0000000097d42934>
> > > [   75.987323]   jiffies              <ffff9d98>
> > > [   75.987323]   desc.status          <168000>
> > > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   78.033107] {10}[Hardware Error]:   slot: 0
> > > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   78.096202]   Tx Queue             <2>
> > > [   78.096202]   TDH                  <0>
> > > [   78.096202]   TDT                  <1>
> > > [   78.096202]   next_to_use          <1>
> > > [   78.096202]   next_to_clean        <0>
> > > [   78.096202] buffer_info[next_to_clean]
> > > [   78.096202]   time_stamp           <ffff9c1a>
> > > [   78.096202]   next_to_watch        <0000000097d42934>
> > > [   78.096202]   jiffies              <ffff9e6a>
> > > [   78.096202]   desc.status          <168000>
> > > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > > [   79.642988] {11}[Hardware Error]:   slot: 0
> > > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > > 0x6000, control: 0x0002
> > > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   79.987052]   Tx Queue             <2>
> > > [   79.987052]   TDH                  <0>
> > > [   79.987052]   TDT                  <1>
> > > [   79.987052]   next_to_use          <1>
> > > [   79.987052]   next_to_clean        <0>
> > > [   79.987052] buffer_info[next_to_clean]
> > > [   79.987052]   time_stamp           <ffff9c1a>
> > > [   79.987052]   next_to_watch        <0000000097d42934>
> > > [   79.987052]   jiffies              <ffff9f28>
> > > [   79.987052]   desc.status          <168000>
> > > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   79.987056]   Tx Queue             <3>
> > > [   79.987056]   TDH                  <0>
> > > [   79.987056]   TDT                  <1>
> > > [   79.987056]   next_to_use          <1>
> > > [   79.987056]   next_to_clean        <0>
> > > [   79.987056] buffer_info[next_to_clean]
> > > [   79.987056]   time_stamp           <ffff9e43>
> > > [   79.987056]   next_to_watch        <000000008da33deb>
> > > [   79.987056]   jiffies              <ffff9f28>
> > > [   79.987056]   desc.status          <514000>
> > > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   81.986842]   Tx Queue             <2>
> > > [   81.986842]   TDH                  <0>
> > > [   81.986842]   TDT                  <1>
> > > [   81.986842]   next_to_use          <1>
> > > [   81.986842]   next_to_clean        <0>
> > > [   81.986842] buffer_info[next_to_clean]
> > > [   81.986842]   time_stamp           <ffff9c1a>
> > > [   81.986842]   next_to_watch        <0000000097d42934>
> > > [   81.986842]   jiffies              <ffff9ff0>
> > > [   81.986842]   desc.status          <168000>
> > > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   81.986844]   Tx Queue             <3>
> > > [   81.986844]   TDH                  <0>
> > > [   81.986844]   TDT                  <1>
> > > [   81.986844]   next_to_use          <1>
> > > [   81.986844]   next_to_clean        <0>
> > > [   81.986844] buffer_info[next_to_clean]
> > > [   81.986844]   time_stamp           <ffff9e43>
> > > [   81.986844]   next_to_watch        <000000008da33deb>
> > > [   81.986844]   jiffies              <ffff9ff0>
> > > [   81.986844]   desc.status          <514000>
> > > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > > [   85.402540] {12}[Hardware Error]:   slot: 0
> > > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > 1000 Mbps Full Duplex, Flow Control: RX
> > >
> > >
> > >
> > >
> > >
> > > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > > primary/old kernel to preserve the mappings until the device driver
> > > > > takes over.
> > > > >
> > > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > > ---
> > > > > Changes for v2: Used memremap in-place of ioremap
> > > > >
> > > > > V2 patch has been sanity tested.
> > > > >
> > > > > V1 patch has been tested with
> > > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > > configurations with "no AER error". Each iteration has
> > > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > > >
> > > > >  1)  with 2 level stream table
> > > > >        ----------------------------------------------------
> > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > >        -----------------------------------------------------
> > > > >        Default Operation  |  100 times     | 10 times
> > > > >        -----------------------------------------------------
> > > > >        IOMMU bypass       |  41 times      | 10 times
> > > > >        -----------------------------------------------------
> > > > >
> > > > >  2)  with Linear stream table.
> > > > >        -----------------------------------------------------
> > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > >        ------------------------------------------------------
> > > > >        Default Operation  |  100 times     | 10 times
> > > > >        ------------------------------------------------------
> > > > >        IOMMU bypass       |  55 times      | 10 times
> > > > >        -------------------------------------------------------
> > > > >
> > > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > > only read configurations. It is tested for both Default Operation and
> > > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > > default Centos ditstro rfs.
> > > > >
> > > > > This patch is not full proof solution. Issue can still come
> > > > > from the point device is discovered and driver probe called.
> > > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > > creation - device-driver" to "device discovery - device-driver".
> > > > > Usually, device discovery to device-driver is very small time. So
> > > > > the probability is very low.
> > > > >
> > > > > Note: device-discovery will overwrite existing stream table entries
> > > > > with both SMMU stage as by-pass.
> > > > >
> > > > >
> > > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > > index 82508730feb7..d492d92c2dd7 100644
> > > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > > >                       break;
> > > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > > -                     ste_live = true;
> > > > > +                     /*
> > > > > +                      * As kdump kernel copy STE table from previous
> > > > > +                      * kernel. It still may have valid stream table entries.
> > > > > +                      * Forcing entry as false to allow overwrite.
> > > > > +                      */
> > > > > +                     if (!is_kdump_kernel())
> > > > > +                             ste_live = true;
> > > > >                       break;
> > > > >               case STRTAB_STE_0_CFG_ABORT:
> > > > >                       BUG_ON(!disable_bypass);
> > > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > >               return -ENOMEM;
> > > > >       }
> > > > >
> > > > > +     if (is_kdump_kernel())
> > > > > +             return 0;
> > > > > +
> > > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > +{
> > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > +
> > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > +
> > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > +
> > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > +
> > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > > +}
> > > > > +
> > > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > >  {
> > > > >       void *strtab;
> > > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > > >       cfg->strtab_base_cfg = reg;
> > > > >
> > > > > +     if (is_kdump_kernel())
> > > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > > +
> > > > >       return arm_smmu_init_l1_strtab(smmu);
> > > > >  }
> > > > >
> > > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > > >       cfg->strtab_base_cfg = reg;
> > > > >
> > > > > +     if (is_kdump_kernel()) {
> > > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > > +             return 0;
> > > > > +     }
> > > > > +
> > > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > > >       return 0;
> > > > >  }
> > > > > --
> > > > > 2.18.2
> > > > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-21 22:49         ` Bjorn Helgaas
@ 2020-05-27 11:44           ` Prabhakar Kushwaha
  2020-05-27 20:18             ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-05-27 11:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

Hi Bjorn,

On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > via this new table..
> > > > > >
> > > > > > There are scenarios, where devices are still having old pending
> > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > As new stream table does not have entry for older transactions,
> > > > > > it will be aborted by SMMU.
> > > > > >
> > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > >
> > > > > That sounds like exactly what we want, doesn't it?
> > > > >
> > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > as it's not memory used by the kdump kernel.
> > > >
> > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > But current abort is happening because of SMMU and no driver/pcie
> > > > setup present at this moment.
> > >
> > > I don't understand what you mean by "in context of driver."  The whole
> > > problem is that we can't control *when* the abort happens, so it may
> > > happen in *any* context.  It may happen when a NIC receives a packet
> > > or at some other unpredictable time.
> > >
> > > > Solution of this issue should be at 2 place
> > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > issue till finally driver's probe takeover.
> > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > able to recover.
> > > >
> > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > A network device hang is observed even after continuous
> > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > >
> > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > we could recover and reliably use the device after the error, that
> > > > > seems like it would be a more robust solution that having to add
> > > > > special cases in every IOMMU driver.
> > > > >
> > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > it because we want to recover from that sort of error in normal
> > > > > (non-crash) situations as well.
> > > > >
> > > > Completion abort case should be gracefully handled.  And device should
> > > > always remain usable.
> > > >
> > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > 82576 Gigabit Network card.
> > > >
> > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > >     -  kdump file system does not have Ethernet driver
> > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > of kdump root file system.
> > >
> > > In this case, I think report_error_detected() is deciding that because
> > > the device has no driver, we can't do anything.  The flow is like
> > > this:
> > >
> > >   aer_recover_work_func               # aer_recover_work
> > >     kfifo_get(aer_recover_ring, entry)
> > >     dev = pci_get_domain_bus_and_slot
> > >     cper_print_aer(dev, ...)
> > >       pci_err("AER: aer_status:")
> > >       pci_err("AER:   [14] CmpltTO")
> > >       pci_err("AER: aer_layer=")
> > >     if (AER_NONFATAL)
> > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > >         status = CAN_RECOVER
> > >         pci_walk_bus(report_normal_detected)
> > >           report_error_detected
> > >             if (!dev->driver)
> > >               vote = NO_AER_DRIVER
> > >               pci_info("can't recover (no error_detected callback)")
> > >             *result = merge_result(*, NO_AER_DRIVER)
> > >             # always NO_AER_DRIVER
> > >         status is now NO_AER_DRIVER
> > >
> > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > and status is not RECOVERED, so it skips .resume().
> > >
> > > I don't remember the history there, but if a device has no driver and
> > > the device generates errors, it seems like we ought to be able to
> > > reset it.
> >
> > But how to reset the device considering there is no driver.
> > Hypothetically, this case should be taken care by PCIe subsystem to
> > perform reset at PCIe level.
>
> I don't understand your question.  The PCI core (not the device
> driver) already does the reset.  When pcie_do_recovery() calls
> reset_link(), all devices on the other side of the link are reset.
>
> > > We should be able to field one (or a few) AER errors, reset the
> > > device, and you should be able to use the shell in the kdump kernel.
> > >
> > here kdump shell is usable only problem is a "lot of AER Errors". One
> > cannot see what they are typing.
>
> Right, that's what I expect.  If the PCI core resets the device, you
> should get just a few AER errors, and they should stop after the
> device is reset.
>
> > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > >
> > > > II) Crash testing using default root file system: Specific case to
> > > > test Ethernet driver in second kernel
> > > >    -  Default root file system have Ethernet driver
> > > >    -  AER error comes even before the driver probe starts.
> > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > >    -  AER also tries to recover. but no success.  [2]
> > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > and commenting ghes_handle_aer() from GHES driver..
> > > >           than different set of errors come which also never able to recover [3]
> > > >
> >
> > Please suggest your view on this case. Here driver is preset.
> > (driver/net/ethernet/intel/igb/igb_main.c)
> > In this case AER errors starts even before driver probe starts.
> > After probe, driver does the device reset with no success and even AER
> > recovery does not work.
>
> This case should be the same as the one above.  If we can change the
> PCI core so it can reset the device when there's no driver,  that would
> apply to case I (where there will never be a driver) and to case II
> (where there is no driver now, but a driver will probe the device
> later).
>

Does this means change are required in PCI core.
I tried following changes in pcie_do_recovery() but it did not help.
Same error as before.

-- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
        pci_info(dev, "broadcast resume message\n");
        pci_walk_bus(bus, report_resume, &status);
@@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
        return status;

 failed:
        pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+       pci_reset_function(dev);
+       pci_aer_clear_device_status(dev);
+       pci_aer_clear_nonfatal_status(dev);

--pk

> > Problem mentioned in case I and II goes away if do pci_reset_function
> > during enumeration phase of kdump kernel.
> > can we thought of doing pci_reset_function for all devices in kdump
> > kernel or device specific quirk.
> >
> > --pk
> >
> >
> > > > As per my understanding, possible solutions are
> > > >  - Copy SMMU table i.e. this patch
> > > > OR
> > > >  - Doing pci_reset_function() during enumeration phase.
> > > > I also tried clearing "M" bit using pci_clear_master during
> > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > same AER error again.
> > > >
> > > >
> > > > -pk
> > > >
> > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > [1] with bootargs having pci=noaer
> > > >
> > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > total mem (8153768 kB)
> > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > >
> > > >
> > > > ----------------------------------------------------------------------------------------------------------------------------
> > > > [2] Normal bootargs.
> > > >
> > > > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > > > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > > > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > > > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > > > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   54.317991] {6}[Hardware Error]:   slot: 0
> > > > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > > > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > > > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > > > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > > > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.792897] Workqueue: events aer_recover_work_func
> > > > [  397.797760] Call trace:
> > > > [  397.800199]  __switch_to+0xcc/0x108
> > > > [  397.803675]  __schedule+0x2c0/0x700
> > > > [  397.807150]  schedule+0x58/0xe8
> > > > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > > > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > > > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > > > [  397.830962]  mutex_lock+0x4c/0x68
> > > > [  397.834264]  report_slot_reset+0x30/0xa0
> > > > [  397.838178]  pci_walk_bus+0x68/0xc0
> > > > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > > > [  397.845562]  aer_recover_work_func+0x100/0x138
> > > > [  397.849995]  process_one_work+0x1bc/0x458
> > > > [  397.853991]  worker_thread+0x150/0x500
> > > > [  397.857727]  kthread+0x114/0x118
> > > > [  397.860945]  ret_from_fork+0x10/0x18
> > > > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > > > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > > > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > > > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > > > [  397.894505] Call trace:
> > > > [  397.896940]  __switch_to+0xcc/0x108
> > > > [  397.900419]  __schedule+0x2c0/0x700
> > > > [  397.903894]  schedule+0x58/0xe8
> > > > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > > > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > > > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > > > [  397.927696]  mutex_lock+0x4c/0x68
> > > > [  397.931005]  rtnl_lock+0x24/0x30
> > > > [  397.934220]  addrconf_verify_work+0x18/0x30
> > > > [  397.938394]  process_one_work+0x1bc/0x458
> > > > [  397.942390]  worker_thread+0x150/0x500
> > > > [  397.946126]  kthread+0x114/0x118
> > > > [  397.949345]  ret_from_fork+0x10/0x18
> > > >
> > > > ---------------------------------------------------------------------------------------------------------------------------------
> > > > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> > > >
> > > > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > > > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > > > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > > > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > > > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > > > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > > > [   69.403248] {9}[Hardware Error]:   slot: 0
> > > > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > > > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > > > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > > > 0x6000, control: 0x0002
> > > > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > > 1000 Mbps Full Duplex, Flow Control: RX
> > > > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   73.995068]   Tx Queue             <2>
> > > > [   73.995068]   TDH                  <0>
> > > > [   73.995068]   TDT                  <1>
> > > > [   73.995068]   next_to_use          <1>
> > > > [   73.995068]   next_to_clean        <0>
> > > > [   73.995068] buffer_info[next_to_clean]
> > > > [   73.995068]   time_stamp           <ffff9c1a>
> > > > [   73.995068]   next_to_watch        <0000000097d42934>
> > > > [   73.995068]   jiffies              <ffff9cd0>
> > > > [   73.995068]   desc.status          <168000>
> > > > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   75.987323]   Tx Queue             <2>
> > > > [   75.987323]   TDH                  <0>
> > > > [   75.987323]   TDT                  <1>
> > > > [   75.987323]   next_to_use          <1>
> > > > [   75.987323]   next_to_clean        <0>
> > > > [   75.987323] buffer_info[next_to_clean]
> > > > [   75.987323]   time_stamp           <ffff9c1a>
> > > > [   75.987323]   next_to_watch        <0000000097d42934>
> > > > [   75.987323]   jiffies              <ffff9d98>
> > > > [   75.987323]   desc.status          <168000>
> > > > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > > > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > > > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > > > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > > > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   78.033107] {10}[Hardware Error]:   slot: 0
> > > > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > > > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > > > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   78.096202]   Tx Queue             <2>
> > > > [   78.096202]   TDH                  <0>
> > > > [   78.096202]   TDT                  <1>
> > > > [   78.096202]   next_to_use          <1>
> > > > [   78.096202]   next_to_clean        <0>
> > > > [   78.096202] buffer_info[next_to_clean]
> > > > [   78.096202]   time_stamp           <ffff9c1a>
> > > > [   78.096202]   next_to_watch        <0000000097d42934>
> > > > [   78.096202]   jiffies              <ffff9e6a>
> > > > [   78.096202]   desc.status          <168000>
> > > > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > > > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > > > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > > > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > > > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > > > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > > > [   79.642988] {11}[Hardware Error]:   slot: 0
> > > > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > > > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > > > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > > > 0x6000, control: 0x0002
> > > > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   79.987052]   Tx Queue             <2>
> > > > [   79.987052]   TDH                  <0>
> > > > [   79.987052]   TDT                  <1>
> > > > [   79.987052]   next_to_use          <1>
> > > > [   79.987052]   next_to_clean        <0>
> > > > [   79.987052] buffer_info[next_to_clean]
> > > > [   79.987052]   time_stamp           <ffff9c1a>
> > > > [   79.987052]   next_to_watch        <0000000097d42934>
> > > > [   79.987052]   jiffies              <ffff9f28>
> > > > [   79.987052]   desc.status          <168000>
> > > > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   79.987056]   Tx Queue             <3>
> > > > [   79.987056]   TDH                  <0>
> > > > [   79.987056]   TDT                  <1>
> > > > [   79.987056]   next_to_use          <1>
> > > > [   79.987056]   next_to_clean        <0>
> > > > [   79.987056] buffer_info[next_to_clean]
> > > > [   79.987056]   time_stamp           <ffff9e43>
> > > > [   79.987056]   next_to_watch        <000000008da33deb>
> > > > [   79.987056]   jiffies              <ffff9f28>
> > > > [   79.987056]   desc.status          <514000>
> > > > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   81.986842]   Tx Queue             <2>
> > > > [   81.986842]   TDH                  <0>
> > > > [   81.986842]   TDT                  <1>
> > > > [   81.986842]   next_to_use          <1>
> > > > [   81.986842]   next_to_clean        <0>
> > > > [   81.986842] buffer_info[next_to_clean]
> > > > [   81.986842]   time_stamp           <ffff9c1a>
> > > > [   81.986842]   next_to_watch        <0000000097d42934>
> > > > [   81.986842]   jiffies              <ffff9ff0>
> > > > [   81.986842]   desc.status          <168000>
> > > > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   81.986844]   Tx Queue             <3>
> > > > [   81.986844]   TDH                  <0>
> > > > [   81.986844]   TDT                  <1>
> > > > [   81.986844]   next_to_use          <1>
> > > > [   81.986844]   next_to_clean        <0>
> > > > [   81.986844] buffer_info[next_to_clean]
> > > > [   81.986844]   time_stamp           <ffff9e43>
> > > > [   81.986844]   next_to_watch        <000000008da33deb>
> > > > [   81.986844]   jiffies              <ffff9ff0>
> > > > [   81.986844]   desc.status          <514000>
> > > > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > > > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > > > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > > > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > > > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > > > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > > > [   85.402540] {12}[Hardware Error]:   slot: 0
> > > > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > > > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > > > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > > 1000 Mbps Full Duplex, Flow Control: RX
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > > > primary/old kernel to preserve the mappings until the device driver
> > > > > > takes over.
> > > > > >
> > > > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > > > ---
> > > > > > Changes for v2: Used memremap in-place of ioremap
> > > > > >
> > > > > > V2 patch has been sanity tested.
> > > > > >
> > > > > > V1 patch has been tested with
> > > > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > > > configurations with "no AER error". Each iteration has
> > > > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > > > >
> > > > > >  1)  with 2 level stream table
> > > > > >        ----------------------------------------------------
> > > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > > >        -----------------------------------------------------
> > > > > >        Default Operation  |  100 times     | 10 times
> > > > > >        -----------------------------------------------------
> > > > > >        IOMMU bypass       |  41 times      | 10 times
> > > > > >        -----------------------------------------------------
> > > > > >
> > > > > >  2)  with Linear stream table.
> > > > > >        -----------------------------------------------------
> > > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > > >        ------------------------------------------------------
> > > > > >        Default Operation  |  100 times     | 10 times
> > > > > >        ------------------------------------------------------
> > > > > >        IOMMU bypass       |  55 times      | 10 times
> > > > > >        -------------------------------------------------------
> > > > > >
> > > > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > > > only read configurations. It is tested for both Default Operation and
> > > > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > > > default Centos ditstro rfs.
> > > > > >
> > > > > > This patch is not full proof solution. Issue can still come
> > > > > > from the point device is discovered and driver probe called.
> > > > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > > > creation - device-driver" to "device discovery - device-driver".
> > > > > > Usually, device discovery to device-driver is very small time. So
> > > > > > the probability is very low.
> > > > > >
> > > > > > Note: device-discovery will overwrite existing stream table entries
> > > > > > with both SMMU stage as by-pass.
> > > > > >
> > > > > >
> > > > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > > > index 82508730feb7..d492d92c2dd7 100644
> > > > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > > > >                       break;
> > > > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > > > -                     ste_live = true;
> > > > > > +                     /*
> > > > > > +                      * As kdump kernel copy STE table from previous
> > > > > > +                      * kernel. It still may have valid stream table entries.
> > > > > > +                      * Forcing entry as false to allow overwrite.
> > > > > > +                      */
> > > > > > +                     if (!is_kdump_kernel())
> > > > > > +                             ste_live = true;
> > > > > >                       break;
> > > > > >               case STRTAB_STE_0_CFG_ABORT:
> > > > > >                       BUG_ON(!disable_bypass);
> > > > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > > >               return -ENOMEM;
> > > > > >       }
> > > > > >
> > > > > > +     if (is_kdump_kernel())
> > > > > > +             return 0;
> > > > > > +
> > > > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > > +{
> > > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > > +
> > > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > > +
> > > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > > +
> > > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > > +
> > > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > > > +}
> > > > > > +
> > > > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > > >  {
> > > > > >       void *strtab;
> > > > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > > > >       cfg->strtab_base_cfg = reg;
> > > > > >
> > > > > > +     if (is_kdump_kernel())
> > > > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > > > +
> > > > > >       return arm_smmu_init_l1_strtab(smmu);
> > > > > >  }
> > > > > >
> > > > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > > > >       cfg->strtab_base_cfg = reg;
> > > > > >
> > > > > > +     if (is_kdump_kernel()) {
> > > > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > > > +             return 0;
> > > > > > +     }
> > > > > > +
> > > > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > > > >       return 0;
> > > > > >  }
> > > > > > --
> > > > > > 2.18.2
> > > > > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-27 11:44           ` Prabhakar Kushwaha
@ 2020-05-27 20:18             ` Bjorn Helgaas
  2020-05-29 14:18               ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-05-27 20:18 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > via this new table..
> > > > > > >
> > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > it will be aborted by SMMU.
> > > > > > >
> > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > >
> > > > > > That sounds like exactly what we want, doesn't it?
> > > > > >
> > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > as it's not memory used by the kdump kernel.
> > > > >
> > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > setup present at this moment.
> > > >
> > > > I don't understand what you mean by "in context of driver."  The whole
> > > > problem is that we can't control *when* the abort happens, so it may
> > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > or at some other unpredictable time.
> > > >
> > > > > Solution of this issue should be at 2 place
> > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > issue till finally driver's probe takeover.
> > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > able to recover.
> > > > >
> > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > A network device hang is observed even after continuous
> > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > >
> > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > we could recover and reliably use the device after the error, that
> > > > > > seems like it would be a more robust solution that having to add
> > > > > > special cases in every IOMMU driver.
> > > > > >
> > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > it because we want to recover from that sort of error in normal
> > > > > > (non-crash) situations as well.
> > > > > >
> > > > > Completion abort case should be gracefully handled.  And device should
> > > > > always remain usable.
> > > > >
> > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > 82576 Gigabit Network card.
> > > > >
> > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > >     -  kdump file system does not have Ethernet driver
> > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > of kdump root file system.
> > > >
> > > > In this case, I think report_error_detected() is deciding that because
> > > > the device has no driver, we can't do anything.  The flow is like
> > > > this:
> > > >
> > > >   aer_recover_work_func               # aer_recover_work
> > > >     kfifo_get(aer_recover_ring, entry)
> > > >     dev = pci_get_domain_bus_and_slot
> > > >     cper_print_aer(dev, ...)
> > > >       pci_err("AER: aer_status:")
> > > >       pci_err("AER:   [14] CmpltTO")
> > > >       pci_err("AER: aer_layer=")
> > > >     if (AER_NONFATAL)
> > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > >         status = CAN_RECOVER
> > > >         pci_walk_bus(report_normal_detected)
> > > >           report_error_detected
> > > >             if (!dev->driver)
> > > >               vote = NO_AER_DRIVER
> > > >               pci_info("can't recover (no error_detected callback)")
> > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > >             # always NO_AER_DRIVER
> > > >         status is now NO_AER_DRIVER
> > > >
> > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > and status is not RECOVERED, so it skips .resume().
> > > >
> > > > I don't remember the history there, but if a device has no driver and
> > > > the device generates errors, it seems like we ought to be able to
> > > > reset it.
> > >
> > > But how to reset the device considering there is no driver.
> > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > perform reset at PCIe level.
> >
> > I don't understand your question.  The PCI core (not the device
> > driver) already does the reset.  When pcie_do_recovery() calls
> > reset_link(), all devices on the other side of the link are reset.
> >
> > > > We should be able to field one (or a few) AER errors, reset the
> > > > device, and you should be able to use the shell in the kdump kernel.
> > > >
> > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > cannot see what they are typing.
> >
> > Right, that's what I expect.  If the PCI core resets the device, you
> > should get just a few AER errors, and they should stop after the
> > device is reset.
> >
> > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > >
> > > > > II) Crash testing using default root file system: Specific case to
> > > > > test Ethernet driver in second kernel
> > > > >    -  Default root file system have Ethernet driver
> > > > >    -  AER error comes even before the driver probe starts.
> > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > >    -  AER also tries to recover. but no success.  [2]
> > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > >           than different set of errors come which also never able to recover [3]
> > > > >
> > >
> > > Please suggest your view on this case. Here driver is preset.
> > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > In this case AER errors starts even before driver probe starts.
> > > After probe, driver does the device reset with no success and even AER
> > > recovery does not work.
> >
> > This case should be the same as the one above.  If we can change the
> > PCI core so it can reset the device when there's no driver,  that would
> > apply to case I (where there will never be a driver) and to case II
> > (where there is no driver now, but a driver will probe the device
> > later).
> 
> Does this means change are required in PCI core.

Yes, I am suggesting that the PCI core does not do the right thing
here.

> I tried following changes in pcie_do_recovery() but it did not help.
> Same error as before.
> 
> -- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
>         pci_info(dev, "broadcast resume message\n");
>         pci_walk_bus(bus, report_resume, &status);
> @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>         return status;
> 
>  failed:
>         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> +       pci_reset_function(dev);
> +       pci_aer_clear_device_status(dev);
> +       pci_aer_clear_nonfatal_status(dev);

Did you confirm that this resets the devices in question (0000:09:00.0
and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
PM, etc)?

Case I is using APEI, and it looks like that can queue up 16 errors
(AER_RECOVER_RING_SIZE), so that queue could be completely full before
we even get a chance to reset the device.  But I would think that the
reset should *eventually* stop the errors, even though we might log
30+ of them first.

As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
see if it reduces the logging.

> > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > during enumeration phase of kdump kernel.
> > > can we thought of doing pci_reset_function for all devices in kdump
> > > kernel or device specific quirk.
> > >
> > > --pk
> > >
> > >
> > > > > As per my understanding, possible solutions are
> > > > >  - Copy SMMU table i.e. this patch
> > > > > OR
> > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > same AER error again.
> > > > >
> > > > >
> > > > > -pk
> > > > >
> > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > [1] with bootargs having pci=noaer
> > > > >
> > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > Hardware Error Source: 1
> > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > aer_mask: 0x00000000
> > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > aer_agent=Requester ID
> > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > total mem (8153768 kB)
> > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > aer_mask: 0x00000000
> > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > aer_agent=Requester ID
> > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)

<snip>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-27 20:18             ` Bjorn Helgaas
@ 2020-05-29 14:18               ` Prabhakar Kushwaha
  2020-05-29 19:33                 ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-05-29 14:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

Hi Bjorn,

On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > via this new table..
> > > > > > > >
> > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > it will be aborted by SMMU.
> > > > > > > >
> > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > >
> > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > >
> > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > as it's not memory used by the kdump kernel.
> > > > > >
> > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > setup present at this moment.
> > > > >
> > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > or at some other unpredictable time.
> > > > >
> > > > > > Solution of this issue should be at 2 place
> > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > issue till finally driver's probe takeover.
> > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > able to recover.
> > > > > >
> > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > A network device hang is observed even after continuous
> > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > >
> > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > special cases in every IOMMU driver.
> > > > > > >
> > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > (non-crash) situations as well.
> > > > > > >
> > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > always remain usable.
> > > > > >
> > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > 82576 Gigabit Network card.
> > > > > >
> > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > >     -  kdump file system does not have Ethernet driver
> > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > of kdump root file system.
> > > > >
> > > > > In this case, I think report_error_detected() is deciding that because
> > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > this:
> > > > >
> > > > >   aer_recover_work_func               # aer_recover_work
> > > > >     kfifo_get(aer_recover_ring, entry)
> > > > >     dev = pci_get_domain_bus_and_slot
> > > > >     cper_print_aer(dev, ...)
> > > > >       pci_err("AER: aer_status:")
> > > > >       pci_err("AER:   [14] CmpltTO")
> > > > >       pci_err("AER: aer_layer=")
> > > > >     if (AER_NONFATAL)
> > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > >         status = CAN_RECOVER
> > > > >         pci_walk_bus(report_normal_detected)
> > > > >           report_error_detected
> > > > >             if (!dev->driver)
> > > > >               vote = NO_AER_DRIVER
> > > > >               pci_info("can't recover (no error_detected callback)")
> > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > >             # always NO_AER_DRIVER
> > > > >         status is now NO_AER_DRIVER
> > > > >
> > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > and status is not RECOVERED, so it skips .resume().
> > > > >
> > > > > I don't remember the history there, but if a device has no driver and
> > > > > the device generates errors, it seems like we ought to be able to
> > > > > reset it.
> > > >
> > > > But how to reset the device considering there is no driver.
> > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > perform reset at PCIe level.
> > >
> > > I don't understand your question.  The PCI core (not the device
> > > driver) already does the reset.  When pcie_do_recovery() calls
> > > reset_link(), all devices on the other side of the link are reset.
> > >
> > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > >
> > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > cannot see what they are typing.
> > >
> > > Right, that's what I expect.  If the PCI core resets the device, you
> > > should get just a few AER errors, and they should stop after the
> > > device is reset.
> > >
> > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > >
> > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > test Ethernet driver in second kernel
> > > > > >    -  Default root file system have Ethernet driver
> > > > > >    -  AER error comes even before the driver probe starts.
> > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > >           than different set of errors come which also never able to recover [3]
> > > > > >
> > > >
> > > > Please suggest your view on this case. Here driver is preset.
> > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > In this case AER errors starts even before driver probe starts.
> > > > After probe, driver does the device reset with no success and even AER
> > > > recovery does not work.
> > >
> > > This case should be the same as the one above.  If we can change the
> > > PCI core so it can reset the device when there's no driver,  that would
> > > apply to case I (where there will never be a driver) and to case II
> > > (where there is no driver now, but a driver will probe the device
> > > later).
> >
> > Does this means change are required in PCI core.
>
> Yes, I am suggesting that the PCI core does not do the right thing
> here.
>
> > I tried following changes in pcie_do_recovery() but it did not help.
> > Same error as before.
> >
> > -- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> >         pci_info(dev, "broadcast resume message\n");
> >         pci_walk_bus(bus, report_resume, &status);
> > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >         return status;
> >
> >  failed:
> >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > +       pci_reset_function(dev);
> > +       pci_aer_clear_device_status(dev);
> > +       pci_aer_clear_nonfatal_status(dev);
>
> Did you confirm that this resets the devices in question (0000:09:00.0
> and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> PM, etc)?
>

Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
reason no effect. After making following changes,  both devices are
now getting reset.
Both devices are using FLR.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 117c0a2b2ba4..26b908f55aef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       pci_save_state(dev);
+                       pci_cfg_access_lock(dev);
+
+                       /* Quiesce the device completely */
+                       pci_write_config_word(dev, PCI_COMMAND,
+                             PCI_COMMAND_INTX_DISABLE);
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "recovered via pci level
reset\n");
+                       }
+
+                       pci_cfg_access_unlock(dev);
+                       pci_restore_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;
                }

in order to take care of case 2 (driver comes after sometime) ==>
following code needs to be added to avoid crash during igb_probe.  It
looks to be a race condition between AER and igb_probe().

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index b46bff8fe056..c48f0a54bb95 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
const struct pci_device_id *ent)
        /* Catch broken hardware that put the wrong VF device ID in
         * the PCIe SR-IOV capability.
         */
+       if (pci_dev_trylock(pdev)) {
+               mdelay(1000);
+               pci_info(pdev,"device is locked, try waiting 1 sec\n");
+       }
+

Here are the observation with all above changes
A) AER errors are less but they are still there for both case 1 (No
driver at all) and case 2 (driver comes after some time)
B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times
C) After that AER errors [1] comes is only for device 0000:09:00.0.
This is strange as this pci device is not being used during test.
Ping/ssh are happening with 0000:09:01.0
D) If wait for some more time. No more AER errors from any device
E) Ping is working fine in case 2.

09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
Connection (rev 01)
09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
Connection (rev 01)

# lspci -t -v

 \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
             +-01.0-[01]--
             +-02.0-[02]--
             +-03.0-[03]--
             +-04.0-[04]--
             +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
BCM57840 NetXtreme II 10 Gigabit Ethernet
             |            \-00.1  Broadcom Inc. and subsidiaries
BCM57840 NetXtreme II 10 Gigabit Ethernet
             +-06.0-[06]--
             +-07.0-[07]--
             +-08.0-[08]--
             +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
Network Connection
             |               \-00.1  Intel Corporation 82576 Gigabit
Network Connection


[1] AER error which comes for 09:00.0:

[   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   81.676503] {7}[Hardware Error]: event severity: corrected
[   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
[   81.687447] {7}[Hardware Error]:   section_type: PCIe error
[   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.698908] {7}[Hardware Error]:   version: 3.0
[   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.715059] {7}[Hardware Error]:   slot: 0
[   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.730864] {7}[Hardware Error]:   class_code: 000002
[   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
[   81.748058] {7}[Hardware Error]:   section_type: PCIe error
[   81.753615] {7}[Hardware Error]:   port_type: 4, root port
[   81.759086] {7}[Hardware Error]:   version: 3.0
[   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   81.775237] {7}[Hardware Error]:   slot: 0
[   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
[   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   81.791041] {7}[Hardware Error]:   class_code: 000406
[   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
[   81.809276] {7}[Hardware Error]:   section_type: PCIe error
[   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.820738] {7}[Hardware Error]:   version: 3.0
[   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.836889] {7}[Hardware Error]:   slot: 0
[   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.852693] {7}[Hardware Error]:   class_code: 000002
[   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
[   81.869886] {7}[Hardware Error]:   section_type: PCIe error
[   81.875444] {7}[Hardware Error]:   port_type: 4, root port
[   81.880914] {7}[Hardware Error]:   version: 3.0
[   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   81.897066] {7}[Hardware Error]:   slot: 0
[   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
[   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   81.912870] {7}[Hardware Error]:   class_code: 000406
[   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
[   81.931104] {7}[Hardware Error]:   section_type: PCIe error
[   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.942566] {7}[Hardware Error]:   version: 3.0
[   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.958717] {7}[Hardware Error]:   slot: 0
[   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.974522] {7}[Hardware Error]:   class_code: 000002
[   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
[   81.991715] {7}[Hardware Error]:   section_type: PCIe error
[   81.997272] {7}[Hardware Error]:   port_type: 4, root port
[   82.002743] {7}[Hardware Error]:   version: 3.0
[   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.018894] {7}[Hardware Error]:   slot: 0
[   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.034698] {7}[Hardware Error]:   class_code: 000406
[   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
[   82.052933] {7}[Hardware Error]:   section_type: PCIe error
[   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.064395] {7}[Hardware Error]:   version: 3.0
[   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.080547] {7}[Hardware Error]:   slot: 0
[   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.096351] {7}[Hardware Error]:   class_code: 000002
[   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
[   82.113544] {7}[Hardware Error]:   section_type: PCIe error
[   82.119101] {7}[Hardware Error]:   port_type: 4, root port
[   82.124572] {7}[Hardware Error]:   version: 3.0
[   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.140723] {7}[Hardware Error]:   slot: 0
[   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.156527] {7}[Hardware Error]:   class_code: 000406
[   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
[   82.174762] {7}[Hardware Error]:   section_type: PCIe error
[   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.186224] {7}[Hardware Error]:   version: 3.0
[   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.202375] {7}[Hardware Error]:   slot: 0
[   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.218179] {7}[Hardware Error]:   class_code: 000002
[   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
[   82.235372] {7}[Hardware Error]:   section_type: PCIe error
[   82.240929] {7}[Hardware Error]:   port_type: 4, root port
[   82.246400] {7}[Hardware Error]:   version: 3.0
[   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.262551] {7}[Hardware Error]:   slot: 0
[   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.278355] {7}[Hardware Error]:   class_code: 000406
[   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
[   82.296676] {7}[Hardware Error]:   section_type: PCIe error
[   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.308138] {7}[Hardware Error]:   version: 3.0
[   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.324290] {7}[Hardware Error]:   slot: 0
[   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.340094] {7}[Hardware Error]:   class_code: 000002
[   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
[   82.357374] {7}[Hardware Error]:   section_type: PCIe error
[   82.362931] {7}[Hardware Error]:   port_type: 4, root port
[   82.368402] {7}[Hardware Error]:   version: 3.0
[   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.384553] {7}[Hardware Error]:   slot: 0
[   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.400357] {7}[Hardware Error]:   class_code: 000406
[   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
[   82.418678] {7}[Hardware Error]:   section_type: PCIe error
[   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.430140] {7}[Hardware Error]:   version: 3.0
[   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.446291] {7}[Hardware Error]:   slot: 0
[   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.462096] {7}[Hardware Error]:   class_code: 000002
[   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
[   82.479375] {7}[Hardware Error]:   section_type: PCIe error
[   82.484933] {7}[Hardware Error]:   port_type: 4, root port
[   82.490403] {7}[Hardware Error]:   version: 3.0
[   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.506555] {7}[Hardware Error]:   slot: 0
[   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.522359] {7}[Hardware Error]:   class_code: 000406
[   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   88.075899] {8}[Hardware Error]: event severity: corrected
[   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
[   88.086841] {8}[Hardware Error]:   section_type: PCIe error
[   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
[   88.098303] {8}[Hardware Error]:   version: 3.0
[   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
[   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
[   88.114455] {8}[Hardware Error]:   slot: 0
[   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
[   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   88.130259] {8}[Hardware Error]:   class_code: 000002
[   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
[   88.147452] {8}[Hardware Error]:   section_type: PCIe error
[   88.153009] {8}[Hardware Error]:   port_type: 4, root port
[   88.158480] {8}[Hardware Error]:   version: 3.0
[   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
[   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
[   88.174633] {8}[Hardware Error]:   slot: 0
[   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
[   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   88.191914] {8}[Hardware Error]:   class_code: 000406
[   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
[   88.210323] {8}[Hardware Error]:   section_type: PCIe error
[   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
[   88.221786] {8}[Hardware Error]:   version: 3.0
[   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
[   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
[   88.237937] {8}[Hardware Error]:   slot: 0
[   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
[   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   88.253741] {8}[Hardware Error]:   class_code: 000002
[   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID

> Case I is using APEI, and it looks like that can queue up 16 errors
> (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> we even get a chance to reset the device.  But I would think that the
> reset should *eventually* stop the errors, even though we might log
> 30+ of them first.
>
> As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> see if it reduces the logging.

Did not tried this experiment. I believe it is not required now

--pk

>
> > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > during enumeration phase of kdump kernel.
> > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > kernel or device specific quirk.
> > > >
> > > > --pk
> > > >
> > > >
> > > > > > As per my understanding, possible solutions are
> > > > > >  - Copy SMMU table i.e. this patch
> > > > > > OR
> > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > same AER error again.
> > > > > >
> > > > > >
> > > > > > -pk
> > > > > >
> > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > [1] with bootargs having pci=noaer
> > > > > >
> > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > Hardware Error Source: 1
> > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > aer_mask: 0x00000000
> > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > aer_agent=Requester ID
> > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > total mem (8153768 kB)
> > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > aer_mask: 0x00000000
> > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > aer_agent=Requester ID
> > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
>
> <snip>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-29 14:18               ` Prabhakar Kushwaha
@ 2020-05-29 19:33                 ` Bjorn Helgaas
  2020-06-03 17:42                   ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-05-29 19:33 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > via this new table..
> > > > > > > > >
> > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > it will be aborted by SMMU.
> > > > > > > > >
> > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > >
> > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > >
> > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > >
> > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > setup present at this moment.
> > > > > >
> > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > or at some other unpredictable time.
> > > > > >
> > > > > > > Solution of this issue should be at 2 place
> > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > issue till finally driver's probe takeover.
> > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > able to recover.
> > > > > > >
> > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > >
> > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > special cases in every IOMMU driver.
> > > > > > > >
> > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > (non-crash) situations as well.
> > > > > > > >
> > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > always remain usable.
> > > > > > >
> > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > 82576 Gigabit Network card.
> > > > > > >
> > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > of kdump root file system.
> > > > > >
> > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > this:
> > > > > >
> > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > >     cper_print_aer(dev, ...)
> > > > > >       pci_err("AER: aer_status:")
> > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > >       pci_err("AER: aer_layer=")
> > > > > >     if (AER_NONFATAL)
> > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > >         status = CAN_RECOVER
> > > > > >         pci_walk_bus(report_normal_detected)
> > > > > >           report_error_detected
> > > > > >             if (!dev->driver)
> > > > > >               vote = NO_AER_DRIVER
> > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > >             # always NO_AER_DRIVER
> > > > > >         status is now NO_AER_DRIVER
> > > > > >
> > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > >
> > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > reset it.
> > > > >
> > > > > But how to reset the device considering there is no driver.
> > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > perform reset at PCIe level.
> > > >
> > > > I don't understand your question.  The PCI core (not the device
> > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > reset_link(), all devices on the other side of the link are reset.
> > > >
> > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > >
> > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > cannot see what they are typing.
> > > >
> > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > should get just a few AER errors, and they should stop after the
> > > > device is reset.
> > > >
> > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > >
> > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > test Ethernet driver in second kernel
> > > > > > >    -  Default root file system have Ethernet driver
> > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > >
> > > > >
> > > > > Please suggest your view on this case. Here driver is preset.
> > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > In this case AER errors starts even before driver probe starts.
> > > > > After probe, driver does the device reset with no success and even AER
> > > > > recovery does not work.
> > > >
> > > > This case should be the same as the one above.  If we can change the
> > > > PCI core so it can reset the device when there's no driver,  that would
> > > > apply to case I (where there will never be a driver) and to case II
> > > > (where there is no driver now, but a driver will probe the device
> > > > later).
> > >
> > > Does this means change are required in PCI core.
> >
> > Yes, I am suggesting that the PCI core does not do the right thing
> > here.
> >
> > > I tried following changes in pcie_do_recovery() but it did not help.
> > > Same error as before.
> > >
> > > -- a/drivers/pci/pcie/err.c
> > > +++ b/drivers/pci/pcie/err.c
> > >         pci_info(dev, "broadcast resume message\n");
> > >         pci_walk_bus(bus, report_resume, &status);
> > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > >         return status;
> > >
> > >  failed:
> > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > +       pci_reset_function(dev);
> > > +       pci_aer_clear_device_status(dev);
> > > +       pci_aer_clear_nonfatal_status(dev);
> >
> > Did you confirm that this resets the devices in question (0000:09:00.0
> > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > PM, etc)?
> 
> Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> reason no effect. After making following changes,  both devices are
> now getting reset.
> Both devices are using FLR.
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 117c0a2b2ba4..26b908f55aef 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
>                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>                         pci_info(dev, "can't recover (no
> error_detected callback)\n");
> +
> +                       pci_save_state(dev);
> +                       pci_cfg_access_lock(dev);
> +
> +                       /* Quiesce the device completely */
> +                       pci_write_config_word(dev, PCI_COMMAND,
> +                             PCI_COMMAND_INTX_DISABLE);
> +                       if (!__pci_reset_function_locked(dev)) {
> +                               vote = PCI_ERS_RESULT_RECOVERED;
> +                               pci_info(dev, "recovered via pci level
> reset\n");
> +                       }

Why do we need to save the state and quiesce the device?  The reset
should disable interrupts anyway.  In this particular case where
there's no driver, I don't think we should have to restore the state.
We maybe should *remove* the device and re-enumerate it after the
reset, but the state from before the reset should be irrelevant.

> +                       pci_cfg_access_unlock(dev);
> +                       pci_restore_state(dev);
>                 } else {
>                         vote = PCI_ERS_RESULT_NONE;
>                 }
> 
> in order to take care of case 2 (driver comes after sometime) ==>
> following code needs to be added to avoid crash during igb_probe.  It
> looks to be a race condition between AER and igb_probe().
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index b46bff8fe056..c48f0a54bb95 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
> const struct pci_device_id *ent)
>         /* Catch broken hardware that put the wrong VF device ID in
>          * the PCIe SR-IOV capability.
>          */
> +       if (pci_dev_trylock(pdev)) {
> +               mdelay(1000);
> +               pci_info(pdev,"device is locked, try waiting 1 sec\n");
> +       }

This is interesting to learn about the AER/driver interaction, but of
course, we wouldn't want to add code like this permanently.

> Here are the observation with all above changes
> A) AER errors are less but they are still there for both case 1 (No
> driver at all) and case 2 (driver comes after some time)

We'll certainly get *some* AER errors.  We have to get one before we
know to reset the device.

> B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times

I'm not sure why we reset both devices.  Are we seeing errors from
both, or could we be more selective in the code?

> C) After that AER errors [1] comes is only for device 0000:09:00.0.
> This is strange as this pci device is not being used during test.
> Ping/ssh are happening with 0000:09:01.0
> D) If wait for some more time. No more AER errors from any device
> E) Ping is working fine in case 2.
> 
> 09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 
> # lspci -t -v
> 
>  \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
>              +-01.0-[01]--
>              +-02.0-[02]--
>              +-03.0-[03]--
>              +-04.0-[04]--
>              +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
> BCM57840 NetXtreme II 10 Gigabit Ethernet
>              |            \-00.1  Broadcom Inc. and subsidiaries
> BCM57840 NetXtreme II 10 Gigabit Ethernet
>              +-06.0-[06]--
>              +-07.0-[07]--
>              +-08.0-[08]--
>              +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
> Network Connection
>              |               \-00.1  Intel Corporation 82576 Gigabit
> Network Connection
> 
> 
> [1] AER error which comes for 09:00.0:
> 
> [   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   81.676503] {7}[Hardware Error]: event severity: corrected
> [   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
> [   81.687447] {7}[Hardware Error]:   section_type: PCIe error
> [   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.698908] {7}[Hardware Error]:   version: 3.0
> [   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.715059] {7}[Hardware Error]:   slot: 0
> [   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.730864] {7}[Hardware Error]:   class_code: 000002
> [   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
> [   81.748058] {7}[Hardware Error]:   section_type: PCIe error
> [   81.753615] {7}[Hardware Error]:   port_type: 4, root port
> [   81.759086] {7}[Hardware Error]:   version: 3.0
> [   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   81.775237] {7}[Hardware Error]:   slot: 0
> [   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
> [   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   81.791041] {7}[Hardware Error]:   class_code: 000406
> [   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
> [   81.809276] {7}[Hardware Error]:   section_type: PCIe error
> [   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.820738] {7}[Hardware Error]:   version: 3.0
> [   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.836889] {7}[Hardware Error]:   slot: 0
> [   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.852693] {7}[Hardware Error]:   class_code: 000002
> [   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
> [   81.869886] {7}[Hardware Error]:   section_type: PCIe error
> [   81.875444] {7}[Hardware Error]:   port_type: 4, root port
> [   81.880914] {7}[Hardware Error]:   version: 3.0
> [   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   81.897066] {7}[Hardware Error]:   slot: 0
> [   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
> [   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   81.912870] {7}[Hardware Error]:   class_code: 000406
> [   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
> [   81.931104] {7}[Hardware Error]:   section_type: PCIe error
> [   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.942566] {7}[Hardware Error]:   version: 3.0
> [   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.958717] {7}[Hardware Error]:   slot: 0
> [   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.974522] {7}[Hardware Error]:   class_code: 000002
> [   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
> [   81.991715] {7}[Hardware Error]:   section_type: PCIe error
> [   81.997272] {7}[Hardware Error]:   port_type: 4, root port
> [   82.002743] {7}[Hardware Error]:   version: 3.0
> [   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.018894] {7}[Hardware Error]:   slot: 0
> [   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.034698] {7}[Hardware Error]:   class_code: 000406
> [   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
> [   82.052933] {7}[Hardware Error]:   section_type: PCIe error
> [   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.064395] {7}[Hardware Error]:   version: 3.0
> [   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.080547] {7}[Hardware Error]:   slot: 0
> [   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.096351] {7}[Hardware Error]:   class_code: 000002
> [   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
> [   82.113544] {7}[Hardware Error]:   section_type: PCIe error
> [   82.119101] {7}[Hardware Error]:   port_type: 4, root port
> [   82.124572] {7}[Hardware Error]:   version: 3.0
> [   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.140723] {7}[Hardware Error]:   slot: 0
> [   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.156527] {7}[Hardware Error]:   class_code: 000406
> [   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
> [   82.174762] {7}[Hardware Error]:   section_type: PCIe error
> [   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.186224] {7}[Hardware Error]:   version: 3.0
> [   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.202375] {7}[Hardware Error]:   slot: 0
> [   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.218179] {7}[Hardware Error]:   class_code: 000002
> [   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
> [   82.235372] {7}[Hardware Error]:   section_type: PCIe error
> [   82.240929] {7}[Hardware Error]:   port_type: 4, root port
> [   82.246400] {7}[Hardware Error]:   version: 3.0
> [   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.262551] {7}[Hardware Error]:   slot: 0
> [   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.278355] {7}[Hardware Error]:   class_code: 000406
> [   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
> [   82.296676] {7}[Hardware Error]:   section_type: PCIe error
> [   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.308138] {7}[Hardware Error]:   version: 3.0
> [   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.324290] {7}[Hardware Error]:   slot: 0
> [   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.340094] {7}[Hardware Error]:   class_code: 000002
> [   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
> [   82.357374] {7}[Hardware Error]:   section_type: PCIe error
> [   82.362931] {7}[Hardware Error]:   port_type: 4, root port
> [   82.368402] {7}[Hardware Error]:   version: 3.0
> [   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.384553] {7}[Hardware Error]:   slot: 0
> [   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.400357] {7}[Hardware Error]:   class_code: 000406
> [   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
> [   82.418678] {7}[Hardware Error]:   section_type: PCIe error
> [   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.430140] {7}[Hardware Error]:   version: 3.0
> [   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.446291] {7}[Hardware Error]:   slot: 0
> [   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.462096] {7}[Hardware Error]:   class_code: 000002
> [   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
> [   82.479375] {7}[Hardware Error]:   section_type: PCIe error
> [   82.484933] {7}[Hardware Error]:   port_type: 4, root port
> [   82.490403] {7}[Hardware Error]:   version: 3.0
> [   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.506555] {7}[Hardware Error]:   slot: 0
> [   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.522359] {7}[Hardware Error]:   class_code: 000406
> [   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   88.075899] {8}[Hardware Error]: event severity: corrected
> [   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
> [   88.086841] {8}[Hardware Error]:   section_type: PCIe error
> [   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
> [   88.098303] {8}[Hardware Error]:   version: 3.0
> [   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
> [   88.114455] {8}[Hardware Error]:   slot: 0
> [   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
> [   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   88.130259] {8}[Hardware Error]:   class_code: 000002
> [   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
> [   88.147452] {8}[Hardware Error]:   section_type: PCIe error
> [   88.153009] {8}[Hardware Error]:   port_type: 4, root port
> [   88.158480] {8}[Hardware Error]:   version: 3.0
> [   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
> [   88.174633] {8}[Hardware Error]:   slot: 0
> [   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
> [   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   88.191914] {8}[Hardware Error]:   class_code: 000406
> [   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
> [   88.210323] {8}[Hardware Error]:   section_type: PCIe error
> [   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
> [   88.221786] {8}[Hardware Error]:   version: 3.0
> [   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
> [   88.237937] {8}[Hardware Error]:   slot: 0
> [   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
> [   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   88.253741] {8}[Hardware Error]:   class_code: 000002
> [   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> 
> > Case I is using APEI, and it looks like that can queue up 16 errors
> > (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> > we even get a chance to reset the device.  But I would think that the
> > reset should *eventually* stop the errors, even though we might log
> > 30+ of them first.
> >
> > As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> > see if it reduces the logging.
> 
> Did not tried this experiment. I believe it is not required now
> 
> --pk
> 
> >
> > > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > > during enumeration phase of kdump kernel.
> > > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > > kernel or device specific quirk.
> > > > >
> > > > > --pk
> > > > >
> > > > >
> > > > > > > As per my understanding, possible solutions are
> > > > > > >  - Copy SMMU table i.e. this patch
> > > > > > > OR
> > > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > > same AER error again.
> > > > > > >
> > > > > > >
> > > > > > > -pk
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > > [1] with bootargs having pci=noaer
> > > > > > >
> > > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > > Hardware Error Source: 1
> > > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > aer_mask: 0x00000000
> > > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > aer_agent=Requester ID
> > > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > > total mem (8153768 kB)
> > > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > aer_mask: 0x00000000
> > > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > aer_agent=Requester ID
> > > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> >
> > <snip>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-05-29 19:33                 ` Bjorn Helgaas
@ 2020-06-03 17:42                   ` Prabhakar Kushwaha
  2020-06-04  0:02                     ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-06-03 17:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

Hi Bjorn,

On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > via this new table..
> > > > > > > > > >
> > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > >
> > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > >
> > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > >
> > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > >
> > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > setup present at this moment.
> > > > > > >
> > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > or at some other unpredictable time.
> > > > > > >
> > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > able to recover.
> > > > > > > >
> > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > >
> > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > >
> > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > (non-crash) situations as well.
> > > > > > > > >
> > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > always remain usable.
> > > > > > > >
> > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > 82576 Gigabit Network card.
> > > > > > > >
> > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > of kdump root file system.
> > > > > > >
> > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > this:
> > > > > > >
> > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > >     cper_print_aer(dev, ...)
> > > > > > >       pci_err("AER: aer_status:")
> > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > >       pci_err("AER: aer_layer=")
> > > > > > >     if (AER_NONFATAL)
> > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > >         status = CAN_RECOVER
> > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > >           report_error_detected
> > > > > > >             if (!dev->driver)
> > > > > > >               vote = NO_AER_DRIVER
> > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > >             # always NO_AER_DRIVER
> > > > > > >         status is now NO_AER_DRIVER
> > > > > > >
> > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > >
> > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > reset it.
> > > > > >
> > > > > > But how to reset the device considering there is no driver.
> > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > perform reset at PCIe level.
> > > > >
> > > > > I don't understand your question.  The PCI core (not the device
> > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > reset_link(), all devices on the other side of the link are reset.
> > > > >
> > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > >
> > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > cannot see what they are typing.
> > > > >
> > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > should get just a few AER errors, and they should stop after the
> > > > > device is reset.
> > > > >
> > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > >
> > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > test Ethernet driver in second kernel
> > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > >
> > > > > >
> > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > recovery does not work.
> > > > >
> > > > > This case should be the same as the one above.  If we can change the
> > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > apply to case I (where there will never be a driver) and to case II
> > > > > (where there is no driver now, but a driver will probe the device
> > > > > later).
> > > >
> > > > Does this means change are required in PCI core.
> > >
> > > Yes, I am suggesting that the PCI core does not do the right thing
> > > here.
> > >
> > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > Same error as before.
> > > >
> > > > -- a/drivers/pci/pcie/err.c
> > > > +++ b/drivers/pci/pcie/err.c
> > > >         pci_info(dev, "broadcast resume message\n");
> > > >         pci_walk_bus(bus, report_resume, &status);
> > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > >         return status;
> > > >
> > > >  failed:
> > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > +       pci_reset_function(dev);
> > > > +       pci_aer_clear_device_status(dev);
> > > > +       pci_aer_clear_nonfatal_status(dev);
> > >
> > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > PM, etc)?
> >
> > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > reason no effect. After making following changes,  both devices are
> > now getting reset.
> > Both devices are using FLR.
> >
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 117c0a2b2ba4..26b908f55aef 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> >                         pci_info(dev, "can't recover (no
> > error_detected callback)\n");
> > +
> > +                       pci_save_state(dev);
> > +                       pci_cfg_access_lock(dev);
> > +
> > +                       /* Quiesce the device completely */
> > +                       pci_write_config_word(dev, PCI_COMMAND,
> > +                             PCI_COMMAND_INTX_DISABLE);
> > +                       if (!__pci_reset_function_locked(dev)) {
> > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > +                               pci_info(dev, "recovered via pci level
> > reset\n");
> > +                       }
>
> Why do we need to save the state and quiesce the device?  The reset
> should disable interrupts anyway.  In this particular case where
> there's no driver, I don't think we should have to restore the state.
> We maybe should *remove* the device and re-enumerate it after the
> reset, but the state from before the reset should be irrelevant.
>

I tried pci_reset_fucntion_locked without save/restore then I got the
synchronous abort during igb_probe (case 2 i.e. with driver). This is
100% reproducible.
looks like pci_reset_function_locked is causing PCI configuration
space random. Same is mentioned here
https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html

[   16.492586] Internal error: synchronous external abort: 96000610 [#1] SMP
[   16.499362] Modules linked in: mpt3sas(+) igb(+) nvme nvme_core
raid_class scsi_transport_sas i2c_algo_bit mdio libcrc32c gpio_xlp
i2c_xlp9xx(+) uas usb_storage
[   16.513696] CPU: 0 PID: 477 Comm: systemd-udevd Not tainted 5.7.0-rc3+ #132
[   16.520644] Hardware name: Cavium Inc. Saber/Saber, BIOS
TX2-FW-Release-3.1-build_01-2803-g74253a541a mm/dd/yyyy
[   16.530805] pstate: 60400009 (nZCv daif +PAN -UAO)
[   16.535598] pc : igb_rd32+0x24/0xe0 [igb]
[   16.539603] lr : igb_get_invariants_82575+0xb0/0xde8 [igb]
[   16.545074] sp : ffffffc012e2b7e0
[   16.548375] x29: ffffffc012e2b7e0 x28: ffffffc008baa4d8
[   16.553674] x27: 0000000000000001 x26: ffffffc008b99a70
[   16.558972] x25: ffffff8cdef60900 x24: ffffff8cdef60e48
[   16.564270] x23: ffffff8cf30b50b0 x22: ffffffc011359988
[   16.569568] x21: ffffff8cdef612e0 x20: ffffff8cdef60e68
[   16.574866] x19: ffffffc0140a0018 x18: 0000000000000000
[   16.580164] x17: 0000000000000000 x16: 0000000000000000
[   16.585463] x15: 0000000000000000 x14: 0000000000000000
[   16.590761] x13: 0000000000000000 x12: 0000000000000000
[   16.596059] x11: ffffffc008b86b08 x10: 0000000000000000
[   16.601357] x9 : ffffffc008b88888 x8 : ffffffc008b81050
[   16.606655] x7 : 0000000000000000 x6 : ffffff8cdef611a8
[   16.611952] x5 : ffffffc008b887d8 x4 : ffffffc008ba7a68
[   16.617250] x3 : 0000000000000000 x2 : ffffffc0140a0000
[   16.622548] x1 : 0000000000000018 x0 : ffffff8cdef60e48
[   16.627846] Call trace:
[   16.630288]  igb_rd32+0x24/0xe0 [igb]
[   16.633943]  igb_get_invariants_82575+0xb0/0xde8 [igb]
[   16.639073]  igb_probe+0x264/0xed8 [igb]
[   16.642989]  local_pci_probe+0x48/0xb8
[   16.646727]  pci_device_probe+0x120/0x1b8
[   16.650735]  really_probe+0xe4/0x448
[   16.654298]  driver_probe_device+0xe8/0x140
[   16.658469]  device_driver_attach+0x7c/0x88
[   16.662638]  __driver_attach+0xac/0x178
[   16.666462]  bus_for_each_dev+0x7c/0xd0
[   16.670284]  driver_attach+0x2c/0x38
[   16.673846]  bus_add_driver+0x1a8/0x240
[   16.677670]  driver_register+0x6c/0x128
[   16.681492]  __pci_register_driver+0x4c/0x58
[   16.685754]  igb_init_module+0x64/0x1000 [igb]
[   16.690189]  do_one_initcall+0x54/0x228
[   16.694021]  do_init_module+0x60/0x240
[   16.697757]  load_module+0x1614/0x1970
[   16.701493]  __do_sys_finit_module+0xb4/0x118
[   16.705837]  __arm64_sys_finit_module+0x28/0x38
[   16.710367]  do_el0_svc+0xf8/0x1b8
[   16.713761]  el0_sync_handler+0x12c/0x20c
[   16.717757]  el0_sync+0x158/0x180
[   16.721062] Code: a90153f3 f9400402 b4000482 8b214053 (b9400273)
[   16.727144] ---[ end trace 95523d7d37f1d883 ]---
[   16.731748] Kernel panic - not syncing: Fatal exception
[   16.736962] Kernel Offset: disabled
[   16.740438] CPU features: 0x084002,22000c38
[   16.744607] Memory Limit: none

> > +                       pci_cfg_access_unlock(dev);
> > +                       pci_restore_state(dev);
> >                 } else {
> >                         vote = PCI_ERS_RESULT_NONE;
> >                 }
> >
> > in order to take care of case 2 (driver comes after sometime) ==>
> > following code needs to be added to avoid crash during igb_probe.  It
> > looks to be a race condition between AER and igb_probe().
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index b46bff8fe056..c48f0a54bb95 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
> > const struct pci_device_id *ent)
> >         /* Catch broken hardware that put the wrong VF device ID in
> >          * the PCIe SR-IOV capability.
> >          */
> > +       if (pci_dev_trylock(pdev)) {
> > +               mdelay(1000);
> > +               pci_info(pdev,"device is locked, try waiting 1 sec\n");
> > +       }
>
> This is interesting to learn about the AER/driver interaction, but of
> course, we wouldn't want to add code like this permanently.
>
> > Here are the observation with all above changes
> > A) AER errors are less but they are still there for both case 1 (No
> > driver at all) and case 2 (driver comes after some time)
>
> We'll certainly get *some* AER errors.  We have to get one before we
> know to reset the device.
>
> > B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times
>
> I'm not sure why we reset both devices.  Are we seeing errors from
> both, or could we be more selective in the code?
>

I tried even with a reset of 09.01.0 *only* but again AER errors were
found from 09.00.0 as mentioned in previous mail.
So either do a reset of one or both devices, AER error from 09.00.0 is
inevitable. So better to do rest for all devices connected to the bus.

Following changes looks to be working with these observations for case
1 (No  driver at all) & case 2 (driver comes after some time)
A) AER errors are less
B) For NON_FATAL AER errors both devices get reset.
C) Few AER errors(neither NON_FATAL nor FATAL) for 09.00.0 still
comes. (Note this device is never used for networking in the primary
kernel)
D) No action taking for "c" as below changes does not cover "c".
E)  No AER errors from any device after some time (At least 8-10 AER
errors, all from 09.00.0)
F) Ping/SSH is working fine in case 2 for kudmp kernel.

Please let me know your view. I can send a patch after detailed testing.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..585a43b9c0da 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,19 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       pci_save_state(dev);
+                       pci_cfg_access_lock(dev);
+
+                       pci_write_config_word(dev, PCI_COMMAND,
PCI_COMMAND_INTX_DISABLE);
+
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "Recovered via pci level
reset\n");
+                       }
+
+                       pci_cfg_access_unlock(dev);
+                       pci_restore_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;

--pk


> > C) After that AER errors [1] comes is only for device 0000:09:00.0.
> > This is strange as this pci device is not being used during test.
> > Ping/ssh are happening with 0000:09:01.0
> > D) If wait for some more time. No more AER errors from any device
> > E) Ping is working fine in case 2.
> >
> > 09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> > Connection (rev 01)
> > 09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
> > Connection (rev 01)
> >
> > # lspci -t -v
> >
> >  \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
> >              +-01.0-[01]--
> >              +-02.0-[02]--
> >              +-03.0-[03]--
> >              +-04.0-[04]--
> >              +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
> > BCM57840 NetXtreme II 10 Gigabit Ethernet
> >              |            \-00.1  Broadcom Inc. and subsidiaries
> > BCM57840 NetXtreme II 10 Gigabit Ethernet
> >              +-06.0-[06]--
> >              +-07.0-[07]--
> >              +-08.0-[08]--
> >              +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
> > Network Connection
> >              |               \-00.1  Intel Corporation 82576 Gigabit
> > Network Connection
> >
> >
> > [1] AER error which comes for 09:00.0:
> >
> > [   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   81.676503] {7}[Hardware Error]: event severity: corrected
> > [   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
> > [   81.687447] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.698908] {7}[Hardware Error]:   version: 3.0
> > [   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.715059] {7}[Hardware Error]:   slot: 0
> > [   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.730864] {7}[Hardware Error]:   class_code: 000002
> > [   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
> > [   81.748058] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.753615] {7}[Hardware Error]:   port_type: 4, root port
> > [   81.759086] {7}[Hardware Error]:   version: 3.0
> > [   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   81.775237] {7}[Hardware Error]:   slot: 0
> > [   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   81.791041] {7}[Hardware Error]:   class_code: 000406
> > [   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
> > [   81.809276] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.820738] {7}[Hardware Error]:   version: 3.0
> > [   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.836889] {7}[Hardware Error]:   slot: 0
> > [   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.852693] {7}[Hardware Error]:   class_code: 000002
> > [   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
> > [   81.869886] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.875444] {7}[Hardware Error]:   port_type: 4, root port
> > [   81.880914] {7}[Hardware Error]:   version: 3.0
> > [   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   81.897066] {7}[Hardware Error]:   slot: 0
> > [   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   81.912870] {7}[Hardware Error]:   class_code: 000406
> > [   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
> > [   81.931104] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.942566] {7}[Hardware Error]:   version: 3.0
> > [   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.958717] {7}[Hardware Error]:   slot: 0
> > [   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.974522] {7}[Hardware Error]:   class_code: 000002
> > [   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
> > [   81.991715] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.997272] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.002743] {7}[Hardware Error]:   version: 3.0
> > [   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.018894] {7}[Hardware Error]:   slot: 0
> > [   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.034698] {7}[Hardware Error]:   class_code: 000406
> > [   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
> > [   82.052933] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.064395] {7}[Hardware Error]:   version: 3.0
> > [   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.080547] {7}[Hardware Error]:   slot: 0
> > [   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.096351] {7}[Hardware Error]:   class_code: 000002
> > [   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
> > [   82.113544] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.119101] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.124572] {7}[Hardware Error]:   version: 3.0
> > [   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.140723] {7}[Hardware Error]:   slot: 0
> > [   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.156527] {7}[Hardware Error]:   class_code: 000406
> > [   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
> > [   82.174762] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.186224] {7}[Hardware Error]:   version: 3.0
> > [   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.202375] {7}[Hardware Error]:   slot: 0
> > [   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.218179] {7}[Hardware Error]:   class_code: 000002
> > [   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
> > [   82.235372] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.240929] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.246400] {7}[Hardware Error]:   version: 3.0
> > [   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.262551] {7}[Hardware Error]:   slot: 0
> > [   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.278355] {7}[Hardware Error]:   class_code: 000406
> > [   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
> > [   82.296676] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.308138] {7}[Hardware Error]:   version: 3.0
> > [   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.324290] {7}[Hardware Error]:   slot: 0
> > [   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.340094] {7}[Hardware Error]:   class_code: 000002
> > [   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
> > [   82.357374] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.362931] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.368402] {7}[Hardware Error]:   version: 3.0
> > [   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.384553] {7}[Hardware Error]:   slot: 0
> > [   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.400357] {7}[Hardware Error]:   class_code: 000406
> > [   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
> > [   82.418678] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.430140] {7}[Hardware Error]:   version: 3.0
> > [   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.446291] {7}[Hardware Error]:   slot: 0
> > [   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.462096] {7}[Hardware Error]:   class_code: 000002
> > [   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
> > [   82.479375] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.484933] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.490403] {7}[Hardware Error]:   version: 3.0
> > [   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.506555] {7}[Hardware Error]:   slot: 0
> > [   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.522359] {7}[Hardware Error]:   class_code: 000406
> > [   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   88.075899] {8}[Hardware Error]: event severity: corrected
> > [   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
> > [   88.086841] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
> > [   88.098303] {8}[Hardware Error]:   version: 3.0
> > [   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
> > [   88.114455] {8}[Hardware Error]:   slot: 0
> > [   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
> > [   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   88.130259] {8}[Hardware Error]:   class_code: 000002
> > [   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
> > [   88.147452] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.153009] {8}[Hardware Error]:   port_type: 4, root port
> > [   88.158480] {8}[Hardware Error]:   version: 3.0
> > [   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
> > [   88.174633] {8}[Hardware Error]:   slot: 0
> > [   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
> > [   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   88.191914] {8}[Hardware Error]:   class_code: 000406
> > [   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
> > [   88.210323] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
> > [   88.221786] {8}[Hardware Error]:   version: 3.0
> > [   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
> > [   88.237937] {8}[Hardware Error]:   slot: 0
> > [   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
> > [   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   88.253741] {8}[Hardware Error]:   class_code: 000002
> > [   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> >
> > > Case I is using APEI, and it looks like that can queue up 16 errors
> > > (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> > > we even get a chance to reset the device.  But I would think that the
> > > reset should *eventually* stop the errors, even though we might log
> > > 30+ of them first.
> > >
> > > As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> > > see if it reduces the logging.
> >
> > Did not tried this experiment. I believe it is not required now
> >
> > --pk
> >
> > >
> > > > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > > > during enumeration phase of kdump kernel.
> > > > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > > > kernel or device specific quirk.
> > > > > >
> > > > > > --pk
> > > > > >
> > > > > >
> > > > > > > > As per my understanding, possible solutions are
> > > > > > > >  - Copy SMMU table i.e. this patch
> > > > > > > > OR
> > > > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > > > same AER error again.
> > > > > > > >
> > > > > > > >
> > > > > > > > -pk
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > > > [1] with bootargs having pci=noaer
> > > > > > > >
> > > > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > > > Hardware Error Source: 1
> > > > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > > aer_mask: 0x00000000
> > > > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > > aer_agent=Requester ID
> > > > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > > > total mem (8153768 kB)
> > > > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > > aer_mask: 0x00000000
> > > > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > > aer_agent=Requester ID
> > > > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > >
> > > <snip>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-06-03 17:42                   ` Prabhakar Kushwaha
@ 2020-06-04  0:02                     ` Bjorn Helgaas
  2020-06-07  8:30                       ` Prabhakar Kushwaha
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2020-06-04  0:02 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > > via this new table..
> > > > > > > > > > >
> > > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > > >
> > > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > > >
> > > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > > >
> > > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > > >
> > > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > > setup present at this moment.
> > > > > > > >
> > > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > > or at some other unpredictable time.
> > > > > > > >
> > > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > > able to recover.
> > > > > > > > >
> > > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > > >
> > > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > > >
> > > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > > (non-crash) situations as well.
> > > > > > > > > >
> > > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > > always remain usable.
> > > > > > > > >
> > > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > > 82576 Gigabit Network card.
> > > > > > > > >
> > > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > > of kdump root file system.
> > > > > > > >
> > > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > > this:
> > > > > > > >
> > > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > > >     cper_print_aer(dev, ...)
> > > > > > > >       pci_err("AER: aer_status:")
> > > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > > >       pci_err("AER: aer_layer=")
> > > > > > > >     if (AER_NONFATAL)
> > > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > > >         status = CAN_RECOVER
> > > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > > >           report_error_detected
> > > > > > > >             if (!dev->driver)
> > > > > > > >               vote = NO_AER_DRIVER
> > > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > > >             # always NO_AER_DRIVER
> > > > > > > >         status is now NO_AER_DRIVER
> > > > > > > >
> > > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > > >
> > > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > > reset it.
> > > > > > >
> > > > > > > But how to reset the device considering there is no driver.
> > > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > > perform reset at PCIe level.
> > > > > >
> > > > > > I don't understand your question.  The PCI core (not the device
> > > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > > reset_link(), all devices on the other side of the link are reset.
> > > > > >
> > > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > > >
> > > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > > cannot see what they are typing.
> > > > > >
> > > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > > should get just a few AER errors, and they should stop after the
> > > > > > device is reset.
> > > > > >
> > > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > > >
> > > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > > test Ethernet driver in second kernel
> > > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > > >
> > > > > > >
> > > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > > recovery does not work.
> > > > > >
> > > > > > This case should be the same as the one above.  If we can change the
> > > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > > apply to case I (where there will never be a driver) and to case II
> > > > > > (where there is no driver now, but a driver will probe the device
> > > > > > later).
> > > > >
> > > > > Does this means change are required in PCI core.
> > > >
> > > > Yes, I am suggesting that the PCI core does not do the right thing
> > > > here.
> > > >
> > > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > > Same error as before.
> > > > >
> > > > > -- a/drivers/pci/pcie/err.c
> > > > > +++ b/drivers/pci/pcie/err.c
> > > > >         pci_info(dev, "broadcast resume message\n");
> > > > >         pci_walk_bus(bus, report_resume, &status);
> > > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > > >         return status;
> > > > >
> > > > >  failed:
> > > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > > +       pci_reset_function(dev);
> > > > > +       pci_aer_clear_device_status(dev);
> > > > > +       pci_aer_clear_nonfatal_status(dev);
> > > >
> > > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > > PM, etc)?
> > >
> > > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > > reason no effect. After making following changes,  both devices are
> > > now getting reset.
> > > Both devices are using FLR.
> > >
> > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > index 117c0a2b2ba4..26b908f55aef 100644
> > > --- a/drivers/pci/pcie/err.c
> > > +++ b/drivers/pci/pcie/err.c
> > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > >                         pci_info(dev, "can't recover (no
> > > error_detected callback)\n");
> > > +
> > > +                       pci_save_state(dev);
> > > +                       pci_cfg_access_lock(dev);
> > > +
> > > +                       /* Quiesce the device completely */
> > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > +                             PCI_COMMAND_INTX_DISABLE);
> > > +                       if (!__pci_reset_function_locked(dev)) {
> > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > +                               pci_info(dev, "recovered via pci level
> > > reset\n");
> > > +                       }
> >
> > Why do we need to save the state and quiesce the device?  The reset
> > should disable interrupts anyway.  In this particular case where
> > there's no driver, I don't think we should have to restore the state.
> > We maybe should *remove* the device and re-enumerate it after the
> > reset, but the state from before the reset should be irrelevant.
> 
> I tried pci_reset_function_locked without save/restore then I got the
> synchronous abort during igb_probe (case 2 i.e. with driver). This is
> 100% reproducible.
> looks like pci_reset_function_locked is causing PCI configuration
> space random. Same is mentioned here
> https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html

That documentation is poorly worded.  A reset doesn't make the
contents of config space "random," but of course it sets config space
registers to their initialization values, including things like the
device BARs.  After a reset, the device BARs are zero, so it won't
respond at the address we expect, and I'm sure that's what's causing
the external abort.

So I guess we *do* need to save the state before the reset and restore
it (either that or enumerate the device from scratch just like we
would if it had been hot-added).  I'm not really thrilled with trying
to save the state after the device has already reported an error.  I'd
rather do it earlier, maybe during enumeration, like in
pci_init_capabilities().  But I don't understand all the subtleties of
dev->state_saved, so that requires some legwork.

I don't think we should set INTX_DISABLE; the reset will make whatever
we do with it irrelevant anyway.

Remind me why the pci_cfg_access_lock()?

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-06-04  0:02                     ` Bjorn Helgaas
@ 2020-06-07  8:30                       ` Prabhakar Kushwaha
  2020-06-11 23:03                         ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Prabhakar Kushwaha @ 2020-06-07  8:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

Hi Bjorn,

On Thu, Jun 4, 2020 at 5:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> > On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > > > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >
> > > > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > > > via this new table..
> > > > > > > > > > > >
> > > > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > > > >
> > > > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > > > >
> > > > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > > > >
> > > > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > > > >
> > > > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > > > setup present at this moment.
> > > > > > > > >
> > > > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > > > or at some other unpredictable time.
> > > > > > > > >
> > > > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > > > able to recover.
> > > > > > > > > >
> > > > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > > > >
> > > > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > > > >
> > > > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > > > (non-crash) situations as well.
> > > > > > > > > > >
> > > > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > > > always remain usable.
> > > > > > > > > >
> > > > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > > > 82576 Gigabit Network card.
> > > > > > > > > >
> > > > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > > > of kdump root file system.
> > > > > > > > >
> > > > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > > > this:
> > > > > > > > >
> > > > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > > > >     cper_print_aer(dev, ...)
> > > > > > > > >       pci_err("AER: aer_status:")
> > > > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > > > >       pci_err("AER: aer_layer=")
> > > > > > > > >     if (AER_NONFATAL)
> > > > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > > > >         status = CAN_RECOVER
> > > > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > > > >           report_error_detected
> > > > > > > > >             if (!dev->driver)
> > > > > > > > >               vote = NO_AER_DRIVER
> > > > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > > > >             # always NO_AER_DRIVER
> > > > > > > > >         status is now NO_AER_DRIVER
> > > > > > > > >
> > > > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > > > >
> > > > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > > > reset it.
> > > > > > > >
> > > > > > > > But how to reset the device considering there is no driver.
> > > > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > > > perform reset at PCIe level.
> > > > > > >
> > > > > > > I don't understand your question.  The PCI core (not the device
> > > > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > > > reset_link(), all devices on the other side of the link are reset.
> > > > > > >
> > > > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > > > >
> > > > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > > > cannot see what they are typing.
> > > > > > >
> > > > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > > > should get just a few AER errors, and they should stop after the
> > > > > > > device is reset.
> > > > > > >
> > > > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > > > >
> > > > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > > > test Ethernet driver in second kernel
> > > > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > > > >
> > > > > > > >
> > > > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > > > recovery does not work.
> > > > > > >
> > > > > > > This case should be the same as the one above.  If we can change the
> > > > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > > > apply to case I (where there will never be a driver) and to case II
> > > > > > > (where there is no driver now, but a driver will probe the device
> > > > > > > later).
> > > > > >
> > > > > > Does this means change are required in PCI core.
> > > > >
> > > > > Yes, I am suggesting that the PCI core does not do the right thing
> > > > > here.
> > > > >
> > > > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > > > Same error as before.
> > > > > >
> > > > > > -- a/drivers/pci/pcie/err.c
> > > > > > +++ b/drivers/pci/pcie/err.c
> > > > > >         pci_info(dev, "broadcast resume message\n");
> > > > > >         pci_walk_bus(bus, report_resume, &status);
> > > > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > > > >         return status;
> > > > > >
> > > > > >  failed:
> > > > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > > > +       pci_reset_function(dev);
> > > > > > +       pci_aer_clear_device_status(dev);
> > > > > > +       pci_aer_clear_nonfatal_status(dev);
> > > > >
> > > > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > > > PM, etc)?
> > > >
> > > > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > > > reason no effect. After making following changes,  both devices are
> > > > now getting reset.
> > > > Both devices are using FLR.
> > > >
> > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > > index 117c0a2b2ba4..26b908f55aef 100644
> > > > --- a/drivers/pci/pcie/err.c
> > > > +++ b/drivers/pci/pcie/err.c
> > > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > > >                         pci_info(dev, "can't recover (no
> > > > error_detected callback)\n");
> > > > +
> > > > +                       pci_save_state(dev);
> > > > +                       pci_cfg_access_lock(dev);
> > > > +
> > > > +                       /* Quiesce the device completely */
> > > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > > +                             PCI_COMMAND_INTX_DISABLE);
> > > > +                       if (!__pci_reset_function_locked(dev)) {
> > > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > > +                               pci_info(dev, "recovered via pci level
> > > > reset\n");
> > > > +                       }
> > >
> > > Why do we need to save the state and quiesce the device?  The reset
> > > should disable interrupts anyway.  In this particular case where
> > > there's no driver, I don't think we should have to restore the state.
> > > We maybe should *remove* the device and re-enumerate it after the
> > > reset, but the state from before the reset should be irrelevant.
> >
> > I tried pci_reset_function_locked without save/restore then I got the
> > synchronous abort during igb_probe (case 2 i.e. with driver). This is
> > 100% reproducible.
> > looks like pci_reset_function_locked is causing PCI configuration
> > space random. Same is mentioned here
> > https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html
>
> That documentation is poorly worded.  A reset doesn't make the
> contents of config space "random," but of course it sets config space
> registers to their initialization values, including things like the
> device BARs.  After a reset, the device BARs are zero, so it won't
> respond at the address we expect, and I'm sure that's what's causing
> the external abort.
>
> So I guess we *do* need to save the state before the reset and restore
> it (either that or enumerate the device from scratch just like we
> would if it had been hot-added).  I'm not really thrilled with trying
> to save the state after the device has already reported an error.  I'd
> rather do it earlier, maybe during enumeration, like in
> pci_init_capabilities().  But I don't understand all the subtleties of
> dev->state_saved, so that requires some legwork.
>

I tried moving pci_save_state earlier. All observations are the same
as mentioned in earlier discussions.

Some modifications are required in pci_restore_state() as by default
it makes dev->state_saved = false after restore. .
So the next AER causes the earlier mentioned
crash(igb_get_invariants_82575 --> igb_rd32).  It is because
pci_restore_state() returns without restoring any state.

Code changes are below [1]

> I don't think we should set INTX_DISABLE; the reset will make whatever
> we do with it irrelevant anyway.
>
Yes.. It is not required.

> Remind me why the pci_cfg_access_lock()?

I thought of the race conditions between AER (save/restore) and
igb_probe. So I added this.
It is not required as lock is inherently "taken care" in both AER (bus
walk) and igb_probe by the framework.

[1]
root@localhost$ git diff
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 595fcf59843f..35396eb4fd9e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1537,11 +1537,7 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
        }
 }

-/**
- * pci_restore_state - Restore the saved state of a PCI device
- * @dev: PCI device that we're dealing with
- */
-void pci_restore_state(struct pci_dev *dev)
+void __pci_restore_state(struct pci_dev *dev, int retain_state)
 {
        if (!dev->state_saved)
                return;
@@ -1572,10 +1568,26 @@ void pci_restore_state(struct pci_dev *dev)
        pci_enable_acs(dev);
        pci_restore_iov_state(dev);

-       dev->state_saved = false;
+       if (!retain_state)
+               dev->state_saved = false;
+}
+
+/**
+ * pci_restore_state - Restore the saved state of a PCI device
+ * @dev: PCI device that we're dealing with
+ */
+void pci_restore_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 0);
 }
 EXPORT_SYMBOL(pci_restore_state);

+void pci_restore_retain_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 1);
+}
+EXPORT_SYMBOL(pci_restore_retain_state);
+
 struct pci_saved_state {
        u32 config_space[16];
        struct pci_cap_saved_data cap[0];
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..621eaa34bf9f 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,13 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "Recovered via pci level
reset\n");
+                       }
+
+                       pci_restore_retain_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;
                }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 77b8a145c39b..af4e27c95421 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2448,6 +2448,8 @@ void pci_device_add(struct pci_dev *dev, struct
pci_bus *bus)

        pci_init_capabilities(dev);

+       pci_save_state(dev);
+
        /*
         * Add the device to our list of discovered devices
         * and the bus list for fixup functions, etc.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 83ce1cdf5676..42ab7ef850b7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1234,6 +1234,7 @@ void pci_unmap_rom(struct pci_dev *pdev, void
__iomem *rom);

 /* Power management related routines */
 int pci_save_state(struct pci_dev *dev);
+void pci_restore_retain_state(struct pci_dev *dev);
 void pci_restore_state(struct pci_dev *dev);
 struct pci_saved_state *pci_store_saved_state(struct pci_dev *dev);
 int pci_load_saved_state(struct pci_dev *dev,

--pk

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel
  2020-06-07  8:30                       ` Prabhakar Kushwaha
@ 2020-06-11 23:03                         ` Bjorn Helgaas
  0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2020-06-11 23:03 UTC (permalink / raw)
  To: Prabhakar Kushwaha
  Cc: Robin Murphy, linux-arm-kernel, kexec mailing list, linux-pci,
	Marc Zyngier, Will Deacon, Ganapatrao Prabhakerrao Kulkarni,
	Bhupesh Sharma, Prabhakar Kushwaha, Kuppuswamy Sathyanarayanan,
	Vijay Mohan Pandarathil, Myron Stowe

On Sun, Jun 07, 2020 at 02:00:35PM +0530, Prabhakar Kushwaha wrote:
> On Thu, Jun 4, 2020 at 5:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> > > On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
<snip>

> > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > > > index 117c0a2b2ba4..26b908f55aef 100644
> > > > > --- a/drivers/pci/pcie/err.c
> > > > > +++ b/drivers/pci/pcie/err.c
> > > > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > > > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > > > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > > > >                         pci_info(dev, "can't recover (no
> > > > > error_detected callback)\n");
> > > > > +
> > > > > +                       pci_save_state(dev);
> > > > > +                       pci_cfg_access_lock(dev);
> > > > > +
> > > > > +                       /* Quiesce the device completely */
> > > > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > > > +                             PCI_COMMAND_INTX_DISABLE);
> > > > > +                       if (!__pci_reset_function_locked(dev)) {
> > > > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > > > +                               pci_info(dev, "recovered via pci level
> > > > > reset\n");
> > > > > +                       }
> >
> > So I guess we *do* need to save the state before the reset and restore
> > it (either that or enumerate the device from scratch just like we
> > would if it had been hot-added).  I'm not really thrilled with trying
> > to save the state after the device has already reported an error.  I'd
> > rather do it earlier, maybe during enumeration, like in
> > pci_init_capabilities().  But I don't understand all the subtleties of
> > dev->state_saved, so that requires some legwork.
> 
> I tried moving pci_save_state earlier. All observations are the same
> as mentioned in earlier discussions.

By "legwork", I didn't mean just trying things to see whether they
seem to work.  I meant researching the history to find out *why* it's
designed the way it is so that when we change it, we don't break
things.

For example, these commits are obviously important to understand:

  aa8c6c93747f ("PCI PM: Restore standard config registers of all devices early")
  c82f63e411f1 ("PCI: check saved state before restore")
  4b77b0a2ba27 ("PCI: Clear saved_state after the state has been restored")

I think we need to step back and separate this AER issue from the
whole SMMU table copying thing.  Then do the research and start a
new thread with a patch to fix just the AER issue.

The ARM guys would probably be grateful to be dropped from the AER
thread because it really has nothing to do with ARM.

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-11 23:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1589251566-32126-1-git-send-email-pkushwaha@marvell.com>
2020-05-12 22:03 ` [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel Bjorn Helgaas
2020-05-14  7:17   ` Prabhakar Kushwaha
2020-05-19 23:22     ` Bjorn Helgaas
2020-05-21  3:58       ` Prabhakar Kushwaha
2020-05-21 22:49         ` Bjorn Helgaas
2020-05-27 11:44           ` Prabhakar Kushwaha
2020-05-27 20:18             ` Bjorn Helgaas
2020-05-29 14:18               ` Prabhakar Kushwaha
2020-05-29 19:33                 ` Bjorn Helgaas
2020-06-03 17:42                   ` Prabhakar Kushwaha
2020-06-04  0:02                     ` Bjorn Helgaas
2020-06-07  8:30                       ` Prabhakar Kushwaha
2020-06-11 23:03                         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).