All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] fix vt-d hard lockup when hotplug ATS capable device
@ 2023-12-13  3:46 Ethan Zhao
  2023-12-13  3:46 ` [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
  2023-12-13  3:46 ` [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
  0 siblings, 2 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-13  3:46 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, haifeng.zhao

 This patch is used to fix vt-d hard lockup reported when ATS capable 
 endpoint device connects to system via PCIe switch populates in root port
 as following topology.

     +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
     |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
     |           +-00.2  Intel Corporation Ice Lake RAS
     |           +-00.4  Intel Corporation Device 0b23
     |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0 
                                           NVIDIA Corporation Device 2324
     |                                           +-01.0-[19]----00.0
                          Mellanox Technologies MT2910 Family [ConnectX-7]

 User brought endpoint device 19:00.0's link down by flap it's hotplug 
 capable slot 17:01.0 link control register, as sequence DLLSC response, 
 pciehp_ist() will unload device driver and power it off, durning device
 driver is unloading an iommu devTlb flush request isssed to that link 
 down device, thus a long time completion/timeout waiting in interrupt
 context causes continuous lock lockup warnning and system hang.

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()
 

This patchset was tested by yehaorong@bytedance.com on stable-6.7rc4.
And is sent for more comment.

Ethan Zhao (2):
  PCI: make pci_dev_is_disconnected() helper public for other drivers
  iommu/vt-d: don's issue devTLB flush request when device is
    disconnected

 drivers/iommu/intel/pasid.c | 21 ++++++++++++++++++++-
 drivers/pci/pci.h           |  5 -----
 include/linux/pci.h         |  5 +++++
 3 files changed, 25 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-13  3:46 [PATCH RFC 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
@ 2023-12-13  3:46 ` Ethan Zhao
  2023-12-13 10:49   ` Lukas Wunner
  2023-12-13  3:46 ` [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
  1 sibling, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-13  3:46 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, haifeng.zhao, Haorong Ye

move pci_dev_is_disconnected() from driver/pci/pci.h to public
include/linux/pci.h for other driver's reference.
no function change.

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/pci/pci.h   | 5 -----
 include/linux/pci.h | 5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..75fa2084492f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -366,11 +366,6 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 	return 0;
 }
 
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
-	return dev->error_state == pci_channel_io_perm_failure;
-}
-
 /* pci_dev priv_flags */
 #define PCI_DEV_ADDED 0
 #define PCI_DPC_RECOVERED 1
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60ca768bc867..869f2ec97a84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2503,6 +2503,11 @@ static inline struct pci_dev *pcie_find_root_port(struct pci_dev *dev)
 	return NULL;
 }
 
+static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
+{
+	return dev->error_state == pci_channel_io_perm_failure;
+}
+
 void pci_request_acs(void);
 bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
 bool pci_acs_path_enabled(struct pci_dev *start,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13  3:46 [PATCH RFC 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
  2023-12-13  3:46 ` [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
@ 2023-12-13  3:46 ` Ethan Zhao
  2023-12-13 10:44   ` Lukas Wunner
  2023-12-13 11:59   ` Baolu Lu
  1 sibling, 2 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-13  3:46 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, haifeng.zhao, Haorong Ye

For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.

That would cause following continuous hard lockup warning and system hang
as following

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/iommu/intel/pasid.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..8557b6dee22f 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -476,6 +476,23 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 {
 	struct device_domain_info *info;
 	u16 sid, qdep, pfsid;
+	struct pci_dev *pdev;
+
+	pdev = to_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	/*
+	 * If endpoint device's link was brough down by user's pci configuration
+	 * access to it's hotplug capable slot's link control register, as sequence
+	 * response for DLLSC pciehp_ist() will set the device error_state to
+	 * pci_channel_io_perm_failure. Checking device's state here to avoid
+	 * issuing meaningless devTLB flush request to it, that might cause lockup
+	 * warning or deadlock because too long time waiting in interrupt context.
+	 */
+
+	if (pci_dev_is_disconnected(pdev))
+		return;
 
 	info = dev_iommu_priv_get(dev);
 	if (!info || !info->ats_enabled)
@@ -495,6 +512,8 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 		qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
 	else
 		qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);
+
+	return;
 }
 
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13  3:46 ` [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
@ 2023-12-13 10:44   ` Lukas Wunner
  2023-12-13 11:54     ` Robin Murphy
  2023-12-14  2:16     ` Ethan Zhao
  2023-12-13 11:59   ` Baolu Lu
  1 sibling, 2 replies; 31+ messages in thread
From: Lukas Wunner @ 2023-12-13 10:44 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register,

Well, users could just *unplug* the device, right?  Why is it relevant
that thay could fiddle with registers in config space?


> as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.

A completion timeout should be on the order of usecs or msecs, why does it
cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
delay between hot removal and watchdog reaction.


> Fix it by checking the device's error_state in
> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> request to link down device that is set to pci_channel_io_perm_failure and
> then powered off in

This doesn't seem to be a proper fix.  It will work most of the time
but not always.  A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards.  In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-13  3:46 ` [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
@ 2023-12-13 10:49   ` Lukas Wunner
  2023-12-14  0:58     ` Ethan Zhao
  0 siblings, 1 reply; 31+ messages in thread
From: Lukas Wunner @ 2023-12-13 10:49 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
> move pci_dev_is_disconnected() from driver/pci/pci.h to public
> include/linux/pci.h for other driver's reference.
> no function change.

That's merely a prose description of the code.  A reader can already
see from the code what it's doing.  You need to explain the *reason*
for the change instead.  E.g.:  "Make pci_dev_is_disconnected() public
so that it can be called from $DRIVER to speed up hot removal
handling which may otherwise take seconds because of $REASONS."

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13 10:44   ` Lukas Wunner
@ 2023-12-13 11:54     ` Robin Murphy
  2023-12-14  2:40       ` Ethan Zhao
  2023-12-21 10:42       ` Lukas Wunner
  2023-12-14  2:16     ` Ethan Zhao
  1 sibling, 2 replies; 31+ messages in thread
From: Robin Murphy @ 2023-12-13 11:54 UTC (permalink / raw)
  To: Lukas Wunner, Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu, linux-kernel,
	Haorong Ye

On 13/12/2023 10:44 am, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
> 
> Well, users could just *unplug* the device, right?  Why is it relevant
> that thay could fiddle with registers in config space?
> 
> 
>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> 
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.

The PCIe spec only requires an endpoint to respond to an ATS invalidate 
within a rather hilarious 90 seconds, so it's primarily a question of 
how patient the root complex and bridges in between are prepared to be.

>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> 
> This doesn't seem to be a proper fix.  It will work most of the time
> but not always.  A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards.  In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.

Yeah, I think we have a subtle but fundamental issue here in that the 
iommu_release_device() callback is hooked to BUS_NOTIFY_REMOVED_DEVICE, 
so in general probably shouldn't be assuming it's safe to do anything 
with the device itself *after* it's already been removed from its bus - 
this step is primarily about cleaning up any of the IOMMU's own state 
relating to the given device.

I think if we want to ensure ATCs are invalidated on hot-unplug we need 
an additional pre-removal notifier to take care of that, and that step 
would then want to distinguish between an orderly removal where cleaning 
up is somewhat meaningful, and a surprise removal where it definitely isn't.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13  3:46 ` [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
  2023-12-13 10:44   ` Lukas Wunner
@ 2023-12-13 11:59   ` Baolu Lu
  2023-12-14  2:26     ` Ethan Zhao
  1 sibling, 1 reply; 31+ messages in thread
From: Baolu Lu @ 2023-12-13 11:59 UTC (permalink / raw)
  To: Ethan Zhao, bhelgaas, dwmw2, will, robin.murphy
  Cc: baolu.lu, linux-pci, iommu, linux-kernel, Haorong Ye

On 2023/12/13 11:46, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off.

Is it possible for pciehp to disable ATS on the device before unloading
the driver? Or should the device follow some specific steps to warm
reset the device?

What happens if IOMMU issues device TLB invalidation after link down but
before pci_dev_is_disconnected() returns true?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-13 10:49   ` Lukas Wunner
@ 2023-12-14  0:58     ` Ethan Zhao
  2023-12-21 10:51       ` Lukas Wunner
  0 siblings, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-14  0:58 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye


On 12/13/2023 6:49 PM, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
>> move pci_dev_is_disconnected() from driver/pci/pci.h to public
>> include/linux/pci.h for other driver's reference.
>> no function change.
> That's merely a prose description of the code.  A reader can already
> see from the code what it's doing.  You need to explain the *reason*
> for the change instead.  E.g.:  "Make pci_dev_is_disconnected() public
> so that it can be called from $DRIVER to speed up hot removal
> handling which may otherwise take seconds because of $REASONS."

Yup, why I made it public. then how about

"

Make pci_dev_is_disconnected() public so that it can be called from
Intel vt-d driver to check the device's hotplug removal state when
issue devTLB flush request."



Thanks,

Ethan

>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13 10:44   ` Lukas Wunner
  2023-12-13 11:54     ` Robin Murphy
@ 2023-12-14  2:16     ` Ethan Zhao
  2023-12-15  0:43       ` Ethan Zhao
  1 sibling, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-14  2:16 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye


On 12/13/2023 6:44 PM, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
> Well, users could just *unplug* the device, right?  Why is it relevant
> that thay could fiddle with registers in config space?
>
Yes, if the device and it's slot are hotplug capable, users could just

'unplug' the device.

But this case reported, users try to do a warm reset with a tool

command like:

   mlxfwreset -d <busid> -y reset

Actually, it will access configuration space  just as

  setpci -s 0000:17:01.0 0x78.L=0x21050010

Well, we couldn't say don't fiddle PCIe config space registers like

that.

>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.
>
In my understanding, the devTLB flush request sent to ATS capable devcie

is non-posted request, if the ATS transaction is broken by endpoint link

-down, power-off event, the timeout will take up to 60 seconds+-30,

see "Invalidate Completion Timeout " part of

chapter 10.3.1 Invalidate Request

In PCIe spec 6.1

"

IMPLEMENTATION NOTE:

INVALIDATE COMPLETION TIMEOUT

Devices should respond to Invalidate Requests within 1 minute (+50% 
-0%).Having a bounded time

permits an ATPT to implement Invalidate Completion Timeouts and reuse 
the associated ITag values.

ATPT designs are implementation specific. As such, Invalidate Completion 
Timeouts and their

associated error handling are outside the scope of this specification

"

>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> This doesn't seem to be a proper fix.  It will work most of the time
> but not always.  A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards.  In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.

If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?

that would issse devTLB invalidation first, power off device later, it

wouldn't trigger the hard lockup, though the

pci_dev_is_disconnected() return false. this fix works such case.


Thanks,

Ethan



>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13 11:59   ` Baolu Lu
@ 2023-12-14  2:26     ` Ethan Zhao
  2023-12-15  1:03       ` Ethan Zhao
  0 siblings, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-14  2:26 UTC (permalink / raw)
  To: Baolu Lu, bhelgaas, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, Haorong Ye


On 12/13/2023 7:59 PM, Baolu Lu wrote:
> On 2023/12/13 11:46, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off.
>
> Is it possible for pciehp to disable ATS on the device before unloading
> the driver? Or should the device follow some specific steps to warm
> reset the device?
>
In this case, link down first, then pciehp_ist() got DLLSC interrupt to 
know

that, I don't think it makes sense to disable device ATS here, but it could

flag the device is ATS disabled, well,  "disconnected" is enough to let

vt-d like software knows the device state.


> What happens if IOMMU issues device TLB invalidation after link down but
> before pci_dev_is_disconnected() returns true?

Seems it wouldn't happen with hotplug cases, safe_removal or

supprise_removal.



Thanks,

Ethan

>
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13 11:54     ` Robin Murphy
@ 2023-12-14  2:40       ` Ethan Zhao
  2023-12-21 10:42       ` Lukas Wunner
  1 sibling, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-14  2:40 UTC (permalink / raw)
  To: Robin Murphy, Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu, linux-kernel,
	Haorong Ye


On 12/13/2023 7:54 PM, Robin Murphy wrote:
> On 13/12/2023 10:44 am, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register,
>>
>> Well, users could just *unplug* the device, right?  Why is it relevant
>> that thay could fiddle with registers in config space?
>>
>>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for 
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt 
>>> context.
>>
>> A completion timeout should be on the order of usecs or msecs, why 
>> does it
>> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 
>> *second*
>> delay between hot removal and watchdog reaction.
>
> The PCIe spec only requires an endpoint to respond to an ATS 
> invalidate within a rather hilarious 90 seconds, so it's primarily a 
> question of how patient the root complex and bridges in between are 
> prepared to be.

The issue reported only reproduce with endpoint device connects to 
system via PCIe switch (only has read tracking feature), those switchses 
seem not be aware of ATS transaction. while root port is aware of ATS

while the ATS transaction is broken. (invalidation sent, but link down, 
device removed etc). but I didn't find any public spec about that.

>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB 
>>> flush
>>> request to link down device that is set to 
>>> pci_channel_io_perm_failure and
>>> then powered off in
>>
>> This doesn't seem to be a proper fix.  It will work most of the time
>> but not always.  A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards.  In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> Yeah, I think we have a subtle but fundamental issue here in that the 
> iommu_release_device() callback is hooked to 
> BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be 
> assuming it's safe to do anything with the device itself *after* it's 
> already been removed from its bus - this step is primarily about 
> cleaning up any of the IOMMU's own state relating to the given device.
>
> I think if we want to ensure ATCs are invalidated on hot-unplug we 
> need an additional pre-removal notifier to take care of that, and that 
> step would then want to distinguish between an orderly removal where 
> cleaning up is somewhat meaningful, and a surprise removal where it 
> definitely isn't.

So, at least, we should check device state before issue devTLB invaliation.


Thanks,

Ethan

>
>
> Thanks,
> Robin.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-14  2:16     ` Ethan Zhao
@ 2023-12-15  0:43       ` Ethan Zhao
  0 siblings, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-15  0:43 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye


On 12/14/2023 10:16 AM, Ethan Zhao wrote:
>
> On 12/13/2023 6:44 PM, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register,
>> Well, users could just *unplug* the device, right?  Why is it relevant
>> that thay could fiddle with registers in config space?
>>
> Yes, if the device and it's slot are hotplug capable, users could just
>
> 'unplug' the device.
>
> But this case reported, users try to do a warm reset with a tool
>
> command like:
>
>   mlxfwreset -d <busid> -y reset
>
> Actually, it will access configuration space  just as
>
>  setpci -s 0000:17:01.0 0x78.L=0x21050010
>
> Well, we couldn't say don't fiddle PCIe config space registers like
>
> that.
>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for 
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt 
>>> context.
>> A completion timeout should be on the order of usecs or msecs, why 
>> does it
>> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 
>> *second*
>> delay between hot removal and watchdog reaction.
>>
> In my understanding, the devTLB flush request sent to ATS capable devcie
>
> is non-posted request, if the ATS transaction is broken by endpoint link
>
> -down, power-off event, the timeout will take up to 60 seconds+-30,
>
> see "Invalidate Completion Timeout " part of
>
> chapter 10.3.1 Invalidate Request
>
> In PCIe spec 6.1
>
> "
>
> IMPLEMENTATION NOTE:
>
> INVALIDATE COMPLETION TIMEOUT
>
> Devices should respond to Invalidate Requests within 1 minute (+50% 
> -0%).Having a bounded time
>
> permits an ATPT to implement Invalidate Completion Timeouts and reuse 
> the associated ITag values.
>
> ATPT designs are implementation specific. As such, Invalidate 
> Completion Timeouts and their
>
> associated error handling are outside the scope of this specification
>
> "
>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB 
>>> flush
>>> request to link down device that is set to 
>>> pci_channel_io_perm_failure and
>>> then powered off in
>> This doesn't seem to be a proper fix.  It will work most of the time
>> but not always.  A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards.  In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?
>
> that would issse devTLB invalidation first, power off device later, it
>
> wouldn't trigger the hard lockup, though the
>
> pci_dev_is_disconnected() return false. this fix works such case.

Could you help to point out if there are any other window to close ?

Thanks,

Ethan


>
>
> Thanks,
>
> Ethan
>
>
>
>>
>> Thanks,
>>
>> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-14  2:26     ` Ethan Zhao
@ 2023-12-15  1:03       ` Ethan Zhao
  2023-12-15  1:34         ` Baolu Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-15  1:03 UTC (permalink / raw)
  To: Baolu Lu, bhelgaas, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, Haorong Ye


On 12/14/2023 10:26 AM, Ethan Zhao wrote:
>
> On 12/13/2023 7:59 PM, Baolu Lu wrote:
>> On 2023/12/13 11:46, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register, as pciehpt_ist() 
>>> DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off.
>>
>> Is it possible for pciehp to disable ATS on the device before unloading
>> the driver? Or should the device follow some specific steps to warm
>> reset the device?
>>
> In this case, link down first, then pciehp_ist() got DLLSC interrupt 
> to know
>
> that, I don't think it makes sense to disable device ATS here, but it 
> could
>
> flag the device is ATS disabled, well,  "disconnected" is enough to let
>
> vt-d like software knows the device state.
>
>
For hot "unplug" cases:

1. safe_removal

   Users request unplug the device via sysfs or press the attention button,

   Then pciehp will response to unconfig device/unload device driver, power

   if off, and devcie is ready to remove. in this case, devTLB invalidate

   request is sent before device link to be brought down or device power

   to be turned off. so it doesn't trigger the hard lockup.

2. supprise_removal

  Users remove the devece directly or bring the device link down/turn off

  device power first by setting pci config space, link-down/not-present/

  power-off are all handled by pciehp the same way "supprise_removal", in

  such case, pciehp_ist() will flag the device as "disconnected" first, then

  unconfig the devcie, unload driver, iommu release device(issing devTLB 
flush)

  delete device. so we checking the device state could work such cases.

But I am still think about if there are other windows.


Thanks,

Ethan


>> What happens if IOMMU issues device TLB invalidation after link down but
>> before pci_dev_is_disconnected() returns true?
>
> Seems it wouldn't happen with hotplug cases, safe_removal or
>
> supprise_removal.
>
>
>
> Thanks,
>
> Ethan
>
>>
>> Best regards,
>> baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-15  1:03       ` Ethan Zhao
@ 2023-12-15  1:34         ` Baolu Lu
  2023-12-15  1:51           ` Ethan Zhao
  0 siblings, 1 reply; 31+ messages in thread
From: Baolu Lu @ 2023-12-15  1:34 UTC (permalink / raw)
  To: Ethan Zhao, bhelgaas, dwmw2, will, robin.murphy
  Cc: baolu.lu, linux-pci, iommu, linux-kernel, Haorong Ye

On 2023/12/15 9:03, Ethan Zhao wrote:
> 
> 2. supprise_removal
> 
>   Users remove the devece directly or bring the device link down/turn off
> 
>   device power first by setting pci config space, link-down/not-present/
> 
>   power-off are all handled by pciehp the same way "supprise_removal", in
> 
>   such case, pciehp_ist() will flag the device as "disconnected" first, 
> then
> 
>   unconfig the devcie, unload driver, iommu release device(issing devTLB 
> flush)
> 
>   delete device. so we checking the device state could work such cases.

If so, then it is fine for the iommu driver. As Robin said, if the
device needs more cleanup, the iommu core should register a right
callback to the driver core and handle it before the device goes away.

Disabling PCI features seems to be a reasonable device cleanup. This
gives us another reason to move ATS enabling/disabling out from the
iommu subsystem. Once this is done, the device driver will enable ATS
during its probe and disable it during its release. There will be no
such workaround in the iommu driver anymore.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-15  1:34         ` Baolu Lu
@ 2023-12-15  1:51           ` Ethan Zhao
  0 siblings, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-15  1:51 UTC (permalink / raw)
  To: Baolu Lu, bhelgaas, dwmw2, will, robin.murphy
  Cc: linux-pci, iommu, linux-kernel, Haorong Ye


On 12/15/2023 9:34 AM, Baolu Lu wrote:
> On 2023/12/15 9:03, Ethan Zhao wrote:
>>
>> 2. supprise_removal
>>
>>   Users remove the devece directly or bring the device link down/turn 
>> off
>>
>>   device power first by setting pci config space, link-down/not-present/
>>
>>   power-off are all handled by pciehp the same way 
>> "supprise_removal", in
>>
>>   such case, pciehp_ist() will flag the device as "disconnected" 
>> first, then
>>
>>   unconfig the devcie, unload driver, iommu release device(issing 
>> devTLB flush)
>>
>>   delete device. so we checking the device state could work such cases.
>
> If so, then it is fine for the iommu driver. As Robin said, if the
> device needs more cleanup, the iommu core should register a right
> callback to the driver core and handle it before the device goes away.
>
> Disabling PCI features seems to be a reasonable device cleanup. This
> gives us another reason to move ATS enabling/disabling out from the

For supprise_removal, device was already removed, powered-off, iommu

device-release got notification  or cleanup callback is  invoked to disable

ATS to not-present device etc ,

I didn't get the meaning to do so, perhaps I misunderstand ?

Thanks,

Ethan

> iommu subsystem. Once this is done, the device driver will enable ATS
> during its probe and disable it during its release. There will be no
> such workaround in the iommu driver anymore.
>
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 0/2] fix vt-d hard lockup when hotplug ATS capable device
@ 2023-12-20  0:51 Ethan Zhao
  2023-12-20  0:51 ` [PATCH v4 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
  2023-12-20  0:51 ` [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
  0 siblings, 2 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-20  0:51 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, lukas
  Cc: linux-pci, iommu, linux-kernel

 This patchset is used to fix vt-d hard lockup reported when surpprise
 unplug ATS capable endpoint device connects to system via PCIe switch
 as following topology.

     +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
     |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
     |           +-00.2  Intel Corporation Ice Lake RAS
     |           +-00.4  Intel Corporation Device 0b23
     |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0 
                                           NVIDIA Corporation Device 2324
     |                                           +-01.0-[19]----00.0
                          Mellanox Technologies MT2910 Family [ConnectX-7]

 User brought endpoint device 19:00.0's link down by flapping it's hotplug 
 capable slot 17:01.0 link control register, as sequence DLLSC response, 
 pciehp_ist() will unload device driver and power it off, durning device
 driver is unloading an iommu devTlb flush request issued to that link 
 down device, thus a long time completion/timeout waiting in interrupt
 context causes continuous hard lockup warnning and system hang.

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Make a quick fix by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()
 
safe_removal unplug doesn't trigger such issue. 
and this fix works for all supprise_removal unplug operation.

This patchset was tested by yehaorong@bytedance.com on stable-6.7rc4.


change log:

v4: 
- move the PCI device state checking after ATS per Baolu's suggestion.
v3:
- fix commit description typo.
v2:
- revise commit[1] description part according to Lukas' suggestion.
- revise commit[2] description to clarify the issue's impact.
v1:
- https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@linux.intel.com/T/



Thanks,
Ethan


Ethan Zhao (2):
  PCI: make pci_dev_is_disconnected() helper public for other drivers
  iommu/vt-d: don's issue devTLB flush request when device is
    disconnected

 drivers/iommu/intel/pasid.c | 21 ++++++++++++++++++++-
 drivers/pci/pci.h           |  5 -----
 include/linux/pci.h         |  5 +++++
 3 files changed, 25 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-20  0:51 [PATCH v4 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
@ 2023-12-20  0:51 ` Ethan Zhao
  2023-12-20  0:51 ` [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
  1 sibling, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-20  0:51 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, lukas
  Cc: linux-pci, iommu, linux-kernel

Make pci_dev_is_disconnected() public so that it can be called from
Intel vt-d driver to quickly fix/workaround the SURPPRISE_REMOVAL unplug
hang issue for those ATS capable devices on PCIe switch downstream 
ports.

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/pci/pci.h   | 5 -----
 include/linux/pci.h | 5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..75fa2084492f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -366,11 +366,6 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 	return 0;
 }
 
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
-	return dev->error_state == pci_channel_io_perm_failure;
-}
-
 /* pci_dev priv_flags */
 #define PCI_DEV_ADDED 0
 #define PCI_DPC_RECOVERED 1
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60ca768bc867..869f2ec97a84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2503,6 +2503,11 @@ static inline struct pci_dev *pcie_find_root_port(struct pci_dev *dev)
 	return NULL;
 }
 
+static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
+{
+	return dev->error_state == pci_channel_io_perm_failure;
+}
+
 void pci_request_acs(void);
 bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
 bool pci_acs_path_enabled(struct pci_dev *start,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-20  0:51 [PATCH v4 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
  2023-12-20  0:51 ` [PATCH v4 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
@ 2023-12-20  0:51 ` Ethan Zhao
  2023-12-21 10:39   ` Lukas Wunner
  1 sibling, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-20  0:51 UTC (permalink / raw)
  To: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, lukas
  Cc: linux-pci, iommu, linux-kernel

For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.

That would cause following continuous hard lockup warning and system hang

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then power off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()

For SAVE_REMOVAL unplug, link is alive when iommu releases device and
issues devTLB invalidate request, wouldn't trigger such issue.

This patch works for all kinds of SURPPRISE_REMOVAL unplug operation.

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/iommu/intel/pasid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..7dbee9931eb6 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 	if (!info || !info->ats_enabled)
 		return;
 
+	if (pci_dev_is_disconnected(to_pci_dev(dev)))
+		return;
+
 	sid = info->bus << 8 | info->devfn;
 	qdep = info->ats_qdep;
 	pfsid = info->pfsid;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-20  0:51 ` [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
@ 2023-12-21 10:39   ` Lukas Wunner
  2023-12-21 11:01     ` Lukas Wunner
  2023-12-22  1:56     ` Ethan Zhao
  0 siblings, 2 replies; 31+ messages in thread
From: Lukas Wunner @ 2023-12-21 10:39 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel

On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.

I think the problem is in the "waiting in interrupt context".

Can you change qi_submit_sync() to *sleep* until the queue is done?
Instead of busy-waiting in atomic context?

Is the hardware capable of sending an interrupt once the queue is done?
If it is not capable, would it be viable to poll with exponential backoff
and sleep in-between polling once the polling delay increases beyond, say,
10 usec?

Again, the proposed patch is not a proper solution.  It will paper over
the issue most of the time but every once in a while someone will still
get a hard lockup splat and it will then be more difficult to reproduce
and fix if the proposed patch is accepted.


> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>          OE    kernel version xxxx

I don't see any reason to hide the kernel version.
This isn't Intel Confidential information.


> [ 4223.822628] Call Trace:
> [ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50

__dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
one and a half years ago, so the stack trace appears to be from
an older kernel version.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-13 11:54     ` Robin Murphy
  2023-12-14  2:40       ` Ethan Zhao
@ 2023-12-21 10:42       ` Lukas Wunner
  2023-12-21 11:01         ` Robin Murphy
  2023-12-22  3:20         ` Ethan Zhao
  1 sibling, 2 replies; 31+ messages in thread
From: Lukas Wunner @ 2023-12-21 10:42 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ethan Zhao, bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> additional pre-removal notifier to take care of that, and that step would
> then want to distinguish between an orderly removal where cleaning up is
> somewhat meaningful, and a surprise removal where it definitely isn't.

Even if a user starts the process for orderly removal, the device may be
surprise-removed *during* that process.  So we cannot assume that the
device is actually accessible if orderly removal has been initiated.
If the form factor supports surprise removal, the device may be gone
at any time.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-14  0:58     ` Ethan Zhao
@ 2023-12-21 10:51       ` Lukas Wunner
  2023-12-22  2:35         ` Ethan Zhao
  0 siblings, 1 reply; 31+ messages in thread
From: Lukas Wunner @ 2023-12-21 10:51 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Thu, Dec 14, 2023 at 08:58:49AM +0800, Ethan Zhao wrote:
> On 12/13/2023 6:49 PM, Lukas Wunner wrote:
> > On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
> > > move pci_dev_is_disconnected() from driver/pci/pci.h to public
> > > include/linux/pci.h for other driver's reference.
> > > no function change.
> > 
> > That's merely a prose description of the code.  A reader can already
> > see from the code what it's doing.  You need to explain the *reason*
> > for the change instead.  E.g.:  "Make pci_dev_is_disconnected() public
> > so that it can be called from $DRIVER to speed up hot removal
> > handling which may otherwise take seconds because of $REASONS."
> 
> Yup, why I made it public. then how about
> 
> "Make pci_dev_is_disconnected() public so that it can be called from
> Intel vt-d driver to check the device's hotplug removal state when
> issue devTLB flush request."

Much better.

You may optionally want to point out the location of the file in the
source tree because not everyone may be familiar where to find the
"Intel vt-d driver".  Also, not every reader may know where issuing
of devTLB flush requests occurs, so it might make sense to name the
function where that happens.  Finally, it is common to adhere to terms
used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
might be preferable to "devTLB flush request".

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 10:39   ` Lukas Wunner
@ 2023-12-21 11:01     ` Lukas Wunner
  2023-12-22  2:08       ` Ethan Zhao
  2023-12-22  3:56       ` Ethan Zhao
  2023-12-22  1:56     ` Ethan Zhao
  1 sibling, 2 replies; 31+ messages in thread
From: Lukas Wunner @ 2023-12-21 11:01 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel

On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> > For those endpoint devices connect to system via hotplug capable ports,
> > users could request a warm reset to the device by flapping device's link
> > through setting the slot's link control register, as pciehpt_ist() DLLSC
> > interrupt sequence response, pciehp will unload the device driver and
> > then power it off. thus cause an IOMMU devTLB flush request for device to
> > be sent and a long time completion/timeout waiting in interrupt context.
> 
> I think the problem is in the "waiting in interrupt context".

I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
capability to reduce the Invalidate Completion Timeout to a sane value?
Could you check whether that's supported?

Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
but that's not even a "must", it's a "should".  So devices are free to
take even longer.  We have to cut off at *some* point.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 10:42       ` Lukas Wunner
@ 2023-12-21 11:01         ` Robin Murphy
  2023-12-21 11:07           ` Lukas Wunner
  2023-12-22  3:20         ` Ethan Zhao
  1 sibling, 1 reply; 31+ messages in thread
From: Robin Murphy @ 2023-12-21 11:01 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ethan Zhao, bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu,
	linux-kernel, Haorong Ye

On 2023-12-21 10:42 am, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
> 
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process.  So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone
> at any time.

Sure, whatever we do there's always going to be some unavoidable 
time-of-check-to-time-of-use race window so we can never guarantee that 
sending a request to the device will succeed. I was just making the 
point that if we *have* already detected a surprise removal, then 
cleaning up its leftover driver model state should still generate a 
BUS_NOTIFY_REMOVE_DEVICE call, but in that case we can know there's no 
point trying to send any requests to the device that's already gone.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 11:01         ` Robin Murphy
@ 2023-12-21 11:07           ` Lukas Wunner
  0 siblings, 0 replies; 31+ messages in thread
From: Lukas Wunner @ 2023-12-21 11:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ethan Zhao, bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Thu, Dec 21, 2023 at 11:01:56AM +0000, Robin Murphy wrote:
> On 2023-12-21 10:42 am, Lukas Wunner wrote:
> > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> > > I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> > > additional pre-removal notifier to take care of that, and that step would
> > > then want to distinguish between an orderly removal where cleaning up is
> > > somewhat meaningful, and a surprise removal where it definitely isn't.
> > 
> > Even if a user starts the process for orderly removal, the device may be
> > surprise-removed *during* that process.  So we cannot assume that the
> > device is actually accessible if orderly removal has been initiated.
> > If the form factor supports surprise removal, the device may be gone
> > at any time.
> 
> Sure, whatever we do there's always going to be some unavoidable
> time-of-check-to-time-of-use race window so we can never guarantee that
> sending a request to the device will succeed. I was just making the point
> that if we *have* already detected a surprise removal, then cleaning up its
> leftover driver model state should still generate a BUS_NOTIFY_REMOVE_DEVICE
> call, but in that case we can know there's no point trying to send any
> requests to the device that's already gone.

Right, using pci_dev_is_disconnected() as a *speedup* when we
definitely know the device is gone, that's perfectly fine.

So in that sense the proposed patch is acceptable *after* this
series has been extended to make sure hard lockups can *never*
occur on unplug.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 10:39   ` Lukas Wunner
  2023-12-21 11:01     ` Lukas Wunner
@ 2023-12-22  1:56     ` Ethan Zhao
  2023-12-22  8:14       ` Lukas Wunner
  1 sibling, 1 reply; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  1:56 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel


On 12/21/2023 6:39 PM, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> I think the problem is in the "waiting in interrupt context".
>
> Can you change qi_submit_sync() to *sleep* until the queue is done?
> Instead of busy-waiting in atomic context?

If you read that function carefully, you wouldn't say "sleep" there....

that is 'sync'ed.

>
> Is the hardware capable of sending an interrupt once the queue is done?
> If it is not capable, would it be viable to poll with exponential backoff
> and sleep in-between polling once the polling delay increases beyond, say,
> 10 usec?

I don't know if the polling along sleeping for completion of meanningless

devTLB invalidation request blindly sent to (removed/powered down/link down)

device makes sense or not.

But according to PCIe spec  6.1  10.3.1

"Software ensures no invalidations are issued to a Function when its

  ATS capability is disabled. "

>
> Again, the proposed patch is not a proper solution.  It will paper over
> the issue most of the time but every once in a while someone will still
> get a hard lockup splat and it will then be more difficult to reproduce
> and fix if the proposed patch is accepted.

Could you point out why is not proper ? Is there any other window

the hard lockup still could happen with the ATS capable devcie

supprise_removal case if we checked the connection state first ?

Please help to elaberate it.

>
>
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>>           OE    kernel version xxxx
> I don't see any reason to hide the kernel version.
> This isn't Intel Confidential information.
>
Yes, this is the old kernel stack trace, but customer also tried lasted 
6.7rc4

(doesn't work) and the patched 6.7rc4 (fixed).


Thanks,

Ethan

>> [ 4223.822628] Call Trace:
>> [ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
> __dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
> one and a half years ago, so the stack trace appears to be from
> an older kernel version.
>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 11:01     ` Lukas Wunner
@ 2023-12-22  2:08       ` Ethan Zhao
  2023-12-22  3:56       ` Ethan Zhao
  1 sibling, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  2:08 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel


On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?

It is not about Intel vt-d's capability per my understanding, it is the 
third

party PCIe switch's capability, they are not aware of  ATS transation at 
all,

if its downstream port endpoint device is removed/powered-off/link-down,

it couldn't feedback the upstream iommu a fault/completion/timeout for

ATS transaction breakage reason.  While the root port could (verified).

>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
  1 min (60 sec)+50%
> but that's not even a "must", it's a "should".  So devices are free to

I could happen if blindly wait here, so we should avoid such case.


Thanks,

Ethan

> take even longer.  We have to cut off at *some* point.
>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers
  2023-12-21 10:51       ` Lukas Wunner
@ 2023-12-22  2:35         ` Ethan Zhao
  0 siblings, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  2:35 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye


On 12/21/2023 6:51 PM, Lukas Wunner wrote:
> On Thu, Dec 14, 2023 at 08:58:49AM +0800, Ethan Zhao wrote:
>> On 12/13/2023 6:49 PM, Lukas Wunner wrote:
>>> On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
>>>> move pci_dev_is_disconnected() from driver/pci/pci.h to public
>>>> include/linux/pci.h for other driver's reference.
>>>> no function change.
>>> That's merely a prose description of the code.  A reader can already
>>> see from the code what it's doing.  You need to explain the *reason*
>>> for the change instead.  E.g.:  "Make pci_dev_is_disconnected() public
>>> so that it can be called from $DRIVER to speed up hot removal
>>> handling which may otherwise take seconds because of $REASONS."
>> Yup, why I made it public. then how about
>>
>> "Make pci_dev_is_disconnected() public so that it can be called from
>> Intel vt-d driver to check the device's hotplug removal state when
>> issue devTLB flush request."
> Much better.
>
> You may optionally want to point out the location of the file in the
> source tree because not everyone may be familiar where to find the
> "Intel vt-d driver".  Also, not every reader may know where issuing
> of devTLB flush requests occurs, so it might make sense to name the
> function where that happens.  Finally, it is common to adhere to terms
> used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
> might be preferable to "devTLB flush request".

ATS Invalidate Request ? devTLB flush request has the same meaning,

I thought all iommu/PCIe guys could understand.


Thanks,

Ethan

>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 10:42       ` Lukas Wunner
  2023-12-21 11:01         ` Robin Murphy
@ 2023-12-22  3:20         ` Ethan Zhao
  1 sibling, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  3:20 UTC (permalink / raw)
  To: Lukas Wunner, Robin Murphy
  Cc: bhelgaas, baolu.lu, dwmw2, will, linux-pci, iommu, linux-kernel,
	Haorong Ye


On 12/21/2023 6:42 PM, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process.  So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone

There is no hardware lock to prevent user powerring-off/removing

the supprise-removal capable device before issuing ATS invalidation

request but after checking device connection state, the no target

request still possibly be sent.


Thanks,

Ethan

> at any time.
>
> Thanks,
>
> Lukas
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-21 11:01     ` Lukas Wunner
  2023-12-22  2:08       ` Ethan Zhao
@ 2023-12-22  3:56       ` Ethan Zhao
  1 sibling, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  3:56 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel


On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?
>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
> but that's not even a "must", it's a "should".  So devices are free to
> take even longer.  We have to cut off at *some* point.

I really "expected" there is interrrupt signal to iommu hardware when

the PCIe swtich downstream device 'gone', or some internal polling

/heartbeating the endpoint device for ATS breaking,

but so far seems there are only hotplug interrupts to downstream

control.

...

How to define the point "some" msec to timeout while software

break out the waiting loop ?  or polling if the target is gone ?


Thanks,

Ethan



>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-22  1:56     ` Ethan Zhao
@ 2023-12-22  8:14       ` Lukas Wunner
  2023-12-22  9:01         ` Ethan Zhao
  0 siblings, 1 reply; 31+ messages in thread
From: Lukas Wunner @ 2023-12-22  8:14 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye

On Fri, Dec 22, 2023 at 09:56:39AM +0800, Ethan Zhao wrote:
> I don't know if the polling along sleeping for completion of meanningless
> devTLB invalidation request blindly sent to (removed/powered down/link down)
> device makes sense or not.

If you have a way to get to the struct pci_dev * which you're waiting for
in qi_submit_sync() then I guess you could check for its presence and bail
out if it's gone, instead of issuing a cpu_relax().


> > Again, the proposed patch is not a proper solution.  It will paper over
> > the issue most of the time but every once in a while someone will still
> > get a hard lockup splat and it will then be more difficult to reproduce
> > and fix if the proposed patch is accepted.
> 
> Could you point out why is not proper ? Is there any other window
> the hard lockup still could happen with the ATS capable devcie
> supprise_removal case if we checked the connection state first ?
> Please help to elaberate it.

Even though user space may have initiated orderly removal via sysfs,
the device may be yanked from the slot (surprise removed) while the
orderly removal is happening.


> Yes, this is the old kernel stack trace, but customer also tried lasted
> 6.7rc4

If you could provide a stacktrace for a contemporary kernel,
I think that would be preferred.


> (doesn't work) and the patched 6.7rc4 (fixed).

Why is it fixed in v6.7-rc4?  Is the present patch thus unnecessary?


> > Finally, it is common to adhere to terms
> > used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
> > might be preferable to "devTLB flush request".
> 
> ATS Invalidate Request ? devTLB flush request has the same meaning,
> 
> I thought all iommu/PCIe guys could understand.

I'm just pointing out the preferred way to write commit messages
in the PCI subsystem (as I've perceived it over the years) so that
you can reduce the number of iterations you have to go through
due to maintainer feedback.  I'm just trying to be helpful.


> How to define the point "some" msec to timeout while software
> break out the waiting loop ? or polling if the target is gone ?

I'd say adhere to the 1 min + 50% number provided in the spec.

If you know the device is gone before that then you can break out
of the loop in qi_submit_sync() of course.

The question is, does the Intel IOMMU have a timeout at all for
Invalidate Requests?  I guess we don't really know that because
in the stack trace you've provided, the watchdog stops the machine
before a timeout occurs.  So it's at least 12 sec.  Or there's
no timeout at all.

If the Intel IOMMU doesn't enforce a timeout, you should probably amend
qi_submit_sync() to break out of the loop once the 1 min + 50% limit
is exceeded.  And you need to amend the function to sleep instead of
polling in interrupt context.

Can you check with hardware engineers whether there's a timeout?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected
  2023-12-22  8:14       ` Lukas Wunner
@ 2023-12-22  9:01         ` Ethan Zhao
  0 siblings, 0 replies; 31+ messages in thread
From: Ethan Zhao @ 2023-12-22  9:01 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas, baolu.lu, dwmw2, will, robin.murphy, linux-pci, iommu,
	linux-kernel, Haorong Ye


On 12/22/2023 4:14 PM, Lukas Wunner wrote:
> On Fri, Dec 22, 2023 at 09:56:39AM +0800, Ethan Zhao wrote:
>> I don't know if the polling along sleeping for completion of meanningless
>> devTLB invalidation request blindly sent to (removed/powered down/link down)
>> device makes sense or not.
> If you have a way to get to the struct pci_dev * which you're waiting for
> in qi_submit_sync() then I guess you could check for its presence and bail
> out if it's gone, instead of issuing a cpu_relax().
One option to bail out the loop.
>
>>> Again, the proposed patch is not a proper solution.  It will paper over
>>> the issue most of the time but every once in a while someone will still
>>> get a hard lockup splat and it will then be more difficult to reproduce
>>> and fix if the proposed patch is accepted.
>> Could you point out why is not proper ? Is there any other window
>> the hard lockup still could happen with the ATS capable devcie
>> supprise_removal case if we checked the connection state first ?
>> Please help to elaberate it.
> Even though user space may have initiated orderly removal via sysfs,
> the device may be yanked from the slot (surprise removed) while the
> orderly removal is happening.

Yes, just after the wait descripor is submitted and before waiting in loop.

the rare but worst case.

>
>
>> Yes, this is the old kernel stack trace, but customer also tried lasted
>> 6.7rc4
> If you could provide a stacktrace for a contemporary kernel,
> I think that would be preferred.
Customer tried, but they didn't provide me the lastest trace.
>
>
>> (doesn't work) and the patched 6.7rc4 (fixed).
> Why is it fixed in v6.7-rc4?  Is the present patch thus unnecessary?
Not fixed in v6.7rc4, with this patch, they said the unplug works.
>
>>> Finally, it is common to adhere to terms
>>> used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
>>> might be preferable to "devTLB flush request".
>> ATS Invalidate Request ? devTLB flush request has the same meaning,
>>
>> I thought all iommu/PCIe guys could understand.
> I'm just pointing out the preferred way to write commit messages
> in the PCI subsystem (as I've perceived it over the years) so that
> you can reduce the number of iterations you have to go through
> due to maintainer feedback.  I'm just trying to be helpful.
>
Understand.
>> How to define the point "some" msec to timeout while software
>> break out the waiting loop ? or polling if the target is gone ?
> I'd say adhere to the 1 min + 50% number provided in the spec.
>
> If you know the device is gone before that then you can break out
> of the loop in qi_submit_sync() of course.

I am trying to find a way to break it out in this qi_submit_sync().

  checking the device state in this loop, but seems not good in this

iommu low level code and need some interfaces to be modified.

That would cost me much more hours to make the rare case work,

to be perfect:

1.  check the pci device state in the loop

2.  modify the invalidation descriptor status in 
pciehp_ist()->intel_iommu_release_device() call.

>
> The question is, does the Intel IOMMU have a timeout at all for
> Invalidate Requests?  I guess we don't really know that because
> in the stack trace you've provided, the watchdog stops the machine
> before a timeout occurs.  So it's at least 12 sec.  Or there's
> no timeout at all.

The calltrace wouldn't tell us there is really timeout of 1min+50%

or not, event there is, meanlingless.

> If the Intel IOMMU doesn't enforce a timeout, you should probably amend
> qi_submit_sync() to break out of the loop once the 1 min + 50% limit
> is exceeded.  And you need to amend the function to sleep instead of
> polling in interrupt context.

Too many paths to call this function, and revise it to non-sync, to much

things impacted.

>
> Can you check with hardware engineers whether there's a timeout?

Combinated with third party PCIe switch chips ?


Thanks,

Ethan

>
> Thanks,
>
> Lukas
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-12-22  9:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-13  3:46 [PATCH RFC 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
2023-12-13  3:46 ` [PATCH 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
2023-12-13 10:49   ` Lukas Wunner
2023-12-14  0:58     ` Ethan Zhao
2023-12-21 10:51       ` Lukas Wunner
2023-12-22  2:35         ` Ethan Zhao
2023-12-13  3:46 ` [PATCH 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
2023-12-13 10:44   ` Lukas Wunner
2023-12-13 11:54     ` Robin Murphy
2023-12-14  2:40       ` Ethan Zhao
2023-12-21 10:42       ` Lukas Wunner
2023-12-21 11:01         ` Robin Murphy
2023-12-21 11:07           ` Lukas Wunner
2023-12-22  3:20         ` Ethan Zhao
2023-12-14  2:16     ` Ethan Zhao
2023-12-15  0:43       ` Ethan Zhao
2023-12-13 11:59   ` Baolu Lu
2023-12-14  2:26     ` Ethan Zhao
2023-12-15  1:03       ` Ethan Zhao
2023-12-15  1:34         ` Baolu Lu
2023-12-15  1:51           ` Ethan Zhao
2023-12-20  0:51 [PATCH v4 0/2] fix vt-d hard lockup when hotplug ATS capable device Ethan Zhao
2023-12-20  0:51 ` [PATCH v4 1/2] PCI: make pci_dev_is_disconnected() helper public for other drivers Ethan Zhao
2023-12-20  0:51 ` [PATCH v4 2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected Ethan Zhao
2023-12-21 10:39   ` Lukas Wunner
2023-12-21 11:01     ` Lukas Wunner
2023-12-22  2:08       ` Ethan Zhao
2023-12-22  3:56       ` Ethan Zhao
2023-12-22  1:56     ` Ethan Zhao
2023-12-22  8:14       ` Lukas Wunner
2023-12-22  9:01         ` Ethan Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.