linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug
@ 2023-03-28  4:51 Dexuan Cui
  2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

Before the guest finishes probing a device, the host may be already starting
to remove the device. Currently there are multiple race condition bugs in the
pci-hyperv driver, which can cause the guest to panic.  The patchset fixes
the crashes.

The patchset also does some cleanup work: patch 3 removes the useless
hv_pcichild_state, and patch 4 reverts an old patch which is not really
useful (without patch 4, it would be hard to make patch 5 clean).

Patch 6 removes the use of a global mutex lock, and enables async-probing
to allow concurrent device probing for faster boot.

The patchset is also availsble in my github branch:
https://github.com/dcui/tdx/commits/decui/vpci/v6.3-rc3-v1

Please review. Thanks!

Dexuan Cui (6):
  PCI: hv: fix a race condition bug in hv_pci_query_relations()
  PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
  PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
  Revert "PCI: hv: Fix a timing issue which causes kdump to fail
    occasionally"
  PCI: hv: Add a per-bus mutex state_lock
  PCI: hv: Use async probing to reduce boot time

 drivers/pci/controller/pci-hyperv.c | 143 +++++++++++++++++-----------
 1 file changed, 85 insertions(+), 58 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-28  5:29   ` [EXTERNAL] " Saurabh Singh Sengar
  2023-03-28 16:48   ` Long Li
  2023-03-28  4:51 ` [PATCH 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic Dexuan Cui
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

Fix the longstanding race between hv_pci_query_relations() and
survey_child_resources() by flushing the workqueue before we exit from
hv_pci_query_relations().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

With the below debug code:

@@ -2103,6 +2103,8 @@ static void survey_child_resources(struct hv_pcibus_device *hbus)
 	}

 	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+	ssleep(15);
+	printk("%s: completing %px\n", __func__, event);
 	complete(event);
 }

@@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device *hdev)

 	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
 			       0, VM_PKT_DATA_INBAND, 0);
-	if (!ret)
+	if (!ret) {
+		ssleep(10); // unassign the PCI device on the host during the 10s
 		ret = wait_for_response(hdev, &comp);
+		printk("%s: comp=%px is becoming invalid! ret=%d\n",
+			__func__, &comp, ret);
+	}

 	return ret;
 }
@@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,

 retry:
 	ret = hv_pci_query_relations(hdev);
+	printk("hv_pci_query_relations() exited\n");
+
 	if (ret)
 		goto free_irq_domain;

I'm able to repro the below hang issue:

[   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming invalid! ret=-19
[   84.797505] hv_pci_query_relations() exited
[   89.652268] survey_child_resources: completing ffffa7504012fb58
[  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  150.398447] rcu:     15-...0: (2 ticks this GP) idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
[  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833 ncpus=16)
[  150.410870] Sending NMI from CPU 14 to CPUs 15:
[  150.414836] NMI backtrace for cpu 15
[  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E      6.3.0-rc3-decui-dirty #34
...
[  150.414849] Workqueue: hv_pci_468b pci_devices_present_work [pci_hyperv]
[  150.414866] RIP: 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
...
[  150.414905] Call Trace:
[  150.414907]  <TASK>
[  150.414911]  _raw_spin_lock_irqsave+0x40/0x50
[  150.414917]  complete+0x1d/0x60
[  150.414924]  pci_devices_present_work+0x5dd/0x680 [pci_hyperv]
[  150.414946]  process_one_work+0x21f/0x430
[  150.414952]  worker_thread+0x4a/0x3c0

With this patch, the hang issue goes away:

[  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming invalid! ret=-19
[  191.263611] survey_child_resources: completing ffffa7cfd0aa3b50
[  191.267732] hv_pci_query_relations() exited

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b75628..b82c7cde19e6 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device *hdev)
 	if (!ret)
 		ret = wait_for_response(hdev, &comp);
 
+	/*
+	 * In the case of fast device addition/removal, it's possible that
+	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+	 * already got a PCI_BUS_RELATIONS* message from the host and the
+	 * channel callback already scheduled a work to hbus->wq, which can be
+	 * running survey_child_resources() -> complete(&hbus->survey_event),
+	 * even after hv_pci_query_relations() exits and the stack variable
+	 * 'comp' is no longer valid. This can cause a strange hang issue
+	 * or sometimes a page fault. Flush hbus->wq before we exit from
+	 * hv_pci_query_relations() to avoid the issues.
+	 */
+	flush_workqueue(hbus->wq);
+
 	return ret;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
  2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-28  4:51 ` [PATCH 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev Dexuan Cui
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

When the host tries to remove a PCI device, the host first sends a
PCI_EJECT message to the guest, and the guest is supposed to gracefully
remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host;
the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to
the guest (when the guest receives this message, the device is already
unassigned from the guest) and the guest can do some final cleanup work;
if the guest fails to respond to the PCI_EJECT message within one minute,
the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and
removes the PCI device forcibly.

In the case of fast device addition/removal, it's possible that the PCI
device driver is still configuring MSI-X interrupts when the guest receives
the PCI_EJECT message; the channel callback calls hv_pci_eject_device(),
which sets hpdev->state to hv_pcichild_ejecting, and schedules a work
hv_eject_device_work(); if the PCI device driver is calling
pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the
while loop in hv_compose_msi_msg() due to the updated hpdev->state, and
leave data->chip_data with its default value of NULL; later, when the PCI
device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest
crashes in hv_arch_irq_unmask() due to data->chip_data being NULL.

Fix the issue by not testing hpdev->state in the while loop: when the
guest receives PCI_EJECT, the device is still assigned to the guest, and
the guest has one minute to finish the device removal gracefully. We don't
really need to (and we should not) test hpdev->state in the loop.

Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

With the below debug code:

@@ -643,6 +643,9 @@ static void hv_arch_irq_unmask(struct irq_data *data)
 	pbus = pdev->bus;
 	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 	int_desc = data->chip_data;
+	if (!int_desc)
+		dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
+			 __func__, data->irq);

 	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);

@@ -1865,6 +1868,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		goto free_int_desc;
 	}

+	printk("%s: line %d: irq=%u\n", __func__, __LINE__, data->irq);
+	{
+		static bool delayed; //remove the device within the 10s.
+		if (!delayed) { delayed = true; mdelay(10000); }
+	}
 	ret = vmbus_sendpacket_getid(hpdev->hbus->hdev->channel, &ctxt.int_pkts,
 				     size, (unsigned long)&ctxt.pci_pkt,
 				     &trans_id, VM_PKT_DATA_INBAND,

I'm able to repro the below panic:

[   23.258674] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   23.271313] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[   23.274554] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
[   23.277733] pci_bus 468b:00: No busn resource found for root bus, will use [bus 00-ff]
[   23.283845] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
[   23.289796] pci 468b:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
[   23.296463] pci 468b:00:02.0: enabling Extended Tags
...
[   23.331300] pci_bus 468b:00: busn_res: [bus 00-ff] end is updated to 00
[   23.334130] pci 468b:00:02.0: BAR 0: assigned [mem 0xfe0000000-0xfe00fffff 64bit pref]
[   23.507985] mlx5_core 468b:00:02.0: no default pinctrl state
[   23.510834] mlx5_core 468b:00:02.0: enabling device (0000 -> 0002)
[   23.516843] mlx5_core 468b:00:02.0: firmware version: 14.25.8102
[   23.745069] hv_compose_msi_msg: line 1871: irq=24
[   33.685554] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: the device is being ejected
[   33.690855] hv_compose_msi_msg: line 1871: irq=25
[   33.694797] hv_compose_msi_msg: line 1871: irq=26
[   33.698884] hv_compose_msi_msg: line 1871: irq=27
[   33.702910] hv_compose_msi_msg: line 1871: irq=28
[   33.705726] hv_compose_msi_msg: line 1871: irq=29
[   33.709644] hv_compose_msi_msg: line 1871: irq=29
[   33.712182] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: hv_arch_irq_unmask() can not unmask irq 29
[   33.716625] BUG: kernel NULL pointer dereference, address: 0000000000000008
...
[   33.737426] Workqueue: events work_for_cpu_fn
[   33.739562] RIP: 0010:hv_irq_unmask+0xc2/0x400 [pci_hyperv]
...
[   33.778511] Call Trace:
[   33.779533]  <TASK>
[   33.780428]  unmask_irq.part.0+0x23/0x40
[   33.781994]  irq_enable+0x60/0x70
[   33.783336]  __irq_startup+0x5b/0x80
[   33.784772]  irq_startup+0x75/0x140
[   33.786175]  __setup_irq+0x3ae/0x840
[   33.787586]  request_threaded_irq+0x112/0x180
[   33.789298]  mlx5_irq_alloc+0x111/0x310 [mlx5_core]
[   33.791464]  irq_pool_request_vector+0x72/0x80 [mlx5_core]
[   33.794449]  mlx5_ctrl_irq_request+0xc9/0x160 [mlx5_core]
[   33.797454]  mlx5_eq_table_create+0x9e/0xb30 [mlx5_core]
[   33.802127]  mlx5_load+0x54/0x3b0 [mlx5_core]
[   33.804157]  mlx5_init_one+0x1e6/0x550 [mlx5_core]
[   33.806347]  probe_one+0x2e5/0x460 [mlx5_core]
[   33.808664]  local_pci_probe+0x4b/0xb0
[   33.810377]  work_for_cpu_fn+0x1a/0x30
[   33.812275]  process_one_work+0x21f/0x430
[   33.814700]  worker_thread+0x1fa/0x3c0

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index b82c7cde19e6..1b11cf739193 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data *data)
 	pbus = pdev->bus;
 	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 	int_desc = data->chip_data;
+	if (!int_desc) {
+		dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
+			 __func__, data->irq);
+		return;
+	}
 
 	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
 
@@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		hv_pci_onchannelcallback(hbus);
 		spin_unlock_irqrestore(&channel->sched_lock, flags);
 
-		if (hpdev->state == hv_pcichild_ejecting) {
-			dev_err_once(&hbus->hdev->device,
-				     "the device is being ejected\n");
-			goto enable_tasklet;
-		}
-
 		udelay(100);
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
  2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
  2023-03-28  4:51 ` [PATCH 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-28  4:51 ` [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" Dexuan Cui
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

The hpdev->state is never really useful. The only use in
hv_pci_eject_device() and hv_eject_device_work() is not really necessary.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1b11cf739193..46df6d093d68 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -553,19 +553,10 @@ struct hv_dr_state {
 	struct hv_pcidev_description func[];
 };
 
-enum hv_pcichild_state {
-	hv_pcichild_init = 0,
-	hv_pcichild_requirements,
-	hv_pcichild_resourced,
-	hv_pcichild_ejecting,
-	hv_pcichild_maximum
-};
-
 struct hv_pci_dev {
 	/* List protected by pci_rescan_remove_lock */
 	struct list_head list_entry;
 	refcount_t refs;
-	enum hv_pcichild_state state;
 	struct pci_slot *pci_slot;
 	struct hv_pcidev_description desc;
 	bool reported_missing;
@@ -2750,8 +2741,6 @@ static void hv_eject_device_work(struct work_struct *work)
 	hpdev = container_of(work, struct hv_pci_dev, wrk);
 	hbus = hpdev->hbus;
 
-	WARN_ON(hpdev->state != hv_pcichild_ejecting);
-
 	/*
 	 * Ejection can come before or after the PCI bus has been set up, so
 	 * attempt to find it and tear down the bus state, if it exists.  This
@@ -2808,7 +2797,6 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
 		return;
 	}
 
-	hpdev->state = hv_pcichild_ejecting;
 	get_pcichild(hpdev);
 	INIT_WORK(&hpdev->wrk, hv_eject_device_work);
 	queue_work(hbus->wq, &hpdev->wrk);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
                   ` (2 preceding siblings ...)
  2023-03-28  4:51 ` [PATCH 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-28  6:33   ` Dexuan Cui
  2023-03-28  4:51 ` [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock Dexuan Cui
  2023-03-28  4:51 ` [PATCH 6/6] PCI: hv: Use async probing to reduce boot time Dexuan Cui
  5 siblings, 1 reply; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.

The statement "the hv_pci_bus_exit() call releases structures of all its
child devices" in commit d6af2ed29c7c is not true: in the path
hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
child "struct hv_pci_dev *hpdev" that is created earlier in
pci_devices_present_work() -> new_pcichild_device().

The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
where the old version of hv_pci_bus_exit() was used; when the commit was
rebased and merged into the upstream, people didn't notice that it's
not really necessary. The commit itself doesn't cause any issue, but it
makes hv_pci_probe() more complicated. Revert it to facilitate some
upcoming changes to hv_pci_probe().

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
 1 file changed, 34 insertions(+), 37 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 46df6d093d68..48feab095a14 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
 	struct pci_bus_d0_entry *d0_entry;
 	struct hv_pci_compl comp_pkt;
 	struct pci_packet *pkt;
+	bool retry = true;
 	int ret;
 
+enter_d0_retry:
 	/*
 	 * Tell the host that the bus is ready to use, and moved into the
 	 * powered-on state.  This includes telling the host which region
@@ -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
 	if (ret)
 		goto exit;
 
+	/*
+	 * In certain case (Kdump) the pci device of interest was
+	 * not cleanly shut down and resource is still held on host
+	 * side, the host could return invalid device status.
+	 * We need to explicitly request host to release the resource
+	 * and try to enter D0 again.
+	 */
+	if (comp_pkt.completion_status < 0 && retry) {
+		retry = false;
+
+		dev_err(&hdev->device, "Retrying D0 Entry\n");
+
+		/*
+		 * Hv_pci_bus_exit() calls hv_send_resource_released()
+		 * to free up resources of its child devices.
+		 * In the kdump kernel we need to set the
+		 * wslot_res_allocated to 255 so it scans all child
+		 * devices to release resources allocated in the
+		 * normal kernel before panic happened.
+		 */
+		hbus->wslot_res_allocated = 255;
+
+		ret = hv_pci_bus_exit(hdev, true);
+
+		if (ret == 0) {
+			kfree(pkt);
+			goto enter_d0_retry;
+		}
+		dev_err(&hdev->device,
+			"Retrying D0 failed with ret %d\n", ret);
+	}
+
 	if (comp_pkt.completion_status < 0) {
 		dev_err(&hdev->device,
 			"PCI Pass-through VSP failed D0 Entry with status %x\n",
@@ -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
 	struct hv_pcibus_device *hbus;
 	u16 dom_req, dom;
 	char *name;
-	bool enter_d0_retry = true;
 	int ret;
 
 	/*
@@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_fwnode;
 
-retry:
 	ret = hv_pci_query_relations(hdev);
 	if (ret)
 		goto free_irq_domain;
 
 	ret = hv_pci_enter_d0(hdev);
-	/*
-	 * In certain case (Kdump) the pci device of interest was
-	 * not cleanly shut down and resource is still held on host
-	 * side, the host could return invalid device status.
-	 * We need to explicitly request host to release the resource
-	 * and try to enter D0 again.
-	 * Since the hv_pci_bus_exit() call releases structures
-	 * of all its child devices, we need to start the retry from
-	 * hv_pci_query_relations() call, requesting host to send
-	 * the synchronous child device relations message before this
-	 * information is needed in hv_send_resources_allocated()
-	 * call later.
-	 */
-	if (ret == -EPROTO && enter_d0_retry) {
-		enter_d0_retry = false;
-
-		dev_err(&hdev->device, "Retrying D0 Entry\n");
-
-		/*
-		 * Hv_pci_bus_exit() calls hv_send_resources_released()
-		 * to free up resources of its child devices.
-		 * In the kdump kernel we need to set the
-		 * wslot_res_allocated to 255 so it scans all child
-		 * devices to release resources allocated in the
-		 * normal kernel before panic happened.
-		 */
-		hbus->wslot_res_allocated = 255;
-		ret = hv_pci_bus_exit(hdev, true);
-
-		if (ret == 0)
-			goto retry;
-
-		dev_err(&hdev->device,
-			"Retrying D0 failed with ret %d\n", ret);
-	}
 	if (ret)
 		goto free_irq_domain;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
                   ` (3 preceding siblings ...)
  2023-03-28  4:51 ` [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-29 16:41   ` Michael Kelley (LINUX)
  2023-03-28  4:51 ` [PATCH 6/6] PCI: hv: Use async probing to reduce boot time Dexuan Cui
  5 siblings, 1 reply; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

In the case of fast device addition/removal, it's possible that
hv_eject_device_work() can start to run before create_root_hv_pci_bus()
starts to run; as a result, the pci_get_domain_bus_and_slot() in
hv_eject_device_work() can return a 'pdev' of NULL, and
hv_eject_device_work() can remove the 'hpdev', and immediately send a
message PCI_EJECTION_COMPLETE to the host, and the host immediately
unassigns the PCI device from the guest; meanwhile,
create_root_hv_pci_bus() and the PCI device driver can be probing the
dead PCI device and reporting timeout errors.

Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the
mutex before powering on the PCI bus in hv_pci_enter_d0(): when
hv_eject_device_work() starts to run, it's able to find the 'pdev' and call
pci_stop_and_remove_bus_device(pdev): if the PCI device driver has
loaded, the PCI device driver's probe() function is already called in
create_root_hv_pci_bus() -> pci_bus_add_devices(), and now
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able
to call the PCI device driver's remove() function and remove the device
reliably; if the PCI device driver hasn't loaded yet, the function call
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to
remove the PCI device reliably and the PCI device driver's probe()
function won't be called; if the PCI device driver's probe() is already
running (e.g., systemd-udev is loading the PCI device driver), it must
be holding the per-device lock, and after the probe() finishes and releases
the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is
able to proceed to remove the device reliably.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

With the below debug code:

Changes to /drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1727,6 +1727,8 @@ static int probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct devlink *devlink;
 	int err;

+	printk("%s: line %d: sleep 20s: \n", __func__, __LINE__); ssleep(20);
+	printk("%s: line %d: sleep 20s: done\n", __func__, __LINE__);
 	devlink = mlx5_devlink_alloc(&pdev->dev);
 	if (!devlink) {
 		dev_err(&pdev->dev, "devlink alloc failed\n");
Changes to drivers/pci/controller/pci-hyperv.c
@@ -2749,6 +2749,7 @@ static void hv_eject_device_work(struct work_struct *work)
 	 */
 	wslot = wslot_to_devfn(hpdev->desc.win_slot.slot);
 	pdev = pci_get_domain_bus_and_slot(hbus->bridge->domain_nr, 0, wslot);
+	printk("%s: 1: line %d: pdev=%px\n", __func__, __LINE__, pdev);
 	if (pdev) {
 		pci_lock_rescan_remove();
 		pci_stop_and_remove_bus_device(pdev);
@@ -2756,9 +2757,12 @@ static void hv_eject_device_work(struct work_struct *work)
 		pci_unlock_rescan_remove();
 	}

+	printk("%s: 2: line %d: pdev=%px: sleeping 10s\n", __func__, __LINE__, pdev); ssleep(10);
+	printk("%s: 3: line %d: pdev=%px: sleeping 10s: done: removing hpdev\n", __func__, __LINE__, pdev);
 	spin_lock_irqsave(&hbus->device_list_lock, flags);
 	list_del(&hpdev->list_entry);
 	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+	printk("%s: 4: line %d: pdev=%px: hpdev is removed!!!\n", __func__, __LINE__, pdev);

 	if (hpdev->pci_slot)
 		pci_destroy_slot(hpdev->pci_slot);
@@ -2770,6 +2774,7 @@ static void hv_eject_device_work(struct work_struct *work)
 	vmbus_sendpacket(hbus->hdev->channel, ejct_pkt,
 			 sizeof(*ejct_pkt), 0,
 			 VM_PKT_DATA_INBAND, 0);
+	printk("%s: 5: line %d: pdev=%px: sent PCI_EJECTION_COMPLETE\n", __func__, __LINE__, pdev);

 	/* For the get_pcichild() in hv_pci_eject_device() */
 	put_pcichild(hpdev);
@@ -3686,7 +3691,10 @@ static int hv_pci_probe(struct hv_device *hdev,

 	hbus->state = hv_pcibus_probed;

+	printk("%s: line %d: create_root_hv_pci_bus=%d: sleeping 10s\n", __func__, __LINE__, ret); ssleep(10);
+	printk("%s: line %d: create_root_hv_pci_bus=%d: sleeping 10s: done\n", __func__, __LINE__, ret);
 	ret = create_root_hv_pci_bus(hbus);
+	printk("%s: line %d: create_root_hv_pci_bus=%d\n", __func__, __LINE__, ret);
 	if (ret)
 		goto free_windows;

I'm able to repro the below timeout errors (remove the PCI VF NIC within 10 seconds after it's assigned)

[   31.445306] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   31.452133] hv_pci_probe: line 3694: create_root_hv_pci_bus=0: sleeping 10s
[   34.345256] hv_eject_device_work: 1: line 2752: pdev=0000000000000000
[   34.350175] hv_eject_device_work: 2: line 2760: pdev=0000000000000000: sleeping 10s
[   41.650330] hv_pci_probe: line 3695: create_root_hv_pci_bus=0: sleeping 10s: done
[   41.668443] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[   41.674201] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
[   41.680284] pci_bus 468b:00: No busn resource found for root bus, will use [bus 00-ff]
[   41.688901] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
[   41.695554] pci 468b:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
[   41.704304] pci 468b:00:02.0: enabling Extended Tags
...
[   41.745847] hv_pci_probe: line 3697: create_root_hv_pci_bus=0
[   41.938950] mlx5_core 468b:00:02.0: no default pinctrl state
[   41.941462] probe_one: line 1730: sleep 20s:
[   44.466334] hv_eject_device_work: 3: line 2761: pdev=0000000000000000: sleeping 10s: done: removing hpdev
[   44.472691] hv_eject_device_work: 4: line 2765: pdev=0000000000000000: hpdev is removed!!!
[   44.478007] hv_eject_device_work: 5: line 2777: pdev=0000000000000000: sent PCI_EJECTION_COMPLETE
[   44.480213] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   63.154376] probe_one: line 1731: sleep 20s: done
[   63.161610] mlx5_core 468b:00:02.0: firmware version: 65535.65535.65535
[   83.174538] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 100s (0xffffffff)
[  103.190543] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 79s (0xffffffff)
[  123.202507] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 59s (0xffffffff)
[  143.214547] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 39s (0xffffffff)
[  163.222594] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 19s (0xffffffff)
[  183.178629] mlx5_core 468b:00:02.0: mlx5_function_setup:1130:(pid 17): Firmware over 120000 MS in pre-initializing state, aborting
[  183.186289] mlx5_core 468b:00:02.0: probe_one:1764:(pid 17): mlx5_init_one failed with error code -16
[  183.192623] mlx5_core: probe of 468b:00:02.0 failed with error -16
[  183.202701] pci_bus 468b:00: busn_res: [bus 00] is released

With the fix, I'm no longer seeing the timeout errors:

[   31.144066] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   31.149207] hv_pci_probe: line 3708: create_root_hv_pci_bus=0: sleeping 10s
[   41.378274] hv_pci_probe: line 3709: create_root_hv_pci_bus=0: sleeping 10s: done
[   41.397577] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[   41.402799] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
...
[   41.415586] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
[   41.423132] pci 468b:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
...
[   41.471366] hv_pci_probe: line 3711: create_root_hv_pci_bus=0
[   41.484371] hv_eject_device_work: 1: line 2761: pdev=ffff90f4c4858000
[   41.491644] hv_eject_device_work: 2: line 2769: pdev=ffff90f4c4858000: sleeping 10s
[   51.618268] hv_eject_device_work: 3: line 2770: pdev=ffff90f4c4858000: sleeping 10s: done: removing hpdev
[   51.625094] hv_eject_device_work: 4: line 2774: pdev=ffff90f4c4858000: hpdev is removed!!!
[   51.630234] hv_eject_device_work: 5: line 2786: pdev=ffff90f4c4858000: sent PCI_EJECTION_COMPLETE
[   51.632148] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   51.647014] pci_bus 468b:00: busn_res: [bus 00] is released

Now the mlx5_core driver is loaded; I try fast device addition/removal again
and I'm still not seeing the timeout errors:

[  495.622678] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[  495.627791] hv_pci_probe: line 3708: create_root_hv_pci_bus=0: sleeping 10s
[  505.762550] hv_pci_probe: line 3709: create_root_hv_pci_bus=0: sleeping 10s: done
[  505.779496] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[  505.784872] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
...
[  505.798323] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
...
[  505.868908] probe_one: line 1730: sleep 20s:
[  525.986578] probe_one: line 1731: sleep 20s: done
[  525.990717] mlx5_core 468b:00:02.0: enabling device (0000 -> 0002)
[  525.998339] mlx5_core 468b:00:02.0: firmware version: 14.25.8102
...
[  526.381211] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF registering: eth1
[  526.385432] mlx5_core 468b:00:02.0 eth1: joined to eth0
[  526.389076] mlx5_core 468b:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[  526.397276] mlx5_core 468b:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[  526.400330] mlx5_core 468b:00:02.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
...
[  526.430803] hv_pci_probe: line 3711: create_root_hv_pci_bus=0
[  526.443633] hv_eject_device_work: 1: line 2761: pdev=ffff90f4e369b000
[  526.454819] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF unregistering: enP18059s1
[  526.457828] mlx5_core 468b:00:02.0 enP18059s1 (unregistering): Disabling LRO, not supported in legacy RQ
[  527.680118] mlx5_core 468b:00:02.0: devm_attr_group_remove: removing group 00000000074d6d6b
[  527.692794] hv_eject_device_work: 2: line 2769: pdev=ffff90f4e369b000: sleeping 10s
[  537.762575] hv_eject_device_work: 3: line 2770: pdev=ffff90f4e369b000: sleeping 10s: done: removing hpdev
[  537.768831] hv_eject_device_work: 4: line 2774: pdev=ffff90f4e369b000: hpdev is removed!!!
[  537.774313] hv_eject_device_work: 5: line 2786: pdev=ffff90f4e369b000: sent PCI_EJECTION_COMPLETE
[  537.777737] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[  537.794038] pci_bus 468b:00: busn_res: [bus 00] is released

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 48feab095a14..2c0b86b20408 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -489,7 +489,10 @@ struct hv_pcibus_device {
 	struct fwnode_handle *fwnode;
 	/* Protocol version negotiated with the host */
 	enum pci_protocol_version_t protocol_version;
+
+	struct mutex state_lock;
 	enum hv_pcibus_state state;
+
 	struct hv_device *hdev;
 	resource_size_t low_mmio_space;
 	resource_size_t high_mmio_space;
@@ -2512,6 +2515,8 @@ static void pci_devices_present_work(struct work_struct *work)
 	if (!dr)
 		return;
 
+	mutex_lock(&hbus->state_lock);
+
 	/* First, mark all existing children as reported missing. */
 	spin_lock_irqsave(&hbus->device_list_lock, flags);
 	list_for_each_entry(hpdev, &hbus->children, list_entry) {
@@ -2593,6 +2598,8 @@ static void pci_devices_present_work(struct work_struct *work)
 		break;
 	}
 
+	mutex_unlock(&hbus->state_lock);
+
 	kfree(dr);
 }
 
@@ -2741,6 +2748,8 @@ static void hv_eject_device_work(struct work_struct *work)
 	hpdev = container_of(work, struct hv_pci_dev, wrk);
 	hbus = hpdev->hbus;
 
+	mutex_lock(&hbus->state_lock);
+
 	/*
 	 * Ejection can come before or after the PCI bus has been set up, so
 	 * attempt to find it and tear down the bus state, if it exists.  This
@@ -2777,6 +2786,8 @@ static void hv_eject_device_work(struct work_struct *work)
 	put_pcichild(hpdev);
 	put_pcichild(hpdev);
 	/* hpdev has been freed. Do not use it any more. */
+
+	mutex_unlock(&hbus->state_lock);
 }
 
 /**
@@ -3562,6 +3573,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 		return -ENOMEM;
 
 	hbus->bridge = bridge;
+	mutex_init(&hbus->state_lock);
 	hbus->state = hv_pcibus_init;
 	hbus->wslot_res_allocated = -1;
 
@@ -3670,9 +3682,11 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_irq_domain;
 
+	mutex_lock(&hbus->state_lock);
+
 	ret = hv_pci_enter_d0(hdev);
 	if (ret)
-		goto free_irq_domain;
+		goto release_state_lock;
 
 	ret = hv_pci_allocate_bridge_windows(hbus);
 	if (ret)
@@ -3690,12 +3704,15 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_windows;
 
+	mutex_unlock(&hbus->state_lock);
 	return 0;
 
 free_windows:
 	hv_pci_free_bridge_windows(hbus);
 exit_d0:
 	(void) hv_pci_bus_exit(hdev, true);
+release_state_lock:
+	mutex_unlock(&hbus->state_lock);
 free_irq_domain:
 	irq_domain_remove(hbus->irq_domain);
 free_fwnode:
@@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device *hdev)
 	if (ret)
 		goto out;
 
+	mutex_lock(&hbus->state_lock);
+
 	ret = hv_pci_enter_d0(hdev);
 	if (ret)
 		goto out;
 
 	ret = hv_send_resources_allocated(hdev);
 	if (ret)
-		goto out;
+		goto release_state_lock;
 
 	prepopulate_bars(hbus);
 
 	hv_pci_restore_msi_state(hbus);
 
 	hbus->state = hv_pcibus_installed;
+	mutex_unlock(&hbus->state_lock);
 	return 0;
+
+release_state_lock:
+	mutex_unlock(&hbus->state_lock);
 out:
 	vmbus_close(hdev->channel);
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] PCI: hv: Use async probing to reduce boot time
  2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
                   ` (4 preceding siblings ...)
  2023-03-28  4:51 ` [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock Dexuan Cui
@ 2023-03-28  4:51 ` Dexuan Cui
  2023-03-29 16:55   ` Michael Kelley (LINUX)
  5 siblings, 1 reply; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  4:51 UTC (permalink / raw)
  To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
	leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
	wei.liu, longli, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added the
pci_lock_rescan_remove() and pci_unlock_rescan_remove() to address the
race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
doesn't really work well.

Now with hbus->state_lock and other fixes, the race is resolved, so
remove pci_{lock,unlock}_rescan_remove().

Also enable async probing to reduce boot time.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 2c0b86b20408..08ab389e27cc 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2312,12 +2312,16 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device *hbus)
 	if (error)
 		return error;
 
-	pci_lock_rescan_remove();
+	/*
+	 * pci_lock_rescan_remove() and pci_unlock_rescan_remove() are
+	 * unnecessary here, because we hold the hbus->state_lock, meaning
+	 * hv_eject_device_work() and pci_devices_present_work() can't race
+	 * with create_root_hv_pci_bus().
+	 */
 	hv_pci_assign_numa_node(hbus);
 	pci_bus_assign_resources(bridge->bus);
 	hv_pci_assign_slots(hbus);
 	pci_bus_add_devices(bridge->bus);
-	pci_unlock_rescan_remove();
 	hbus->state = hv_pcibus_installed;
 	return 0;
 }
@@ -4003,6 +4007,9 @@ static struct hv_driver hv_pci_drv = {
 	.remove		= hv_pci_remove,
 	.suspend	= hv_pci_suspend,
 	.resume		= hv_pci_resume,
+	.driver = {
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
+	},
 };
 
 static void __exit exit_hv_pci_drv(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
@ 2023-03-28  5:29   ` Saurabh Singh Sengar
  2023-03-28  5:38     ` Dexuan Cui
  2023-03-28 16:48   ` Long Li
  1 sibling, 1 reply; 21+ messages in thread
From: Saurabh Singh Sengar @ 2023-03-28  5:29 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, Dexuan Cui, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 10:21 AM
> To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org; Michael
> Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> <longli@microsoft.com>; boqun.feng@gmail.com
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in
> hv_pci_query_relations()
> 
> Fix the longstanding race between hv_pci_query_relations() and
> survey_child_resources() by flushing the workqueue before we exit from
> hv_pci_query_relations().
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
> Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> With the below debug code:
> 
> @@ -2103,6 +2103,8 @@ static void survey_child_resources(struct
> hv_pcibus_device *hbus)
>  	}
> 
>  	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> +	ssleep(15);
> +	printk("%s: completing %px\n", __func__, event);
>  	complete(event);
>  }
> 
> @@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
> 
>  	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
>  			       0, VM_PKT_DATA_INBAND, 0);
> -	if (!ret)
> +	if (!ret) {
> +		ssleep(10); // unassign the PCI device on the host during the
> 10s
>  		ret = wait_for_response(hdev, &comp);
> +		printk("%s: comp=%px is becoming invalid! ret=%d\n",
> +			__func__, &comp, ret);
> +	}
> 
>  	return ret;
>  }
> @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> 
>  retry:
>  	ret = hv_pci_query_relations(hdev);
> +	printk("hv_pci_query_relations() exited\n");

Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

> +
>  	if (ret)
>  		goto free_irq_domain;
> 
> I'm able to repro the below hang issue:
> 
> [   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus
> probing: Using version 0x10004
> [   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot
> 1 removed
> [   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming
> invalid! ret=-19
> [   84.797505] hv_pci_query_relations() exited
> [   89.652268] survey_child_resources: completing ffffa7504012fb58
> [  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  150.398447] rcu:     15-...0: (2 ticks this GP)
> idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
> [  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833
> ncpus=16)
> [  150.410870] Sending NMI from CPU 14 to CPUs 15:
> [  150.414836] NMI backtrace for cpu 15
> [  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E
> 6.3.0-rc3-decui-dirty #34
> ...
> [  150.414849] Workqueue: hv_pci_468b pci_devices_present_work
> [pci_hyperv] [  150.414866] RIP:
> 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
> ...
> [  150.414905] Call Trace:
> [  150.414907]  <TASK>
> [  150.414911]  _raw_spin_lock_irqsave+0x40/0x50 [  150.414917]
> complete+0x1d/0x60 [  150.414924]  pci_devices_present_work+0x5dd/0x680
> [pci_hyperv] [  150.414946]  process_one_work+0x21f/0x430 [  150.414952]
> worker_thread+0x4a/0x3c0
> 
> With this patch, the hang issue goes away:
> 
> [  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming
> invalid! ret=-19 [  191.263611] survey_child_resources: completing
> ffffa7cfd0aa3b50 [  191.267732] hv_pci_query_relations() exited
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);
> +
>  	return ret;
>  }
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28  5:29   ` [EXTERNAL] " Saurabh Singh Sengar
@ 2023-03-28  5:38     ` Dexuan Cui
  2023-03-28 17:24       ` Bjorn Helgaas
  0 siblings, 1 reply; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  5:38 UTC (permalink / raw)
  To: Saurabh Singh Sengar, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> From: Saurabh Singh Sengar <ssengar@microsoft.com>
> Sent: Monday, March 27, 2023 10:29 PM
> > ...
> > ---

Please note this special line "---". 
Anything after the special line and before the line "diff --git" is discarded
automaticaly by 'git' and 'patch'. 

> >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> >
> >  retry:
> >  	ret = hv_pci_query_relations(hdev);
> > +	printk("hv_pci_query_relations() exited\n");
> 
> Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

This is not part of the real patch :-)
I just thought the debug code can help understand the issues
resolved by the patches.
I'll remove the debug code to avoid confusion if I need to post v2.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
  2023-03-28  4:51 ` [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" Dexuan Cui
@ 2023-03-28  6:33   ` Dexuan Cui
  2023-03-28 12:46     ` Wei Hu
  2023-03-30  5:17     ` Wei Hu
  0 siblings, 2 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-28  6:33 UTC (permalink / raw)
  To: bhelgaas, davem, edumazet, Haiyang Zhang, Jake Oshins, kuba, kw,
	KY Srinivasan, leon, linux-pci, lpieralisi,
	Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng, Wei Hu
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> From: Dexuan Cui <decui@microsoft.com>
> Sent: Monday, March 27, 2023 9:51 PM
> To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org; Michael
> Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> <longli@microsoft.com>; boqun.feng@gmail.com
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to
> fail occasionally"
> 
> This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
> 
> The statement "the hv_pci_bus_exit() call releases structures of all its
> child devices" in commit d6af2ed29c7c is not true: in the path
> hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
> parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
> child "struct hv_pci_dev *hpdev" that is created earlier in
> pci_devices_present_work() -> new_pcichild_device().
> 
> The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> where the old version of hv_pci_bus_exit() was used; when the commit was
> rebased and merged into the upstream, people didn't notice that it's
> not really necessary. The commit itself doesn't cause any issue, but it
> makes hv_pci_probe() more complicated. Revert it to facilitate some
> upcoming changes to hv_pci_probe().
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
>  1 file changed, 34 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c
> b/drivers/pci/controller/pci-hyperv.c
> index 46df6d093d68..48feab095a14 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
>  	struct pci_bus_d0_entry *d0_entry;
>  	struct hv_pci_compl comp_pkt;
>  	struct pci_packet *pkt;
> +	bool retry = true;
>  	int ret;
> 
> +enter_d0_retry:
>  	/*
>  	 * Tell the host that the bus is ready to use, and moved into the
>  	 * powered-on state.  This includes telling the host which region
> @@ -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
>  	if (ret)
>  		goto exit;
> 
> +	/*
> +	 * In certain case (Kdump) the pci device of interest was
> +	 * not cleanly shut down and resource is still held on host
> +	 * side, the host could return invalid device status.
> +	 * We need to explicitly request host to release the resource
> +	 * and try to enter D0 again.
> +	 */
> +	if (comp_pkt.completion_status < 0 && retry) {
> +		retry = false;
> +
> +		dev_err(&hdev->device, "Retrying D0 Entry\n");
> +
> +		/*
> +		 * Hv_pci_bus_exit() calls hv_send_resource_released()
> +		 * to free up resources of its child devices.
> +		 * In the kdump kernel we need to set the
> +		 * wslot_res_allocated to 255 so it scans all child
> +		 * devices to release resources allocated in the
> +		 * normal kernel before panic happened.
> +		 */
> +		hbus->wslot_res_allocated = 255;
> +
> +		ret = hv_pci_bus_exit(hdev, true);
> +
> +		if (ret == 0) {
> +			kfree(pkt);
> +			goto enter_d0_retry;
> +		}
> +		dev_err(&hdev->device,
> +			"Retrying D0 failed with ret %d\n", ret);
> +	}
> +
>  	if (comp_pkt.completion_status < 0) {
>  		dev_err(&hdev->device,
>  			"PCI Pass-through VSP failed D0 Entry with status %x\n",
> @@ -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	struct hv_pcibus_device *hbus;
>  	u16 dom_req, dom;
>  	char *name;
> -	bool enter_d0_retry = true;
>  	int ret;
> 
>  	/*
> @@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	if (ret)
>  		goto free_fwnode;
> 
> -retry:
>  	ret = hv_pci_query_relations(hdev);
>  	if (ret)
>  		goto free_irq_domain;
> 
>  	ret = hv_pci_enter_d0(hdev);
> -	/*
> -	 * In certain case (Kdump) the pci device of interest was
> -	 * not cleanly shut down and resource is still held on host
> -	 * side, the host could return invalid device status.
> -	 * We need to explicitly request host to release the resource
> -	 * and try to enter D0 again.
> -	 * Since the hv_pci_bus_exit() call releases structures
> -	 * of all its child devices, we need to start the retry from
> -	 * hv_pci_query_relations() call, requesting host to send
> -	 * the synchronous child device relations message before this
> -	 * information is needed in hv_send_resources_allocated()
> -	 * call later.
> -	 */
> -	if (ret == -EPROTO && enter_d0_retry) {
> -		enter_d0_retry = false;
> -
> -		dev_err(&hdev->device, "Retrying D0 Entry\n");
> -
> -		/*
> -		 * Hv_pci_bus_exit() calls hv_send_resources_released()
> -		 * to free up resources of its child devices.
> -		 * In the kdump kernel we need to set the
> -		 * wslot_res_allocated to 255 so it scans all child
> -		 * devices to release resources allocated in the
> -		 * normal kernel before panic happened.
> -		 */
> -		hbus->wslot_res_allocated = 255;
> -		ret = hv_pci_bus_exit(hdev, true);
> -
> -		if (ret == 0)
> -			goto retry;
> -
> -		dev_err(&hdev->device,
> -			"Retrying D0 failed with ret %d\n", ret);
> -	}
>  	if (ret)
>  		goto free_irq_domain;
> 
> --
> 2.25.1

+ Wei Hu.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
  2023-03-28  6:33   ` Dexuan Cui
@ 2023-03-28 12:46     ` Wei Hu
  2023-03-30  5:17     ` Wei Hu
  1 sibling, 0 replies; 21+ messages in thread
From: Wei Hu @ 2023-03-28 12:46 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 2:33 PM
> To: bhelgaas@google.com; davem@davemloft.net; edumazet@google.com;
> Haiyang Zhang <haiyangz@microsoft.com>; Jake Oshins
> <jakeo@microsoft.com>; kuba@kernel.org; kw@linux.com; KY Srinivasan
> <kys@microsoft.com>; leon@kernel.org; linux-pci@vger.kernel.org;
> lpieralisi@kernel.org; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> pabeni@redhat.com; robh@kernel.org; saeedm@nvidia.com;
> wei.liu@kernel.org; Long Li <longli@microsoft.com>; boqun.feng@gmail.com;
> Wei Hu <weh@microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: RE: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes
> kdump to fail occasionally"
> 
> > From: Dexuan Cui <decui@microsoft.com>
> > Sent: Monday, March 27, 2023 9:51 PM
> > To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> > <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> > <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> > kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> > leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org;
> > Michael Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> > robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> > <longli@microsoft.com>; boqun.feng@gmail.com
> > Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-rdma@vger.kernel.org; netdev@vger.kernel.org
> > Subject: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes
> > kdump to fail occasionally"
> >
> > This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
> >
> > The statement "the hv_pci_bus_exit() call releases structures of all
> > its child devices" in commit d6af2ed29c7c is not true: in the path
> > hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true):
> > the parameter "keep_devs" is true, so hv_pci_bus_exit() does *not*
> > release the child "struct hv_pci_dev *hpdev" that is created earlier
> > in
> > pci_devices_present_work() -> new_pcichild_device().
> >
> > The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> > where the old version of hv_pci_bus_exit() was used; when the commit
> > was rebased and merged into the upstream, people didn't notice that
> > it's not really necessary. The commit itself doesn't cause any issue,
> > but it makes hv_pci_probe() more complicated. Revert it to facilitate
> > some upcoming changes to hv_pci_probe().
> >
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 71
> > ++++++++++++++---------------
> >  1 file changed, 34 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 46df6d093d68..48feab095a14 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device
> *hdev)
> >  	struct pci_bus_d0_entry *d0_entry;
> >  	struct hv_pci_compl comp_pkt;
> >  	struct pci_packet *pkt;
> > +	bool retry = true;
> >  	int ret;
> >
> > +enter_d0_retry:
> >  	/*
> >  	 * Tell the host that the bus is ready to use, and moved into the
> >  	 * powered-on state.  This includes telling the host which region @@
> > -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
> >  	if (ret)
> >  		goto exit;
> >
> > +	/*
> > +	 * In certain case (Kdump) the pci device of interest was
> > +	 * not cleanly shut down and resource is still held on host
> > +	 * side, the host could return invalid device status.
> > +	 * We need to explicitly request host to release the resource
> > +	 * and try to enter D0 again.
> > +	 */
> > +	if (comp_pkt.completion_status < 0 && retry) {
> > +		retry = false;
> > +
> > +		dev_err(&hdev->device, "Retrying D0 Entry\n");
> > +
> > +		/*
> > +		 * Hv_pci_bus_exit() calls hv_send_resource_released()
> > +		 * to free up resources of its child devices.
> > +		 * In the kdump kernel we need to set the
> > +		 * wslot_res_allocated to 255 so it scans all child
> > +		 * devices to release resources allocated in the
> > +		 * normal kernel before panic happened.
> > +		 */
> > +		hbus->wslot_res_allocated = 255;
> > +
> > +		ret = hv_pci_bus_exit(hdev, true);
> > +
> > +		if (ret == 0) {
> > +			kfree(pkt);
> > +			goto enter_d0_retry;
> > +		}
> > +		dev_err(&hdev->device,
> > +			"Retrying D0 failed with ret %d\n", ret);
> > +	}
> > +
> >  	if (comp_pkt.completion_status < 0) {
> >  		dev_err(&hdev->device,
> >  			"PCI Pass-through VSP failed D0 Entry with
> status %x\n", @@
> > -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  	struct hv_pcibus_device *hbus;
> >  	u16 dom_req, dom;
> >  	char *name;
> > -	bool enter_d0_retry = true;
> >  	int ret;
> >
> >  	/*
> > @@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  	if (ret)
> >  		goto free_fwnode;
> >
> > -retry:
> >  	ret = hv_pci_query_relations(hdev);
> >  	if (ret)
> >  		goto free_irq_domain;
> >
> >  	ret = hv_pci_enter_d0(hdev);
> > -	/*
> > -	 * In certain case (Kdump) the pci device of interest was
> > -	 * not cleanly shut down and resource is still held on host
> > -	 * side, the host could return invalid device status.
> > -	 * We need to explicitly request host to release the resource
> > -	 * and try to enter D0 again.
> > -	 * Since the hv_pci_bus_exit() call releases structures
> > -	 * of all its child devices, we need to start the retry from
> > -	 * hv_pci_query_relations() call, requesting host to send
> > -	 * the synchronous child device relations message before this
> > -	 * information is needed in hv_send_resources_allocated()
> > -	 * call later.
> > -	 */
> > -	if (ret == -EPROTO && enter_d0_retry) {
> > -		enter_d0_retry = false;
> > -
> > -		dev_err(&hdev->device, "Retrying D0 Entry\n");
> > -
> > -		/*
> > -		 * Hv_pci_bus_exit() calls hv_send_resources_released()
> > -		 * to free up resources of its child devices.
> > -		 * In the kdump kernel we need to set the
> > -		 * wslot_res_allocated to 255 so it scans all child
> > -		 * devices to release resources allocated in the
> > -		 * normal kernel before panic happened.
> > -		 */
> > -		hbus->wslot_res_allocated = 255;
> > -		ret = hv_pci_bus_exit(hdev, true);
> > -
> > -		if (ret == 0)
> > -			goto retry;
> > -
> > -		dev_err(&hdev->device,
> > -			"Retrying D0 failed with ret %d\n", ret);
> > -	}
> >  	if (ret)
> >  		goto free_irq_domain;
> >
> > --
> > 2.25.1

Looks good to me. Thanks for fixing this.

Wei

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
  2023-03-28  5:29   ` [EXTERNAL] " Saurabh Singh Sengar
@ 2023-03-28 16:48   ` Long Li
  2023-03-29  8:21     ` Dexuan Cui
  1 sibling, 1 reply; 21+ messages in thread
From: Long Li @ 2023-03-28 16:48 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);

Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling flush_workqueue(hbus->wq)?

> +
>  	return ret;
>  }
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28  5:38     ` Dexuan Cui
@ 2023-03-28 17:24       ` Bjorn Helgaas
  2023-03-29  8:37         ` Dexuan Cui
  0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2023-03-28 17:24 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Saurabh Singh Sengar, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng, linux-hyperv,
	linux-kernel, linux-rdma, netdev

On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Sent: Monday, March 27, 2023 10:29 PM
> > > ...
> > > ---
> 
> Please note this special line "---". 
> Anything after the special line and before the line "diff --git" is discarded
> automaticaly by 'git' and 'patch'. 
> 
> > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > >
> > >  retry:
> > >  	ret = hv_pci_query_relations(hdev);
> > > +	printk("hv_pci_query_relations() exited\n");
> > 
> > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> 
> This is not part of the real patch :-)
> I just thought the debug code can help understand the issues
> resolved by the patches.
> I'll remove the debug code to avoid confusion if I need to post v2.

I guess that means you *will* post a v2, right?  Or do you expect
somebody else to remove the debug code?  If you do keep any debug or
other logging, use pci_info() (or dev_info()) whenever possible.

Also capitalize the subject line to match the others in the series.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28 16:48   ` Long Li
@ 2023-03-29  8:21     ` Dexuan Cui
  0 siblings, 0 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-29  8:21 UTC (permalink / raw)
  To: Long Li, bhelgaas, davem, edumazet, Haiyang Zhang, Jake Oshins,
	kuba, kw, KY Srinivasan, leon, linux-pci, lpieralisi,
	Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> From: Long Li <longli@microsoft.com>
> Sent: Tuesday, March 28, 2023 9:49 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct
> hv_device
> > *hdev)
> >  	if (!ret)
> >  		ret = wait_for_response(hdev, &comp);
> >
> > +	/*
> > +	 * In the case of fast device addition/removal, it's possible that
> > +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> > we
> > +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> > +	 * channel callback already scheduled a work to hbus->wq, which can
> > be
> > +	 * running survey_child_resources() -> complete(&hbus-
> > >survey_event),
> > +	 * even after hv_pci_query_relations() exits and the stack variable
> > +	 * 'comp' is no longer valid. This can cause a strange hang issue
> > +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> > +	 * hv_pci_query_relations() to avoid the issues.
> > +	 */
> > +	flush_workqueue(hbus->wq);
> 
> Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling
> flush_workqueue(hbus->wq)?

It's possible, but that doesn't matter:

hv_pci_query_relations() is called only once, and it sets hbus->survey_event
to point to the stack variable 'comp'. The first survey_child_resources()
calls complete() for the 'comp' and sets hbus->survey_event to NULL.

When the second survey_child_resources() is called, hbus->survey_event
is NULL, so survey_child_resources() returns immediately.

According to my test, after hv_pci_enter_d0() posts PCI_BUS_D0ENTRY,
the guest receives a second PCI_BUS_RELATIONS2 message, which is
the same as the first PCI_BUS_RELATIONS2 message, which is basically
a no-op in pci_devices_present_work(), especially with the
newly-introduced per-hbus state_lock mutex.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-28 17:24       ` Bjorn Helgaas
@ 2023-03-29  8:37         ` Dexuan Cui
  2023-03-29 16:06           ` Bjorn Helgaas
  0 siblings, 1 reply; 21+ messages in thread
From: Dexuan Cui @ 2023-03-29  8:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saurabh Singh Sengar, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng, linux-hyperv,
	linux-kernel, linux-rdma, netdev

> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, March 28, 2023 10:25 AM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > Sent: Monday, March 27, 2023 10:29 PM
> > > > ...
> > > > ---
> >
> > Please note this special line "---".
> > Anything after the special line and before the line "diff --git" is discarded
> > automaticaly by 'git' and 'patch'.
> >
> > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > >
> > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > >
> > > >  retry:
> > > >  	ret = hv_pci_query_relations(hdev);
> > > > +	printk("hv_pci_query_relations() exited\n");
> > >
> > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> >
> > This is not part of the real patch :-)
> > I just thought the debug code can help understand the issues
> > resolved by the patches.
> > I'll remove the debug code to avoid confusion if I need to post v2.
> 
> I guess that means you *will* post a v2, right?  

I guess I didn't make myself clear, sorry. The "debug code" is not
part of the real patch body -- if we run the "patch" program or "git am"
to apply the patches, the "debug code" is automatically dropped because
it's between the special "---" line and the real start of the patch body (i.e.
the "diff --git" line). 

So far, IMO I don't have to post v2 because the patch body and the patch
description (except for the part that's automatically removed by 'patch'
and 'git') don't need any change.

> Or do you expect
> somebody else to remove the debug code?  If you do keep any debug or
> other logging, use pci_info() (or dev_info()) whenever possible.

As I explained above, 'patch' and 'git' automatically remove the part that
don't have to be in the git history.

> 
> Also capitalize the subject line to match the others in the series.
> 
> Bjorn

Thanks for catching this! If this is the only thing to be fixed, I hope the
PCI folks can help fix this when accepting the patch. If you think I should
post v2, please let me know. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
  2023-03-29  8:37         ` Dexuan Cui
@ 2023-03-29 16:06           ` Bjorn Helgaas
  0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2023-03-29 16:06 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Saurabh Singh Sengar, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng, linux-hyperv,
	linux-kernel, linux-rdma, netdev

On Wed, Mar 29, 2023 at 08:37:20AM +0000, Dexuan Cui wrote:
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Tuesday, March 28, 2023 10:25 AM
> > To: Dexuan Cui <decui@microsoft.com>
> > ...
> > On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > > Sent: Monday, March 27, 2023 10:29 PM
> > > > > ...
> > > > > ---
> > >
> > > Please note this special line "---".
> > > Anything after the special line and before the line "diff --git" is discarded
> > > automaticaly by 'git' and 'patch'.
> > >
> > > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > >
> > > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > > >
> > > > >  retry:
> > > > >  	ret = hv_pci_query_relations(hdev);
> > > > > +	printk("hv_pci_query_relations() exited\n");
> > > >
> > > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> > >
> > > This is not part of the real patch :-)
> > > I just thought the debug code can help understand the issues
> > > resolved by the patches.
> > > I'll remove the debug code to avoid confusion if I need to post v2.
> > 
> > I guess that means you *will* post a v2, right?  
> 
> I guess I didn't make myself clear, sorry. The "debug code" is not
> part of the real patch body -- if we run the "patch" program or "git am"
> to apply the patches, the "debug code" is automatically dropped because
> it's between the special "---" line and the real start of the patch body (i.e.
> the "diff --git" line). 
> 
> So far, IMO I don't have to post v2 because the patch body and the patch
> description (except for the part that's automatically removed by 'patch'
> and 'git') don't need any change.

Ah, sorry, I didn't notice that, even though you explicitly pointed it
out earlier.  No need to repost just to capitalize the subject,
Lorenzo can do it or not as he chooses.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock
  2023-03-28  4:51 ` [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock Dexuan Cui
@ 2023-03-29 16:41   ` Michael Kelley (LINUX)
  2023-03-29 16:54     ` Dexuan Cui
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Kelley (LINUX) @ 2023-03-29 16:41 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, March 27, 2023 9:51 PM
> 

[snip]

> @@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device *hdev)
>  	if (ret)
>  		goto out;
> 
> +	mutex_lock(&hbus->state_lock);
> +
>  	ret = hv_pci_enter_d0(hdev);
>  	if (ret)
>  		goto out;

Shouldn't this be goto release_state_lock?

> 
>  	ret = hv_send_resources_allocated(hdev);
>  	if (ret)
> -		goto out;
> +		goto release_state_lock;
> 
>  	prepopulate_bars(hbus);
> 
>  	hv_pci_restore_msi_state(hbus);
> 
>  	hbus->state = hv_pcibus_installed;
> +	mutex_unlock(&hbus->state_lock);
>  	return 0;
> +
> +release_state_lock:
> +	mutex_unlock(&hbus->state_lock);
>  out:
>  	vmbus_close(hdev->channel);
>  	return ret;
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock
  2023-03-29 16:41   ` Michael Kelley (LINUX)
@ 2023-03-29 16:54     ` Dexuan Cui
  0 siblings, 0 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-29 16:54 UTC (permalink / raw)
  To: Michael Kelley (LINUX),
	bhelgaas, davem, edumazet, Haiyang Zhang, Jake Oshins, kuba, kw,
	KY Srinivasan, leon, linux-pci, lpieralisi, pabeni, robh, saeedm,
	wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> From: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Sent: Wednesday, March 29, 2023 9:42 AM
>  ...
> From: Dexuan Cui <decui@microsoft.com> Sent: Monday, March 27, 2023 9:51
> PM
> >
> 
> [snip]
> 
> > @@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device
> *hdev)
> >  	if (ret)
> >  		goto out;
> >
> > +	mutex_lock(&hbus->state_lock);
> > +
> >  	ret = hv_pci_enter_d0(hdev);
> >  	if (ret)
> >  		goto out;
> 
> Shouldn't this be goto release_state_lock?

Hmm, somehow I missed this. Will fix this in v2.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 6/6] PCI: hv: Use async probing to reduce boot time
  2023-03-28  4:51 ` [PATCH 6/6] PCI: hv: Use async probing to reduce boot time Dexuan Cui
@ 2023-03-29 16:55   ` Michael Kelley (LINUX)
  2023-03-29 22:42     ` Dexuan Cui
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Kelley (LINUX) @ 2023-03-29 16:55 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

From: Dexuan Cui <decui@microsoft.com>
> 
> Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added the
> pci_lock_rescan_remove() and pci_unlock_rescan_remove() to address the
> race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> doesn't really work well.
> 
> Now with hbus->state_lock and other fixes, the race is resolved, so
> remove pci_{lock,unlock}_rescan_remove().

Commit 414428c5da1c added the calls to pci_lock/unlock_rescan_remove()
in both create_root_hv_pci_bus() and in hv_eject_device_work().  This patch
removes the calls only in create_reboot_hv_pci_bus(), but leaves them in
hv_eject_device_work(), in hv_pci_remove(), and in pci_devices_present_work().
So evidently it is still needed to provide exclusion for other cases.  Perhaps your
commit message could clarify that only the exclusion of create_root_hv_pci_bus()
is now superfluous because of the hbus->state_lock.  And commit 414428c5da1c
isn't fully reverted because evidently the lock is still needed in
hv_eject_device_work().

Michael

> 
> Also enable async probing to reduce boot time.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c0b86b20408..08ab389e27cc 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2312,12 +2312,16 @@ static int create_root_hv_pci_bus(struct
> hv_pcibus_device *hbus)
>  	if (error)
>  		return error;
> 
> -	pci_lock_rescan_remove();
> +	/*
> +	 * pci_lock_rescan_remove() and pci_unlock_rescan_remove() are
> +	 * unnecessary here, because we hold the hbus->state_lock, meaning
> +	 * hv_eject_device_work() and pci_devices_present_work() can't race
> +	 * with create_root_hv_pci_bus().
> +	 */
>  	hv_pci_assign_numa_node(hbus);
>  	pci_bus_assign_resources(bridge->bus);
>  	hv_pci_assign_slots(hbus);
>  	pci_bus_add_devices(bridge->bus);
> -	pci_unlock_rescan_remove();
>  	hbus->state = hv_pcibus_installed;
>  	return 0;
>  }
> @@ -4003,6 +4007,9 @@ static struct hv_driver hv_pci_drv = {
>  	.remove		= hv_pci_remove,
>  	.suspend	= hv_pci_suspend,
>  	.resume		= hv_pci_resume,
> +	.driver = {
> +		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
> +	},
>  };
> 
>  static void __exit exit_hv_pci_drv(void)
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 6/6] PCI: hv: Use async probing to reduce boot time
  2023-03-29 16:55   ` Michael Kelley (LINUX)
@ 2023-03-29 22:42     ` Dexuan Cui
  0 siblings, 0 replies; 21+ messages in thread
From: Dexuan Cui @ 2023-03-29 22:42 UTC (permalink / raw)
  To: Michael Kelley (LINUX),
	bhelgaas, davem, edumazet, Haiyang Zhang, Jake Oshins, kuba, kw,
	KY Srinivasan, leon, linux-pci, lpieralisi, pabeni, robh, saeedm,
	wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> From: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Sent: Wednesday, March 29, 2023 9:55 AM
> 
> From: Dexuan Cui <decui@microsoft.com>
> >
> > Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added the
> > pci_lock_rescan_remove() and pci_unlock_rescan_remove() to address the
> > race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> > doesn't really work well.
> >
> > Now with hbus->state_lock and other fixes, the race is resolved, so
> > remove pci_{lock,unlock}_rescan_remove().
> 
> Commit 414428c5da1c added the calls to pci_lock/unlock_rescan_remove()
> in both create_root_hv_pci_bus() and in hv_eject_device_work().  This
> patch removes the calls only in create_reboot_hv_pci_bus(), but leaves
> them in hv_eject_device_work(), in hv_pci_remove(), and in 
> pci_devices_present_work().
> So evidently it is still needed to provide exclusion for other cases.  Perhaps
> your commit message could clarify that only the exclusion of
> create_root_hv_pci_bus() > is now superfluous because of the
> hbus->state_lock.  And commit 414428c5da1c > isn't fully reverted
> because evidently the lock is still needed in hv_eject_device_work().
> 
> Michael

Thanks! I'll update the commit message in v2.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
  2023-03-28  6:33   ` Dexuan Cui
  2023-03-28 12:46     ` Wei Hu
@ 2023-03-30  5:17     ` Wei Hu
  1 sibling, 0 replies; 21+ messages in thread
From: Wei Hu @ 2023-03-30  5:17 UTC (permalink / raw)
  To: Dexuan Cui, bhelgaas, davem, edumazet, Haiyang Zhang,
	Jake Oshins, kuba, kw, KY Srinivasan, leon, linux-pci,
	lpieralisi, Michael Kelley (LINUX),
	pabeni, robh, saeedm, wei.liu, Long Li, boqun.feng
  Cc: linux-hyperv, linux-kernel, linux-rdma, netdev

> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 2:33 PM
> To: bhelgaas@google.com; davem@davemloft.net; edumazet@google.com;
> Haiyang Zhang <haiyangz@microsoft.com>; Jake Oshins
> <jakeo@microsoft.com>; kuba@kernel.org; kw@linux.com; KY Srinivasan
> <kys@microsoft.com>; leon@kernel.org; linux-pci@vger.kernel.org;
> lpieralisi@kernel.org; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> pabeni@redhat.com; robh@kernel.org; saeedm@nvidia.com;
> wei.liu@kernel.org; Long Li <longli@microsoft.com>; boqun.feng@gmail.com;
> Wei Hu <weh@microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: RE: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes
> kdump to fail occasionally"
> 
> > From: Dexuan Cui <decui@microsoft.com>
> > Sent: Monday, March 27, 2023 9:51 PM
> > To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> > <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> > <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> > kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> > leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org;
> > Michael Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> > robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> > <longli@microsoft.com>; boqun.feng@gmail.com
> > Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-rdma@vger.kernel.org; netdev@vger.kernel.org
> > Subject: [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes
> > kdump to fail occasionally"
> >
> > This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
> >
> > The statement "the hv_pci_bus_exit() call releases structures of all
> > its child devices" in commit d6af2ed29c7c is not true: in the path
> > hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true):
> > the parameter "keep_devs" is true, so hv_pci_bus_exit() does *not*
> > release the child "struct hv_pci_dev *hpdev" that is created earlier
> > in
> > pci_devices_present_work() -> new_pcichild_device().
> >
> > The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> > where the old version of hv_pci_bus_exit() was used; when the commit
> > was rebased and merged into the upstream, people didn't notice that
> > it's not really necessary. The commit itself doesn't cause any issue,
> > but it makes hv_pci_probe() more complicated. Revert it to facilitate
> > some upcoming changes to hv_pci_probe().
> >
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 71
> > ++++++++++++++---------------
> >  1 file changed, 34 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 46df6d093d68..48feab095a14 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device
> *hdev)
> >  	struct pci_bus_d0_entry *d0_entry;
> >  	struct hv_pci_compl comp_pkt;
> >  	struct pci_packet *pkt;
> > +	bool retry = true;
> >  	int ret;
> >
> > +enter_d0_retry:
> >  	/*
> >  	 * Tell the host that the bus is ready to use, and moved into the
> >  	 * powered-on state.  This includes telling the host which region @@
> > -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
> >  	if (ret)
> >  		goto exit;
> >
> > +	/*
> > +	 * In certain case (Kdump) the pci device of interest was
> > +	 * not cleanly shut down and resource is still held on host
> > +	 * side, the host could return invalid device status.
> > +	 * We need to explicitly request host to release the resource
> > +	 * and try to enter D0 again.
> > +	 */
> > +	if (comp_pkt.completion_status < 0 && retry) {
> > +		retry = false;
> > +
> > +		dev_err(&hdev->device, "Retrying D0 Entry\n");
> > +
> > +		/*
> > +		 * Hv_pci_bus_exit() calls hv_send_resource_released()
> > +		 * to free up resources of its child devices.
> > +		 * In the kdump kernel we need to set the
> > +		 * wslot_res_allocated to 255 so it scans all child
> > +		 * devices to release resources allocated in the
> > +		 * normal kernel before panic happened.
> > +		 */
> > +		hbus->wslot_res_allocated = 255;
> > +
> > +		ret = hv_pci_bus_exit(hdev, true);
> > +
> > +		if (ret == 0) {
> > +			kfree(pkt);
> > +			goto enter_d0_retry;
> > +		}
> > +		dev_err(&hdev->device,
> > +			"Retrying D0 failed with ret %d\n", ret);
> > +	}
> > +
> >  	if (comp_pkt.completion_status < 0) {
> >  		dev_err(&hdev->device,
> >  			"PCI Pass-through VSP failed D0 Entry with
> status %x\n", @@
> > -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  	struct hv_pcibus_device *hbus;
> >  	u16 dom_req, dom;
> >  	char *name;
> > -	bool enter_d0_retry = true;
> >  	int ret;
> >
> >  	/*
> > @@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  	if (ret)
> >  		goto free_fwnode;
> >
> > -retry:
> >  	ret = hv_pci_query_relations(hdev);
> >  	if (ret)
> >  		goto free_irq_domain;
> >
> >  	ret = hv_pci_enter_d0(hdev);
> > -	/*
> > -	 * In certain case (Kdump) the pci device of interest was
> > -	 * not cleanly shut down and resource is still held on host
> > -	 * side, the host could return invalid device status.
> > -	 * We need to explicitly request host to release the resource
> > -	 * and try to enter D0 again.
> > -	 * Since the hv_pci_bus_exit() call releases structures
> > -	 * of all its child devices, we need to start the retry from
> > -	 * hv_pci_query_relations() call, requesting host to send
> > -	 * the synchronous child device relations message before this
> > -	 * information is needed in hv_send_resources_allocated()
> > -	 * call later.
> > -	 */
> > -	if (ret == -EPROTO && enter_d0_retry) {
> > -		enter_d0_retry = false;
> > -
> > -		dev_err(&hdev->device, "Retrying D0 Entry\n");
> > -
> > -		/*
> > -		 * Hv_pci_bus_exit() calls hv_send_resources_released()
> > -		 * to free up resources of its child devices.
> > -		 * In the kdump kernel we need to set the
> > -		 * wslot_res_allocated to 255 so it scans all child
> > -		 * devices to release resources allocated in the
> > -		 * normal kernel before panic happened.
> > -		 */
> > -		hbus->wslot_res_allocated = 255;
> > -		ret = hv_pci_bus_exit(hdev, true);
> > -
> > -		if (ret == 0)
> > -			goto retry;
> > -
> > -		dev_err(&hdev->device,
> > -			"Retrying D0 failed with ret %d\n", ret);
> > -	}
> >  	if (ret)
> >  		goto free_irq_domain;
> >
> > --
> > 2.25.1
>

Acked-by: Wei Hu <weh@microsoft.com>



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-03-30  5:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
2023-03-28  5:29   ` [EXTERNAL] " Saurabh Singh Sengar
2023-03-28  5:38     ` Dexuan Cui
2023-03-28 17:24       ` Bjorn Helgaas
2023-03-29  8:37         ` Dexuan Cui
2023-03-29 16:06           ` Bjorn Helgaas
2023-03-28 16:48   ` Long Li
2023-03-29  8:21     ` Dexuan Cui
2023-03-28  4:51 ` [PATCH 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic Dexuan Cui
2023-03-28  4:51 ` [PATCH 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev Dexuan Cui
2023-03-28  4:51 ` [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" Dexuan Cui
2023-03-28  6:33   ` Dexuan Cui
2023-03-28 12:46     ` Wei Hu
2023-03-30  5:17     ` Wei Hu
2023-03-28  4:51 ` [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock Dexuan Cui
2023-03-29 16:41   ` Michael Kelley (LINUX)
2023-03-29 16:54     ` Dexuan Cui
2023-03-28  4:51 ` [PATCH 6/6] PCI: hv: Use async probing to reduce boot time Dexuan Cui
2023-03-29 16:55   ` Michael Kelley (LINUX)
2023-03-29 22:42     ` Dexuan Cui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).