linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail occasionally
@ 2020-07-17  2:55 Wei Hu
  2020-07-17 20:11 ` Bjorn Helgaas
  0 siblings, 1 reply; 3+ messages in thread
From: Wei Hu @ 2020-07-17  2:55 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin, wei.liu, lorenzo.pieralisi, robh,
	bhelgaas, linux-hyperv, linux-pci, linux-kernel, decui, mikelley
  Cc: Wei Hu

Kdump could fail sometime on HyperV guest over Accerlated Network
interface. This is because the retry in hv_pci_enter_d0() relies on
an asynchronous host event to arrive guest before calling
hv_send_resources_allocated(). This fixes the problem by moving retry
to hv_pci_probe(), removing this dependence and making the calling
sequence synchronous.

v2: Adding Fixes tag according to Michael Kelley's review comment.

Fixes: c81992e7f4aa ("PCI: hv: Retry PCI bus D0 entry on invalid device state")
Signed-off-by: Wei Hu <weh@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 66 ++++++++++++++---------------
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index bf40ff09c99d..738ee30f3334 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2759,10 +2759,8 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
 	struct pci_bus_d0_entry *d0_entry;
 	struct hv_pci_compl comp_pkt;
 	struct pci_packet *pkt;
-	bool retry = true;
 	int ret;
 
-enter_d0_retry:
 	/*
 	 * Tell the host that the bus is ready to use, and moved into the
 	 * powered-on state.  This includes telling the host which region
@@ -2789,38 +2787,6 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
 	if (ret)
 		goto exit;
 
-	/*
-	 * In certain case (Kdump) the pci device of interest was
-	 * not cleanly shut down and resource is still held on host
-	 * side, the host could return invalid device status.
-	 * We need to explicitly request host to release the resource
-	 * and try to enter D0 again.
-	 */
-	if (comp_pkt.completion_status < 0 && retry) {
-		retry = false;
-
-		dev_err(&hdev->device, "Retrying D0 Entry\n");
-
-		/*
-		 * Hv_pci_bus_exit() calls hv_send_resource_released()
-		 * to free up resources of its child devices.
-		 * In the kdump kernel we need to set the
-		 * wslot_res_allocated to 255 so it scans all child
-		 * devices to release resources allocated in the
-		 * normal kernel before panic happened.
-		 */
-		hbus->wslot_res_allocated = 255;
-
-		ret = hv_pci_bus_exit(hdev, true);
-
-		if (ret == 0) {
-			kfree(pkt);
-			goto enter_d0_retry;
-		}
-		dev_err(&hdev->device,
-			"Retrying D0 failed with ret %d\n", ret);
-	}
-
 	if (comp_pkt.completion_status < 0) {
 		dev_err(&hdev->device,
 			"PCI Pass-through VSP failed D0 Entry with status %x\n",
@@ -3058,6 +3024,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 	struct hv_pcibus_device *hbus;
 	u16 dom_req, dom;
 	char *name;
+	bool enter_d0_retry = true;
 	int ret;
 
 	/*
@@ -3178,11 +3145,42 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_fwnode;
 
+retry:
 	ret = hv_pci_query_relations(hdev);
 	if (ret)
 		goto free_irq_domain;
 
 	ret = hv_pci_enter_d0(hdev);
+	/*
+	 * In certain case (Kdump) the pci device of interest was
+	 * not cleanly shut down and resource is still held on host
+	 * side, the host could return invalid device status.
+	 * We need to explicitly request host to release the resource
+	 * and try to enter D0 again.
+	 * The retry should start from hv_pci_query_relations() call.
+	 */
+	if (ret == -EPROTO && enter_d0_retry) {
+		enter_d0_retry = false;
+
+		dev_err(&hdev->device, "Retrying D0 Entry\n");
+
+		/*
+		 * Hv_pci_bus_exit() calls hv_send_resources_released()
+		 * to free up resources of its child devices.
+		 * In the kdump kernel we need to set the
+		 * wslot_res_allocated to 255 so it scans all child
+		 * devices to release resources allocated in the
+		 * normal kernel before panic happened.
+		 */
+		hbus->wslot_res_allocated = 255;
+		ret = hv_pci_bus_exit(hdev, true);
+
+		if (ret == 0)
+			goto retry;
+
+		dev_err(&hdev->device,
+			"Retrying D0 failed with ret %d\n", ret);
+	}
 	if (ret)
 		goto free_irq_domain;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail occasionally
  2020-07-17  2:55 [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail occasionally Wei Hu
@ 2020-07-17 20:11 ` Bjorn Helgaas
  2020-07-18  3:46   ` Wei Hu
  0 siblings, 1 reply; 3+ messages in thread
From: Bjorn Helgaas @ 2020-07-17 20:11 UTC (permalink / raw)
  To: Wei Hu
  Cc: kys, haiyangz, sthemmin, wei.liu, lorenzo.pieralisi, robh,
	bhelgaas, linux-hyperv, linux-pci, linux-kernel, decui, mikelley

On Fri, Jul 17, 2020 at 10:55:28AM +0800, Wei Hu wrote:
> Kdump could fail sometime on HyperV guest over Accerlated Network
> interface. This is because the retry in hv_pci_enter_d0() relies on
> an asynchronous host event to arrive guest before calling
> hv_send_resources_allocated(). This fixes the problem by moving retry
> to hv_pci_probe(), removing this dependence and making the calling
> sequence synchronous.

Lorenzo, if you apply this, can you fix this typo?

  s/Accerlated/Accelerated/

and maybe even

  s/HyperV/Hyper-V/
  s/This fixes the problem/Fix the problem/
  s/this dependence/this dependency/

Not sure if "relies on ... event to arrive guest" means "relies on ...
event arriving in the guest"?  Or maybe "relies on ... event arriving
before the guest calls ..."?

This would probably all make perfect sense to me if I understood more
about Hyper-V :)

> v2: Adding Fixes tag according to Michael Kelley's review comment.

There's no need to include this "v2" comment in the commit log.  I
think if you put it after a line containing only "---" (or maybe
"--"?), it will be in the email but not in the commit log.

> Fixes: c81992e7f4aa ("PCI: hv: Retry PCI bus D0 entry on invalid device state")
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 66 ++++++++++++++---------------
>  1 file changed, 32 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index bf40ff09c99d..738ee30f3334 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2759,10 +2759,8 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
>  	struct pci_bus_d0_entry *d0_entry;
>  	struct hv_pci_compl comp_pkt;
>  	struct pci_packet *pkt;
> -	bool retry = true;
>  	int ret;
>  
> -enter_d0_retry:
>  	/*
>  	 * Tell the host that the bus is ready to use, and moved into the
>  	 * powered-on state.  This includes telling the host which region
> @@ -2789,38 +2787,6 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
>  	if (ret)
>  		goto exit;
>  
> -	/*
> -	 * In certain case (Kdump) the pci device of interest was
> -	 * not cleanly shut down and resource is still held on host
> -	 * side, the host could return invalid device status.
> -	 * We need to explicitly request host to release the resource
> -	 * and try to enter D0 again.
> -	 */
> -	if (comp_pkt.completion_status < 0 && retry) {
> -		retry = false;
> -
> -		dev_err(&hdev->device, "Retrying D0 Entry\n");
> -
> -		/*
> -		 * Hv_pci_bus_exit() calls hv_send_resource_released()
> -		 * to free up resources of its child devices.
> -		 * In the kdump kernel we need to set the
> -		 * wslot_res_allocated to 255 so it scans all child
> -		 * devices to release resources allocated in the
> -		 * normal kernel before panic happened.
> -		 */
> -		hbus->wslot_res_allocated = 255;
> -
> -		ret = hv_pci_bus_exit(hdev, true);
> -
> -		if (ret == 0) {
> -			kfree(pkt);
> -			goto enter_d0_retry;
> -		}
> -		dev_err(&hdev->device,
> -			"Retrying D0 failed with ret %d\n", ret);
> -	}
> -
>  	if (comp_pkt.completion_status < 0) {
>  		dev_err(&hdev->device,
>  			"PCI Pass-through VSP failed D0 Entry with status %x\n",
> @@ -3058,6 +3024,7 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	struct hv_pcibus_device *hbus;
>  	u16 dom_req, dom;
>  	char *name;
> +	bool enter_d0_retry = true;
>  	int ret;
>  
>  	/*
> @@ -3178,11 +3145,42 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	if (ret)
>  		goto free_fwnode;
>  
> +retry:
>  	ret = hv_pci_query_relations(hdev);
>  	if (ret)
>  		goto free_irq_domain;
>  
>  	ret = hv_pci_enter_d0(hdev);
> +	/*
> +	 * In certain case (Kdump) the pci device of interest was
> +	 * not cleanly shut down and resource is still held on host
> +	 * side, the host could return invalid device status.
> +	 * We need to explicitly request host to release the resource
> +	 * and try to enter D0 again.
> +	 * The retry should start from hv_pci_query_relations() call.
> +	 */
> +	if (ret == -EPROTO && enter_d0_retry) {
> +		enter_d0_retry = false;
> +
> +		dev_err(&hdev->device, "Retrying D0 Entry\n");
> +
> +		/*
> +		 * Hv_pci_bus_exit() calls hv_send_resources_released()
> +		 * to free up resources of its child devices.
> +		 * In the kdump kernel we need to set the
> +		 * wslot_res_allocated to 255 so it scans all child
> +		 * devices to release resources allocated in the
> +		 * normal kernel before panic happened.
> +		 */
> +		hbus->wslot_res_allocated = 255;
> +		ret = hv_pci_bus_exit(hdev, true);
> +
> +		if (ret == 0)
> +			goto retry;
> +
> +		dev_err(&hdev->device,
> +			"Retrying D0 failed with ret %d\n", ret);
> +	}
>  	if (ret)
>  		goto free_irq_domain;
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail occasionally
  2020-07-17 20:11 ` Bjorn Helgaas
@ 2020-07-18  3:46   ` Wei Hu
  0 siblings, 0 replies; 3+ messages in thread
From: Wei Hu @ 2020-07-18  3:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, wei.liu,
	lorenzo.pieralisi, robh, bhelgaas, linux-hyperv, linux-pci,
	linux-kernel, Dexuan Cui, Michael Kelley



> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Saturday, July 18, 2020 4:11 AM
> To: Wei Hu <weh@microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> wei.liu@kernel.org; lorenzo.pieralisi@arm.com; robh@kernel.org;
> bhelgaas@google.com; linux-hyperv@vger.kernel.org; linux-
> pci@vger.kernel.org; linux-kernel@vger.kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley <mikelley@microsoft.com>
> Subject: Re: [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail
> occasionally
> 
> On Fri, Jul 17, 2020 at 10:55:28AM +0800, Wei Hu wrote:
> > Kdump could fail sometime on HyperV guest over Accerlated Network
> > interface. This is because the retry in hv_pci_enter_d0() relies on an
> > asynchronous host event to arrive guest before calling
> > hv_send_resources_allocated(). This fixes the problem by moving retry
> > to hv_pci_probe(), removing this dependence and making the calling
> > sequence synchronous.
> 
> Lorenzo, if you apply this, can you fix this typo?
> 
>   s/Accerlated/Accelerated/
> 
> and maybe even
> 
>   s/HyperV/Hyper-V/
>   s/This fixes the problem/Fix the problem/
>   s/this dependence/this dependency/
> 
> Not sure if "relies on ... event to arrive guest" means "relies on ...
> event arriving in the guest"?  Or maybe "relies on ... event arriving before the
> guest calls ..."?
> 
> This would probably all make perfect sense to me if I understood more about
> Hyper-V :)
> 
> > v2: Adding Fixes tag according to Michael Kelley's review comment.
> 
> There's no need to include this "v2" comment in the commit log.  I think if you
> put it after a line containing only "---" (or maybe "--"?), it will be in the email
> but not in the commit log.
> 
Thanks Bjorn. I will send out a v3 version shortly to correct all these. 
Lorenzo, please pick up my v3 version if you have not started to apply it yet. 

Thanks so much,
Wei

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-07-18  3:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-17  2:55 [PATCH v2] PCI: hv: Fix a timing issue which causes kdump to fail occasionally Wei Hu
2020-07-17 20:11 ` Bjorn Helgaas
2020-07-18  3:46   ` Wei Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).