linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()
@ 2020-09-05  2:55 Dexuan Cui
  2020-09-08 21:05 ` Michael Kelley
  0 siblings, 1 reply; 3+ messages in thread
From: Dexuan Cui @ 2020-09-05  2:55 UTC (permalink / raw)
  To: wei.liu, kys, haiyangz, sthemmin, linux-hyperv, linux-kernel,
	mikelley, vkuznets
  Cc: Dexuan Cui

After we Stop and later Start a VM that uses Accelerated Networking (NIC
SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
the original vmbus channel of the VF, and hence we can't complete()
vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
and the VM hangs in vmbus_bus_resume() forever.

Fix the issue by adding a timeout, so the resuming can still succeed, and
the saved state is not lost, and according to my test, the user can disable
Accelerated Networking and then will be able to SSH into the VM for
further recovery. Also prevent the VM in question from suspending again.

The host will be fixed so in future the Instance GUID will stay the same
across hibernation.

Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/hv/vmbus_drv.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 910b6e90866c..946d0aba101f 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2382,7 +2382,10 @@ static int vmbus_bus_suspend(struct device *dev)
 	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
 		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
 
-	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
+	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
+		pr_err("Can not suspend due to a previous failed resuming\n");
+		return -EBUSY;
+	}
 
 	mutex_lock(&vmbus_connection.channel_mutex);
 
@@ -2456,7 +2459,9 @@ static int vmbus_bus_resume(struct device *dev)
 
 	vmbus_request_offers();
 
-	wait_for_completion(&vmbus_connection.ready_for_resume_event);
+	if (wait_for_completion_timeout(
+		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
+		pr_err("Some vmbus device is missing after suspending?\n");
 
 	/* Reset the event for the next suspend. */
 	reinit_completion(&vmbus_connection.ready_for_suspend_event);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* RE: [PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()
  2020-09-05  2:55 [PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume() Dexuan Cui
@ 2020-09-08 21:05 ` Michael Kelley
  2020-09-09 11:37   ` Wei Liu
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Kelley @ 2020-09-08 21:05 UTC (permalink / raw)
  To: Dexuan Cui, wei.liu, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, linux-hyperv, linux-kernel, vkuznets

From: Dexuan Cui <decui@microsoft.com> Sent: Friday, September 4, 2020 7:56 PM
> 
> After we Stop and later Start a VM that uses Accelerated Networking (NIC
> SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
> vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
> the original vmbus channel of the VF, and hence we can't complete()
> vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
> and the VM hangs in vmbus_bus_resume() forever.
> 
> Fix the issue by adding a timeout, so the resuming can still succeed, and
> the saved state is not lost, and according to my test, the user can disable
> Accelerated Networking and then will be able to SSH into the VM for
> further recovery. Also prevent the VM in question from suspending again.
> 
> The host will be fixed so in future the Instance GUID will stay the same
> across hibernation.
> 
> Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 910b6e90866c..946d0aba101f 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2382,7 +2382,10 @@ static int vmbus_bus_suspend(struct device *dev)
>  	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
>  		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
> 
> -	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
> +	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
> +		pr_err("Can not suspend due to a previous failed resuming\n");
> +		return -EBUSY;
> +	}
> 
>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
> @@ -2456,7 +2459,9 @@ static int vmbus_bus_resume(struct device *dev)
> 
>  	vmbus_request_offers();
> 
> -	wait_for_completion(&vmbus_connection.ready_for_resume_event);
> +	if (wait_for_completion_timeout(
> +		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
> +		pr_err("Some vmbus device is missing after suspending?\n");
> 
>  	/* Reset the event for the next suspend. */
>  	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> --
> 2.19.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()
  2020-09-08 21:05 ` Michael Kelley
@ 2020-09-09 11:37   ` Wei Liu
  0 siblings, 0 replies; 3+ messages in thread
From: Wei Liu @ 2020-09-09 11:37 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Dexuan Cui, wei.liu, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, linux-hyperv, linux-kernel, vkuznets

On Tue, Sep 08, 2020 at 09:05:34PM +0000, Michael Kelley wrote:
> From: Dexuan Cui <decui@microsoft.com> Sent: Friday, September 4, 2020 7:56 PM
> > 
> > After we Stop and later Start a VM that uses Accelerated Networking (NIC
> > SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
> > vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
> > the original vmbus channel of the VF, and hence we can't complete()
> > vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
> > and the VM hangs in vmbus_bus_resume() forever.
> > 
> > Fix the issue by adding a timeout, so the resuming can still succeed, and
> > the saved state is not lost, and according to my test, the user can disable
> > Accelerated Networking and then will be able to SSH into the VM for
> > further recovery. Also prevent the VM in question from suspending again.
> > 
> > The host will be fixed so in future the Instance GUID will stay the same
> > across hibernation.
> > 
> > Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >  drivers/hv/vmbus_drv.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> 

Applied to hyperv-fixes. Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-09 11:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-05  2:55 [PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume() Dexuan Cui
2020-09-08 21:05 ` Michael Kelley
2020-09-09 11:37   ` Wei Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).