[PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-27 13:09 ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-27 13:09 UTC (permalink / raw)
  To: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang, kys,
	vkuznets
  Cc: haiyangz

In the case the user-space daemon crashes, hangs or is killed, we
need to down the semaphore, otherwise, after the daemon starts next
time, the obsolete data in fcopy_transaction.message or
fcopy_transaction.fcopy_msg will be used immediately.

Cc: Jason Wang <jasowang@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

v2: I removed the "FCP" prefix as Greg asked.

    I also updated the output message a little:
    "FCP: failed to acquire the semaphore" --> 
    "can not acquire the semaphore: it is benign"

v3: I added the code in fcopy_release() as Jason Wang suggested.
    I removed the pr_debug (it isn't so meaningful)and added a comment instead.

 drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index 23b2ce2..faa6ba6 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy)
 	 * process the pending transaction.
 	 */
 	fcopy_respond_to_host(HV_E_FAIL);
+
+	/* In the case the user-space daemon crashes, hangs or is killed, we
+	 * need to down the semaphore, otherwise, after the daemon starts next
+	 * time, the obsolete data in fcopy_transaction.message or
+	 * fcopy_transaction.fcopy_msg will be used immediately.
+	 *
+	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
+	 * still OK, because we've reported the failure to the host.
+	 */
+	if (down_trylock(&fcopy_transaction.read_sema))
+		;
+
 }
 
 static int fcopy_handle_handshake(u32 version)
@@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct file *f)
 	 */
 	in_hand_shake = true;
 	opened = false;
+
+	if (cancel_delayed_work_sync(&fcopy_work)) {
+		/* We haven't up()-ed the semaphore(very rare)? */
+		if (down_trylock(&fcopy_transaction.read_sema))
+			;
+		fcopy_respond_to_host(HV_E_FAIL);
+	}
 	return 0;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-27 13:09 ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-27 13:09 UTC (permalink / raw)
  To: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang, kys,
	vkuznets
  Cc: haiyangz

In the case the user-space daemon crashes, hangs or is killed, we
need to down the semaphore, otherwise, after the daemon starts next
time, the obsolete data in fcopy_transaction.message or
fcopy_transaction.fcopy_msg will be used immediately.

Cc: Jason Wang <jasowang@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

v2: I removed the "FCP" prefix as Greg asked.

    I also updated the output message a little:
    "FCP: failed to acquire the semaphore" --> 
    "can not acquire the semaphore: it is benign"

v3: I added the code in fcopy_release() as Jason Wang suggested.
    I removed the pr_debug (it isn't so meaningful)and added a comment instead.

 drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index 23b2ce2..faa6ba6 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy)
 	 * process the pending transaction.
 	 */
 	fcopy_respond_to_host(HV_E_FAIL);
+
+	/* In the case the user-space daemon crashes, hangs or is killed, we
+	 * need to down the semaphore, otherwise, after the daemon starts next
+	 * time, the obsolete data in fcopy_transaction.message or
+	 * fcopy_transaction.fcopy_msg will be used immediately.
+	 *
+	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
+	 * still OK, because we've reported the failure to the host.
+	 */
+	if (down_trylock(&fcopy_transaction.read_sema))
+		;
+
 }
 
 static int fcopy_handle_handshake(u32 version)
@@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct file *f)
 	 */
 	in_hand_shake = true;
 	opened = false;
+
+	if (cancel_delayed_work_sync(&fcopy_work)) {
+		/* We haven't up()-ed the semaphore(very rare)? */
+		if (down_trylock(&fcopy_transaction.read_sema))
+			;
+		fcopy_respond_to_host(HV_E_FAIL);
+	}
 	return 0;
 }
 
-- 
1.9.1

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-11-27 13:09 ` Dexuan Cui
@ 2014-11-28  6:55   ` Jason Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-11-28  6:47 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, kys, vkuznets,
	haiyangz



On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> wrote:
> In the case the user-space daemon crashes, hangs or is killed, we
> need to down the semaphore, otherwise, after the daemon starts next
> time, the obsolete data in fcopy_transaction.message or
> fcopy_transaction.fcopy_msg will be used immediately.
> 
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: K. Y. Srinivasan <kys@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> 
> v2: I removed the "FCP" prefix as Greg asked.
> 
>     I also updated the output message a little:
>     "FCP: failed to acquire the semaphore" --> 
>     "can not acquire the semaphore: it is benign"
> 
> v3: I added the code in fcopy_release() as Jason Wang suggested.
>     I removed the pr_debug (it isn't so meaningful)and added a 
> comment instead.
> 
>  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> index 23b2ce2..faa6ba6 100644
> --- a/drivers/hv/hv_fcopy.c
> +++ b/drivers/hv/hv_fcopy.c
> @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct 
> *dummy)
>  	 * process the pending transaction.
>  	 */
>  	fcopy_respond_to_host(HV_E_FAIL);
> +
> +	/* In the case the user-space daemon crashes, hangs or is killed, we
> +	 * need to down the semaphore, otherwise, after the daemon starts 
> next
> +	 * time, the obsolete data in fcopy_transaction.message or
> +	 * fcopy_transaction.fcopy_msg will be used immediately.
> +	 *
> +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? 
> We're
> +	 * still OK, because we've reported the failure to the host.
> +	 */
> +	if (down_trylock(&fcopy_transaction.read_sema))
> +		;

Sorry, I'm not quite understand how if () ; can help here.

Btw, a question not relate to this patch.

What happens if a daemon is resume from SIGSTOP and expires the check 
here?
> 
> +
>  }
>  
>  static int fcopy_handle_handshake(u32 version)
> @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, 
> struct file *f)
>  	 */
>  	in_hand_shake = true;
>  	opened = false;
> +
> +	if (cancel_delayed_work_sync(&fcopy_work)) {
> +		/* We haven't up()-ed the semaphore(very rare)? */
> +		if (down_trylock(&fcopy_transaction.read_sema))
> +			;

And this.
> 
> +		fcopy_respond_to_host(HV_E_FAIL);
> +	}
>  	return 0;
>  }
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-28  6:55   ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-11-28  6:55 UTC (permalink / raw)
  To: Dexuan Cui; +Cc: olaf, gregkh, haiyangz, driverdev-devel, linux-kernel, apw



On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> wrote:
> In the case the user-space daemon crashes, hangs or is killed, we
> need to down the semaphore, otherwise, after the daemon starts next
> time, the obsolete data in fcopy_transaction.message or
> fcopy_transaction.fcopy_msg will be used immediately.
> 
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: K. Y. Srinivasan <kys@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> 
> v2: I removed the "FCP" prefix as Greg asked.
> 
>     I also updated the output message a little:
>     "FCP: failed to acquire the semaphore" --> 
>     "can not acquire the semaphore: it is benign"
> 
> v3: I added the code in fcopy_release() as Jason Wang suggested.
>     I removed the pr_debug (it isn't so meaningful)and added a 
> comment instead.
> 
>  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> index 23b2ce2..faa6ba6 100644
> --- a/drivers/hv/hv_fcopy.c
> +++ b/drivers/hv/hv_fcopy.c
> @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct 
> *dummy)
>  	 * process the pending transaction.
>  	 */
>  	fcopy_respond_to_host(HV_E_FAIL);
> +
> +	/* In the case the user-space daemon crashes, hangs or is killed, we
> +	 * need to down the semaphore, otherwise, after the daemon starts 
> next
> +	 * time, the obsolete data in fcopy_transaction.message or
> +	 * fcopy_transaction.fcopy_msg will be used immediately.
> +	 *
> +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? 
> We're
> +	 * still OK, because we've reported the failure to the host.
> +	 */
> +	if (down_trylock(&fcopy_transaction.read_sema))
> +		;

Sorry, I'm not quite understand how if () ; can help here.

Btw, a question not relate to this patch.

What happens if a daemon is resume from SIGSTOP and expires the check 
here?
> 
> +
>  }
>  
>  static int fcopy_handle_handshake(u32 version)
> @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, 
> struct file *f)
>  	 */
>  	in_hand_shake = true;
>  	opened = false;
> +
> +	if (cancel_delayed_work_sync(&fcopy_work)) {
> +		/* We haven't up()-ed the semaphore(very rare)? */
> +		if (down_trylock(&fcopy_transaction.read_sema))
> +			;

And this.
> 
> +		fcopy_respond_to_host(HV_E_FAIL);
> +	}
>  	return 0;
>  }
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-11-28  6:55   ` Jason Wang
@ 2014-11-28  8:36     ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-28  8:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5050 bytes --]

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, November 28, 2014 14:47 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> wrote:
> > In the case the user-space daemon crashes, hangs or is killed, we
> > need to down the semaphore, otherwise, after the daemon starts next
> > time, the obsolete data in fcopy_transaction.message or
> > fcopy_transaction.fcopy_msg will be used immediately.
> >
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >
> > v2: I removed the "FCP" prefix as Greg asked.
> >
> >     I also updated the output message a little:
> >     "FCP: failed to acquire the semaphore" -->
> >     "can not acquire the semaphore: it is benign"
> >
> > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >     I removed the pr_debug (it isn't so meaningful)and added a
> > comment instead.
> >
> >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> > index 23b2ce2..faa6ba6 100644
> > --- a/drivers/hv/hv_fcopy.c
> > +++ b/drivers/hv/hv_fcopy.c
> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> > *dummy)
> >  	 * process the pending transaction.
> >  	 */
> >  	fcopy_respond_to_host(HV_E_FAIL);
> > +
> > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > +	 * need to down the semaphore, otherwise, after the daemon starts
> > next
> > +	 * time, the obsolete data in fcopy_transaction.message or
> > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > +	 *
> > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> > We're
> > +	 * still OK, because we've reported the failure to the host.
> > +	 */
> > +	if (down_trylock(&fcopy_transaction.read_sema))
> > +		;
> 
> Sorry, I'm not quite understand how if () ; can help here.
> 
> Btw, a question not relate to this patch.
> 
> What happens if a daemon is resume from SIGSTOP and expires the check
> here?
Hi Jason,
My idea is: here we need down_trylock(), but in case we can't get the
semaphore, it's OK anyway:

Scenario 1):
1.1: when the daemon is blocked on the pread(), the daemon receives SIGSTOP;
1.2: the host user runs the PowerShell Copy-VMFile command;
1.3.1: the driver reports the failure to the host user in 5s and
1.3.2: the driver down()-es the semaphore;
1.4: the daemon receives SIGCONT and it will be still blocked on the pread().
Without the down_trylock(), in 1.4, the daemon can receive an obsolete message.
NOTE: in this scenario, the daemon is not killed.

Scenario 2):
In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 and
do down() in fcopy_read(), it will receive the message but: the driver has
reported the failure to the host user and the driver's 1.3.2 can't get the
semaphore -- IMO this is acceptably OK, though in the VM, an incomplete
file will be left there.
BTW, I think in the daemon's hv_start_fcopy() we should add a
close(target_fd) before open()-ing a new one.

> >
> > +
> >  }
> >
> >  static int fcopy_handle_handshake(u32 version)
> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
> > struct file *f)
> >  	 */
> >  	in_hand_shake = true;
> >  	opened = false;
> > +
> > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > +		/* We haven't up()-ed the semaphore(very rare)? */
> > +		if (down_trylock(&fcopy_transaction.read_sema))
> > +			;
> 
> And this.

Scenario 3):
When the daemon exits(e.g., SIGKILL received), if there is a fcopy_work
pending (scheduled but not start to run yet), we should cancel the
work (as you suggested) and down() the semaphore, otherwise, the
obsolete message will be received by the next instance of the daemon.

Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
        schedule_delayed_work(&fcopy_work, 5*HZ);
        ----> if fcopy_release() is running on another vcpu, just before the next line?
        fcopy_send_data();

In this case, fcopy_release() can cancel fcopy_work, but
can't get the semaphore since it hasn't been up()-ed.
Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
buffer an obsolete message in the driver, and the message will be
fetched by the next instance of the daemon... 

Looks we need a spinlock here?

Please share your idea.

> >
> > +		fcopy_respond_to_host(HV_E_FAIL);
> > +	}
> >  	return 0;
> >  }
> >
> > --
> > 1.9.1
> >

-- Dexuan

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-28  8:36     ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-28  8:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, November 28, 2014 14:47 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> wrote:
> > In the case the user-space daemon crashes, hangs or is killed, we
> > need to down the semaphore, otherwise, after the daemon starts next
> > time, the obsolete data in fcopy_transaction.message or
> > fcopy_transaction.fcopy_msg will be used immediately.
> >
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >
> > v2: I removed the "FCP" prefix as Greg asked.
> >
> >     I also updated the output message a little:
> >     "FCP: failed to acquire the semaphore" -->
> >     "can not acquire the semaphore: it is benign"
> >
> > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >     I removed the pr_debug (it isn't so meaningful)and added a
> > comment instead.
> >
> >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> > index 23b2ce2..faa6ba6 100644
> > --- a/drivers/hv/hv_fcopy.c
> > +++ b/drivers/hv/hv_fcopy.c
> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> > *dummy)
> >  	 * process the pending transaction.
> >  	 */
> >  	fcopy_respond_to_host(HV_E_FAIL);
> > +
> > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > +	 * need to down the semaphore, otherwise, after the daemon starts
> > next
> > +	 * time, the obsolete data in fcopy_transaction.message or
> > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > +	 *
> > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> > We're
> > +	 * still OK, because we've reported the failure to the host.
> > +	 */
> > +	if (down_trylock(&fcopy_transaction.read_sema))
> > +		;
> 
> Sorry, I'm not quite understand how if () ; can help here.
> 
> Btw, a question not relate to this patch.
> 
> What happens if a daemon is resume from SIGSTOP and expires the check
> here?
Hi Jason,
My idea is: here we need down_trylock(), but in case we can't get the
semaphore, it's OK anyway:

Scenario 1):
1.1: when the daemon is blocked on the pread(), the daemon receives SIGSTOP;
1.2: the host user runs the PowerShell Copy-VMFile command;
1.3.1: the driver reports the failure to the host user in 5s and
1.3.2: the driver down()-es the semaphore;
1.4: the daemon receives SIGCONT and it will be still blocked on the pread().
Without the down_trylock(), in 1.4, the daemon can receive an obsolete message.
NOTE: in this scenario, the daemon is not killed.

Scenario 2):
In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 and
do down() in fcopy_read(), it will receive the message but: the driver has
reported the failure to the host user and the driver's 1.3.2 can't get the
semaphore -- IMO this is acceptably OK, though in the VM, an incomplete
file will be left there.
BTW, I think in the daemon's hv_start_fcopy() we should add a
close(target_fd) before open()-ing a new one.

> >
> > +
> >  }
> >
> >  static int fcopy_handle_handshake(u32 version)
> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
> > struct file *f)
> >  	 */
> >  	in_hand_shake = true;
> >  	opened = false;
> > +
> > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > +		/* We haven't up()-ed the semaphore(very rare)? */
> > +		if (down_trylock(&fcopy_transaction.read_sema))
> > +			;
> 
> And this.

Scenario 3):
When the daemon exits(e.g., SIGKILL received), if there is a fcopy_work
pending (scheduled but not start to run yet), we should cancel the
work (as you suggested) and down() the semaphore, otherwise, the
obsolete message will be received by the next instance of the daemon.

Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
        schedule_delayed_work(&fcopy_work, 5*HZ);
        ----> if fcopy_release() is running on another vcpu, just before the next line?
        fcopy_send_data();

In this case, fcopy_release() can cancel fcopy_work, but
can't get the semaphore since it hasn't been up()-ed.
Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
buffer an obsolete message in the driver, and the message will be
fetched by the next instance of the daemon... 

Looks we need a spinlock here?

Please share your idea.

> >
> > +		fcopy_respond_to_host(HV_E_FAIL);
> > +	}
> >  	return 0;
> >  }
> >
> > --
> > 1.9.1
> >

-- Dexuan

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
       [not found]     ` <F792CF86EFE20D4AB8064279AFBA51C613E66566@HKNPRD3002MB017.064d.mgd.msft.net >
@ 2014-11-28 10:20         ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-11-28 10:12 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang



On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Friday, November 28, 2014 14:47 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  > In the case the user-space daemon crashes, hangs or is killed, we
>>  > need to down the semaphore, otherwise, after the daemon starts 
>> next
>>  > time, the obsolete data in fcopy_transaction.message or
>>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >
>>  > Cc: Jason Wang <jasowang@redhat.com>
>>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  > ---
>>  >
>>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >
>>  >     I also updated the output message a little:
>>  >     "FCP: failed to acquire the semaphore" -->
>>  >     "can not acquire the semaphore: it is benign"
>>  >
>>  > v3: I added the code in fcopy_release() as Jason Wang suggested.
>>  >     I removed the pr_debug (it isn't so meaningful)and added a
>>  > comment instead.
>>  >
>>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >  1 file changed, 19 insertions(+)
>>  >
>>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  > index 23b2ce2..faa6ba6 100644
>>  > --- a/drivers/hv/hv_fcopy.c
>>  > +++ b/drivers/hv/hv_fcopy.c
>>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
>>  > *dummy)
>>  >  	 * process the pending transaction.
>>  >  	 */
>>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  > +
>>  > +	/* In the case the user-space daemon crashes, hangs or is 
>> killed, we
>>  > +	 * need to down the semaphore, otherwise, after the daemon 
>> starts
>>  > next
>>  > +	 * time, the obsolete data in fcopy_transaction.message or
>>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  > +	 *
>>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
>>  > We're
>>  > +	 * still OK, because we've reported the failure to the host.
>>  > +	 */
>>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  > +		;
>>  
>>  Sorry, I'm not quite understand how if () ; can help here.
>>  
>>  Btw, a question not relate to this patch.
>>  
>>  What happens if a daemon is resume from SIGSTOP and expires the 
>> check
>>  here?
> Hi Jason,
> My idea is: here we need down_trylock(), but in case we can't get the
> semaphore, it's OK anyway:
> 
> Scenario 1):
> 1.1: when the daemon is blocked on the pread(), the daemon receives 
> SIGSTOP;
> 1.2: the host user runs the PowerShell Copy-VMFile command;
> 1.3.1: the driver reports the failure to the host user in 5s and
> 1.3.2: the driver down()-es the semaphore;
> 1.4: the daemon receives SIGCONT and it will be still blocked on the 
> pread().
> Without the down_trylock(), in 1.4, the daemon can receive an 
> obsolete message.
> NOTE: in this scenario, the daemon is not killed.
> 
> Scenario 2):
> In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 
> and
> do down() in fcopy_read(), it will receive the message but: the 
> driver has
> reported the failure to the host user and the driver's 1.3.2 can't 
> get the
> semaphore -- IMO this is acceptably OK, though in the VM, an 
> incomplete
> file will be left there.
> BTW, I think in the daemon's hv_start_fcopy() we should add a
> close(target_fd) before open()-ing a new one.

Right, but how about the case when resuming from SIGSTOP but no timeout?

Looks like in this case userspace() may wait in down_interruptible() 
until timeout. We probably need something like this:

        if (down_interruptible(&fcopy_transaction.read_sema)) {
                up(&fcopy_transaction.read_sema);
                return -EINTR;
        }

This should synchronize with the timeout work for sure.
But how about only schedule it after this?
It does not may sense to start the timer during interrupt
since the file may not even opened and it may take time
to handle signals?

> 
>>  >
>>  > +
>>  >  }
>>  >
>>  >  static int fcopy_handle_handshake(u32 version)
>>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
>>  > struct file *f)
>>  >  	 */
>>  >  	in_hand_shake = true;
>>  >  	opened = false;
>>  > +
>>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  > +			;
>>  
>>  And this.
> 
> Scenario 3):
> When the daemon exits(e.g., SIGKILL received), if there is a 
> fcopy_work
> pending (scheduled but not start to run yet), we should cancel the
> work (as you suggested) and down() the semaphore, otherwise, the
> obsolete message will be received by the next instance of the daemon.

Yes
> 
> 
> Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>         schedule_delayed_work(&fcopy_work, 5*HZ);
>         ----> if fcopy_release() is running on another vcpu, just 
> before the next line?
>         fcopy_send_data();
> 
> In this case, fcopy_release() can cancel fcopy_work, but
> can't get the semaphore since it hasn't been up()-ed.
> Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
> buffer an obsolete message in the driver, and the message will be
> fetched by the next instance of the daemon... 
> 
> Looks we need a spinlock here?

Unless fcopy_release() can wait for all data for current transation
to be received. Spinlock won't help.

But an idea is let the daemon the handle such cases. E.g make sure the
processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
Drop all requests that does not start with START_COPY.

Thought?

Thanks
> 
> 
> Please share your idea.
> 
>>  >
>>  > +		fcopy_respond_to_host(HV_E_FAIL);
>>  > +	}
>>  >  	return 0;
>>  >  }
>>  >
>>  > --
>>  > 1.9.1
>>  >
> 
> -- Dexuan
> 
> ��NrybXǧv^)޺{.n+{zX\x17ܨ}Ơz&j:+v\azZ++zfh~iz\x1ew?&)ߢ^[f^jǫym@Aa\x7f\f0h\x0fi\x7f


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-28 10:20         ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-11-28 10:20 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw



On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Friday, November 28, 2014 14:47 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  > In the case the user-space daemon crashes, hangs or is killed, we
>>  > need to down the semaphore, otherwise, after the daemon starts 
>> next
>>  > time, the obsolete data in fcopy_transaction.message or
>>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >
>>  > Cc: Jason Wang <jasowang@redhat.com>
>>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  > ---
>>  >
>>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >
>>  >     I also updated the output message a little:
>>  >     "FCP: failed to acquire the semaphore" -->
>>  >     "can not acquire the semaphore: it is benign"
>>  >
>>  > v3: I added the code in fcopy_release() as Jason Wang suggested.
>>  >     I removed the pr_debug (it isn't so meaningful)and added a
>>  > comment instead.
>>  >
>>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >  1 file changed, 19 insertions(+)
>>  >
>>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  > index 23b2ce2..faa6ba6 100644
>>  > --- a/drivers/hv/hv_fcopy.c
>>  > +++ b/drivers/hv/hv_fcopy.c
>>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
>>  > *dummy)
>>  >  	 * process the pending transaction.
>>  >  	 */
>>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  > +
>>  > +	/* In the case the user-space daemon crashes, hangs or is 
>> killed, we
>>  > +	 * need to down the semaphore, otherwise, after the daemon 
>> starts
>>  > next
>>  > +	 * time, the obsolete data in fcopy_transaction.message or
>>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  > +	 *
>>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
>>  > We're
>>  > +	 * still OK, because we've reported the failure to the host.
>>  > +	 */
>>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  > +		;
>>  
>>  Sorry, I'm not quite understand how if () ; can help here.
>>  
>>  Btw, a question not relate to this patch.
>>  
>>  What happens if a daemon is resume from SIGSTOP and expires the 
>> check
>>  here?
> Hi Jason,
> My idea is: here we need down_trylock(), but in case we can't get the
> semaphore, it's OK anyway:
> 
> Scenario 1):
> 1.1: when the daemon is blocked on the pread(), the daemon receives 
> SIGSTOP;
> 1.2: the host user runs the PowerShell Copy-VMFile command;
> 1.3.1: the driver reports the failure to the host user in 5s and
> 1.3.2: the driver down()-es the semaphore;
> 1.4: the daemon receives SIGCONT and it will be still blocked on the 
> pread().
> Without the down_trylock(), in 1.4, the daemon can receive an 
> obsolete message.
> NOTE: in this scenario, the daemon is not killed.
> 
> Scenario 2):
> In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 
> and
> do down() in fcopy_read(), it will receive the message but: the 
> driver has
> reported the failure to the host user and the driver's 1.3.2 can't 
> get the
> semaphore -- IMO this is acceptably OK, though in the VM, an 
> incomplete
> file will be left there.
> BTW, I think in the daemon's hv_start_fcopy() we should add a
> close(target_fd) before open()-ing a new one.

Right, but how about the case when resuming from SIGSTOP but no timeout?

Looks like in this case userspace() may wait in down_interruptible() 
until timeout. We probably need something like this:

        if (down_interruptible(&fcopy_transaction.read_sema)) {
                up(&fcopy_transaction.read_sema);
                return -EINTR;
        }

This should synchronize with the timeout work for sure.
But how about only schedule it after this?
It does not may sense to start the timer during interrupt
since the file may not even opened and it may take time
to handle signals?

> 
>>  >
>>  > +
>>  >  }
>>  >
>>  >  static int fcopy_handle_handshake(u32 version)
>>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
>>  > struct file *f)
>>  >  	 */
>>  >  	in_hand_shake = true;
>>  >  	opened = false;
>>  > +
>>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  > +			;
>>  
>>  And this.
> 
> Scenario 3):
> When the daemon exits(e.g., SIGKILL received), if there is a 
> fcopy_work
> pending (scheduled but not start to run yet), we should cancel the
> work (as you suggested) and down() the semaphore, otherwise, the
> obsolete message will be received by the next instance of the daemon.

Yes
> 
> 
> Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>         schedule_delayed_work(&fcopy_work, 5*HZ);
>         ----> if fcopy_release() is running on another vcpu, just 
> before the next line?
>         fcopy_send_data();
> 
> In this case, fcopy_release() can cancel fcopy_work, but
> can't get the semaphore since it hasn't been up()-ed.
> Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
> buffer an obsolete message in the driver, and the message will be
> fetched by the next instance of the daemon... 
> 
> Looks we need a spinlock here?

Unless fcopy_release() can wait for all data for current transation
to be received. Spinlock won't help.

But an idea is let the daemon the handle such cases. E.g make sure the
processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
Drop all requests that does not start with START_COPY.

Thought?

Thanks
> 
> 
> Please share your idea.
> 
>>  >
>>  > +		fcopy_respond_to_host(HV_E_FAIL);
>>  > +	}
>>  >  	return 0;
>>  >  }
>>  >
>>  > --
>>  > 1.9.1
>>  >
> 
> -- Dexuan
> 
> ��NrybXǧv^)޺{.n+{zX\x17ܨ}Ơz&j:+v\azZ++zfh~iz\x1ew?&)ߢ^[f^jǫym@Aa\x7f\f0h\x0fi\x7f

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-11-28 10:20         ` Jason Wang
@ 2014-11-28 11:54           ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-28 11:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 7608 bytes --]

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, November 28, 2014 18:13 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  > In the case the user-space daemon crashes, hangs or is killed, we
> >>  > need to down the semaphore, otherwise, after the daemon starts
> >> next
> >>  > time, the obsolete data in fcopy_transaction.message or
> >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >
> >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  > ---
> >>  >
> >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >
> >>  >     I also updated the output message a little:
> >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >     "can not acquire the semaphore: it is benign"
> >>  >
> >>  > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >>  >     I removed the pr_debug (it isn't so meaningful)and added a
> >>  > comment instead.
> >>  >
> >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >  1 file changed, 19 insertions(+)
> >>  >
> >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  > index 23b2ce2..faa6ba6 100644
> >>  > --- a/drivers/hv/hv_fcopy.c
> >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> >>  > *dummy)
> >>  >  	 * process the pending transaction.
> >>  >  	 */
> >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  > +
> >>  > +	/* In the case the user-space daemon crashes, hangs or is
> >> killed, we
> >>  > +	 * need to down the semaphore, otherwise, after the daemon
> >> starts
> >>  > next
> >>  > +	 * time, the obsolete data in fcopy_transaction.message or
> >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  > +	 *
> >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> >>  > We're
> >>  > +	 * still OK, because we've reported the failure to the host.
> >>  > +	 */
> >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  > +		;
> >>
> >>  Sorry, I'm not quite understand how if () ; can help here.
> >>
> >>  Btw, a question not relate to this patch.
> >>
> >>  What happens if a daemon is resume from SIGSTOP and expires the
> >> check
> >>  here?
> > Hi Jason,
> > My idea is: here we need down_trylock(), but in case we can't get the
> > semaphore, it's OK anyway:
> >
> > Scenario 1):
> > 1.1: when the daemon is blocked on the pread(), the daemon receives
> > SIGSTOP;
> > 1.2: the host user runs the PowerShell Copy-VMFile command;
> > 1.3.1: the driver reports the failure to the host user in 5s and
> > 1.3.2: the driver down()-es the semaphore;
> > 1.4: the daemon receives SIGCONT and it will be still blocked on the
> > pread().
> > Without the down_trylock(), in 1.4, the daemon can receive an
> > obsolete message.
> > NOTE: in this scenario, the daemon is not killed.
> >
> > Scenario 2):
> > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2
> > and
> > do down() in fcopy_read(), it will receive the message but: the
> > driver has
> > reported the failure to the host user and the driver's 1.3.2 can't
> > get the
> > semaphore -- IMO this is acceptably OK, though in the VM, an
> > incomplete
> > file will be left there.
> > BTW, I think in the daemon's hv_start_fcopy() we should add a
> > close(target_fd) before open()-ing a new one.
> 
> Right, but how about the case when resuming from SIGSTOP but no timeout?
Sorry, I don't understand this:
if no timeout, fcopy_read() will get the semaphore and fcopy_write()
will try to cancel fcopy_work.

> Looks like in this case userspace() may wait in down_interruptible()
> until timeout. We probably need something like this:
> 
>         if (down_interruptible(&fcopy_transaction.read_sema)) {
>                 up(&fcopy_transaction.read_sema);
>                 return -EINTR;
>         }
until "timeout"?
if the daemon can't get the semaphore, it can only be wake by a signal(the
daemon doesn't install handler, so by default most signals will kill the daemon).
In case a signal waking up the daemon doesn't kill the daemon, why should
we do up()?

> 
> This should synchronize with the timeout work for sure.
> But how about only schedule it after this?
> It does not may sense to start the timer during interrupt
> since the file may not even opened and it may take time
> to handle signals?
> 
> >
> >>  >
> >>  > +
> >>  >  }
> >>  >
> >>  >  static int fcopy_handle_handshake(u32 version)
> >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
> >>  > struct file *f)
> >>  >  	 */
> >>  >  	in_hand_shake = true;
> >>  >  	opened = false;
> >>  > +
> >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  > +			;
> >>
> >>  And this.
> >
> > Scenario 3):
> > When the daemon exits(e.g., SIGKILL received), if there is a
> > fcopy_work
> > pending (scheduled but not start to run yet), we should cancel the
> > work (as you suggested) and down() the semaphore, otherwise, the
> > obsolete message will be received by the next instance of the daemon.
> 
> Yes
> >
> >
> > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >         ----> if fcopy_release() is running on another vcpu, just
> > before the next line?
> >         fcopy_send_data();
> >
> > In this case, fcopy_release() can cancel fcopy_work, but
> > can't get the semaphore since it hasn't been up()-ed.
> > Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
> > buffer an obsolete message in the driver, and the message will be
> > fetched by the next instance of the daemon...
> >
> > Looks we need a spinlock here?
> 
> Unless fcopy_release() can wait for all data for current transation
> to be received. Spinlock won't help.
> 
> But an idea is let the daemon the handle such cases. E.g make sure the
> processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
> Drop all requests that does not start with START_COPY.
> 
> Thought?
Good idea.
I also think we should reinforce the concept of state machine in the
daemon code.

The daemon/driver communication has so many corner cases...

> >
> >>  >
> >>  > +		fcopy_respond_to_host(HV_E_FAIL);
> >>  > +	}
> >>  >  	return 0;
> >>  >  }
> >>  >
> >>  > --
> >>  > 1.9.1
> >>  >

 -- Dexuan

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-11-28 11:54           ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-11-28 11:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, November 28, 2014 18:13 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  > In the case the user-space daemon crashes, hangs or is killed, we
> >>  > need to down the semaphore, otherwise, after the daemon starts
> >> next
> >>  > time, the obsolete data in fcopy_transaction.message or
> >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >
> >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  > ---
> >>  >
> >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >
> >>  >     I also updated the output message a little:
> >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >     "can not acquire the semaphore: it is benign"
> >>  >
> >>  > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >>  >     I removed the pr_debug (it isn't so meaningful)and added a
> >>  > comment instead.
> >>  >
> >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >  1 file changed, 19 insertions(+)
> >>  >
> >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  > index 23b2ce2..faa6ba6 100644
> >>  > --- a/drivers/hv/hv_fcopy.c
> >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> >>  > *dummy)
> >>  >  	 * process the pending transaction.
> >>  >  	 */
> >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  > +
> >>  > +	/* In the case the user-space daemon crashes, hangs or is
> >> killed, we
> >>  > +	 * need to down the semaphore, otherwise, after the daemon
> >> starts
> >>  > next
> >>  > +	 * time, the obsolete data in fcopy_transaction.message or
> >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  > +	 *
> >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> >>  > We're
> >>  > +	 * still OK, because we've reported the failure to the host.
> >>  > +	 */
> >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  > +		;
> >>
> >>  Sorry, I'm not quite understand how if () ; can help here.
> >>
> >>  Btw, a question not relate to this patch.
> >>
> >>  What happens if a daemon is resume from SIGSTOP and expires the
> >> check
> >>  here?
> > Hi Jason,
> > My idea is: here we need down_trylock(), but in case we can't get the
> > semaphore, it's OK anyway:
> >
> > Scenario 1):
> > 1.1: when the daemon is blocked on the pread(), the daemon receives
> > SIGSTOP;
> > 1.2: the host user runs the PowerShell Copy-VMFile command;
> > 1.3.1: the driver reports the failure to the host user in 5s and
> > 1.3.2: the driver down()-es the semaphore;
> > 1.4: the daemon receives SIGCONT and it will be still blocked on the
> > pread().
> > Without the down_trylock(), in 1.4, the daemon can receive an
> > obsolete message.
> > NOTE: in this scenario, the daemon is not killed.
> >
> > Scenario 2):
> > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2
> > and
> > do down() in fcopy_read(), it will receive the message but: the
> > driver has
> > reported the failure to the host user and the driver's 1.3.2 can't
> > get the
> > semaphore -- IMO this is acceptably OK, though in the VM, an
> > incomplete
> > file will be left there.
> > BTW, I think in the daemon's hv_start_fcopy() we should add a
> > close(target_fd) before open()-ing a new one.
> 
> Right, but how about the case when resuming from SIGSTOP but no timeout?
Sorry, I don't understand this:
if no timeout, fcopy_read() will get the semaphore and fcopy_write()
will try to cancel fcopy_work.

> Looks like in this case userspace() may wait in down_interruptible()
> until timeout. We probably need something like this:
> 
>         if (down_interruptible(&fcopy_transaction.read_sema)) {
>                 up(&fcopy_transaction.read_sema);
>                 return -EINTR;
>         }
until "timeout"?
if the daemon can't get the semaphore, it can only be wake by a signal(the
daemon doesn't install handler, so by default most signals will kill the daemon).
In case a signal waking up the daemon doesn't kill the daemon, why should
we do up()?

> 
> This should synchronize with the timeout work for sure.
> But how about only schedule it after this?
> It does not may sense to start the timer during interrupt
> since the file may not even opened and it may take time
> to handle signals?
> 
> >
> >>  >
> >>  > +
> >>  >  }
> >>  >
> >>  >  static int fcopy_handle_handshake(u32 version)
> >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
> >>  > struct file *f)
> >>  >  	 */
> >>  >  	in_hand_shake = true;
> >>  >  	opened = false;
> >>  > +
> >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  > +			;
> >>
> >>  And this.
> >
> > Scenario 3):
> > When the daemon exits(e.g., SIGKILL received), if there is a
> > fcopy_work
> > pending (scheduled but not start to run yet), we should cancel the
> > work (as you suggested) and down() the semaphore, otherwise, the
> > obsolete message will be received by the next instance of the daemon.
> 
> Yes
> >
> >
> > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >         ----> if fcopy_release() is running on another vcpu, just
> > before the next line?
> >         fcopy_send_data();
> >
> > In this case, fcopy_release() can cancel fcopy_work, but
> > can't get the semaphore since it hasn't been up()-ed.
> > Hmm, in this case,   fcopy_send_data() will do up()  later, and we'll
> > buffer an obsolete message in the driver, and the message will be
> > fetched by the next instance of the daemon...
> >
> > Looks we need a spinlock here?
> 
> Unless fcopy_release() can wait for all data for current transation
> to be received. Spinlock won't help.
> 
> But an idea is let the daemon the handle such cases. E.g make sure the
> processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
> Drop all requests that does not start with START_COPY.
> 
> Thought?
Good idea.
I also think we should reinforce the concept of state machine in the
daemon code.

The daemon/driver communication has so many corner cases...

> >
> >>  >
> >>  > +		fcopy_respond_to_host(HV_E_FAIL);
> >>  > +	}
> >>  >  	return 0;
> >>  >  }
> >>  >
> >>  > --
> >>  > 1.9.1
> >>  >

 -- Dexuan

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
       [not found]           ` <F792CF86EFE20D4AB8064279AFBA51C613E66AAA@HKNPRD3002MB017.064d.mgd.msft.net >
@ 2014-12-01  8:30               ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-01  8:22 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang



On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Friday, November 28, 2014 18:13 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  >>  -----Original Message-----
>>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  Sent: Friday, November 28, 2014 14:47 PM
>>  >>  To: Dexuan Cui
>>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  >> driverdev-
>>  >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com; KY
>>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  >> transfer
>>  >>  failure
>>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui 
>> <decui@microsoft.com>
>>  >> wrote:
>>  >>  > In the case the user-space daemon crashes, hangs or is 
>> killed, we
>>  >>  > need to down the semaphore, otherwise, after the daemon starts
>>  >> next
>>  >>  > time, the obsolete data in fcopy_transaction.message or
>>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >
>>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  >>  > ---
>>  >>  >
>>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >>  >
>>  >>  >     I also updated the output message a little:
>>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  >>  >     "can not acquire the semaphore: it is benign"
>>  >>  >
>>  >>  > v3: I added the code in fcopy_release() as Jason Wang 
>> suggested.
>>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
>>  >>  > comment instead.
>>  >>  >
>>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >>  >  1 file changed, 19 insertions(+)
>>  >>  >
>>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  >>  > index 23b2ce2..faa6ba6 100644
>>  >>  > --- a/drivers/hv/hv_fcopy.c
>>  >>  > +++ b/drivers/hv/hv_fcopy.c
>>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct 
>> work_struct
>>  >>  > *dummy)
>>  >>  >  	 * process the pending transaction.
>>  >>  >  	 */
>>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  >>  > +
>>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
>>  >> killed, we
>>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
>>  >> starts
>>  >>  > next
>>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
>>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  > +	 *
>>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very 
>> rare)?
>>  >>  > We're
>>  >>  > +	 * still OK, because we've reported the failure to the host.
>>  >>  > +	 */
>>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  > +		;
>>  >>
>>  >>  Sorry, I'm not quite understand how if () ; can help here.
>>  >>
>>  >>  Btw, a question not relate to this patch.
>>  >>
>>  >>  What happens if a daemon is resume from SIGSTOP and expires the
>>  >> check
>>  >>  here?
>>  > Hi Jason,
>>  > My idea is: here we need down_trylock(), but in case we can't get 
>> the
>>  > semaphore, it's OK anyway:
>>  >
>>  > Scenario 1):
>>  > 1.1: when the daemon is blocked on the pread(), the daemon 
>> receives
>>  > SIGSTOP;
>>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>>  > 1.3.1: the driver reports the failure to the host user in 5s and
>>  > 1.3.2: the driver down()-es the semaphore;
>>  > 1.4: the daemon receives SIGCONT and it will be still blocked on 
>> the
>>  > pread().
>>  > Without the down_trylock(), in 1.4, the daemon can receive an
>>  > obsolete message.
>>  > NOTE: in this scenario, the daemon is not killed.
>>  >
>>  > Scenario 2):
>>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 
>> 1.3.2
>>  > and
>>  > do down() in fcopy_read(), it will receive the message but: the
>>  > driver has
>>  > reported the failure to the host user and the driver's 1.3.2 can't
>>  > get the
>>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>>  > incomplete
>>  > file will be left there.
>>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>>  > close(target_fd) before open()-ing a new one.
>>  
>>  Right, but how about the case when resuming from SIGSTOP but no 
>> timeout?
> Sorry, I don't understand this:
> if no timeout, fcopy_read() will get the semaphore and fcopy_write()
> will try to cancel fcopy_work.

Yes.
> 
> 
>>  Looks like in this case userspace() may wait in down_interruptible()
>>  until timeout. We probably need something like this:
>>  
>>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>>                  up(&fcopy_transaction.read_sema);
>>                  return -EINTR;
>>          }
> until "timeout"?
> if the daemon can't get the semaphore, it can only be wake by a 
> signal(the
> daemon doesn't install handler, so by default most signals will kill 
> the daemon).
> In case a signal waking up the daemon doesn't kill the daemon, why 
> should
> we do up()?

True, no need since we do down_trylock() in release().

Btw, there's no EINTR handling in handling pread() return value,
may add such one which should be useful for something like debugging.

> 
> 
>>  
>>  This should synchronize with the timeout work for sure.
>>  But how about only schedule it after this?
>>  It does not may sense to start the timer during interrupt
>>  since the file may not even opened and it may take time
>>  to handle signals?
>>  
>>  >
>>  >>  >
>>  >>  > +
>>  >>  >  }
>>  >>  >
>>  >>  >  static int fcopy_handle_handshake(u32 version)
>>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode 
>> *inode,
>>  >>  > struct file *f)
>>  >>  >  	 */
>>  >>  >  	in_hand_shake = true;
>>  >>  >  	opened = false;
>>  >>  > +
>>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  > +			;
>>  >>
>>  >>  And this.
>>  >
>>  > Scenario 3):
>>  > When the daemon exits(e.g., SIGKILL received), if there is a
>>  > fcopy_work
>>  > pending (scheduled but not start to run yet), we should cancel the
>>  > work (as you suggested) and down() the semaphore, otherwise, the
>>  > obsolete message will be received by the next instance of the 
>> daemon.
>>  
>>  Yes
>>  >
>>  >
>>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  >         ----> if fcopy_release() is running on another vcpu, just
>>  > before the next line?
>>  >         fcopy_send_data();
>>  >
>>  > In this case, fcopy_release() can cancel fcopy_work, but
>>  > can't get the semaphore since it hasn't been up()-ed.
>>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and 
>> we'll
>>  > buffer an obsolete message in the driver, and the message will be
>>  > fetched by the next instance of the daemon...
>>  >
>>  > Looks we need a spinlock here?
>>  
>>  Unless fcopy_release() can wait for all data for current transation
>>  to be received. Spinlock won't help.
>>  
>>  But an idea is let the daemon the handle such cases. E.g make sure 
>> the
>>  processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
>>  Drop all requests that does not start with START_COPY.
>>  
>>  Thought?
> Good idea.
> I also think we should reinforce the concept of state machine in the
> daemon code.

Yes, it needs.
> 
> 
> The daemon/driver communication has so many corner cases...

Looks so, let's first address the issue mentioned in this patch.

I don't have any more comments other than changing

 if(down_trylock(&fcopy_transaction.read_sema))
     ;

to 

down_trylock(&fcopy_transaction.read_sema);

Thanks
> 
>>  >
>>  >>  >
>>  >>  > +		fcopy_respond_to_host(HV_E_FAIL);
>>  >>  > +	}
>>  >>  >  	return 0;
>>  >>  >  }
>>  >>  >
>>  >>  > --
>>  >>  > 1.9.1
>>  >>  >
> 
>  -- Dexuan
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01  8:30               ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-01  8:30 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw



On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Friday, November 28, 2014 18:13 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  >>  -----Original Message-----
>>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  Sent: Friday, November 28, 2014 14:47 PM
>>  >>  To: Dexuan Cui
>>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  >> driverdev-
>>  >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com; KY
>>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  >> transfer
>>  >>  failure
>>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui 
>> <decui@microsoft.com>
>>  >> wrote:
>>  >>  > In the case the user-space daemon crashes, hangs or is 
>> killed, we
>>  >>  > need to down the semaphore, otherwise, after the daemon starts
>>  >> next
>>  >>  > time, the obsolete data in fcopy_transaction.message or
>>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >
>>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  >>  > ---
>>  >>  >
>>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >>  >
>>  >>  >     I also updated the output message a little:
>>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  >>  >     "can not acquire the semaphore: it is benign"
>>  >>  >
>>  >>  > v3: I added the code in fcopy_release() as Jason Wang 
>> suggested.
>>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
>>  >>  > comment instead.
>>  >>  >
>>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >>  >  1 file changed, 19 insertions(+)
>>  >>  >
>>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  >>  > index 23b2ce2..faa6ba6 100644
>>  >>  > --- a/drivers/hv/hv_fcopy.c
>>  >>  > +++ b/drivers/hv/hv_fcopy.c
>>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct 
>> work_struct
>>  >>  > *dummy)
>>  >>  >  	 * process the pending transaction.
>>  >>  >  	 */
>>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  >>  > +
>>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
>>  >> killed, we
>>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
>>  >> starts
>>  >>  > next
>>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
>>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  > +	 *
>>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very 
>> rare)?
>>  >>  > We're
>>  >>  > +	 * still OK, because we've reported the failure to the host.
>>  >>  > +	 */
>>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  > +		;
>>  >>
>>  >>  Sorry, I'm not quite understand how if () ; can help here.
>>  >>
>>  >>  Btw, a question not relate to this patch.
>>  >>
>>  >>  What happens if a daemon is resume from SIGSTOP and expires the
>>  >> check
>>  >>  here?
>>  > Hi Jason,
>>  > My idea is: here we need down_trylock(), but in case we can't get 
>> the
>>  > semaphore, it's OK anyway:
>>  >
>>  > Scenario 1):
>>  > 1.1: when the daemon is blocked on the pread(), the daemon 
>> receives
>>  > SIGSTOP;
>>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>>  > 1.3.1: the driver reports the failure to the host user in 5s and
>>  > 1.3.2: the driver down()-es the semaphore;
>>  > 1.4: the daemon receives SIGCONT and it will be still blocked on 
>> the
>>  > pread().
>>  > Without the down_trylock(), in 1.4, the daemon can receive an
>>  > obsolete message.
>>  > NOTE: in this scenario, the daemon is not killed.
>>  >
>>  > Scenario 2):
>>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 
>> 1.3.2
>>  > and
>>  > do down() in fcopy_read(), it will receive the message but: the
>>  > driver has
>>  > reported the failure to the host user and the driver's 1.3.2 can't
>>  > get the
>>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>>  > incomplete
>>  > file will be left there.
>>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>>  > close(target_fd) before open()-ing a new one.
>>  
>>  Right, but how about the case when resuming from SIGSTOP but no 
>> timeout?
> Sorry, I don't understand this:
> if no timeout, fcopy_read() will get the semaphore and fcopy_write()
> will try to cancel fcopy_work.

Yes.
> 
> 
>>  Looks like in this case userspace() may wait in down_interruptible()
>>  until timeout. We probably need something like this:
>>  
>>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>>                  up(&fcopy_transaction.read_sema);
>>                  return -EINTR;
>>          }
> until "timeout"?
> if the daemon can't get the semaphore, it can only be wake by a 
> signal(the
> daemon doesn't install handler, so by default most signals will kill 
> the daemon).
> In case a signal waking up the daemon doesn't kill the daemon, why 
> should
> we do up()?

True, no need since we do down_trylock() in release().

Btw, there's no EINTR handling in handling pread() return value,
may add such one which should be useful for something like debugging.

> 
> 
>>  
>>  This should synchronize with the timeout work for sure.
>>  But how about only schedule it after this?
>>  It does not may sense to start the timer during interrupt
>>  since the file may not even opened and it may take time
>>  to handle signals?
>>  
>>  >
>>  >>  >
>>  >>  > +
>>  >>  >  }
>>  >>  >
>>  >>  >  static int fcopy_handle_handshake(u32 version)
>>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode 
>> *inode,
>>  >>  > struct file *f)
>>  >>  >  	 */
>>  >>  >  	in_hand_shake = true;
>>  >>  >  	opened = false;
>>  >>  > +
>>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  > +			;
>>  >>
>>  >>  And this.
>>  >
>>  > Scenario 3):
>>  > When the daemon exits(e.g., SIGKILL received), if there is a
>>  > fcopy_work
>>  > pending (scheduled but not start to run yet), we should cancel the
>>  > work (as you suggested) and down() the semaphore, otherwise, the
>>  > obsolete message will be received by the next instance of the 
>> daemon.
>>  
>>  Yes
>>  >
>>  >
>>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  >         ----> if fcopy_release() is running on another vcpu, just
>>  > before the next line?
>>  >         fcopy_send_data();
>>  >
>>  > In this case, fcopy_release() can cancel fcopy_work, but
>>  > can't get the semaphore since it hasn't been up()-ed.
>>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and 
>> we'll
>>  > buffer an obsolete message in the driver, and the message will be
>>  > fetched by the next instance of the daemon...
>>  >
>>  > Looks we need a spinlock here?
>>  
>>  Unless fcopy_release() can wait for all data for current transation
>>  to be received. Spinlock won't help.
>>  
>>  But an idea is let the daemon the handle such cases. E.g make sure 
>> the
>>  processing begins with START_COPY and end with COMPLETE/CANCEL_COPY.
>>  Drop all requests that does not start with START_COPY.
>>  
>>  Thought?
> Good idea.
> I also think we should reinforce the concept of state machine in the
> daemon code.

Yes, it needs.
> 
> 
> The daemon/driver communication has so many corner cases...

Looks so, let's first address the issue mentioned in this patch.

I don't have any more comments other than changing

 if(down_trylock(&fcopy_transaction.read_sema))
     ;

to 

down_trylock(&fcopy_transaction.read_sema);

Thanks
> 
>>  >
>>  >>  >
>>  >>  > +		fcopy_respond_to_host(HV_E_FAIL);
>>  >>  > +	}
>>  >>  >  	return 0;
>>  >>  >  }
>>  >>  >
>>  >>  > --
>>  >>  > 1.9.1
>>  >>  >
> 
>  -- Dexuan
> 

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-12-01  8:30               ` Jason Wang
@ 2014-12-01  9:47                 ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-01  9:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 10098 bytes --]

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, December 1, 2014 16:23 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Friday, November 28, 2014 18:13 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  >>  -----Original Message-----
> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  >>  To: Dexuan Cui
> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >>  >> driverdev-
> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >> apw@canonical.com; KY
> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message
> >> on
> >>  >> transfer
> >>  >>  failure
> >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
> >> <decui@microsoft.com>
> >>  >> wrote:
> >>  >>  > In the case the user-space daemon crashes, hangs or is
> >> killed, we
> >>  >>  > need to down the semaphore, otherwise, after the daemon starts
> >>  >> next
> >>  >>  > time, the obsolete data in fcopy_transaction.message or
> >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >
> >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  >>  > ---
> >>  >>  >
> >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >>  >
> >>  >>  >     I also updated the output message a little:
> >>  >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >>  >     "can not acquire the semaphore: it is benign"
> >>  >>  >
> >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> >> suggested.
> >>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
> >>  >>  > comment instead.
> >>  >>  >
> >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >>  >  1 file changed, 19 insertions(+)
> >>  >>  >
> >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  >>  > index 23b2ce2..faa6ba6 100644
> >>  >>  > --- a/drivers/hv/hv_fcopy.c
> >>  >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
> >> work_struct
> >>  >>  > *dummy)
> >>  >>  >  	 * process the pending transaction.
> >>  >>  >  	 */
> >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  >>  > +
> >>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
> >>  >> killed, we
> >>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
> >>  >> starts
> >>  >>  > next
> >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
> >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  > +	 *
> >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
> >> rare)?
> >>  >>  > We're
> >>  >>  > +	 * still OK, because we've reported the failure to the host.
> >>  >>  > +	 */
> >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  > +		;
> >>  >>
> >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> >>  >>
> >>  >>  Btw, a question not relate to this patch.
> >>  >>
> >>  >>  What happens if a daemon is resume from SIGSTOP and expires the
> >>  >> check
> >>  >>  here?
> >>  > Hi Jason,
> >>  > My idea is: here we need down_trylock(), but in case we can't get
> >> the
> >>  > semaphore, it's OK anyway:
> >>  >
> >>  > Scenario 1):
> >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> >> receives
> >>  > SIGSTOP;
> >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
> >>  > 1.3.1: the driver reports the failure to the host user in 5s and
> >>  > 1.3.2: the driver down()-es the semaphore;
> >>  > 1.4: the daemon receives SIGCONT and it will be still blocked on
> >> the
> >>  > pread().
> >>  > Without the down_trylock(), in 1.4, the daemon can receive an
> >>  > obsolete message.
> >>  > NOTE: in this scenario, the daemon is not killed.
> >>  >
> >>  > Scenario 2):
> >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and
> >> 1.3.2
> >>  > and
> >>  > do down() in fcopy_read(), it will receive the message but: the
> >>  > driver has
> >>  > reported the failure to the host user and the driver's 1.3.2 can't
> >>  > get the
> >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
> >>  > incomplete
> >>  > file will be left there.
> >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
> >>  > close(target_fd) before open()-ing a new one.
> >>
> >>  Right, but how about the case when resuming from SIGSTOP but no
> >> timeout?
> > Sorry, I don't understand this:
> > if no timeout, fcopy_read() will get the semaphore and fcopy_write()
> > will try to cancel fcopy_work.
> 
> Yes.
> >
> >
> >>  Looks like in this case userspace() may wait in down_interruptible()
> >>  until timeout. We probably need something like this:
> >>
> >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> >>                  up(&fcopy_transaction.read_sema);
> >>                  return -EINTR;
> >>          }
> > until "timeout"?
> > if the daemon can't get the semaphore, it can only be wake by a
> > signal(the
> > daemon doesn't install handler, so by default most signals will kill
> > the daemon).
> > In case a signal waking up the daemon doesn't kill the daemon, why
> > should
> > we do up()?
> 
> True, no need since we do down_trylock() in release().
> 
> Btw, there's no EINTR handling in handling pread() return value,
> may add such one which should be useful for something like debugging.
> 
> >
> >
> >>
> >>  This should synchronize with the timeout work for sure.
> >>  But how about only schedule it after this?
> >>  It does not may sense to start the timer during interrupt
> >>  since the file may not even opened and it may take time
> >>  to handle signals?
> >>
> >>  >
> >>  >>  >
> >>  >>  > +
> >>  >>  >  }
> >>  >>  >
> >>  >>  >  static int fcopy_handle_handshake(u32 version)
> >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
> >> *inode,
> >>  >>  > struct file *f)
> >>  >>  >  	 */
> >>  >>  >  	in_hand_shake = true;
> >>  >>  >  	opened = false;
> >>  >>  > +
> >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  > +			;
> >>  >>
> >>  >>  And this.
> >>  >
> >>  > Scenario 3):
> >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> >>  > fcopy_work
> >>  > pending (scheduled but not start to run yet), we should cancel the
> >>  > work (as you suggested) and down() the semaphore, otherwise, the
> >>  > obsolete message will be received by the next instance of the
> >> daemon.
> >>
> >>  Yes
> >>  >
> >>  >
> >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >>  >         ----> if fcopy_release() is running on another vcpu, just
> >>  > before the next line?
> >>  >         fcopy_send_data();
> >>  >
> >>  > In this case, fcopy_release() can cancel fcopy_work, but
> >>  > can't get the semaphore since it hasn't been up()-ed.
> >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and
> >> we'll
> >>  > buffer an obsolete message in the driver, and the message will be
> >>  > fetched by the next instance of the daemon...
> >>  >
> >>  > Looks we need a spinlock here?
> >>
> >>  Unless fcopy_release() can wait for all data for current transation
> >>  to be received. Spinlock won't help.
> >>
> >>  But an idea is let the daemon the handle such cases. E.g make sure
> >> the
> >>  processing begins with START_COPY and end with
> COMPLETE/CANCEL_COPY.
> >>  Drop all requests that does not start with START_COPY.
> >>
> >>  Thought?
> > Good idea.
> > I also think we should reinforce the concept of state machine in the
> > daemon code.
> 
> Yes, it needs.
I agree.
Obviously we can do something to make the daemon/driver work better
in the corner cases.

> >
> > The daemon/driver communication has so many corner cases...
> 
> Looks so, let's first address the issue mentioned in this patch.
OK.

> I don't have any more comments other than changing
> 
>  if(down_trylock(&fcopy_transaction.read_sema))
>      ;
> 
> to
> 
> down_trylock(&fcopy_transaction.read_sema);
Hi Jason,
This is to address Vitaly's comment in the bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5

down_trylock(&fcopy_transaction.read_sema) will

"
produces the following compile warning:
drivers/hv/hv_fcopy.c: In function â€˜fcopy_work_funcâ€™:
drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of â€˜down_trylockâ€™, declared with attribute warn_unused_result [-Wunused-result]
  (void)down_trylock(&fcopy_transaction.read_sema);
"

Actually I personally don't care about the warning, because we only
see it when we run some kind of code checker program.  :-)

I can change my v3  to the "normal" style you prefer, if
there is no strong objection from Vitaly?

Thanks,
-- Dexuan

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01  9:47                 ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-01  9:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, December 1, 2014 16:23 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Friday, November 28, 2014 18:13 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  >>  -----Original Message-----
> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  >>  To: Dexuan Cui
> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >>  >> driverdev-
> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >> apw@canonical.com; KY
> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message
> >> on
> >>  >> transfer
> >>  >>  failure
> >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
> >> <decui@microsoft.com>
> >>  >> wrote:
> >>  >>  > In the case the user-space daemon crashes, hangs or is
> >> killed, we
> >>  >>  > need to down the semaphore, otherwise, after the daemon starts
> >>  >> next
> >>  >>  > time, the obsolete data in fcopy_transaction.message or
> >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >
> >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  >>  > ---
> >>  >>  >
> >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >>  >
> >>  >>  >     I also updated the output message a little:
> >>  >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >>  >     "can not acquire the semaphore: it is benign"
> >>  >>  >
> >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> >> suggested.
> >>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
> >>  >>  > comment instead.
> >>  >>  >
> >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >>  >  1 file changed, 19 insertions(+)
> >>  >>  >
> >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  >>  > index 23b2ce2..faa6ba6 100644
> >>  >>  > --- a/drivers/hv/hv_fcopy.c
> >>  >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
> >> work_struct
> >>  >>  > *dummy)
> >>  >>  >  	 * process the pending transaction.
> >>  >>  >  	 */
> >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  >>  > +
> >>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
> >>  >> killed, we
> >>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
> >>  >> starts
> >>  >>  > next
> >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
> >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  > +	 *
> >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
> >> rare)?
> >>  >>  > We're
> >>  >>  > +	 * still OK, because we've reported the failure to the host.
> >>  >>  > +	 */
> >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  > +		;
> >>  >>
> >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> >>  >>
> >>  >>  Btw, a question not relate to this patch.
> >>  >>
> >>  >>  What happens if a daemon is resume from SIGSTOP and expires the
> >>  >> check
> >>  >>  here?
> >>  > Hi Jason,
> >>  > My idea is: here we need down_trylock(), but in case we can't get
> >> the
> >>  > semaphore, it's OK anyway:
> >>  >
> >>  > Scenario 1):
> >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> >> receives
> >>  > SIGSTOP;
> >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
> >>  > 1.3.1: the driver reports the failure to the host user in 5s and
> >>  > 1.3.2: the driver down()-es the semaphore;
> >>  > 1.4: the daemon receives SIGCONT and it will be still blocked on
> >> the
> >>  > pread().
> >>  > Without the down_trylock(), in 1.4, the daemon can receive an
> >>  > obsolete message.
> >>  > NOTE: in this scenario, the daemon is not killed.
> >>  >
> >>  > Scenario 2):
> >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and
> >> 1.3.2
> >>  > and
> >>  > do down() in fcopy_read(), it will receive the message but: the
> >>  > driver has
> >>  > reported the failure to the host user and the driver's 1.3.2 can't
> >>  > get the
> >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
> >>  > incomplete
> >>  > file will be left there.
> >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
> >>  > close(target_fd) before open()-ing a new one.
> >>
> >>  Right, but how about the case when resuming from SIGSTOP but no
> >> timeout?
> > Sorry, I don't understand this:
> > if no timeout, fcopy_read() will get the semaphore and fcopy_write()
> > will try to cancel fcopy_work.
> 
> Yes.
> >
> >
> >>  Looks like in this case userspace() may wait in down_interruptible()
> >>  until timeout. We probably need something like this:
> >>
> >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> >>                  up(&fcopy_transaction.read_sema);
> >>                  return -EINTR;
> >>          }
> > until "timeout"?
> > if the daemon can't get the semaphore, it can only be wake by a
> > signal(the
> > daemon doesn't install handler, so by default most signals will kill
> > the daemon).
> > In case a signal waking up the daemon doesn't kill the daemon, why
> > should
> > we do up()?
> 
> True, no need since we do down_trylock() in release().
> 
> Btw, there's no EINTR handling in handling pread() return value,
> may add such one which should be useful for something like debugging.
> 
> >
> >
> >>
> >>  This should synchronize with the timeout work for sure.
> >>  But how about only schedule it after this?
> >>  It does not may sense to start the timer during interrupt
> >>  since the file may not even opened and it may take time
> >>  to handle signals?
> >>
> >>  >
> >>  >>  >
> >>  >>  > +
> >>  >>  >  }
> >>  >>  >
> >>  >>  >  static int fcopy_handle_handshake(u32 version)
> >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
> >> *inode,
> >>  >>  > struct file *f)
> >>  >>  >  	 */
> >>  >>  >  	in_hand_shake = true;
> >>  >>  >  	opened = false;
> >>  >>  > +
> >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  > +			;
> >>  >>
> >>  >>  And this.
> >>  >
> >>  > Scenario 3):
> >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> >>  > fcopy_work
> >>  > pending (scheduled but not start to run yet), we should cancel the
> >>  > work (as you suggested) and down() the semaphore, otherwise, the
> >>  > obsolete message will be received by the next instance of the
> >> daemon.
> >>
> >>  Yes
> >>  >
> >>  >
> >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >>  >         ----> if fcopy_release() is running on another vcpu, just
> >>  > before the next line?
> >>  >         fcopy_send_data();
> >>  >
> >>  > In this case, fcopy_release() can cancel fcopy_work, but
> >>  > can't get the semaphore since it hasn't been up()-ed.
> >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and
> >> we'll
> >>  > buffer an obsolete message in the driver, and the message will be
> >>  > fetched by the next instance of the daemon...
> >>  >
> >>  > Looks we need a spinlock here?
> >>
> >>  Unless fcopy_release() can wait for all data for current transation
> >>  to be received. Spinlock won't help.
> >>
> >>  But an idea is let the daemon the handle such cases. E.g make sure
> >> the
> >>  processing begins with START_COPY and end with
> COMPLETE/CANCEL_COPY.
> >>  Drop all requests that does not start with START_COPY.
> >>
> >>  Thought?
> > Good idea.
> > I also think we should reinforce the concept of state machine in the
> > daemon code.
> 
> Yes, it needs.
I agree.
Obviously we can do something to make the daemon/driver work better
in the corner cases.

> >
> > The daemon/driver communication has so many corner cases...
> 
> Looks so, let's first address the issue mentioned in this patch.
OK.

> I don't have any more comments other than changing
> 
>  if(down_trylock(&fcopy_transaction.read_sema))
>      ;
> 
> to
> 
> down_trylock(&fcopy_transaction.read_sema);
Hi Jason,
This is to address Vitaly's comment in the bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5

down_trylock(&fcopy_transaction.read_sema) will

"
produces the following compile warning:
drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of ‘down_trylock’, declared with attribute warn_unused_result [-Wunused-result]
  (void)down_trylock(&fcopy_transaction.read_sema);
"

Actually I personally don't care about the warning, because we only
see it when we run some kind of code checker program.  :-)

I can change my v3  to the "normal" style you prefer, if
there is no strong objection from Vitaly?

Thanks,
-- Dexuan

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-12-01  9:47                 ` Dexuan Cui
@ 2014-12-01  9:55                   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 34+ messages in thread
From: Vitaly Kuznetsov @ 2014-12-01  9:55 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Jason Wang, gregkh, linux-kernel, driverdev-devel, olaf, apw,
	KY Srinivasan, Haiyang Zhang

Dexuan Cui <decui@microsoft.com> writes:

>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Monday, December 1, 2014 16:23 PM
>> To: Dexuan Cui
>> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
>> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
>> failure
>> On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
>> >>  -----Original Message-----
>> >>  From: Jason Wang [mailto:jasowang@redhat.com]
>> >>  Sent: Friday, November 28, 2014 18:13 PM
>> >>  To: Dexuan Cui
>> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> >> driverdev-
>> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
>> >> transfer
>> >>  failure
>> >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com>
>> >> wrote:
>> >>  >>  -----Original Message-----
>> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>> >>  >>  Sent: Friday, November 28, 2014 14:47 PM
>> >>  >>  To: Dexuan Cui
>> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> >>  >> driverdev-
>> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>> >> apw@canonical.com; KY
>> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message
>> >> on
>> >>  >> transfer
>> >>  >>  failure
>> >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
>> >> <decui@microsoft.com>
>> >>  >> wrote:
>> >>  >>  > In the case the user-space daemon crashes, hangs or is
>> >> killed, we
>> >>  >>  > need to down the semaphore, otherwise, after the daemon starts
>> >>  >> next
>> >>  >>  > time, the obsolete data in fcopy_transaction.message or
>> >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>> >>  >>  >
>> >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>> >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>> >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>> >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>> >>  >>  > ---
>> >>  >>  >
>> >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>> >>  >>  >
>> >>  >>  >     I also updated the output message a little:
>> >>  >>  >     "FCP: failed to acquire the semaphore" -->
>> >>  >>  >     "can not acquire the semaphore: it is benign"
>> >>  >>  >
>> >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
>> >> suggested.
>> >>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
>> >>  >>  > comment instead.
>> >>  >>  >
>> >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>> >>  >>  >  1 file changed, 19 insertions(+)
>> >>  >>  >
>> >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>> >>  >>  > index 23b2ce2..faa6ba6 100644
>> >>  >>  > --- a/drivers/hv/hv_fcopy.c
>> >>  >>  > +++ b/drivers/hv/hv_fcopy.c
>> >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
>> >> work_struct
>> >>  >>  > *dummy)
>> >>  >>  >  	 * process the pending transaction.
>> >>  >>  >  	 */
>> >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>> >>  >>  > +
>> >>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
>> >>  >> killed, we
>> >>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
>> >>  >> starts
>> >>  >>  > next
>> >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
>> >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>> >>  >>  > +	 *
>> >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
>> >> rare)?
>> >>  >>  > We're
>> >>  >>  > +	 * still OK, because we've reported the failure to the host.
>> >>  >>  > +	 */
>> >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>> >>  >>  > +		;
>> >>  >>
>> >>  >>  Sorry, I'm not quite understand how if () ; can help here.
>> >>  >>
>> >>  >>  Btw, a question not relate to this patch.
>> >>  >>
>> >>  >>  What happens if a daemon is resume from SIGSTOP and expires the
>> >>  >> check
>> >>  >>  here?
>> >>  > Hi Jason,
>> >>  > My idea is: here we need down_trylock(), but in case we can't get
>> >> the
>> >>  > semaphore, it's OK anyway:
>> >>  >
>> >>  > Scenario 1):
>> >>  > 1.1: when the daemon is blocked on the pread(), the daemon
>> >> receives
>> >>  > SIGSTOP;
>> >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>> >>  > 1.3.1: the driver reports the failure to the host user in 5s and
>> >>  > 1.3.2: the driver down()-es the semaphore;
>> >>  > 1.4: the daemon receives SIGCONT and it will be still blocked on
>> >> the
>> >>  > pread().
>> >>  > Without the down_trylock(), in 1.4, the daemon can receive an
>> >>  > obsolete message.
>> >>  > NOTE: in this scenario, the daemon is not killed.
>> >>  >
>> >>  > Scenario 2):
>> >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and
>> >> 1.3.2
>> >>  > and
>> >>  > do down() in fcopy_read(), it will receive the message but: the
>> >>  > driver has
>> >>  > reported the failure to the host user and the driver's 1.3.2 can't
>> >>  > get the
>> >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>> >>  > incomplete
>> >>  > file will be left there.
>> >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>> >>  > close(target_fd) before open()-ing a new one.
>> >>
>> >>  Right, but how about the case when resuming from SIGSTOP but no
>> >> timeout?
>> > Sorry, I don't understand this:
>> > if no timeout, fcopy_read() will get the semaphore and fcopy_write()
>> > will try to cancel fcopy_work.
>> 
>> Yes.
>> >
>> >
>> >>  Looks like in this case userspace() may wait in down_interruptible()
>> >>  until timeout. We probably need something like this:
>> >>
>> >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>> >>                  up(&fcopy_transaction.read_sema);
>> >>                  return -EINTR;
>> >>          }
>> > until "timeout"?
>> > if the daemon can't get the semaphore, it can only be wake by a
>> > signal(the
>> > daemon doesn't install handler, so by default most signals will kill
>> > the daemon).
>> > In case a signal waking up the daemon doesn't kill the daemon, why
>> > should
>> > we do up()?
>> 
>> True, no need since we do down_trylock() in release().
>> 
>> Btw, there's no EINTR handling in handling pread() return value,
>> may add such one which should be useful for something like debugging.
>> 
>> >
>> >
>> >>
>> >>  This should synchronize with the timeout work for sure.
>> >>  But how about only schedule it after this?
>> >>  It does not may sense to start the timer during interrupt
>> >>  since the file may not even opened and it may take time
>> >>  to handle signals?
>> >>
>> >>  >
>> >>  >>  >
>> >>  >>  > +
>> >>  >>  >  }
>> >>  >>  >
>> >>  >>  >  static int fcopy_handle_handshake(u32 version)
>> >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
>> >> *inode,
>> >>  >>  > struct file *f)
>> >>  >>  >  	 */
>> >>  >>  >  	in_hand_shake = true;
>> >>  >>  >  	opened = false;
>> >>  >>  > +
>> >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>> >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>> >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>> >>  >>  > +			;
>> >>  >>
>> >>  >>  And this.
>> >>  >
>> >>  > Scenario 3):
>> >>  > When the daemon exits(e.g., SIGKILL received), if there is a
>> >>  > fcopy_work
>> >>  > pending (scheduled but not start to run yet), we should cancel the
>> >>  > work (as you suggested) and down() the semaphore, otherwise, the
>> >>  > obsolete message will be received by the next instance of the
>> >> daemon.
>> >>
>> >>  Yes
>> >>  >
>> >>  >
>> >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>> >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>> >>  >         ----> if fcopy_release() is running on another vcpu, just
>> >>  > before the next line?
>> >>  >         fcopy_send_data();
>> >>  >
>> >>  > In this case, fcopy_release() can cancel fcopy_work, but
>> >>  > can't get the semaphore since it hasn't been up()-ed.
>> >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and
>> >> we'll
>> >>  > buffer an obsolete message in the driver, and the message will be
>> >>  > fetched by the next instance of the daemon...
>> >>  >
>> >>  > Looks we need a spinlock here?
>> >>
>> >>  Unless fcopy_release() can wait for all data for current transation
>> >>  to be received. Spinlock won't help.
>> >>
>> >>  But an idea is let the daemon the handle such cases. E.g make sure
>> >> the
>> >>  processing begins with START_COPY and end with
>> COMPLETE/CANCEL_COPY.
>> >>  Drop all requests that does not start with START_COPY.
>> >>
>> >>  Thought?
>> > Good idea.
>> > I also think we should reinforce the concept of state machine in the
>> > daemon code.
>> 
>> Yes, it needs.
> I agree.
> Obviously we can do something to make the daemon/driver work better
> in the corner cases.
>
>> >
>> > The daemon/driver communication has so many corner cases...
>> 
>> Looks so, let's first address the issue mentioned in this patch.
> OK.
>
>> I don't have any more comments other than changing
>> 
>>  if(down_trylock(&fcopy_transaction.read_sema))
>>      ;
>> 
>> to
>> 
>> down_trylock(&fcopy_transaction.read_sema);
> Hi Jason,
> This is to address Vitaly's comment in the bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
>
> down_trylock(&fcopy_transaction.read_sema) will
>
> "
> produces the following compile warning:
> drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of ‘down_trylock’, declared with attribute warn_unused_result [-Wunused-result]
>   (void)down_trylock(&fcopy_transaction.read_sema);
> "
>
> Actually I personally don't care about the warning, because we only
> see it when we run some kind of code checker program.  :-)
>
> I can change my v3  to the "normal" style you prefer, if
> there is no strong objection from Vitaly?

Well, I still prefer pr_debug() there: these are being compiled out on
non-debug builds. I'm not objecting to any solution here though, really
a minor issue.

>
> Thanks,
> -- Dexuan

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01  9:55                   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 34+ messages in thread
From: Vitaly Kuznetsov @ 2014-12-01  9:55 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: olaf, gregkh, Jason Wang, driverdev-devel, linux-kernel, apw,
	Haiyang Zhang

Dexuan Cui <decui@microsoft.com> writes:

>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Monday, December 1, 2014 16:23 PM
>> To: Dexuan Cui
>> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
>> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
>> failure
>> On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> wrote:
>> >>  -----Original Message-----
>> >>  From: Jason Wang [mailto:jasowang@redhat.com]
>> >>  Sent: Friday, November 28, 2014 18:13 PM
>> >>  To: Dexuan Cui
>> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> >> driverdev-
>> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
>> >> transfer
>> >>  failure
>> >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@microsoft.com>
>> >> wrote:
>> >>  >>  -----Original Message-----
>> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>> >>  >>  Sent: Friday, November 28, 2014 14:47 PM
>> >>  >>  To: Dexuan Cui
>> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> >>  >> driverdev-
>> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>> >> apw@canonical.com; KY
>> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>> >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message
>> >> on
>> >>  >> transfer
>> >>  >>  failure
>> >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
>> >> <decui@microsoft.com>
>> >>  >> wrote:
>> >>  >>  > In the case the user-space daemon crashes, hangs or is
>> >> killed, we
>> >>  >>  > need to down the semaphore, otherwise, after the daemon starts
>> >>  >> next
>> >>  >>  > time, the obsolete data in fcopy_transaction.message or
>> >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>> >>  >>  >
>> >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>> >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>> >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>> >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>> >>  >>  > ---
>> >>  >>  >
>> >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>> >>  >>  >
>> >>  >>  >     I also updated the output message a little:
>> >>  >>  >     "FCP: failed to acquire the semaphore" -->
>> >>  >>  >     "can not acquire the semaphore: it is benign"
>> >>  >>  >
>> >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
>> >> suggested.
>> >>  >>  >     I removed the pr_debug (it isn't so meaningful)and added a
>> >>  >>  > comment instead.
>> >>  >>  >
>> >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>> >>  >>  >  1 file changed, 19 insertions(+)
>> >>  >>  >
>> >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>> >>  >>  > index 23b2ce2..faa6ba6 100644
>> >>  >>  > --- a/drivers/hv/hv_fcopy.c
>> >>  >>  > +++ b/drivers/hv/hv_fcopy.c
>> >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
>> >> work_struct
>> >>  >>  > *dummy)
>> >>  >>  >  	 * process the pending transaction.
>> >>  >>  >  	 */
>> >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>> >>  >>  > +
>> >>  >>  > +	/* In the case the user-space daemon crashes, hangs or is
>> >>  >> killed, we
>> >>  >>  > +	 * need to down the semaphore, otherwise, after the daemon
>> >>  >> starts
>> >>  >>  > next
>> >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message or
>> >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>> >>  >>  > +	 *
>> >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
>> >> rare)?
>> >>  >>  > We're
>> >>  >>  > +	 * still OK, because we've reported the failure to the host.
>> >>  >>  > +	 */
>> >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>> >>  >>  > +		;
>> >>  >>
>> >>  >>  Sorry, I'm not quite understand how if () ; can help here.
>> >>  >>
>> >>  >>  Btw, a question not relate to this patch.
>> >>  >>
>> >>  >>  What happens if a daemon is resume from SIGSTOP and expires the
>> >>  >> check
>> >>  >>  here?
>> >>  > Hi Jason,
>> >>  > My idea is: here we need down_trylock(), but in case we can't get
>> >> the
>> >>  > semaphore, it's OK anyway:
>> >>  >
>> >>  > Scenario 1):
>> >>  > 1.1: when the daemon is blocked on the pread(), the daemon
>> >> receives
>> >>  > SIGSTOP;
>> >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>> >>  > 1.3.1: the driver reports the failure to the host user in 5s and
>> >>  > 1.3.2: the driver down()-es the semaphore;
>> >>  > 1.4: the daemon receives SIGCONT and it will be still blocked on
>> >> the
>> >>  > pread().
>> >>  > Without the down_trylock(), in 1.4, the daemon can receive an
>> >>  > obsolete message.
>> >>  > NOTE: in this scenario, the daemon is not killed.
>> >>  >
>> >>  > Scenario 2):
>> >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 and
>> >> 1.3.2
>> >>  > and
>> >>  > do down() in fcopy_read(), it will receive the message but: the
>> >>  > driver has
>> >>  > reported the failure to the host user and the driver's 1.3.2 can't
>> >>  > get the
>> >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>> >>  > incomplete
>> >>  > file will be left there.
>> >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>> >>  > close(target_fd) before open()-ing a new one.
>> >>
>> >>  Right, but how about the case when resuming from SIGSTOP but no
>> >> timeout?
>> > Sorry, I don't understand this:
>> > if no timeout, fcopy_read() will get the semaphore and fcopy_write()
>> > will try to cancel fcopy_work.
>> 
>> Yes.
>> >
>> >
>> >>  Looks like in this case userspace() may wait in down_interruptible()
>> >>  until timeout. We probably need something like this:
>> >>
>> >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>> >>                  up(&fcopy_transaction.read_sema);
>> >>                  return -EINTR;
>> >>          }
>> > until "timeout"?
>> > if the daemon can't get the semaphore, it can only be wake by a
>> > signal(the
>> > daemon doesn't install handler, so by default most signals will kill
>> > the daemon).
>> > In case a signal waking up the daemon doesn't kill the daemon, why
>> > should
>> > we do up()?
>> 
>> True, no need since we do down_trylock() in release().
>> 
>> Btw, there's no EINTR handling in handling pread() return value,
>> may add such one which should be useful for something like debugging.
>> 
>> >
>> >
>> >>
>> >>  This should synchronize with the timeout work for sure.
>> >>  But how about only schedule it after this?
>> >>  It does not may sense to start the timer during interrupt
>> >>  since the file may not even opened and it may take time
>> >>  to handle signals?
>> >>
>> >>  >
>> >>  >>  >
>> >>  >>  > +
>> >>  >>  >  }
>> >>  >>  >
>> >>  >>  >  static int fcopy_handle_handshake(u32 version)
>> >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
>> >> *inode,
>> >>  >>  > struct file *f)
>> >>  >>  >  	 */
>> >>  >>  >  	in_hand_shake = true;
>> >>  >>  >  	opened = false;
>> >>  >>  > +
>> >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>> >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>> >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>> >>  >>  > +			;
>> >>  >>
>> >>  >>  And this.
>> >>  >
>> >>  > Scenario 3):
>> >>  > When the daemon exits(e.g., SIGKILL received), if there is a
>> >>  > fcopy_work
>> >>  > pending (scheduled but not start to run yet), we should cancel the
>> >>  > work (as you suggested) and down() the semaphore, otherwise, the
>> >>  > obsolete message will be received by the next instance of the
>> >> daemon.
>> >>
>> >>  Yes
>> >>  >
>> >>  >
>> >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>> >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>> >>  >         ----> if fcopy_release() is running on another vcpu, just
>> >>  > before the next line?
>> >>  >         fcopy_send_data();
>> >>  >
>> >>  > In this case, fcopy_release() can cancel fcopy_work, but
>> >>  > can't get the semaphore since it hasn't been up()-ed.
>> >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, and
>> >> we'll
>> >>  > buffer an obsolete message in the driver, and the message will be
>> >>  > fetched by the next instance of the daemon...
>> >>  >
>> >>  > Looks we need a spinlock here?
>> >>
>> >>  Unless fcopy_release() can wait for all data for current transation
>> >>  to be received. Spinlock won't help.
>> >>
>> >>  But an idea is let the daemon the handle such cases. E.g make sure
>> >> the
>> >>  processing begins with START_COPY and end with
>> COMPLETE/CANCEL_COPY.
>> >>  Drop all requests that does not start with START_COPY.
>> >>
>> >>  Thought?
>> > Good idea.
>> > I also think we should reinforce the concept of state machine in the
>> > daemon code.
>> 
>> Yes, it needs.
> I agree.
> Obviously we can do something to make the daemon/driver work better
> in the corner cases.
>
>> >
>> > The daemon/driver communication has so many corner cases...
>> 
>> Looks so, let's first address the issue mentioned in this patch.
> OK.
>
>> I don't have any more comments other than changing
>> 
>>  if(down_trylock(&fcopy_transaction.read_sema))
>>      ;
>> 
>> to
>> 
>> down_trylock(&fcopy_transaction.read_sema);
> Hi Jason,
> This is to address Vitaly's comment in the bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
>
> down_trylock(&fcopy_transaction.read_sema) will
>
> "
> produces the following compile warning:
> drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of ‘down_trylock’, declared with attribute warn_unused_result [-Wunused-result]
>   (void)down_trylock(&fcopy_transaction.read_sema);
> "
>
> Actually I personally don't care about the warning, because we only
> see it when we run some kind of code checker program.  :-)
>
> I can change my v3  to the "normal" style you prefer, if
> there is no strong objection from Vitaly?

Well, I still prefer pr_debug() there: these are being compiled out on
non-debug builds. I'm not objecting to any solution here though, really
a minor issue.

>
> Thanks,
> -- Dexuan

-- 
  Vitaly
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
       [not found]                 ` <F792CF86EFE20D4AB8064279AFBA51C613E691B8@HKNPRD3002MB017.064d.mgd.msft.net >
@ 2014-12-01 10:26                     ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-01 10:18 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang



On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Monday, December 1, 2014 16:23 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  >>  -----Original Message-----
>>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  Sent: Friday, November 28, 2014 18:13 PM
>>  >>  To: Dexuan Cui
>>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  >> driverdev-
>>  >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com; KY
>>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  >> transfer
>>  >>  failure
>>  >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui 
>> <decui@microsoft.com>
>>  >> wrote:
>>  >>  >>  -----Original Message-----
>>  >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  >>  Sent: Friday, November 28, 2014 14:47 PM
>>  >>  >>  To: Dexuan Cui
>>  >>  >>  Cc: gregkh@linuxfoundation.org; 
>> linux-kernel@vger.kernel.org;
>>  >>  >> driverdev-
>>  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>>  >> apw@canonical.com; KY
>>  >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete 
>> message
>>  >> on
>>  >>  >> transfer
>>  >>  >>  failure
>>  >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
>>  >> <decui@microsoft.com>
>>  >>  >> wrote:
>>  >>  >>  > In the case the user-space daemon crashes, hangs or is
>>  >> killed, we
>>  >>  >>  > need to down the semaphore, otherwise, after the daemon 
>> starts
>>  >>  >> next
>>  >>  >>  > time, the obsolete data in fcopy_transaction.message or
>>  >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >>  >
>>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>>  >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  >>  >>  > ---
>>  >>  >>  >
>>  >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >>  >>  >
>>  >>  >>  >     I also updated the output message a little:
>>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  >>  >>  >     "can not acquire the semaphore: it is benign"
>>  >>  >>  >
>>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
>>  >> suggested.
>>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and 
>> added a
>>  >>  >>  > comment instead.
>>  >>  >>  >
>>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >>  >>  >  1 file changed, 19 insertions(+)
>>  >>  >>  >
>>  >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  >>  >>  > index 23b2ce2..faa6ba6 100644
>>  >>  >>  > --- a/drivers/hv/hv_fcopy.c
>>  >>  >>  > +++ b/drivers/hv/hv_fcopy.c
>>  >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
>>  >> work_struct
>>  >>  >>  > *dummy)
>>  >>  >>  >  	 * process the pending transaction.
>>  >>  >>  >  	 */
>>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  >>  >>  > +
>>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs or 
>> is
>>  >>  >> killed, we
>>  >>  >>  > +	 * need to down the semaphore, otherwise, after the 
>> daemon
>>  >>  >> starts
>>  >>  >>  > next
>>  >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message 
>> or
>>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >>  > +	 *
>>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
>>  >> rare)?
>>  >>  >>  > We're
>>  >>  >>  > +	 * still OK, because we've reported the failure to the 
>> host.
>>  >>  >>  > +	 */
>>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  >>  > +		;
>>  >>  >>
>>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
>>  >>  >>
>>  >>  >>  Btw, a question not relate to this patch.
>>  >>  >>
>>  >>  >>  What happens if a daemon is resume from SIGSTOP and expires 
>> the
>>  >>  >> check
>>  >>  >>  here?
>>  >>  > Hi Jason,
>>  >>  > My idea is: here we need down_trylock(), but in case we can't 
>> get
>>  >> the
>>  >>  > semaphore, it's OK anyway:
>>  >>  >
>>  >>  > Scenario 1):
>>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
>>  >> receives
>>  >>  > SIGSTOP;
>>  >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>>  >>  > 1.3.1: the driver reports the failure to the host user in 5s 
>> and
>>  >>  > 1.3.2: the driver down()-es the semaphore;
>>  >>  > 1.4: the daemon receives SIGCONT and it will be still blocked 
>> on
>>  >> the
>>  >>  > pread().
>>  >>  > Without the down_trylock(), in 1.4, the daemon can receive an
>>  >>  > obsolete message.
>>  >>  > NOTE: in this scenario, the daemon is not killed.
>>  >>  >
>>  >>  > Scenario 2):
>>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 
>> and
>>  >> 1.3.2
>>  >>  > and
>>  >>  > do down() in fcopy_read(), it will receive the message but: 
>> the
>>  >>  > driver has
>>  >>  > reported the failure to the host user and the driver's 1.3.2 
>> can't
>>  >>  > get the
>>  >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>>  >>  > incomplete
>>  >>  > file will be left there.
>>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>>  >>  > close(target_fd) before open()-ing a new one.
>>  >>
>>  >>  Right, but how about the case when resuming from SIGSTOP but no
>>  >> timeout?
>>  > Sorry, I don't understand this:
>>  > if no timeout, fcopy_read() will get the semaphore and 
>> fcopy_write()
>>  > will try to cancel fcopy_work.
>>  
>>  Yes.
>>  >
>>  >
>>  >>  Looks like in this case userspace() may wait in 
>> down_interruptible()
>>  >>  until timeout. We probably need something like this:
>>  >>
>>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>>  >>                  up(&fcopy_transaction.read_sema);
>>  >>                  return -EINTR;
>>  >>          }
>>  > until "timeout"?
>>  > if the daemon can't get the semaphore, it can only be wake by a
>>  > signal(the
>>  > daemon doesn't install handler, so by default most signals will 
>> kill
>>  > the daemon).
>>  > In case a signal waking up the daemon doesn't kill the daemon, why
>>  > should
>>  > we do up()?
>>  
>>  True, no need since we do down_trylock() in release().
>>  
>>  Btw, there's no EINTR handling in handling pread() return value,
>>  may add such one which should be useful for something like 
>> debugging.
>>  
>>  >
>>  >
>>  >>
>>  >>  This should synchronize with the timeout work for sure.
>>  >>  But how about only schedule it after this?
>>  >>  It does not may sense to start the timer during interrupt
>>  >>  since the file may not even opened and it may take time
>>  >>  to handle signals?
>>  >>
>>  >>  >
>>  >>  >>  >
>>  >>  >>  > +
>>  >>  >>  >  }
>>  >>  >>  >
>>  >>  >>  >  static int fcopy_handle_handshake(u32 version)
>>  >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
>>  >> *inode,
>>  >>  >>  > struct file *f)
>>  >>  >>  >  	 */
>>  >>  >>  >  	in_hand_shake = true;
>>  >>  >>  >  	opened = false;
>>  >>  >>  > +
>>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  >>  > +			;
>>  >>  >>
>>  >>  >>  And this.
>>  >>  >
>>  >>  > Scenario 3):
>>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
>>  >>  > fcopy_work
>>  >>  > pending (scheduled but not start to run yet), we should 
>> cancel the
>>  >>  > work (as you suggested) and down() the semaphore, otherwise, 
>> the
>>  >>  > obsolete message will be received by the next instance of the
>>  >> daemon.
>>  >>
>>  >>  Yes
>>  >>  >
>>  >>  >
>>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  >>  >         ----> if fcopy_release() is running on another vcpu, 
>> just
>>  >>  > before the next line?
>>  >>  >         fcopy_send_data();
>>  >>  >
>>  >>  > In this case, fcopy_release() can cancel fcopy_work, but
>>  >>  > can't get the semaphore since it hasn't been up()-ed.
>>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, 
>> and
>>  >> we'll
>>  >>  > buffer an obsolete message in the driver, and the message 
>> will be
>>  >>  > fetched by the next instance of the daemon...
>>  >>  >
>>  >>  > Looks we need a spinlock here?
>>  >>
>>  >>  Unless fcopy_release() can wait for all data for current 
>> transation
>>  >>  to be received. Spinlock won't help.
>>  >>
>>  >>  But an idea is let the daemon the handle such cases. E.g make 
>> sure
>>  >> the
>>  >>  processing begins with START_COPY and end with
>>  COMPLETE/CANCEL_COPY.
>>  >>  Drop all requests that does not start with START_COPY.
>>  >>
>>  >>  Thought?
>>  > Good idea.
>>  > I also think we should reinforce the concept of state machine in 
>> the
>>  > daemon code.
>>  
>>  Yes, it needs.
> I agree.
> Obviously we can do something to make the daemon/driver work better
> in the corner cases.
> 
>>  >
>>  > The daemon/driver communication has so many corner cases...
>>  
>>  Looks so, let's first address the issue mentioned in this patch.
> OK.
> 
>>  I don't have any more comments other than changing
>>  
>>   if(down_trylock(&fcopy_transaction.read_sema))
>>       ;
>>  
>>  to
>>  
>>  down_trylock(&fcopy_transaction.read_sema);
> Hi Jason,
> This is to address Vitaly's comment in the bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> 
> down_trylock(&fcopy_transaction.read_sema) will
> 
> "
> produces the following compile warning:
> drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of 
> ‘down_trylock’, declared with attribute warn_unused_result 
> [-Wunused-result]
>   (void)down_trylock(&fcopy_transaction.read_sema);
> "
> 
> Actually I personally don't care about the warning, because we only
> see it when we run some kind of code checker program.  :-)
> 
> I can change my v3  to the "normal" style you prefer, if
> there is no strong objection from Vitaly?

Ah, I see the point. Then no objection for this patch.

Since Vitaly said he does not has objection.

Acked-by: Jason Wang <jasowang@redhat.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01 10:26                     ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-01 10:26 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw



On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  Sent: Monday, December 1, 2014 16:23 PM
>>  To: Dexuan Cui
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
>>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com> 
>> wrote:
>>  >>  -----Original Message-----
>>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  Sent: Friday, November 28, 2014 18:13 PM
>>  >>  To: Dexuan Cui
>>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  >> driverdev-
>>  >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com; KY
>>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  >> transfer
>>  >>  failure
>>  >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui 
>> <decui@microsoft.com>
>>  >> wrote:
>>  >>  >>  -----Original Message-----
>>  >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  >>  >>  Sent: Friday, November 28, 2014 14:47 PM
>>  >>  >>  To: Dexuan Cui
>>  >>  >>  Cc: gregkh@linuxfoundation.org; 
>> linux-kernel@vger.kernel.org;
>>  >>  >> driverdev-
>>  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>>  >> apw@canonical.com; KY
>>  >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete 
>> message
>>  >> on
>>  >>  >> transfer
>>  >>  >>  failure
>>  >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
>>  >> <decui@microsoft.com>
>>  >>  >> wrote:
>>  >>  >>  > In the case the user-space daemon crashes, hangs or is
>>  >> killed, we
>>  >>  >>  > need to down the semaphore, otherwise, after the daemon 
>> starts
>>  >>  >> next
>>  >>  >>  > time, the obsolete data in fcopy_transaction.message or
>>  >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >>  >
>>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
>>  >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>>  >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
>>  >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
>>  >>  >>  > ---
>>  >>  >>  >
>>  >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
>>  >>  >>  >
>>  >>  >>  >     I also updated the output message a little:
>>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  >>  >>  >     "can not acquire the semaphore: it is benign"
>>  >>  >>  >
>>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
>>  >> suggested.
>>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and 
>> added a
>>  >>  >>  > comment instead.
>>  >>  >>  >
>>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>>  >>  >>  >  1 file changed, 19 insertions(+)
>>  >>  >>  >
>>  >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
>>  >>  >>  > index 23b2ce2..faa6ba6 100644
>>  >>  >>  > --- a/drivers/hv/hv_fcopy.c
>>  >>  >>  > +++ b/drivers/hv/hv_fcopy.c
>>  >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
>>  >> work_struct
>>  >>  >>  > *dummy)
>>  >>  >>  >  	 * process the pending transaction.
>>  >>  >>  >  	 */
>>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
>>  >>  >>  > +
>>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs or 
>> is
>>  >>  >> killed, we
>>  >>  >>  > +	 * need to down the semaphore, otherwise, after the 
>> daemon
>>  >>  >> starts
>>  >>  >>  > next
>>  >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message 
>> or
>>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
>>  >>  >>  > +	 *
>>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
>>  >> rare)?
>>  >>  >>  > We're
>>  >>  >>  > +	 * still OK, because we've reported the failure to the 
>> host.
>>  >>  >>  > +	 */
>>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  >>  > +		;
>>  >>  >>
>>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
>>  >>  >>
>>  >>  >>  Btw, a question not relate to this patch.
>>  >>  >>
>>  >>  >>  What happens if a daemon is resume from SIGSTOP and expires 
>> the
>>  >>  >> check
>>  >>  >>  here?
>>  >>  > Hi Jason,
>>  >>  > My idea is: here we need down_trylock(), but in case we can't 
>> get
>>  >> the
>>  >>  > semaphore, it's OK anyway:
>>  >>  >
>>  >>  > Scenario 1):
>>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
>>  >> receives
>>  >>  > SIGSTOP;
>>  >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
>>  >>  > 1.3.1: the driver reports the failure to the host user in 5s 
>> and
>>  >>  > 1.3.2: the driver down()-es the semaphore;
>>  >>  > 1.4: the daemon receives SIGCONT and it will be still blocked 
>> on
>>  >> the
>>  >>  > pread().
>>  >>  > Without the down_trylock(), in 1.4, the daemon can receive an
>>  >>  > obsolete message.
>>  >>  > NOTE: in this scenario, the daemon is not killed.
>>  >>  >
>>  >>  > Scenario 2):
>>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1 
>> and
>>  >> 1.3.2
>>  >>  > and
>>  >>  > do down() in fcopy_read(), it will receive the message but: 
>> the
>>  >>  > driver has
>>  >>  > reported the failure to the host user and the driver's 1.3.2 
>> can't
>>  >>  > get the
>>  >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
>>  >>  > incomplete
>>  >>  > file will be left there.
>>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
>>  >>  > close(target_fd) before open()-ing a new one.
>>  >>
>>  >>  Right, but how about the case when resuming from SIGSTOP but no
>>  >> timeout?
>>  > Sorry, I don't understand this:
>>  > if no timeout, fcopy_read() will get the semaphore and 
>> fcopy_write()
>>  > will try to cancel fcopy_work.
>>  
>>  Yes.
>>  >
>>  >
>>  >>  Looks like in this case userspace() may wait in 
>> down_interruptible()
>>  >>  until timeout. We probably need something like this:
>>  >>
>>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
>>  >>                  up(&fcopy_transaction.read_sema);
>>  >>                  return -EINTR;
>>  >>          }
>>  > until "timeout"?
>>  > if the daemon can't get the semaphore, it can only be wake by a
>>  > signal(the
>>  > daemon doesn't install handler, so by default most signals will 
>> kill
>>  > the daemon).
>>  > In case a signal waking up the daemon doesn't kill the daemon, why
>>  > should
>>  > we do up()?
>>  
>>  True, no need since we do down_trylock() in release().
>>  
>>  Btw, there's no EINTR handling in handling pread() return value,
>>  may add such one which should be useful for something like 
>> debugging.
>>  
>>  >
>>  >
>>  >>
>>  >>  This should synchronize with the timeout work for sure.
>>  >>  But how about only schedule it after this?
>>  >>  It does not may sense to start the timer during interrupt
>>  >>  since the file may not even opened and it may take time
>>  >>  to handle signals?
>>  >>
>>  >>  >
>>  >>  >>  >
>>  >>  >>  > +
>>  >>  >>  >  }
>>  >>  >>  >
>>  >>  >>  >  static int fcopy_handle_handshake(u32 version)
>>  >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
>>  >> *inode,
>>  >>  >>  > struct file *f)
>>  >>  >>  >  	 */
>>  >>  >>  >  	in_hand_shake = true;
>>  >>  >>  >  	opened = false;
>>  >>  >>  > +
>>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
>>  >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
>>  >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
>>  >>  >>  > +			;
>>  >>  >>
>>  >>  >>  And this.
>>  >>  >
>>  >>  > Scenario 3):
>>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
>>  >>  > fcopy_work
>>  >>  > pending (scheduled but not start to run yet), we should 
>> cancel the
>>  >>  > work (as you suggested) and down() the semaphore, otherwise, 
>> the
>>  >>  > obsolete message will be received by the next instance of the
>>  >> daemon.
>>  >>
>>  >>  Yes
>>  >>  >
>>  >>  >
>>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
>>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  >>  >         ----> if fcopy_release() is running on another vcpu, 
>> just
>>  >>  > before the next line?
>>  >>  >         fcopy_send_data();
>>  >>  >
>>  >>  > In this case, fcopy_release() can cancel fcopy_work, but
>>  >>  > can't get the semaphore since it hasn't been up()-ed.
>>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later, 
>> and
>>  >> we'll
>>  >>  > buffer an obsolete message in the driver, and the message 
>> will be
>>  >>  > fetched by the next instance of the daemon...
>>  >>  >
>>  >>  > Looks we need a spinlock here?
>>  >>
>>  >>  Unless fcopy_release() can wait for all data for current 
>> transation
>>  >>  to be received. Spinlock won't help.
>>  >>
>>  >>  But an idea is let the daemon the handle such cases. E.g make 
>> sure
>>  >> the
>>  >>  processing begins with START_COPY and end with
>>  COMPLETE/CANCEL_COPY.
>>  >>  Drop all requests that does not start with START_COPY.
>>  >>
>>  >>  Thought?
>>  > Good idea.
>>  > I also think we should reinforce the concept of state machine in 
>> the
>>  > daemon code.
>>  
>>  Yes, it needs.
> I agree.
> Obviously we can do something to make the daemon/driver work better
> in the corner cases.
> 
>>  >
>>  > The daemon/driver communication has so many corner cases...
>>  
>>  Looks so, let's first address the issue mentioned in this patch.
> OK.
> 
>>  I don't have any more comments other than changing
>>  
>>   if(down_trylock(&fcopy_transaction.read_sema))
>>       ;
>>  
>>  to
>>  
>>  down_trylock(&fcopy_transaction.read_sema);
> Hi Jason,
> This is to address Vitaly's comment in the bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> 
> down_trylock(&fcopy_transaction.read_sema) will
> 
> "
> produces the following compile warning:
> drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of 
> ‘down_trylock’, declared with attribute warn_unused_result 
> [-Wunused-result]
>   (void)down_trylock(&fcopy_transaction.read_sema);
> "
> 
> Actually I personally don't care about the warning, because we only
> see it when we run some kind of code checker program.  :-)
> 
> I can change my v3  to the "normal" style you prefer, if
> there is no strong objection from Vitaly?

Ah, I see the point. Then no objection for this patch.

Since Vitaly said he does not has objection.

Acked-by: Jason Wang <jasowang@redhat.com>

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-12-01 10:26                     ` Jason Wang
@ 2014-12-01 11:00                       ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-01 11:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 12355 bytes --]

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, December 1, 2014 18:18 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> 
> 
> On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Monday, December 1, 2014 16:23 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  >>  -----Original Message-----
> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  Sent: Friday, November 28, 2014 18:13 PM
> >>  >>  To: Dexuan Cui
> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >>  >> driverdev-
> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >> apw@canonical.com; KY
> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message
> >> on
> >>  >> transfer
> >>  >>  failure
> >>  >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui
> >> <decui@microsoft.com>
> >>  >> wrote:
> >>  >>  >>  -----Original Message-----
> >>  >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  >>  >>  To: Dexuan Cui
> >>  >>  >>  Cc: gregkh@linuxfoundation.org;
> >> linux-kernel@vger.kernel.org;
> >>  >>  >> driverdev-
> >>  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >>  >> apw@canonical.com; KY
> >>  >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete
> >> message
> >>  >> on
> >>  >>  >> transfer
> >>  >>  >>  failure
> >>  >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
> >>  >> <decui@microsoft.com>
> >>  >>  >> wrote:
> >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> >>  >> killed, we
> >>  >>  >>  > need to down the semaphore, otherwise, after the daemon
> >> starts
> >>  >>  >> next
> >>  >>  >>  > time, the obsolete data in fcopy_transaction.message or
> >>  >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >>  >
> >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  >>  >>  > ---
> >>  >>  >>  >
> >>  >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >>  >>  >
> >>  >>  >>  >     I also updated the output message a little:
> >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> >>  >>  >>  >
> >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> >>  >> suggested.
> >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> >> added a
> >>  >>  >>  > comment instead.
> >>  >>  >>  >
> >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >>  >>  >  1 file changed, 19 insertions(+)
> >>  >>  >>  >
> >>  >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  >>  >>  > index 23b2ce2..faa6ba6 100644
> >>  >>  >>  > --- a/drivers/hv/hv_fcopy.c
> >>  >>  >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
> >>  >> work_struct
> >>  >>  >>  > *dummy)
> >>  >>  >>  >  	 * process the pending transaction.
> >>  >>  >>  >  	 */
> >>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  >>  >>  > +
> >>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs or
> >> is
> >>  >>  >> killed, we
> >>  >>  >>  > +	 * need to down the semaphore, otherwise, after the
> >> daemon
> >>  >>  >> starts
> >>  >>  >>  > next
> >>  >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message
> >> or
> >>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >>  > +	 *
> >>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
> >>  >> rare)?
> >>  >>  >>  > We're
> >>  >>  >>  > +	 * still OK, because we've reported the failure to the
> >> host.
> >>  >>  >>  > +	 */
> >>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  >>  > +		;
> >>  >>  >>
> >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> >>  >>  >>
> >>  >>  >>  Btw, a question not relate to this patch.
> >>  >>  >>
> >>  >>  >>  What happens if a daemon is resume from SIGSTOP and expires
> >> the
> >>  >>  >> check
> >>  >>  >>  here?
> >>  >>  > Hi Jason,
> >>  >>  > My idea is: here we need down_trylock(), but in case we can't
> >> get
> >>  >> the
> >>  >>  > semaphore, it's OK anyway:
> >>  >>  >
> >>  >>  > Scenario 1):
> >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> >>  >> receives
> >>  >>  > SIGSTOP;
> >>  >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
> >>  >>  > 1.3.1: the driver reports the failure to the host user in 5s
> >> and
> >>  >>  > 1.3.2: the driver down()-es the semaphore;
> >>  >>  > 1.4: the daemon receives SIGCONT and it will be still blocked
> >> on
> >>  >> the
> >>  >>  > pread().
> >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive an
> >>  >>  > obsolete message.
> >>  >>  > NOTE: in this scenario, the daemon is not killed.
> >>  >>  >
> >>  >>  > Scenario 2):
> >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> >> and
> >>  >> 1.3.2
> >>  >>  > and
> >>  >>  > do down() in fcopy_read(), it will receive the message but:
> >> the
> >>  >>  > driver has
> >>  >>  > reported the failure to the host user and the driver's 1.3.2
> >> can't
> >>  >>  > get the
> >>  >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
> >>  >>  > incomplete
> >>  >>  > file will be left there.
> >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
> >>  >>  > close(target_fd) before open()-ing a new one.
> >>  >>
> >>  >>  Right, but how about the case when resuming from SIGSTOP but no
> >>  >> timeout?
> >>  > Sorry, I don't understand this:
> >>  > if no timeout, fcopy_read() will get the semaphore and
> >> fcopy_write()
> >>  > will try to cancel fcopy_work.
> >>
> >>  Yes.
> >>  >
> >>  >
> >>  >>  Looks like in this case userspace() may wait in
> >> down_interruptible()
> >>  >>  until timeout. We probably need something like this:
> >>  >>
> >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> >>  >>                  up(&fcopy_transaction.read_sema);
> >>  >>                  return -EINTR;
> >>  >>          }
> >>  > until "timeout"?
> >>  > if the daemon can't get the semaphore, it can only be wake by a
> >>  > signal(the
> >>  > daemon doesn't install handler, so by default most signals will
> >> kill
> >>  > the daemon).
> >>  > In case a signal waking up the daemon doesn't kill the daemon, why
> >>  > should
> >>  > we do up()?
> >>
> >>  True, no need since we do down_trylock() in release().
> >>
> >>  Btw, there's no EINTR handling in handling pread() return value,
> >>  may add such one which should be useful for something like
> >> debugging.
> >>
> >>  >
> >>  >
> >>  >>
> >>  >>  This should synchronize with the timeout work for sure.
> >>  >>  But how about only schedule it after this?
> >>  >>  It does not may sense to start the timer during interrupt
> >>  >>  since the file may not even opened and it may take time
> >>  >>  to handle signals?
> >>  >>
> >>  >>  >
> >>  >>  >>  >
> >>  >>  >>  > +
> >>  >>  >>  >  }
> >>  >>  >>  >
> >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)
> >>  >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
> >>  >> *inode,
> >>  >>  >>  > struct file *f)
> >>  >>  >>  >  	 */
> >>  >>  >>  >  	in_hand_shake = true;
> >>  >>  >>  >  	opened = false;
> >>  >>  >>  > +
> >>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  >>  > +			;
> >>  >>  >>
> >>  >>  >>  And this.
> >>  >>  >
> >>  >>  > Scenario 3):
> >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> >>  >>  > fcopy_work
> >>  >>  > pending (scheduled but not start to run yet), we should
> >> cancel the
> >>  >>  > work (as you suggested) and down() the semaphore, otherwise,
> >> the
> >>  >>  > obsolete message will be received by the next instance of the
> >>  >> daemon.
> >>  >>
> >>  >>  Yes
> >>  >>  >
> >>  >>  >
> >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> >> just
> >>  >>  > before the next line?
> >>  >>  >         fcopy_send_data();
> >>  >>  >
> >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but
> >>  >>  > can't get the semaphore since it hasn't been up()-ed.
> >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> >> and
> >>  >> we'll
> >>  >>  > buffer an obsolete message in the driver, and the message
> >> will be
> >>  >>  > fetched by the next instance of the daemon...
> >>  >>  >
> >>  >>  > Looks we need a spinlock here?
> >>  >>
> >>  >>  Unless fcopy_release() can wait for all data for current
> >> transation
> >>  >>  to be received. Spinlock won't help.
> >>  >>
> >>  >>  But an idea is let the daemon the handle such cases. E.g make
> >> sure
> >>  >> the
> >>  >>  processing begins with START_COPY and end with
> >>  COMPLETE/CANCEL_COPY.
> >>  >>  Drop all requests that does not start with START_COPY.
> >>  >>
> >>  >>  Thought?
> >>  > Good idea.
> >>  > I also think we should reinforce the concept of state machine in
> >> the
> >>  > daemon code.
> >>
> >>  Yes, it needs.
> > I agree.
> > Obviously we can do something to make the daemon/driver work better
> > in the corner cases.
> >
> >>  >
> >>  > The daemon/driver communication has so many corner cases...
> >>
> >>  Looks so, let's first address the issue mentioned in this patch.
> > OK.
> >
> >>  I don't have any more comments other than changing
> >>
> >>   if(down_trylock(&fcopy_transaction.read_sema))
> >>       ;
> >>
> >>  to
> >>
> >>  down_trylock(&fcopy_transaction.read_sema);
> > Hi Jason,
> > This is to address Vitaly's comment in the bugzilla:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> >
> > down_trylock(&fcopy_transaction.read_sema) will
> >
> > "
> > produces the following compile warning:
> > drivers/hv/hv_fcopy.c: In function â€˜fcopy_work_funcâ€™:
> > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > â€˜down_trylockâ€™, declared with attribute warn_unused_result
> > [-Wunused-result]
> >   (void)down_trylock(&fcopy_transaction.read_sema);
> > "
> >
> > Actually I personally don't care about the warning, because we only
> > see it when we run some kind of code checker program.  :-)
> >
> > I can change my v3  to the "normal" style you prefer, if
> > there is no strong objection from Vitaly?
> 
> Ah, I see the point. Then no objection for this patch.
> 
> Since Vitaly said he does not has objection.
> 
> Acked-by: Jason Wang <jasowang@redhat.com>

I realized non-trivial effort might be needed to better handle corner cases.
This is only the first step.

E.g., I found a new corner case:
after we unload the hv_utils/fcopy driver, the fcopy daemon can
exit and fcopy_release() will be invoked -- at this time fcopy_release() has
become obsolete and invalid, and as a result, a kernel panic can happen:  I
did see the panic once.

-- Dexuan
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01 11:00                       ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-01 11:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, KY Srinivasan,
	vkuznets, Haiyang Zhang

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, December 1, 2014 18:18 PM
> To: Dexuan Cui
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> 
> 
> On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com> wrote:
> >>  -----Original Message-----
> >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  Sent: Monday, December 1, 2014 16:23 PM
> >>  To: Dexuan Cui
> >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> driverdev-
> >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> >> transfer
> >>  failure
> >>  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui <decui@microsoft.com>
> >> wrote:
> >>  >>  -----Original Message-----
> >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  Sent: Friday, November 28, 2014 18:13 PM
> >>  >>  To: Dexuan Cui
> >>  >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >>  >> driverdev-
> >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >> apw@canonical.com; KY
> >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message
> >> on
> >>  >> transfer
> >>  >>  failure
> >>  >>  On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui
> >> <decui@microsoft.com>
> >>  >> wrote:
> >>  >>  >>  -----Original Message-----
> >>  >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]
> >>  >>  >>  Sent: Friday, November 28, 2014 14:47 PM
> >>  >>  >>  To: Dexuan Cui
> >>  >>  >>  Cc: gregkh@linuxfoundation.org;
> >> linux-kernel@vger.kernel.org;
> >>  >>  >> driverdev-
> >>  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> >>  >> apw@canonical.com; KY
> >>  >>  >>  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> >>  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete
> >> message
> >>  >> on
> >>  >>  >> transfer
> >>  >>  >>  failure
> >>  >>  >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
> >>  >> <decui@microsoft.com>
> >>  >>  >> wrote:
> >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> >>  >> killed, we
> >>  >>  >>  > need to down the semaphore, otherwise, after the daemon
> >> starts
> >>  >>  >> next
> >>  >>  >>  > time, the obsolete data in fcopy_transaction.message or
> >>  >>  >>  > fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >>  >
> >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>
> >>  >>  >>  > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> >>  >>  >>  > Cc: K. Y. Srinivasan <kys@microsoft.com>
> >>  >>  >>  > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >>  >>  >>  > ---
> >>  >>  >>  >
> >>  >>  >>  > v2: I removed the "FCP" prefix as Greg asked.
> >>  >>  >>  >
> >>  >>  >>  >     I also updated the output message a little:
> >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> >>  >>  >>  >
> >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> >>  >> suggested.
> >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> >> added a
> >>  >>  >>  > comment instead.
> >>  >>  >>  >
> >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >>  >>  >>  >  1 file changed, 19 insertions(+)
> >>  >>  >>  >
> >>  >>  >>  > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> >>  >>  >>  > index 23b2ce2..faa6ba6 100644
> >>  >>  >>  > --- a/drivers/hv/hv_fcopy.c
> >>  >>  >>  > +++ b/drivers/hv/hv_fcopy.c
> >>  >>  >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
> >>  >> work_struct
> >>  >>  >>  > *dummy)
> >>  >>  >>  >  	 * process the pending transaction.
> >>  >>  >>  >  	 */
> >>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> >>  >>  >>  > +
> >>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs or
> >> is
> >>  >>  >> killed, we
> >>  >>  >>  > +	 * need to down the semaphore, otherwise, after the
> >> daemon
> >>  >>  >> starts
> >>  >>  >>  > next
> >>  >>  >>  > +	 * time, the obsolete data in fcopy_transaction.message
> >> or
> >>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> >>  >>  >>  > +	 *
> >>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the semaphore (very
> >>  >> rare)?
> >>  >>  >>  > We're
> >>  >>  >>  > +	 * still OK, because we've reported the failure to the
> >> host.
> >>  >>  >>  > +	 */
> >>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  >>  > +		;
> >>  >>  >>
> >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> >>  >>  >>
> >>  >>  >>  Btw, a question not relate to this patch.
> >>  >>  >>
> >>  >>  >>  What happens if a daemon is resume from SIGSTOP and expires
> >> the
> >>  >>  >> check
> >>  >>  >>  here?
> >>  >>  > Hi Jason,
> >>  >>  > My idea is: here we need down_trylock(), but in case we can't
> >> get
> >>  >> the
> >>  >>  > semaphore, it's OK anyway:
> >>  >>  >
> >>  >>  > Scenario 1):
> >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> >>  >> receives
> >>  >>  > SIGSTOP;
> >>  >>  > 1.2: the host user runs the PowerShell Copy-VMFile command;
> >>  >>  > 1.3.1: the driver reports the failure to the host user in 5s
> >> and
> >>  >>  > 1.3.2: the driver down()-es the semaphore;
> >>  >>  > 1.4: the daemon receives SIGCONT and it will be still blocked
> >> on
> >>  >> the
> >>  >>  > pread().
> >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive an
> >>  >>  > obsolete message.
> >>  >>  > NOTE: in this scenario, the daemon is not killed.
> >>  >>  >
> >>  >>  > Scenario 2):
> >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> >> and
> >>  >> 1.3.2
> >>  >>  > and
> >>  >>  > do down() in fcopy_read(), it will receive the message but:
> >> the
> >>  >>  > driver has
> >>  >>  > reported the failure to the host user and the driver's 1.3.2
> >> can't
> >>  >>  > get the
> >>  >>  > semaphore -- IMO this is acceptably OK, though in the VM, an
> >>  >>  > incomplete
> >>  >>  > file will be left there.
> >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add a
> >>  >>  > close(target_fd) before open()-ing a new one.
> >>  >>
> >>  >>  Right, but how about the case when resuming from SIGSTOP but no
> >>  >> timeout?
> >>  > Sorry, I don't understand this:
> >>  > if no timeout, fcopy_read() will get the semaphore and
> >> fcopy_write()
> >>  > will try to cancel fcopy_work.
> >>
> >>  Yes.
> >>  >
> >>  >
> >>  >>  Looks like in this case userspace() may wait in
> >> down_interruptible()
> >>  >>  until timeout. We probably need something like this:
> >>  >>
> >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> >>  >>                  up(&fcopy_transaction.read_sema);
> >>  >>                  return -EINTR;
> >>  >>          }
> >>  > until "timeout"?
> >>  > if the daemon can't get the semaphore, it can only be wake by a
> >>  > signal(the
> >>  > daemon doesn't install handler, so by default most signals will
> >> kill
> >>  > the daemon).
> >>  > In case a signal waking up the daemon doesn't kill the daemon, why
> >>  > should
> >>  > we do up()?
> >>
> >>  True, no need since we do down_trylock() in release().
> >>
> >>  Btw, there's no EINTR handling in handling pread() return value,
> >>  may add such one which should be useful for something like
> >> debugging.
> >>
> >>  >
> >>  >
> >>  >>
> >>  >>  This should synchronize with the timeout work for sure.
> >>  >>  But how about only schedule it after this?
> >>  >>  It does not may sense to start the timer during interrupt
> >>  >>  since the file may not even opened and it may take time
> >>  >>  to handle signals?
> >>  >>
> >>  >>  >
> >>  >>  >>  >
> >>  >>  >>  > +
> >>  >>  >>  >  }
> >>  >>  >>  >
> >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)
> >>  >>  >>  > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode
> >>  >> *inode,
> >>  >>  >>  > struct file *f)
> >>  >>  >>  >  	 */
> >>  >>  >>  >  	in_hand_shake = true;
> >>  >>  >>  >  	opened = false;
> >>  >>  >>  > +
> >>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> >>  >>  >>  > +		/* We haven't up()-ed the semaphore(very rare)? */
> >>  >>  >>  > +		if (down_trylock(&fcopy_transaction.read_sema))
> >>  >>  >>  > +			;
> >>  >>  >>
> >>  >>  >>  And this.
> >>  >>  >
> >>  >>  > Scenario 3):
> >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> >>  >>  > fcopy_work
> >>  >>  > pending (scheduled but not start to run yet), we should
> >> cancel the
> >>  >>  > work (as you suggested) and down() the semaphore, otherwise,
> >> the
> >>  >>  > obsolete message will be received by the next instance of the
> >>  >> daemon.
> >>  >>
> >>  >>  Yes
> >>  >>  >
> >>  >>  >
> >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> >> just
> >>  >>  > before the next line?
> >>  >>  >         fcopy_send_data();
> >>  >>  >
> >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but
> >>  >>  > can't get the semaphore since it hasn't been up()-ed.
> >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> >> and
> >>  >> we'll
> >>  >>  > buffer an obsolete message in the driver, and the message
> >> will be
> >>  >>  > fetched by the next instance of the daemon...
> >>  >>  >
> >>  >>  > Looks we need a spinlock here?
> >>  >>
> >>  >>  Unless fcopy_release() can wait for all data for current
> >> transation
> >>  >>  to be received. Spinlock won't help.
> >>  >>
> >>  >>  But an idea is let the daemon the handle such cases. E.g make
> >> sure
> >>  >> the
> >>  >>  processing begins with START_COPY and end with
> >>  COMPLETE/CANCEL_COPY.
> >>  >>  Drop all requests that does not start with START_COPY.
> >>  >>
> >>  >>  Thought?
> >>  > Good idea.
> >>  > I also think we should reinforce the concept of state machine in
> >> the
> >>  > daemon code.
> >>
> >>  Yes, it needs.
> > I agree.
> > Obviously we can do something to make the daemon/driver work better
> > in the corner cases.
> >
> >>  >
> >>  > The daemon/driver communication has so many corner cases...
> >>
> >>  Looks so, let's first address the issue mentioned in this patch.
> > OK.
> >
> >>  I don't have any more comments other than changing
> >>
> >>   if(down_trylock(&fcopy_transaction.read_sema))
> >>       ;
> >>
> >>  to
> >>
> >>  down_trylock(&fcopy_transaction.read_sema);
> > Hi Jason,
> > This is to address Vitaly's comment in the bugzilla:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> >
> > down_trylock(&fcopy_transaction.read_sema) will
> >
> > "
> > produces the following compile warning:
> > drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > ‘down_trylock’, declared with attribute warn_unused_result
> > [-Wunused-result]
> >   (void)down_trylock(&fcopy_transaction.read_sema);
> > "
> >
> > Actually I personally don't care about the warning, because we only
> > see it when we run some kind of code checker program.  :-)
> >
> > I can change my v3  to the "normal" style you prefer, if
> > there is no strong objection from Vitaly?
> 
> Ah, I see the point. Then no objection for this patch.
> 
> Since Vitaly said he does not has objection.
> 
> Acked-by: Jason Wang <jasowang@redhat.com>

I realized non-trivial effort might be needed to better handle corner cases.
This is only the first step.

E.g., I found a new corner case:
after we unload the hv_utils/fcopy driver, the fcopy daemon can
exit and fcopy_release() will be invoked -- at this time fcopy_release() has
become obsolete and invalid, and as a result, a kernel panic can happen:  I
did see the panic once.

-- Dexuan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-12-01 11:00                       ` Dexuan Cui
@ 2014-12-01 15:54                         ` KY Srinivasan
  -1 siblings, 0 replies; 34+ messages in thread
From: KY Srinivasan @ 2014-12-01 15:54 UTC (permalink / raw)
  To: Dexuan Cui, Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, vkuznets,
	Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 13183 bytes --]



> -----Original Message-----
> From: Dexuan Cui
> Sent: Monday, December 1, 2014 3:01 AM
> To: Jason Wang
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Monday, December 1, 2014 18:18 PM
> > To: Dexuan Cui
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > transfer failure
> >
> >
> >
> > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
> wrote:
> > >>  -----Original Message-----
> > >>  From: Jason Wang [mailto:jasowang@redhat.com]
> > >>  Sent: Monday, December 1, 2014 16:23 PM
> > >>  To: Dexuan Cui
> > >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > >> driverdev-
> > >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui
> > >> <decui@microsoft.com>
> > >> wrote:
> > >>  >>  -----Original Message-----
> > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
> > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  Cc:
> > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
> > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com; Haiyang
> > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete
> > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 2014 at
> > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
> > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
> > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 28,
> > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
> > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>  >>
> > >> driverdev-  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; vkuznets@redhat.com;
> > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop
> > >> the obsolete message  >> on  >>  >> transfer  >>  >>  failure  >>
> > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
> > >> <decui@microsoft.com>  >>  >> wrote:
> > >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> > >> >> killed, we  >>  >>  > need to down the semaphore, otherwise,
> > >> after the daemon starts  >>  >> next  >>  >>  > time, the obsolete
> > >> data in fcopy_transaction.message or  >>  >>  >
> > >> fcopy_transaction.fcopy_msg will be used immediately.
> > >>  >>  >>  >
> > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > Cc:
> > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
> > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: Dexuan Cui
> > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > v2: I
> > >> removed the "FCP" prefix as Greg asked.
> > >>  >>  >>  >
> > >>  >>  >>  >     I also updated the output message a little:
> > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> > >>  >>  >>  >
> > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> > >> >> suggested.
> > >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> > >> added a
> > >>  >>  >>  > comment instead.
> > >>  >>  >>  >
> > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  >>  >>
> > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > diff
> > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  >>  >
> > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
> > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ b/drivers/hv/hv_fcopy.c  >>
> > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  >>
> > >> work_struct  >>  >>  > *dummy)
> > >>  >>  >>  >  	 * process the pending transaction.
> > >>  >>  >>  >  	 */
> > >>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> > >>  >>  >>  > +
> > >>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs
> or
> > >> is
> > >>  >>  >> killed, we
> > >>  >>  >>  > +	 * need to down the semaphore, otherwise, after
> the
> > >> daemon
> > >>  >>  >> starts
> > >>  >>  >>  > next
> > >>  >>  >>  > +	 * time, the obsolete data in
> fcopy_transaction.message
> > >> or
> > >>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used
> immediately.
> > >>  >>  >>  > +	 *
> > >>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the
> semaphore (very
> > >>  >> rare)?
> > >>  >>  >>  > We're
> > >>  >>  >>  > +	 * still OK, because we've reported the failure to the
> > >> host.
> > >>  >>  >>  > +	 */
> > >>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> > >>  >>  >>  > +		;
> > >>  >>  >>
> > >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> > >>  >>  >>
> > >>  >>  >>  Btw, a question not relate to this patch.
> > >>  >>  >>
> > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
> > >> expires the  >>  >> check  >>  >>  here?
> > >>  >>  > Hi Jason,
> > >>  >>  > My idea is: here we need down_trylock(), but in case we
> > >> can't get  >> the  >>  > semaphore, it's OK anyway:
> > >>  >>  >
> > >>  >>  > Scenario 1):
> > >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs the
> > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver reports
> > >> the failure to the host user in 5s and  >>  > 1.3.2: the driver
> > >> down()-es the semaphore;  >>  > 1.4: the daemon receives SIGCONT
> > >> and it will be still blocked on  >> the  >>  > pread().
> > >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive
> > >> an  >>  > obsolete message.
> > >>  >>  > NOTE: in this scenario, the daemon is not killed.
> > >>  >>  >
> > >>  >>  > Scenario 2):
> > >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), it will
> > >> receive the message but:
> > >> the
> > >>  >>  > driver has
> > >>  >>  > reported the failure to the host user and the driver's 1.3.2
> > >> can't  >>  > get the  >>  > semaphore -- IMO this is acceptably OK,
> > >> though in the VM, an  >>  > incomplete  >>  > file will be left
> > >> there.
> > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add
> > >> a  >>  > close(target_fd) before open()-ing a new one.
> > >>  >>
> > >>  >>  Right, but how about the case when resuming from SIGSTOP but
> > >> no  >> timeout?
> > >>  > Sorry, I don't understand this:
> > >>  > if no timeout, fcopy_read() will get the semaphore and
> > >> fcopy_write()
> > >>  > will try to cancel fcopy_work.
> > >>
> > >>  Yes.
> > >>  >
> > >>  >
> > >>  >>  Looks like in this case userspace() may wait in
> > >> down_interruptible()
> > >>  >>  until timeout. We probably need something like this:
> > >>  >>
> > >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> > >>  >>                  up(&fcopy_transaction.read_sema);
> > >>  >>                  return -EINTR;
> > >>  >>          }
> > >>  > until "timeout"?
> > >>  > if the daemon can't get the semaphore, it can only be wake by a
> > >> > signal(the  > daemon doesn't install handler, so by default most
> > >> signals will kill  > the daemon).
> > >>  > In case a signal waking up the daemon doesn't kill the daemon,
> > >> why  > should  > we do up()?
> > >>
> > >>  True, no need since we do down_trylock() in release().
> > >>
> > >>  Btw, there's no EINTR handling in handling pread() return value,
> > >> may add such one which should be useful for something like
> > >> debugging.
> > >>
> > >>  >
> > >>  >
> > >>  >>
> > >>  >>  This should synchronize with the timeout work for sure.
> > >>  >>  But how about only schedule it after this?
> > >>  >>  It does not may sense to start the timer during interrupt  >>
> > >> since the file may not even opened and it may take time  >>  to
> > >> handle signals?
> > >>  >>
> > >>  >>  >
> > >>  >>  >>  >
> > >>  >>  >>  > +
> > >>  >>  >>  >  }
> > >>  >>  >>  >
> > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  >>  >>
> > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode  >>
> > >> *inode,  >>  >>  > struct file *f)
> > >>  >>  >>  >  	 */
> > >>  >>  >>  >  	in_hand_shake = true;
> > >>  >>  >>  >  	opened = false;
> > >>  >>  >>  > +
> > >>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > >>  >>  >>  > +		/* We haven't up()-ed the semaphore(very
> rare)? */
> > >>  >>  >>  > +		if
> (down_trylock(&fcopy_transaction.read_sema))
> > >>  >>  >>  > +			;
> > >>  >>  >>
> > >>  >>  >>  And this.
> > >>  >>  >
> > >>  >>  > Scenario 3):
> > >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> > >> >>  > fcopy_work  >>  > pending (scheduled but not start to run
> > >> yet), we should cancel the  >>  > work (as you suggested) and
> > >> down() the semaphore, otherwise, the  >>  > obsolete message will
> > >> be received by the next instance of the  >> daemon.
> > >>  >>
> > >>  >>  Yes
> > >>  >>  >
> > >>  >>  >
> > >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> > >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> > >> just
> > >>  >>  > before the next line?
> > >>  >>  >         fcopy_send_data();
> > >>  >>  >
> > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but  >>
> > >> > can't get the semaphore since it hasn't been up()-ed.
> > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> > >> and
> > >>  >> we'll
> > >>  >>  > buffer an obsolete message in the driver, and the message
> > >> will be  >>  > fetched by the next instance of the daemon...
> > >>  >>  >
> > >>  >>  > Looks we need a spinlock here?
> > >>  >>
> > >>  >>  Unless fcopy_release() can wait for all data for current
> > >> transation  >>  to be received. Spinlock won't help.
> > >>  >>
> > >>  >>  But an idea is let the daemon the handle such cases. E.g make
> > >> sure  >> the  >>  processing begins with START_COPY and end with
> > >> COMPLETE/CANCEL_COPY.
> > >>  >>  Drop all requests that does not start with START_COPY.
> > >>  >>
> > >>  >>  Thought?
> > >>  > Good idea.
> > >>  > I also think we should reinforce the concept of state machine in
> > >> the  > daemon code.
> > >>
> > >>  Yes, it needs.
> > > I agree.
> > > Obviously we can do something to make the daemon/driver work better
> > > in the corner cases.
> > >
> > >>  >
> > >>  > The daemon/driver communication has so many corner cases...
> > >>
> > >>  Looks so, let's first address the issue mentioned in this patch.
> > > OK.
> > >
> > >>  I don't have any more comments other than changing
> > >>
> > >>   if(down_trylock(&fcopy_transaction.read_sema))
> > >>       ;
> > >>
> > >>  to
> > >>
> > >>  down_trylock(&fcopy_transaction.read_sema);
> > > Hi Jason,
> > > This is to address Vitaly's comment in the bugzilla:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> > >
> > > down_trylock(&fcopy_transaction.read_sema) will
> > >
> > > "
> > > produces the following compile warning:
> > > drivers/hv/hv_fcopy.c: In function â€˜fcopy_work_funcâ€™:
> > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > > â€˜down_trylockâ€™, declared with attribute warn_unused_result
> > > [-Wunused-result]
> > >   (void)down_trylock(&fcopy_transaction.read_sema);
> > > "
> > >
> > > Actually I personally don't care about the warning, because we only
> > > see it when we run some kind of code checker program.  :-)
> > >
> > > I can change my v3  to the "normal" style you prefer, if there is no
> > > strong objection from Vitaly?
> >
> > Ah, I see the point. Then no objection for this patch.
> >
> > Since Vitaly said he does not has objection.
> >
> > Acked-by: Jason Wang <jasowang@redhat.com>
> 
> I realized non-trivial effort might be needed to better handle corner cases.
> This is only the first step.
> 
> E.g., I found a new corner case:
> after we unload the hv_utils/fcopy driver, the fcopy daemon can exit and
> fcopy_release() will be invoked -- at this time fcopy_release() has become
> obsolete and invalid, and as a result, a kernel panic can happen:  I did see the
> panic once.

You should not be able to unload the driver when there is still an open reference from the
Daemon.

K. Y
> 
> -- Dexuan

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-01 15:54                         ` KY Srinivasan
  0 siblings, 0 replies; 34+ messages in thread
From: KY Srinivasan @ 2014-12-01 15:54 UTC (permalink / raw)
  To: Dexuan Cui, Jason Wang
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw



> -----Original Message-----
> From: Dexuan Cui
> Sent: Monday, December 1, 2014 3:01 AM
> To: Jason Wang
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Monday, December 1, 2014 18:18 PM
> > To: Dexuan Cui
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > transfer failure
> >
> >
> >
> > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
> wrote:
> > >>  -----Original Message-----
> > >>  From: Jason Wang [mailto:jasowang@redhat.com]
> > >>  Sent: Monday, December 1, 2014 16:23 PM
> > >>  To: Dexuan Cui
> > >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > >> driverdev-
> > >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui
> > >> <decui@microsoft.com>
> > >> wrote:
> > >>  >>  -----Original Message-----
> > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
> > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  Cc:
> > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
> > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com; Haiyang
> > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete
> > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 2014 at
> > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
> > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
> > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 28,
> > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
> > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>  >>
> > >> driverdev-  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; vkuznets@redhat.com;
> > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop
> > >> the obsolete message  >> on  >>  >> transfer  >>  >>  failure  >>
> > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
> > >> <decui@microsoft.com>  >>  >> wrote:
> > >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> > >> >> killed, we  >>  >>  > need to down the semaphore, otherwise,
> > >> after the daemon starts  >>  >> next  >>  >>  > time, the obsolete
> > >> data in fcopy_transaction.message or  >>  >>  >
> > >> fcopy_transaction.fcopy_msg will be used immediately.
> > >>  >>  >>  >
> > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > Cc:
> > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
> > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: Dexuan Cui
> > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > v2: I
> > >> removed the "FCP" prefix as Greg asked.
> > >>  >>  >>  >
> > >>  >>  >>  >     I also updated the output message a little:
> > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> > >>  >>  >>  >
> > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> > >> >> suggested.
> > >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> > >> added a
> > >>  >>  >>  > comment instead.
> > >>  >>  >>  >
> > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  >>  >>
> > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > diff
> > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  >>  >
> > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
> > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ b/drivers/hv/hv_fcopy.c  >>
> > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  >>
> > >> work_struct  >>  >>  > *dummy)
> > >>  >>  >>  >  	 * process the pending transaction.
> > >>  >>  >>  >  	 */
> > >>  >>  >>  >  	fcopy_respond_to_host(HV_E_FAIL);
> > >>  >>  >>  > +
> > >>  >>  >>  > +	/* In the case the user-space daemon crashes, hangs
> or
> > >> is
> > >>  >>  >> killed, we
> > >>  >>  >>  > +	 * need to down the semaphore, otherwise, after
> the
> > >> daemon
> > >>  >>  >> starts
> > >>  >>  >>  > next
> > >>  >>  >>  > +	 * time, the obsolete data in
> fcopy_transaction.message
> > >> or
> > >>  >>  >>  > +	 * fcopy_transaction.fcopy_msg will be used
> immediately.
> > >>  >>  >>  > +	 *
> > >>  >>  >>  > +	 * NOTE: fcopy_read() happens to get the
> semaphore (very
> > >>  >> rare)?
> > >>  >>  >>  > We're
> > >>  >>  >>  > +	 * still OK, because we've reported the failure to the
> > >> host.
> > >>  >>  >>  > +	 */
> > >>  >>  >>  > +	if (down_trylock(&fcopy_transaction.read_sema))
> > >>  >>  >>  > +		;
> > >>  >>  >>
> > >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> > >>  >>  >>
> > >>  >>  >>  Btw, a question not relate to this patch.
> > >>  >>  >>
> > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
> > >> expires the  >>  >> check  >>  >>  here?
> > >>  >>  > Hi Jason,
> > >>  >>  > My idea is: here we need down_trylock(), but in case we
> > >> can't get  >> the  >>  > semaphore, it's OK anyway:
> > >>  >>  >
> > >>  >>  > Scenario 1):
> > >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs the
> > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver reports
> > >> the failure to the host user in 5s and  >>  > 1.3.2: the driver
> > >> down()-es the semaphore;  >>  > 1.4: the daemon receives SIGCONT
> > >> and it will be still blocked on  >> the  >>  > pread().
> > >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive
> > >> an  >>  > obsolete message.
> > >>  >>  > NOTE: in this scenario, the daemon is not killed.
> > >>  >>  >
> > >>  >>  > Scenario 2):
> > >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), it will
> > >> receive the message but:
> > >> the
> > >>  >>  > driver has
> > >>  >>  > reported the failure to the host user and the driver's 1.3.2
> > >> can't  >>  > get the  >>  > semaphore -- IMO this is acceptably OK,
> > >> though in the VM, an  >>  > incomplete  >>  > file will be left
> > >> there.
> > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add
> > >> a  >>  > close(target_fd) before open()-ing a new one.
> > >>  >>
> > >>  >>  Right, but how about the case when resuming from SIGSTOP but
> > >> no  >> timeout?
> > >>  > Sorry, I don't understand this:
> > >>  > if no timeout, fcopy_read() will get the semaphore and
> > >> fcopy_write()
> > >>  > will try to cancel fcopy_work.
> > >>
> > >>  Yes.
> > >>  >
> > >>  >
> > >>  >>  Looks like in this case userspace() may wait in
> > >> down_interruptible()
> > >>  >>  until timeout. We probably need something like this:
> > >>  >>
> > >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> > >>  >>                  up(&fcopy_transaction.read_sema);
> > >>  >>                  return -EINTR;
> > >>  >>          }
> > >>  > until "timeout"?
> > >>  > if the daemon can't get the semaphore, it can only be wake by a
> > >> > signal(the  > daemon doesn't install handler, so by default most
> > >> signals will kill  > the daemon).
> > >>  > In case a signal waking up the daemon doesn't kill the daemon,
> > >> why  > should  > we do up()?
> > >>
> > >>  True, no need since we do down_trylock() in release().
> > >>
> > >>  Btw, there's no EINTR handling in handling pread() return value,
> > >> may add such one which should be useful for something like
> > >> debugging.
> > >>
> > >>  >
> > >>  >
> > >>  >>
> > >>  >>  This should synchronize with the timeout work for sure.
> > >>  >>  But how about only schedule it after this?
> > >>  >>  It does not may sense to start the timer during interrupt  >>
> > >> since the file may not even opened and it may take time  >>  to
> > >> handle signals?
> > >>  >>
> > >>  >>  >
> > >>  >>  >>  >
> > >>  >>  >>  > +
> > >>  >>  >>  >  }
> > >>  >>  >>  >
> > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  >>  >>
> > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode  >>
> > >> *inode,  >>  >>  > struct file *f)
> > >>  >>  >>  >  	 */
> > >>  >>  >>  >  	in_hand_shake = true;
> > >>  >>  >>  >  	opened = false;
> > >>  >>  >>  > +
> > >>  >>  >>  > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > >>  >>  >>  > +		/* We haven't up()-ed the semaphore(very
> rare)? */
> > >>  >>  >>  > +		if
> (down_trylock(&fcopy_transaction.read_sema))
> > >>  >>  >>  > +			;
> > >>  >>  >>
> > >>  >>  >>  And this.
> > >>  >>  >
> > >>  >>  > Scenario 3):
> > >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> > >> >>  > fcopy_work  >>  > pending (scheduled but not start to run
> > >> yet), we should cancel the  >>  > work (as you suggested) and
> > >> down() the semaphore, otherwise, the  >>  > obsolete message will
> > >> be received by the next instance of the  >> daemon.
> > >>  >>
> > >>  >>  Yes
> > >>  >>  >
> > >>  >>  >
> > >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> > >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> > >> just
> > >>  >>  > before the next line?
> > >>  >>  >         fcopy_send_data();
> > >>  >>  >
> > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but  >>
> > >> > can't get the semaphore since it hasn't been up()-ed.
> > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> > >> and
> > >>  >> we'll
> > >>  >>  > buffer an obsolete message in the driver, and the message
> > >> will be  >>  > fetched by the next instance of the daemon...
> > >>  >>  >
> > >>  >>  > Looks we need a spinlock here?
> > >>  >>
> > >>  >>  Unless fcopy_release() can wait for all data for current
> > >> transation  >>  to be received. Spinlock won't help.
> > >>  >>
> > >>  >>  But an idea is let the daemon the handle such cases. E.g make
> > >> sure  >> the  >>  processing begins with START_COPY and end with
> > >> COMPLETE/CANCEL_COPY.
> > >>  >>  Drop all requests that does not start with START_COPY.
> > >>  >>
> > >>  >>  Thought?
> > >>  > Good idea.
> > >>  > I also think we should reinforce the concept of state machine in
> > >> the  > daemon code.
> > >>
> > >>  Yes, it needs.
> > > I agree.
> > > Obviously we can do something to make the daemon/driver work better
> > > in the corner cases.
> > >
> > >>  >
> > >>  > The daemon/driver communication has so many corner cases...
> > >>
> > >>  Looks so, let's first address the issue mentioned in this patch.
> > > OK.
> > >
> > >>  I don't have any more comments other than changing
> > >>
> > >>   if(down_trylock(&fcopy_transaction.read_sema))
> > >>       ;
> > >>
> > >>  to
> > >>
> > >>  down_trylock(&fcopy_transaction.read_sema);
> > > Hi Jason,
> > > This is to address Vitaly's comment in the bugzilla:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> > >
> > > down_trylock(&fcopy_transaction.read_sema) will
> > >
> > > "
> > > produces the following compile warning:
> > > drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > > ‘down_trylock’, declared with attribute warn_unused_result
> > > [-Wunused-result]
> > >   (void)down_trylock(&fcopy_transaction.read_sema);
> > > "
> > >
> > > Actually I personally don't care about the warning, because we only
> > > see it when we run some kind of code checker program.  :-)
> > >
> > > I can change my v3  to the "normal" style you prefer, if there is no
> > > strong objection from Vitaly?
> >
> > Ah, I see the point. Then no objection for this patch.
> >
> > Since Vitaly said he does not has objection.
> >
> > Acked-by: Jason Wang <jasowang@redhat.com>
> 
> I realized non-trivial effort might be needed to better handle corner cases.
> This is only the first step.
> 
> E.g., I found a new corner case:
> after we unload the hv_utils/fcopy driver, the fcopy daemon can exit and
> fcopy_release() will be invoked -- at this time fcopy_release() has become
> obsolete and invalid, and as a result, a kernel panic can happen:  I did see the
> panic once.

You should not be able to unload the driver when there is still an open reference from the
Daemon.

K. Y
> 
> -- Dexuan

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-12-01 15:54                         ` KY Srinivasan
@ 2014-12-02  5:33                           ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-02  5:33 UTC (permalink / raw)
  To: KY Srinivasan, Jason Wang
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, vkuznets,
	Haiyang Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 17744 bytes --]

> -----Original Message-----
> From: KY Srinivasan
> Sent: Monday, December 1, 2014 23:55 PM
> To: Dexuan Cui; Jason Wang
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> 
> 
> > -----Original Message-----
> > From: Dexuan Cui
> > Sent: Monday, December 1, 2014 3:01 AM
> > To: Jason Wang
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> > devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> > Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> > failure
> >
> > > -----Original Message-----
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Monday, December 1, 2014 18:18 PM
> > > To: Dexuan Cui
> > > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang
> Zhang
> > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > > transfer failure
> > >
> > >
> > >
> > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
> > wrote:
> > > >>  -----Original Message-----
> > > >>  From: Jason Wang [mailto:jasowang@redhat.com]
> > > >>  Sent: Monday, December 1, 2014 16:23 PM
> > > >>  To: Dexuan Cui
> > > >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > > >> driverdev-
> > > >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> > > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui
> > > >> <decui@microsoft.com>
> > > >> wrote:
> > > >>  >>  -----Original Message-----
> > > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
> > > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  Cc:
> > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
> > > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com;
> Haiyang
> > > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete
> > > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 2014 at
> > > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
> > > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
> > > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 28,
> > > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
> > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>  >>
> > > >> driverdev-  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; vkuznets@redhat.com;
> > > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop
> > > >> the obsolete message  >> on  >>  >> transfer  >>  >>  failure  >>
> > > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
> > > >> <decui@microsoft.com>  >>  >> wrote:
> > > >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> > > >> >> killed, we  >>  >>  > need to down the semaphore, otherwise,
> > > >> after the daemon starts  >>  >> next  >>  >>  > time, the obsolete
> > > >> data in fcopy_transaction.message or  >>  >>  >
> > > >> fcopy_transaction.fcopy_msg will be used immediately.
> > > >>  >>  >>  >
> > > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > Cc:
> > > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
> > > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: Dexuan Cui
> > > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > v2: I
> > > >> removed the "FCP" prefix as Greg asked.
> > > >>  >>  >>  >
> > > >>  >>  >>  >     I also updated the output message a little:
> > > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> > > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> > > >>  >>  >>  >
> > > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> > > >> >> suggested.
> > > >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> > > >> added a
> > > >>  >>  >>  > comment instead.
> > > >>  >>  >>  >
> > > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  >>  >>
> > > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > diff
> > > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  >>  >
> > > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
> > > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ b/drivers/hv/hv_fcopy.c  >>
> > > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  >>
> > > >> work_struct  >>  >>  > *dummy)
> > > >>  >>  >>  >   * process the pending transaction.
> > > >>  >>  >>  >   */
> > > >>  >>  >>  >  fcopy_respond_to_host(HV_E_FAIL);
> > > >>  >>  >>  > +
> > > >>  >>  >>  > +/* In the case the user-space daemon crashes, hangs
> > or
> > > >> is
> > > >>  >>  >> killed, we
> > > >>  >>  >>  > + * need to down the semaphore, otherwise, after
> > the
> > > >> daemon
> > > >>  >>  >> starts
> > > >>  >>  >>  > next
> > > >>  >>  >>  > + * time, the obsolete data in
> > fcopy_transaction.message
> > > >> or
> > > >>  >>  >>  > + * fcopy_transaction.fcopy_msg will be used
> > immediately.
> > > >>  >>  >>  > + *
> > > >>  >>  >>  > + * NOTE: fcopy_read() happens to get the
> > semaphore (very
> > > >>  >> rare)?
> > > >>  >>  >>  > We're
> > > >>  >>  >>  > + * still OK, because we've reported the failure to the
> > > >> host.
> > > >>  >>  >>  > + */
> > > >>  >>  >>  > +if (down_trylock(&fcopy_transaction.read_sema))
> > > >>  >>  >>  > +;
> > > >>  >>  >>
> > > >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> > > >>  >>  >>
> > > >>  >>  >>  Btw, a question not relate to this patch.
> > > >>  >>  >>
> > > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
> > > >> expires the  >>  >> check  >>  >>  here?
> > > >>  >>  > Hi Jason,
> > > >>  >>  > My idea is: here we need down_trylock(), but in case we
> > > >> can't get  >> the  >>  > semaphore, it's OK anyway:
> > > >>  >>  >
> > > >>  >>  > Scenario 1):
> > > >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> > > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs the
> > > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver reports
> > > >> the failure to the host user in 5s and  >>  > 1.3.2: the driver
> > > >> down()-es the semaphore;  >>  > 1.4: the daemon receives SIGCONT
> > > >> and it will be still blocked on  >> the  >>  > pread().
> > > >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive
> > > >> an  >>  > obsolete message.
> > > >>  >>  > NOTE: in this scenario, the daemon is not killed.
> > > >>  >>  >
> > > >>  >>  > Scenario 2):
> > > >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> > > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), it will
> > > >> receive the message but:
> > > >> the
> > > >>  >>  > driver has
> > > >>  >>  > reported the failure to the host user and the driver's 1.3.2
> > > >> can't  >>  > get the  >>  > semaphore -- IMO this is acceptably OK,
> > > >> though in the VM, an  >>  > incomplete  >>  > file will be left
> > > >> there.
> > > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add
> > > >> a  >>  > close(target_fd) before open()-ing a new one.
> > > >>  >>
> > > >>  >>  Right, but how about the case when resuming from SIGSTOP but
> > > >> no  >> timeout?
> > > >>  > Sorry, I don't understand this:
> > > >>  > if no timeout, fcopy_read() will get the semaphore and
> > > >> fcopy_write()
> > > >>  > will try to cancel fcopy_work.
> > > >>
> > > >>  Yes.
> > > >>  >
> > > >>  >
> > > >>  >>  Looks like in this case userspace() may wait in
> > > >> down_interruptible()
> > > >>  >>  until timeout. We probably need something like this:
> > > >>  >>
> > > >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> > > >>  >>                  up(&fcopy_transaction.read_sema);
> > > >>  >>                  return -EINTR;
> > > >>  >>          }
> > > >>  > until "timeout"?
> > > >>  > if the daemon can't get the semaphore, it can only be wake by a
> > > >> > signal(the  > daemon doesn't install handler, so by default most
> > > >> signals will kill  > the daemon).
> > > >>  > In case a signal waking up the daemon doesn't kill the daemon,
> > > >> why  > should  > we do up()?
> > > >>
> > > >>  True, no need since we do down_trylock() in release().
> > > >>
> > > >>  Btw, there's no EINTR handling in handling pread() return value,
> > > >> may add such one which should be useful for something like
> > > >> debugging.
> > > >>
> > > >>  >
> > > >>  >
> > > >>  >>
> > > >>  >>  This should synchronize with the timeout work for sure.
> > > >>  >>  But how about only schedule it after this?
> > > >>  >>  It does not may sense to start the timer during interrupt  >>
> > > >> since the file may not even opened and it may take time  >>  to
> > > >> handle signals?
> > > >>  >>
> > > >>  >>  >
> > > >>  >>  >>  >
> > > >>  >>  >>  > +
> > > >>  >>  >>  >  }
> > > >>  >>  >>  >
> > > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  >>  >>
> > > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode  >>
> > > >> *inode,  >>  >>  > struct file *f)
> > > >>  >>  >>  >   */
> > > >>  >>  >>  >  in_hand_shake = true;
> > > >>  >>  >>  >  opened = false;
> > > >>  >>  >>  > +
> > > >>  >>  >>  > +if (cancel_delayed_work_sync(&fcopy_work)) {
> > > >>  >>  >>  > +/* We haven't up()-ed the semaphore(very
> > rare)? */
> > > >>  >>  >>  > +if
> > (down_trylock(&fcopy_transaction.read_sema))
> > > >>  >>  >>  > +;
> > > >>  >>  >>
> > > >>  >>  >>  And this.
> > > >>  >>  >
> > > >>  >>  > Scenario 3):
> > > >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> > > >> >>  > fcopy_work  >>  > pending (scheduled but not start to run
> > > >> yet), we should cancel the  >>  > work (as you suggested) and
> > > >> down() the semaphore, otherwise, the  >>  > obsolete message will
> > > >> be received by the next instance of the  >> daemon.
> > > >>  >>
> > > >>  >>  Yes
> > > >>  >>  >
> > > >>  >>  >
> > > >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> > > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> > > >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> > > >> just
> > > >>  >>  > before the next line?
> > > >>  >>  >         fcopy_send_data();
> > > >>  >>  >
> > > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but  >>
> > > >> > can't get the semaphore since it hasn't been up()-ed.
> > > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> > > >> and
> > > >>  >> we'll
> > > >>  >>  > buffer an obsolete message in the driver, and the message
> > > >> will be  >>  > fetched by the next instance of the daemon...
> > > >>  >>  >
> > > >>  >>  > Looks we need a spinlock here?
> > > >>  >>
> > > >>  >>  Unless fcopy_release() can wait for all data for current
> > > >> transation  >>  to be received. Spinlock won't help.
> > > >>  >>
> > > >>  >>  But an idea is let the daemon the handle such cases. E.g make
> > > >> sure  >> the  >>  processing begins with START_COPY and end with
> > > >> COMPLETE/CANCEL_COPY.
> > > >>  >>  Drop all requests that does not start with START_COPY.
> > > >>  >>
> > > >>  >>  Thought?
> > > >>  > Good idea.
> > > >>  > I also think we should reinforce the concept of state machine in
> > > >> the  > daemon code.
> > > >>
> > > >>  Yes, it needs.
> > > > I agree.
> > > > Obviously we can do something to make the daemon/driver work better
> > > > in the corner cases.
> > > >
> > > >>  >
> > > >>  > The daemon/driver communication has so many corner cases...
> > > >>
> > > >>  Looks so, let's first address the issue mentioned in this patch.
> > > > OK.
> > > >
> > > >>  I don't have any more comments other than changing
> > > >>
> > > >>   if(down_trylock(&fcopy_transaction.read_sema))
> > > >>       ;
> > > >>
> > > >>  to
> > > >>
> > > >>  down_trylock(&fcopy_transaction.read_sema);
> > > > Hi Jason,
> > > > This is to address Vitaly's comment in the bugzilla:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> > > >
> > > > down_trylock(&fcopy_transaction.read_sema) will
> > > >
> > > > "
> > > > produces the following compile warning:
> > > > drivers/hv/hv_fcopy.c: In function â€˜fcopy_work_funcâ€™:
> > > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > > > â€˜down_trylockâ€™, declared with attribute warn_unused_result
> > > > [-Wunused-result]
> > > >   (void)down_trylock(&fcopy_transaction.read_sema);
> > > > "
> > > >
> > > > Actually I personally don't care about the warning, because we only
> > > > see it when we run some kind of code checker program.  :-)
> > > >
> > > > I can change my v3  to the "normal" style you prefer, if there is no
> > > > strong objection from Vitaly?
> > >
> > > Ah, I see the point. Then no objection for this patch.
> > >
> > > Since Vitaly said he does not has objection.
> > >
> > > Acked-by: Jason Wang <jasowang@redhat.com>
> >
> > I realized non-trivial effort might be needed to better handle corner cases.
> > This is only the first step.
> >
> > E.g., I found a new corner case:
> > after we unload the hv_utils/fcopy driver, the fcopy daemon can exit and
> > fcopy_release() will be invoked -- at this time fcopy_release() has become
> > obsolete and invalid, and as a result, a kernel panic can happen:  I did see
> the
> > panic once.
> 
> You should not be able to unload the driver when there is still an open
> reference from the
> Daemon.
> 
> K. Y

Hi KY, I'm able to reproduce the below issue every time:
(I ran the below in Ubuntu 14.10 but I think we're able to repro it in any distro):

1) start hv_fcopy_daemon and use ps to make sure it's running in the background;
2) kill -STOP `pidof hv_fcopy_daemon`, and use ps to make sure the daemon is in
the state of "Ts";
3) rmmod hv_utils  # dexuan: this can runs fine. 
4) kill -CONT `pidof hv_fcopy_daemon` and we see an oops at once:

[ 9699.421564] BUG: unable to handle kernel paging request at ffffffffc019c230
[ 9699.422701] IP: [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154] PGD 1c16067 PUD 1c18067 PMD 366d8067 PTE 0
[ 9699.423154] Oops: 0000 [#1] SMP
[ 9699.423154] Modules linked in: hv_utils rfcomm bnep bluetooth 6lowpan_iphc joydev hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hid_hyperv serio_raw hid hyperv_keyboard hv_balloon mac_hid hyperv_fb i2c_piix4 parport_pc ppdev lp parport hv_storvsc psmouse floppy hv_vmbus tulip pata_acpi [last unloaded: hv_utils]
[ 9699.423154] CPU: 0 PID: 4659 Comm: hv_fcopy_daemon Not tainted 3.16.0-23-generic #31-Ubuntu
[ 9699.423154] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[ 9699.423154] task: ffff880019ce5bb0 ti: ffff8800c7be0000 task.ti: ffff8800c7be0000
[ 9699.423154] RIP: 0010:[<ffffffff811de294>]  [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154] RSP: 0018:ffff8800c7be3e48  EFLAGS: 00010246
[ 9699.423154] RAX: ffffffffc019c1c0 RBX: ffff8800c7b9c200 RCX: ffff8800396548d8
[ 9699.423154] RDX: 0000000080000000 RSI: ffff880039654880 RDI: ffff8800c7b9c200
[ 9699.423154] RBP: ffff8800c7be3e60 R08: 0000000000000001 R09: 00000001802a0017
[ 9699.423154] R10: ffff88003de17640 R11: ffff880087963140 R12: 0000000000000000
[ 9699.423154] R13: ffff880039654880 R14: ffff880039654890 R15: 0000000000000004
[ 9699.423154] FS:  0000000000000000(0000) GS:ffff88003de00000(0000) knlGS:0000000000000000
[ 9699.423154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9699.423154] CR2: ffffffffc019c230 CR3: 0000000001c13000 CR4: 00000000001406f0
[ 9699.423154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9699.423154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9699.423154] Stack:
[ 9699.423154]  0000000000000001 0000000000000000 ffff880039654880 ffff8800c7be3ea0
[ 9699.423154]  ffffffff811ff208 00000001034a4c80 ffff880019ce5bb0 ffff880039654880
[ 9699.423154]  ffff880019ce6308 ffff8800126b8e00 ffff8800126b8e60 ffff8800c7be3ec8
[ 9699.423154] Call Trace:
[ 9699.423154]  [<ffffffff811ff208>] put_files_struct+0x78/0xd0
[ 9699.423154]  [<ffffffff811ff307>] exit_files+0x47/0x50
[ 9699.423154]  [<ffffffff810725a7>] do_exit+0x357/0xab0
[ 9699.423154]  [<ffffffff81072d83>] do_group_exit+0x43/0xc0
[ 9699.423154]  [<ffffffff81072e14>] SyS_exit_group+0x14/0x20
[ 9699.423154]  [<ffffffff81787ced>] system_call_fastpath+0x1a/0x1f
[ 9699.423154] Code: ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 38 48 89 fb 48 85 c0 74 46 48 8b 47 28 45 31 e4 49 89 f5 <48> 8b 40 70 48 85 c0 74 05 ff d0 41 89 c4 f6 43 45 40 75 16 4c
[ 9699.423154] RIP  [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154]  RSP <ffff8800c7be3e48>
[ 9699.423154] CR2: ffffffffc019c230
[ 9699.423154] ---[ end trace 0278723740833669 ]---
[ 9699.423154] Fixing recursive fault but reboot is needed!

I understand an administrator should not unload the driver before stopping the
daemon, but this could happen by accident.

 -- Dexuan

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-02  5:33                           ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2014-12-02  5:33 UTC (permalink / raw)
  To: KY Srinivasan, Jason Wang
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw

> -----Original Message-----
> From: KY Srinivasan
> Sent: Monday, December 1, 2014 23:55 PM
> To: Dexuan Cui; Jason Wang
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> vkuznets@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> 
> 
> > -----Original Message-----
> > From: Dexuan Cui
> > Sent: Monday, December 1, 2014 3:01 AM
> > To: Jason Wang
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> > devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; KY
> > Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> > failure
> >
> > > -----Original Message-----
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Monday, December 1, 2014 18:18 PM
> > > To: Dexuan Cui
> > > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang
> Zhang
> > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > > transfer failure
> > >
> > >
> > >
> > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
> > wrote:
> > > >>  -----Original Message-----
> > > >>  From: Jason Wang [mailto:jasowang@redhat.com]
> > > >>  Sent: Monday, December 1, 2014 16:23 PM
> > > >>  To: Dexuan Cui
> > > >>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > > >> driverdev-
> > > >>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> > > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
> > > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui
> > > >> <decui@microsoft.com>
> > > >> wrote:
> > > >>  >>  -----Original Message-----
> > > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
> > > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  Cc:
> > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
> > > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com;
> Haiyang
> > > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete
> > > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 2014 at
> > > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
> > > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
> > > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 28,
> > > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
> > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>  >>
> > > >> driverdev-  >>  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
> > > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; vkuznets@redhat.com;
> > > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop
> > > >> the obsolete message  >> on  >>  >> transfer  >>  >>  failure  >>
> > > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
> > > >> <decui@microsoft.com>  >>  >> wrote:
> > > >>  >>  >>  > In the case the user-space daemon crashes, hangs or is
> > > >> >> killed, we  >>  >>  > need to down the semaphore, otherwise,
> > > >> after the daemon starts  >>  >> next  >>  >>  > time, the obsolete
> > > >> data in fcopy_transaction.message or  >>  >>  >
> > > >> fcopy_transaction.fcopy_msg will be used immediately.
> > > >>  >>  >>  >
> > > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > Cc:
> > > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
> > > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: Dexuan Cui
> > > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > v2: I
> > > >> removed the "FCP" prefix as Greg asked.
> > > >>  >>  >>  >
> > > >>  >>  >>  >     I also updated the output message a little:
> > > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
> > > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
> > > >>  >>  >>  >
> > > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason Wang
> > > >> >> suggested.
> > > >>  >>  >>  >     I removed the pr_debug (it isn't so meaningful)and
> > > >> added a
> > > >>  >>  >>  > comment instead.
> > > >>  >>  >>  >
> > > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  >>  >>
> > > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > diff
> > > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  >>  >
> > > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
> > > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ b/drivers/hv/hv_fcopy.c  >>
> > > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  >>
> > > >> work_struct  >>  >>  > *dummy)
> > > >>  >>  >>  >   * process the pending transaction.
> > > >>  >>  >>  >   */
> > > >>  >>  >>  >  fcopy_respond_to_host(HV_E_FAIL);
> > > >>  >>  >>  > +
> > > >>  >>  >>  > +/* In the case the user-space daemon crashes, hangs
> > or
> > > >> is
> > > >>  >>  >> killed, we
> > > >>  >>  >>  > + * need to down the semaphore, otherwise, after
> > the
> > > >> daemon
> > > >>  >>  >> starts
> > > >>  >>  >>  > next
> > > >>  >>  >>  > + * time, the obsolete data in
> > fcopy_transaction.message
> > > >> or
> > > >>  >>  >>  > + * fcopy_transaction.fcopy_msg will be used
> > immediately.
> > > >>  >>  >>  > + *
> > > >>  >>  >>  > + * NOTE: fcopy_read() happens to get the
> > semaphore (very
> > > >>  >> rare)?
> > > >>  >>  >>  > We're
> > > >>  >>  >>  > + * still OK, because we've reported the failure to the
> > > >> host.
> > > >>  >>  >>  > + */
> > > >>  >>  >>  > +if (down_trylock(&fcopy_transaction.read_sema))
> > > >>  >>  >>  > +;
> > > >>  >>  >>
> > > >>  >>  >>  Sorry, I'm not quite understand how if () ; can help here.
> > > >>  >>  >>
> > > >>  >>  >>  Btw, a question not relate to this patch.
> > > >>  >>  >>
> > > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
> > > >> expires the  >>  >> check  >>  >>  here?
> > > >>  >>  > Hi Jason,
> > > >>  >>  > My idea is: here we need down_trylock(), but in case we
> > > >> can't get  >> the  >>  > semaphore, it's OK anyway:
> > > >>  >>  >
> > > >>  >>  > Scenario 1):
> > > >>  >>  > 1.1: when the daemon is blocked on the pread(), the daemon
> > > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs the
> > > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver reports
> > > >> the failure to the host user in 5s and  >>  > 1.3.2: the driver
> > > >> down()-es the semaphore;  >>  > 1.4: the daemon receives SIGCONT
> > > >> and it will be still blocked on  >> the  >>  > pread().
> > > >>  >>  > Without the down_trylock(), in 1.4, the daemon can receive
> > > >> an  >>  > obsolete message.
> > > >>  >>  > NOTE: in this scenario, the daemon is not killed.
> > > >>  >>  >
> > > >>  >>  > Scenario 2):
> > > >>  >>  > In senario 1), if the daemon receives SIGCONT between 1.3.1
> > > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), it will
> > > >> receive the message but:
> > > >> the
> > > >>  >>  > driver has
> > > >>  >>  > reported the failure to the host user and the driver's 1.3.2
> > > >> can't  >>  > get the  >>  > semaphore -- IMO this is acceptably OK,
> > > >> though in the VM, an  >>  > incomplete  >>  > file will be left
> > > >> there.
> > > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we should add
> > > >> a  >>  > close(target_fd) before open()-ing a new one.
> > > >>  >>
> > > >>  >>  Right, but how about the case when resuming from SIGSTOP but
> > > >> no  >> timeout?
> > > >>  > Sorry, I don't understand this:
> > > >>  > if no timeout, fcopy_read() will get the semaphore and
> > > >> fcopy_write()
> > > >>  > will try to cancel fcopy_work.
> > > >>
> > > >>  Yes.
> > > >>  >
> > > >>  >
> > > >>  >>  Looks like in this case userspace() may wait in
> > > >> down_interruptible()
> > > >>  >>  until timeout. We probably need something like this:
> > > >>  >>
> > > >>  >>          if (down_interruptible(&fcopy_transaction.read_sema)) {
> > > >>  >>                  up(&fcopy_transaction.read_sema);
> > > >>  >>                  return -EINTR;
> > > >>  >>          }
> > > >>  > until "timeout"?
> > > >>  > if the daemon can't get the semaphore, it can only be wake by a
> > > >> > signal(the  > daemon doesn't install handler, so by default most
> > > >> signals will kill  > the daemon).
> > > >>  > In case a signal waking up the daemon doesn't kill the daemon,
> > > >> why  > should  > we do up()?
> > > >>
> > > >>  True, no need since we do down_trylock() in release().
> > > >>
> > > >>  Btw, there's no EINTR handling in handling pread() return value,
> > > >> may add such one which should be useful for something like
> > > >> debugging.
> > > >>
> > > >>  >
> > > >>  >
> > > >>  >>
> > > >>  >>  This should synchronize with the timeout work for sure.
> > > >>  >>  But how about only schedule it after this?
> > > >>  >>  It does not may sense to start the timer during interrupt  >>
> > > >> since the file may not even opened and it may take time  >>  to
> > > >> handle signals?
> > > >>  >>
> > > >>  >>  >
> > > >>  >>  >>  >
> > > >>  >>  >>  > +
> > > >>  >>  >>  >  }
> > > >>  >>  >>  >
> > > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  >>  >>
> > > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode  >>
> > > >> *inode,  >>  >>  > struct file *f)
> > > >>  >>  >>  >   */
> > > >>  >>  >>  >  in_hand_shake = true;
> > > >>  >>  >>  >  opened = false;
> > > >>  >>  >>  > +
> > > >>  >>  >>  > +if (cancel_delayed_work_sync(&fcopy_work)) {
> > > >>  >>  >>  > +/* We haven't up()-ed the semaphore(very
> > rare)? */
> > > >>  >>  >>  > +if
> > (down_trylock(&fcopy_transaction.read_sema))
> > > >>  >>  >>  > +;
> > > >>  >>  >>
> > > >>  >>  >>  And this.
> > > >>  >>  >
> > > >>  >>  > Scenario 3):
> > > >>  >>  > When the daemon exits(e.g., SIGKILL received), if there is a
> > > >> >>  > fcopy_work  >>  > pending (scheduled but not start to run
> > > >> yet), we should cancel the  >>  > work (as you suggested) and
> > > >> down() the semaphore, otherwise, the  >>  > obsolete message will
> > > >> be received by the next instance of the  >> daemon.
> > > >>  >>
> > > >>  >>  Yes
> > > >>  >>  >
> > > >>  >>  >
> > > >>  >>  > Scenario 4):  in the driver's hv_fcopy_onchannelcallback():
> > > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
> > > >>  >>  >         ----> if fcopy_release() is running on another vcpu,
> > > >> just
> > > >>  >>  > before the next line?
> > > >>  >>  >         fcopy_send_data();
> > > >>  >>  >
> > > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, but  >>
> > > >> > can't get the semaphore since it hasn't been up()-ed.
> > > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  later,
> > > >> and
> > > >>  >> we'll
> > > >>  >>  > buffer an obsolete message in the driver, and the message
> > > >> will be  >>  > fetched by the next instance of the daemon...
> > > >>  >>  >
> > > >>  >>  > Looks we need a spinlock here?
> > > >>  >>
> > > >>  >>  Unless fcopy_release() can wait for all data for current
> > > >> transation  >>  to be received. Spinlock won't help.
> > > >>  >>
> > > >>  >>  But an idea is let the daemon the handle such cases. E.g make
> > > >> sure  >> the  >>  processing begins with START_COPY and end with
> > > >> COMPLETE/CANCEL_COPY.
> > > >>  >>  Drop all requests that does not start with START_COPY.
> > > >>  >>
> > > >>  >>  Thought?
> > > >>  > Good idea.
> > > >>  > I also think we should reinforce the concept of state machine in
> > > >> the  > daemon code.
> > > >>
> > > >>  Yes, it needs.
> > > > I agree.
> > > > Obviously we can do something to make the daemon/driver work better
> > > > in the corner cases.
> > > >
> > > >>  >
> > > >>  > The daemon/driver communication has so many corner cases...
> > > >>
> > > >>  Looks so, let's first address the issue mentioned in this patch.
> > > > OK.
> > > >
> > > >>  I don't have any more comments other than changing
> > > >>
> > > >>   if(down_trylock(&fcopy_transaction.read_sema))
> > > >>       ;
> > > >>
> > > >>  to
> > > >>
> > > >>  down_trylock(&fcopy_transaction.read_sema);
> > > > Hi Jason,
> > > > This is to address Vitaly's comment in the bugzilla:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
> > > >
> > > > down_trylock(&fcopy_transaction.read_sema) will
> > > >
> > > > "
> > > > produces the following compile warning:
> > > > drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
> > > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
> > > > ‘down_trylock’, declared with attribute warn_unused_result
> > > > [-Wunused-result]
> > > >   (void)down_trylock(&fcopy_transaction.read_sema);
> > > > "
> > > >
> > > > Actually I personally don't care about the warning, because we only
> > > > see it when we run some kind of code checker program.  :-)
> > > >
> > > > I can change my v3  to the "normal" style you prefer, if there is no
> > > > strong objection from Vitaly?
> > >
> > > Ah, I see the point. Then no objection for this patch.
> > >
> > > Since Vitaly said he does not has objection.
> > >
> > > Acked-by: Jason Wang <jasowang@redhat.com>
> >
> > I realized non-trivial effort might be needed to better handle corner cases.
> > This is only the first step.
> >
> > E.g., I found a new corner case:
> > after we unload the hv_utils/fcopy driver, the fcopy daemon can exit and
> > fcopy_release() will be invoked -- at this time fcopy_release() has become
> > obsolete and invalid, and as a result, a kernel panic can happen:  I did see
> the
> > panic once.
> 
> You should not be able to unload the driver when there is still an open
> reference from the
> Daemon.
> 
> K. Y

Hi KY, I'm able to reproduce the below issue every time:
(I ran the below in Ubuntu 14.10 but I think we're able to repro it in any distro):

1) start hv_fcopy_daemon and use ps to make sure it's running in the background;
2) kill -STOP `pidof hv_fcopy_daemon`, and use ps to make sure the daemon is in
the state of "Ts";
3) rmmod hv_utils  # dexuan: this can runs fine. 
4) kill -CONT `pidof hv_fcopy_daemon` and we see an oops at once:

[ 9699.421564] BUG: unable to handle kernel paging request at ffffffffc019c230
[ 9699.422701] IP: [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154] PGD 1c16067 PUD 1c18067 PMD 366d8067 PTE 0
[ 9699.423154] Oops: 0000 [#1] SMP
[ 9699.423154] Modules linked in: hv_utils rfcomm bnep bluetooth 6lowpan_iphc joydev hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hid_hyperv serio_raw hid hyperv_keyboard hv_balloon mac_hid hyperv_fb i2c_piix4 parport_pc ppdev lp parport hv_storvsc psmouse floppy hv_vmbus tulip pata_acpi [last unloaded: hv_utils]
[ 9699.423154] CPU: 0 PID: 4659 Comm: hv_fcopy_daemon Not tainted 3.16.0-23-generic #31-Ubuntu
[ 9699.423154] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[ 9699.423154] task: ffff880019ce5bb0 ti: ffff8800c7be0000 task.ti: ffff8800c7be0000
[ 9699.423154] RIP: 0010:[<ffffffff811de294>]  [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154] RSP: 0018:ffff8800c7be3e48  EFLAGS: 00010246
[ 9699.423154] RAX: ffffffffc019c1c0 RBX: ffff8800c7b9c200 RCX: ffff8800396548d8
[ 9699.423154] RDX: 0000000080000000 RSI: ffff880039654880 RDI: ffff8800c7b9c200
[ 9699.423154] RBP: ffff8800c7be3e60 R08: 0000000000000001 R09: 00000001802a0017
[ 9699.423154] R10: ffff88003de17640 R11: ffff880087963140 R12: 0000000000000000
[ 9699.423154] R13: ffff880039654880 R14: ffff880039654890 R15: 0000000000000004
[ 9699.423154] FS:  0000000000000000(0000) GS:ffff88003de00000(0000) knlGS:0000000000000000
[ 9699.423154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9699.423154] CR2: ffffffffc019c230 CR3: 0000000001c13000 CR4: 00000000001406f0
[ 9699.423154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9699.423154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9699.423154] Stack:
[ 9699.423154]  0000000000000001 0000000000000000 ffff880039654880 ffff8800c7be3ea0
[ 9699.423154]  ffffffff811ff208 00000001034a4c80 ffff880019ce5bb0 ffff880039654880
[ 9699.423154]  ffff880019ce6308 ffff8800126b8e00 ffff8800126b8e60 ffff8800c7be3ec8
[ 9699.423154] Call Trace:
[ 9699.423154]  [<ffffffff811ff208>] put_files_struct+0x78/0xd0
[ 9699.423154]  [<ffffffff811ff307>] exit_files+0x47/0x50
[ 9699.423154]  [<ffffffff810725a7>] do_exit+0x357/0xab0
[ 9699.423154]  [<ffffffff81072d83>] do_group_exit+0x43/0xc0
[ 9699.423154]  [<ffffffff81072e14>] SyS_exit_group+0x14/0x20
[ 9699.423154]  [<ffffffff81787ced>] system_call_fastpath+0x1a/0x1f
[ 9699.423154] Code: ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 38 48 89 fb 48 85 c0 74 46 48 8b 47 28 45 31 e4 49 89 f5 <48> 8b 40 70 48 85 c0 74 05 ff d0 41 89 c4 f6 43 45 40 75 16 4c
[ 9699.423154] RIP  [<ffffffff811de294>] filp_close+0x24/0x70
[ 9699.423154]  RSP <ffff8800c7be3e48>
[ 9699.423154] CR2: ffffffffc019c230
[ 9699.423154] ---[ end trace 0278723740833669 ]---
[ 9699.423154] Fixing recursive fault but reboot is needed!

I understand an administrator should not unload the driver before stopping the
daemon, but this could happen by accident.

 -- Dexuan

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
       [not found]                           ` <F792CF86EFE20D4AB8064279AFBA51C613E6B93B@HKNPRD3002MB017.064d.mgd.msft.net >
@ 2014-12-02  6:58                               ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-02  6:58 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: KY Srinivasan, gregkh, linux-kernel, driverdev-devel, olaf, apw,
	vkuznets, Haiyang Zhang



On Tue, Dec 2, 2014 at 1:33 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: KY Srinivasan
>>  Sent: Monday, December 1, 2014 23:55 PM
>>  To: Dexuan Cui; Jason Wang
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
>>  vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  
>>  
>>  
>>  > -----Original Message-----
>>  > From: Dexuan Cui
>>  > Sent: Monday, December 1, 2014 3:01 AM
>>  > To: Jason Wang
>>  > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  > devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; 
>> KY
>>  > Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on transfer
>>  > failure
>>  >
>>  > > -----Original Message-----
>>  > > From: Jason Wang [mailto:jasowang@redhat.com]
>>  > > Sent: Monday, December 1, 2014 18:18 PM
>>  > > To: Dexuan Cui
>>  > > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  > > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
>>  > > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang
>>  Zhang
>>  > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  > > transfer failure
>>  > >
>>  > >
>>  > >
>>  > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
>>  > wrote:
>>  > > >>  -----Original Message-----
>>  > > >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  > > >>  Sent: Monday, December 1, 2014 16:23 PM
>>  > > >>  To: Dexuan Cui
>>  > > >>  Cc: gregkh@linuxfoundation.org; 
>> linux-kernel@vger.kernel.org;
>>  > > >> driverdev-
>>  > > >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com;
>>  > > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  > > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete 
>> message on
>>  > > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan 
>> Cui
>>  > > >> <decui@microsoft.com>
>>  > > >> wrote:
>>  > > >>  >>  -----Original Message-----
>>  > > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
>>  > > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  
>> Cc:
>>  > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
>>  > > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>>  > > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com;
>>  Haiyang
>>  > > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the 
>> obsolete
>>  > > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 
>> 2014 at
>>  > > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
>>  > > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
>>  > > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 
>> 28,
>>  > > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
>>  > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  
>> >>  >>
>>  > > >> driverdev-  >>  >>  devel@linuxdriverproject.org; 
>> olaf@aepfle.de;
>>  > > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; 
>> vkuznets@redhat.com;
>>  > > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: 
>> drop
>>  > > >> the obsolete message  >> on  >>  >> transfer  >>  >>  
>> failure  >>
>>  > > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
>>  > > >> <decui@microsoft.com>  >>  >> wrote:
>>  > > >>  >>  >>  > In the case the user-space daemon crashes, hangs 
>> or is
>>  > > >> >> killed, we  >>  >>  > need to down the semaphore, 
>> otherwise,
>>  > > >> after the daemon starts  >>  >> next  >>  >>  > time, the 
>> obsolete
>>  > > >> data in fcopy_transaction.message or  >>  >>  >
>>  > > >> fcopy_transaction.fcopy_msg will be used immediately.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > 
>> Cc:
>>  > > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
>>  > > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: 
>> Dexuan Cui
>>  > > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > 
>> v2: I
>>  > > >> removed the "FCP" prefix as Greg asked.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >     I also updated the output message a little:
>>  > > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  > > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason 
>> Wang
>>  > > >> >> suggested.
>>  > > >>  >>  >>  >     I removed the pr_debug (it isn't so 
>> meaningful)and
>>  > > >> added a
>>  > > >>  >>  >>  > comment instead.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  
>> >>  >>
>>  > > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > 
>> diff
>>  > > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  
>> >>  >
>>  > > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
>>  > > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ 
>> b/drivers/hv/hv_fcopy.c  >>
>>  > > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  
>> >>
>>  > > >> work_struct  >>  >>  > *dummy)
>>  > > >>  >>  >>  >   * process the pending transaction.
>>  > > >>  >>  >>  >   */
>>  > > >>  >>  >>  >  fcopy_respond_to_host(HV_E_FAIL);
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  > +/* In the case the user-space daemon crashes, 
>> hangs
>>  > or
>>  > > >> is
>>  > > >>  >>  >> killed, we
>>  > > >>  >>  >>  > + * need to down the semaphore, otherwise, after
>>  > the
>>  > > >> daemon
>>  > > >>  >>  >> starts
>>  > > >>  >>  >>  > next
>>  > > >>  >>  >>  > + * time, the obsolete data in
>>  > fcopy_transaction.message
>>  > > >> or
>>  > > >>  >>  >>  > + * fcopy_transaction.fcopy_msg will be used
>>  > immediately.
>>  > > >>  >>  >>  > + *
>>  > > >>  >>  >>  > + * NOTE: fcopy_read() happens to get the
>>  > semaphore (very
>>  > > >>  >> rare)?
>>  > > >>  >>  >>  > We're
>>  > > >>  >>  >>  > + * still OK, because we've reported the failure 
>> to the
>>  > > >> host.
>>  > > >>  >>  >>  > + */
>>  > > >>  >>  >>  > +if (down_trylock(&fcopy_transaction.read_sema))
>>  > > >>  >>  >>  > +;
>>  > > >>  >>  >>
>>  > > >>  >>  >>  Sorry, I'm not quite understand how if () ; can 
>> help here.
>>  > > >>  >>  >>
>>  > > >>  >>  >>  Btw, a question not relate to this patch.
>>  > > >>  >>  >>
>>  > > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
>>  > > >> expires the  >>  >> check  >>  >>  here?
>>  > > >>  >>  > Hi Jason,
>>  > > >>  >>  > My idea is: here we need down_trylock(), but in case 
>> we
>>  > > >> can't get  >> the  >>  > semaphore, it's OK anyway:
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 1):
>>  > > >>  >>  > 1.1: when the daemon is blocked on the pread(), the 
>> daemon
>>  > > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs 
>> the
>>  > > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver 
>> reports
>>  > > >> the failure to the host user in 5s and  >>  > 1.3.2: the 
>> driver
>>  > > >> down()-es the semaphore;  >>  > 1.4: the daemon receives 
>> SIGCONT
>>  > > >> and it will be still blocked on  >> the  >>  > pread().
>>  > > >>  >>  > Without the down_trylock(), in 1.4, the daemon can 
>> receive
>>  > > >> an  >>  > obsolete message.
>>  > > >>  >>  > NOTE: in this scenario, the daemon is not killed.
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 2):
>>  > > >>  >>  > In senario 1), if the daemon receives SIGCONT between 
>> 1.3.1
>>  > > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), 
>> it will
>>  > > >> receive the message but:
>>  > > >> the
>>  > > >>  >>  > driver has
>>  > > >>  >>  > reported the failure to the host user and the 
>> driver's 1.3.2
>>  > > >> can't  >>  > get the  >>  > semaphore -- IMO this is 
>> acceptably OK,
>>  > > >> though in the VM, an  >>  > incomplete  >>  > file will be 
>> left
>>  > > >> there.
>>  > > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we 
>> should add
>>  > > >> a  >>  > close(target_fd) before open()-ing a new one.
>>  > > >>  >>
>>  > > >>  >>  Right, but how about the case when resuming from 
>> SIGSTOP but
>>  > > >> no  >> timeout?
>>  > > >>  > Sorry, I don't understand this:
>>  > > >>  > if no timeout, fcopy_read() will get the semaphore and
>>  > > >> fcopy_write()
>>  > > >>  > will try to cancel fcopy_work.
>>  > > >>
>>  > > >>  Yes.
>>  > > >>  >
>>  > > >>  >
>>  > > >>  >>  Looks like in this case userspace() may wait in
>>  > > >> down_interruptible()
>>  > > >>  >>  until timeout. We probably need something like this:
>>  > > >>  >>
>>  > > >>  >>          if 
>> (down_interruptible(&fcopy_transaction.read_sema)) {
>>  > > >>  >>                  up(&fcopy_transaction.read_sema);
>>  > > >>  >>                  return -EINTR;
>>  > > >>  >>          }
>>  > > >>  > until "timeout"?
>>  > > >>  > if the daemon can't get the semaphore, it can only be 
>> wake by a
>>  > > >> > signal(the  > daemon doesn't install handler, so by 
>> default most
>>  > > >> signals will kill  > the daemon).
>>  > > >>  > In case a signal waking up the daemon doesn't kill the 
>> daemon,
>>  > > >> why  > should  > we do up()?
>>  > > >>
>>  > > >>  True, no need since we do down_trylock() in release().
>>  > > >>
>>  > > >>  Btw, there's no EINTR handling in handling pread() return 
>> value,
>>  > > >> may add such one which should be useful for something like
>>  > > >> debugging.
>>  > > >>
>>  > > >>  >
>>  > > >>  >
>>  > > >>  >>
>>  > > >>  >>  This should synchronize with the timeout work for sure.
>>  > > >>  >>  But how about only schedule it after this?
>>  > > >>  >>  It does not may sense to start the timer during 
>> interrupt  >>
>>  > > >> since the file may not even opened and it may take time  >>  
>> to
>>  > > >> handle signals?
>>  > > >>  >>
>>  > > >>  >>  >
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  >  }
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  
>> >>  >>
>>  > > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode 
>>  >>
>>  > > >> *inode,  >>  >>  > struct file *f)
>>  > > >>  >>  >>  >   */
>>  > > >>  >>  >>  >  in_hand_shake = true;
>>  > > >>  >>  >>  >  opened = false;
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  > +if (cancel_delayed_work_sync(&fcopy_work)) {
>>  > > >>  >>  >>  > +/* We haven't up()-ed the semaphore(very
>>  > rare)? */
>>  > > >>  >>  >>  > +if
>>  > (down_trylock(&fcopy_transaction.read_sema))
>>  > > >>  >>  >>  > +;
>>  > > >>  >>  >>
>>  > > >>  >>  >>  And this.
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 3):
>>  > > >>  >>  > When the daemon exits(e.g., SIGKILL received), if 
>> there is a
>>  > > >> >>  > fcopy_work  >>  > pending (scheduled but not start to 
>> run
>>  > > >> yet), we should cancel the  >>  > work (as you suggested) and
>>  > > >> down() the semaphore, otherwise, the  >>  > obsolete message 
>> will
>>  > > >> be received by the next instance of the  >> daemon.
>>  > > >>  >>
>>  > > >>  >>  Yes
>>  > > >>  >>  >
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 4):  in the driver's 
>> hv_fcopy_onchannelcallback():
>>  > > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  > > >>  >>  >         ----> if fcopy_release() is running on 
>> another vcpu,
>>  > > >> just
>>  > > >>  >>  > before the next line?
>>  > > >>  >>  >         fcopy_send_data();
>>  > > >>  >>  >
>>  > > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, 
>> but  >>
>>  > > >> > can't get the semaphore since it hasn't been up()-ed.
>>  > > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  
>> later,
>>  > > >> and
>>  > > >>  >> we'll
>>  > > >>  >>  > buffer an obsolete message in the driver, and the 
>> message
>>  > > >> will be  >>  > fetched by the next instance of the daemon...
>>  > > >>  >>  >
>>  > > >>  >>  > Looks we need a spinlock here?
>>  > > >>  >>
>>  > > >>  >>  Unless fcopy_release() can wait for all data for current
>>  > > >> transation  >>  to be received. Spinlock won't help.
>>  > > >>  >>
>>  > > >>  >>  But an idea is let the daemon the handle such cases. 
>> E.g make
>>  > > >> sure  >> the  >>  processing begins with START_COPY and end 
>> with
>>  > > >> COMPLETE/CANCEL_COPY.
>>  > > >>  >>  Drop all requests that does not start with START_COPY.
>>  > > >>  >>
>>  > > >>  >>  Thought?
>>  > > >>  > Good idea.
>>  > > >>  > I also think we should reinforce the concept of state 
>> machine in
>>  > > >> the  > daemon code.
>>  > > >>
>>  > > >>  Yes, it needs.
>>  > > > I agree.
>>  > > > Obviously we can do something to make the daemon/driver work 
>> better
>>  > > > in the corner cases.
>>  > > >
>>  > > >>  >
>>  > > >>  > The daemon/driver communication has so many corner 
>> cases...
>>  > > >>
>>  > > >>  Looks so, let's first address the issue mentioned in this 
>> patch.
>>  > > > OK.
>>  > > >
>>  > > >>  I don't have any more comments other than changing
>>  > > >>
>>  > > >>   if(down_trylock(&fcopy_transaction.read_sema))
>>  > > >>       ;
>>  > > >>
>>  > > >>  to
>>  > > >>
>>  > > >>  down_trylock(&fcopy_transaction.read_sema);
>>  > > > Hi Jason,
>>  > > > This is to address Vitaly's comment in the bugzilla:
>>  > > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
>>  > > >
>>  > > > down_trylock(&fcopy_transaction.read_sema) will
>>  > > >
>>  > > > "
>>  > > > produces the following compile warning:
>>  > > > drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
>>  > > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
>>  > > > ‘down_trylock’, declared with attribute warn_unused_result
>>  > > > [-Wunused-result]
>>  > > >   (void)down_trylock(&fcopy_transaction.read_sema);
>>  > > > "
>>  > > >
>>  > > > Actually I personally don't care about the warning, because 
>> we only
>>  > > > see it when we run some kind of code checker program.  :-)
>>  > > >
>>  > > > I can change my v3  to the "normal" style you prefer, if 
>> there is no
>>  > > > strong objection from Vitaly?
>>  > >
>>  > > Ah, I see the point. Then no objection for this patch.
>>  > >
>>  > > Since Vitaly said he does not has objection.
>>  > >
>>  > > Acked-by: Jason Wang <jasowang@redhat.com>
>>  >
>>  > I realized non-trivial effort might be needed to better handle 
>> corner cases.
>>  > This is only the first step.
>>  >
>>  > E.g., I found a new corner case:
>>  > after we unload the hv_utils/fcopy driver, the fcopy daemon can 
>> exit and
>>  > fcopy_release() will be invoked -- at this time fcopy_release() 
>> has become
>>  > obsolete and invalid, and as a result, a kernel panic can happen: 
>>  I did see
>>  the
>>  > panic once.
>>  
>>  You should not be able to unload the driver when there is still an 
>> open
>>  reference from the
>>  Daemon.
>>  
>>  K. Y
> 
> Hi KY, I'm able to reproduce the below issue every time:
> (I ran the below in Ubuntu 14.10 but I think we're able to repro it 
> in any distro):
> 
> 1) start hv_fcopy_daemon and use ps to make sure it's running in the 
> background;
> 2) kill -STOP `pidof hv_fcopy_daemon`, and use ps to make sure the 
> daemon is in
> the state of "Ts";
> 3) rmmod hv_utils  # dexuan: this can runs fine. 

Probably the opened file should hold a refcnt of the module.
> 
> 4) kill -CONT `pidof hv_fcopy_daemon` and we see an oops at once:
> 
> [ 9699.421564] BUG: unable to handle kernel paging request at 
> ffffffffc019c230
> [ 9699.422701] IP: [<ffffffff811de294>] filp_close+0x24/0x70
> [ 9699.423154] PGD 1c16067 PUD 1c18067 PMD 366d8067 PTE 0
> [ 9699.423154] Oops: 0000 [#1] SMP
> [ 9699.423154] Modules linked in: hv_utils rfcomm bnep bluetooth 
> 6lowpan_iphc joydev hid_generic crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper 
> ablk_helper cryptd hid_hyperv serio_raw hid hyperv_keyboard 
> hv_balloon mac_hid hyperv_fb i2c_piix4 parport_pc ppdev lp parport 
> hv_storvsc psmouse floppy hv_vmbus tulip pata_acpi [last unloaded: 
> hv_utils]
> [ 9699.423154] CPU: 0 PID: 4659 Comm: hv_fcopy_daemon Not tainted 
> 3.16.0-23-generic #31-Ubuntu
> [ 9699.423154] Hardware name: Microsoft Corporation Virtual 
> Machine/Virtual Machine, BIOS 090006  05/23/2012
> [ 9699.423154] task: ffff880019ce5bb0 ti: ffff8800c7be0000 task.ti: 
> ffff8800c7be0000
> [ 9699.423154] RIP: 0010:[<ffffffff811de294>]  [<ffffffff811de294>] 
> filp_close+0x24/0x70
> [ 9699.423154] RSP: 0018:ffff8800c7be3e48  EFLAGS: 00010246
> [ 9699.423154] RAX: ffffffffc019c1c0 RBX: ffff8800c7b9c200 RCX: 
> ffff8800396548d8
> [ 9699.423154] RDX: 0000000080000000 RSI: ffff880039654880 RDI: 
> ffff8800c7b9c200
> [ 9699.423154] RBP: ffff8800c7be3e60 R08: 0000000000000001 R09: 
> 00000001802a0017
> [ 9699.423154] R10: ffff88003de17640 R11: ffff880087963140 R12: 
> 0000000000000000
> [ 9699.423154] R13: ffff880039654880 R14: ffff880039654890 R15: 
> 0000000000000004
> [ 9699.423154] FS:  0000000000000000(0000) GS:ffff88003de00000(0000) 
> knlGS:0000000000000000
> [ 9699.423154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9699.423154] CR2: ffffffffc019c230 CR3: 0000000001c13000 CR4: 
> 00000000001406f0
> [ 9699.423154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 9699.423154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 9699.423154] Stack:
> [ 9699.423154]  0000000000000001 0000000000000000 ffff880039654880 
> ffff8800c7be3ea0
> [ 9699.423154]  ffffffff811ff208 00000001034a4c80 ffff880019ce5bb0 
> ffff880039654880
> [ 9699.423154]  ffff880019ce6308 ffff8800126b8e00 ffff8800126b8e60 
> ffff8800c7be3ec8
> [ 9699.423154] Call Trace:
> [ 9699.423154]  [<ffffffff811ff208>] put_files_struct+0x78/0xd0
> [ 9699.423154]  [<ffffffff811ff307>] exit_files+0x47/0x50
> [ 9699.423154]  [<ffffffff810725a7>] do_exit+0x357/0xab0
> [ 9699.423154]  [<ffffffff81072d83>] do_group_exit+0x43/0xc0
> [ 9699.423154]  [<ffffffff81072e14>] SyS_exit_group+0x14/0x20
> [ 9699.423154]  [<ffffffff81787ced>] system_call_fastpath+0x1a/0x1f
> [ 9699.423154] Code: ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 
> 41 55 41 54 53 48 8b 47 38 48 89 fb 48 85 c0 74 46 48 8b 47 28 45 31 
> e4 49 89 f5 <48> 8b 40 70 48 85 c0 74 05 ff d0 41 89 c4 f6 43 45 40 
> 75 16 4c
> [ 9699.423154] RIP  [<ffffffff811de294>] filp_close+0x24/0x70
> [ 9699.423154]  RSP <ffff8800c7be3e48>
> [ 9699.423154] CR2: ffffffffc019c230
> [ 9699.423154] ---[ end trace 0278723740833669 ]---
> [ 9699.423154] Fixing recursive fault but reboot is needed!
> 
> I understand an administrator should not unload the driver before 
> stopping the
> daemon, but this could happen by accident.
> 
>  -- Dexuan
> 
> �NrybXǧv^)޺{.n+{zX\x17ܨ}Ơz&j:+v\azZ++zfh~iz\x1ew?&)ߢ^[f^jǫym@Aa\x7f\f0h\x0fi\x7f


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2014-12-02  6:58                               ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2014-12-02  6:58 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: KY Srinivasan, gregkh, linux-kernel, driverdev-devel, olaf, apw,
	vkuznets, Haiyang Zhang



On Tue, Dec 2, 2014 at 1:33 PM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: KY Srinivasan
>>  Sent: Monday, December 1, 2014 23:55 PM
>>  To: Dexuan Cui; Jason Wang
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
>>  vkuznets@redhat.com; Haiyang Zhang
>>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  
>>  
>>  
>>  > -----Original Message-----
>>  > From: Dexuan Cui
>>  > Sent: Monday, December 1, 2014 3:01 AM
>>  > To: Jason Wang
>>  > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  > devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com; 
>> KY
>>  > Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on transfer
>>  > failure
>>  >
>>  > > -----Original Message-----
>>  > > From: Jason Wang [mailto:jasowang@redhat.com]
>>  > > Sent: Monday, December 1, 2014 18:18 PM
>>  > > To: Dexuan Cui
>>  > > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>>  > > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
>>  > > apw@canonical.com; KY Srinivasan; vkuznets@redhat.com; Haiyang
>>  Zhang
>>  > > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message 
>> on
>>  > > transfer failure
>>  > >
>>  > >
>>  > >
>>  > > On Mon, Dec 1, 2014 at 5:47 PM, Dexuan Cui <decui@microsoft.com>
>>  > wrote:
>>  > > >>  -----Original Message-----
>>  > > >>  From: Jason Wang [mailto:jasowang@redhat.com]
>>  > > >>  Sent: Monday, December 1, 2014 16:23 PM
>>  > > >>  To: Dexuan Cui
>>  > > >>  Cc: gregkh@linuxfoundation.org; 
>> linux-kernel@vger.kernel.org;
>>  > > >> driverdev-
>>  > > >>  devel@linuxdriverproject.org; olaf@aepfle.de; 
>> apw@canonical.com;
>>  > > >> KY  Srinivasan; vkuznets@redhat.com; Haiyang Zhang
>>  > > >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete 
>> message on
>>  > > >> transfer  failure  On Fri, Nov 28, 2014 at 7:54 PM, Dexuan 
>> Cui
>>  > > >> <decui@microsoft.com>
>>  > > >> wrote:
>>  > > >>  >>  -----Original Message-----
>>  > > >>  >>  From: Jason Wang [mailto:jasowang@redhat.com]  >>  Sent:
>>  > > >> Friday, November 28, 2014 18:13 PM  >>  To: Dexuan Cui  >>  
>> Cc:
>>  > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  >>
>>  > > >> driverdev-  >>  devel@linuxdriverproject.org; olaf@aepfle.de;
>>  > > >> apw@canonical.com; KY  >>  Srinivasan; vkuznets@redhat.com;
>>  Haiyang
>>  > > >> Zhang  >>  Subject: RE: [PATCH v3] hv: hv_fcopy: drop the 
>> obsolete
>>  > > >> message on  >> transfer  >>  failure  >>  On Fri, Nov 28, 
>> 2014 at
>>  > > >> 4:36 PM, Dexuan Cui <decui@microsoft.com>  >> wrote:
>>  > > >>  >>  >>  -----Original Message-----  >>  >>  From: Jason Wang
>>  > > >> [mailto:jasowang@redhat.com]  >>  >>  Sent: Friday, November 
>> 28,
>>  > > >> 2014 14:47 PM  >>  >>  To: Dexuan Cui  >>  >>  Cc:
>>  > > >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;  
>> >>  >>
>>  > > >> driverdev-  >>  >>  devel@linuxdriverproject.org; 
>> olaf@aepfle.de;
>>  > > >> >> apw@canonical.com; KY  >>  >>  Srinivasan; 
>> vkuznets@redhat.com;
>>  > > >> Haiyang Zhang  >>  >>  Subject: Re: [PATCH v3] hv: hv_fcopy: 
>> drop
>>  > > >> the obsolete message  >> on  >>  >> transfer  >>  >>  
>> failure  >>
>>  > > >> >>  On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui  >>
>>  > > >> <decui@microsoft.com>  >>  >> wrote:
>>  > > >>  >>  >>  > In the case the user-space daemon crashes, hangs 
>> or is
>>  > > >> >> killed, we  >>  >>  > need to down the semaphore, 
>> otherwise,
>>  > > >> after the daemon starts  >>  >> next  >>  >>  > time, the 
>> obsolete
>>  > > >> data in fcopy_transaction.message or  >>  >>  >
>>  > > >> fcopy_transaction.fcopy_msg will be used immediately.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > Cc: Jason Wang <jasowang@redhat.com>  >>  >>  > 
>> Cc:
>>  > > >> Vitaly Kuznetsov <vkuznets@redhat.com>  >>  >>  > Cc: K. Y.
>>  > > >> Srinivasan <kys@microsoft.com>  >>  >>  > Signed-off-by: 
>> Dexuan Cui
>>  > > >> <decui@microsoft.com>  >>  >>  > ---  >>  >>  >  >>  >>  > 
>> v2: I
>>  > > >> removed the "FCP" prefix as Greg asked.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >     I also updated the output message a little:
>>  > > >>  >>  >>  >     "FCP: failed to acquire the semaphore" -->
>>  > > >>  >>  >>  >     "can not acquire the semaphore: it is benign"
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > v3: I added the code in fcopy_release() as Jason 
>> Wang
>>  > > >> >> suggested.
>>  > > >>  >>  >>  >     I removed the pr_debug (it isn't so 
>> meaningful)and
>>  > > >> added a
>>  > > >>  >>  >>  > comment instead.
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++  
>> >>  >>
>>  > > >> >  1 file changed, 19 insertions(+)  >>  >>  >  >>  >>  > 
>> diff
>>  > > >> --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c  >>  
>> >>  >
>>  > > >> index 23b2ce2..faa6ba6 100644  >>  >>  > ---
>>  > > >> a/drivers/hv/hv_fcopy.c  >>  >>  > +++ 
>> b/drivers/hv/hv_fcopy.c  >>
>>  > > >> >>  > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct  
>> >>
>>  > > >> work_struct  >>  >>  > *dummy)
>>  > > >>  >>  >>  >   * process the pending transaction.
>>  > > >>  >>  >>  >   */
>>  > > >>  >>  >>  >  fcopy_respond_to_host(HV_E_FAIL);
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  > +/* In the case the user-space daemon crashes, 
>> hangs
>>  > or
>>  > > >> is
>>  > > >>  >>  >> killed, we
>>  > > >>  >>  >>  > + * need to down the semaphore, otherwise, after
>>  > the
>>  > > >> daemon
>>  > > >>  >>  >> starts
>>  > > >>  >>  >>  > next
>>  > > >>  >>  >>  > + * time, the obsolete data in
>>  > fcopy_transaction.message
>>  > > >> or
>>  > > >>  >>  >>  > + * fcopy_transaction.fcopy_msg will be used
>>  > immediately.
>>  > > >>  >>  >>  > + *
>>  > > >>  >>  >>  > + * NOTE: fcopy_read() happens to get the
>>  > semaphore (very
>>  > > >>  >> rare)?
>>  > > >>  >>  >>  > We're
>>  > > >>  >>  >>  > + * still OK, because we've reported the failure 
>> to the
>>  > > >> host.
>>  > > >>  >>  >>  > + */
>>  > > >>  >>  >>  > +if (down_trylock(&fcopy_transaction.read_sema))
>>  > > >>  >>  >>  > +;
>>  > > >>  >>  >>
>>  > > >>  >>  >>  Sorry, I'm not quite understand how if () ; can 
>> help here.
>>  > > >>  >>  >>
>>  > > >>  >>  >>  Btw, a question not relate to this patch.
>>  > > >>  >>  >>
>>  > > >>  >>  >>  What happens if a daemon is resume from SIGSTOP and
>>  > > >> expires the  >>  >> check  >>  >>  here?
>>  > > >>  >>  > Hi Jason,
>>  > > >>  >>  > My idea is: here we need down_trylock(), but in case 
>> we
>>  > > >> can't get  >> the  >>  > semaphore, it's OK anyway:
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 1):
>>  > > >>  >>  > 1.1: when the daemon is blocked on the pread(), the 
>> daemon
>>  > > >> >> receives  >>  > SIGSTOP;  >>  > 1.2: the host user runs 
>> the
>>  > > >> PowerShell Copy-VMFile command;  >>  > 1.3.1: the driver 
>> reports
>>  > > >> the failure to the host user in 5s and  >>  > 1.3.2: the 
>> driver
>>  > > >> down()-es the semaphore;  >>  > 1.4: the daemon receives 
>> SIGCONT
>>  > > >> and it will be still blocked on  >> the  >>  > pread().
>>  > > >>  >>  > Without the down_trylock(), in 1.4, the daemon can 
>> receive
>>  > > >> an  >>  > obsolete message.
>>  > > >>  >>  > NOTE: in this scenario, the daemon is not killed.
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 2):
>>  > > >>  >>  > In senario 1), if the daemon receives SIGCONT between 
>> 1.3.1
>>  > > >> and  >> 1.3.2  >>  > and  >>  > do down() in fcopy_read(), 
>> it will
>>  > > >> receive the message but:
>>  > > >> the
>>  > > >>  >>  > driver has
>>  > > >>  >>  > reported the failure to the host user and the 
>> driver's 1.3.2
>>  > > >> can't  >>  > get the  >>  > semaphore -- IMO this is 
>> acceptably OK,
>>  > > >> though in the VM, an  >>  > incomplete  >>  > file will be 
>> left
>>  > > >> there.
>>  > > >>  >>  > BTW, I think in the daemon's hv_start_fcopy() we 
>> should add
>>  > > >> a  >>  > close(target_fd) before open()-ing a new one.
>>  > > >>  >>
>>  > > >>  >>  Right, but how about the case when resuming from 
>> SIGSTOP but
>>  > > >> no  >> timeout?
>>  > > >>  > Sorry, I don't understand this:
>>  > > >>  > if no timeout, fcopy_read() will get the semaphore and
>>  > > >> fcopy_write()
>>  > > >>  > will try to cancel fcopy_work.
>>  > > >>
>>  > > >>  Yes.
>>  > > >>  >
>>  > > >>  >
>>  > > >>  >>  Looks like in this case userspace() may wait in
>>  > > >> down_interruptible()
>>  > > >>  >>  until timeout. We probably need something like this:
>>  > > >>  >>
>>  > > >>  >>          if 
>> (down_interruptible(&fcopy_transaction.read_sema)) {
>>  > > >>  >>                  up(&fcopy_transaction.read_sema);
>>  > > >>  >>                  return -EINTR;
>>  > > >>  >>          }
>>  > > >>  > until "timeout"?
>>  > > >>  > if the daemon can't get the semaphore, it can only be 
>> wake by a
>>  > > >> > signal(the  > daemon doesn't install handler, so by 
>> default most
>>  > > >> signals will kill  > the daemon).
>>  > > >>  > In case a signal waking up the daemon doesn't kill the 
>> daemon,
>>  > > >> why  > should  > we do up()?
>>  > > >>
>>  > > >>  True, no need since we do down_trylock() in release().
>>  > > >>
>>  > > >>  Btw, there's no EINTR handling in handling pread() return 
>> value,
>>  > > >> may add such one which should be useful for something like
>>  > > >> debugging.
>>  > > >>
>>  > > >>  >
>>  > > >>  >
>>  > > >>  >>
>>  > > >>  >>  This should synchronize with the timeout work for sure.
>>  > > >>  >>  But how about only schedule it after this?
>>  > > >>  >>  It does not may sense to start the timer during 
>> interrupt  >>
>>  > > >> since the file may not even opened and it may take time  >>  
>> to
>>  > > >> handle signals?
>>  > > >>  >>
>>  > > >>  >>  >
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  >  }
>>  > > >>  >>  >>  >
>>  > > >>  >>  >>  >  static int fcopy_handle_handshake(u32 version)  
>> >>  >>
>>  > > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode 
>>  >>
>>  > > >> *inode,  >>  >>  > struct file *f)
>>  > > >>  >>  >>  >   */
>>  > > >>  >>  >>  >  in_hand_shake = true;
>>  > > >>  >>  >>  >  opened = false;
>>  > > >>  >>  >>  > +
>>  > > >>  >>  >>  > +if (cancel_delayed_work_sync(&fcopy_work)) {
>>  > > >>  >>  >>  > +/* We haven't up()-ed the semaphore(very
>>  > rare)? */
>>  > > >>  >>  >>  > +if
>>  > (down_trylock(&fcopy_transaction.read_sema))
>>  > > >>  >>  >>  > +;
>>  > > >>  >>  >>
>>  > > >>  >>  >>  And this.
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 3):
>>  > > >>  >>  > When the daemon exits(e.g., SIGKILL received), if 
>> there is a
>>  > > >> >>  > fcopy_work  >>  > pending (scheduled but not start to 
>> run
>>  > > >> yet), we should cancel the  >>  > work (as you suggested) and
>>  > > >> down() the semaphore, otherwise, the  >>  > obsolete message 
>> will
>>  > > >> be received by the next instance of the  >> daemon.
>>  > > >>  >>
>>  > > >>  >>  Yes
>>  > > >>  >>  >
>>  > > >>  >>  >
>>  > > >>  >>  > Scenario 4):  in the driver's 
>> hv_fcopy_onchannelcallback():
>>  > > >>  >>  >         schedule_delayed_work(&fcopy_work, 5*HZ);
>>  > > >>  >>  >         ----> if fcopy_release() is running on 
>> another vcpu,
>>  > > >> just
>>  > > >>  >>  > before the next line?
>>  > > >>  >>  >         fcopy_send_data();
>>  > > >>  >>  >
>>  > > >>  >>  > In this case, fcopy_release() can cancel fcopy_work, 
>> but  >>
>>  > > >> > can't get the semaphore since it hasn't been up()-ed.
>>  > > >>  >>  > Hmm, in this case,   fcopy_send_data() will do up()  
>> later,
>>  > > >> and
>>  > > >>  >> we'll
>>  > > >>  >>  > buffer an obsolete message in the driver, and the 
>> message
>>  > > >> will be  >>  > fetched by the next instance of the daemon...
>>  > > >>  >>  >
>>  > > >>  >>  > Looks we need a spinlock here?
>>  > > >>  >>
>>  > > >>  >>  Unless fcopy_release() can wait for all data for current
>>  > > >> transation  >>  to be received. Spinlock won't help.
>>  > > >>  >>
>>  > > >>  >>  But an idea is let the daemon the handle such cases. 
>> E.g make
>>  > > >> sure  >> the  >>  processing begins with START_COPY and end 
>> with
>>  > > >> COMPLETE/CANCEL_COPY.
>>  > > >>  >>  Drop all requests that does not start with START_COPY.
>>  > > >>  >>
>>  > > >>  >>  Thought?
>>  > > >>  > Good idea.
>>  > > >>  > I also think we should reinforce the concept of state 
>> machine in
>>  > > >> the  > daemon code.
>>  > > >>
>>  > > >>  Yes, it needs.
>>  > > > I agree.
>>  > > > Obviously we can do something to make the daemon/driver work 
>> better
>>  > > > in the corner cases.
>>  > > >
>>  > > >>  >
>>  > > >>  > The daemon/driver communication has so many corner 
>> cases...
>>  > > >>
>>  > > >>  Looks so, let's first address the issue mentioned in this 
>> patch.
>>  > > > OK.
>>  > > >
>>  > > >>  I don't have any more comments other than changing
>>  > > >>
>>  > > >>   if(down_trylock(&fcopy_transaction.read_sema))
>>  > > >>       ;
>>  > > >>
>>  > > >>  to
>>  > > >>
>>  > > >>  down_trylock(&fcopy_transaction.read_sema);
>>  > > > Hi Jason,
>>  > > > This is to address Vitaly's comment in the bugzilla:
>>  > > > https://bugzilla.redhat.com/show_bug.cgi?id=1162100#c5
>>  > > >
>>  > > > down_trylock(&fcopy_transaction.read_sema) will
>>  > > >
>>  > > > "
>>  > > > produces the following compile warning:
>>  > > > drivers/hv/hv_fcopy.c: In function ‘fcopy_work_func’:
>>  > > > drivers/hv/hv_fcopy.c:95:2: warning: ignoring return value of
>>  > > > ‘down_trylock’, declared with attribute warn_unused_result
>>  > > > [-Wunused-result]
>>  > > >   (void)down_trylock(&fcopy_transaction.read_sema);
>>  > > > "
>>  > > >
>>  > > > Actually I personally don't care about the warning, because 
>> we only
>>  > > > see it when we run some kind of code checker program.  :-)
>>  > > >
>>  > > > I can change my v3  to the "normal" style you prefer, if 
>> there is no
>>  > > > strong objection from Vitaly?
>>  > >
>>  > > Ah, I see the point. Then no objection for this patch.
>>  > >
>>  > > Since Vitaly said he does not has objection.
>>  > >
>>  > > Acked-by: Jason Wang <jasowang@redhat.com>
>>  >
>>  > I realized non-trivial effort might be needed to better handle 
>> corner cases.
>>  > This is only the first step.
>>  >
>>  > E.g., I found a new corner case:
>>  > after we unload the hv_utils/fcopy driver, the fcopy daemon can 
>> exit and
>>  > fcopy_release() will be invoked -- at this time fcopy_release() 
>> has become
>>  > obsolete and invalid, and as a result, a kernel panic can happen: 
>>  I did see
>>  the
>>  > panic once.
>>  
>>  You should not be able to unload the driver when there is still an 
>> open
>>  reference from the
>>  Daemon.
>>  
>>  K. Y
> 
> Hi KY, I'm able to reproduce the below issue every time:
> (I ran the below in Ubuntu 14.10 but I think we're able to repro it 
> in any distro):
> 
> 1) start hv_fcopy_daemon and use ps to make sure it's running in the 
> background;
> 2) kill -STOP `pidof hv_fcopy_daemon`, and use ps to make sure the 
> daemon is in
> the state of "Ts";
> 3) rmmod hv_utils  # dexuan: this can runs fine. 

Probably the opened file should hold a refcnt of the module.
> 
> 4) kill -CONT `pidof hv_fcopy_daemon` and we see an oops at once:
> 
> [ 9699.421564] BUG: unable to handle kernel paging request at 
> ffffffffc019c230
> [ 9699.422701] IP: [<ffffffff811de294>] filp_close+0x24/0x70
> [ 9699.423154] PGD 1c16067 PUD 1c18067 PMD 366d8067 PTE 0
> [ 9699.423154] Oops: 0000 [#1] SMP
> [ 9699.423154] Modules linked in: hv_utils rfcomm bnep bluetooth 
> 6lowpan_iphc joydev hid_generic crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper 
> ablk_helper cryptd hid_hyperv serio_raw hid hyperv_keyboard 
> hv_balloon mac_hid hyperv_fb i2c_piix4 parport_pc ppdev lp parport 
> hv_storvsc psmouse floppy hv_vmbus tulip pata_acpi [last unloaded: 
> hv_utils]
> [ 9699.423154] CPU: 0 PID: 4659 Comm: hv_fcopy_daemon Not tainted 
> 3.16.0-23-generic #31-Ubuntu
> [ 9699.423154] Hardware name: Microsoft Corporation Virtual 
> Machine/Virtual Machine, BIOS 090006  05/23/2012
> [ 9699.423154] task: ffff880019ce5bb0 ti: ffff8800c7be0000 task.ti: 
> ffff8800c7be0000
> [ 9699.423154] RIP: 0010:[<ffffffff811de294>]  [<ffffffff811de294>] 
> filp_close+0x24/0x70
> [ 9699.423154] RSP: 0018:ffff8800c7be3e48  EFLAGS: 00010246
> [ 9699.423154] RAX: ffffffffc019c1c0 RBX: ffff8800c7b9c200 RCX: 
> ffff8800396548d8
> [ 9699.423154] RDX: 0000000080000000 RSI: ffff880039654880 RDI: 
> ffff8800c7b9c200
> [ 9699.423154] RBP: ffff8800c7be3e60 R08: 0000000000000001 R09: 
> 00000001802a0017
> [ 9699.423154] R10: ffff88003de17640 R11: ffff880087963140 R12: 
> 0000000000000000
> [ 9699.423154] R13: ffff880039654880 R14: ffff880039654890 R15: 
> 0000000000000004
> [ 9699.423154] FS:  0000000000000000(0000) GS:ffff88003de00000(0000) 
> knlGS:0000000000000000
> [ 9699.423154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9699.423154] CR2: ffffffffc019c230 CR3: 0000000001c13000 CR4: 
> 00000000001406f0
> [ 9699.423154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 9699.423154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 9699.423154] Stack:
> [ 9699.423154]  0000000000000001 0000000000000000 ffff880039654880 
> ffff8800c7be3ea0
> [ 9699.423154]  ffffffff811ff208 00000001034a4c80 ffff880019ce5bb0 
> ffff880039654880
> [ 9699.423154]  ffff880019ce6308 ffff8800126b8e00 ffff8800126b8e60 
> ffff8800c7be3ec8
> [ 9699.423154] Call Trace:
> [ 9699.423154]  [<ffffffff811ff208>] put_files_struct+0x78/0xd0
> [ 9699.423154]  [<ffffffff811ff307>] exit_files+0x47/0x50
> [ 9699.423154]  [<ffffffff810725a7>] do_exit+0x357/0xab0
> [ 9699.423154]  [<ffffffff81072d83>] do_group_exit+0x43/0xc0
> [ 9699.423154]  [<ffffffff81072e14>] SyS_exit_group+0x14/0x20
> [ 9699.423154]  [<ffffffff81787ced>] system_call_fastpath+0x1a/0x1f
> [ 9699.423154] Code: ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 
> 41 55 41 54 53 48 8b 47 38 48 89 fb 48 85 c0 74 46 48 8b 47 28 45 31 
> e4 49 89 f5 <48> 8b 40 70 48 85 c0 74 05 ff d0 41 89 c4 f6 43 45 40 
> 75 16 4c
> [ 9699.423154] RIP  [<ffffffff811de294>] filp_close+0x24/0x70
> [ 9699.423154]  RSP <ffff8800c7be3e48>
> [ 9699.423154] CR2: ffffffffc019c230
> [ 9699.423154] ---[ end trace 0278723740833669 ]---
> [ 9699.423154] Fixing recursive fault but reboot is needed!
> 
> I understand an administrator should not unload the driver before 
> stopping the
> daemon, but this could happen by accident.
> 
>  -- Dexuan
> 
> �NrybXǧv^)޺{.n+{zX\x17ܨ}Ơz&j:+v\azZ++zfh~iz\x1ew?&)ߢ^[f^jǫym@Aa\x7f\f0h\x0fi\x7f

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2014-11-27 13:09 ` Dexuan Cui
@ 2015-01-13 13:52   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 34+ messages in thread
From: Vitaly Kuznetsov @ 2015-01-13 13:52 UTC (permalink / raw)
  To: Dexuan Cui, kys
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang, haiyangz

Dexuan Cui <decui@microsoft.com> writes:

> In the case the user-space daemon crashes, hangs or is killed, we
> need to down the semaphore, otherwise, after the daemon starts next
> time, the obsolete data in fcopy_transaction.message or
> fcopy_transaction.fcopy_msg will be used immediately.

It seems this patch got lost and I don't see it in recent 'resend'
series. K.Y., Dexuan, can you please take a look?

>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: K. Y. Srinivasan <kys@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>
> v2: I removed the "FCP" prefix as Greg asked.
>
>     I also updated the output message a little:
>     "FCP: failed to acquire the semaphore" --> 
>     "can not acquire the semaphore: it is benign"
>
> v3: I added the code in fcopy_release() as Jason Wang suggested.
>     I removed the pr_debug (it isn't so meaningful)and added a comment instead.
>
>  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> index 23b2ce2..faa6ba6 100644
> --- a/drivers/hv/hv_fcopy.c
> +++ b/drivers/hv/hv_fcopy.c
> @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy)
>  	 * process the pending transaction.
>  	 */
>  	fcopy_respond_to_host(HV_E_FAIL);
> +
> +	/* In the case the user-space daemon crashes, hangs or is killed, we
> +	 * need to down the semaphore, otherwise, after the daemon starts next
> +	 * time, the obsolete data in fcopy_transaction.message or
> +	 * fcopy_transaction.fcopy_msg will be used immediately.
> +	 *
> +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
> +	 * still OK, because we've reported the failure to the host.
> +	 */
> +	if (down_trylock(&fcopy_transaction.read_sema))
> +		;
> +
>  }
>
>  static int fcopy_handle_handshake(u32 version)
> @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct file *f)
>  	 */
>  	in_hand_shake = true;
>  	opened = false;
> +
> +	if (cancel_delayed_work_sync(&fcopy_work)) {
> +		/* We haven't up()-ed the semaphore(very rare)? */
> +		if (down_trylock(&fcopy_transaction.read_sema))
> +			;
> +		fcopy_respond_to_host(HV_E_FAIL);
> +	}
>  	return 0;
>  }

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2015-01-13 13:52   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 34+ messages in thread
From: Vitaly Kuznetsov @ 2015-01-13 13:52 UTC (permalink / raw)
  To: Dexuan Cui, kys
  Cc: olaf, gregkh, jasowang, driverdev-devel, linux-kernel, apw, haiyangz

Dexuan Cui <decui@microsoft.com> writes:

> In the case the user-space daemon crashes, hangs or is killed, we
> need to down the semaphore, otherwise, after the daemon starts next
> time, the obsolete data in fcopy_transaction.message or
> fcopy_transaction.fcopy_msg will be used immediately.

It seems this patch got lost and I don't see it in recent 'resend'
series. K.Y., Dexuan, can you please take a look?

>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: K. Y. Srinivasan <kys@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>
> v2: I removed the "FCP" prefix as Greg asked.
>
>     I also updated the output message a little:
>     "FCP: failed to acquire the semaphore" --> 
>     "can not acquire the semaphore: it is benign"
>
> v3: I added the code in fcopy_release() as Jason Wang suggested.
>     I removed the pr_debug (it isn't so meaningful)and added a comment instead.
>
>  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> index 23b2ce2..faa6ba6 100644
> --- a/drivers/hv/hv_fcopy.c
> +++ b/drivers/hv/hv_fcopy.c
> @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy)
>  	 * process the pending transaction.
>  	 */
>  	fcopy_respond_to_host(HV_E_FAIL);
> +
> +	/* In the case the user-space daemon crashes, hangs or is killed, we
> +	 * need to down the semaphore, otherwise, after the daemon starts next
> +	 * time, the obsolete data in fcopy_transaction.message or
> +	 * fcopy_transaction.fcopy_msg will be used immediately.
> +	 *
> +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
> +	 * still OK, because we've reported the failure to the host.
> +	 */
> +	if (down_trylock(&fcopy_transaction.read_sema))
> +		;
> +
>  }
>
>  static int fcopy_handle_handshake(u32 version)
> @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct file *f)
>  	 */
>  	in_hand_shake = true;
>  	opened = false;
> +
> +	if (cancel_delayed_work_sync(&fcopy_work)) {
> +		/* We haven't up()-ed the semaphore(very rare)? */
> +		if (down_trylock(&fcopy_transaction.read_sema))
> +			;
> +		fcopy_respond_to_host(HV_E_FAIL);
> +	}
>  	return 0;
>  }

-- 
  Vitaly
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2015-01-13 13:52   ` Vitaly Kuznetsov
@ 2015-01-14  1:43     ` Dexuan Cui
  -1 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2015-01-14  1:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov, KY Srinivasan
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang,
	Haiyang Zhang

> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Tuesday, January 13, 2015 21:52 PM
> To: Dexuan Cui; KY Srinivasan
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; Haiyang Zhang
> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> Dexuan Cui <decui@microsoft.com> writes:
> 
> > In the case the user-space daemon crashes, hangs or is killed, we
> > need to down the semaphore, otherwise, after the daemon starts next
> > time, the obsolete data in fcopy_transaction.message or
> > fcopy_transaction.fcopy_msg will be used immediately.
> 
> It seems this patch got lost and I don't see it in recent 'resend'
> series. K.Y., Dexuan, can you please take a look?
Hi Vitaly, Jason,
The patch can't fix all the corner cases (it would need non-trivial efforts for that)
as we discussed, but I think it would be better for us to have it as it can indeed fix
an obvious issue and doesn't introduce new issues.

I think I can document the known discussed corner cases in the patch as TODOs
and resend the patch.

Please let me know if you have different opinions. :-) 

Thanks,
-- Dexuan

> 
> >
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >
> > v2: I removed the "FCP" prefix as Greg asked.
> >
> >     I also updated the output message a little:
> >     "FCP: failed to acquire the semaphore" -->
> >     "can not acquire the semaphore: it is benign"
> >
> > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >     I removed the pr_debug (it isn't so meaningful)and added a comment
> instead.
> >
> >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> > index 23b2ce2..faa6ba6 100644
> > --- a/drivers/hv/hv_fcopy.c
> > +++ b/drivers/hv/hv_fcopy.c
> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> *dummy)
> >  	 * process the pending transaction.
> >  	 */
> >  	fcopy_respond_to_host(HV_E_FAIL);
> > +
> > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > +	 * need to down the semaphore, otherwise, after the daemon starts
> next
> > +	 * time, the obsolete data in fcopy_transaction.message or
> > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > +	 *
> > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
> > +	 * still OK, because we've reported the failure to the host.
> > +	 */
> > +	if (down_trylock(&fcopy_transaction.read_sema))
> > +		;
> > +
> >  }
> >
> >  static int fcopy_handle_handshake(u32 version)
> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct
> file *f)
> >  	 */
> >  	in_hand_shake = true;
> >  	opened = false;
> > +
> > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > +		/* We haven't up()-ed the semaphore(very rare)? */
> > +		if (down_trylock(&fcopy_transaction.read_sema))
> > +			;
> > +		fcopy_respond_to_host(HV_E_FAIL);
> > +	}
> >  	return 0;
> >  }
> 
> --
>   Vitaly

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2015-01-14  1:43     ` Dexuan Cui
  0 siblings, 0 replies; 34+ messages in thread
From: Dexuan Cui @ 2015-01-14  1:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov, KY Srinivasan
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang,
	Haiyang Zhang

> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Tuesday, January 13, 2015 21:52 PM
> To: Dexuan Cui; KY Srinivasan
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; Haiyang Zhang
> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> Dexuan Cui <decui@microsoft.com> writes:
> 
> > In the case the user-space daemon crashes, hangs or is killed, we
> > need to down the semaphore, otherwise, after the daemon starts next
> > time, the obsolete data in fcopy_transaction.message or
> > fcopy_transaction.fcopy_msg will be used immediately.
> 
> It seems this patch got lost and I don't see it in recent 'resend'
> series. K.Y., Dexuan, can you please take a look?
Hi Vitaly, Jason,
The patch can't fix all the corner cases (it would need non-trivial efforts for that)
as we discussed, but I think it would be better for us to have it as it can indeed fix
an obvious issue and doesn't introduce new issues.

I think I can document the known discussed corner cases in the patch as TODOs
and resend the patch.

Please let me know if you have different opinions. :-) 

Thanks,
-- Dexuan

> 
> >
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >
> > v2: I removed the "FCP" prefix as Greg asked.
> >
> >     I also updated the output message a little:
> >     "FCP: failed to acquire the semaphore" -->
> >     "can not acquire the semaphore: it is benign"
> >
> > v3: I added the code in fcopy_release() as Jason Wang suggested.
> >     I removed the pr_debug (it isn't so meaningful)and added a comment
> instead.
> >
> >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> > index 23b2ce2..faa6ba6 100644
> > --- a/drivers/hv/hv_fcopy.c
> > +++ b/drivers/hv/hv_fcopy.c
> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> *dummy)
> >  	 * process the pending transaction.
> >  	 */
> >  	fcopy_respond_to_host(HV_E_FAIL);
> > +
> > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > +	 * need to down the semaphore, otherwise, after the daemon starts
> next
> > +	 * time, the obsolete data in fcopy_transaction.message or
> > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > +	 *
> > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
> > +	 * still OK, because we've reported the failure to the host.
> > +	 */
> > +	if (down_trylock(&fcopy_transaction.read_sema))
> > +		;
> > +
> >  }
> >
> >  static int fcopy_handle_handshake(u32 version)
> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, struct
> file *f)
> >  	 */
> >  	in_hand_shake = true;
> >  	opened = false;
> > +
> > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > +		/* We haven't up()-ed the semaphore(very rare)? */
> > +		if (down_trylock(&fcopy_transaction.read_sema))
> > +			;
> > +		fcopy_respond_to_host(HV_E_FAIL);
> > +	}
> >  	return 0;
> >  }
> 
> --
>   Vitaly

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
       [not found]     ` <F792CF86EFE20D4AB8064279AFBA51C61EA7CFF9@SIXPRD3002MB028.064d.mgd.msft.net >
@ 2015-01-14  2:36         ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2015-01-14  2:28 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Vitaly Kuznetsov, KY Srinivasan, gregkh, linux-kernel,
	driverdev-devel, olaf, apw, Haiyang Zhang



On Wed, Jan 14, 2015 at 9:43 AM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
>>  Sent: Tuesday, January 13, 2015 21:52 PM
>>  To: Dexuan Cui; KY Srinivasan
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
>>  jasowang@redhat.com; Haiyang Zhang
>>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  
>>  Dexuan Cui <decui@microsoft.com> writes:
>>  
>>  > In the case the user-space daemon crashes, hangs or is killed, we
>>  > need to down the semaphore, otherwise, after the daemon starts 
>> next
>>  > time, the obsolete data in fcopy_transaction.message or
>>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  
>>  It seems this patch got lost and I don't see it in recent 'resend'
>>  series. K.Y., Dexuan, can you please take a look?
> Hi Vitaly, Jason,
> The patch can't fix all the corner cases (it would need non-trivial 
> efforts for that)
> as we discussed, but I think it would be better for us to have it as 
> it can indeed fix
> an obvious issue and doesn't introduce new issues.
> 
> I think I can document the known discussed corner cases in the patch 
> as TODOs
> and resend the patch.
> 
> Please let me know if you have different opinions. :-) 
> 
> Thanks,
> -- Dexuan

Yes, I think it's ok.
We can do other fixes on top.

Thanks


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2015-01-14  2:36         ` Jason Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2015-01-14  2:36 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: olaf, gregkh, Haiyang Zhang, driverdev-devel, linux-kernel, apw



On Wed, Jan 14, 2015 at 9:43 AM, Dexuan Cui <decui@microsoft.com> wrote:
>>  -----Original Message-----
>>  From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
>>  Sent: Tuesday, January 13, 2015 21:52 PM
>>  To: Dexuan Cui; KY Srinivasan
>>  Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; 
>> driverdev-
>>  devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
>>  jasowang@redhat.com; Haiyang Zhang
>>  Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on 
>> transfer
>>  failure
>>  
>>  Dexuan Cui <decui@microsoft.com> writes:
>>  
>>  > In the case the user-space daemon crashes, hangs or is killed, we
>>  > need to down the semaphore, otherwise, after the daemon starts 
>> next
>>  > time, the obsolete data in fcopy_transaction.message or
>>  > fcopy_transaction.fcopy_msg will be used immediately.
>>  
>>  It seems this patch got lost and I don't see it in recent 'resend'
>>  series. K.Y., Dexuan, can you please take a look?
> Hi Vitaly, Jason,
> The patch can't fix all the corner cases (it would need non-trivial 
> efforts for that)
> as we discussed, but I think it would be better for us to have it as 
> it can indeed fix
> an obvious issue and doesn't introduce new issues.
> 
> I think I can document the known discussed corner cases in the patch 
> as TODOs
> and resend the patch.
> 
> Please let me know if you have different opinions. :-) 
> 
> Thanks,
> -- Dexuan

Yes, I think it's ok.
We can do other fixes on top.

Thanks

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
  2015-01-14  1:43     ` Dexuan Cui
@ 2015-01-14  3:21       ` KY Srinivasan
  -1 siblings, 0 replies; 34+ messages in thread
From: KY Srinivasan @ 2015-01-14  3:21 UTC (permalink / raw)
  To: Dexuan Cui, Vitaly Kuznetsov
  Cc: gregkh, linux-kernel, driverdev-devel, olaf, apw, jasowang,
	Haiyang Zhang



> -----Original Message-----
> From: Dexuan Cui
> Sent: Tuesday, January 13, 2015 5:44 PM
> To: Vitaly Kuznetsov; KY Srinivasan
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> > -----Original Message-----
> > From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> > Sent: Tuesday, January 13, 2015 21:52 PM
> > To: Dexuan Cui; KY Srinivasan
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > apw@canonical.com; jasowang@redhat.com; Haiyang Zhang
> > Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > transfer failure
> >
> > Dexuan Cui <decui@microsoft.com> writes:
> >
> > > In the case the user-space daemon crashes, hangs or is killed, we
> > > need to down the semaphore, otherwise, after the daemon starts next
> > > time, the obsolete data in fcopy_transaction.message or
> > > fcopy_transaction.fcopy_msg will be used immediately.
> >
> > It seems this patch got lost and I don't see it in recent 'resend'
> > series. K.Y., Dexuan, can you please take a look?
> Hi Vitaly, Jason,
> The patch can't fix all the corner cases (it would need non-trivial efforts for
> that) as we discussed, but I think it would be better for us to have it as it can
> indeed fix an obvious issue and doesn't introduce new issues.
> 
> I think I can document the known discussed corner cases in the patch as
> TODOs and resend the patch.
> 
> Please let me know if you have different opinions. :-)

Resend the patch.

K. Y
> 
> Thanks,
> -- Dexuan
> 
> >
> > >
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > ---
> > >
> > > v2: I removed the "FCP" prefix as Greg asked.
> > >
> > >     I also updated the output message a little:
> > >     "FCP: failed to acquire the semaphore" -->
> > >     "can not acquire the semaphore: it is benign"
> > >
> > > v3: I added the code in fcopy_release() as Jason Wang suggested.
> > >     I removed the pr_debug (it isn't so meaningful)and added a
> > > comment
> > instead.
> > >
> > >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > >
> > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index
> > > 23b2ce2..faa6ba6 100644
> > > --- a/drivers/hv/hv_fcopy.c
> > > +++ b/drivers/hv/hv_fcopy.c
> > > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> > *dummy)
> > >  	 * process the pending transaction.
> > >  	 */
> > >  	fcopy_respond_to_host(HV_E_FAIL);
> > > +
> > > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > > +	 * need to down the semaphore, otherwise, after the daemon starts
> > next
> > > +	 * time, the obsolete data in fcopy_transaction.message or
> > > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > > +	 *
> > > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> We're
> > > +	 * still OK, because we've reported the failure to the host.
> > > +	 */
> > > +	if (down_trylock(&fcopy_transaction.read_sema))
> > > +		;
> > > +
> > >  }
> > >
> > >  static int fcopy_handle_handshake(u32 version) @@ -351,6 +363,13 @@
> > > static int fcopy_release(struct inode *inode, struct
> > file *f)
> > >  	 */
> > >  	in_hand_shake = true;
> > >  	opened = false;
> > > +
> > > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > > +		/* We haven't up()-ed the semaphore(very rare)? */
> > > +		if (down_trylock(&fcopy_transaction.read_sema))
> > > +			;
> > > +		fcopy_respond_to_host(HV_E_FAIL);
> > > +	}
> > >  	return 0;
> > >  }
> >
> > --
> >   Vitaly


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
@ 2015-01-14  3:21       ` KY Srinivasan
  0 siblings, 0 replies; 34+ messages in thread
From: KY Srinivasan @ 2015-01-14  3:21 UTC (permalink / raw)
  To: Dexuan Cui, Vitaly Kuznetsov
  Cc: olaf, gregkh, jasowang, driverdev-devel, linux-kernel, apw,
	Haiyang Zhang



> -----Original Message-----
> From: Dexuan Cui
> Sent: Tuesday, January 13, 2015 5:44 PM
> To: Vitaly Kuznetsov; KY Srinivasan
> Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> devel@linuxdriverproject.org; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; Haiyang Zhang
> Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> 
> > -----Original Message-----
> > From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> > Sent: Tuesday, January 13, 2015 21:52 PM
> > To: Dexuan Cui; KY Srinivasan
> > Cc: gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > driverdev- devel@linuxdriverproject.org; olaf@aepfle.de;
> > apw@canonical.com; jasowang@redhat.com; Haiyang Zhang
> > Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
> > transfer failure
> >
> > Dexuan Cui <decui@microsoft.com> writes:
> >
> > > In the case the user-space daemon crashes, hangs or is killed, we
> > > need to down the semaphore, otherwise, after the daemon starts next
> > > time, the obsolete data in fcopy_transaction.message or
> > > fcopy_transaction.fcopy_msg will be used immediately.
> >
> > It seems this patch got lost and I don't see it in recent 'resend'
> > series. K.Y., Dexuan, can you please take a look?
> Hi Vitaly, Jason,
> The patch can't fix all the corner cases (it would need non-trivial efforts for
> that) as we discussed, but I think it would be better for us to have it as it can
> indeed fix an obvious issue and doesn't introduce new issues.
> 
> I think I can document the known discussed corner cases in the patch as
> TODOs and resend the patch.
> 
> Please let me know if you have different opinions. :-)

Resend the patch.

K. Y
> 
> Thanks,
> -- Dexuan
> 
> >
> > >
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Cc: K. Y. Srinivasan <kys@microsoft.com>
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > ---
> > >
> > > v2: I removed the "FCP" prefix as Greg asked.
> > >
> > >     I also updated the output message a little:
> > >     "FCP: failed to acquire the semaphore" -->
> > >     "can not acquire the semaphore: it is benign"
> > >
> > > v3: I added the code in fcopy_release() as Jason Wang suggested.
> > >     I removed the pr_debug (it isn't so meaningful)and added a
> > > comment
> > instead.
> > >
> > >  drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > >
> > > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index
> > > 23b2ce2..faa6ba6 100644
> > > --- a/drivers/hv/hv_fcopy.c
> > > +++ b/drivers/hv/hv_fcopy.c
> > > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> > *dummy)
> > >  	 * process the pending transaction.
> > >  	 */
> > >  	fcopy_respond_to_host(HV_E_FAIL);
> > > +
> > > +	/* In the case the user-space daemon crashes, hangs or is killed, we
> > > +	 * need to down the semaphore, otherwise, after the daemon starts
> > next
> > > +	 * time, the obsolete data in fcopy_transaction.message or
> > > +	 * fcopy_transaction.fcopy_msg will be used immediately.
> > > +	 *
> > > +	 * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> We're
> > > +	 * still OK, because we've reported the failure to the host.
> > > +	 */
> > > +	if (down_trylock(&fcopy_transaction.read_sema))
> > > +		;
> > > +
> > >  }
> > >
> > >  static int fcopy_handle_handshake(u32 version) @@ -351,6 +363,13 @@
> > > static int fcopy_release(struct inode *inode, struct
> > file *f)
> > >  	 */
> > >  	in_hand_shake = true;
> > >  	opened = false;
> > > +
> > > +	if (cancel_delayed_work_sync(&fcopy_work)) {
> > > +		/* We haven't up()-ed the semaphore(very rare)? */
> > > +		if (down_trylock(&fcopy_transaction.read_sema))
> > > +			;
> > > +		fcopy_respond_to_host(HV_E_FAIL);
> > > +	}
> > >  	return 0;
> > >  }
> >
> > --
> >   Vitaly

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-01-14  3:21 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-27 13:09 [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure Dexuan Cui
2014-11-27 13:09 ` Dexuan Cui
2014-11-28  6:47 ` Jason Wang
2014-11-28  6:55   ` Jason Wang
2014-11-28  8:36   ` Dexuan Cui
2014-11-28  8:36     ` Dexuan Cui
     [not found]     ` <F792CF86EFE20D4AB8064279AFBA51C613E66566@HKNPRD3002MB017.064d.mgd.msft.net >
2014-11-28 10:12       ` Jason Wang
2014-11-28 10:20         ` Jason Wang
2014-11-28 11:54         ` Dexuan Cui
2014-11-28 11:54           ` Dexuan Cui
     [not found]           ` <F792CF86EFE20D4AB8064279AFBA51C613E66AAA@HKNPRD3002MB017.064d.mgd.msft.net >
2014-12-01  8:22             ` Jason Wang
2014-12-01  8:30               ` Jason Wang
2014-12-01  9:47               ` Dexuan Cui
2014-12-01  9:47                 ` Dexuan Cui
2014-12-01  9:55                 ` Vitaly Kuznetsov
2014-12-01  9:55                   ` Vitaly Kuznetsov
     [not found]                 ` <F792CF86EFE20D4AB8064279AFBA51C613E691B8@HKNPRD3002MB017.064d.mgd.msft.net >
2014-12-01 10:18                   ` Jason Wang
2014-12-01 10:26                     ` Jason Wang
2014-12-01 11:00                     ` Dexuan Cui
2014-12-01 11:00                       ` Dexuan Cui
2014-12-01 15:54                       ` KY Srinivasan
2014-12-01 15:54                         ` KY Srinivasan
2014-12-02  5:33                         ` Dexuan Cui
2014-12-02  5:33                           ` Dexuan Cui
     [not found]                           ` <F792CF86EFE20D4AB8064279AFBA51C613E6B93B@HKNPRD3002MB017.064d.mgd.msft.net >
2014-12-02  6:58                             ` Jason Wang
2014-12-02  6:58                               ` Jason Wang
2015-01-13 13:52 ` Vitaly Kuznetsov
2015-01-13 13:52   ` Vitaly Kuznetsov
2015-01-14  1:43   ` Dexuan Cui
2015-01-14  1:43     ` Dexuan Cui
     [not found]     ` <F792CF86EFE20D4AB8064279AFBA51C61EA7CFF9@SIXPRD3002MB028.064d.mgd.msft.net >
2015-01-14  2:28       ` Jason Wang
2015-01-14  2:36         ` Jason Wang
2015-01-14  3:21     ` KY Srinivasan
2015-01-14  3:21       ` KY Srinivasan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.