linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] VMCI: Release resource if the work is already queued
@ 2019-08-20 20:26 Nadav Amit
  2019-08-22  6:35 ` Vishnu Dasa
  2019-09-07 18:47 ` Greg Kroah-Hartman
  0 siblings, 2 replies; 5+ messages in thread
From: Nadav Amit @ 2019-08-20 20:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Arnd Bergmann
  Cc: linux-kernel, Francois Rigault, Nadav Amit, Jorgen Hansen,
	Adit Ranadive, Alexios Zavras, Vishnu DASA, stable

Francois reported that VMware balloon gets stuck after a balloon reset,
when the VMCI doorbell is removed. A similar error can occur when the
balloon driver is removed with the following splat:

[ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds.
[ 1088.622035]       Tainted: G        W         5.2.0 #4
[ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.622205] modprobe        D    0  3565   1450 0x00000000
[ 1088.622210] Call Trace:
[ 1088.622246]  __schedule+0x2a8/0x690
[ 1088.622248]  schedule+0x2d/0x90
[ 1088.622250]  schedule_timeout+0x1d3/0x2f0
[ 1088.622252]  wait_for_completion+0xba/0x140
[ 1088.622320]  ? wake_up_q+0x80/0x80
[ 1088.622370]  vmci_resource_remove+0xb9/0xc0 [vmw_vmci]
[ 1088.622373]  vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci]
[ 1088.622379]  vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon]
[ 1088.622381]  vmballoon_exit+0x18/0xcc8 [vmw_balloon]
[ 1088.622394]  __x64_sys_delete_module+0x146/0x280
[ 1088.622408]  do_syscall_64+0x5a/0x130
[ 1088.622410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1088.622415] RIP: 0033:0x7f54f62791b7
[ 1088.622421] Code: Bad RIP value.
[ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7
[ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68
[ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000
[ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68
[ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0

The cause for the bug is that when the "delayed" doorbell is invoked, it
takes a reference on the doorbell entry and schedules work that is
supposed to run the appropriate code and drop the doorbell entry
reference. The code ignores the fact that if the work is already queued,
it will not be scheduled to run one more time. As a result one of the
references would not be dropped. When the code waits for the reference
to get to zero, during balloon reset or module removal, it gets stuck.

Fix it. Drop the reference if schedule_work() indicates that the work is
already queued.

Note that this bug got more apparent (or apparent at all) due to
commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").

Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.")
Reported-by: Francois Rigault <rigault.francois@gmail.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Adit Ranadive <aditr@vmware.com>
Cc: Alexios Zavras <alexios.zavras@intel.com>
Cc: Vishnu DASA <vdasa@vmware.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_vmci/vmci_doorbell.c b/drivers/misc/vmw_vmci/vmci_doorbell.c
index bad89b6e0802..345addd9306d 100644
--- a/drivers/misc/vmw_vmci/vmci_doorbell.c
+++ b/drivers/misc/vmw_vmci/vmci_doorbell.c
@@ -310,7 +310,8 @@ int vmci_dbell_host_context_notify(u32 src_cid, struct vmci_handle handle)
 
 	entry = container_of(resource, struct dbell_entry, resource);
 	if (entry->run_delayed) {
-		schedule_work(&entry->work);
+		if (!schedule_work(&entry->work))
+			vmci_resource_put(resource);
 	} else {
 		entry->notify_cb(entry->client_data);
 		vmci_resource_put(resource);
@@ -361,7 +362,8 @@ static void dbell_fire_entries(u32 notify_idx)
 		    atomic_read(&dbell->active) == 1) {
 			if (dbell->run_delayed) {
 				vmci_resource_get(&dbell->resource);
-				schedule_work(&dbell->work);
+				if (!schedule_work(&dbell->work))
+					vmci_resource_put(&dbell->resource);
 			} else {
 				dbell->notify_cb(dbell->client_data);
 			}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] VMCI: Release resource if the work is already queued
  2019-08-20 20:26 [PATCH] VMCI: Release resource if the work is already queued Nadav Amit
@ 2019-08-22  6:35 ` Vishnu Dasa
  2019-09-07 18:47 ` Greg Kroah-Hartman
  1 sibling, 0 replies; 5+ messages in thread
From: Vishnu Dasa @ 2019-08-22  6:35 UTC (permalink / raw)
  To: Nadav Amit, Greg Kroah-Hartman, Arnd Bergmann
  Cc: linux-kernel, Francois Rigault, Jorgen Hansen, Adit Ranadive,
	Alexios Zavras, stable

On 8/20/19, 8:48 PM, "Nadav Amit" <namit@vmware.com> wrote:
> Francois reported that VMware balloon gets stuck after a balloon reset,
> when the VMCI doorbell is removed. A similar error can occur when the
> balloon driver is removed with the following splat:
> 
> [ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds.
> [ 1088.622035]       Tainted: G        W         5.2.0 #4
> [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1088.622205] modprobe        D    0  3565   1450 0x00000000
> [ 1088.622210] Call Trace:
> [ 1088.622246]  __schedule+0x2a8/0x690
> [ 1088.622248]  schedule+0x2d/0x90
> [ 1088.622250]  schedule_timeout+0x1d3/0x2f0
> [ 1088.622252]  wait_for_completion+0xba/0x140
> [ 1088.622320]  ? wake_up_q+0x80/0x80
> [ 1088.622370]  vmci_resource_remove+0xb9/0xc0 [vmw_vmci]
> [ 1088.622373]  vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci]
> [ 1088.622379]  vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon]
> [ 1088.622381]  vmballoon_exit+0x18/0xcc8 [vmw_balloon]
> [ 1088.622394]  __x64_sys_delete_module+0x146/0x280
> [ 1088.622408]  do_syscall_64+0x5a/0x130
> [ 1088.622410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1088.622415] RIP: 0033:0x7f54f62791b7
> [ 1088.622421] Code: Bad RIP value.
> [ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7
> [ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68
> [ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000
> [ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68
> [ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0
> 
> The cause for the bug is that when the "delayed" doorbell is invoked, it
> takes a reference on the doorbell entry and schedules work that is
> supposed to run the appropriate code and drop the doorbell entry
> reference. The code ignores the fact that if the work is already queued,
> it will not be scheduled to run one more time. As a result one of the
> references would not be dropped. When the code waits for the reference
> to get to zero, during balloon reset or module removal, it gets stuck.
>
> Fix it. Drop the reference if schedule_work() indicates that the work is
> already queued.
>
> Note that this bug got more apparent (or apparent at all) due to
> commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").
>
> Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.")
> Reported-by: Francois Rigault <rigault.francois@gmail.com>
> Cc: Jorgen Hansen <jhansen@vmware.com>
> Cc: Adit Ranadive <aditr@vmware.com>
> Cc: Alexios Zavras <alexios.zavras@intel.com>
> Cc: Vishnu DASA <vdasa@vmware.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
> drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)

Thanks for the fix, looks good to me.
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>

--
vishnu


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VMCI: Release resource if the work is already queued
  2019-08-20 20:26 [PATCH] VMCI: Release resource if the work is already queued Nadav Amit
  2019-08-22  6:35 ` Vishnu Dasa
@ 2019-09-07 18:47 ` Greg Kroah-Hartman
  2019-09-08  4:04   ` Nadav Amit
  1 sibling, 1 reply; 5+ messages in thread
From: Greg Kroah-Hartman @ 2019-09-07 18:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Arnd Bergmann, linux-kernel, Francois Rigault, Jorgen Hansen,
	Adit Ranadive, Alexios Zavras, Vishnu DASA, stable

On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
> Francois reported that VMware balloon gets stuck after a balloon reset,
> when the VMCI doorbell is removed. A similar error can occur when the
> balloon driver is removed with the following splat:

<snip>

Note, google thinks your email is spam as you are not sending this from
a valid vmware.com email server.  Please fix this up if you want to make
sure your patches actually make it through...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VMCI: Release resource if the work is already queued
  2019-09-07 18:47 ` Greg Kroah-Hartman
@ 2019-09-08  4:04   ` Nadav Amit
  2019-09-08 10:19     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 5+ messages in thread
From: Nadav Amit @ 2019-09-08  4:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Arnd Bergmann, linux-kernel, Francois Rigault, Jorgen Hansen,
	Adit Ranadive, Alexios Zavras, Vishnu Dasa, stable

> On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> 
> On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
>> Francois reported that VMware balloon gets stuck after a balloon reset,
>> when the VMCI doorbell is removed. A similar error can occur when the
>> balloon driver is removed with the following splat:
> 
> <snip>
> 
> Note, google thinks your email is spam as you are not sending this from
> a valid vmware.com email server.  Please fix this up if you want to make
> sure your patches actually make it through...

I used gmail servers since I get complaints (mainly from Linus) that our
email servers use DMARC, and this causes mailing list postings to be
rejected.

I will check again if our open source people got to a solution, or I
would just send it using my gmail identity, while keeping the vmware
authorship.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] VMCI: Release resource if the work is already queued
  2019-09-08  4:04   ` Nadav Amit
@ 2019-09-08 10:19     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Kroah-Hartman @ 2019-09-08 10:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Arnd Bergmann, linux-kernel, Francois Rigault, Jorgen Hansen,
	Adit Ranadive, Alexios Zavras, Vishnu Dasa, stable

On Sun, Sep 08, 2019 at 04:04:39AM +0000, Nadav Amit wrote:
> > On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> > 
> > On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
> >> Francois reported that VMware balloon gets stuck after a balloon reset,
> >> when the VMCI doorbell is removed. A similar error can occur when the
> >> balloon driver is removed with the following splat:
> > 
> > <snip>
> > 
> > Note, google thinks your email is spam as you are not sending this from
> > a valid vmware.com email server.  Please fix this up if you want to make
> > sure your patches actually make it through...
> 
> I used gmail servers since I get complaints (mainly from Linus) that our
> email servers use DMARC, and this causes mailing list postings to be
> rejected.

Well the email servers might not reject it now, but gmail itself sure
does as you are lying about what your domain is :(

> I will check again if our open source people got to a solution, or I
> would just send it using my gmail identity, while keeping the vmware
> authorship.

Please do, this is not a viable workflow.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-09-08 10:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-20 20:26 [PATCH] VMCI: Release resource if the work is already queued Nadav Amit
2019-08-22  6:35 ` Vishnu Dasa
2019-09-07 18:47 ` Greg Kroah-Hartman
2019-09-08  4:04   ` Nadav Amit
2019-09-08 10:19     ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).