All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] char: kick main loop after adding a watch
@ 2017-03-31 16:43 Stefan Hajnoczi
  2017-03-31 16:53 ` Paolo Bonzini
  2017-03-31 17:09 ` Richard W.M. Jones
  0 siblings, 2 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2017-03-31 16:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Marc-André Lureau, Richard W . M . Jones,
	Stefan Hajnoczi

The ISA serial port device's output can hang when the pipe on stdout
becomes full.  This is a race condition where the vcpu thread executing
serial emulation code adds a watch on stdout while the main loop thread
is blocked in ppoll(2).  If no timer or other event wakes up the main
loop, there will be no further output from the serial device even when
the pipe becomes writable.

Richard W. M. Jones was able to reproduce the hang on recent versions of
guestfs-tools-c and libglib2 on Fedora 26 hosts.

This patch kicks the main loop so the next iteration invokes ppoll(2)
with the watch fd.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
Reported-by: Richard W. M. Jones <rjones@redhat.com>
Tested-by: Richard W. M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 chardev/char.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/chardev/char.c b/chardev/char.c
index 3df1163..6c99c34 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
     tag = g_source_attach(src, NULL);
     g_source_unref(src);
 
+    /* The main loop may be in blocked waiting on events in another thread.
+     * Kick it so the new watch will be added.
+     */
+    qemu_notify_event();
+
     return tag;
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 16:43 [Qemu-devel] [PATCH] char: kick main loop after adding a watch Stefan Hajnoczi
@ 2017-03-31 16:53 ` Paolo Bonzini
  2017-03-31 17:31   ` Richard W.M. Jones
  2017-04-03 13:20   ` Stefan Hajnoczi
  2017-03-31 17:09 ` Richard W.M. Jones
  1 sibling, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2017-03-31 16:53 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Richard W.M. Jones, Marc-André Lureau, Frediano Ziglio



On 31/03/2017 18:43, Stefan Hajnoczi wrote:
> The ISA serial port device's output can hang when the pipe on stdout
> becomes full.  This is a race condition where the vcpu thread executing
> serial emulation code adds a watch on stdout while the main loop thread
> is blocked in ppoll(2).  If no timer or other event wakes up the main
> loop, there will be no further output from the serial device even when
> the pipe becomes writable.
> 
> Richard W. M. Jones was able to reproduce the hang on recent versions of
> guestfs-tools-c and libglib2 on Fedora 26 hosts.
> 
> This patch kicks the main loop so the next iteration invokes ppoll(2)
> with the watch fd.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> Reported-by: Richard W. M. Jones <rjones@redhat.com>
> Tested-by: Richard W. M. Jones <rjones@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  chardev/char.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/chardev/char.c b/chardev/char.c
> index 3df1163..6c99c34 100644
> --- a/chardev/char.c
> +++ b/chardev/char.c
> @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
>      tag = g_source_attach(src, NULL);
>      g_source_unref(src);
>  
> +    /* The main loop may be in blocked waiting on events in another thread.
> +     * Kick it so the new watch will be added.
> +     */
> +    qemu_notify_event();
> +
>      return tag;
>  }
>  
> 

Thanks for looking at this, I was quite stuck and now I understand 
what's going on.  However, I don't believe your patch is the right 
solution.

According to Richard's bisection, the bug was introduced by the patch 
at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699.

The g_wakeup_signal that is removed (actually made conditional) in that 
patch is doing exactly the same thing as qemu_notify_event, which is
fishy...  It would still be a QEMU bug according to the theory below but,
depending on how they handle backwards-compatibility, they might 
consider undoing this change.

glib is expecting QEMU to use g_main_context_acquire around accesses to 
GMainContext.  However QEMU is not doing that, instead it is taking its 
own mutex.  So we should add g_main_context_acquire and
g_main_context_release in the two implementations of 
os_host_main_loop_wait; these should undo the effect of Frediano's 
glib patch.

In all fairness, the docs do say "You must be the owner of a context 
before you can call g_main_context_prepare(), g_main_context_query(), 
g_main_context_check(), g_main_context_dispatch()".  However, it has
worked until now and the documentation does not say exactly why that
is necessary.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 16:43 [Qemu-devel] [PATCH] char: kick main loop after adding a watch Stefan Hajnoczi
  2017-03-31 16:53 ` Paolo Bonzini
@ 2017-03-31 17:09 ` Richard W.M. Jones
  1 sibling, 0 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2017-03-31 17:09 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau

On Fri, Mar 31, 2017 at 05:43:22PM +0100, Stefan Hajnoczi wrote:
> The ISA serial port device's output can hang when the pipe on stdout
> becomes full.  This is a race condition where the vcpu thread executing
> serial emulation code adds a watch on stdout while the main loop thread
> is blocked in ppoll(2).  If no timer or other event wakes up the main
> loop, there will be no further output from the serial device even when
> the pipe becomes writable.
> 
> Richard W. M. Jones was able to reproduce the hang on recent versions of
> guestfs-tools-c and libglib2 on Fedora 26 hosts.
> 
> This patch kicks the main loop so the next iteration invokes ppoll(2)
> with the watch fd.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> Reported-by: Richard W. M. Jones <rjones@redhat.com>
> Tested-by: Richard W. M. Jones <rjones@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  chardev/char.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/chardev/char.c b/chardev/char.c
> index 3df1163..6c99c34 100644
> --- a/chardev/char.c
> +++ b/chardev/char.c
> @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
>      tag = g_source_attach(src, NULL);
>      g_source_unref(src);
>  
> +    /* The main loop may be in blocked waiting on events in another thread.
> +     * Kick it so the new watch will be added.
> +     */
> +    qemu_notify_event();
> +
>      return tag;
>  }
>  

Obviously Paolo has provided further insights in the other part of
this thread, but from my point of view this does appear to fix the
problem.  I tested it on 3 different physical machines which
previously showed hangs, and on all 3 the test ran for > 30 mins
[before I killed the test] which is far longer than seen before.

Therefore: ACK.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 16:53 ` Paolo Bonzini
@ 2017-03-31 17:31   ` Richard W.M. Jones
  2017-03-31 17:31     ` Paolo Bonzini
  2017-04-01 10:08     ` Paolo Bonzini
  2017-04-03 13:20   ` Stefan Hajnoczi
  1 sibling, 2 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2017-03-31 17:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, qemu-devel, Marc-André Lureau, Frediano Ziglio

[-- Attachment #1: Type: text/plain, Size: 819 bytes --]

On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> GMainContext.  However QEMU is not doing that, instead it is taking its 
> own mutex.  So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of 
> os_host_main_loop_wait; these should undo the effect of Frediano's 
> glib patch.

Based on this paragraph, I'm testing the attached patch, and it does
also appear to solve the hanging serial port problem.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

[-- Attachment #2: 0001-main-loop-Acquire-main_context-lock-around-os_host_m.patch --]
[-- Type: text/plain, Size: 1656 bytes --]

>From d46afa25dae0f1d057cfac3a0b6ae94771be51f2 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjones@redhat.com>
Date: Fri, 31 Mar 2017 18:29:01 +0100
Subject: [PATCH] main-loop: Acquire main_context lock around
 os_host_main_loop_wait.

---
 util/main-loop.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/util/main-loop.c b/util/main-loop.c
index 4534c89..19cad6b 100644
--- a/util/main-loop.c
+++ b/util/main-loop.c
@@ -218,9 +218,12 @@ static void glib_pollfds_poll(void)
 
 static int os_host_main_loop_wait(int64_t timeout)
 {
+    GMainContext *context = g_main_context_default();
     int ret;
     static int spin_counter;
 
+    g_main_context_acquire(context);
+
     glib_pollfds_fill(&timeout);
 
     /* If the I/O thread is very busy or we are incorrectly busy waiting in
@@ -256,6 +259,9 @@ static int os_host_main_loop_wait(int64_t timeout)
     }
 
     glib_pollfds_poll();
+
+    g_main_context_release(context);
+
     return ret;
 }
 #else
@@ -412,12 +418,15 @@ static int os_host_main_loop_wait(int64_t timeout)
     fd_set rfds, wfds, xfds;
     int nfds;
 
+    g_main_context_acquire(context);
+
     /* XXX: need to suppress polling by better using win32 events */
     ret = 0;
     for (pe = first_polling_entry; pe != NULL; pe = pe->next) {
         ret |= pe->func(pe->opaque);
     }
     if (ret != 0) {
+        g_main_context_release(context);
         return ret;
     }
 
@@ -472,6 +481,8 @@ static int os_host_main_loop_wait(int64_t timeout)
         g_main_context_dispatch(context);
     }
 
+    g_main_context_release(context);
+
     return select_ret || g_poll_ret;
 }
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 17:31   ` Richard W.M. Jones
@ 2017-03-31 17:31     ` Paolo Bonzini
  2017-03-31 17:40       ` Richard W.M. Jones
  2017-04-01 10:08     ` Paolo Bonzini
  1 sibling, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-03-31 17:31 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Stefan Hajnoczi, qemu-devel, Marc-André Lureau, Frediano Ziglio



On 31/03/2017 19:31, Richard W.M. Jones wrote:
> On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
>> glib is expecting QEMU to use g_main_context_acquire around accesses to 
>> GMainContext.  However QEMU is not doing that, instead it is taking its 
>> own mutex.  So we should add g_main_context_acquire and
>> g_main_context_release in the two implementations of 
>> os_host_main_loop_wait; these should undo the effect of Frediano's 
>> glib patch.
> 
> Based on this paragraph, I'm testing the attached patch, and it does
> also appear to solve the hanging serial port problem.

Great, can you do more testing and/or send the patch formally as a
toplevel message?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 17:31     ` Paolo Bonzini
@ 2017-03-31 17:40       ` Richard W.M. Jones
  0 siblings, 0 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2017-03-31 17:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc-André Lureau, qemu-devel, Stefan Hajnoczi, Frediano Ziglio

On Fri, Mar 31, 2017 at 07:31:57PM +0200, Paolo Bonzini wrote:
> 
> 
> On 31/03/2017 19:31, Richard W.M. Jones wrote:
> > On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> >> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> >> GMainContext.  However QEMU is not doing that, instead it is taking its 
> >> own mutex.  So we should add g_main_context_acquire and
> >> g_main_context_release in the two implementations of 
> >> os_host_main_loop_wait; these should undo the effect of Frediano's 
> >> glib patch.
> > 
> > Based on this paragraph, I'm testing the attached patch, and it does
> > also appear to solve the hanging serial port problem.
> 
> Great, can you do more testing and/or send the patch formally as a
> toplevel message?

Sure, I'm going to test it a lot more overnight, and then
if it survives all that I'll post it properly.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 17:31   ` Richard W.M. Jones
  2017-03-31 17:31     ` Paolo Bonzini
@ 2017-04-01 10:08     ` Paolo Bonzini
  1 sibling, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2017-04-01 10:08 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Marc-André Lureau, qemu-devel, Stefan Hajnoczi, Frediano Ziglio



On 31/03/2017 19:31, Richard W.M. Jones wrote:
> On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
>> glib is expecting QEMU to use g_main_context_acquire around accesses to 
>> GMainContext.  However QEMU is not doing that, instead it is taking its 
>> own mutex.  So we should add g_main_context_acquire and
>> g_main_context_release in the two implementations of 
>> os_host_main_loop_wait; these should undo the effect of Frediano's 
>> glib patch.
> 
> Based on this paragraph, I'm testing the attached patch, and it does
> also appear to solve the hanging serial port problem.

Turn out that it may actually be a glib bug.  Your patch would enable
Frediano's optimizations so it's worth applying anyway.

https://bugzilla.gnome.org/show_bug.cgi?id=761102

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
  2017-03-31 16:53 ` Paolo Bonzini
  2017-03-31 17:31   ` Richard W.M. Jones
@ 2017-04-03 13:20   ` Stefan Hajnoczi
  1 sibling, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2017-04-03 13:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Richard W.M. Jones, Marc-André Lureau, Frediano Ziglio

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> 
> 
> On 31/03/2017 18:43, Stefan Hajnoczi wrote:
> > The ISA serial port device's output can hang when the pipe on stdout
> > becomes full.  This is a race condition where the vcpu thread executing
> > serial emulation code adds a watch on stdout while the main loop thread
> > is blocked in ppoll(2).  If no timer or other event wakes up the main
> > loop, there will be no further output from the serial device even when
> > the pipe becomes writable.
> > 
> > Richard W. M. Jones was able to reproduce the hang on recent versions of
> > guestfs-tools-c and libglib2 on Fedora 26 hosts.
> > 
> > This patch kicks the main loop so the next iteration invokes ppoll(2)
> > with the watch fd.
> > 
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> > Reported-by: Richard W. M. Jones <rjones@redhat.com>
> > Tested-by: Richard W. M. Jones <rjones@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  chardev/char.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/chardev/char.c b/chardev/char.c
> > index 3df1163..6c99c34 100644
> > --- a/chardev/char.c
> > +++ b/chardev/char.c
> > @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
> >      tag = g_source_attach(src, NULL);
> >      g_source_unref(src);
> >  
> > +    /* The main loop may be in blocked waiting on events in another thread.
> > +     * Kick it so the new watch will be added.
> > +     */
> > +    qemu_notify_event();
> > +
> >      return tag;
> >  }
> >  
> > 
> 
> Thanks for looking at this, I was quite stuck and now I understand 
> what's going on.  However, I don't believe your patch is the right 
> solution.
> 
> According to Richard's bisection, the bug was introduced by the patch 
> at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699.
> 
> The g_wakeup_signal that is removed (actually made conditional) in that 
> patch is doing exactly the same thing as qemu_notify_event, which is
> fishy...  It would still be a QEMU bug according to the theory below but,
> depending on how they handle backwards-compatibility, they might 
> consider undoing this change.
> 
> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> GMainContext.  However QEMU is not doing that, instead it is taking its 
> own mutex.  So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of 
> os_host_main_loop_wait; these should undo the effect of Frediano's 
> glib patch.
> 
> In all fairness, the docs do say "You must be the owner of a context 
> before you can call g_main_context_prepare(), g_main_context_query(), 
> g_main_context_check(), g_main_context_dispatch()".  However, it has
> worked until now and the documentation does not say exactly why that
> is necessary.

Thanks Paolo, very interesting.  I never realized that glib has its own
inter-thread signal.

NACK

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-04-03 13:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31 16:43 [Qemu-devel] [PATCH] char: kick main loop after adding a watch Stefan Hajnoczi
2017-03-31 16:53 ` Paolo Bonzini
2017-03-31 17:31   ` Richard W.M. Jones
2017-03-31 17:31     ` Paolo Bonzini
2017-03-31 17:40       ` Richard W.M. Jones
2017-04-01 10:08     ` Paolo Bonzini
2017-04-03 13:20   ` Stefan Hajnoczi
2017-03-31 17:09 ` Richard W.M. Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.