qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] monitor/qmp: resume monitor when clearing its queue
@ 2019-10-02  8:30 Wolfgang Bumiller
  2019-10-09  8:39 ` Markus Armbruster
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Bumiller @ 2019-10-02  8:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Michael Roth, Markus Armbruster, qemu-stable

When a monitor's queue is filled up in handle_qmp_command()
it gets suspended. It's the dispatcher bh's job currently to
resume the monitor, which it does after processing an event
from the queue. However, it is possible for a
CHR_EVENT_CLOSED event to be processed before before the bh
is scheduled, which will clear the queue without resuming
the monitor, thereby preventing the dispatcher from reaching
the resume() call.
Fix this by resuming the monitor when clearing a queue which
was filled up.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
---
@Michael, we ran into this with qemu 4.0, so if the logic in this patch
is correct it may make sense to include it in the 4.0.1 roundup.
A backport is at [1] as 4.0 was before the monitor/ dir split.

[1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714

 monitor/qmp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/monitor/qmp.c b/monitor/qmp.c
index 9d9e5d8b27..c1db5bf940 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
 /* Caller must hold mon->qmp.qmp_queue_lock */
 static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
 {
+    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
+        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
     while (!g_queue_is_empty(mon->qmp_requests)) {
         qmp_request_free(g_queue_pop_head(mon->qmp_requests));
     }
+    if (need_resume) {
+        /*
+         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
+         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
+         * bh got scheduled.
+         */
+        monitor_resume(&mon->common);
+    }
 }
 
 static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
-- 
2.20.1




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
  2019-10-02  8:30 [PATCH] monitor/qmp: resume monitor when clearing its queue Wolfgang Bumiller
@ 2019-10-09  8:39 ` Markus Armbruster
  2019-10-09 10:10   ` Wolfgang Bumiller
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Armbruster @ 2019-10-09  8:39 UTC (permalink / raw)
  To: Wolfgang Bumiller
  Cc: Marc-André Lureau, qemu-stable, qemu-devel, Michael Roth

Cc: Marc-André for additional monitor and chardev expertise.

Wolfgang Bumiller <w.bumiller@proxmox.com> writes:

> When a monitor's queue is filled up in handle_qmp_command()
> it gets suspended. It's the dispatcher bh's job currently to
> resume the monitor, which it does after processing an event
> from the queue. However, it is possible for a
> CHR_EVENT_CLOSED event to be processed before before the bh
> is scheduled, which will clear the queue without resuming
> the monitor, thereby preventing the dispatcher from reaching
> the resume() call.

Because with the request queue cleared, there's nothing for
monitor_qmp_requests_pop_any_with_lock() to pop, so
monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
suspended forever.  Correct?

Observable effect for the monitor's user?

> Fix this by resuming the monitor when clearing a queue which
> was filled up.
>
> Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
> ---
> @Michael, we ran into this with qemu 4.0, so if the logic in this patch
> is correct it may make sense to include it in the 4.0.1 roundup.
> A backport is at [1] as 4.0 was before the monitor/ dir split.
>
> [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
>
>  monitor/qmp.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index 9d9e5d8b27..c1db5bf940 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
>  /* Caller must hold mon->qmp.qmp_queue_lock */
>  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
>  {
> +    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
> +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;

Can you explain why this condition is correct?

>      while (!g_queue_is_empty(mon->qmp_requests)) {
>          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
>      }
> +    if (need_resume) {
> +        /*
> +         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
> +         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
> +         * bh got scheduled.
> +         */
> +        monitor_resume(&mon->common);
> +    }
>  }
>  
>  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)

Is monitor_qmp_cleanup_req_queue_locked() the correct place?

It's called from

* monitor_qmp_event() case CHR_EVENT_CLOSED via
  monitor_qmp_cleanup_queues(), as part of destroying the monitor's
  session state.

  This is the case you're trying to fix.  Correct?

  I figure monitor_resume() is safe because we haven't really destroyed
  anything, yet, we merely flushed the request queue.  Correct?

* monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
  the monitor.

  Can need_resume be true in this case?  If yes, is monitor_resume()
  still safe?  We're in the middle of destroying the monitor...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
  2019-10-09  8:39 ` Markus Armbruster
@ 2019-10-09 10:10   ` Wolfgang Bumiller
  2019-10-09 19:18     ` Markus Armbruster
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Bumiller @ 2019-10-09 10:10 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Marc-André Lureau, qemu-stable, qemu-devel, Michael Roth

On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
> Cc: Marc-André for additional monitor and chardev expertise.
> 
> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
> 
> > When a monitor's queue is filled up in handle_qmp_command()
> > it gets suspended. It's the dispatcher bh's job currently to
> > resume the monitor, which it does after processing an event
> > from the queue. However, it is possible for a
> > CHR_EVENT_CLOSED event to be processed before before the bh
> > is scheduled, which will clear the queue without resuming
> > the monitor, thereby preventing the dispatcher from reaching
> > the resume() call.
> 
> Because with the request queue cleared, there's nothing for
> monitor_qmp_requests_pop_any_with_lock() to pop, so
> monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
> suspended forever.  Correct?
> 
> Observable effect for the monitor's user?

Yes. More easily triggered now with oob. We ran into this a longer time
ago, but our only reliable trigger was a customized version of
-loadstate which loads the state from a separate file instead of the
vmstate region of a qcow2. Turns out that doing this on a slow storage
(~12s to load the data) caused our status daemon to try to poll the qmp
socket during the load-state and give up after a 3s timeout. And since
the BH runs in the main loop which is not even entered until after the
loadstate has finished, but iothread handling the qmp socket does fill &
clear the queue, the qmp socket always ended up unusable afterwards.

Aside from that we have users reporting the same symptom (hanging qmp)
appearing randomly on busy systems.

> > Fix this by resuming the monitor when clearing a queue which
> > was filled up.
> >
> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
> > ---
> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
> > is correct it may make sense to include it in the 4.0.1 roundup.
> > A backport is at [1] as 4.0 was before the monitor/ dir split.
> >
> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
> >
> >  monitor/qmp.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > index 9d9e5d8b27..c1db5bf940 100644
> > --- a/monitor/qmp.c
> > +++ b/monitor/qmp.c
> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
> >  /* Caller must hold mon->qmp.qmp_queue_lock */
> >  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
> >  {
> > +    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
> > +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
> 
> Can you explain why this condition is correct?

Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
which does the following *after* popping 1 element off the queue:

    need_resume = !qmp_oob_enabled(mon) ||
        mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
    qemu_mutex_unlock(&mon->qmp_queue_lock);

It's supposed to be the same condition, but _before_ popping off an
element (hence no `- 1`), but the queue shouldn't be empty as well
otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
happened, though on second though we could probably just return early in
that case.).

> 
> >      while (!g_queue_is_empty(mon->qmp_requests)) {
> >          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
> >      }
> > +    if (need_resume) {
> > +        /*
> > +         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
> > +         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
> > +         * bh got scheduled.
> > +         */
> > +        monitor_resume(&mon->common);
> > +    }
> >  }
> >  
> >  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
> 
> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
> 
> It's called from
> 
> * monitor_qmp_event() case CHR_EVENT_CLOSED via
>   monitor_qmp_cleanup_queues(), as part of destroying the monitor's
>   session state.
> 
>   This is the case you're trying to fix.  Correct?
> 
>   I figure monitor_resume() is safe because we haven't really destroyed
>   anything, yet, we merely flushed the request queue.  Correct?
> 
> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
>   the monitor.
> 
>   Can need_resume be true in this case?  If yes, is monitor_resume()
>   still safe?  We're in the middle of destroying the monitor...

I thought so when first reading through it, but on second though, we
should probably avoid this for sanity's sake.
Maybe with a flag, or an extra parameter.
Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
instead of "calculating" `need_resume` in 2 places and unset it in
`monitor_data_destroy()` before clearing the queue?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
  2019-10-09 10:10   ` Wolfgang Bumiller
@ 2019-10-09 19:18     ` Markus Armbruster
  2019-10-10  8:12       ` Wolfgang Bumiller
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Armbruster @ 2019-10-09 19:18 UTC (permalink / raw)
  To: Wolfgang Bumiller
  Cc: Michael Roth, Marc-André Lureau, qemu-stable, qemu-devel

Wolfgang Bumiller <w.bumiller@proxmox.com> writes:

> On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
>> Cc: Marc-André for additional monitor and chardev expertise.
>> 
>> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
>> 
>> > When a monitor's queue is filled up in handle_qmp_command()
>> > it gets suspended. It's the dispatcher bh's job currently to
>> > resume the monitor, which it does after processing an event
>> > from the queue. However, it is possible for a
>> > CHR_EVENT_CLOSED event to be processed before before the bh
>> > is scheduled, which will clear the queue without resuming
>> > the monitor, thereby preventing the dispatcher from reaching
>> > the resume() call.
>> 
>> Because with the request queue cleared, there's nothing for
>> monitor_qmp_requests_pop_any_with_lock() to pop, so
>> monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
>> suspended forever.  Correct?
>> 
>> Observable effect for the monitor's user?
>
> Yes.

I was too terse, let me try again: what exactly breaks for the monitor's
user?

>      More easily triggered now with oob. We ran into this a longer time
> ago, but our only reliable trigger was a customized version of
> -loadstate which loads the state from a separate file instead of the
> vmstate region of a qcow2. Turns out that doing this on a slow storage
> (~12s to load the data) caused our status daemon to try to poll the qmp
> socket during the load-state and give up after a 3s timeout. And since
> the BH runs in the main loop which is not even entered until after the
> loadstate has finished, but iothread handling the qmp socket does fill &
> clear the queue, the qmp socket always ended up unusable afterwards.
>
> Aside from that we have users reporting the same symptom (hanging qmp)
> appearing randomly on busy systems.
>
>> > Fix this by resuming the monitor when clearing a queue which
>> > was filled up.
>> >
>> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
>> > ---
>> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
>> > is correct it may make sense to include it in the 4.0.1 roundup.
>> > A backport is at [1] as 4.0 was before the monitor/ dir split.
>> >
>> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
>> >
>> >  monitor/qmp.c | 10 ++++++++++
>> >  1 file changed, 10 insertions(+)
>> >
>> > diff --git a/monitor/qmp.c b/monitor/qmp.c
>> > index 9d9e5d8b27..c1db5bf940 100644
>> > --- a/monitor/qmp.c
>> > +++ b/monitor/qmp.c
>> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
>> >  /* Caller must hold mon->qmp.qmp_queue_lock */
>> >  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
>> >  {
>> > +    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
>> > +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
>> 
>> Can you explain why this condition is correct?
>
> Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
> which does the following *after* popping 1 element off the queue:
>
>     need_resume = !qmp_oob_enabled(mon) ||
>         mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
>     qemu_mutex_unlock(&mon->qmp_queue_lock);
>
> It's supposed to be the same condition, but _before_ popping off an
> element (hence no `- 1`), but the queue shouldn't be empty as well
> otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
> happened, though on second though we could probably just return early in
> that case.).

I see.

Could we monitor_resume() unconditionally here?

>> >      while (!g_queue_is_empty(mon->qmp_requests)) {
>> >          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
>> >      }
>> > +    if (need_resume) {
>> > +        /*
>> > +         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
>> > +         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
>> > +         * bh got scheduled.
>> > +         */
>> > +        monitor_resume(&mon->common);
>> > +    }
>> >  }
>> >  
>> >  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
>> 
>> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
>> 
>> It's called from
>> 
>> * monitor_qmp_event() case CHR_EVENT_CLOSED via
>>   monitor_qmp_cleanup_queues(), as part of destroying the monitor's
>>   session state.
>> 
>>   This is the case you're trying to fix.  Correct?
>> 
>>   I figure monitor_resume() is safe because we haven't really destroyed
>>   anything, yet, we merely flushed the request queue.  Correct?
>> 
>> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
>>   the monitor.
>> 
>>   Can need_resume be true in this case?  If yes, is monitor_resume()
>>   still safe?  We're in the middle of destroying the monitor...
>
> I thought so when first reading through it, but on second though, we
> should probably avoid this for sanity's sake.
> Maybe with a flag, or an extra parameter.
> Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
> instead of "calculating" `need_resume` in 2 places and unset it in
> `monitor_data_destroy()` before clearing the queue?

Could we simply call monitor_resume() in monitor_qmp_event() right after
monitor_qmp_cleanup_queues()?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
  2019-10-09 19:18     ` Markus Armbruster
@ 2019-10-10  8:12       ` Wolfgang Bumiller
  2019-10-10  9:03         ` Markus Armbruster
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Bumiller @ 2019-10-10  8:12 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Marc-André Lureau, Michael Roth, qemu-stable

On Wed, Oct 09, 2019 at 09:18:04PM +0200, Markus Armbruster wrote:
> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
> 
> > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
> >> Cc: Marc-André for additional monitor and chardev expertise.
> >> 
> >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
> >> 
> >> > When a monitor's queue is filled up in handle_qmp_command()
> >> > it gets suspended. It's the dispatcher bh's job currently to
> >> > resume the monitor, which it does after processing an event
> >> > from the queue. However, it is possible for a
> >> > CHR_EVENT_CLOSED event to be processed before before the bh
> >> > is scheduled, which will clear the queue without resuming
> >> > the monitor, thereby preventing the dispatcher from reaching
> >> > the resume() call.
> >> 
> >> Because with the request queue cleared, there's nothing for
> >> monitor_qmp_requests_pop_any_with_lock() to pop, so
> >> monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
> >> suspended forever.  Correct?
> >> 
> >> Observable effect for the monitor's user?
> >
> > Yes.
> 
> I was too terse, let me try again: what exactly breaks for the monitor's
> user?

Any new connections to qmp will be accept()ed and show the greeting
({"QMP":{"version"...}}), but not respond to any messages sent
afterwards (though in my tests sometimes (after a few more attempts to
reconnect/talk to the same qmp socket), not even the greeting would
appear anymore).

> >      More easily triggered now with oob. We ran into this a longer time
> > ago, but our only reliable trigger was a customized version of
> > -loadstate which loads the state from a separate file instead of the
> > vmstate region of a qcow2. Turns out that doing this on a slow storage
> > (~12s to load the data) caused our status daemon to try to poll the qmp
> > socket during the load-state and give up after a 3s timeout. And since
> > the BH runs in the main loop which is not even entered until after the
> > loadstate has finished, but iothread handling the qmp socket does fill &
> > clear the queue, the qmp socket always ended up unusable afterwards.
> >
> > Aside from that we have users reporting the same symptom (hanging qmp)
> > appearing randomly on busy systems.
> >
> >> > Fix this by resuming the monitor when clearing a queue which
> >> > was filled up.
> >> >
> >> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
> >> > ---
> >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
> >> > is correct it may make sense to include it in the 4.0.1 roundup.
> >> > A backport is at [1] as 4.0 was before the monitor/ dir split.
> >> >
> >> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
> >> >
> >> >  monitor/qmp.c | 10 ++++++++++
> >> >  1 file changed, 10 insertions(+)
> >> >
> >> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> >> > index 9d9e5d8b27..c1db5bf940 100644
> >> > --- a/monitor/qmp.c
> >> > +++ b/monitor/qmp.c
> >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
> >> >  /* Caller must hold mon->qmp.qmp_queue_lock */
> >> >  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
> >> >  {
> >> > +    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
> >> > +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
> >> 
> >> Can you explain why this condition is correct?
> >
> > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
> > which does the following *after* popping 1 element off the queue:
> >
> >     need_resume = !qmp_oob_enabled(mon) ||
> >         mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
> >     qemu_mutex_unlock(&mon->qmp_queue_lock);
> >
> > It's supposed to be the same condition, but _before_ popping off an
> > element (hence no `- 1`), but the queue shouldn't be empty as well
> > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
> > happened, though on second though we could probably just return early in
> > that case.).
> 
> I see.
> 
> Could we monitor_resume() unconditionally here?

We can't, because suspend()/resume() employ a counter, which would
become unbalanced and could easily become negative by queuing a command
without filling the queue up and quickly disconnecting.

> 
> >> >      while (!g_queue_is_empty(mon->qmp_requests)) {
> >> >          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
> >> >      }
> >> > +    if (need_resume) {
> >> > +        /*
> >> > +         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
> >> > +         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
> >> > +         * bh got scheduled.
> >> > +         */
> >> > +        monitor_resume(&mon->common);
> >> > +    }
> >> >  }
> >> >  
> >> >  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
> >> 
> >> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
> >> 
> >> It's called from
> >> 
> >> * monitor_qmp_event() case CHR_EVENT_CLOSED via
> >>   monitor_qmp_cleanup_queues(), as part of destroying the monitor's
> >>   session state.
> >> 
> >>   This is the case you're trying to fix.  Correct?
> >> 
> >>   I figure monitor_resume() is safe because we haven't really destroyed
> >>   anything, yet, we merely flushed the request queue.  Correct?
> >> 
> >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
> >>   the monitor.
> >> 
> >>   Can need_resume be true in this case?  If yes, is monitor_resume()
> >>   still safe?  We're in the middle of destroying the monitor...
> >
> > I thought so when first reading through it, but on second though, we
> > should probably avoid this for sanity's sake.
> > Maybe with a flag, or an extra parameter.
> > Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
> > instead of "calculating" `need_resume` in 2 places and unset it in
> > `monitor_data_destroy()` before clearing the queue?
> 
> Could we simply call monitor_resume() in monitor_qmp_event() right after
> monitor_qmp_cleanup_queues()?

We'd still need to make sure the suspend counter is balanced out with
the corresponding suspend() calls.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
  2019-10-10  8:12       ` Wolfgang Bumiller
@ 2019-10-10  9:03         ` Markus Armbruster
  0 siblings, 0 replies; 6+ messages in thread
From: Markus Armbruster @ 2019-10-10  9:03 UTC (permalink / raw)
  To: Wolfgang Bumiller
  Cc: Marc-André Lureau, qemu-stable, Gerd Hoffmann, qemu-devel,
	Michael Roth

Wolfgang Bumiller <w.bumiller@proxmox.com> writes:

> On Wed, Oct 09, 2019 at 09:18:04PM +0200, Markus Armbruster wrote:
>> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
>> 
>> > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
>> >> Cc: Marc-André for additional monitor and chardev expertise.
>> >> 
>> >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes:
>> >> 
>> >> > When a monitor's queue is filled up in handle_qmp_command()
>> >> > it gets suspended. It's the dispatcher bh's job currently to
>> >> > resume the monitor, which it does after processing an event
>> >> > from the queue. However, it is possible for a
>> >> > CHR_EVENT_CLOSED event to be processed before before the bh
>> >> > is scheduled, which will clear the queue without resuming
>> >> > the monitor, thereby preventing the dispatcher from reaching
>> >> > the resume() call.
>> >> 
>> >> Because with the request queue cleared, there's nothing for
>> >> monitor_qmp_requests_pop_any_with_lock() to pop, so
>> >> monitor_qmp_bh_dispatcher() won't look at this monitor.  It stays
>> >> suspended forever.  Correct?
>> >> 
>> >> Observable effect for the monitor's user?
>> >
>> > Yes.
>> 
>> I was too terse, let me try again: what exactly breaks for the monitor's
>> user?
>
> Any new connections to qmp will be accept()ed and show the greeting
> ({"QMP":{"version"...}}), but not respond to any messages sent
> afterwards (though in my tests sometimes (after a few more attempts to
> reconnect/talk to the same qmp socket), not even the greeting would
> appear anymore).

Work this into your commit message, please :)

>> >      More easily triggered now with oob. We ran into this a longer time
>> > ago, but our only reliable trigger was a customized version of
>> > -loadstate which loads the state from a separate file instead of the
>> > vmstate region of a qcow2. Turns out that doing this on a slow storage
>> > (~12s to load the data) caused our status daemon to try to poll the qmp
>> > socket during the load-state and give up after a 3s timeout. And since
>> > the BH runs in the main loop which is not even entered until after the
>> > loadstate has finished, but iothread handling the qmp socket does fill &
>> > clear the queue, the qmp socket always ended up unusable afterwards.
>> >
>> > Aside from that we have users reporting the same symptom (hanging qmp)
>> > appearing randomly on busy systems.
>> >
>> >> > Fix this by resuming the monitor when clearing a queue which
>> >> > was filled up.
>> >> >
>> >> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
>> >> > ---
>> >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
>> >> > is correct it may make sense to include it in the 4.0.1 roundup.
>> >> > A backport is at [1] as 4.0 was before the monitor/ dir split.
>> >> >
>> >> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
>> >> >
>> >> >  monitor/qmp.c | 10 ++++++++++
>> >> >  1 file changed, 10 insertions(+)
>> >> >
>> >> > diff --git a/monitor/qmp.c b/monitor/qmp.c
>> >> > index 9d9e5d8b27..c1db5bf940 100644
>> >> > --- a/monitor/qmp.c
>> >> > +++ b/monitor/qmp.c
>> >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
>> >> >  /* Caller must hold mon->qmp.qmp_queue_lock */
>> >> >  static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
>> >> >  {
>> >> > +    bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0)
>> >> > +        || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
>> >> 
>> >> Can you explain why this condition is correct?
>> >
>> > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
>> > which does the following *after* popping 1 element off the queue:
>> >
>> >     need_resume = !qmp_oob_enabled(mon) ||
>> >         mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
>> >     qemu_mutex_unlock(&mon->qmp_queue_lock);
>> >
>> > It's supposed to be the same condition, but _before_ popping off an
>> > element (hence no `- 1`), but the queue shouldn't be empty as well
>> > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
>> > happened, though on second though we could probably just return early in
>> > that case.).
>> 
>> I see.
>> 
>> Could we monitor_resume() unconditionally here?
>
> We can't, because suspend()/resume() employ a counter, which would
> become unbalanced and could easily become negative by queuing a command
> without filling the queue up and quickly disconnecting.
>
>> 
>> >> >      while (!g_queue_is_empty(mon->qmp_requests)) {
>> >> >          qmp_request_free(g_queue_pop_head(mon->qmp_requests));
>> >> >      }
>> >> > +    if (need_resume) {
>> >> > +        /*
>> >> > +         * Pairs with the monitor_suspend() in handle_qmp_command() in case the
>> >> > +         * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch
>> >> > +         * bh got scheduled.
>> >> > +         */
>> >> > +        monitor_resume(&mon->common);
>> >> > +    }
>> >> >  }
>> >> >  
>> >> >  static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
>> >> 
>> >> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
>> >> 
>> >> It's called from
>> >> 
>> >> * monitor_qmp_event() case CHR_EVENT_CLOSED via
>> >>   monitor_qmp_cleanup_queues(), as part of destroying the monitor's
>> >>   session state.
>> >> 
>> >>   This is the case you're trying to fix.  Correct?
>> >> 
>> >>   I figure monitor_resume() is safe because we haven't really destroyed
>> >>   anything, yet, we merely flushed the request queue.  Correct?
>> >> 
>> >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
>> >>   the monitor.
>> >> 
>> >>   Can need_resume be true in this case?  If yes, is monitor_resume()
>> >>   still safe?  We're in the middle of destroying the monitor...
>> >
>> > I thought so when first reading through it, but on second though, we
>> > should probably avoid this for sanity's sake.
>> > Maybe with a flag, or an extra parameter.
>> > Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
>> > instead of "calculating" `need_resume` in 2 places and unset it in
>> > `monitor_data_destroy()` before clearing the queue?
>> 
>> Could we simply call monitor_resume() in monitor_qmp_event() right after
>> monitor_qmp_cleanup_queues()?
>
> We'd still need to make sure the suspend counter is balanced out with
> the corresponding suspend() calls.

You made me look at ->suspend_cnt, and now I feel nauseous.

monitor_suspend() increments, and monitor_resume() decrements.  This
permits "recursive" suspension.  Fair enough.  I'd expect
monitor_resume() to assert the decrement doesn't go negative, but that's
detail.

The nauseating part is in monitor_event():

    switch (event) {
    case CHR_EVENT_MUX_IN:
        qemu_mutex_lock(&mon->mon_lock);
        mon->mux_out = 0;
        qemu_mutex_unlock(&mon->mon_lock);
        if (mon->reset_seen) {
            readline_restart(hmp_mon->rs);
--->        monitor_resume(mon);
            monitor_flush(mon);
        } else {
--->        atomic_mb_set(&mon->suspend_cnt, 0);
        }
        break;

    case CHR_EVENT_MUX_OUT:
        if (mon->reset_seen) {
            if (atomic_mb_read(&mon->suspend_cnt) == 0) {
                monitor_printf(mon, "\n");
            }
            monitor_flush(mon);
--->        monitor_suspend(mon);
        } else {
--->        atomic_inc(&mon->suspend_cnt);
        }
        qemu_mutex_lock(&mon->mon_lock);
        mon->mux_out = 1;
        qemu_mutex_unlock(&mon->mon_lock);
        break;


If mon->reset_seen(), we use monitor_resume() and monitor_suspend() like
a good boy.

Else, we mess with mon->suspend_cnt directly.  Worse, we don't just
decrement, we go all the way to zero!

Gerd, this goes back to your a7aec5da4d "monitor: fix muxing".  It's
been ten years, but I have to ask anyway: why?


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-10-10  9:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-02  8:30 [PATCH] monitor/qmp: resume monitor when clearing its queue Wolfgang Bumiller
2019-10-09  8:39 ` Markus Armbruster
2019-10-09 10:10   ` Wolfgang Bumiller
2019-10-09 19:18     ` Markus Armbruster
2019-10-10  8:12       ` Wolfgang Bumiller
2019-10-10  9:03         ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).