qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] lock-free monitor?
@ 2016-02-08 15:17 Dr. David Alan Gilbert
  2016-02-09 13:47 ` Stefan Hajnoczi
  2016-02-09 16:57 ` Markus Armbruster
  0 siblings, 2 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2016-02-08 15:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: xiecl.fnst, armbru, stefanha

Hi,
  I wondered what it would take to be able to do a lock-free monitor;
i.e. one that could respond to (some) commands while the qemu big lock is held.

My reason for asking is that there are cases in migration and colo
where a network block device has failed and is blocking, but it fails
at just the wrong time while the lock is held, and now, you don't have
any way to unblock it since you can't do anything on the monitors.
If I could issue a command then I could have it end up doing a shutdown(2)
on the network connection and free things up.

Maybe this would also help real-time people?

I was thinking then, we could:
   a) Have a monitor that wasn't tied to the main loop at all - each instance
would be it's own thread and have it's own poll() (probably using the new
IO channels?)
   b) for most commands it would take the big lock before execute the command
and release it afterwards - so there would be no risk in those commands.
   c) Some commands you could mark as 'lock free' - they would be required
not to take any locks or wait for *anything* and for those the monitor would
not take the lock.
   d) Perhaps have some way of restricting a particular monitor session to only
allowing non-blocking commands.
   e) Then I'd have to have a way of finding the broken device in a lockfree
way (RTU list walk?) and issuing the shutdown(2).

Bonus points to anyone who can statically check commands in (c).

Does this make sense to everyone else, or does anyone have any better
suggestions?

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-08 15:17 [Qemu-devel] lock-free monitor? Dr. David Alan Gilbert
@ 2016-02-09 13:47 ` Stefan Hajnoczi
  2016-02-09 13:52   ` Dr. David Alan Gilbert
                     ` (2 more replies)
  2016-02-09 16:57 ` Markus Armbruster
  1 sibling, 3 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2016-02-09 13:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: xiecl.fnst, qemu-devel, armbru

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
> Does this make sense to everyone else, or does anyone have any better
> suggestions?

As a concrete example, any monitor command that calls bdrv_drain_all()
can hang forever with the QEMU global mutex held if I/O requests are
stuck (e.g. NFS mount is unreachable).

bdrv_aio_cancel() can also hang but is mostly exposed to device
emulation, not the monitor.

One solution for these block layer functions is to add a timeout
argument and let them return an error.  This way the monitor and device
emulation do not hang forever.

The benefit of the timeout is that both monitor and device emulation
hangs are tackled.  It also doesn't require monitor changes.

I'm not sure who chooses the timeout value and which value makes sense
(policy vs mechanism separation)...

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-09 13:47 ` Stefan Hajnoczi
@ 2016-02-09 13:52   ` Dr. David Alan Gilbert
  2016-02-14  6:22   ` Fam Zheng
  2016-02-15 12:59   ` Kevin Wolf
  2 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2016-02-09 13:52 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: xiecl.fnst, qemu-devel, armbru

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
> > Does this make sense to everyone else, or does anyone have any better
> > suggestions?
> 
> As a concrete example, any monitor command that calls bdrv_drain_all()
> can hang forever with the QEMU global mutex held if I/O requests are
> stuck (e.g. NFS mount is unreachable).
> 
> bdrv_aio_cancel() can also hang but is mostly exposed to device
> emulation, not the monitor.
> 
> One solution for these block layer functions is to add a timeout
> argument and let them return an error.  This way the monitor and device
> emulation do not hang forever.
> 
> The benefit of the timeout is that both monitor and device emulation
> hangs are tackled.  It also doesn't require monitor changes.
> 
> I'm not sure who chooses the timeout value and which value makes sense
> (policy vs mechanism separation)...

Chosing that value tends to be very difficult - for example if it's
iSCSI then you have to make all the same type of choices as multipath
does and worry about switch reconfiguration and SANs failing over between
controllers etc etc.    And then what do you do when you timeout - anywhere
that already calls bdrv_drain_all would have to make a decision.

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-08 15:17 [Qemu-devel] lock-free monitor? Dr. David Alan Gilbert
  2016-02-09 13:47 ` Stefan Hajnoczi
@ 2016-02-09 16:57 ` Markus Armbruster
  2016-02-10  8:52   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 12+ messages in thread
From: Markus Armbruster @ 2016-02-09 16:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: xiecl.fnst, qemu-devel, stefanha

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> Hi,
>   I wondered what it would take to be able to do a lock-free monitor;
> i.e. one that could respond to (some) commands while the qemu big lock is held.

Requires a careful audit of the monitor code.

For instance, cur_mon needs to be made thread local for running multiple
monitors concurrently.

> My reason for asking is that there are cases in migration and colo
> where a network block device has failed and is blocking, but it fails
> at just the wrong time while the lock is held, and now, you don't have

Is it okay to block while holding the BQL?

> any way to unblock it since you can't do anything on the monitors.
> If I could issue a command then I could have it end up doing a shutdown(2)
> on the network connection and free things up.
>
> Maybe this would also help real-time people?
>
> I was thinking then, we could:
>    a) Have a monitor that wasn't tied to the main loop at all - each instance
> would be it's own thread and have it's own poll() (probably using the new
> IO channels?)
>    b) for most commands it would take the big lock before execute the command
> and release it afterwards - so there would be no risk in those commands.

Saves you the auditing their code, which would probably be impractical.

>    c) Some commands you could mark as 'lock free' - they would be required
> not to take any locks or wait for *anything* and for those the monitor would
> not take the lock.

They also must not access shared state without suitable synchronization.

>    d) Perhaps have some way of restricting a particular monitor session to only
> allowing non-blocking commands.

Why?  To ensure you always have an emergency monitor available?

>    e) Then I'd have to have a way of finding the broken device in a lockfree
> way (RTU list walk?) and issuing the shutdown(2).
>
> Bonus points to anyone who can statically check commands in (c).

Tall order.

> Does this make sense to everyone else, or does anyone have any better
> suggestions?

Sounds like a big, hairy task to me.

Any alternatives?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-09 16:57 ` Markus Armbruster
@ 2016-02-10  8:52   ` Dr. David Alan Gilbert
  2016-02-10 15:12     ` Markus Armbruster
  0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2016-02-10  8:52 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: xiecl.fnst, qemu-devel, stefanha

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > Hi,
> >   I wondered what it would take to be able to do a lock-free monitor;
> > i.e. one that could respond to (some) commands while the qemu big lock is held.
> 
> Requires a careful audit of the monitor code.
> 
> For instance, cur_mon needs to be made thread local for running multiple
> monitors concurrently.

Hmm that's fun;  and cur_mon is used all over the place when 'fd' passing
is used - although that probably should be cleaned up somehow.

> > My reason for asking is that there are cases in migration and colo
> > where a network block device has failed and is blocking, but it fails
> > at just the wrong time while the lock is held, and now, you don't have
> 
> Is it okay to block while holding the BQL?

I'd hope the simple answer was no; unfortunately the more complex answer
is that we keep finding places that do.  The cases I'm aware of are
mostly bdrv_drain_all calls in migration/colo, where one of the block
devices (e.g. an NBD network device) fails.  Generally these are called
at the end of a migration cycle when we're just ensuring that everything
really is drained and are normally called with the lock held to ensure
that nothing else is generating new traffic as we're clearing everything else.

> > any way to unblock it since you can't do anything on the monitors.
> > If I could issue a command then I could have it end up doing a shutdown(2)
> > on the network connection and free things up.
> >
> > Maybe this would also help real-time people?
> >
> > I was thinking then, we could:
> >    a) Have a monitor that wasn't tied to the main loop at all - each instance
> > would be it's own thread and have it's own poll() (probably using the new
> > IO channels?)
> >    b) for most commands it would take the big lock before execute the command
> > and release it afterwards - so there would be no risk in those commands.
> 
> Saves you the auditing their code, which would probably be impractical.
> 
> >    c) Some commands you could mark as 'lock free' - they would be required
> > not to take any locks or wait for *anything* and for those the monitor would
> > not take the lock.
> 
> They also must not access shared state without suitable synchronization.

Right; I'd expect most of the cases to either be reading simple status information,
or having to find whatever they need to do using rcu list walks.

> >    d) Perhaps have some way of restricting a particular monitor session to only
> > allowing non-blocking commands.
> 
> Why?  To ensure you always have an emergency monitor available?

Yes, mostly to stop screwups of putting a potentially blocking command down your
emergency route.

> >    e) Then I'd have to have a way of finding the broken device in a lockfree
> > way (RTU list walk?) and issuing the shutdown(2).
> >
> > Bonus points to anyone who can statically check commands in (c).
> 
> Tall order.

Yes :-)   A (compile time selected) dynamic check might be possible
where the normal global lock functions and rcu block functions check if they're
in a monitor thread and assert.

> > Does this make sense to everyone else, or does anyone have any better
> > suggestions?
> 
> Sounds like a big, hairy task to me.
> 
> Any alternatives?

I hope so; although is the idea of making a lock-free monitor a generally
good idea we should do anyway?

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-10  8:52   ` Dr. David Alan Gilbert
@ 2016-02-10 15:12     ` Markus Armbruster
  2016-02-10 15:33       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Markus Armbruster @ 2016-02-10 15:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: xiecl.fnst, qemu-devel, stefanha

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> 
>> > Hi,
>> >   I wondered what it would take to be able to do a lock-free monitor;
>> > i.e. one that could respond to (some) commands while the qemu big lock is held.
>> 
>> Requires a careful audit of the monitor code.
>> 
>> For instance, cur_mon needs to be made thread local for running multiple
>> monitors concurrently.
>
> Hmm that's fun;  and cur_mon is used all over the place when 'fd' passing
> is used - although that probably should be cleaned up somehow.

When cur_mon got created, we had no threads.  Even now, monitors only
ever run under the BQL, so cur_mon's still fine.

Having a current monitor is simpler than passing one around, especially
when you go through layers that don't know about it.  Such cases exist,
and they made us abandon commit 376253e's hope to eliminate cur_mon.
Unfortunately, I've since forgotten the details, so I can't point you to
an example.

>> > My reason for asking is that there are cases in migration and colo
>> > where a network block device has failed and is blocking, but it fails
>> > at just the wrong time while the lock is held, and now, you don't have
>> 
>> Is it okay to block while holding the BQL?
>
> I'd hope the simple answer was no; unfortunately the more complex answer
> is that we keep finding places that do.

Would you call these bugs?

Even if you do, we may want to recover from them.

>                                          The cases I'm aware of are
> mostly bdrv_drain_all calls in migration/colo, where one of the block
> devices (e.g. an NBD network device) fails.  Generally these are called
> at the end of a migration cycle when we're just ensuring that everything
> really is drained and are normally called with the lock held to ensure
> that nothing else is generating new traffic as we're clearing everything else.

Using the BQL for that drives a cork into the I/O pipeline with a Really
Big Hammer.  Can we do better?

The answer may be "yes, but don't hold your breath."  Then we may need
means to better cope with the bugs.

>> > any way to unblock it since you can't do anything on the monitors.
>> > If I could issue a command then I could have it end up doing a shutdown(2)
>> > on the network connection and free things up.
>> >
>> > Maybe this would also help real-time people?
>> >
>> > I was thinking then, we could:
>> >    a) Have a monitor that wasn't tied to the main loop at all - each instance
>> > would be it's own thread and have it's own poll() (probably using the new
>> > IO channels?)
>> >    b) for most commands it would take the big lock before execute the command
>> > and release it afterwards - so there would be no risk in those commands.
>> 
>> Saves you the auditing their code, which would probably be impractical.
>> 
>> >    c) Some commands you could mark as 'lock free' - they would be required
>> > not to take any locks or wait for *anything* and for those the monitor would
>> > not take the lock.
>> 
>> They also must not access shared state without suitable synchronization.
>
> Right; I'd expect most of the cases to either be reading simple status information,
> or having to find whatever they need to do using rcu list walks.
>
>> >    d) Perhaps have some way of restricting a particular monitor session to only
>> > allowing non-blocking commands.
>> 
>> Why?  To ensure you always have an emergency monitor available?
>
> Yes, mostly to stop screwups of putting a potentially blocking command down your
> emergency route.

Understand.

>> >    e) Then I'd have to have a way of finding the broken device in a lockfree
>> > way (RTU list walk?) and issuing the shutdown(2).
>> >
>> > Bonus points to anyone who can statically check commands in (c).
>> 
>> Tall order.
>
> Yes :-)   A (compile time selected) dynamic check might be possible
> where the normal global lock functions and rcu block functions check if they're
> in a monitor thread and assert.
>
>> > Does this make sense to everyone else, or does anyone have any better
>> > suggestions?
>> 
>> Sounds like a big, hairy task to me.
>> 
>> Any alternatives?
>
> I hope so; although is the idea of making a lock-free monitor a generally
> good idea we should do anyway?

There's no shortage of good ideas, but our bandwidth to implement,
review, document and test them has limits.

The general plan is to reduce the scope of the BQL gradually.
Liberating the monitor from the BQL is one step on the road to complete
BQL elimination.  The question is whether this step needs to be taken
*now*.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-10 15:12     ` Markus Armbruster
@ 2016-02-10 15:33       ` Dr. David Alan Gilbert
  2016-02-11  8:33         ` Markus Armbruster
  0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2016-02-10 15:33 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: xiecl.fnst, qemu-devel, stefanha

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> 
> >> > Hi,
> >> >   I wondered what it would take to be able to do a lock-free monitor;
> >> > i.e. one that could respond to (some) commands while the qemu big lock is held.
> >> 
> >> Requires a careful audit of the monitor code.
> >> 
> >> For instance, cur_mon needs to be made thread local for running multiple
> >> monitors concurrently.
> >
> > Hmm that's fun;  and cur_mon is used all over the place when 'fd' passing
> > is used - although that probably should be cleaned up somehow.
> 
> When cur_mon got created, we had no threads.  Even now, monitors only
> ever run under the BQL, so cur_mon's still fine.
> 
> Having a current monitor is simpler than passing one around, especially
> when you go through layers that don't know about it.  Such cases exist,
> and they made us abandon commit 376253e's hope to eliminate cur_mon.
> Unfortunately, I've since forgotten the details, so I can't point you to
> an example.

Yes,  I think maybe if we can try and remove the use of cur_mon one
use at a time outside of the monitor it might move it in the right direction.

> >> > My reason for asking is that there are cases in migration and colo
> >> > where a network block device has failed and is blocking, but it fails
> >> > at just the wrong time while the lock is held, and now, you don't have
> >> 
> >> Is it okay to block while holding the BQL?
> >
> > I'd hope the simple answer was no; unfortunately the more complex answer
> > is that we keep finding places that do.
> 
> Would you call these bugs?
> 
> Even if you do, we may want to recover from them.

They're a bug in the end result that needs fixing; so for example a 
crashing NBD server shouldn't lock you out of the monitor during a migrate,
and I don't think we current;y have other solutions.

> >                                          The cases I'm aware of are
> > mostly bdrv_drain_all calls in migration/colo, where one of the block
> > devices (e.g. an NBD network device) fails.  Generally these are called
> > at the end of a migration cycle when we're just ensuring that everything
> > really is drained and are normally called with the lock held to ensure
> > that nothing else is generating new traffic as we're clearing everything else.
> 
> Using the BQL for that drives a cork into the I/O pipeline with a Really
> Big Hammer.  Can we do better?
> 
> The answer may be "yes, but don't hold your breath."  Then we may need
> means to better cope with the bugs.

Yeh there are some places that the migraiton code holds the BQL where
I don't really understand all the things it's guarding against.

> >> > any way to unblock it since you can't do anything on the monitors.
> >> > If I could issue a command then I could have it end up doing a shutdown(2)
> >> > on the network connection and free things up.
> >> >
> >> > Maybe this would also help real-time people?
> >> >
> >> > I was thinking then, we could:
> >> >    a) Have a monitor that wasn't tied to the main loop at all - each instance
> >> > would be it's own thread and have it's own poll() (probably using the new
> >> > IO channels?)
> >> >    b) for most commands it would take the big lock before execute the command
> >> > and release it afterwards - so there would be no risk in those commands.
> >> 
> >> Saves you the auditing their code, which would probably be impractical.
> >> 
> >> >    c) Some commands you could mark as 'lock free' - they would be required
> >> > not to take any locks or wait for *anything* and for those the monitor would
> >> > not take the lock.
> >> 
> >> They also must not access shared state without suitable synchronization.
> >
> > Right; I'd expect most of the cases to either be reading simple status information,
> > or having to find whatever they need to do using rcu list walks.
> >
> >> >    d) Perhaps have some way of restricting a particular monitor session to only
> >> > allowing non-blocking commands.
> >> 
> >> Why?  To ensure you always have an emergency monitor available?
> >
> > Yes, mostly to stop screwups of putting a potentially blocking command down your
> > emergency route.
> 
> Understand.
> 
> >> >    e) Then I'd have to have a way of finding the broken device in a lockfree
> >> > way (RTU list walk?) and issuing the shutdown(2).
> >> >
> >> > Bonus points to anyone who can statically check commands in (c).
> >> 
> >> Tall order.
> >
> > Yes :-)   A (compile time selected) dynamic check might be possible
> > where the normal global lock functions and rcu block functions check if they're
> > in a monitor thread and assert.
> >
> >> > Does this make sense to everyone else, or does anyone have any better
> >> > suggestions?
> >> 
> >> Sounds like a big, hairy task to me.
> >> 
> >> Any alternatives?
> >
> > I hope so; although is the idea of making a lock-free monitor a generally
> > good idea we should do anyway?
> 
> There's no shortage of good ideas, but our bandwidth to implement,
> review, document and test them has limits.
> 
> The general plan is to reduce the scope of the BQL gradually.
> Liberating the monitor from the BQL is one step on the road to complete
> BQL elimination.  The question is whether this step needs to be taken
> *now*.

COLO probably needs something;  although in it's case I'm going to investigate
if some evil with iptables might be able to force a recovery by causing the socket
to close (after all we're assuming that case involves the destination is dead - but
I don't want to wait for a full TCP timeout).

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-10 15:33       ` Dr. David Alan Gilbert
@ 2016-02-11  8:33         ` Markus Armbruster
  0 siblings, 0 replies; 12+ messages in thread
From: Markus Armbruster @ 2016-02-11  8:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: xiecl.fnst, qemu-devel, stefanha

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> 
>> > * Markus Armbruster (armbru@redhat.com) wrote:
>> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> >> 
>> >> > Hi,
>> >> >   I wondered what it would take to be able to do a lock-free monitor;
>> >> > i.e. one that could respond to (some) commands while the qemu big lock is held.
>> >> 
>> >> Requires a careful audit of the monitor code.
>> >> 
>> >> For instance, cur_mon needs to be made thread local for running multiple
>> >> monitors concurrently.
>> >
>> > Hmm that's fun;  and cur_mon is used all over the place when 'fd' passing
>> > is used - although that probably should be cleaned up somehow.
>> 
>> When cur_mon got created, we had no threads.  Even now, monitors only
>> ever run under the BQL, so cur_mon's still fine.
>> 
>> Having a current monitor is simpler than passing one around, especially
>> when you go through layers that don't know about it.  Such cases exist,
>> and they made us abandon commit 376253e's hope to eliminate cur_mon.
>> Unfortunately, I've since forgotten the details, so I can't point you to
>> an example.
>
> Yes,  I think maybe if we can try and remove the use of cur_mon one
> use at a time outside of the monitor it might move it in the right direction.

For historical reasons, some interfaces assume the current monitor,
while others take Monitor * parameters.  The typical cur_mon use outside
the monitor is such an argument.

Whether such a Monitor * interface actually works when passed something
other than cur_mon is anybody's guess.  Moving monitors out of the BQL
is unlikely to increase my confidence.

If a function can safely work on non-current monitors, a Monitor *
parameter is fine.  If not, it shouldn't take one, and just use cur_mon.

>> >> > My reason for asking is that there are cases in migration and colo
>> >> > where a network block device has failed and is blocking, but it fails
>> >> > at just the wrong time while the lock is held, and now, you don't have
>> >> 
>> >> Is it okay to block while holding the BQL?
>> >
>> > I'd hope the simple answer was no; unfortunately the more complex answer
>> > is that we keep finding places that do.
>> 
>> Would you call these bugs?
>> 
>> Even if you do, we may want to recover from them.
>
> They're a bug in the end result that needs fixing; so for example a 
> crashing NBD server shouldn't lock you out of the monitor during a migrate,
> and I don't think we current;y have other solutions.

It shouldn't lock you out of anything.  The monitor is just one such
thing.  It's special only because it can let the user (human or
management application) recover from certain bugs with monitor commands.

I'd see enabling that as a partial work-around for a class of bugs that
is hard to fix.  Work-around is better than nothing.

However, it can reduce the incentive to actually fix the bug.  No reason
to block work-arounds, but I'd like to see an earnest commitment to
actual bug fixing.

>> >                                          The cases I'm aware of are
>> > mostly bdrv_drain_all calls in migration/colo, where one of the block
>> > devices (e.g. an NBD network device) fails.  Generally these are called
>> > at the end of a migration cycle when we're just ensuring that everything
>> > really is drained and are normally called with the lock held to ensure
>> > that nothing else is generating new traffic as we're clearing everything else.
>> 
>> Using the BQL for that drives a cork into the I/O pipeline with a Really
>> Big Hammer.  Can we do better?
>> 
>> The answer may be "yes, but don't hold your breath."  Then we may need
>> means to better cope with the bugs.
>
> Yeh there are some places that the migraiton code holds the BQL where
> I don't really understand all the things it's guarding against.

I guess actual bug fixing would start with figuring that out.

>> >> > any way to unblock it since you can't do anything on the monitors.
>> >> > If I could issue a command then I could have it end up doing a shutdown(2)
>> >> > on the network connection and free things up.
>> >> >
>> >> > Maybe this would also help real-time people?
>> >> >
>> >> > I was thinking then, we could:
>> >> >    a) Have a monitor that wasn't tied to the main loop at all - each instance
>> >> > would be it's own thread and have it's own poll() (probably using the new
>> >> > IO channels?)
>> >> >    b) for most commands it would take the big lock before execute the command
>> >> > and release it afterwards - so there would be no risk in those commands.
>> >> 
>> >> Saves you the auditing their code, which would probably be impractical.
>> >> 
>> >> >    c) Some commands you could mark as 'lock free' - they would be required
>> >> > not to take any locks or wait for *anything* and for those the monitor would
>> >> > not take the lock.
>> >> 
>> >> They also must not access shared state without suitable synchronization.
>> >
>> > Right; I'd expect most of the cases to either be reading simple status information,
>> > or having to find whatever they need to do using rcu list walks.
>> >
>> >> >    d) Perhaps have some way of restricting a particular monitor session to only
>> >> > allowing non-blocking commands.
>> >> 
>> >> Why?  To ensure you always have an emergency monitor available?
>> >
>> > Yes, mostly to stop screwups of putting a potentially blocking command down your
>> > emergency route.
>> 
>> Understand.
>> 
>> >> >    e) Then I'd have to have a way of finding the broken device in a lockfree
>> >> > way (RTU list walk?) and issuing the shutdown(2).
>> >> >
>> >> > Bonus points to anyone who can statically check commands in (c).
>> >> 
>> >> Tall order.
>> >
>> > Yes :-)   A (compile time selected) dynamic check might be possible
>> > where the normal global lock functions and rcu block functions check if they're
>> > in a monitor thread and assert.
>> >
>> >> > Does this make sense to everyone else, or does anyone have any better
>> >> > suggestions?
>> >> 
>> >> Sounds like a big, hairy task to me.
>> >> 
>> >> Any alternatives?
>> >
>> > I hope so; although is the idea of making a lock-free monitor a generally
>> > good idea we should do anyway?
>> 
>> There's no shortage of good ideas, but our bandwidth to implement,
>> review, document and test them has limits.
>> 
>> The general plan is to reduce the scope of the BQL gradually.
>> Liberating the monitor from the BQL is one step on the road to complete
>> BQL elimination.  The question is whether this step needs to be taken
>> *now*.
>
> COLO probably needs something;  although in it's case I'm going to investigate
> if some evil with iptables might be able to force a recovery by causing the socket
> to close (after all we're assuming that case involves the destination is dead - but
> I don't want to wait for a full TCP timeout).

As always when feature X is considered sound, but there's doubt whether
it's worth doing *now*: somebody wanting it badly enough to implement it
is pretty compelling evidence it's worth doing for him.  Especially when
said somebody has plenty of other important things to do in QEMU.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-09 13:47 ` Stefan Hajnoczi
  2016-02-09 13:52   ` Dr. David Alan Gilbert
@ 2016-02-14  6:22   ` Fam Zheng
  2016-02-15 13:42     ` Stefan Hajnoczi
  2016-02-15 12:59   ` Kevin Wolf
  2 siblings, 1 reply; 12+ messages in thread
From: Fam Zheng @ 2016-02-14  6:22 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: armbru, xiecl.fnst, Dr. David Alan Gilbert, qemu-devel

On Tue, 02/09 13:47, Stefan Hajnoczi wrote:
> On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
> > Does this make sense to everyone else, or does anyone have any better
> > suggestions?
> 
> As a concrete example, any monitor command that calls bdrv_drain_all()
> can hang forever with the QEMU global mutex held if I/O requests are
> stuck (e.g. NFS mount is unreachable).
> 
> bdrv_aio_cancel() can also hang but is mostly exposed to device
> emulation, not the monitor.
> 
> One solution for these block layer functions is to add a timeout
> argument and let them return an error.  This way the monitor and device
> emulation do not hang forever.

Yes, there are a few places in block layer invoking aio_poll() in a loop
waiting for certain events, and a disconnected network link could make QEMU
hang. In these cases a timeout is a huge improvement.  Maybe we can mark the
BDS as "hanging" (-EIO is returned for all further requests) and let
bdrv_drain_all() return.

> 
> The benefit of the timeout is that both monitor and device emulation
> hangs are tackled.  It also doesn't require monitor changes.
> 
> I'm not sure who chooses the timeout value and which value makes sense
> (policy vs mechanism separation)...

Default to 30 seconds like Linux, and make it tunable through command line
options as well as QMP?

Fam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-09 13:47 ` Stefan Hajnoczi
  2016-02-09 13:52   ` Dr. David Alan Gilbert
  2016-02-14  6:22   ` Fam Zheng
@ 2016-02-15 12:59   ` Kevin Wolf
  2 siblings, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2016-02-15 12:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: armbru, xiecl.fnst, Dr. David Alan Gilbert, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]

Am 09.02.2016 um 14:47 hat Stefan Hajnoczi geschrieben:
> On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
> > Does this make sense to everyone else, or does anyone have any better
> > suggestions?
> 
> As a concrete example, any monitor command that calls bdrv_drain_all()
> can hang forever with the QEMU global mutex held if I/O requests are
> stuck (e.g. NFS mount is unreachable).
> 
> bdrv_aio_cancel() can also hang but is mostly exposed to device
> emulation, not the monitor.
> 
> One solution for these block layer functions is to add a timeout
> argument and let them return an error.  This way the monitor and device
> emulation do not hang forever.
> 
> The benefit of the timeout is that both monitor and device emulation
> hangs are tackled.  It also doesn't require monitor changes.
> 
> I'm not sure who chooses the timeout value and which value makes sense
> (policy vs mechanism separation)...

You would still have a monitor hang for half a minute.

And anyway, the only reasonable action to do on a timeout is to make the
image completetly unusable, with no option to resume operation on it
without restarting the VM because we don't know what state the image is
in.

That's a pretty large impact for something that qemu does automatically.
It would be much nicer if the monitor kept working all the time and the
management would actively tell qemu to give up on a given image.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-14  6:22   ` Fam Zheng
@ 2016-02-15 13:42     ` Stefan Hajnoczi
  2016-02-15 14:19       ` Markus Armbruster
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2016-02-15 13:42 UTC (permalink / raw)
  To: Fam Zheng; +Cc: armbru, xiecl.fnst, Dr. David Alan Gilbert, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

On Sun, Feb 14, 2016 at 02:22:10PM +0800, Fam Zheng wrote:
> On Tue, 02/09 13:47, Stefan Hajnoczi wrote:
> > On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
> > > Does this make sense to everyone else, or does anyone have any better
> > > suggestions?
> > 
> > As a concrete example, any monitor command that calls bdrv_drain_all()
> > can hang forever with the QEMU global mutex held if I/O requests are
> > stuck (e.g. NFS mount is unreachable).
> > 
> > bdrv_aio_cancel() can also hang but is mostly exposed to device
> > emulation, not the monitor.
> > 
> > One solution for these block layer functions is to add a timeout
> > argument and let them return an error.  This way the monitor and device
> > emulation do not hang forever.
> 
> Yes, there are a few places in block layer invoking aio_poll() in a loop
> waiting for certain events, and a disconnected network link could make QEMU
> hang. In these cases a timeout is a huge improvement.  Maybe we can mark the
> BDS as "hanging" (-EIO is returned for all further requests) and let
> bdrv_drain_all() return.

No, we need to be very careful about hung requests that are still in
flight.  They could complete after a long time and modify the disk or
guest RAM.

The only thing bdrv_drain_all() can return is -ETIME.  Beyond that it
cannot pretend that requests have been drained since that could lead to
data corruption/loss.

The caller has to abort whatever it was trying to do since it has no
guarantee that requests have drained.  Maybe the remaining requests will
fail and go away, maybe they will complete.  We don't know...

> > 
> > The benefit of the timeout is that both monitor and device emulation
> > hangs are tackled.  It also doesn't require monitor changes.
> > 
> > I'm not sure who chooses the timeout value and which value makes sense
> > (policy vs mechanism separation)...
> 
> Default to 30 seconds like Linux, and make it tunable through command line
> options as well as QMP?

Yes.  Either commands that block can take an optional timeout (seconds)
parameter, or we could have a global disk I/O timeout value.  I prefer
passing an optional per-command value.

Libvirt and other sophisticated clients could begin using it in a
backwards-compatible way.  Existing code wouldn't use it and still
suffer from the hang problem (but at least QEMU is
backwards-compatible).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] lock-free monitor?
  2016-02-15 13:42     ` Stefan Hajnoczi
@ 2016-02-15 14:19       ` Markus Armbruster
  0 siblings, 0 replies; 12+ messages in thread
From: Markus Armbruster @ 2016-02-15 14:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: xiecl.fnst, Fam Zheng, Dr. David Alan Gilbert, qemu-devel

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Sun, Feb 14, 2016 at 02:22:10PM +0800, Fam Zheng wrote:
>> On Tue, 02/09 13:47, Stefan Hajnoczi wrote:
>> > On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote:
>> > > Does this make sense to everyone else, or does anyone have any better
>> > > suggestions?
>> > 
>> > As a concrete example, any monitor command that calls bdrv_drain_all()
>> > can hang forever with the QEMU global mutex held if I/O requests are
>> > stuck (e.g. NFS mount is unreachable).
>> > 
>> > bdrv_aio_cancel() can also hang but is mostly exposed to device
>> > emulation, not the monitor.
>> > 
>> > One solution for these block layer functions is to add a timeout
>> > argument and let them return an error.  This way the monitor and device
>> > emulation do not hang forever.
>> 
>> Yes, there are a few places in block layer invoking aio_poll() in a loop
>> waiting for certain events, and a disconnected network link could make QEMU
>> hang. In these cases a timeout is a huge improvement.  Maybe we can mark the
>> BDS as "hanging" (-EIO is returned for all further requests) and let
>> bdrv_drain_all() return.
>
> No, we need to be very careful about hung requests that are still in
> flight.  They could complete after a long time and modify the disk or
> guest RAM.
>
> The only thing bdrv_drain_all() can return is -ETIME.  Beyond that it
> cannot pretend that requests have been drained since that could lead to
> data corruption/loss.
>
> The caller has to abort whatever it was trying to do since it has no
> guarantee that requests have drained.  Maybe the remaining requests will
> fail and go away, maybe they will complete.  We don't know...
>
>> > 
>> > The benefit of the timeout is that both monitor and device emulation
>> > hangs are tackled.  It also doesn't require monitor changes.
>> > 
>> > I'm not sure who chooses the timeout value and which value makes sense
>> > (policy vs mechanism separation)...
>> 
>> Default to 30 seconds like Linux, and make it tunable through command line
>> options as well as QMP?
>
> Yes.  Either commands that block can take an optional timeout (seconds)
> parameter, or we could have a global disk I/O timeout value.  I prefer
> passing an optional per-command value.
>
> Libvirt and other sophisticated clients could begin using it in a
> backwards-compatible way.  Existing code wouldn't use it and still
> suffer from the hang problem (but at least QEMU is
> backwards-compatible).

I'm not arguing against timeouts, I just want to remind everyone that
blocking while holding the BQL is a bug, and limiting the block time
with timeouts is a work-around, no more, no less.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-02-15 14:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 15:17 [Qemu-devel] lock-free monitor? Dr. David Alan Gilbert
2016-02-09 13:47 ` Stefan Hajnoczi
2016-02-09 13:52   ` Dr. David Alan Gilbert
2016-02-14  6:22   ` Fam Zheng
2016-02-15 13:42     ` Stefan Hajnoczi
2016-02-15 14:19       ` Markus Armbruster
2016-02-15 12:59   ` Kevin Wolf
2016-02-09 16:57 ` Markus Armbruster
2016-02-10  8:52   ` Dr. David Alan Gilbert
2016-02-10 15:12     ` Markus Armbruster
2016-02-10 15:33       ` Dr. David Alan Gilbert
2016-02-11  8:33         ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).