All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] ping RE: question: I found a qemu crash about migration
@ 2017-09-28  7:38 wangjie (P)
  2017-09-28 17:01 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 8+ messages in thread
From: wangjie (P) @ 2017-09-28  7:38 UTC (permalink / raw)
  To: qemu-devel, kwolf, pbonzini
  Cc: fuweiwei (C), eblake, kchamart, dgilbert, famz

[-- Attachment #1: Type: text/plain, Size: 3032 bytes --]

Ping?

From: wangjie (P)
Sent: Tuesday, September 26, 2017 9:10 PM
To: qemu-devel@nongnu.org; kwolf@redhat.com; pbonzini@redhat.com
Cc: wangjie (P) <wangjie88@huawei.com>; fuweiwei (C) <fuweiwei2@huawei.com>; eblake@redhat.com; kchamart@redhat.com; dgilbert@redhat.com; famz@redhat.com; Wubin (H) <wu.wubin@huawei.com>
Subject: question: I found a qemu crash about migration

Hi,

When I use qemuMigrationRun to migrate both memory and storage with some IO press in VM, and configured iothreads. We triggered a error reports:  (I use the current qemu master branch)
" bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed",

I reviewed the code, and gdb the coredump file, I think one case can trigger the error reports

Case:

Migration_thread()
      Migration_completion() ----------> last iteration of memory migration
            Vm_stop_force_state()--------------> Stop the VM, and call bdrv_drain_all, but I gdb the core file, and found the cnt of dirty bitmap of driver-mirror is not 0, and in_flight mirror IO is 16,
                  Bdrv_inactivate_all()----------------> inactivate images and set the INACTIVE label.
      -> bdrv_co_do_pwritev()-------------->then the mirror IO handled after will trigger the Assertion `!(bs->open_flags & 0x0800)' and qemu crashed




As we can see from above,  Migration_completion call Bdrv_inactivate_all to inactivate images, but the mirror_run is not done (still has dirty clusters), the mirror_run IO issued later will triggered error reports: " bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed",

It seems that memory migration and storage mirror is done independently and the sequence of the two progresses are quite random.

How can I solve this problem, should we not set  INACTIVE label for drive-mirror BlockDriverState?

Qemu Crash bt:
(gdb) bt
#0  0x00007f6b6e2a71d7 in raise () from /usr/lib64/libc.so.6
#1  0x00007f6b6e2a88c8 in abort () from /usr/lib64/libc.so.6
#2  0x00007f6b6e2a0146 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00007f6b6e2a01f2 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00000000007b9211 in bdrv_co_pwritev (child=<optimized out>, offset=offset@entry=7034896384, bytes=bytes@entry=65536,
    qiov=qiov@entry=0x7f69cc09b068, flags=0) at block/io.c:1536
#5  0x00000000007a6f02 in blk_co_pwritev (blk=0x2f92750, offset=7034896384, bytes=65536, qiov=0x7f69cc09b068,
    flags=<optimized out>) at block/block_backend.c:851
#6  0x00000000007a6fc1 in blk_aio_write_entry (opaque=0x301dad0) at block/block_backend.c:1043
#7  0x0000000000835e2a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine_ucontext.c:79
#8  0x00007f6b6e2b8cf0 in ?? () from /usr/lib64/libc.so.6
#9  0x00007f6a1bcfc780 in ?? ()
#10 0x0000000000000000 in ?? ()

And I see the mirror_run is not done,  gdb info as following:
[cid:image001.png@01D3386F.DBC9FF10]


Src VM qemu log:

[cid:image002.png@01D3386F.DBC9FF10]















[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 61384 bytes --]

[-- Attachment #3: image002.png --]
[-- Type: image/png, Size: 153960 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-09-28  7:38 [Qemu-devel] ping RE: question: I found a qemu crash about migration wangjie (P)
@ 2017-09-28 17:01 ` Dr. David Alan Gilbert
  2017-09-29  9:35   ` Kevin Wolf
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-28 17:01 UTC (permalink / raw)
  To: wangjie (P)
  Cc: qemu-devel, kwolf, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

* wangjie (P) (wangjie88@huawei.com) wrote:
> Ping?
> 
> From: wangjie (P)
> Sent: Tuesday, September 26, 2017 9:10 PM
> To: qemu-devel@nongnu.org; kwolf@redhat.com; pbonzini@redhat.com
> Cc: wangjie (P) <wangjie88@huawei.com>; fuweiwei (C) <fuweiwei2@huawei.com>; eblake@redhat.com; kchamart@redhat.com; dgilbert@redhat.com; famz@redhat.com; Wubin (H) <wu.wubin@huawei.com>
> Subject: question: I found a qemu crash about migration
> 
> Hi,
> 
> When I use qemuMigrationRun to migrate both memory and storage with some IO press in VM, and configured iothreads. We triggered a error reports:  (I use the current qemu master branch)
> " bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed",
> 
> I reviewed the code, and gdb the coredump file, I think one case can trigger the error reports
> 
> Case:
> 
> Migration_thread()
>       Migration_completion() ----------> last iteration of memory migration
>             Vm_stop_force_state()--------------> Stop the VM, and call bdrv_drain_all, but I gdb the core file, and found the cnt of dirty bitmap of driver-mirror is not 0, and in_flight mirror IO is 16,
>                   Bdrv_inactivate_all()----------------> inactivate images and set the INACTIVE label.
>       -> bdrv_co_do_pwritev()-------------->then the mirror IO handled after will trigger the Assertion `!(bs->open_flags & 0x0800)' and qemu crashed
> 
> 
> 
> 
> As we can see from above,  Migration_completion call Bdrv_inactivate_all to inactivate images, but the mirror_run is not done (still has dirty clusters), the mirror_run IO issued later will triggered error reports: " bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed",
> 
> It seems that memory migration and storage mirror is done independently and the sequence of the two progresses are quite random.
> 
> How can I solve this problem, should we not set  INACTIVE label for drive-mirror BlockDriverState?

Hi,
  This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
A proper fix really needs to be done together with libvirt so that we
can sequence:
   a) The stopping of the CPU on the source
   b) The termination of the mirroring block job
   c) The inactivation of the block devices on the source
       (bdrv_inactivate_all)
   d) The activation of the block devices on the destination
       (bdrv_invalidate_cache_all)
   e) The start of the CPU on the destination

It looks like you're hitting a race between b/c;  we've had races
between c/d in the past and moved the bdrv_inactivate_all.

During the discussion we ended up with two proposed solutions;
both of them require one extra command and one extra migration
capability.

The block way
-------------
   1) Add a new migration capability pause-at-complete
   2) Add a new migration state almost-complete
   3) After saving devices, if pause-at-complete is set,
      transition to almost-complete
   4) Add a new command (migration-continue) that 
      causes the migration to inactivate the devices (c)
      and send the final EOF to the destination.

You set pause-at-complete, wait until migrate hits almost-complete;
cleanup the mirror job, and then do migration-continue.  When it
completes do 'cont' on the destination.

The migration way
-----------------
   1) Stop doing (d) when the destination is started with -S
      since it happens anyway when 'cont' is issued
   2) Add a new migration capability ext-manage-storage
   3) When 'ext-manage-storage' is set, we don't bother doing (c)
   4) Add a new command 'block-inactivate' on the source

You set ext-manage-storage, do the migrate and when it's finished
clean up the block job, block-inactivate on the source, and
then cont on the destination.
  

My worry about the 'block way' is that the point at which we
do the pause seems pretty interesting;  it probably is best
done after the final device save but before the inactivate,
but could be done before it.  But it probably becomes API
and something might become dependent on where we did it.

I think Kevin's worry about the 'migration way' is that
it's a bit of a block-specific fudge; which is probably right.


I've not really thought what happens when you have a mix of shared and
non-shared storage.

Could we do any hack that isn't libvirt-visible for existing versions?
I guess maybe hack drive-mirror so it interlocks with the migration
code somehow to hold off on that inactivate?

This code is visible probalby from 2.9.ish with the new locking code;
but really that b/c race has been there for ever - there's maybe
always the chance that the last few blocks of mirroring might have
happened too late ?

Thoughts?
What are the libvirt view on the preferred solution.

Dave


> Qemu Crash bt:
> (gdb) bt
> #0  0x00007f6b6e2a71d7 in raise () from /usr/lib64/libc.so.6
> #1  0x00007f6b6e2a88c8 in abort () from /usr/lib64/libc.so.6
> #2  0x00007f6b6e2a0146 in __assert_fail_base () from /usr/lib64/libc.so.6
> #3  0x00007f6b6e2a01f2 in __assert_fail () from /usr/lib64/libc.so.6
> #4  0x00000000007b9211 in bdrv_co_pwritev (child=<optimized out>, offset=offset@entry=7034896384, bytes=bytes@entry=65536,
>     qiov=qiov@entry=0x7f69cc09b068, flags=0) at block/io.c:1536
> #5  0x00000000007a6f02 in blk_co_pwritev (blk=0x2f92750, offset=7034896384, bytes=65536, qiov=0x7f69cc09b068,
>     flags=<optimized out>) at block/block_backend.c:851
> #6  0x00000000007a6fc1 in blk_aio_write_entry (opaque=0x301dad0) at block/block_backend.c:1043
> #7  0x0000000000835e2a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine_ucontext.c:79
> #8  0x00007f6b6e2b8cf0 in ?? () from /usr/lib64/libc.so.6
> #9  0x00007f6a1bcfc780 in ?? ()
> #10 0x0000000000000000 in ?? ()
> 
> And I see the mirror_run is not done,  gdb info as following:
> [cid:image001.png@01D3386F.DBC9FF10]
> 
> 
> Src VM qemu log:
> 
> [cid:image002.png@01D3386F.DBC9FF10]
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-09-28 17:01 ` Dr. David Alan Gilbert
@ 2017-09-29  9:35   ` Kevin Wolf
  2017-09-29 19:06     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2017-09-29  9:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: wangjie (P), qemu-devel, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> Hi,
>   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> A proper fix really needs to be done together with libvirt so that we
> can sequence:
>    a) The stopping of the CPU on the source
>    b) The termination of the mirroring block job
>    c) The inactivation of the block devices on the source
>        (bdrv_inactivate_all)
>    d) The activation of the block devices on the destination
>        (bdrv_invalidate_cache_all)
>    e) The start of the CPU on the destination
> 
> It looks like you're hitting a race between b/c;  we've had races
> between c/d in the past and moved the bdrv_inactivate_all.
> 
> During the discussion we ended up with two proposed solutions;
> both of them require one extra command and one extra migration
> capability.
> 
> The block way
> -------------
>    1) Add a new migration capability pause-at-complete
>    2) Add a new migration state almost-complete
>    3) After saving devices, if pause-at-complete is set,
>       transition to almost-complete
>    4) Add a new command (migration-continue) that 
>       causes the migration to inactivate the devices (c)
>       and send the final EOF to the destination.
> 
> You set pause-at-complete, wait until migrate hits almost-complete;
> cleanup the mirror job, and then do migration-continue.  When it
> completes do 'cont' on the destination.

Especially since you're concerned about the exact position of the pause,
the following would be a little safer (using the opportunity to suggest
slightly different names,too):

1) Add a new migration capability pause-at-complete
2) Add a new migration state ready
3) After migration has converged (i.e. the background phase of the
   migration has completed) and we are ready to actually switch to the
   destination, stop the VM, transition to 'ready' and emit a QMP event
4) Continue processing QMP commands in the 'ready' phase. This is where
   libvirt is supposed to tear down any running non-VM activities like
   block jobs, NBD servers, etc.
5) Add a new command (migration-complete) that does the final part of
   migration, including sending the remaining dirty memory, device state
   and the final EOF that signals the destination to resume the VM if -S
   isn't given.

The main reason why I advocate this "block way" is that it looks a lot
less hacky, but more future-proof to me. Instead of adding a one-off
hack for the problem at hand, it introduces a phase where libvirt can
cleanly tear down any activity it has still running on the source qemu.


We got away without doing this because we never did a clean handover of
resources from the source to the destination. From the perspective of a
management tool, we had these distinct phases during migration:

a) 'migrate' command:
   Source VM is still running and in control of all resources (like disk
   images), migrating some state in the background

b) Migration converges:
   Both VMs are stopped (assuming -S is given on the destination,
   otherwise this phase is skipped), the destination is in control of
   the resources

c) 'cont' command:
   Destination VM is running and in control of the resources

There is an asymmetry here because there is no phase where the VMs are
stopped, but the source is in control of the resources (this is the
additional state that my proposal introduces).

As long as we didn't really manage control of the resources with locks,
we could abuse b) for things where the source still accessed the
resources if we knew that the destination wouldn't use them yet (it's
not an assumption that I'm willing to trust too much) and it would still
kind of work in the common cases. But it's not really the correct thing
to do, it has probably always caused bugs in corner cases, and the
locking mechanisms are pointing that out now.

We're now noticing this with block device because we're using locking,
but I wouldn't be surprised if we found sooner or later that there are
more resources that should really be handed over in a cleanly designed
way and that work only by accident as long as you don't do that. So I
don't think a local hack for mirroring (or even all block jobs) is
sufficient to be future-proof.

> The migration way
> -----------------
>    1) Stop doing (d) when the destination is started with -S
>       since it happens anyway when 'cont' is issued
>    2) Add a new migration capability ext-manage-storage
>    3) When 'ext-manage-storage' is set, we don't bother doing (c)
>    4) Add a new command 'block-inactivate' on the source
> 
> You set ext-manage-storage, do the migrate and when it's finished
> clean up the block job, block-inactivate on the source, and
> then cont on the destination.

You would have to inactivate every single block node, which is not
a thing libvirt can currently do (they don't even really manage
individual nodes yet). Apart from that I believe it is much too
low-level and users shouldn't have to worry about what implications
migration has to block devices. A clean migration phase model should be
much easier to understand.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-09-29  9:35   ` Kevin Wolf
@ 2017-09-29 19:06     ` Dr. David Alan Gilbert
  2017-09-29 20:44       ` Kevin Wolf
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-29 19:06 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: wangjie (P), qemu-devel, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > Hi,
> >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > A proper fix really needs to be done together with libvirt so that we
> > can sequence:
> >    a) The stopping of the CPU on the source
> >    b) The termination of the mirroring block job
> >    c) The inactivation of the block devices on the source
> >        (bdrv_inactivate_all)
> >    d) The activation of the block devices on the destination
> >        (bdrv_invalidate_cache_all)
> >    e) The start of the CPU on the destination
> > 
> > It looks like you're hitting a race between b/c;  we've had races
> > between c/d in the past and moved the bdrv_inactivate_all.
> > 
> > During the discussion we ended up with two proposed solutions;
> > both of them require one extra command and one extra migration
> > capability.
> > 
> > The block way
> > -------------
> >    1) Add a new migration capability pause-at-complete
> >    2) Add a new migration state almost-complete
> >    3) After saving devices, if pause-at-complete is set,
> >       transition to almost-complete
> >    4) Add a new command (migration-continue) that 
> >       causes the migration to inactivate the devices (c)
> >       and send the final EOF to the destination.
> > 
> > You set pause-at-complete, wait until migrate hits almost-complete;
> > cleanup the mirror job, and then do migration-continue.  When it
> > completes do 'cont' on the destination.
> 
> Especially since you're concerned about the exact position of the pause,
> the following would be a little safer (using the opportunity to suggest
> slightly different names,too):
> 
> 1) Add a new migration capability pause-at-complete
> 2) Add a new migration state ready
> 3) After migration has converged (i.e. the background phase of the
>    migration has completed) and we are ready to actually switch to the
>    destination, stop the VM, transition to 'ready' and emit a QMP event
> 4) Continue processing QMP commands in the 'ready' phase. This is where
>    libvirt is supposed to tear down any running non-VM activities like
>    block jobs, NBD servers, etc.
> 5) Add a new command (migration-complete) that does the final part of
>    migration, including sending the remaining dirty memory, device state
>    and the final EOF that signals the destination to resume the VM if -S
>    isn't given.

What worries me here is whether some other subsection is going to say
that this pause must happen after the device state save, e.g. if the
device state save causes them to push out the last block/packet/etc -
and I've got no real way to say whether that point is any better or
worse than the point I was suggestion.
And the position in the cycle when the pause happens is part of the API.

> The main reason why I advocate this "block way" is that it looks a lot
> less hacky, but more future-proof to me. Instead of adding a one-off
> hack for the problem at hand, it introduces a phase where libvirt can
> cleanly tear down any activity it has still running on the source qemu.

Yes, if we get that phase right.

> We got away without doing this because we never did a clean handover of
> resources from the source to the destination. From the perspective of a
> management tool, we had these distinct phases during migration:
> 
> a) 'migrate' command:
>    Source VM is still running and in control of all resources (like disk
>    images), migrating some state in the background
> 
> b) Migration converges:
>    Both VMs are stopped (assuming -S is given on the destination,
>    otherwise this phase is skipped), the destination is in control of
>    the resources
> 
> c) 'cont' command:
>    Destination VM is running and in control of the resources
> 
> There is an asymmetry here because there is no phase where the VMs are
> stopped, but the source is in control of the resources (this is the
> additional state that my proposal introduces).
> 
> As long as we didn't really manage control of the resources with locks,
> we could abuse b) for things where the source still accessed the
> resources if we knew that the destination wouldn't use them yet (it's
> not an assumption that I'm willing to trust too much) and it would still
> kind of work in the common cases. But it's not really the correct thing
> to do, it has probably always caused bugs in corner cases, and the
> locking mechanisms are pointing that out now.
> 
> We're now noticing this with block device because we're using locking,
> but I wouldn't be surprised if we found sooner or later that there are
> more resources that should really be handed over in a cleanly designed
> way and that work only by accident as long as you don't do that. So I
> don't think a local hack for mirroring (or even all block jobs) is
> sufficient to be future-proof.

Without the block-job we were actually pretty safe - up until the EOF
we knew the destination wouldn't start, and similarly we wouldn't start
using a device on the destination before the source had saved that
device.

My worry is where the edge of 'source is in control of the device is' -
is that upto the point at which the end of the device save? Before that?
After that? What?  It seems to be pretty arbitrary.  Unless we can
understand why it's where it is, then I'm not sure if it's a 'clean
migration phase model'.


> > The migration way
> > -----------------
> >    1) Stop doing (d) when the destination is started with -S
> >       since it happens anyway when 'cont' is issued
> >    2) Add a new migration capability ext-manage-storage
> >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> >    4) Add a new command 'block-inactivate' on the source
> > 
> > You set ext-manage-storage, do the migrate and when it's finished
> > clean up the block job, block-inactivate on the source, and
> > then cont on the destination.
> 
> You would have to inactivate every single block node, which is not
> a thing libvirt can currently do (they don't even really manage
> individual nodes yet). Apart from that I believe it is much too
> low-level and users shouldn't have to worry about what implications
> migration has to block devices. A clean migration phase model should be
> much easier to understand.

Yep, I see that.
(Having said that, this extra round-trip with management is adding extra
downtime, and we'll have to work out things like what happens if you
cancel or fail in that time).

Dave
> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-09-29 19:06     ` Dr. David Alan Gilbert
@ 2017-09-29 20:44       ` Kevin Wolf
  2017-10-09 11:55         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2017-09-29 20:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: wangjie (P), qemu-devel, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> * Kevin Wolf (kwolf@redhat.com) wrote:
> > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > Hi,
> > >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > > A proper fix really needs to be done together with libvirt so that we
> > > can sequence:
> > >    a) The stopping of the CPU on the source
> > >    b) The termination of the mirroring block job
> > >    c) The inactivation of the block devices on the source
> > >        (bdrv_inactivate_all)
> > >    d) The activation of the block devices on the destination
> > >        (bdrv_invalidate_cache_all)
> > >    e) The start of the CPU on the destination
> > > 
> > > It looks like you're hitting a race between b/c;  we've had races
> > > between c/d in the past and moved the bdrv_inactivate_all.
> > > 
> > > During the discussion we ended up with two proposed solutions;
> > > both of them require one extra command and one extra migration
> > > capability.
> > > 
> > > The block way
> > > -------------
> > >    1) Add a new migration capability pause-at-complete
> > >    2) Add a new migration state almost-complete
> > >    3) After saving devices, if pause-at-complete is set,
> > >       transition to almost-complete
> > >    4) Add a new command (migration-continue) that 
> > >       causes the migration to inactivate the devices (c)
> > >       and send the final EOF to the destination.
> > > 
> > > You set pause-at-complete, wait until migrate hits almost-complete;
> > > cleanup the mirror job, and then do migration-continue.  When it
> > > completes do 'cont' on the destination.
> > 
> > Especially since you're concerned about the exact position of the pause,
> > the following would be a little safer (using the opportunity to suggest
> > slightly different names,too):
> > 
> > 1) Add a new migration capability pause-at-complete
> > 2) Add a new migration state ready
> > 3) After migration has converged (i.e. the background phase of the
> >    migration has completed) and we are ready to actually switch to the
> >    destination, stop the VM, transition to 'ready' and emit a QMP event
> > 4) Continue processing QMP commands in the 'ready' phase. This is where
> >    libvirt is supposed to tear down any running non-VM activities like
> >    block jobs, NBD servers, etc.
> > 5) Add a new command (migration-complete) that does the final part of
> >    migration, including sending the remaining dirty memory, device state
> >    and the final EOF that signals the destination to resume the VM if -S
> >    isn't given.
> 
> What worries me here is whether some other subsection is going to say
> that this pause must happen after the device state save, e.g. if the
> device state save causes them to push out the last block/packet/etc -
> and I've got no real way to say whether that point is any better or
> worse than the point I was suggestion.
> And the position in the cycle when the pause happens is part of the API.

The important point here is that the VM is stopped. If device emulation
is doing anything by itself while the VM is stopped, that is a bug.
In other words, that last block/packet/etc. must already have been
pushed out when we stopped the VM, so there is nothing left to push out
when we save the state.

With correctly written device code, there is no difference whether we
save the device state first or last, without external influence it will
be the same.

The obvious external thing that I can see is monitor commands, e.g. the
monitor allows to access I/O ports, which can change the device state.
However, monitor commands are completely under the control of the user,
so they can always choose the order that they need.

In any case, saving the device state only immediately before doing the
switch makes sure that we consider any changes that have still been
made.

> > The main reason why I advocate this "block way" is that it looks a lot
> > less hacky, but more future-proof to me. Instead of adding a one-off
> > hack for the problem at hand, it introduces a phase where libvirt can
> > cleanly tear down any activity it has still running on the source qemu.
> 
> Yes, if we get that phase right.
> 
> > We got away without doing this because we never did a clean handover of
> > resources from the source to the destination. From the perspective of a
> > management tool, we had these distinct phases during migration:
> > 
> > a) 'migrate' command:
> >    Source VM is still running and in control of all resources (like disk
> >    images), migrating some state in the background
> > 
> > b) Migration converges:
> >    Both VMs are stopped (assuming -S is given on the destination,
> >    otherwise this phase is skipped), the destination is in control of
> >    the resources
> > 
> > c) 'cont' command:
> >    Destination VM is running and in control of the resources
> > 
> > There is an asymmetry here because there is no phase where the VMs are
> > stopped, but the source is in control of the resources (this is the
> > additional state that my proposal introduces).
> > 
> > As long as we didn't really manage control of the resources with locks,
> > we could abuse b) for things where the source still accessed the
> > resources if we knew that the destination wouldn't use them yet (it's
> > not an assumption that I'm willing to trust too much) and it would still
> > kind of work in the common cases. But it's not really the correct thing
> > to do, it has probably always caused bugs in corner cases, and the
> > locking mechanisms are pointing that out now.
> > 
> > We're now noticing this with block device because we're using locking,
> > but I wouldn't be surprised if we found sooner or later that there are
> > more resources that should really be handed over in a cleanly designed
> > way and that work only by accident as long as you don't do that. So I
> > don't think a local hack for mirroring (or even all block jobs) is
> > sufficient to be future-proof.
> 
> Without the block-job we were actually pretty safe - up until the EOF
> we knew the destination wouldn't start, and similarly we wouldn't start
> using a device on the destination before the source had saved that
> device.

s/Without the block-job/With only the guest device/

As long as only a guest device is using a resource, we're pretty safe,
yes. This is because the VM is stopped when we actually migrate, so the
user is already shut down without management tool interaction and the
resource isn't actually in use at this point.

Anything that keeps using a resource while the VM is stopped is a
potential problem. Apart from block jobs, this includes at least the
built-in NBD server, which also needs to be shut down before we can
complete migration. And probably other things I'm forgetting right now.

> My worry is where the edge of 'source is in control of the device is' -
> is that upto the point at which the end of the device save? Before that?
> After that? What?  It seems to be pretty arbitrary.  Unless we can
> understand why it's where it is, then I'm not sure if it's a 'clean
> migration phase model'.

I can tell you where in the phase model it is: The source gives up
control of the image during 'migration-complete', i.e. at the transition
between the 'ready' and the 'completed' phase on the source.

Within this transition, it's less clear where it has to be, but that's
the same with its current place. I expect all of this code to stay
mostly unchanged, but just called a bit later, so the place where we
inactivate images today sounds like a good candidate. It's important
that it's done before sending the EOF packet that triggers the
destination to try and take control of the images.

> > > The migration way
> > > -----------------
> > >    1) Stop doing (d) when the destination is started with -S
> > >       since it happens anyway when 'cont' is issued
> > >    2) Add a new migration capability ext-manage-storage
> > >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > >    4) Add a new command 'block-inactivate' on the source
> > > 
> > > You set ext-manage-storage, do the migrate and when it's finished
> > > clean up the block job, block-inactivate on the source, and
> > > then cont on the destination.
> > 
> > You would have to inactivate every single block node, which is not
> > a thing libvirt can currently do (they don't even really manage
> > individual nodes yet). Apart from that I believe it is much too
> > low-level and users shouldn't have to worry about what implications
> > migration has to block devices. A clean migration phase model should be
> > much easier to understand.
> 
> Yep, I see that.
> (Having said that, this extra round-trip with management is adding extra
> downtime, and we'll have to work out things like what happens if you
> cancel or fail in that time).

libvirt can opt in to the additional phase only when it actually needs
it.

I'm not sure if the requirement can change while migration is already
running. I don't think so currently, but if yes, we might need a way for
management tools to change the capability setting after the fact.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-09-29 20:44       ` Kevin Wolf
@ 2017-10-09 11:55         ` Dr. David Alan Gilbert
  2017-10-11  1:34           ` wangjie (P)
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-10-09 11:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: wangjie (P), qemu-devel, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> > * Kevin Wolf (kwolf@redhat.com) wrote:
> > > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > > Hi,
> > > >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > > > A proper fix really needs to be done together with libvirt so that we
> > > > can sequence:
> > > >    a) The stopping of the CPU on the source
> > > >    b) The termination of the mirroring block job
> > > >    c) The inactivation of the block devices on the source
> > > >        (bdrv_inactivate_all)
> > > >    d) The activation of the block devices on the destination
> > > >        (bdrv_invalidate_cache_all)
> > > >    e) The start of the CPU on the destination
> > > > 
> > > > It looks like you're hitting a race between b/c;  we've had races
> > > > between c/d in the past and moved the bdrv_inactivate_all.
> > > > 
> > > > During the discussion we ended up with two proposed solutions;
> > > > both of them require one extra command and one extra migration
> > > > capability.
> > > > 
> > > > The block way
> > > > -------------
> > > >    1) Add a new migration capability pause-at-complete
> > > >    2) Add a new migration state almost-complete
> > > >    3) After saving devices, if pause-at-complete is set,
> > > >       transition to almost-complete
> > > >    4) Add a new command (migration-continue) that 
> > > >       causes the migration to inactivate the devices (c)
> > > >       and send the final EOF to the destination.
> > > > 
> > > > You set pause-at-complete, wait until migrate hits almost-complete;
> > > > cleanup the mirror job, and then do migration-continue.  When it
> > > > completes do 'cont' on the destination.
> > > 
> > > Especially since you're concerned about the exact position of the pause,
> > > the following would be a little safer (using the opportunity to suggest
> > > slightly different names,too):
> > > 
> > > 1) Add a new migration capability pause-at-complete
> > > 2) Add a new migration state ready
> > > 3) After migration has converged (i.e. the background phase of the
> > >    migration has completed) and we are ready to actually switch to the
> > >    destination, stop the VM, transition to 'ready' and emit a QMP event
> > > 4) Continue processing QMP commands in the 'ready' phase. This is where
> > >    libvirt is supposed to tear down any running non-VM activities like
> > >    block jobs, NBD servers, etc.
> > > 5) Add a new command (migration-complete) that does the final part of
> > >    migration, including sending the remaining dirty memory, device state
> > >    and the final EOF that signals the destination to resume the VM if -S
> > >    isn't given.
> > 
> > What worries me here is whether some other subsection is going to say
> > that this pause must happen after the device state save, e.g. if the
> > device state save causes them to push out the last block/packet/etc -
> > and I've got no real way to say whether that point is any better or
> > worse than the point I was suggestion.
> > And the position in the cycle when the pause happens is part of the API.
> 
> The important point here is that the VM is stopped. If device emulation
> is doing anything by itself while the VM is stopped, that is a bug.
> In other words, that last block/packet/etc. must already have been
> pushed out when we stopped the VM, so there is nothing left to push out
> when we save the state.
> 
> With correctly written device code, there is no difference whether we
> save the device state first or last, without external influence it will
> be the same.
> 
> The obvious external thing that I can see is monitor commands, e.g. the
> monitor allows to access I/O ports, which can change the device state.
> However, monitor commands are completely under the control of the user,
> so they can always choose the order that they need.
> 
> In any case, saving the device state only immediately before doing the
> switch makes sure that we consider any changes that have still been
> made.
> 
> > > The main reason why I advocate this "block way" is that it looks a lot
> > > less hacky, but more future-proof to me. Instead of adding a one-off
> > > hack for the problem at hand, it introduces a phase where libvirt can
> > > cleanly tear down any activity it has still running on the source qemu.
> > 
> > Yes, if we get that phase right.
> > 
> > > We got away without doing this because we never did a clean handover of
> > > resources from the source to the destination. From the perspective of a
> > > management tool, we had these distinct phases during migration:
> > > 
> > > a) 'migrate' command:
> > >    Source VM is still running and in control of all resources (like disk
> > >    images), migrating some state in the background
> > > 
> > > b) Migration converges:
> > >    Both VMs are stopped (assuming -S is given on the destination,
> > >    otherwise this phase is skipped), the destination is in control of
> > >    the resources
> > > 
> > > c) 'cont' command:
> > >    Destination VM is running and in control of the resources
> > > 
> > > There is an asymmetry here because there is no phase where the VMs are
> > > stopped, but the source is in control of the resources (this is the
> > > additional state that my proposal introduces).
> > > 
> > > As long as we didn't really manage control of the resources with locks,
> > > we could abuse b) for things where the source still accessed the
> > > resources if we knew that the destination wouldn't use them yet (it's
> > > not an assumption that I'm willing to trust too much) and it would still
> > > kind of work in the common cases. But it's not really the correct thing
> > > to do, it has probably always caused bugs in corner cases, and the
> > > locking mechanisms are pointing that out now.
> > > 
> > > We're now noticing this with block device because we're using locking,
> > > but I wouldn't be surprised if we found sooner or later that there are
> > > more resources that should really be handed over in a cleanly designed
> > > way and that work only by accident as long as you don't do that. So I
> > > don't think a local hack for mirroring (or even all block jobs) is
> > > sufficient to be future-proof.
> > 
> > Without the block-job we were actually pretty safe - up until the EOF
> > we knew the destination wouldn't start, and similarly we wouldn't start
> > using a device on the destination before the source had saved that
> > device.
> 
> s/Without the block-job/With only the guest device/
> 
> As long as only a guest device is using a resource, we're pretty safe,
> yes. This is because the VM is stopped when we actually migrate, so the
> user is already shut down without management tool interaction and the
> resource isn't actually in use at this point.
> 
> Anything that keeps using a resource while the VM is stopped is a
> potential problem. Apart from block jobs, this includes at least the
> built-in NBD server, which also needs to be shut down before we can
> complete migration. And probably other things I'm forgetting right now.
> 
> > My worry is where the edge of 'source is in control of the device is' -
> > is that upto the point at which the end of the device save? Before that?
> > After that? What?  It seems to be pretty arbitrary.  Unless we can
> > understand why it's where it is, then I'm not sure if it's a 'clean
> > migration phase model'.
> 
> I can tell you where in the phase model it is: The source gives up
> control of the image during 'migration-complete', i.e. at the transition
> between the 'ready' and the 'completed' phase on the source.
> 
> Within this transition, it's less clear where it has to be, but that's
> the same with its current place. I expect all of this code to stay
> mostly unchanged, but just called a bit later, so the place where we
> inactivate images today sounds like a good candidate. It's important
> that it's done before sending the EOF packet that triggers the
> destination to try and take control of the images.
> 
> > > > The migration way
> > > > -----------------
> > > >    1) Stop doing (d) when the destination is started with -S
> > > >       since it happens anyway when 'cont' is issued
> > > >    2) Add a new migration capability ext-manage-storage
> > > >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > > >    4) Add a new command 'block-inactivate' on the source
> > > > 
> > > > You set ext-manage-storage, do the migrate and when it's finished
> > > > clean up the block job, block-inactivate on the source, and
> > > > then cont on the destination.
> > > 
> > > You would have to inactivate every single block node, which is not
> > > a thing libvirt can currently do (they don't even really manage
> > > individual nodes yet). Apart from that I believe it is much too
> > > low-level and users shouldn't have to worry about what implications
> > > migration has to block devices. A clean migration phase model should be
> > > much easier to understand.
> > 
> > Yep, I see that.
> > (Having said that, this extra round-trip with management is adding extra
> > downtime, and we'll have to work out things like what happens if you
> > cancel or fail in that time).
> 
> libvirt can opt in to the additional phase only when it actually needs
> it.
> 
> I'm not sure if the requirement can change while migration is already
> running. I don't think so currently, but if yes, we might need a way for
> management tools to change the capability setting after the fact.

OK, I'm going to look at doing this in the next few days.

Dave

> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-10-09 11:55         ` Dr. David Alan Gilbert
@ 2017-10-11  1:34           ` wangjie (P)
  2017-10-11  8:28             ` Kevin Wolf
  0 siblings, 1 reply; 8+ messages in thread
From: wangjie (P) @ 2017-10-11  1:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Kevin Wolf
  Cc: qemu-devel, mreitz, quintela, pbonzini, fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

I found the real culprit, the bug begin to occur since the committed as following:
https://github.com/qemu/qemu/commit/e253f4b89796967d03a455d1df2ae6bda8cc7d01

since the patch committed, the mirror target bs begin to use BlockBackend, So the mirror target bs will be inactived in bdrv_inactivate_all


-----Original Message-----
From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com] 
Sent: Monday, October 09, 2017 7:56 PM
To: Kevin Wolf <kwolf@redhat.com>
Cc: wangjie (P) <wangjie88@huawei.com>; qemu-devel@nongnu.org; mreitz@redhat.com; quintela@redhat.com; pbonzini@redhat.com; fuweiwei (C) <fuweiwei2@huawei.com>; eblake@redhat.com; berrange@redhat.com kchamart@redhat.com <kchamart@redhat.com>; famz@redhat.com
Subject: Re: ping RE: question: I found a qemu crash about migration

* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> > * Kevin Wolf (kwolf@redhat.com) wrote:
> > > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > > Hi,
> > > >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > > > A proper fix really needs to be done together with libvirt so 
> > > > that we can sequence:
> > > >    a) The stopping of the CPU on the source
> > > >    b) The termination of the mirroring block job
> > > >    c) The inactivation of the block devices on the source
> > > >        (bdrv_inactivate_all)
> > > >    d) The activation of the block devices on the destination
> > > >        (bdrv_invalidate_cache_all)
> > > >    e) The start of the CPU on the destination
> > > > 
> > > > It looks like you're hitting a race between b/c;  we've had 
> > > > races between c/d in the past and moved the bdrv_inactivate_all.
> > > > 
> > > > During the discussion we ended up with two proposed solutions; 
> > > > both of them require one extra command and one extra migration 
> > > > capability.
> > > > 
> > > > The block way
> > > > -------------
> > > >    1) Add a new migration capability pause-at-complete
> > > >    2) Add a new migration state almost-complete
> > > >    3) After saving devices, if pause-at-complete is set,
> > > >       transition to almost-complete
> > > >    4) Add a new command (migration-continue) that 
> > > >       causes the migration to inactivate the devices (c)
> > > >       and send the final EOF to the destination.
> > > > 
> > > > You set pause-at-complete, wait until migrate hits 
> > > > almost-complete; cleanup the mirror job, and then do 
> > > > migration-continue.  When it completes do 'cont' on the destination.
> > > 
> > > Especially since you're concerned about the exact position of the 
> > > pause, the following would be a little safer (using the 
> > > opportunity to suggest slightly different names,too):
> > > 
> > > 1) Add a new migration capability pause-at-complete
> > > 2) Add a new migration state ready
> > > 3) After migration has converged (i.e. the background phase of the
> > >    migration has completed) and we are ready to actually switch to the
> > >    destination, stop the VM, transition to 'ready' and emit a QMP 
> > > event
> > > 4) Continue processing QMP commands in the 'ready' phase. This is where
> > >    libvirt is supposed to tear down any running non-VM activities like
> > >    block jobs, NBD servers, etc.
> > > 5) Add a new command (migration-complete) that does the final part of
> > >    migration, including sending the remaining dirty memory, device state
> > >    and the final EOF that signals the destination to resume the VM if -S
> > >    isn't given.
> > 
> > What worries me here is whether some other subsection is going to 
> > say that this pause must happen after the device state save, e.g. if 
> > the device state save causes them to push out the last 
> > block/packet/etc - and I've got no real way to say whether that 
> > point is any better or worse than the point I was suggestion.
> > And the position in the cycle when the pause happens is part of the API.
> 
> The important point here is that the VM is stopped. If device 
> emulation is doing anything by itself while the VM is stopped, that is a bug.
> In other words, that last block/packet/etc. must already have been 
> pushed out when we stopped the VM, so there is nothing left to push 
> out when we save the state.
> 
> With correctly written device code, there is no difference whether we 
> save the device state first or last, without external influence it 
> will be the same.
> 
> The obvious external thing that I can see is monitor commands, e.g. 
> the monitor allows to access I/O ports, which can change the device state.
> However, monitor commands are completely under the control of the 
> user, so they can always choose the order that they need.
> 
> In any case, saving the device state only immediately before doing the 
> switch makes sure that we consider any changes that have still been 
> made.
> 
> > > The main reason why I advocate this "block way" is that it looks a 
> > > lot less hacky, but more future-proof to me. Instead of adding a 
> > > one-off hack for the problem at hand, it introduces a phase where 
> > > libvirt can cleanly tear down any activity it has still running on the source qemu.
> > 
> > Yes, if we get that phase right.
> > 
> > > We got away without doing this because we never did a clean 
> > > handover of resources from the source to the destination. From the 
> > > perspective of a management tool, we had these distinct phases during migration:
> > > 
> > > a) 'migrate' command:
> > >    Source VM is still running and in control of all resources (like disk
> > >    images), migrating some state in the background
> > > 
> > > b) Migration converges:
> > >    Both VMs are stopped (assuming -S is given on the destination,
> > >    otherwise this phase is skipped), the destination is in control of
> > >    the resources
> > > 
> > > c) 'cont' command:
> > >    Destination VM is running and in control of the resources
> > > 
> > > There is an asymmetry here because there is no phase where the VMs 
> > > are stopped, but the source is in control of the resources (this 
> > > is the additional state that my proposal introduces).
> > > 
> > > As long as we didn't really manage control of the resources with 
> > > locks, we could abuse b) for things where the source still 
> > > accessed the resources if we knew that the destination wouldn't 
> > > use them yet (it's not an assumption that I'm willing to trust too 
> > > much) and it would still kind of work in the common cases. But 
> > > it's not really the correct thing to do, it has probably always 
> > > caused bugs in corner cases, and the locking mechanisms are pointing that out now.
> > > 
> > > We're now noticing this with block device because we're using 
> > > locking, but I wouldn't be surprised if we found sooner or later 
> > > that there are more resources that should really be handed over in 
> > > a cleanly designed way and that work only by accident as long as 
> > > you don't do that. So I don't think a local hack for mirroring (or 
> > > even all block jobs) is sufficient to be future-proof.
> > 
> > Without the block-job we were actually pretty safe - up until the 
> > EOF we knew the destination wouldn't start, and similarly we 
> > wouldn't start using a device on the destination before the source 
> > had saved that device.
> 
> s/Without the block-job/With only the guest device/
> 
> As long as only a guest device is using a resource, we're pretty safe, 
> yes. This is because the VM is stopped when we actually migrate, so 
> the user is already shut down without management tool interaction and 
> the resource isn't actually in use at this point.
> 
> Anything that keeps using a resource while the VM is stopped is a 
> potential problem. Apart from block jobs, this includes at least the 
> built-in NBD server, which also needs to be shut down before we can 
> complete migration. And probably other things I'm forgetting right now.
> 
> > My worry is where the edge of 'source is in control of the device 
> > is' - is that upto the point at which the end of the device save? Before that?
> > After that? What?  It seems to be pretty arbitrary.  Unless we can 
> > understand why it's where it is, then I'm not sure if it's a 'clean 
> > migration phase model'.
> 
> I can tell you where in the phase model it is: The source gives up 
> control of the image during 'migration-complete', i.e. at the 
> transition between the 'ready' and the 'completed' phase on the source.
> 
> Within this transition, it's less clear where it has to be, but that's 
> the same with its current place. I expect all of this code to stay 
> mostly unchanged, but just called a bit later, so the place where we 
> inactivate images today sounds like a good candidate. It's important 
> that it's done before sending the EOF packet that triggers the 
> destination to try and take control of the images.
> 
> > > > The migration way
> > > > -----------------
> > > >    1) Stop doing (d) when the destination is started with -S
> > > >       since it happens anyway when 'cont' is issued
> > > >    2) Add a new migration capability ext-manage-storage
> > > >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > > >    4) Add a new command 'block-inactivate' on the source
> > > > 
> > > > You set ext-manage-storage, do the migrate and when it's 
> > > > finished clean up the block job, block-inactivate on the source, 
> > > > and then cont on the destination.
> > > 
> > > You would have to inactivate every single block node, which is not 
> > > a thing libvirt can currently do (they don't even really manage 
> > > individual nodes yet). Apart from that I believe it is much too 
> > > low-level and users shouldn't have to worry about what 
> > > implications migration has to block devices. A clean migration 
> > > phase model should be much easier to understand.
> > 
> > Yep, I see that.
> > (Having said that, this extra round-trip with management is adding 
> > extra downtime, and we'll have to work out things like what happens 
> > if you cancel or fail in that time).
> 
> libvirt can opt in to the additional phase only when it actually needs 
> it.
> 
> I'm not sure if the requirement can change while migration is already 
> running. I don't think so currently, but if yes, we might need a way 
> for management tools to change the capability setting after the fact.

OK, I'm going to look at doing this in the next few days.

Dave

> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
  2017-10-11  1:34           ` wangjie (P)
@ 2017-10-11  8:28             ` Kevin Wolf
  0 siblings, 0 replies; 8+ messages in thread
From: Kevin Wolf @ 2017-10-11  8:28 UTC (permalink / raw)
  To: wangjie (P)
  Cc: Dr. David Alan Gilbert, qemu-devel, mreitz, quintela, pbonzini,
	fuweiwei (C),
	eblake, berrange@redhat.com kchamart@redhat.com, famz

Am 11.10.2017 um 03:34 hat wangjie (P) geschrieben:
> I found the real culprit, the bug begin to occur since the committed as following:
> https://github.com/qemu/qemu/commit/e253f4b89796967d03a455d1df2ae6bda8cc7d01
> 
> since the patch committed, the mirror target bs begin to use BlockBackend, So the mirror target bs will be inactived in bdrv_inactivate_all

Despite being the first commit that shows this behaviour, this is a bug
fix, not "the real culprit". The target image must be inactivated to
avoid corrupting it. We went from potential silent corruption to very
visible failure. This is a step forward.

The real "real culprit" is what we already discussed two weeks ago and
what Dave is going to fix now, a bad process for image handover during
migration.

Also, please don't top-post on this mailing list.

Kevin

> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com] 
> Sent: Monday, October 09, 2017 7:56 PM
> To: Kevin Wolf <kwolf@redhat.com>
> Cc: wangjie (P) <wangjie88@huawei.com>; qemu-devel@nongnu.org; mreitz@redhat.com; quintela@redhat.com; pbonzini@redhat.com; fuweiwei (C) <fuweiwei2@huawei.com>; eblake@redhat.com; berrange@redhat.com kchamart@redhat.com <kchamart@redhat.com>; famz@redhat.com
> Subject: Re: ping RE: question: I found a qemu crash about migration
> 
> * Kevin Wolf (kwolf@redhat.com) wrote:
> > Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> > > * Kevin Wolf (kwolf@redhat.com) wrote:
> > > > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > > > Hi,
> > > > >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > > > > A proper fix really needs to be done together with libvirt so 
> > > > > that we can sequence:
> > > > >    a) The stopping of the CPU on the source
> > > > >    b) The termination of the mirroring block job
> > > > >    c) The inactivation of the block devices on the source
> > > > >        (bdrv_inactivate_all)
> > > > >    d) The activation of the block devices on the destination
> > > > >        (bdrv_invalidate_cache_all)
> > > > >    e) The start of the CPU on the destination
> > > > > 
> > > > > It looks like you're hitting a race between b/c;  we've had 
> > > > > races between c/d in the past and moved the bdrv_inactivate_all.
> > > > > 
> > > > > During the discussion we ended up with two proposed solutions; 
> > > > > both of them require one extra command and one extra migration 
> > > > > capability.
> > > > > 
> > > > > The block way
> > > > > -------------
> > > > >    1) Add a new migration capability pause-at-complete
> > > > >    2) Add a new migration state almost-complete
> > > > >    3) After saving devices, if pause-at-complete is set,
> > > > >       transition to almost-complete
> > > > >    4) Add a new command (migration-continue) that 
> > > > >       causes the migration to inactivate the devices (c)
> > > > >       and send the final EOF to the destination.
> > > > > 
> > > > > You set pause-at-complete, wait until migrate hits 
> > > > > almost-complete; cleanup the mirror job, and then do 
> > > > > migration-continue.  When it completes do 'cont' on the destination.
> > > > 
> > > > Especially since you're concerned about the exact position of the 
> > > > pause, the following would be a little safer (using the 
> > > > opportunity to suggest slightly different names,too):
> > > > 
> > > > 1) Add a new migration capability pause-at-complete
> > > > 2) Add a new migration state ready
> > > > 3) After migration has converged (i.e. the background phase of the
> > > >    migration has completed) and we are ready to actually switch to the
> > > >    destination, stop the VM, transition to 'ready' and emit a QMP 
> > > > event
> > > > 4) Continue processing QMP commands in the 'ready' phase. This is where
> > > >    libvirt is supposed to tear down any running non-VM activities like
> > > >    block jobs, NBD servers, etc.
> > > > 5) Add a new command (migration-complete) that does the final part of
> > > >    migration, including sending the remaining dirty memory, device state
> > > >    and the final EOF that signals the destination to resume the VM if -S
> > > >    isn't given.
> > > 
> > > What worries me here is whether some other subsection is going to 
> > > say that this pause must happen after the device state save, e.g. if 
> > > the device state save causes them to push out the last 
> > > block/packet/etc - and I've got no real way to say whether that 
> > > point is any better or worse than the point I was suggestion.
> > > And the position in the cycle when the pause happens is part of the API.
> > 
> > The important point here is that the VM is stopped. If device 
> > emulation is doing anything by itself while the VM is stopped, that is a bug.
> > In other words, that last block/packet/etc. must already have been 
> > pushed out when we stopped the VM, so there is nothing left to push 
> > out when we save the state.
> > 
> > With correctly written device code, there is no difference whether we 
> > save the device state first or last, without external influence it 
> > will be the same.
> > 
> > The obvious external thing that I can see is monitor commands, e.g. 
> > the monitor allows to access I/O ports, which can change the device state.
> > However, monitor commands are completely under the control of the 
> > user, so they can always choose the order that they need.
> > 
> > In any case, saving the device state only immediately before doing the 
> > switch makes sure that we consider any changes that have still been 
> > made.
> > 
> > > > The main reason why I advocate this "block way" is that it looks a 
> > > > lot less hacky, but more future-proof to me. Instead of adding a 
> > > > one-off hack for the problem at hand, it introduces a phase where 
> > > > libvirt can cleanly tear down any activity it has still running on the source qemu.
> > > 
> > > Yes, if we get that phase right.
> > > 
> > > > We got away without doing this because we never did a clean 
> > > > handover of resources from the source to the destination. From the 
> > > > perspective of a management tool, we had these distinct phases during migration:
> > > > 
> > > > a) 'migrate' command:
> > > >    Source VM is still running and in control of all resources (like disk
> > > >    images), migrating some state in the background
> > > > 
> > > > b) Migration converges:
> > > >    Both VMs are stopped (assuming -S is given on the destination,
> > > >    otherwise this phase is skipped), the destination is in control of
> > > >    the resources
> > > > 
> > > > c) 'cont' command:
> > > >    Destination VM is running and in control of the resources
> > > > 
> > > > There is an asymmetry here because there is no phase where the VMs 
> > > > are stopped, but the source is in control of the resources (this 
> > > > is the additional state that my proposal introduces).
> > > > 
> > > > As long as we didn't really manage control of the resources with 
> > > > locks, we could abuse b) for things where the source still 
> > > > accessed the resources if we knew that the destination wouldn't 
> > > > use them yet (it's not an assumption that I'm willing to trust too 
> > > > much) and it would still kind of work in the common cases. But 
> > > > it's not really the correct thing to do, it has probably always 
> > > > caused bugs in corner cases, and the locking mechanisms are pointing that out now.
> > > > 
> > > > We're now noticing this with block device because we're using 
> > > > locking, but I wouldn't be surprised if we found sooner or later 
> > > > that there are more resources that should really be handed over in 
> > > > a cleanly designed way and that work only by accident as long as 
> > > > you don't do that. So I don't think a local hack for mirroring (or 
> > > > even all block jobs) is sufficient to be future-proof.
> > > 
> > > Without the block-job we were actually pretty safe - up until the 
> > > EOF we knew the destination wouldn't start, and similarly we 
> > > wouldn't start using a device on the destination before the source 
> > > had saved that device.
> > 
> > s/Without the block-job/With only the guest device/
> > 
> > As long as only a guest device is using a resource, we're pretty safe, 
> > yes. This is because the VM is stopped when we actually migrate, so 
> > the user is already shut down without management tool interaction and 
> > the resource isn't actually in use at this point.
> > 
> > Anything that keeps using a resource while the VM is stopped is a 
> > potential problem. Apart from block jobs, this includes at least the 
> > built-in NBD server, which also needs to be shut down before we can 
> > complete migration. And probably other things I'm forgetting right now.
> > 
> > > My worry is where the edge of 'source is in control of the device 
> > > is' - is that upto the point at which the end of the device save? Before that?
> > > After that? What?  It seems to be pretty arbitrary.  Unless we can 
> > > understand why it's where it is, then I'm not sure if it's a 'clean 
> > > migration phase model'.
> > 
> > I can tell you where in the phase model it is: The source gives up 
> > control of the image during 'migration-complete', i.e. at the 
> > transition between the 'ready' and the 'completed' phase on the source.
> > 
> > Within this transition, it's less clear where it has to be, but that's 
> > the same with its current place. I expect all of this code to stay 
> > mostly unchanged, but just called a bit later, so the place where we 
> > inactivate images today sounds like a good candidate. It's important 
> > that it's done before sending the EOF packet that triggers the 
> > destination to try and take control of the images.
> > 
> > > > > The migration way
> > > > > -----------------
> > > > >    1) Stop doing (d) when the destination is started with -S
> > > > >       since it happens anyway when 'cont' is issued
> > > > >    2) Add a new migration capability ext-manage-storage
> > > > >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > > > >    4) Add a new command 'block-inactivate' on the source
> > > > > 
> > > > > You set ext-manage-storage, do the migrate and when it's 
> > > > > finished clean up the block job, block-inactivate on the source, 
> > > > > and then cont on the destination.
> > > > 
> > > > You would have to inactivate every single block node, which is not 
> > > > a thing libvirt can currently do (they don't even really manage 
> > > > individual nodes yet). Apart from that I believe it is much too 
> > > > low-level and users shouldn't have to worry about what 
> > > > implications migration has to block devices. A clean migration 
> > > > phase model should be much easier to understand.
> > > 
> > > Yep, I see that.
> > > (Having said that, this extra round-trip with management is adding 
> > > extra downtime, and we'll have to work out things like what happens 
> > > if you cancel or fail in that time).
> > 
> > libvirt can opt in to the additional phase only when it actually needs 
> > it.
> > 
> > I'm not sure if the requirement can change while migration is already 
> > running. I don't think so currently, but if yes, we might need a way 
> > for management tools to change the capability setting after the fact.
> 
> OK, I'm going to look at doing this in the next few days.
> 
> Dave
> 
> > Kevin
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-11  8:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-28  7:38 [Qemu-devel] ping RE: question: I found a qemu crash about migration wangjie (P)
2017-09-28 17:01 ` Dr. David Alan Gilbert
2017-09-29  9:35   ` Kevin Wolf
2017-09-29 19:06     ` Dr. David Alan Gilbert
2017-09-29 20:44       ` Kevin Wolf
2017-10-09 11:55         ` Dr. David Alan Gilbert
2017-10-11  1:34           ` wangjie (P)
2017-10-11  8:28             ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.