All of lore.kernel.org
 help / color / mirror / Atom feed
* Question regarding live-migration with drive-mirror
@ 2022-09-28 14:10 Fiona Ebner
  2022-09-28 18:53 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 3+ messages in thread
From: Fiona Ebner @ 2022-09-28 14:10 UTC (permalink / raw)
  To: open list:Network Block Dev..., QEMU Developers
  Cc: jsnow, vsementsov, Kevin Wolf, Hanna Reitz, quintela,
	Dr. David Alan Gilbert, Thomas Lamprecht,
	Fabian Grünbichler, Wolfgang Bumiller

Hi,
recently one of our users provided a backtrace[0] for the following
assertion failure during a live migration that uses drive-mirror to sync
a local disk:
> bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed

The way we do migration with a local disk is essentially:
1. start target instance with a suitable NBD export
2. start drive-mirror on the source side and wait for it to become ready
once
3. issue 'migrate' QMP command
4. cancel drive-mirror blockjob after the migration has finished

I reproduced the issue with the following fio script running in the
guest (to dirty lots of clusters):
> fio --name=make-mirror-work --size=100M --direct=1 --rw=randwrite \
>     --bs=4k --ioengine=psync --numjobs=5 --runtime=60 --time_based

AFAIU, the issue is that nothing guarantees that the drive mirror is
ready when the migration inactivates the block drives.

Is using copy-mode=write-blocking for drive-mirror to only way to avoid
this issue? There, the downside is that the network (used by the mirror)
would become a bottleneck for IO in the guest, while the behavior would
really only be needed during the final phase.

I guess the assert should be avoided in any case. Here's a few ideas
that came to mind:
1. migration should fail gracefully
2. migration should wait for the mirror-jobs to become ready before
inactivating the block drives - that would increase the downtime in
these situations of course
2A. additionally, drive-mirror could be taken into account when
converging the migration somehow?

I noticed the following comment in the mirror implementation
>         /* Note that even when no rate limit is applied we need to yield
>          * periodically with no pending I/O so that bdrv_drain_all() returns.
>          * We do so every BLKOCK_JOB_SLICE_TIME nanoseconds, or when there is
>          * an error, or when the source is clean, whichever comes first. */

3. change draining behavior after the job was ready once, so that
bdrv_drain_all() will only return when the job is ready again? Hope I'm
not completely misunderstanding this.

Best Regards,
Fiona

[0]:
> Thread 1 (Thread 0x7f3389d4a000 (LWP 2297576) "kvm"):
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f339488d537 in __GI_abort () at abort.c:79
> #2  0x00007f339488d40f in __assert_fail_base (fmt=0x7f3394a056a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5595f85bfd70 "!(bs->open_flags & BDRV_O_INACTIVE)", file=0x5595f85cb576 "../block/io.c", line=2026, function=<optimized out>) at assert.c:92
> #3  0x00007f339489c662 in __GI___assert_fail (assertion=assertion@entry=0x5595f85bfd70 "!(bs->open_flags & BDRV_O_INACTIVE)", file=file@entry=0x5595f85cb576 "../block/io.c", line=line@entry=2026, function=function@entry=0x5595f85cc510 <__PRETTY_FUNCTION__.8> "bdrv_co_write_req_prepare") at assert.c:101
> #4  0x00005595f83218f2 in bdrv_co_write_req_prepare (child=0x5595f91cab90, offset=60867018752, bytes=196608, req=0x7f324a2e9d70, flags=0) at ../block/io.c:2026
> #5  0x00005595f8323384 in bdrv_aligned_pwritev (child=child@entry=0x5595f91cab90, req=req@entry=0x7f324a2e9d70, offset=60867018752, bytes=196608, align=align@entry=1, qiov=0x5595f9030d58, qiov_offset=0, flags=0) at ../block/io.c:2140
> #6  0x00005595f832485a in bdrv_co_pwritev_part (child=0x5595f91cab90, offset=<optimized out>, offset@entry=60867018752, bytes=<optimized out>, bytes@entry=196608, qiov=<optimized out>, qiov@entry=0x5595f9030d58, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:2353
> #7  0x00005595f8315a09 in blk_co_do_pwritev_part (blk=blk@entry=0x5595f91db8c0, offset=60867018752, bytes=196608, qiov=qiov@entry=0x5595f9030d58, qiov_offset=qiov_offset@entry=0, flags=flags@entry=0) at ../block/block-backend.c:1365
> #8  0x00005595f8315bdd in blk_co_pwritev_part (flags=0, qiov_offset=0, qiov=qiov@entry=0x5595f9030d58, bytes=<optimized out>, offset=<optimized out>, blk=0x5595f91db8c0) at ../block/block-backend.c:1380
> #9  blk_co_pwritev (blk=0x5595f91db8c0, offset=<optimized out>, bytes=<optimized out>, qiov=qiov@entry=0x5595f9030d58, flags=flags@entry=0) at ../block/block-backend.c:1391
> #10 0x00005595f8328a59 in mirror_read_complete (ret=0, op=0x5595f9030d50) at ../block/mirror.c:260
> #11 mirror_co_read (opaque=0x5595f9030d50) at ../block/mirror.c:400
> #12 0x00005595f843a39b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #13 0x00007f33948b8d40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #14 0x00007f324a3ea6e0 in ?? ()
> #15 0x0000000000000000 in ?? ()



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question regarding live-migration with drive-mirror
  2022-09-28 14:10 Question regarding live-migration with drive-mirror Fiona Ebner
@ 2022-09-28 18:53 ` Dr. David Alan Gilbert
  2022-09-29  9:39   ` Fiona Ebner
  0 siblings, 1 reply; 3+ messages in thread
From: Dr. David Alan Gilbert @ 2022-09-28 18:53 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: open list:Network Block Dev...,
	QEMU Developers, jsnow, vsementsov, Kevin Wolf, Hanna Reitz,
	quintela, Thomas Lamprecht, Fabian Grünbichler,
	Wolfgang Bumiller

* Fiona Ebner (f.ebner@proxmox.com) wrote:
> Hi,
> recently one of our users provided a backtrace[0] for the following
> assertion failure during a live migration that uses drive-mirror to sync
> a local disk:
> > bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed
> 
> The way we do migration with a local disk is essentially:
> 1. start target instance with a suitable NBD export
> 2. start drive-mirror on the source side and wait for it to become ready
> once
> 3. issue 'migrate' QMP command
> 4. cancel drive-mirror blockjob after the migration has finished
> 
> I reproduced the issue with the following fio script running in the
> guest (to dirty lots of clusters):
> > fio --name=make-mirror-work --size=100M --direct=1 --rw=randwrite \
> >     --bs=4k --ioengine=psync --numjobs=5 --runtime=60 --time_based
> 
> AFAIU, the issue is that nothing guarantees that the drive mirror is
> ready when the migration inactivates the block drives.

I don't know the block code well enough; I don't think I'd realised
that a drive-mirror could become unready.

> Is using copy-mode=write-blocking for drive-mirror to only way to avoid
> this issue? There, the downside is that the network (used by the mirror)
> would become a bottleneck for IO in the guest, while the behavior would
> really only be needed during the final phase.

It sounds like you need a way to switch to the blocking mode.

> I guess the assert should be avoided in any case. Here's a few ideas
> that came to mind:
> 1. migration should fail gracefully
> 2. migration should wait for the mirror-jobs to become ready before
> inactivating the block drives - that would increase the downtime in
> these situations of course
> 2A. additionally, drive-mirror could be taken into account when
> converging the migration somehow?

Does the migration capaibility 'pause-before-switchover' help you here?
If enabled, it causes the VM to pause just before the
bdrv_inactivate_all (and then use migrate-continue to tell it to carry
on)

Dave

> I noticed the following comment in the mirror implementation
> >         /* Note that even when no rate limit is applied we need to yield
> >          * periodically with no pending I/O so that bdrv_drain_all() returns.
> >          * We do so every BLKOCK_JOB_SLICE_TIME nanoseconds, or when there is
> >          * an error, or when the source is clean, whichever comes first. */
> 
> 3. change draining behavior after the job was ready once, so that
> bdrv_drain_all() will only return when the job is ready again? Hope I'm
> not completely misunderstanding this.
> 
> Best Regards,
> Fiona
> 
> [0]:
> > Thread 1 (Thread 0x7f3389d4a000 (LWP 2297576) "kvm"):
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #1  0x00007f339488d537 in __GI_abort () at abort.c:79
> > #2  0x00007f339488d40f in __assert_fail_base (fmt=0x7f3394a056a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5595f85bfd70 "!(bs->open_flags & BDRV_O_INACTIVE)", file=0x5595f85cb576 "../block/io.c", line=2026, function=<optimized out>) at assert.c:92
> > #3  0x00007f339489c662 in __GI___assert_fail (assertion=assertion@entry=0x5595f85bfd70 "!(bs->open_flags & BDRV_O_INACTIVE)", file=file@entry=0x5595f85cb576 "../block/io.c", line=line@entry=2026, function=function@entry=0x5595f85cc510 <__PRETTY_FUNCTION__.8> "bdrv_co_write_req_prepare") at assert.c:101
> > #4  0x00005595f83218f2 in bdrv_co_write_req_prepare (child=0x5595f91cab90, offset=60867018752, bytes=196608, req=0x7f324a2e9d70, flags=0) at ../block/io.c:2026
> > #5  0x00005595f8323384 in bdrv_aligned_pwritev (child=child@entry=0x5595f91cab90, req=req@entry=0x7f324a2e9d70, offset=60867018752, bytes=196608, align=align@entry=1, qiov=0x5595f9030d58, qiov_offset=0, flags=0) at ../block/io.c:2140
> > #6  0x00005595f832485a in bdrv_co_pwritev_part (child=0x5595f91cab90, offset=<optimized out>, offset@entry=60867018752, bytes=<optimized out>, bytes@entry=196608, qiov=<optimized out>, qiov@entry=0x5595f9030d58, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:2353
> > #7  0x00005595f8315a09 in blk_co_do_pwritev_part (blk=blk@entry=0x5595f91db8c0, offset=60867018752, bytes=196608, qiov=qiov@entry=0x5595f9030d58, qiov_offset=qiov_offset@entry=0, flags=flags@entry=0) at ../block/block-backend.c:1365
> > #8  0x00005595f8315bdd in blk_co_pwritev_part (flags=0, qiov_offset=0, qiov=qiov@entry=0x5595f9030d58, bytes=<optimized out>, offset=<optimized out>, blk=0x5595f91db8c0) at ../block/block-backend.c:1380
> > #9  blk_co_pwritev (blk=0x5595f91db8c0, offset=<optimized out>, bytes=<optimized out>, qiov=qiov@entry=0x5595f9030d58, flags=flags@entry=0) at ../block/block-backend.c:1391
> > #10 0x00005595f8328a59 in mirror_read_complete (ret=0, op=0x5595f9030d50) at ../block/mirror.c:260
> > #11 mirror_co_read (opaque=0x5595f9030d50) at ../block/mirror.c:400
> > #12 0x00005595f843a39b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> > #13 0x00007f33948b8d40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> > #14 0x00007f324a3ea6e0 in ?? ()
> > #15 0x0000000000000000 in ?? ()
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question regarding live-migration with drive-mirror
  2022-09-28 18:53 ` Dr. David Alan Gilbert
@ 2022-09-29  9:39   ` Fiona Ebner
  0 siblings, 0 replies; 3+ messages in thread
From: Fiona Ebner @ 2022-09-29  9:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: open list:Network Block Dev...,
	QEMU Developers, jsnow, vsementsov, Kevin Wolf, Hanna Reitz,
	quintela, Thomas Lamprecht, Fabian Grünbichler,
	Wolfgang Bumiller

Am 28.09.22 um 20:53 schrieb Dr. David Alan Gilbert:
> * Fiona Ebner (f.ebner@proxmox.com) wrote:
>> Hi,
>> recently one of our users provided a backtrace[0] for the following
>> assertion failure during a live migration that uses drive-mirror to sync
>> a local disk:
>>> bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed
>>
>> The way we do migration with a local disk is essentially:
>> 1. start target instance with a suitable NBD export
>> 2. start drive-mirror on the source side and wait for it to become ready
>> once
>> 3. issue 'migrate' QMP command
>> 4. cancel drive-mirror blockjob after the migration has finished
>>
>> I reproduced the issue with the following fio script running in the
>> guest (to dirty lots of clusters):
>>> fio --name=make-mirror-work --size=100M --direct=1 --rw=randwrite \
>>>     --bs=4k --ioengine=psync --numjobs=5 --runtime=60 --time_based
>>
>> AFAIU, the issue is that nothing guarantees that the drive mirror is
>> ready when the migration inactivates the block drives.
> 
> I don't know the block code well enough; I don't think I'd realised
> that a drive-mirror could become unready.

I actually shouldn't have used "ready" here. Because "ready" just means
that the job is ready to be completed and indeed, it will stay "ready".
But with the default copy-mode=background, new guest writes do mean that
there can be left-over work lying around. Completing/canceling the job
will do that work, but currently, migration doesn't do that automatically.

> 
>> Is using copy-mode=write-blocking for drive-mirror to only way to avoid
>> this issue? There, the downside is that the network (used by the mirror)
>> would become a bottleneck for IO in the guest, while the behavior would
>> really only be needed during the final phase.
> 
> It sounds like you need a way to switch to the blocking mode.

Yes, that would help. I guess it would be:
1. wait for the drive-mirror(s) to become ready
2. switch to blocking mode
3. wait for the drive-mirror(s) to not have any background work left;
i.e. ensure that from now we're always in sync
4. start state migration

Not sure if step 3 can be achieved currently. The BlockJobInfo object
has a "busy" field, but I guess it's possible to have background work
left even if there's no pending IO. At least the comment about draining
below sounds like that could happen.

Might still not be perfect, because migration with a lot of RAM (or slow
network) can take a while, so the guest IO would still be bottlenecked
during that period. But I guess at /some/ point it has to be ;)

> 
>> I guess the assert should be avoided in any case. Here's a few ideas
>> that came to mind:
>> 1. migration should fail gracefully
>> 2. migration should wait for the mirror-jobs to become ready before
>> inactivating the block drives - that would increase the downtime in
>> these situations of course
>> 2A. additionally, drive-mirror could be taken into account when
>> converging the migration somehow?
> 
> Does the migration capaibility 'pause-before-switchover' help you here?
> If enabled, it causes the VM to pause just before the
> bdrv_inactivate_all (and then use migrate-continue to tell it to carry
> on)
> 
> Dave
> 

Thank you for the suggestion! Using the capability and canceling the
block job before issuing 'migrate-continue' is an alternative. I'm just
a bit worried about the longer downtime, but maybe it's not too bad.

Best Regards,
Fiona

>> I noticed the following comment in the mirror implementation
>>>         /* Note that even when no rate limit is applied we need to yield
>>>          * periodically with no pending I/O so that bdrv_drain_all() returns.
>>>          * We do so every BLKOCK_JOB_SLICE_TIME nanoseconds, or when there is
>>>          * an error, or when the source is clean, whichever comes first. */




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-09-29  9:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-28 14:10 Question regarding live-migration with drive-mirror Fiona Ebner
2022-09-28 18:53 ` Dr. David Alan Gilbert
2022-09-29  9:39   ` Fiona Ebner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.