All of lore.kernel.org
 help / color / mirror / Atom feed
* QMP (without OOB) function running in thread different from the main thread as part of aio_poll
@ 2023-04-19 14:09 Fiona Ebner
  2023-04-19 16:59 ` Paolo Bonzini
  2023-04-20  6:11 ` Markus Armbruster
  0 siblings, 2 replies; 21+ messages in thread
From: Fiona Ebner @ 2023-04-19 14:09 UTC (permalink / raw)
  To: QEMU Developers
  Cc: open list:Network Block Dev...,
	michael.roth, Markus Armbruster, Fam Zheng, Stefan Hajnoczi,
	Paolo Bonzini, Thomas Lamprecht

Hi,
while debugging a completely different issue, I was surprised to see
do_qmp_dispatch_bh being run in a vCPU thread. I was under the
impression that QMP functions are supposed to be executed in the main
thread. Is that wrong?

I managed to reproduced the scenario with a build of upstream QEMU
v8.0.0-rc4 once more (again with GDB open), but it's not a simple
reproducer and racy. The backtrace is below[0].

My attempt at explaining the situation is:
1. In qapi/qmp-dispatch.c, the main thread schedules do_qmp_dispatch_bh,
because the coroutine context doesn't match.
2. The host OS switches to the vCPU thread.
3. Something in the vCPU thread triggers aio_poll with the main thread's
AioContext (in my case, a write to a pflash drive).
4. do_qmp_dispatch_bh is executed in the vCPU thread.

Could this be an artifact of running with a debugger?

I CC'ed the maintainers of util/aio-posix.c and qapi/qmp-dispatch.c
hoping that is not too far off.

Best Regards,
Fiona

[0]:
> Thread 5 "CPU 0/KVM" hit Breakpoint 2, do_qmp_dispatch_bh (opaque=0x7ffff2e96e30) at ../qapi/qmp-dispatch.c:124
> 124	    QmpDispatchBH *data = opaque;
> #0  do_qmp_dispatch_bh (opaque=0x7ffff2e96e30) at ../qapi/qmp-dispatch.c:124
> #1  0x000055555604f50a in aio_bh_call (bh=0x7fffe4005430) at ../util/async.c:155
> #2  0x000055555604f615 in aio_bh_poll (ctx=0x555556b3e730) at ../util/async.c:184
> #3  0x00005555560337b8 in aio_poll (ctx=0x555556b3e730, blocking=true) at ../util/aio-posix.c:721
> #4  0x0000555555e8cf1c in bdrv_poll_co (s=0x7ffff1a45eb0) at /home/febner/repos/qemu/block/block-gen.h:43
> #5  0x0000555555e8fc3a in blk_pwrite (blk=0x555556daf840, offset=159232, bytes=512, buf=0x7ffee3226e00, flags=0) at block/block-gen.c:1650
> #6  0x0000555555908078 in pflash_update (pfl=0x555556d92300, offset=159232, size=1) at ../hw/block/pflash_cfi01.c:394
> #7  0x0000555555908749 in pflash_write (pfl=0x555556d92300, offset=159706, value=127, width=1, be=0) at ../hw/block/pflash_cfi01.c:522
> #8  0x0000555555908cda in pflash_mem_write_with_attrs (opaque=0x555556d92300, addr=159706, value=127, len=1, attrs=...) at ../hw/block/pflash_cfi01.c:681
> #9  0x0000555555d8936a in memory_region_write_with_attrs_accessor (mr=0x555556d926c0, addr=159706, value=0x7ffff1a460c8, size=1, shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> #10 0x0000555555d894a9 in access_with_adjusted_size (addr=159706, value=0x7ffff1a460c8, size=1, access_size_min=1, access_size_max=4, access_fn=0x555555d89270 <memory_region_write_with_attrs_accessor>, mr=0x555556d926c0, attrs=...) at ../softmmu/memory.c:555
> #11 0x0000555555d8c5de in memory_region_dispatch_write (mr=0x555556d926c0, addr=159706, data=127, op=MO_8, attrs=...) at ../softmmu/memory.c:1522
> #12 0x0000555555d996f4 in flatview_write_continue (fv=0x7fffe843cc60, addr=4290932698, attrs=..., ptr=0x7ffff7fc5028, len=1, addr1=159706, l=1, mr=0x555556d926c0) at ../softmmu/physmem.c:2641
> #13 0x0000555555d99857 in flatview_write (fv=0x7fffe843cc60, addr=4290932698, attrs=..., buf=0x7ffff7fc5028, len=1) at ../softmmu/physmem.c:2683
> #14 0x0000555555d99c07 in address_space_write (as=0x555556a01b20 <address_space_memory>, addr=4290932698, attrs=..., buf=0x7ffff7fc5028, len=1) at ../softmmu/physmem.c:2779
> #15 0x0000555555d99c74 in address_space_rw (as=0x555556a01b20 <address_space_memory>, addr=4290932698, attrs=..., buf=0x7ffff7fc5028, len=1, is_write=true) at ../softmmu/physmem.c:2789
> #16 0x0000555555e2da88 in kvm_cpu_exec (cpu=0x555556ea10d0) at ../accel/kvm/kvm-all.c:2989
> #17 0x0000555555e3079a in kvm_vcpu_thread_fn (arg=0x555556ea10d0) at ../accel/kvm/kvm-accel-ops.c:51
> #18 0x000055555603825f in qemu_thread_start (args=0x555556b3c3b0) at ../util/qemu-thread-posix.c:541
> #19 0x00007ffff7125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #20 0x00007ffff62c4a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-19 14:09 QMP (without OOB) function running in thread different from the main thread as part of aio_poll Fiona Ebner
@ 2023-04-19 16:59 ` Paolo Bonzini
  2023-04-20  6:11 ` Markus Armbruster
  1 sibling, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2023-04-19 16:59 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: QEMU Developers, open list:Network Block Dev...,
	Michael Roth, Markus Armbruster, Fam Zheng, Stefan Hajnoczi,
	Thomas Lamprecht

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

Il mer 19 apr 2023, 16:11 Fiona Ebner <f.ebner@proxmox.com> ha scritto:

Hi,
while debugging a completely different issue, I was surprised to see
do_qmp_dispatch_bh being run in a vCPU thread. I was under the
impression that QMP functions are supposed to be executed in the main
thread. Is that wrong



While monitor commands in general start only in the main thread, bottom
halves can also run (via aio_poll) in the vCPU threads, always under the
BQL.

If that's not valid, perhaps you can schedule the monitor's bottom half in
iohandler_get_aio_context() instead?

Paolo


>
>

[-- Attachment #2: Type: text/html, Size: 1261 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-19 14:09 QMP (without OOB) function running in thread different from the main thread as part of aio_poll Fiona Ebner
  2023-04-19 16:59 ` Paolo Bonzini
@ 2023-04-20  6:11 ` Markus Armbruster
  2023-04-20  6:55   ` Paolo Bonzini
  1 sibling, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2023-04-20  6:11 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: QEMU Developers, qemu-block, michael.roth, Fam Zheng,
	Stefan Hajnoczi, Paolo Bonzini, Thomas Lamprecht, Kevin Wolf,
	Peter Xu

Fiona Ebner <f.ebner@proxmox.com> writes:

> Hi,
> while debugging a completely different issue, I was surprised to see
> do_qmp_dispatch_bh being run in a vCPU thread. I was under the
> impression that QMP functions are supposed to be executed in the main
> thread. Is that wrong?
>
> I managed to reproduced the scenario with a build of upstream QEMU
> v8.0.0-rc4 once more (again with GDB open), but it's not a simple
> reproducer and racy. The backtrace is below[0].
>
> My attempt at explaining the situation is:
> 1. In qapi/qmp-dispatch.c, the main thread schedules do_qmp_dispatch_bh,
> because the coroutine context doesn't match.
> 2. The host OS switches to the vCPU thread.
> 3. Something in the vCPU thread triggers aio_poll with the main thread's
> AioContext (in my case, a write to a pflash drive).
> 4. do_qmp_dispatch_bh is executed in the vCPU thread.
>
> Could this be an artifact of running with a debugger?
>
> I CC'ed the maintainers of util/aio-posix.c and qapi/qmp-dispatch.c
> hoping that is not too far off.

Fiona, thanks for reporting this!

The monitor core has become awfully complex.

Instead of trying to explain everything at once, let me recount its
history.  If you're familiar with the monitor, you might want to skip
ahead to "Paolo wrote".

In the beginning, there was just one monitor, and it ran entirely in the
main loop (callback when input is available).  To keep the main loop
going, monitor commands better complete quickly.

Then we got multiple monitors.  Same story, just multiple input streams,
each with a callback.

We also got additional threads.  When I say "main loop", I mean the main
thread's main loop.

"Must complete quickly" means no blocking I/O and such.  Writing code
that way is somewhere between hard and impractical.  Much code called by
monitor commands wasn't written that way.

"Doctor, doctor, running code in the main loop hurts".  Sadly, the
doctor's recommended remedy "don't do that then" is really hard to
apply: a lot of code has been written assuming "running in the main
loop, with the big QEMU lock held".

The first small step towards it was taken to enable the "out-of-band"
feature.  We moved the QMP monitor proper out of the main loop into a
monitor I/O thread.  The monitor commands get funneled to the main loop.
Instead of the main loop calling the monitor when a file descriptor has
input, it now calls the command dispatcher when a funnel queue has a
command.  This is implemented with a bottom half (merge commit
4bdc24fa018, March 2018).

Why bother?  Because now we can execute special "out-of-band" commands
right away, in the I/O thread, regardless of the main loop's state.
Peter Xu wanted this badly enough for postcopy recovery to code it up.
It was hard.  It's not generally useful, as the restriction on what OOB
commands can do are severe.

The next step was the coroutine feature.  Quite a few of the problematic
monitor commands are actually running coroutine-capable code: when
running in coroutine context, the code yields instead of blocking.
Running such commands in monitor context improves things from "blocks
the main loop" to "blocks all monitor commands".

Sadly, code exists that falls apart in coroutine context.  So we had to
make running in coroutine context opt-in.  Right now only two commands
opt in: block_resize and screendump.  Hopefully, we can get to the point
where most or all do.

Until all do, the dispatcher needs to run some commands coroutine
context, and others outside coroutine context.  How?

The QMP command dispatcher runs in a coroutine in the main loop (HMP
works differently, but let's ignore it here).

If the command can be executed in coroutine context, the dispatcher
calls its handler as before.  Right now, we take this path just for
block_resize and screendump.

Else, we create a bottom half that calls the handler, and schedule it to
run in the main loop.  Right now, we take this path for all the other
commands.

This if / else is in qmp_dispatch().  The else is the only way to get
do_qmp_dispatch_bh() to run.  Merge commit b7092cda1b3, October 2020.

Paolo wrote:

    While monitor commands in general start only in the main thread,
    bottom halves can also run (via aio_poll) in the vCPU threads,
    always under the BQL.

Today I learned...

So, splicing in a bottom half unmoored monitor commands from the main
loop.  We weren't aware of that, as our commit messages show.

I guess the commands themselves don't care; all they need is the BQL.

However, did we unwittingly change what can get blocked?  Before,
monitor commands could block only the main thread.  Now they can also
block vCPU threads.  Impact?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-20  6:11 ` Markus Armbruster
@ 2023-04-20  6:55   ` Paolo Bonzini
  2023-04-26 14:31     ` Fiona Ebner
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2023-04-20  6:55 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Fiona Ebner, QEMU Developers, open list:Block layer core,
	Michael Roth, Fam Zheng, Stefan Hajnoczi, Thomas Lamprecht,
	Kevin Wolf, Peter Xu

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

Il gio 20 apr 2023, 08:11 Markus Armbruster <armbru@redhat.com> ha scritto:

> So, splicing in a bottom half unmoored monitor commands from the main
> loop.  We weren't aware of that, as our commit messages show.
>
> I guess the commands themselves don't care; all they need is the BQL.
>
> However, did we unwittingly change what can get blocked?  Before,
> monitor commands could block only the main thread.  Now they can also
> block vCPU threads.  Impact?
>

Monitor commands could always block vCPU threads through the BQL(*).
However, aio_poll() only runs in the vCPU threads in very special cases;
typically associated to resetting a device which causes a blk_drain() on
the device's BlockBackend. So it is not a performance issue.

However, liberal reuse of the main block layer AioContext could indeed be a
*correctness* issue. I need to re-read Fiona's report instead of stopping
at the first three lines because it's the evening. :)

Paolo

(*) Some of you may remember that when migration ran under the BQL it was
also a very effective way to ensure that vCPU threads were slowed down, so
it helped migration converge. Once that hurdle was removed, we had to add
manual sleeping in the vCPU threads!



>

[-- Attachment #2: Type: text/html, Size: 1946 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-20  6:55   ` Paolo Bonzini
@ 2023-04-26 14:31     ` Fiona Ebner
  2023-04-27 11:03       ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Fiona Ebner @ 2023-04-26 14:31 UTC (permalink / raw)
  To: Paolo Bonzini, Markus Armbruster
  Cc: QEMU Developers, open list:Block layer core, Michael Roth,
	Fam Zheng, Stefan Hajnoczi, Thomas Lamprecht, Kevin Wolf,
	Peter Xu

Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
> 
> 
> Il gio 20 apr 2023, 08:11 Markus Armbruster <armbru@redhat.com
> <mailto:armbru@redhat.com>> ha scritto:
> 
>     So, splicing in a bottom half unmoored monitor commands from the main
>     loop.  We weren't aware of that, as our commit messages show.
> 
>     I guess the commands themselves don't care; all they need is the BQL.
> 
>     However, did we unwittingly change what can get blocked?  Before,
>     monitor commands could block only the main thread.  Now they can also
>     block vCPU threads.  Impact?
> 
> 
> Monitor commands could always block vCPU threads through the BQL(*).
> However, aio_poll() only runs in the vCPU threads in very special cases;
> typically associated to resetting a device which causes a blk_drain() on
> the device's BlockBackend. So it is not a performance issue.
> 

AFAIU, all generated coroutine wrappers use aio_poll. In my backtrace
aio_poll happens via blk_pwrite for a pflash device. So a bit more often
than "very special cases" ;)

> However, liberal reuse of the main block layer AioContext could indeed
> be a *correctness* issue. I need to re-read Fiona's report instead of
> stopping at the first three lines because it's the evening. :)

For me, being called in a vCPU thread caused problems with a custom QMP
function patched in by Proxmox. The function uses a newly opened
BlockBackend and calls qemu_mutex_unlock_iothread() after which
qemu_get_current_aio_context() returns 0x0 (when running in the main
thread, it still returns the main thread's AioContext). It then calls
blk_pwritev which is also a generated coroutine wrapper and the
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
in the else branch of the AIO_WAIT_WHILE_INTERNAL macro fails.

Sounds like there's room for improvement in our code :/ I'm not aware of
something similar in upstream QEMU.

Thanks to Markus for the detailed history lesson!

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-26 14:31     ` Fiona Ebner
@ 2023-04-27 11:03       ` Kevin Wolf
  2023-04-27 12:27         ` Fiona Ebner
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2023-04-27 11:03 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
> > 
> > 
> > Il gio 20 apr 2023, 08:11 Markus Armbruster <armbru@redhat.com
> > <mailto:armbru@redhat.com>> ha scritto:
> > 
> >     So, splicing in a bottom half unmoored monitor commands from the main
> >     loop.  We weren't aware of that, as our commit messages show.
> > 
> >     I guess the commands themselves don't care; all they need is the BQL.
> > 
> >     However, did we unwittingly change what can get blocked?  Before,
> >     monitor commands could block only the main thread.  Now they can also
> >     block vCPU threads.  Impact?
> > 
> > 
> > Monitor commands could always block vCPU threads through the BQL(*).
> > However, aio_poll() only runs in the vCPU threads in very special cases;
> > typically associated to resetting a device which causes a blk_drain() on
> > the device's BlockBackend. So it is not a performance issue.
> > 
> 
> AFAIU, all generated coroutine wrappers use aio_poll. In my backtrace
> aio_poll happens via blk_pwrite for a pflash device. So a bit more
> often than "very special cases" ;)

Yes, it's a common thing for devices that start requests from the vcpu
thread when handling I/O (as opposed to devices that use an eventfd or
similar mechanisms).

> > However, liberal reuse of the main block layer AioContext could indeed
> > be a *correctness* issue. I need to re-read Fiona's report instead of
> > stopping at the first three lines because it's the evening. :)
> 
> For me, being called in a vCPU thread caused problems with a custom QMP
> function patched in by Proxmox. The function uses a newly opened
> BlockBackend and calls qemu_mutex_unlock_iothread() after which
> qemu_get_current_aio_context() returns 0x0 (when running in the main
> thread, it still returns the main thread's AioContext). It then calls
> blk_pwritev which is also a generated coroutine wrapper and the
> assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> in the else branch of the AIO_WAIT_WHILE_INTERNAL macro fails.
> 
> Sounds like there's room for improvement in our code :/ I'm not aware
> of something similar in upstream QEMU.

Yes, even if it didn't crash immediately, calling blk_*() without
holding a lock is invalid. In many cases, this is the BQL. If you don't
hold it while calling the function from a vcpu thread, you could run
into races with the main thread, which would probably be very painful to
debug.

Kevin



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-27 11:03       ` Kevin Wolf
@ 2023-04-27 12:27         ` Fiona Ebner
  2023-04-27 14:36           ` Juan Quintela
  0 siblings, 1 reply; 21+ messages in thread
From: Fiona Ebner @ 2023-04-27 12:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu, quintela

Am 27.04.23 um 13:03 schrieb Kevin Wolf:
> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
>>>
>>>
>>> Il gio 20 apr 2023, 08:11 Markus Armbruster <armbru@redhat.com
>>> <mailto:armbru@redhat.com>> ha scritto:
>>>
>>>     So, splicing in a bottom half unmoored monitor commands from the main
>>>     loop.  We weren't aware of that, as our commit messages show.
>>>
>>>     I guess the commands themselves don't care; all they need is the BQL.
>>>
>>>     However, did we unwittingly change what can get blocked?  Before,
>>>     monitor commands could block only the main thread.  Now they can also
>>>     block vCPU threads.  Impact?
>>>
>>>
>>> Monitor commands could always block vCPU threads through the BQL(*).
>>> However, aio_poll() only runs in the vCPU threads in very special cases;
>>> typically associated to resetting a device which causes a blk_drain() on
>>> the device's BlockBackend. So it is not a performance issue.
>>>
>>
>> AFAIU, all generated coroutine wrappers use aio_poll. In my backtrace
>> aio_poll happens via blk_pwrite for a pflash device. So a bit more
>> often than "very special cases" ;)
> 
> Yes, it's a common thing for devices that start requests from the vcpu
> thread when handling I/O (as opposed to devices that use an eventfd or
> similar mechanisms).
> 
>>> However, liberal reuse of the main block layer AioContext could indeed
>>> be a *correctness* issue. I need to re-read Fiona's report instead of
>>> stopping at the first three lines because it's the evening. :)
>>
>> For me, being called in a vCPU thread caused problems with a custom QMP
>> function patched in by Proxmox. The function uses a newly opened
>> BlockBackend and calls qemu_mutex_unlock_iothread() after which
>> qemu_get_current_aio_context() returns 0x0 (when running in the main
>> thread, it still returns the main thread's AioContext). It then calls
>> blk_pwritev which is also a generated coroutine wrapper and the
>> assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>> in the else branch of the AIO_WAIT_WHILE_INTERNAL macro fails.
>>
>> Sounds like there's room for improvement in our code :/ I'm not aware
>> of something similar in upstream QEMU.
> 
> Yes, even if it didn't crash immediately, calling blk_*() without
> holding a lock is invalid. In many cases, this is the BQL. If you don't
> hold it while calling the function from a vcpu thread, you could run
> into races with the main thread, which would probably be very painful to
> debug.
> 
> Kevin
> 

Our function is a custom variant of saving a snapshot and uses
qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
is there. I looked for inspiration for how upstream does things and it
turns out that upstream QEMU v8.0.0 has essentially the same issue with
snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
of the main thread, the situation is the same: after
qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
(this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].


So all bottom halves scheduled for the main thread's AioContext can
potentially get to run in a vCPU thread and need to be very careful with
things like qemu_mutex_unlock_iothread.

Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
looked into why it happens yet. Does there need to be a way to drop the
BQL without also giving up the main thread's AioContext or would it be
enough to re-acquire the context?

CC-ing Juan as the migration maintainer.

Best Regards,
Fiona

[0]:
> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f9027b3e537 in __GI_abort () at abort.c:79
> #2  0x00007f9027b3e40f in __assert_fail_base (fmt=0x7f9027cb66a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x558ed44fcec0 "qemu_get_current_aio_context() == qemu_get_aio_context()", file=0x558ed44fce80 "/home/febner/repos/qemu/block/block-gen.h", line=43, function=<optimized out>) at assert.c:92
> #3  0x00007f9027b4d662 in __GI___assert_fail (assertion=0x558ed44fcec0 "qemu_get_current_aio_context() == qemu_get_aio_context()", file=0x558ed44fce80 "/home/febner/repos/qemu/block/block-gen.h", line=43, function=0x558ed44fcf80 <__PRETTY_FUNCTION__.14> "bdrv_poll_co") at assert.c:101
> #4  0x0000558ed412df5f in bdrv_poll_co (s=0x7f8ffcff37a0) at /home/febner/repos/qemu/block/block-gen.h:43
> #5  0x0000558ed412f4cd in bdrv_writev_vmstate (bs=0x558ed60536a0, qiov=0x7f8ffcff3840, pos=0) at block/block-gen.c:809
> #6  0x0000558ed3df36d0 in qio_channel_block_writev (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0, flags=0, errp=0x7f8ffcff39c0) at ../migration/channel-block.c:89
> #7  0x0000558ed40feedb in qio_channel_writev_full (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0, flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:108
> #8  0x0000558ed40ff3c3 in qio_channel_writev_full_all (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, fds=0x0, nfds=0, flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:263
> #9  0x0000558ed40ff2e4 in qio_channel_writev_all (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, errp=0x7f8ffcff39c0) at ../io/channel.c:242
> #10 0x0000558ed3dee4dc in qemu_fflush (f=0x7f8ff4640000) at ../migration/qemu-file.c:302
> #11 0x0000558ed4050f91 in ram_save_setup (f=0x7f8ff4640000, opaque=0x558ed4ca34c0 <ram_state>) at ../migration/ram.c:3302
> #12 0x0000558ed3e141c8 in qemu_savevm_state_setup (f=0x7f8ff4640000) at ../migration/savevm.c:1266
> #13 0x0000558ed3e14eed in qemu_savevm_state (f=0x7f8ff4640000, errp=0x558ed68c5238) at ../migration/savevm.c:1626
> #14 0x0000558ed3e1755e in save_snapshot (name=0x558ed72af790 "snap0", overwrite=false, vmstate=0x558ed6708ce0 "scsi0", has_devices=true, devices=0x558ed66d6a60, errp=0x558ed68c5238) at ../migration/savevm.c:2954
> #15 0x0000558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at ../migration/savevm.c:3253
> #16 0x0000558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at ../util/async.c:155
> #17 0x0000558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at ../util/async.c:184
> #18 0x0000558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at ../util/aio-posix.c:721
> #19 0x0000558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at /home/febner/repos/qemu/block/block-gen.h:43
> #20 0x0000558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60, offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at block/block-gen.c:1650
> #21 0x0000558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
> #22 0x0000558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30, offset=231232, value=0, width=1, be=0) at ../hw/block/pflash_cfi01.c:522
> #23 0x0000558ed3ba9cda in pflash_mem_write_with_attrs (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at ../hw/block/pflash_cfi01.c:681
> #24 0x0000558ed402a36a in memory_region_write_with_attrs_accessor (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1, shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> #25 0x0000558ed402a4a9 in access_with_adjusted_size (addr=231232, value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4, access_fn=0x558ed402a270 <memory_region_write_with_attrs_accessor>, mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
> #26 0x0000558ed402d5de in memory_region_dispatch_write (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at ../softmmu/memory.c:1522
> #27 0x0000558ed403a6f4 in flatview_write_continue (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028, len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at ../softmmu/physmem.c:2641
> #28 0x0000558ed403a857 in flatview_write (fv=0x558ed66d62c0, addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2683
> #29 0x0000558ed403ac07 in address_space_write (as=0x558ed4ca2b20 <address_space_memory>, addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
> #30 0x0000558ed403ac74 in address_space_rw (as=0x558ed4ca2b20 <address_space_memory>, addr=4291004224, attrs=..., buf=0x7f9029957028, len=1, is_write=true) at ../softmmu/physmem.c:2789
> #31 0x0000558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at ../accel/kvm/kvm-all.c:2989
> #32 0x0000558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at ../accel/kvm/kvm-accel-ops.c:51
> #33 0x0000558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at ../util/qemu-thread-posix.c:541
> #34 0x00007f9028ab7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #35 0x00007f9027c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-27 12:27         ` Fiona Ebner
@ 2023-04-27 14:36           ` Juan Quintela
  2023-04-27 14:56             ` Peter Xu
  2023-04-28  7:23             ` Fiona Ebner
  0 siblings, 2 replies; 21+ messages in thread
From: Juan Quintela @ 2023-04-27 14:36 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Fiona Ebner <f.ebner@proxmox.com> wrote:
> Am 27.04.23 um 13:03 schrieb Kevin Wolf:
>> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
>>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:

Hi

> Our function is a custom variant of saving a snapshot and uses
> qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
> is there. I looked for inspiration for how upstream does things and it
> turns out that upstream QEMU v8.0.0 has essentially the same issue with
> snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
> of the main thread, the situation is the same: after
> qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
> 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
> (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
>
>
> So all bottom halves scheduled for the main thread's AioContext can
> potentially get to run in a vCPU thread and need to be very careful with
> things like qemu_mutex_unlock_iothread.
>
> Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
> looked into why it happens yet. Does there need to be a way to drop the
> BQL without also giving up the main thread's AioContext or would it be
> enough to re-acquire the context?
>
> CC-ing Juan as the migration maintainer.

This is the world backwards.
The tradition is that migration people blame block layer people for
breaking things and for help, not the other way around O:-)

> Best Regards,
> Fiona
>
> [0]:
>> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #1  0x00007f9027b3e537 in __GI_abort () at abort.c:79
>> #2 0x00007f9027b3e40f in __assert_fail_base (fmt=0x7f9027cb66a8
>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>> assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>> qemu_get_aio_context()", file=0x558ed44fce80
>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>> function=<optimized out>) at assert.c:92
>> #3 0x00007f9027b4d662 in __GI___assert_fail
>> (assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>> qemu_get_aio_context()", file=0x558ed44fce80
>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>> function=0x558ed44fcf80 <__PRETTY_FUNCTION__.14> "bdrv_poll_co") at
>> assert.c:101
>> #4  0x0000558ed412df5f in bdrv_poll_co (s=0x7f8ffcff37a0) at /home/febner/repos/qemu/block/block-gen.h:43
>> #5  0x0000558ed412f4cd in bdrv_writev_vmstate (bs=0x558ed60536a0, qiov=0x7f8ffcff3840, pos=0) at block/block-gen.c:809
>> #6 0x0000558ed3df36d0 in qio_channel_block_writev
>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>> flags=0, errp=0x7f8ffcff39c0) at ../migration/channel-block.c:89
>> #7 0x0000558ed40feedb in qio_channel_writev_full
>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:108
>> #8 0x0000558ed40ff3c3 in qio_channel_writev_full_all
>> (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, fds=0x0, nfds=0,
>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:263
>> #9 0x0000558ed40ff2e4 in qio_channel_writev_all (ioc=0x7f8ff40ac060,
>> iov=0x7f8ff4648040, niov=1, errp=0x7f8ffcff39c0) at
>> ../io/channel.c:242
>> #10 0x0000558ed3dee4dc in qemu_fflush (f=0x7f8ff4640000) at ../migration/qemu-file.c:302
>> #11 0x0000558ed4050f91 in ram_save_setup (f=0x7f8ff4640000, opaque=0x558ed4ca34c0 <ram_state>) at ../migration/ram.c:3302
>> #12 0x0000558ed3e141c8 in qemu_savevm_state_setup (f=0x7f8ff4640000) at ../migration/savevm.c:1266
>> #13 0x0000558ed3e14eed in qemu_savevm_state (f=0x7f8ff4640000, errp=0x558ed68c5238) at ../migration/savevm.c:1626
>> #14 0x0000558ed3e1755e in save_snapshot (name=0x558ed72af790
>> "snap0", overwrite=false, vmstate=0x558ed6708ce0 "scsi0",
>> has_devices=true, devices=0x558ed66d6a60, errp=0x558ed68c5238) at
>> ../migration/savevm.c:2954
>> #15 0x0000558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at ../migration/savevm.c:3253
>> #16 0x0000558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at ../util/async.c:155
>> #17 0x0000558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at ../util/async.c:184
>> #18 0x0000558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at ../util/aio-posix.c:721
>> #19 0x0000558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at /home/febner/repos/qemu/block/block-gen.h:43
>> #20 0x0000558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
>> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
>> block/block-gen.c:1650
>> #21 0x0000558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
>> #22 0x0000558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
>> offset=231232, value=0, width=1, be=0) at
>> ../hw/block/pflash_cfi01.c:522
>> #23 0x0000558ed3ba9cda in pflash_mem_write_with_attrs
>> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
>> ../hw/block/pflash_cfi01.c:681
>> #24 0x0000558ed402a36a in memory_region_write_with_attrs_accessor
>> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
>> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
>> #25 0x0000558ed402a4a9 in access_with_adjusted_size (addr=231232,
>> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
>> access_fn=0x558ed402a270 <memory_region_write_with_attrs_accessor>,
>> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
>> #26 0x0000558ed402d5de in memory_region_dispatch_write
>> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
>> ../softmmu/memory.c:1522
>> #27 0x0000558ed403a6f4 in flatview_write_continue
>> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
>> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
>> ../softmmu/physmem.c:2641
>> #28 0x0000558ed403a857 in flatview_write (fv=0x558ed66d62c0,
>> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
>> ../softmmu/physmem.c:2683
>> #29 0x0000558ed403ac07 in address_space_write (as=0x558ed4ca2b20
>> <address_space_memory>, addr=4291004224, attrs=...,
>> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
>> #30 0x0000558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
>> <address_space_memory>, addr=4291004224, attrs=...,
>> buf=0x7f9029957028, len=1, is_write=true) at
>> ../softmmu/physmem.c:2789
>> #31 0x0000558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at ../accel/kvm/kvm-all.c:2989
>> #32 0x0000558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at ../accel/kvm/kvm-accel-ops.c:51
>> #33 0x0000558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at ../util/qemu-thread-posix.c:541
>> #34 0x00007f9028ab7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
>> #35 0x00007f9027c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

To see that I am understading this right:

- you create a thread
- that calls a memory_region operation
- that calls a device write function
- that calls the block layer
- that creates a snapshot
- that calls the migration code
- that calls the block layer again

Without further investigation, I have no clue what is going on here,
sorry.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-27 14:36           ` Juan Quintela
@ 2023-04-27 14:56             ` Peter Xu
  2023-04-28  7:53               ` Fiona Ebner
  2023-04-28  7:23             ` Fiona Ebner
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-04-27 14:56 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Fiona Ebner, Kevin Wolf, Paolo Bonzini, Markus Armbruster,
	QEMU Developers, open list:Block layer core, Michael Roth,
	Fam Zheng, Stefan Hajnoczi, Thomas Lamprecht

On Thu, Apr 27, 2023 at 04:36:14PM +0200, Juan Quintela wrote:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
> > Am 27.04.23 um 13:03 schrieb Kevin Wolf:
> >> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
> >>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
> 
> Hi
> 
> > Our function is a custom variant of saving a snapshot and uses
> > qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
> > is there. I looked for inspiration for how upstream does things and it
> > turns out that upstream QEMU v8.0.0 has essentially the same issue with
> > snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
> > of the main thread, the situation is the same: after
> > qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
> > 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
> > (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
> >
> >
> > So all bottom halves scheduled for the main thread's AioContext can
> > potentially get to run in a vCPU thread and need to be very careful with
> > things like qemu_mutex_unlock_iothread.
> >
> > Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
> > looked into why it happens yet. Does there need to be a way to drop the
> > BQL without also giving up the main thread's AioContext or would it be
> > enough to re-acquire the context?
> >
> > CC-ing Juan as the migration maintainer.
> 
> This is the world backwards.
> The tradition is that migration people blame block layer people for
> breaking things and for help, not the other way around O:-)
> 
> > Best Regards,
> > Fiona
> >
> > [0]:
> >> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
> >> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> >> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> >> (gdb) bt
> >> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> >> #1  0x00007f9027b3e537 in __GI_abort () at abort.c:79
> >> #2 0x00007f9027b3e40f in __assert_fail_base (fmt=0x7f9027cb66a8
> >> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
> >> assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
> >> qemu_get_aio_context()", file=0x558ed44fce80
> >> "/home/febner/repos/qemu/block/block-gen.h", line=43,
> >> function=<optimized out>) at assert.c:92
> >> #3 0x00007f9027b4d662 in __GI___assert_fail
> >> (assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
> >> qemu_get_aio_context()", file=0x558ed44fce80
> >> "/home/febner/repos/qemu/block/block-gen.h", line=43,
> >> function=0x558ed44fcf80 <__PRETTY_FUNCTION__.14> "bdrv_poll_co") at
> >> assert.c:101
> >> #4  0x0000558ed412df5f in bdrv_poll_co (s=0x7f8ffcff37a0) at /home/febner/repos/qemu/block/block-gen.h:43
> >> #5  0x0000558ed412f4cd in bdrv_writev_vmstate (bs=0x558ed60536a0, qiov=0x7f8ffcff3840, pos=0) at block/block-gen.c:809
> >> #6 0x0000558ed3df36d0 in qio_channel_block_writev
> >> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
> >> flags=0, errp=0x7f8ffcff39c0) at ../migration/channel-block.c:89
> >> #7 0x0000558ed40feedb in qio_channel_writev_full
> >> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
> >> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:108
> >> #8 0x0000558ed40ff3c3 in qio_channel_writev_full_all
> >> (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, fds=0x0, nfds=0,
> >> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:263
> >> #9 0x0000558ed40ff2e4 in qio_channel_writev_all (ioc=0x7f8ff40ac060,
> >> iov=0x7f8ff4648040, niov=1, errp=0x7f8ffcff39c0) at
> >> ../io/channel.c:242
> >> #10 0x0000558ed3dee4dc in qemu_fflush (f=0x7f8ff4640000) at ../migration/qemu-file.c:302
> >> #11 0x0000558ed4050f91 in ram_save_setup (f=0x7f8ff4640000, opaque=0x558ed4ca34c0 <ram_state>) at ../migration/ram.c:3302
> >> #12 0x0000558ed3e141c8 in qemu_savevm_state_setup (f=0x7f8ff4640000) at ../migration/savevm.c:1266
> >> #13 0x0000558ed3e14eed in qemu_savevm_state (f=0x7f8ff4640000, errp=0x558ed68c5238) at ../migration/savevm.c:1626
> >> #14 0x0000558ed3e1755e in save_snapshot (name=0x558ed72af790
> >> "snap0", overwrite=false, vmstate=0x558ed6708ce0 "scsi0",
> >> has_devices=true, devices=0x558ed66d6a60, errp=0x558ed68c5238) at
> >> ../migration/savevm.c:2954
> >> #15 0x0000558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at ../migration/savevm.c:3253
> >> #16 0x0000558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at ../util/async.c:155
> >> #17 0x0000558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at ../util/async.c:184
> >> #18 0x0000558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at ../util/aio-posix.c:721
> >> #19 0x0000558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at /home/febner/repos/qemu/block/block-gen.h:43
> >> #20 0x0000558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
> >> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
> >> block/block-gen.c:1650
> >> #21 0x0000558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
> >> #22 0x0000558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
> >> offset=231232, value=0, width=1, be=0) at
> >> ../hw/block/pflash_cfi01.c:522
> >> #23 0x0000558ed3ba9cda in pflash_mem_write_with_attrs
> >> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
> >> ../hw/block/pflash_cfi01.c:681
> >> #24 0x0000558ed402a36a in memory_region_write_with_attrs_accessor
> >> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
> >> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> >> #25 0x0000558ed402a4a9 in access_with_adjusted_size (addr=231232,
> >> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
> >> access_fn=0x558ed402a270 <memory_region_write_with_attrs_accessor>,
> >> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
> >> #26 0x0000558ed402d5de in memory_region_dispatch_write
> >> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
> >> ../softmmu/memory.c:1522
> >> #27 0x0000558ed403a6f4 in flatview_write_continue
> >> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
> >> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
> >> ../softmmu/physmem.c:2641
> >> #28 0x0000558ed403a857 in flatview_write (fv=0x558ed66d62c0,
> >> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
> >> ../softmmu/physmem.c:2683
> >> #29 0x0000558ed403ac07 in address_space_write (as=0x558ed4ca2b20
> >> <address_space_memory>, addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
> >> #30 0x0000558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
> >> <address_space_memory>, addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1, is_write=true) at
> >> ../softmmu/physmem.c:2789
> >> #31 0x0000558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at ../accel/kvm/kvm-all.c:2989
> >> #32 0x0000558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at ../accel/kvm/kvm-accel-ops.c:51
> >> #33 0x0000558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at ../util/qemu-thread-posix.c:541
> >> #34 0x00007f9028ab7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> >> #35 0x00007f9027c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Totally unfamiliar with block jobs, but... does it mean that
snapshot_*_job_bh()s should just always make sure BQL taken?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-27 14:36           ` Juan Quintela
  2023-04-27 14:56             ` Peter Xu
@ 2023-04-28  7:23             ` Fiona Ebner
  2023-04-28  7:47               ` Juan Quintela
  1 sibling, 1 reply; 21+ messages in thread
From: Fiona Ebner @ 2023-04-28  7:23 UTC (permalink / raw)
  To: quintela
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 27.04.23 um 16:36 schrieb Juan Quintela:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
>> Am 27.04.23 um 13:03 schrieb Kevin Wolf:
>>> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
>>>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
> 
> Hi
> 
>> Our function is a custom variant of saving a snapshot and uses
>> qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
>> is there. I looked for inspiration for how upstream does things and it
>> turns out that upstream QEMU v8.0.0 has essentially the same issue with
>> snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
>> of the main thread, the situation is the same: after
>> qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
>> 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
>> (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
>>
>>
>> So all bottom halves scheduled for the main thread's AioContext can
>> potentially get to run in a vCPU thread and need to be very careful with
>> things like qemu_mutex_unlock_iothread.
>>
>> Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
>> looked into why it happens yet. Does there need to be a way to drop the
>> BQL without also giving up the main thread's AioContext or would it be
>> enough to re-acquire the context?
>>
>> CC-ing Juan as the migration maintainer.
> 
> This is the world backwards.
> The tradition is that migration people blame block layer people for
> breaking things and for help, not the other way around O:-)

Sorry, if I didn't provide enough context/explanation. See below for my
attempt to re-iterate. I CC'ed you, because the issue happens as part of
snapshot-save and in particular the qemu_mutex_unlock_iothread call in
qemu_savevm_state is one of the ingredients leading to the problem.

> 
>> Best Regards,
>> Fiona
>>
>> [0]:
>>> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
>>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>> (gdb) bt
>>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> #1  0x00007f9027b3e537 in __GI_abort () at abort.c:79
>>> #2 0x00007f9027b3e40f in __assert_fail_base (fmt=0x7f9027cb66a8
>>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>>> assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>>> qemu_get_aio_context()", file=0x558ed44fce80
>>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>>> function=<optimized out>) at assert.c:92
>>> #3 0x00007f9027b4d662 in __GI___assert_fail
>>> (assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>>> qemu_get_aio_context()", file=0x558ed44fce80
>>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>>> function=0x558ed44fcf80 <__PRETTY_FUNCTION__.14> "bdrv_poll_co") at
>>> assert.c:101
>>> #4  0x0000558ed412df5f in bdrv_poll_co (s=0x7f8ffcff37a0) at /home/febner/repos/qemu/block/block-gen.h:43
>>> #5  0x0000558ed412f4cd in bdrv_writev_vmstate (bs=0x558ed60536a0, qiov=0x7f8ffcff3840, pos=0) at block/block-gen.c:809
>>> #6 0x0000558ed3df36d0 in qio_channel_block_writev
>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>>> flags=0, errp=0x7f8ffcff39c0) at ../migration/channel-block.c:89
>>> #7 0x0000558ed40feedb in qio_channel_writev_full
>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:108
>>> #8 0x0000558ed40ff3c3 in qio_channel_writev_full_all
>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, fds=0x0, nfds=0,
>>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:263
>>> #9 0x0000558ed40ff2e4 in qio_channel_writev_all (ioc=0x7f8ff40ac060,
>>> iov=0x7f8ff4648040, niov=1, errp=0x7f8ffcff39c0) at
>>> ../io/channel.c:242
>>> #10 0x0000558ed3dee4dc in qemu_fflush (f=0x7f8ff4640000) at ../migration/qemu-file.c:302
>>> #11 0x0000558ed4050f91 in ram_save_setup (f=0x7f8ff4640000, opaque=0x558ed4ca34c0 <ram_state>) at ../migration/ram.c:3302
>>> #12 0x0000558ed3e141c8 in qemu_savevm_state_setup (f=0x7f8ff4640000) at ../migration/savevm.c:1266
>>> #13 0x0000558ed3e14eed in qemu_savevm_state (f=0x7f8ff4640000, errp=0x558ed68c5238) at ../migration/savevm.c:1626
>>> #14 0x0000558ed3e1755e in save_snapshot (name=0x558ed72af790
>>> "snap0", overwrite=false, vmstate=0x558ed6708ce0 "scsi0",
>>> has_devices=true, devices=0x558ed66d6a60, errp=0x558ed68c5238) at
>>> ../migration/savevm.c:2954
>>> #15 0x0000558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at ../migration/savevm.c:3253
>>> #16 0x0000558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at ../util/async.c:155
>>> #17 0x0000558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at ../util/async.c:184
>>> #18 0x0000558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at ../util/aio-posix.c:721
>>> #19 0x0000558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at /home/febner/repos/qemu/block/block-gen.h:43
>>> #20 0x0000558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
>>> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
>>> block/block-gen.c:1650
>>> #21 0x0000558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
>>> #22 0x0000558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
>>> offset=231232, value=0, width=1, be=0) at
>>> ../hw/block/pflash_cfi01.c:522
>>> #23 0x0000558ed3ba9cda in pflash_mem_write_with_attrs
>>> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
>>> ../hw/block/pflash_cfi01.c:681
>>> #24 0x0000558ed402a36a in memory_region_write_with_attrs_accessor
>>> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
>>> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
>>> #25 0x0000558ed402a4a9 in access_with_adjusted_size (addr=231232,
>>> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
>>> access_fn=0x558ed402a270 <memory_region_write_with_attrs_accessor>,
>>> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
>>> #26 0x0000558ed402d5de in memory_region_dispatch_write
>>> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
>>> ../softmmu/memory.c:1522
>>> #27 0x0000558ed403a6f4 in flatview_write_continue
>>> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
>>> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
>>> ../softmmu/physmem.c:2641
>>> #28 0x0000558ed403a857 in flatview_write (fv=0x558ed66d62c0,
>>> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
>>> ../softmmu/physmem.c:2683
>>> #29 0x0000558ed403ac07 in address_space_write (as=0x558ed4ca2b20
>>> <address_space_memory>, addr=4291004224, attrs=...,
>>> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
>>> #30 0x0000558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
>>> <address_space_memory>, addr=4291004224, attrs=...,
>>> buf=0x7f9029957028, len=1, is_write=true) at
>>> ../softmmu/physmem.c:2789
>>> #31 0x0000558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at ../accel/kvm/kvm-all.c:2989
>>> #32 0x0000558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at ../accel/kvm/kvm-accel-ops.c:51
>>> #33 0x0000558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at ../util/qemu-thread-posix.c:541
>>> #34 0x00007f9028ab7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
>>> #35 0x00007f9027c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> To see that I am understading this right:
> 
> - you create a thread
> - that calls a memory_region operation
> - that calls a device write function
> - that calls the block layer
> - that creates a snapshot
> - that calls the migration code
> - that calls the block layer again
> 
> Without further investigation, I have no clue what is going on here,
> sorry.
> 
> Later, Juan.
> 

All I'm doing is using QEMU (a build of upstream's v8.0.0) in intended
ways, I promise! In particular, I'm doing two things at the same time
repeatedly:
1. Write to a pflash drive from within the guest.
2. Issue a snapshot-save QMP command (in a way that doesn't lead to an
early error).

(I actually also used a debugger to break on pflash_update and
snapshot_save_job_bh, manually continuing until I triggered the
problematic situation. It's very racy, because it depends on the host OS
to switch threads at the correct time.)

Now we need to be aware of two things:
1. As discussed earlier in the mail thread, if the host OS switches
threads at an inconvenient time, it can happen that a bottom half
scheduled for the main thread's AioContext can be executed as part of a
vCPU thread's aio_poll.
2. Generated coroutine wrappers for block layer functions spawn the
coroutine and use AIO_WAIT_WHILE/aio_poll to wait for it to finish.

What happens in the backtrace above is:
1. The write to the pflash drive uses blk_pwrite which leads to an
aio_poll in the vCPU thread.
2. The snapshot_save_job_bh bottom half, that was scheduled for the main
thread's AioContext, is executed as part of the vCPU thread's aio_poll.
3. qemu_savevm_state is called.
4. qemu_mutex_unlock_iothread is called. Now
qemu_get_current_aio_context returns 0x0. Usually, snapshot_save_job_bh
runs in the main thread, in which case qemu_get_current_aio_context
still returns the main thread's AioContext at this point.
5. bdrv_writev_vmstate is executed as part of the usual savevm setup.
6. bdrv_writev_vmstate is a generated coroutine wrapper, so it uses
AIO_WAIT_WHILE.
7. The assertion to have the main thread's AioContext inside the
AIO_WAIT_WHILE macro fails.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28  7:23             ` Fiona Ebner
@ 2023-04-28  7:47               ` Juan Quintela
  2023-04-28  8:30                 ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Quintela @ 2023-04-28  7:47 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Fiona Ebner <f.ebner@proxmox.com> wrote:
> Am 27.04.23 um 16:36 schrieb Juan Quintela:
>> Fiona Ebner <f.ebner@proxmox.com> wrote:
>>> Am 27.04.23 um 13:03 schrieb Kevin Wolf:
>>>> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
>>>>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
>> 
>> Hi
>> 
>>> Our function is a custom variant of saving a snapshot and uses
>>> qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
>>> is there. I looked for inspiration for how upstream does things and it
>>> turns out that upstream QEMU v8.0.0 has essentially the same issue with
>>> snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
>>> of the main thread, the situation is the same: after
>>> qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
>>> 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
>>> (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
>>>
>>>
>>> So all bottom halves scheduled for the main thread's AioContext can
>>> potentially get to run in a vCPU thread and need to be very careful with
>>> things like qemu_mutex_unlock_iothread.
>>>
>>> Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
>>> looked into why it happens yet. Does there need to be a way to drop the
>>> BQL without also giving up the main thread's AioContext or would it be
>>> enough to re-acquire the context?
>>>
>>> CC-ing Juan as the migration maintainer.
>> 
>> This is the world backwards.
>> The tradition is that migration people blame block layer people for
>> breaking things and for help, not the other way around O:-)
>
> Sorry, if I didn't provide enough context/explanation. See below for my
> attempt to re-iterate. I CC'ed you, because the issue happens as part of
> snapshot-save and in particular the qemu_mutex_unlock_iothread call in
> qemu_savevm_state is one of the ingredients leading to the problem.

This was a joke O:-)

>> To see that I am understading this right:
>> 
>> - you create a thread
>> - that calls a memory_region operation
>> - that calls a device write function
>> - that calls the block layer
>> - that creates a snapshot
>> - that calls the migration code
>> - that calls the block layer again
>> 
>> Without further investigation, I have no clue what is going on here,
>> sorry.
>> 
>> Later, Juan.
>> 
>
> All I'm doing is using QEMU (a build of upstream's v8.0.0) in intended
> ways, I promise! In particular, I'm doing two things at the same time
> repeatedly:
> 1. Write to a pflash drive from within the guest.
> 2. Issue a snapshot-save QMP command (in a way that doesn't lead to an
> early error).
>
> (I actually also used a debugger to break on pflash_update and
> snapshot_save_job_bh, manually continuing until I triggered the
> problematic situation. It's very racy, because it depends on the host OS
> to switch threads at the correct time.)

I think the block layer is the problem (famous last words)

>
> Now we need to be aware of two things:
> 1. As discussed earlier in the mail thread, if the host OS switches
> threads at an inconvenient time, it can happen that a bottom half
> scheduled for the main thread's AioContext can be executed as part of a
> vCPU thread's aio_poll.
> 2. Generated coroutine wrappers for block layer functions spawn the
> coroutine and use AIO_WAIT_WHILE/aio_poll to wait for it to finish.
>
> What happens in the backtrace above is:
> 1. The write to the pflash drive uses blk_pwrite which leads to an
> aio_poll in the vCPU thread.
> 2. The snapshot_save_job_bh bottom half, that was scheduled for the main
> thread's AioContext, is executed as part of the vCPU thread's aio_poll.
> 3. qemu_savevm_state is called.
> 4. qemu_mutex_unlock_iothread is called. Now
> qemu_get_current_aio_context returns 0x0. Usually, snapshot_save_job_bh
> runs in the main thread, in which case qemu_get_current_aio_context
> still returns the main thread's AioContext at this point.

I am perhaps a bit ingenuous here, but it is there a way to convince qemu
that snapshot_save_job_bh *HAS* to run on the main thread?

> 5. bdrv_writev_vmstate is executed as part of the usual savevm setup.
> 6. bdrv_writev_vmstate is a generated coroutine wrapper, so it uses
> AIO_WAIT_WHILE.
> 7. The assertion to have the main thread's AioContext inside the
> AIO_WAIT_WHILE macro fails.

several problems here:
a- There is no "test/qtest/snapshot-test" around
   Hint, Hint.
   Basically snapshots are the bastard sibling of migration, and nobody
   really tests them.
b- In an ideal world, migration shouldn't need a bottom_handler
   Remember, it has its own thread.  But there are functions that can
   only been called from the main thread.  And no, I don't remember
   which, I just try very hard not to touch that part of the code.
c- At that point the vcpus are stopped, for migration it doesn't matter
   a lot(*) to have to use a bottom handler.
d- snapshots are a completely different beast, that don't really stop
   the guest in the same way at that point, and sometimes it shows in
   this subtle details.

I am sorry that I can't be of more help.  I still think that the block
layer people should have a clue at what is going on here.

Later, Juan.

*: That is not completely true.  There is no way to get back from the
   bottom handler to the iterative stage of migration.  If something
   happens at that point, we just have to cancel migration. But we will
   left this discussion for another day.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-27 14:56             ` Peter Xu
@ 2023-04-28  7:53               ` Fiona Ebner
  0 siblings, 0 replies; 21+ messages in thread
From: Fiona Ebner @ 2023-04-28  7:53 UTC (permalink / raw)
  To: Peter Xu, Juan Quintela
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht

Am 27.04.23 um 16:56 schrieb Peter Xu:
> On Thu, Apr 27, 2023 at 04:36:14PM +0200, Juan Quintela wrote:
>> Fiona Ebner <f.ebner@proxmox.com> wrote:
>>> Am 27.04.23 um 13:03 schrieb Kevin Wolf:
>>>> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
>>>>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
>>
>> Hi
>>
>>> Our function is a custom variant of saving a snapshot and uses
>>> qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
>>> is there. I looked for inspiration for how upstream does things and it
>>> turns out that upstream QEMU v8.0.0 has essentially the same issue with
>>> snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
>>> of the main thread, the situation is the same: after
>>> qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
>>> 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
>>> (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
>>>
>>>
>>> So all bottom halves scheduled for the main thread's AioContext can
>>> potentially get to run in a vCPU thread and need to be very careful with
>>> things like qemu_mutex_unlock_iothread.
>>>
>>> Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
>>> looked into why it happens yet. Does there need to be a way to drop the
>>> BQL without also giving up the main thread's AioContext or would it be
>>> enough to re-acquire the context?
>>>
>>> CC-ing Juan as the migration maintainer.
>>
>> This is the world backwards.
>> The tradition is that migration people blame block layer people for
>> breaking things and for help, not the other way around O:-)
>>
>>> Best Regards,
>>> Fiona
>>>
>>> [0]:
>>>> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
>>>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>>> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>>> (gdb) bt
>>>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>>> #1  0x00007f9027b3e537 in __GI_abort () at abort.c:79
>>>> #2 0x00007f9027b3e40f in __assert_fail_base (fmt=0x7f9027cb66a8
>>>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>>>> assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>>>> qemu_get_aio_context()", file=0x558ed44fce80
>>>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>>>> function=<optimized out>) at assert.c:92
>>>> #3 0x00007f9027b4d662 in __GI___assert_fail
>>>> (assertion=0x558ed44fcec0 "qemu_get_current_aio_context() ==
>>>> qemu_get_aio_context()", file=0x558ed44fce80
>>>> "/home/febner/repos/qemu/block/block-gen.h", line=43,
>>>> function=0x558ed44fcf80 <__PRETTY_FUNCTION__.14> "bdrv_poll_co") at
>>>> assert.c:101
>>>> #4  0x0000558ed412df5f in bdrv_poll_co (s=0x7f8ffcff37a0) at /home/febner/repos/qemu/block/block-gen.h:43
>>>> #5  0x0000558ed412f4cd in bdrv_writev_vmstate (bs=0x558ed60536a0, qiov=0x7f8ffcff3840, pos=0) at block/block-gen.c:809
>>>> #6 0x0000558ed3df36d0 in qio_channel_block_writev
>>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>>>> flags=0, errp=0x7f8ffcff39c0) at ../migration/channel-block.c:89
>>>> #7 0x0000558ed40feedb in qio_channel_writev_full
>>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff43f6350, niov=1, fds=0x0, nfds=0,
>>>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:108
>>>> #8 0x0000558ed40ff3c3 in qio_channel_writev_full_all
>>>> (ioc=0x7f8ff40ac060, iov=0x7f8ff4648040, niov=1, fds=0x0, nfds=0,
>>>> flags=0, errp=0x7f8ffcff39c0) at ../io/channel.c:263
>>>> #9 0x0000558ed40ff2e4 in qio_channel_writev_all (ioc=0x7f8ff40ac060,
>>>> iov=0x7f8ff4648040, niov=1, errp=0x7f8ffcff39c0) at
>>>> ../io/channel.c:242
>>>> #10 0x0000558ed3dee4dc in qemu_fflush (f=0x7f8ff4640000) at ../migration/qemu-file.c:302
>>>> #11 0x0000558ed4050f91 in ram_save_setup (f=0x7f8ff4640000, opaque=0x558ed4ca34c0 <ram_state>) at ../migration/ram.c:3302
>>>> #12 0x0000558ed3e141c8 in qemu_savevm_state_setup (f=0x7f8ff4640000) at ../migration/savevm.c:1266
>>>> #13 0x0000558ed3e14eed in qemu_savevm_state (f=0x7f8ff4640000, errp=0x558ed68c5238) at ../migration/savevm.c:1626
>>>> #14 0x0000558ed3e1755e in save_snapshot (name=0x558ed72af790
>>>> "snap0", overwrite=false, vmstate=0x558ed6708ce0 "scsi0",
>>>> has_devices=true, devices=0x558ed66d6a60, errp=0x558ed68c5238) at
>>>> ../migration/savevm.c:2954
>>>> #15 0x0000558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at ../migration/savevm.c:3253
>>>> #16 0x0000558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at ../util/async.c:155
>>>> #17 0x0000558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at ../util/async.c:184
>>>> #18 0x0000558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at ../util/aio-posix.c:721
>>>> #19 0x0000558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at /home/febner/repos/qemu/block/block-gen.h:43
>>>> #20 0x0000558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
>>>> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
>>>> block/block-gen.c:1650
>>>> #21 0x0000558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
>>>> #22 0x0000558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
>>>> offset=231232, value=0, width=1, be=0) at
>>>> ../hw/block/pflash_cfi01.c:522
>>>> #23 0x0000558ed3ba9cda in pflash_mem_write_with_attrs
>>>> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
>>>> ../hw/block/pflash_cfi01.c:681
>>>> #24 0x0000558ed402a36a in memory_region_write_with_attrs_accessor
>>>> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
>>>> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
>>>> #25 0x0000558ed402a4a9 in access_with_adjusted_size (addr=231232,
>>>> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
>>>> access_fn=0x558ed402a270 <memory_region_write_with_attrs_accessor>,
>>>> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
>>>> #26 0x0000558ed402d5de in memory_region_dispatch_write
>>>> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
>>>> ../softmmu/memory.c:1522
>>>> #27 0x0000558ed403a6f4 in flatview_write_continue
>>>> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
>>>> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
>>>> ../softmmu/physmem.c:2641
>>>> #28 0x0000558ed403a857 in flatview_write (fv=0x558ed66d62c0,
>>>> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
>>>> ../softmmu/physmem.c:2683
>>>> #29 0x0000558ed403ac07 in address_space_write (as=0x558ed4ca2b20
>>>> <address_space_memory>, addr=4291004224, attrs=...,
>>>> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
>>>> #30 0x0000558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
>>>> <address_space_memory>, addr=4291004224, attrs=...,
>>>> buf=0x7f9029957028, len=1, is_write=true) at
>>>> ../softmmu/physmem.c:2789
>>>> #31 0x0000558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at ../accel/kvm/kvm-all.c:2989
>>>> #32 0x0000558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at ../accel/kvm/kvm-accel-ops.c:51
>>>> #33 0x0000558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at ../util/qemu-thread-posix.c:541
>>>> #34 0x00007f9028ab7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
>>>> #35 0x00007f9027c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Totally unfamiliar with block jobs, but... does it mean that
> snapshot_*_job_bh()s should just always make sure BQL taken?
> 

It is taken in the beginning, but qemu_savevm_state drops it. I suppose
there is a good reason for it, because when patching out the
qemu_mutex_unlock_iothread call, I get
../softmmu/cpus.c:503:qemu_mutex_lock_iothread_impl: assertion failed:
(!qemu_mutex_iothread_locked())when using the snapshot-save QMP command.

To avoid the assertion failure from the backtrace, I think it would be
enough to ensure still having acquired the main thread's AioContext
between the qemu_mutex_unlock_iothread and qemu_mutex_lock_iothread
calls (which currently is not the case when snapshot_save_job_bh runs in
a vCPU thread). But not sure if that is enough for full correctness.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28  7:47               ` Juan Quintela
@ 2023-04-28  8:30                 ` Kevin Wolf
  2023-04-28  8:38                   ` Juan Quintela
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2023-04-28  8:30 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Fiona Ebner, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 28.04.2023 um 09:47 hat Juan Quintela geschrieben:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
> > Am 27.04.23 um 16:36 schrieb Juan Quintela:
> >> Fiona Ebner <f.ebner@proxmox.com> wrote:
> >>> Am 27.04.23 um 13:03 schrieb Kevin Wolf:
> >>>> Am 26.04.2023 um 16:31 hat Fiona Ebner geschrieben:
> >>>>> Am 20.04.23 um 08:55 schrieb Paolo Bonzini:
> >> 
> >> Hi
> >> 
> >>> Our function is a custom variant of saving a snapshot and uses
> >>> qemu_savevm_state_setup(), which is why the qemu_mutex_unlock_iothread()
> >>> is there. I looked for inspiration for how upstream does things and it
> >>> turns out that upstream QEMU v8.0.0 has essentially the same issue with
> >>> snapshot-save. When snapshot_save_job_bh runs in a vCPU thread instead
> >>> of the main thread, the situation is the same: after
> >>> qemu_mutex_unlock_iothread(), qemu_get_current_aio_context() will return
> >>> 0x0 and then the assertion in the AIO_WAIT_WHILE_INTERNAL macro fails
> >>> (this time the generated coroutine wrapper is bdrv_writev_vmstate)[0].
> >>>
> >>>
> >>> So all bottom halves scheduled for the main thread's AioContext can
> >>> potentially get to run in a vCPU thread and need to be very careful with
> >>> things like qemu_mutex_unlock_iothread.
> >>>
> >>> Is qemu_get_current_aio_context() returning 0x0 expected? I haven't
> >>> looked into why it happens yet. Does there need to be a way to drop the
> >>> BQL without also giving up the main thread's AioContext or would it be
> >>> enough to re-acquire the context?
> >>>
> >>> CC-ing Juan as the migration maintainer.
> >> 
> >> This is the world backwards.
> >> The tradition is that migration people blame block layer people for
> >> breaking things and for help, not the other way around O:-)
> >
> > Sorry, if I didn't provide enough context/explanation. See below for my
> > attempt to re-iterate. I CC'ed you, because the issue happens as part of
> > snapshot-save and in particular the qemu_mutex_unlock_iothread call in
> > qemu_savevm_state is one of the ingredients leading to the problem.
> 
> This was a joke O:-)
> 
> >> To see that I am understading this right:
> >> 
> >> - you create a thread
> >> - that calls a memory_region operation
> >> - that calls a device write function
> >> - that calls the block layer
> >> - that creates a snapshot
> >> - that calls the migration code
> >> - that calls the block layer again
> >> 
> >> Without further investigation, I have no clue what is going on here,
> >> sorry.
> >> 
> >> Later, Juan.
> >> 
> >
> > All I'm doing is using QEMU (a build of upstream's v8.0.0) in intended
> > ways, I promise! In particular, I'm doing two things at the same time
> > repeatedly:
> > 1. Write to a pflash drive from within the guest.
> > 2. Issue a snapshot-save QMP command (in a way that doesn't lead to an
> > early error).
> >
> > (I actually also used a debugger to break on pflash_update and
> > snapshot_save_job_bh, manually continuing until I triggered the
> > problematic situation. It's very racy, because it depends on the host OS
> > to switch threads at the correct time.)
> 
> I think the block layer is the problem (famous last words)
> 
> >
> > Now we need to be aware of two things:
> > 1. As discussed earlier in the mail thread, if the host OS switches
> > threads at an inconvenient time, it can happen that a bottom half
> > scheduled for the main thread's AioContext can be executed as part of a
> > vCPU thread's aio_poll.
> > 2. Generated coroutine wrappers for block layer functions spawn the
> > coroutine and use AIO_WAIT_WHILE/aio_poll to wait for it to finish.
> >
> > What happens in the backtrace above is:
> > 1. The write to the pflash drive uses blk_pwrite which leads to an
> > aio_poll in the vCPU thread.
> > 2. The snapshot_save_job_bh bottom half, that was scheduled for the main
> > thread's AioContext, is executed as part of the vCPU thread's aio_poll.
> > 3. qemu_savevm_state is called.
> > 4. qemu_mutex_unlock_iothread is called. Now
> > qemu_get_current_aio_context returns 0x0. Usually, snapshot_save_job_bh
> > runs in the main thread, in which case qemu_get_current_aio_context
> > still returns the main thread's AioContext at this point.
> 
> I am perhaps a bit ingenuous here, but it is there a way to convince
> qemu that snapshot_save_job_bh *HAS* to run on the main thread?

I believe we're talking about a technicality here. I asked another more
fundamental question that nobody has answered yet:

Why do you think that it's ok to call bdrv_writev_vmstate() without
holding the BQL?

Because if we come to the conclusion that it's not ok (which is what I
think), then it doesn't matter whether we violate the condition in the
main thread or a vcpu thread. It's wrong in both cases, just the failure
mode differs - one crashes spectacularly with an assertion failure, the
other has a race condition. Moving from the assertion failure to a race
condition is not a proper fix.

> > 5. bdrv_writev_vmstate is executed as part of the usual savevm setup.
> > 6. bdrv_writev_vmstate is a generated coroutine wrapper, so it uses
> > AIO_WAIT_WHILE.
> > 7. The assertion to have the main thread's AioContext inside the
> > AIO_WAIT_WHILE macro fails.

Kevin



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28  8:30                 ` Kevin Wolf
@ 2023-04-28  8:38                   ` Juan Quintela
  2023-04-28 12:22                     ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Quintela @ 2023-04-28  8:38 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fiona Ebner, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Kevin Wolf <kwolf@redhat.com> wrote:

>> 
>> I am perhaps a bit ingenuous here, but it is there a way to convince
>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>
> I believe we're talking about a technicality here. I asked another more
> fundamental question that nobody has answered yet:
>
> Why do you think that it's ok to call bdrv_writev_vmstate() without
> holding the BQL?

I will say this function starts by bdrv_ (i.e. block layer people) and
endes with _vmstate (i.e. migration people).

To be honest, I don't know.  That is why I _supposed_ you have an idea.

> Because if we come to the conclusion that it's not ok (which is what I
> think), then it doesn't matter whether we violate the condition in the
> main thread or a vcpu thread. It's wrong in both cases, just the failure
> mode differs - one crashes spectacularly with an assertion failure, the
> other has a race condition. Moving from the assertion failure to a race
> condition is not a proper fix.

Fully agree there.  Just that I don't know the answer.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28  8:38                   ` Juan Quintela
@ 2023-04-28 12:22                     ` Kevin Wolf
  2023-04-28 16:54                       ` Juan Quintela
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2023-04-28 12:22 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Fiona Ebner, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
> Kevin Wolf <kwolf@redhat.com> wrote:
> >> I am perhaps a bit ingenuous here, but it is there a way to convince
> >> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
> >
> > I believe we're talking about a technicality here. I asked another more
> > fundamental question that nobody has answered yet:
> >
> > Why do you think that it's ok to call bdrv_writev_vmstate() without
> > holding the BQL?
> 
> I will say this function starts by bdrv_ (i.e. block layer people) and
> endes with _vmstate (i.e. migration people).
> 
> To be honest, I don't know.  That is why I _supposed_ you have an idea.

My idea is that bdrv_*() can only be called when you hold the BQL, or
for BlockDriverStates in an iothread the AioContext lock.

Apparently dropping the BQL in migration code was introduced in Paolo's
commit 9b095037527. I'm not sure what this was supposed to improve in
the case of snapshots because the VM is stopped anyway.

Would anything bad happen if we removed the BQL unlock/lock section in
qemu_savevm_state() again?

Kevin



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28 12:22                     ` Kevin Wolf
@ 2023-04-28 16:54                       ` Juan Quintela
  2023-05-02 10:03                         ` Fiona Ebner
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Quintela @ 2023-04-28 16:54 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fiona Ebner, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Kevin Wolf <kwolf@redhat.com> wrote:
> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>> Kevin Wolf <kwolf@redhat.com> wrote:
>> >> I am perhaps a bit ingenuous here, but it is there a way to convince
>> >> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>> >
>> > I believe we're talking about a technicality here. I asked another more
>> > fundamental question that nobody has answered yet:
>> >
>> > Why do you think that it's ok to call bdrv_writev_vmstate() without
>> > holding the BQL?
>> 
>> I will say this function starts by bdrv_ (i.e. block layer people) and
>> endes with _vmstate (i.e. migration people).
>> 
>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>
> My idea is that bdrv_*() can only be called when you hold the BQL, or
> for BlockDriverStates in an iothread the AioContext lock.
>
> Apparently dropping the BQL in migration code was introduced in Paolo's
> commit 9b095037527.

Damn.  I reviewed it, so I am as guilty as the author.
10 years later without problems I will not blame that patch.

I guess we changed something else that broke doing it without the lock.

But no, I still don't have suggestions/ideas.

> I'm not sure what this was supposed to improve in
> the case of snapshots because the VM is stopped anyway.
>
> Would anything bad happen if we removed the BQL unlock/lock section in
> qemu_savevm_state() again?

Dunno.

For what is worth, I can say that it survives migration-test, but don't
ask me why/how/...

Fiona, can you check if it fixes your troubles?

Later, Juan.

> Kevin



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-04-28 16:54                       ` Juan Quintela
@ 2023-05-02 10:03                         ` Fiona Ebner
  2023-05-02 10:25                           ` Fiona Ebner
  2023-05-02 10:30                           ` Juan Quintela
  0 siblings, 2 replies; 21+ messages in thread
From: Fiona Ebner @ 2023-05-02 10:03 UTC (permalink / raw)
  To: quintela, Kevin Wolf
  Cc: Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 28.04.23 um 18:54 schrieb Juan Quintela:
> Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>
>>>> I believe we're talking about a technicality here. I asked another more
>>>> fundamental question that nobody has answered yet:
>>>>
>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>> holding the BQL?
>>>
>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>> endes with _vmstate (i.e. migration people).
>>>
>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>
>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>> for BlockDriverStates in an iothread the AioContext lock.
>>
>> Apparently dropping the BQL in migration code was introduced in Paolo's
>> commit 9b095037527.
> 
> Damn.  I reviewed it, so I am as guilty as the author.
> 10 years later without problems I will not blame that patch.
> 
> I guess we changed something else that broke doing it without the lock.
> 
> But no, I still don't have suggestions/ideas.
>

I do feel like the issue might be very difficult to trigger under normal
circumstances. Depending on the configuration and what you do in the
guest, aio_poll in a vCPU thread does not happen often and I imagine
snapshot-save is also not a super frequent operation for most people. It
still takes me a while to trigger the issue by issuing lots of pflash
writes and running snapshot-save in a loop, I'd guess about 30-60
snapshots. Another reason might be that generated co-wrappers were less
common in the past?

>> I'm not sure what this was supposed to improve in
>> the case of snapshots because the VM is stopped anyway.

Is it? Quoting Juan:> d- snapshots are a completely different beast,
that don't really stop
>    the guest in the same way at that point, and sometimes it shows in
>    this subtle details.

>> Would anything bad happen if we removed the BQL unlock/lock section in
>> qemu_savevm_state() again?
> 
> Dunno.
> 
> For what is worth, I can say that it survives migration-test, but don't
> ask me why/how/...
> 
> Fiona, can you check if it fixes your troubles?
> 

Just removing the single section in qemu_savevm_state() breaks even the
case where snapshot_save_job_bh() is executed in the main thread,
because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
which asserts that it's not already locked.

Also removing the lock/unlock pair in ram_init_bitmaps() seems to work.
But I'm not sure what else a full semantic revert of commit 9b095037527
would entail.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-05-02 10:03                         ` Fiona Ebner
@ 2023-05-02 10:25                           ` Fiona Ebner
  2023-05-02 10:35                             ` Juan Quintela
  2023-05-02 10:30                           ` Juan Quintela
  1 sibling, 1 reply; 21+ messages in thread
From: Fiona Ebner @ 2023-05-02 10:25 UTC (permalink / raw)
  To: quintela, Kevin Wolf
  Cc: Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 02.05.23 um 12:03 schrieb Fiona Ebner:
> Am 28.04.23 um 18:54 schrieb Juan Quintela:
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>>
>>>>> I believe we're talking about a technicality here. I asked another more
>>>>> fundamental question that nobody has answered yet:
>>>>>
>>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>>> holding the BQL?
>>>>
>>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>>> endes with _vmstate (i.e. migration people).
>>>>
>>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>>
>>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>>> for BlockDriverStates in an iothread the AioContext lock.
>>>
>>> Apparently dropping the BQL in migration code was introduced in Paolo's
>>> commit 9b095037527.
>>
>> Damn.  I reviewed it, so I am as guilty as the author.
>> 10 years later without problems I will not blame that patch.
>>
>> I guess we changed something else that broke doing it without the lock.
>>
>> But no, I still don't have suggestions/ideas.
>>
> 
> I do feel like the issue might be very difficult to trigger under normal
> circumstances. Depending on the configuration and what you do in the
> guest, aio_poll in a vCPU thread does not happen often and I imagine
> snapshot-save is also not a super frequent operation for most people. It
> still takes me a while to trigger the issue by issuing lots of pflash
> writes and running snapshot-save in a loop, I'd guess about 30-60
> snapshots. Another reason might be that generated co-wrappers were less
> common in the past?
> 
>>> I'm not sure what this was supposed to improve in
>>> the case of snapshots because the VM is stopped anyway.
> 
> Is it? Quoting Juan:> d- snapshots are a completely different beast,
> that don't really stop
>>    the guest in the same way at that point, and sometimes it shows in
>>    this subtle details.
> 
>>> Would anything bad happen if we removed the BQL unlock/lock section in
>>> qemu_savevm_state() again?
>>
>> Dunno.
>>
>> For what is worth, I can say that it survives migration-test, but don't
>> ask me why/how/...
>>
>> Fiona, can you check if it fixes your troubles?
>>
> 
> Just removing the single section in qemu_savevm_state() breaks even the
> case where snapshot_save_job_bh() is executed in the main thread,
> because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
> which asserts that it's not already locked.
> 
> Also removing the lock/unlock pair in ram_init_bitmaps() seems to work. 

Well, after a few more attempts, I got a new failure (running with the
two changes mentioned above), but it seems to happen later and, at a
first glance, doesn't seem to be related to the lock anymore:

> qemu-system-x86_64: ../hw/net/virtio-net.c:3811: virtio_net_pre_save: Assertion `!n->vhost_started' failed.

> (gdb) call qemu_mutex_iothread_locked()
> $1 = true

Backtrace:

> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7fd291ffb700 (LWP 136620)]
> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007fd2b8b3e537 in __GI_abort () at abort.c:79
> #2  0x00007fd2b8b3e40f in __assert_fail_base (fmt=0x7fd2b8cb66a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0 "../hw/net/virtio-net.c", line=3811, function=<optimized out>) at assert.c:92
> #3  0x00007fd2b8b4d662 in __GI___assert_fail (assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0 "../hw/net/virtio-net.c", line=3811, function=0x557e0ebb7740 <__PRETTY_FUNCTION__.3> "virtio_net_pre_save") at assert.c:101
> #4  0x0000557e0e6c21c9 in virtio_net_pre_save (opaque=0x557e12c480b0) at ../hw/net/virtio-net.c:3811
> #5  0x0000557e0e4f63ee in vmstate_save_state_v (f=0x7fd28442f730, vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0, vmdesc=0x7fd2840978e0, version_id=11) at ../migration/vmstate.c:330
> #6  0x0000557e0e4f6390 in vmstate_save_state (f=0x7fd28442f730, vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0, vmdesc_id=0x7fd2840978e0) at ../migration/vmstate.c:318
> #7  0x0000557e0e51c80a in vmstate_save (f=0x7fd28442f730, se=0x557e12c4c430, vmdesc=0x7fd2840978e0) at ../migration/savevm.c:1000
> #8  0x0000557e0e51d942 in qemu_savevm_state_complete_precopy_non_iterable (f=0x7fd28442f730, in_postcopy=false, inactivate_disks=false) at ../migration/savevm.c:1463
> #9  0x0000557e0e51db33 in qemu_savevm_state_complete_precopy (f=0x7fd28442f730, iterable_only=false, inactivate_disks=false) at ../migration/savevm.c:1529
> #10 0x0000557e0e51df3d in qemu_savevm_state (f=0x7fd28442f730, errp=0x7fd28425e1f8) at ../migration/savevm.c:1635
> #11 0x0000557e0e520548 in save_snapshot (name=0x7fd28425e2b0 "snap0", overwrite=false, vmstate=0x7fd284479c20 "scsi0", has_devices=true, devices=0x7fd284097920, errp=0x7fd28425e1f8) at ../migration/savevm.c:2952
> #12 0x0000557e0e520f9b in snapshot_save_job_bh (opaque=0x7fd28425e130) at ../migration/savevm.c:3251
> #13 0x0000557e0e9f94de in aio_bh_call (bh=0x7fd2844703f0) at ../util/async.c:155
> #14 0x0000557e0e9f95e9 in aio_bh_poll (ctx=0x557e110ae910) at ../util/async.c:184
> #15 0x0000557e0e9dd78c in aio_poll (ctx=0x557e110ae910, blocking=true) at ../util/aio-posix.c:721
> #16 0x0000557e0e836ef0 in bdrv_poll_co (s=0x7fd291ff5eb0) at /home/febner/repos/qemu/block/block-gen.h:43
> #17 0x0000557e0e839c0e in blk_pwrite (blk=0x557e11320f40, offset=58880, bytes=512, buf=0x7fd290c0e600, flags=0) at block/block-gen.c:1650
> #18 0x0000557e0e2b2078 in pflash_update (pfl=0x557e11303b30, offset=58880, size=1) at ../hw/block/pflash_cfi01.c:394
> #19 0x0000557e0e2b2749 in pflash_write (pfl=0x557e11303b30, offset=59338, value=62, width=1, be=0) at ../hw/block/pflash_cfi01.c:522
> #20 0x0000557e0e2b2cda in pflash_mem_write_with_attrs (opaque=0x557e11303b30, addr=59338, value=62, len=1, attrs=...) at ../hw/block/pflash_cfi01.c:681
> #21 0x0000557e0e733354 in memory_region_write_with_attrs_accessor (mr=0x557e11303ef0, addr=59338, value=0x7fd291ff60c8, size=1, shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> #22 0x0000557e0e733493 in access_with_adjusted_size (addr=59338, value=0x7fd291ff60c8, size=1, access_size_min=1, access_size_max=4, access_fn=0x557e0e73325a <memory_region_write_with_attrs_accessor>, mr=0x557e11303ef0, attrs=...) at ../softmmu/memory.c:555
> #23 0x0000557e0e7365c8 in memory_region_dispatch_write (mr=0x557e11303ef0, addr=59338, data=62, op=MO_8, attrs=...) at ../softmmu/memory.c:1522
> #24 0x0000557e0e7436de in flatview_write_continue (fv=0x7fd2848bce90, addr=4290832330, attrs=..., ptr=0x7fd2ba9b7028, len=1, addr1=59338, l=1, mr=0x557e11303ef0) at ../softmmu/physmem.c:2641
> #25 0x0000557e0e743841 in flatview_write (fv=0x7fd2848bce90, addr=4290832330, attrs=..., buf=0x7fd2ba9b7028, len=1) at ../softmmu/physmem.c:2683
> #26 0x0000557e0e743bf1 in address_space_write (as=0x557e0f3abb20 <address_space_memory>, addr=4290832330, attrs=..., buf=0x7fd2ba9b7028, len=1) at ../softmmu/physmem.c:2779
> #27 0x0000557e0e743c5e in address_space_rw (as=0x557e0f3abb20 <address_space_memory>, addr=4290832330, attrs=..., buf=0x7fd2ba9b7028, len=1, is_write=true) at ../softmmu/physmem.c:2789
> #28 0x0000557e0e7d7a5c in kvm_cpu_exec (cpu=0x557e11676910) at ../accel/kvm/kvm-all.c:2989
> #29 0x0000557e0e7da76e in kvm_vcpu_thread_fn (arg=0x557e11676910) at ../accel/kvm/kvm-accel-ops.c:51
> #30 0x0000557e0e9e2233 in qemu_thread_start (args=0x557e11553fb0) at ../util/qemu-thread-posix.c:541
> #31 0x00007fd2b9b17ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #32 0x00007fd2b8c18a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-05-02 10:03                         ` Fiona Ebner
  2023-05-02 10:25                           ` Fiona Ebner
@ 2023-05-02 10:30                           ` Juan Quintela
  1 sibling, 0 replies; 21+ messages in thread
From: Juan Quintela @ 2023-05-02 10:30 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Fiona Ebner <f.ebner@proxmox.com> wrote:
> Am 28.04.23 um 18:54 schrieb Juan Quintela:
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>>
>>>>> I believe we're talking about a technicality here. I asked another more
>>>>> fundamental question that nobody has answered yet:
>>>>>
>>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>>> holding the BQL?
>>>>
>>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>>> endes with _vmstate (i.e. migration people).
>>>>
>>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>>
>>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>>> for BlockDriverStates in an iothread the AioContext lock.
>>>
>>> Apparently dropping the BQL in migration code was introduced in Paolo's
>>> commit 9b095037527.
>> 
>> Damn.  I reviewed it, so I am as guilty as the author.
>> 10 years later without problems I will not blame that patch.
>> 
>> I guess we changed something else that broke doing it without the lock.
>> 
>> But no, I still don't have suggestions/ideas.
>>
>
> I do feel like the issue might be very difficult to trigger under normal
> circumstances. Depending on the configuration and what you do in the
> guest, aio_poll in a vCPU thread does not happen often and I imagine
> snapshot-save is also not a super frequent operation for most people. It
> still takes me a while to trigger the issue by issuing lots of pflash
> writes and running snapshot-save in a loop, I'd guess about 30-60
> snapshots. Another reason might be that generated co-wrappers were less
> common in the past?

May I suggest that you add a test in migration-test (or ever better
create snapshot-test, hint, hint!) if you ever find a way to reproduce
it.

>>> I'm not sure what this was supposed to improve in
>>> the case of snapshots because the VM is stopped anyway.
>
> Is it? Quoting Juan:> d- snapshots are a completely different beast,
> that don't really stop
>>    the guest in the same way at that point, and sometimes it shows in
>>    this subtle details.

Exhibit A:

qemu/migration/migration.c:migration_thread()

Exhibit B:

qemu/migration/savevm.c:qemu_savevm_state()

They both call (from a 10000feet view):

qemu_savevm_state_header()
qemu_savevm_state_setup()
qemu_savevm_iterate()
qemu_savevm_state_cleanup()

But there are subtle details in the middle that are different in both
places, and historically we have forgotten to add fixes to
qemu_savevm_state() (principally because almost nobody uses it, compared
with migration) and because on upstream we test with migration-test,
several advocado tests for migration, and we do the same for Enterprise
distros.  As said before, snapshots are the bastard child of migration,
they are there, but nobody look at them.

I tried in the past to merge both code paths, but as said, the
differences are tricky and subtle and I never finished that project.

>>> Would anything bad happen if we removed the BQL unlock/lock section in
>>> qemu_savevm_state() again?
>> 
>> Dunno.
>> 
>> For what is worth, I can say that it survives migration-test, but don't
>> ask me why/how/...
>> 
>> Fiona, can you check if it fixes your troubles?
>> 
>
> Just removing the single section in qemu_savevm_state() breaks even the
> case where snapshot_save_job_bh() is executed in the main thread,
> because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
> which asserts that it's not already locked.
>
> Also removing the lock/unlock pair in ram_init_bitmaps() seems to work.
> But I'm not sure what else a full semantic revert of commit 9b095037527
> would entail.

I have no idea.

I will let Paolo to answer this one.

I can guess this was a solution to improve coroutines or rcu or
whatever, but I don't remember the details.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-05-02 10:25                           ` Fiona Ebner
@ 2023-05-02 10:35                             ` Juan Quintela
  2023-05-02 12:49                               ` Fiona Ebner
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Quintela @ 2023-05-02 10:35 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Kevin Wolf, Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Fiona Ebner <f.ebner@proxmox.com> wrote:
> Am 02.05.23 um 12:03 schrieb Fiona Ebner:
>> Am 28.04.23 um 18:54 schrieb Juan Quintela:
>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>>>
>>>>>> I believe we're talking about a technicality here. I asked another more
>>>>>> fundamental question that nobody has answered yet:
>>>>>>
>>>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>>>> holding the BQL?
>>>>>
>>>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>>>> endes with _vmstate (i.e. migration people).
>>>>>
>>>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>>>
>>>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>>>> for BlockDriverStates in an iothread the AioContext lock.
>>>>
>>>> Apparently dropping the BQL in migration code was introduced in Paolo's
>>>> commit 9b095037527.
>>>
>>> Damn.  I reviewed it, so I am as guilty as the author.
>>> 10 years later without problems I will not blame that patch.
>>>
>>> I guess we changed something else that broke doing it without the lock.
>>>
>>> But no, I still don't have suggestions/ideas.
>>>
>> 
>> I do feel like the issue might be very difficult to trigger under normal
>> circumstances. Depending on the configuration and what you do in the
>> guest, aio_poll in a vCPU thread does not happen often and I imagine
>> snapshot-save is also not a super frequent operation for most people. It
>> still takes me a while to trigger the issue by issuing lots of pflash
>> writes and running snapshot-save in a loop, I'd guess about 30-60
>> snapshots. Another reason might be that generated co-wrappers were less
>> common in the past?
>> 
>>>> I'm not sure what this was supposed to improve in
>>>> the case of snapshots because the VM is stopped anyway.
>> 
>> Is it? Quoting Juan:> d- snapshots are a completely different beast,
>> that don't really stop
>>>    the guest in the same way at that point, and sometimes it shows in
>>>    this subtle details.
>> 
>>>> Would anything bad happen if we removed the BQL unlock/lock section in
>>>> qemu_savevm_state() again?
>>>
>>> Dunno.
>>>
>>> For what is worth, I can say that it survives migration-test, but don't
>>> ask me why/how/...
>>>
>>> Fiona, can you check if it fixes your troubles?
>>>
>> 
>> Just removing the single section in qemu_savevm_state() breaks even the
>> case where snapshot_save_job_bh() is executed in the main thread,
>> because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
>> which asserts that it's not already locked.
>> 
>> Also removing the lock/unlock pair in ram_init_bitmaps() seems to work. 
>
> Well, after a few more attempts, I got a new failure (running with the
> two changes mentioned above), but it seems to happen later and, at a
> first glance, doesn't seem to be related to the lock anymore:

Can you revert the whole commit:

commmit 9b0950375277467fd74a9075624477ae43b9bb22
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Fri Feb 22 17:36:28 2013 +0100

    migration: run setup callbacks out of big lock

Because we are again at:
>
>> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
>> [Switching to Thread 0x7fd291ffb700 (LWP 136620)]
>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #1  0x00007fd2b8b3e537 in __GI_abort () at abort.c:79
>> #2 0x00007fd2b8b3e40f in __assert_fail_base (fmt=0x7fd2b8cb66a8
>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>> assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0
>> "../hw/net/virtio-net.c", line=3811, function=<optimized out>) at
>> assert.c:92
>> #3 0x00007fd2b8b4d662 in __GI___assert_fail
>> (assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0
>> "../hw/net/virtio-net.c", line=3811, function=0x557e0ebb7740
>> <__PRETTY_FUNCTION__.3> "virtio_net_pre_save") at assert.c:101
>> #4  0x0000557e0e6c21c9 in virtio_net_pre_save (opaque=0x557e12c480b0) at ../hw/net/virtio-net.c:3811
>> #5 0x0000557e0e4f63ee in vmstate_save_state_v (f=0x7fd28442f730,
>> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0,
>> vmdesc=0x7fd2840978e0, version_id=11) at ../migration/vmstate.c:330
>> #6 0x0000557e0e4f6390 in vmstate_save_state (f=0x7fd28442f730,
>> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0,
>> vmdesc_id=0x7fd2840978e0) at ../migration/vmstate.c:318
>> #7 0x0000557e0e51c80a in vmstate_save (f=0x7fd28442f730,
>> se=0x557e12c4c430, vmdesc=0x7fd2840978e0) at
>> ../migration/savevm.c:1000
>> #8 0x0000557e0e51d942 in
>> qemu_savevm_state_complete_precopy_non_iterable (f=0x7fd28442f730,
>> in_postcopy=false, inactivate_disks=false) at
>> ../migration/savevm.c:1463
>> #9 0x0000557e0e51db33 in qemu_savevm_state_complete_precopy
>> (f=0x7fd28442f730, iterable_only=false, inactivate_disks=false) at
>> ../migration/savevm.c:1529
>> #10 0x0000557e0e51df3d in qemu_savevm_state (f=0x7fd28442f730, errp=0x7fd28425e1f8) at ../migration/savevm.c:1635
>> #11 0x0000557e0e520548 in save_snapshot (name=0x7fd28425e2b0
>> "snap0", overwrite=false, vmstate=0x7fd284479c20 "scsi0",
>> has_devices=true, devices=0x7fd284097920, errp=0x7fd28425e1f8) at
>> ../migration/savevm.c:2952
>> #12 0x0000557e0e520f9b in snapshot_save_job_bh (opaque=0x7fd28425e130) at ../migration/savevm.c:3251

Here we are
aio_bh_call()

ends calling snapshot_save_job_bh, and it didn't end well.

It appears that there is not an easy way to warantee that snapshot code
is only run on the main io thread, so my only other suggestion right now
is that you check snapshot_save_job_bh() and see if it ever happens on a
non-vcpu thread when you get the test to run correctly.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
  2023-05-02 10:35                             ` Juan Quintela
@ 2023-05-02 12:49                               ` Fiona Ebner
  0 siblings, 0 replies; 21+ messages in thread
From: Fiona Ebner @ 2023-05-02 12:49 UTC (permalink / raw)
  To: quintela, Kevin Wolf
  Cc: Paolo Bonzini, Markus Armbruster, QEMU Developers,
	open list:Block layer core, Michael Roth, Fam Zheng,
	Stefan Hajnoczi, Thomas Lamprecht, Peter Xu

Am 02.05.23 um 12:35 schrieb Juan Quintela:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
>> Am 02.05.23 um 12:03 schrieb Fiona Ebner:
>>> Am 28.04.23 um 18:54 schrieb Juan Quintela:
>>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>>>>> Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>>>>
>>>>>>> I believe we're talking about a technicality here. I asked another more
>>>>>>> fundamental question that nobody has answered yet:
>>>>>>>
>>>>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>>>>> holding the BQL?
>>>>>>
>>>>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>>>>> endes with _vmstate (i.e. migration people).
>>>>>>
>>>>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>>>>
>>>>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>>>>> for BlockDriverStates in an iothread the AioContext lock.
>>>>>
>>>>> Apparently dropping the BQL in migration code was introduced in Paolo's
>>>>> commit 9b095037527.
>>>>
>>>> Damn.  I reviewed it, so I am as guilty as the author.
>>>> 10 years later without problems I will not blame that patch.
>>>>
>>>> I guess we changed something else that broke doing it without the lock.
>>>>
>>>> But no, I still don't have suggestions/ideas.
>>>>
>>>
>>> I do feel like the issue might be very difficult to trigger under normal
>>> circumstances. Depending on the configuration and what you do in the
>>> guest, aio_poll in a vCPU thread does not happen often and I imagine
>>> snapshot-save is also not a super frequent operation for most people. It
>>> still takes me a while to trigger the issue by issuing lots of pflash
>>> writes and running snapshot-save in a loop, I'd guess about 30-60
>>> snapshots. Another reason might be that generated co-wrappers were less
>>> common in the past?
>>>
>>>>> I'm not sure what this was supposed to improve in
>>>>> the case of snapshots because the VM is stopped anyway.
>>>
>>> Is it? Quoting Juan:> d- snapshots are a completely different beast,
>>> that don't really stop
>>>>    the guest in the same way at that point, and sometimes it shows in
>>>>    this subtle details.
>>>
>>>>> Would anything bad happen if we removed the BQL unlock/lock section in
>>>>> qemu_savevm_state() again?
>>>>
>>>> Dunno.
>>>>
>>>> For what is worth, I can say that it survives migration-test, but don't
>>>> ask me why/how/...
>>>>
>>>> Fiona, can you check if it fixes your troubles?
>>>>
>>>
>>> Just removing the single section in qemu_savevm_state() breaks even the
>>> case where snapshot_save_job_bh() is executed in the main thread,
>>> because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
>>> which asserts that it's not already locked.
>>>
>>> Also removing the lock/unlock pair in ram_init_bitmaps() seems to work. 
>>
>> Well, after a few more attempts, I got a new failure (running with the
>> two changes mentioned above), but it seems to happen later and, at a
>> first glance, doesn't seem to be related to the lock anymore:
> 
> Can you revert the whole commit:
> 
> commmit 9b0950375277467fd74a9075624477ae43b9bb22
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Fri Feb 22 17:36:28 2013 +0100
> 
>     migration: run setup callbacks out of big lock

Unfortunately, there were just too many changes. git revert won't even
detect the renamed files (e.g. migration.c -> migration/migration.c) for
me. I tried to revert manually, but I noticed another issue. A full
revert would also mean that migration holds the BQL around the
qemu_savevm_state_setup() call. But if we don't want that, we can't
remove the qemu_mutex_lock_iothread() call in ram_init_bitmaps() (and
similar for block_save_setup()), because migration relies on that. I
guess we could make the lock+unlock calls conditional, depending on
whether the BQL is already held, but it doesn't seem too nice either.

> 
> Because we are again at:
>>
>>> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
>>> [Switching to Thread 0x7fd291ffb700 (LWP 136620)]
>>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> 50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>> (gdb) bt
>>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> #1  0x00007fd2b8b3e537 in __GI_abort () at abort.c:79
>>> #2 0x00007fd2b8b3e40f in __assert_fail_base (fmt=0x7fd2b8cb66a8
>>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>>> assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0
>>> "../hw/net/virtio-net.c", line=3811, function=<optimized out>) at
>>> assert.c:92
>>> #3 0x00007fd2b8b4d662 in __GI___assert_fail
>>> (assertion=0x557e0ebb7193 "!n->vhost_started", file=0x557e0ebb65f0
>>> "../hw/net/virtio-net.c", line=3811, function=0x557e0ebb7740
>>> <__PRETTY_FUNCTION__.3> "virtio_net_pre_save") at assert.c:101
>>> #4  0x0000557e0e6c21c9 in virtio_net_pre_save (opaque=0x557e12c480b0) at ../hw/net/virtio-net.c:3811
>>> #5 0x0000557e0e4f63ee in vmstate_save_state_v (f=0x7fd28442f730,
>>> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0,
>>> vmdesc=0x7fd2840978e0, version_id=11) at ../migration/vmstate.c:330
>>> #6 0x0000557e0e4f6390 in vmstate_save_state (f=0x7fd28442f730,
>>> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0,
>>> vmdesc_id=0x7fd2840978e0) at ../migration/vmstate.c:318
>>> #7 0x0000557e0e51c80a in vmstate_save (f=0x7fd28442f730,
>>> se=0x557e12c4c430, vmdesc=0x7fd2840978e0) at
>>> ../migration/savevm.c:1000
>>> #8 0x0000557e0e51d942 in
>>> qemu_savevm_state_complete_precopy_non_iterable (f=0x7fd28442f730,
>>> in_postcopy=false, inactivate_disks=false) at
>>> ../migration/savevm.c:1463
>>> #9 0x0000557e0e51db33 in qemu_savevm_state_complete_precopy
>>> (f=0x7fd28442f730, iterable_only=false, inactivate_disks=false) at
>>> ../migration/savevm.c:1529
>>> #10 0x0000557e0e51df3d in qemu_savevm_state (f=0x7fd28442f730, errp=0x7fd28425e1f8) at ../migration/savevm.c:1635
>>> #11 0x0000557e0e520548 in save_snapshot (name=0x7fd28425e2b0
>>> "snap0", overwrite=false, vmstate=0x7fd284479c20 "scsi0",
>>> has_devices=true, devices=0x7fd284097920, errp=0x7fd28425e1f8) at
>>> ../migration/savevm.c:2952
>>> #12 0x0000557e0e520f9b in snapshot_save_job_bh (opaque=0x7fd28425e130) at ../migration/savevm.c:3251
> 
> Here we are
> aio_bh_call()
> 
> ends calling snapshot_save_job_bh, and it didn't end well.
> 
> It appears that there is not an easy way to warantee that snapshot code
> is only run on the main io thread, so my only other suggestion right now
> is that you check snapshot_save_job_bh() and see if it ever happens on a
> non-vcpu thread when you get the test to run correctly.

Sorry, I'm not sure what you mean by "when you get the test to run
correctly". Yes, the assertion from above only happened in a vCPU thread
for me.

While save_snapshot() calls vm_stop(RUN_STATE_SAVE_VM), I guess it's
because of the following?

> int vm_stop(RunState state)
> {
>     if (qemu_in_vcpu_thread()) {
>         qemu_system_vmstop_request_prepare();
>         qemu_system_vmstop_request(state);
>         /*
>          * FIXME: should not return to device code in case
>          * vm_stop() has been requested.
>          */
>         cpu_stop_current();
>         return 0;
>     }
> 
>     return do_vm_stop(state, true);
> }

Would rescheduling snapshot_save_job_bh() until it runs in the main
thread be an option?

But Kevin said:

> My idea is that bdrv_*() can only be called when you hold the BQL, or
> for BlockDriverStates in an iothread the AioContext lock

So IIUC, having snapshot code keep dropping the BQL during setup would
still not be fully correct even if we ensure that snapshot code only
runs in the main thread. But when doing qemu_mutex_unlock_iothread() in
the main thread, qemu_get_current_aio_context() still returns the main
thread's AioContext (it returns 0x0 when the same thing is done in the
vCPU thread), so is it fine after all?

Another alternative would be to check in the QIOChannelBlock
implementation if we require the main thread's AioContext and acquire
the BQL before calling the bdrv_*_vmstate() functions? But that seems
rather ugly too.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-05-02 12:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-19 14:09 QMP (without OOB) function running in thread different from the main thread as part of aio_poll Fiona Ebner
2023-04-19 16:59 ` Paolo Bonzini
2023-04-20  6:11 ` Markus Armbruster
2023-04-20  6:55   ` Paolo Bonzini
2023-04-26 14:31     ` Fiona Ebner
2023-04-27 11:03       ` Kevin Wolf
2023-04-27 12:27         ` Fiona Ebner
2023-04-27 14:36           ` Juan Quintela
2023-04-27 14:56             ` Peter Xu
2023-04-28  7:53               ` Fiona Ebner
2023-04-28  7:23             ` Fiona Ebner
2023-04-28  7:47               ` Juan Quintela
2023-04-28  8:30                 ` Kevin Wolf
2023-04-28  8:38                   ` Juan Quintela
2023-04-28 12:22                     ` Kevin Wolf
2023-04-28 16:54                       ` Juan Quintela
2023-05-02 10:03                         ` Fiona Ebner
2023-05-02 10:25                           ` Fiona Ebner
2023-05-02 10:35                             ` Juan Quintela
2023-05-02 12:49                               ` Fiona Ebner
2023-05-02 10:30                           ` Juan Quintela

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.