All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Peter Maydell <peter.maydell@linaro.org>,
	Alberto Garcia <berto@igalia.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	Qemu-block <qemu-block@nongnu.org>
Subject: Re: iotest 030 still occasionally intermittently failing
Date: Fri, 20 Nov 2020 12:51:20 +0300	[thread overview]
Message-ID: <3cfcbb68-2f62-d7f2-65e1-6ba7dd879ac5@virtuozzo.com> (raw)
In-Reply-To: <32cd520b-0463-945c-9222-ec20baf8be8a@virtuozzo.com>

19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>>>>> Do you know if there is a core dump or stack trace available ?
>>>>>
>>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>>> produce, so if you want more diagnostics on failures you have to
>>>>> arrange for the test harness to produce them...
>>>>>
>>>>> thanks
>>>>> -- PMM
>>>>>
>>>>
>>>> Hi!
>>>>
>>>> After some iterations I've reproduced on SIGABRT:
>>>>
>>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
>>>> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
>>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
>>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
>>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
>>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
>>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
>>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
>>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
>>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>>>>      at ../util/async.c:306
>>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>>> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
>>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
>>>>
>>>> (gdb) fr 4
>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>>> 2648            assert(tighten_restrictions == false);
>>>> (gdb) list
>>>> 2643            int ret;
>>>> 2644
>>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
>>>> 2647                                  &tighten_restrictions, NULL);
>>>> 2648            assert(tighten_restrictions == false);
>>>> 2649            if (ret < 0) {
>>>> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
>>>> 2651                bdrv_abort_perm_update(old_bs);
>>>> 2652            } else {
>>>> (gdb) p tighten_restrictions
>>>> $1 = true
>>>>
>>>>
>>>
>>> I've modified code a bit, to crash when we actually want to set tighten_restrictions to true, and get the following bt:
>>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>>> #2  0x000055b9174104d7 in bdrv_check_perm
>>>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2009
>>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2280
>>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2294
>>> #5  0x000055b91741078c in bdrv_check_perm
>>>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x0, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2076
>>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200, new_bs=0x0) at ../block.c:2666
>>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200) at ../block.c:2798
>>> #8  0x000055b917411f5f in bdrv_root_unref_child (child=0x55b919cf6200) at ../block.c:2810
>>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv (job=0x55b918f06a60) at ../blockjob.c:191
>>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at ../blockjob.c:88
>>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at ../job.c:380
>>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at ../job.c:894
>>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at ../util/async.c:136
>>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at ../util/async.c:164
>>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at ../util/aio-posix.c:381
>>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0, callback=0x0, user_data=0x0)
>>>      at ../util/async.c:306
>>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>> #18 0x000055b9175029d9 in glib_pollfds_poll () at ../util/main-loop.c:221
>>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208, envp=0x7fff7f81f2b0)
>>>
>>>
>>> and the picture taken at the moment of abort (and it is the same as at the moment before bdrv_replace_child call) is attached. So, it looks like graph is already corrupted (you see that backing permissions are not propagated to node2-node0 child).
>>>
>>> How graph was corrupted it's still the question..
>>>
>>
>> additional info: it's all during test_stream_commit_1 in 30 iiotest
>>
> 
> OK, after some debugging and looking at block-graph dumps I tend to think that this a race between finish (.prepare) of mirror and block-stream. They do graph updates. Nothing prevents interleaving of graph-updating operations (note that bdrv_replace_child_noperm may do aio_poll). And nothing protects two processes of graph-update from intersection.
> 
> Any ideas?
> 

Wow, I get the following:

#0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
#1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1, timeout=600000000000) at ../util/qemu-timer.c:349
#2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0, ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
#3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true) at ../util/aio-posix.c:607
#4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at ../block/mirror.c:648
#5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at ../block/mirror.c:776
#6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
#7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20, fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at ../job.c:798
#9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20) at ../job.c:852
#10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at ../job.c:865
#11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
#12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at ../util/async.c:136
#13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at ../util/async.c:164
#14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true) at ../util/aio-posix.c:659
#15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at /work/src/qemu/master/block/block-gen.h:44
#16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0, bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
#17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0, buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
#18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at ../block/qcow2.c:3037
#19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020, backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img", backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
#20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020, backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img", backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
#21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at ../block/stream.c:88
#22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
#23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620, fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at ../job.c:798
#25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620) at ../job.c:852
#26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at ../job.c:865
#27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
#28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at ../util/async.c:136
#29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at ../util/async.c:164
#30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at ../util/aio-posix.c:381
#31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0, callback=0x0, user_data=0x0) at ../util/async.c:306
#32 0x00007fa84cdceecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
#34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
#35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
#36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
#37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88, envp=0x7ffcc0af6b30) at ../softmmu/main.c:50


aio_poll at start of mirror_exit_common is my addition. But anyway the problem is here: we do call mirror_prepare inside of stream_prepare!

----

Ok, moving further, I moved stream_clean, stream_prepare, mirror_exit_common and mirror_complete into coroutine context and protect them all by new global coroutine mutex. This helps, now the bug doesn't reproduce.

Still, I'll try to finally understand, what's going on. But I think now that:

- graph modification code (permissions, drains, child-replacements, etc) are not safe to call in parallel, for example in different coroutines or through nested aio_poll.

- but they all do different kinds of drains, io operations, etc, which lead to nested aio polls and everything is broken


-- 
Best regards,
Vladimir


  reply	other threads:[~2020-11-20  9:52 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-16 16:16 iotest 030 still occasionally intermittently failing Peter Maydell
2020-11-16 17:34 ` Alberto Garcia
2020-11-16 17:59   ` Peter Maydell
2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
2020-11-19 18:17       ` Alberto Garcia
2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
2020-11-19 19:31         ` Vladimir Sementsov-Ogievskiy
2020-11-19 20:31           ` Vladimir Sementsov-Ogievskiy
2020-11-20  9:51             ` Vladimir Sementsov-Ogievskiy [this message]
2020-11-20 10:34               ` Philippe Mathieu-Daudé
2020-11-20 12:48                 ` Vladimir Sementsov-Ogievskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cfcbb68-2f62-d7f2-65e1-6ba7dd879ac5@virtuozzo.com \
    --to=vsementsov@virtuozzo.com \
    --cc=berto@igalia.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.