All of lore.kernel.org
 help / color / mirror / Atom feed
* iotest 030 still occasionally intermittently failing
@ 2020-11-16 16:16 Peter Maydell
  2020-11-16 17:34 ` Alberto Garcia
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2020-11-16 16:16 UTC (permalink / raw)
  To: QEMU Developers, Qemu-block

Just saw this on a test run on the OpenBSD VM build-and-test,
so this test is still intermittently failing...


  TEST    iotest-qcow2: 030 [fail]
QEMU          --
"/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-system-aarch64"
-nodefaults -display none -accel qtest -machine virt
QEMU_IMG      --
"/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-img"
QEMU_IO       --
"/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-io"
--cache writeback --aio threads -f qcow2
QEMU_NBD      --
"/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-nbd"
IMGFMT        -- qcow2 (compat=1.1)
IMGPROTO      -- file
PLATFORM      -- OpenBSD/amd64 openbsd 6.8
TEST_DIR      -- /home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/scratch
SOCK_DIR      -- /tmp/tmp.XpU6JjBMml
SOCKET_SCM_HELPER --

--- /home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/030.out  Mon
Nov 16 15:33:05 2020
+++ /home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/030.out.bad
 Mon Nov 16 16:03:42 2020
@@ -1,5 +1,47 @@
-...........................
+WARNING:qemu.machine:qemu received signal 6; command:
"/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-system-aarch64
-display none -vga none -chardev
socket,id=mon,path=/tmp/tmp.XpU6JjBMml/qemu-42319-monitor.sock -mon
chardev=mon,mode=control -qtest
unix:path=/tmp/tmp.XpU6JjBMml/qemu-42319-qtest.sock -accel qtest
-nodefaults -display none -accel qtest -machine virt -drive
if=virtio,id=drive0,file=/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/scratch/img-8.img,format=qcow2,cache=writeback,aio=threads,backing.backing.backing.backing.backing.backing.backing.backing.node-name=node0,backing.backing.backing.backing.backing.backing.backing.node-name=node1,backing.backing.backing.backing.backing.backing.node-name=node2,backing.backing.backing.backing.backing.node-name=node3,backing.backing.backing.backing.node-name=node4,backing.backing.backing.node-name=node5,backing.backing.node-name=node6,backing.node-name=node7,node-name=node8"
+............EE..............
+======================================================================
+ERROR: test_stream_commit_2 (__main__.TestParallelOps)
 ----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "030", line 492, in test_stream_commit_2
+    self.assert_no_active_block_jobs()
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/iotests.py",
line 932, in assert_no_active_block_jobs
+    result = self.vm.qmp('query-block-jobs')
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/machine.py",
line 559, in qmp
+    return self._qmp.cmd(cmd, args=qmp_args)
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/qmp.py",
line 278, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/qmp.py",
line 259, in cmd_obj
+    raise QMPConnectError("Unexpected empty reply from server")
+qemu.qmp.QMPConnectError: Unexpected empty reply from server
+
+======================================================================
+ERROR: test_stream_commit_2 (__main__.TestParallelOps)
+----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/machine.py",
line 476, in _do_shutdown
+    self._soft_shutdown(timeout, has_quit)
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/machine.py",
line 456, in _soft_shutdown
+    self._qmp.cmd('quit')
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/qmp.py",
line 278, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/qmp.py",
line 256, in cmd_obj
+    self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
+BrokenPipeError: [Errno 32] Broken pipe
+
+The above exception was the direct cause of the following exception:
+
+Traceback (most recent call last):
+  File "030", line 226, in tearDown
+    self.vm.shutdown()
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/machine.py",
line 506, in shutdown
+    self._do_shutdown(timeout, has_quit)
+  File "/home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/../../python/qemu/machine.py",
line 479, in _do_shutdown
+    raise AbnormalShutdown("Could not perform graceful shutdown") \
+qemu.machine.AbnormalShutdown: Could not perform graceful shutdown
+
+----------------------------------------------------------------------
 Ran 27 tests

-OK
+FAILED (errors=2, skipped=1)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-16 16:16 iotest 030 still occasionally intermittently failing Peter Maydell
@ 2020-11-16 17:34 ` Alberto Garcia
  2020-11-16 17:59   ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: Alberto Garcia @ 2020-11-16 17:34 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers, Qemu-block

On Mon 16 Nov 2020 05:16:35 PM CET, Peter Maydell wrote:
> Just saw this on a test run on the OpenBSD VM build-and-test,
> so this test is still intermittently failing...
>
>
>   TEST    iotest-qcow2: 030 [fail]
> QEMU          --
> "/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-system-aarch64"
> -nodefaults -display none -accel qtest -machine virt
> QEMU_IMG      --
> "/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-img"
> QEMU_IO       --
> "/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-io"
> --cache writeback --aio threads -f qcow2
> QEMU_NBD      --
> "/home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/../../qemu-nbd"
> IMGFMT        -- qcow2 (compat=1.1)
> IMGPROTO      -- file
> PLATFORM      -- OpenBSD/amd64 openbsd 6.8
> TEST_DIR      -- /home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/scratch
> SOCK_DIR      -- /tmp/tmp.XpU6JjBMml
> SOCKET_SCM_HELPER --
>
> --- /home/qemu/qemu-test.h37iBt/src/tests/qemu-iotests/030.out  Mon
> Nov 16 15:33:05 2020
> +++ /home/qemu/qemu-test.h37iBt/build/tests/qemu-iotests/030.out.bad
>  Mon Nov 16 16:03:42 2020
> @@ -1,5 +1,47 @@
> -...........................
> +WARNING:qemu.machine:qemu received signal 6; command:

Do you know if there is a core dump or stack trace available ?

Berto


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-16 17:34 ` Alberto Garcia
@ 2020-11-16 17:59   ` Peter Maydell
  2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2020-11-16 17:59 UTC (permalink / raw)
  To: Alberto Garcia; +Cc: QEMU Developers, Qemu-block

On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
> Do you know if there is a core dump or stack trace available ?

Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
produce, so if you want more diagnostics on failures you have to
arrange for the test harness to produce them...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-16 17:59   ` Peter Maydell
@ 2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
  2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-19 16:11 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

16.11.2020 20:59, Peter Maydell wrote:
> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>> Do you know if there is a core dump or stack trace available ?
> 
> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
> produce, so if you want more diagnostics on failures you have to
> arrange for the test harness to produce them...
> 
> thanks
> -- PMM
> 

Hi!

After some iterations I've reproduced on SIGABRT:

#0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
#1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
#2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
#5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
#6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
#7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
#8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
#9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
#10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
#11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
#12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
#13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
#14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
     at ../util/async.c:306
#15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
#17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
#18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
#19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
#20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)

(gdb) fr 4
#4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
2648            assert(tighten_restrictions == false);
(gdb) list
2643            int ret;
2644
2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
2647                                  &tighten_restrictions, NULL);
2648            assert(tighten_restrictions == false);
2649            if (ret < 0) {
2650                /* We only tried to loosen restrictions, so errors are not fatal */
2651                bdrv_abort_perm_update(old_bs);
2652            } else {
(gdb) p tighten_restrictions
$1 = true


Hm. it seems that the problem is that tighten_restrictions is not initialized. I'll send a patch.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
@ 2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
  2020-11-19 18:17       ` Alberto Garcia
  2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-19 16:16 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
> 16.11.2020 20:59, Peter Maydell wrote:
>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>> Do you know if there is a core dump or stack trace available ?
>>
>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>> produce, so if you want more diagnostics on failures you have to
>> arrange for the test harness to produce them...
>>
>> thanks
>> -- PMM
>>
> 
> Hi!
> 
> After some iterations I've reproduced on SIGABRT:
> 
> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>      at ../util/async.c:306
> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
> 
> (gdb) fr 4
> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
> 2648            assert(tighten_restrictions == false);
> (gdb) list
> 2643            int ret;
> 2644
> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
> 2647                                  &tighten_restrictions, NULL);
> 2648            assert(tighten_restrictions == false);
> 2649            if (ret < 0) {
> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
> 2651                bdrv_abort_perm_update(old_bs);
> 2652            } else {
> (gdb) p tighten_restrictions
> $1 = true
> 
> 
> Hm. it seems that the problem is that tighten_restrictions is not initialized. I'll send a patch.
> 

wrong, bdrv_check_perm initializes it correctly.. I'll look more careful.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
  2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
@ 2020-11-19 18:17       ` Alberto Garcia
  2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 12+ messages in thread
From: Alberto Garcia @ 2020-11-19 18:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Peter Maydell; +Cc: QEMU Developers, Qemu-block

On Thu 19 Nov 2020 05:11:38 PM CET, Vladimir Sementsov-Ogievskiy wrote:

> After some iterations I've reproduced on SIGABRT:

Great !

I'm however confused about whether this is the same problem that has
been reported in the past, e.g. in this one from August I doesn't see
any crash:

   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html

Berto


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
  2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
  2020-11-19 18:17       ` Alberto Garcia
@ 2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
  2020-11-19 19:31         ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-19 19:30 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

[-- Attachment #1: Type: text/plain, Size: 6140 bytes --]

19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
> 16.11.2020 20:59, Peter Maydell wrote:
>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>> Do you know if there is a core dump or stack trace available ?
>>
>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>> produce, so if you want more diagnostics on failures you have to
>> arrange for the test harness to produce them...
>>
>> thanks
>> -- PMM
>>
> 
> Hi!
> 
> After some iterations I've reproduced on SIGABRT:
> 
> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>      at ../util/async.c:306
> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
> 
> (gdb) fr 4
> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
> 2648            assert(tighten_restrictions == false);
> (gdb) list
> 2643            int ret;
> 2644
> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
> 2647                                  &tighten_restrictions, NULL);
> 2648            assert(tighten_restrictions == false);
> 2649            if (ret < 0) {
> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
> 2651                bdrv_abort_perm_update(old_bs);
> 2652            } else {
> (gdb) p tighten_restrictions
> $1 = true
> 
> 

I've modified code a bit, to crash when we actually want to set tighten_restrictions to true, and get the following bt:
#0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
#1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
#2  0x000055b9174104d7 in bdrv_check_perm
     (bs=0x55b918f09720, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2009
#3  0x000055b917410ec0 in bdrv_check_update_perm
     (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2280
#4  0x000055b917410f38 in bdrv_child_check_perm
     (c=0x55b91921fcf0, q=0x0, perm=1, shared=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2294
#5  0x000055b91741078c in bdrv_check_perm
     (bs=0x55b918defd90, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x0, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2076
#6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200, new_bs=0x0) at ../block.c:2666
#7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200) at ../block.c:2798
#8  0x000055b917411f5f in bdrv_root_unref_child (child=0x55b919cf6200) at ../block.c:2810
#9  0x000055b9173e4d88 in block_job_remove_all_bdrv (job=0x55b918f06a60) at ../blockjob.c:191
#10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at ../blockjob.c:88
#11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at ../job.c:380
#12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at ../job.c:894
#13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at ../util/async.c:136
#14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at ../util/async.c:164
#15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at ../util/aio-posix.c:381
#16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0, callback=0x0, user_data=0x0)
     at ../util/async.c:306
#17 0x00007f6dbc62decd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#18 0x000055b9175029d9 in glib_pollfds_poll () at ../util/main-loop.c:221
#19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
#20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
#21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
#22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208, envp=0x7fff7f81f2b0)


and the picture taken at the moment of abort (and it is the same as at the moment before bdrv_replace_child call) is attached. So, it looks like graph is already corrupted (you see that backing permissions are not propagated to node2-node0 child).

How graph was corrupted it's still the question..

-- 
Best regards,
Vladimir

[-- Attachment #2: abort.png --]
[-- Type: image/png, Size: 77874 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
@ 2020-11-19 19:31         ` Vladimir Sementsov-Ogievskiy
  2020-11-19 20:31           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-19 19:31 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>> 16.11.2020 20:59, Peter Maydell wrote:
>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>>> Do you know if there is a core dump or stack trace available ?
>>>
>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>> produce, so if you want more diagnostics on failures you have to
>>> arrange for the test harness to produce them...
>>>
>>> thanks
>>> -- PMM
>>>
>>
>> Hi!
>>
>> After some iterations I've reproduced on SIGABRT:
>>
>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
>> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
>> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>>      at ../util/async.c:306
>> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
>>
>> (gdb) fr 4
>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>> 2648            assert(tighten_restrictions == false);
>> (gdb) list
>> 2643            int ret;
>> 2644
>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
>> 2647                                  &tighten_restrictions, NULL);
>> 2648            assert(tighten_restrictions == false);
>> 2649            if (ret < 0) {
>> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
>> 2651                bdrv_abort_perm_update(old_bs);
>> 2652            } else {
>> (gdb) p tighten_restrictions
>> $1 = true
>>
>>
> 
> I've modified code a bit, to crash when we actually want to set tighten_restrictions to true, and get the following bt:
> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
> #2  0x000055b9174104d7 in bdrv_check_perm
>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2009
> #3  0x000055b917410ec0 in bdrv_check_update_perm
>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2280
> #4  0x000055b917410f38 in bdrv_child_check_perm
>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2294
> #5  0x000055b91741078c in bdrv_check_perm
>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x0, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2076
> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200, new_bs=0x0) at ../block.c:2666
> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200) at ../block.c:2798
> #8  0x000055b917411f5f in bdrv_root_unref_child (child=0x55b919cf6200) at ../block.c:2810
> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv (job=0x55b918f06a60) at ../blockjob.c:191
> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at ../blockjob.c:88
> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at ../job.c:380
> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at ../job.c:894
> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at ../util/async.c:136
> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at ../util/async.c:164
> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at ../util/aio-posix.c:381
> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0, callback=0x0, user_data=0x0)
>      at ../util/async.c:306
> #17 0x00007f6dbc62decd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #18 0x000055b9175029d9 in glib_pollfds_poll () at ../util/main-loop.c:221
> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208, envp=0x7fff7f81f2b0)
> 
> 
> and the picture taken at the moment of abort (and it is the same as at the moment before bdrv_replace_child call) is attached. So, it looks like graph is already corrupted (you see that backing permissions are not propagated to node2-node0 child).
> 
> How graph was corrupted it's still the question..
> 

additional info: it's all during test_stream_commit_1 in 30 iiotest

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 19:31         ` Vladimir Sementsov-Ogievskiy
@ 2020-11-19 20:31           ` Vladimir Sementsov-Ogievskiy
  2020-11-20  9:51             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-19 20:31 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>>>> Do you know if there is a core dump or stack trace available ?
>>>>
>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>> produce, so if you want more diagnostics on failures you have to
>>>> arrange for the test harness to produce them...
>>>>
>>>> thanks
>>>> -- PMM
>>>>
>>>
>>> Hi!
>>>
>>> After some iterations I've reproduced on SIGABRT:
>>>
>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
>>> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>>>      at ../util/async.c:306
>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
>>>
>>> (gdb) fr 4
>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>> 2648            assert(tighten_restrictions == false);
>>> (gdb) list
>>> 2643            int ret;
>>> 2644
>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
>>> 2647                                  &tighten_restrictions, NULL);
>>> 2648            assert(tighten_restrictions == false);
>>> 2649            if (ret < 0) {
>>> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
>>> 2651                bdrv_abort_perm_update(old_bs);
>>> 2652            } else {
>>> (gdb) p tighten_restrictions
>>> $1 = true
>>>
>>>
>>
>> I've modified code a bit, to crash when we actually want to set tighten_restrictions to true, and get the following bt:
>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>> #2  0x000055b9174104d7 in bdrv_check_perm
>>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2009
>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2280
>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2294
>> #5  0x000055b91741078c in bdrv_check_perm
>>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x0, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2076
>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200, new_bs=0x0) at ../block.c:2666
>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200) at ../block.c:2798
>> #8  0x000055b917411f5f in bdrv_root_unref_child (child=0x55b919cf6200) at ../block.c:2810
>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv (job=0x55b918f06a60) at ../blockjob.c:191
>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at ../blockjob.c:88
>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at ../job.c:380
>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at ../job.c:894
>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at ../util/async.c:136
>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at ../util/async.c:164
>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at ../util/aio-posix.c:381
>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0, callback=0x0, user_data=0x0)
>>      at ../util/async.c:306
>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>> #18 0x000055b9175029d9 in glib_pollfds_poll () at ../util/main-loop.c:221
>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208, envp=0x7fff7f81f2b0)
>>
>>
>> and the picture taken at the moment of abort (and it is the same as at the moment before bdrv_replace_child call) is attached. So, it looks like graph is already corrupted (you see that backing permissions are not propagated to node2-node0 child).
>>
>> How graph was corrupted it's still the question..
>>
> 
> additional info: it's all during test_stream_commit_1 in 30 iiotest
> 

OK, after some debugging and looking at block-graph dumps I tend to think that this a race between finish (.prepare) of mirror and block-stream. They do graph updates. Nothing prevents interleaving of graph-updating operations (note that bdrv_replace_child_noperm may do aio_poll). And nothing protects two processes of graph-update from intersection.

Any ideas?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-19 20:31           ` Vladimir Sementsov-Ogievskiy
@ 2020-11-20  9:51             ` Vladimir Sementsov-Ogievskiy
  2020-11-20 10:34               ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-20  9:51 UTC (permalink / raw)
  To: Peter Maydell, Alberto Garcia; +Cc: QEMU Developers, Qemu-block

19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com> wrote:
>>>>>> Do you know if there is a core dump or stack trace available ?
>>>>>
>>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>>> produce, so if you want more diagnostics on failures you have to
>>>>> arrange for the test harness to produce them...
>>>>>
>>>>> thanks
>>>>> -- PMM
>>>>>
>>>>
>>>> Hi!
>>>>
>>>> After some iterations I've reproduced on SIGABRT:
>>>>
>>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00) at ../block.c:2777
>>>> #6  0x000055a93374fd9c in bdrv_root_unref_child (child=0x55a9363a3a00) at ../block.c:2789
>>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv (job=0x55a935f4f4b0) at ../blockjob.c:191
>>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at ../blockjob.c:88
>>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at ../job.c:380
>>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at ../job.c:894
>>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at ../util/async.c:136
>>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at ../util/async.c:164
>>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at ../util/aio-posix.c:381
>>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0, callback=0x0, user_data=0x0)
>>>>      at ../util/async.c:306
>>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>>> #16 0x000055a933840300 in glib_pollfds_poll () at ../util/main-loop.c:221
>>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138, envp=0x7ffd48c311e0)
>>>>
>>>> (gdb) fr 4
>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00, new_bs=0x0) at ../block.c:2648
>>>> 2648            assert(tighten_restrictions == false);
>>>> (gdb) list
>>>> 2643            int ret;
>>>> 2644
>>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
>>>> 2647                                  &tighten_restrictions, NULL);
>>>> 2648            assert(tighten_restrictions == false);
>>>> 2649            if (ret < 0) {
>>>> 2650                /* We only tried to loosen restrictions, so errors are not fatal */
>>>> 2651                bdrv_abort_perm_update(old_bs);
>>>> 2652            } else {
>>>> (gdb) p tighten_restrictions
>>>> $1 = true
>>>>
>>>>
>>>
>>> I've modified code a bit, to crash when we actually want to set tighten_restrictions to true, and get the following bt:
>>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>>> #2  0x000055b9174104d7 in bdrv_check_perm
>>>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2009
>>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2280
>>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21, ignore_children=0x55b918a57b20 = {...}, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2294
>>> #5  0x000055b91741078c in bdrv_check_perm
>>>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1, cumulative_shared_perms=21, ignore_children=0x0, tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>, errp=0x0) at ../block.c:2076
>>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200, new_bs=0x0) at ../block.c:2666
>>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200) at ../block.c:2798
>>> #8  0x000055b917411f5f in bdrv_root_unref_child (child=0x55b919cf6200) at ../block.c:2810
>>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv (job=0x55b918f06a60) at ../blockjob.c:191
>>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at ../blockjob.c:88
>>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at ../job.c:380
>>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at ../job.c:894
>>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at ../util/async.c:136
>>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at ../util/async.c:164
>>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at ../util/aio-posix.c:381
>>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0, callback=0x0, user_data=0x0)
>>>      at ../util/async.c:306
>>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>> #18 0x000055b9175029d9 in glib_pollfds_poll () at ../util/main-loop.c:221
>>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
>>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
>>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208, envp=0x7fff7f81f2b0)
>>>
>>>
>>> and the picture taken at the moment of abort (and it is the same as at the moment before bdrv_replace_child call) is attached. So, it looks like graph is already corrupted (you see that backing permissions are not propagated to node2-node0 child).
>>>
>>> How graph was corrupted it's still the question..
>>>
>>
>> additional info: it's all during test_stream_commit_1 in 30 iiotest
>>
> 
> OK, after some debugging and looking at block-graph dumps I tend to think that this a race between finish (.prepare) of mirror and block-stream. They do graph updates. Nothing prevents interleaving of graph-updating operations (note that bdrv_replace_child_noperm may do aio_poll). And nothing protects two processes of graph-update from intersection.
> 
> Any ideas?
> 

Wow, I get the following:

#0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
#1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1, timeout=600000000000) at ../util/qemu-timer.c:349
#2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0, ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
#3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true) at ../util/aio-posix.c:607
#4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at ../block/mirror.c:648
#5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at ../block/mirror.c:776
#6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
#7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20, fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at ../job.c:798
#9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20) at ../job.c:852
#10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at ../job.c:865
#11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
#12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at ../util/async.c:136
#13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at ../util/async.c:164
#14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true) at ../util/aio-posix.c:659
#15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at /work/src/qemu/master/block/block-gen.h:44
#16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0, bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
#17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0, buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
#18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at ../block/qcow2.c:3037
#19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020, backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img", backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
#20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020, backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img", backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
#21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at ../block/stream.c:88
#22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
#23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620, fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at ../job.c:798
#25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620) at ../job.c:852
#26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at ../job.c:865
#27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
#28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at ../util/async.c:136
#29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at ../util/async.c:164
#30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at ../util/aio-posix.c:381
#31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0, callback=0x0, user_data=0x0) at ../util/async.c:306
#32 0x00007fa84cdceecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
#34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244
#35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520
#36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
#37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88, envp=0x7ffcc0af6b30) at ../softmmu/main.c:50


aio_poll at start of mirror_exit_common is my addition. But anyway the problem is here: we do call mirror_prepare inside of stream_prepare!

----

Ok, moving further, I moved stream_clean, stream_prepare, mirror_exit_common and mirror_complete into coroutine context and protect them all by new global coroutine mutex. This helps, now the bug doesn't reproduce.

Still, I'll try to finally understand, what's going on. But I think now that:

- graph modification code (permissions, drains, child-replacements, etc) are not safe to call in parallel, for example in different coroutines or through nested aio_poll.

- but they all do different kinds of drains, io operations, etc, which lead to nested aio polls and everything is broken


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-20  9:51             ` Vladimir Sementsov-Ogievskiy
@ 2020-11-20 10:34               ` Philippe Mathieu-Daudé
  2020-11-20 12:48                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-11-20 10:34 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Peter Maydell, Alberto Garcia
  Cc: Paolo Bonzini, Stefan Hajnoczi, QEMU Developers, Qemu-block

Cc'ing Stefan/Paolo

On 11/20/20 10:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>>>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com>
>>>>>> wrote:
>>>>>>> Do you know if there is a core dump or stack trace available ?
>>>>>>
>>>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>>>> produce, so if you want more diagnostics on failures you have to
>>>>>> arrange for the test harness to produce them...
>>>>>>
>>>>>> thanks
>>>>>> -- PMM
>>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>> After some iterations I've reproduced on SIGABRT:
>>>>>
>>>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>> new_bs=0x0) at ../block.c:2648
>>>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00)
>>>>> at ../block.c:2777
>>>>> #6  0x000055a93374fd9c in bdrv_root_unref_child
>>>>> (child=0x55a9363a3a00) at ../block.c:2789
>>>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv
>>>>> (job=0x55a935f4f4b0) at ../blockjob.c:191
>>>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at
>>>>> ../blockjob.c:88
>>>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at
>>>>> ../job.c:380
>>>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at
>>>>> ../job.c:894
>>>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at
>>>>> ../util/async.c:136
>>>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at
>>>>> ../util/async.c:164
>>>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at
>>>>> ../util/aio-posix.c:381
>>>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0,
>>>>> callback=0x0, user_data=0x0)
>>>>>      at ../util/async.c:306
>>>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at
>>>>> /lib64/libglib-2.0.so.0
>>>>> #16 0x000055a933840300 in glib_pollfds_poll () at
>>>>> ../util/main-loop.c:221
>>>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at
>>>>> ../util/main-loop.c:244
>>>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at
>>>>> ../util/main-loop.c:520
>>>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138,
>>>>> envp=0x7ffd48c311e0)
>>>>>
>>>>> (gdb) fr 4
>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>> new_bs=0x0) at ../block.c:2648
>>>>> 2648            assert(tighten_restrictions == false);
>>>>> (gdb) list
>>>>> 2643            int ret;
>>>>> 2644
>>>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm,
>>>>> shared_perm, NULL,
>>>>> 2647                                  &tighten_restrictions, NULL);
>>>>> 2648            assert(tighten_restrictions == false);
>>>>> 2649            if (ret < 0) {
>>>>> 2650                /* We only tried to loosen restrictions, so
>>>>> errors are not fatal */
>>>>> 2651                bdrv_abort_perm_update(old_bs);
>>>>> 2652            } else {
>>>>> (gdb) p tighten_restrictions
>>>>> $1 = true
>>>>>
>>>>>
>>>>
>>>> I've modified code a bit, to crash when we actually want to set
>>>> tighten_restrictions to true, and get the following bt:
>>>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>>>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>>>> #2  0x000055b9174104d7 in bdrv_check_perm
>>>>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1,
>>>> cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2009
>>>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>>>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21,
>>>> ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2280
>>>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>>>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21,
>>>> ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2294
>>>> #5  0x000055b91741078c in bdrv_check_perm
>>>>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1,
>>>> cumulative_shared_perms=21, ignore_children=0x0,
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2076
>>>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200,
>>>> new_bs=0x0) at ../block.c:2666
>>>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200)
>>>> at ../block.c:2798
>>>> #8  0x000055b917411f5f in bdrv_root_unref_child
>>>> (child=0x55b919cf6200) at ../block.c:2810
>>>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv
>>>> (job=0x55b918f06a60) at ../blockjob.c:191
>>>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at
>>>> ../blockjob.c:88
>>>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at
>>>> ../job.c:380
>>>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at
>>>> ../job.c:894
>>>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at
>>>> ../util/async.c:136
>>>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at
>>>> ../util/async.c:164
>>>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at
>>>> ../util/aio-posix.c:381
>>>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0,
>>>> callback=0x0, user_data=0x0)
>>>>      at ../util/async.c:306
>>>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at
>>>> /lib64/libglib-2.0.so.0
>>>> #18 0x000055b9175029d9 in glib_pollfds_poll () at
>>>> ../util/main-loop.c:221
>>>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at
>>>> ../util/main-loop.c:244
>>>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at
>>>> ../util/main-loop.c:520
>>>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208,
>>>> envp=0x7fff7f81f2b0)
>>>>
>>>>
>>>> and the picture taken at the moment of abort (and it is the same as
>>>> at the moment before bdrv_replace_child call) is attached. So, it
>>>> looks like graph is already corrupted (you see that backing
>>>> permissions are not propagated to node2-node0 child).
>>>>
>>>> How graph was corrupted it's still the question..
>>>>
>>>
>>> additional info: it's all during test_stream_commit_1 in 30 iiotest
>>>
>>
>> OK, after some debugging and looking at block-graph dumps I tend to
>> think that this a race between finish (.prepare) of mirror and
>> block-stream. They do graph updates. Nothing prevents interleaving of
>> graph-updating operations (note that bdrv_replace_child_noperm may do
>> aio_poll). And nothing protects two processes of graph-update from
>> intersection.
>>
>> Any ideas?
>>
> 
> Wow, I get the following:
> 
> #0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
> #1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1,
> timeout=600000000000) at ../util/qemu-timer.c:349
> #2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0,
> ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
> #3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true)
> at ../util/aio-posix.c:607
> #4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at
> ../block/mirror.c:648
> #5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at
> ../block/mirror.c:776
> #6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
> #7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20,
> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
> #8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at
> ../job.c:798
> #9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20)
> at ../job.c:852
> #10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at
> ../job.c:865
> #11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
> #12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at
> ../util/async.c:136
> #13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
> ../util/async.c:164
> #14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true)
> at ../util/aio-posix.c:659
> #15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at
> /work/src/qemu/master/block/block-gen.h:44
> #16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0,
> bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
> #17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0,
> buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
> #18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at
> ../block/qcow2.c:3037
> #19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020,
> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
> backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
> #20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020,
> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
> backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
> #21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at
> ../block/stream.c:88
> #22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
> #23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620,
> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
> #24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at
> ../job.c:798
> #25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620)
> at ../job.c:852
> #26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at
> ../job.c:865
> #27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
> #28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at
> ../util/async.c:136
> #29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
> ../util/async.c:164
> #30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at
> ../util/aio-posix.c:381
> #31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0,
> callback=0x0, user_data=0x0) at ../util/async.c:306
> #32 0x00007fa84cdceecd in g_main_context_dispatch () at
> /lib64/libglib-2.0.so.0
> #33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
> #34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at
> ../util/main-loop.c:244
> #35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at
> ../util/main-loop.c:520
> #36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
> #37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88,
> envp=0x7ffcc0af6b30) at ../softmmu/main.c:50
> 
> 
> aio_poll at start of mirror_exit_common is my addition. But anyway the
> problem is here: we do call mirror_prepare inside of stream_prepare!
> 
> ----
> 
> Ok, moving further, I moved stream_clean, stream_prepare,
> mirror_exit_common and mirror_complete into coroutine context and
> protect them all by new global coroutine mutex. This helps, now the bug
> doesn't reproduce.
> 
> Still, I'll try to finally understand, what's going on. But I think now
> that:
> 
> - graph modification code (permissions, drains, child-replacements, etc)
> are not safe to call in parallel, for example in different coroutines or
> through nested aio_poll.
> 
> - but they all do different kinds of drains, io operations, etc, which
> lead to nested aio polls and everything is broken
> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iotest 030 still occasionally intermittently failing
  2020-11-20 10:34               ` Philippe Mathieu-Daudé
@ 2020-11-20 12:48                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-11-20 12:48 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Peter Maydell, Alberto Garcia
  Cc: QEMU Developers, Qemu-block, Stefan Hajnoczi, Paolo Bonzini

20.11.2020 13:34, Philippe Mathieu-Daudé wrote:
> Cc'ing Stefan/Paolo
> 
> On 11/20/20 10:51 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
>>>> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com>
>>>>>>> wrote:
>>>>>>>> Do you know if there is a core dump or stack trace available ?
>>>>>>>
>>>>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>>>>> produce, so if you want more diagnostics on failures you have to
>>>>>>> arrange for the test harness to produce them...
>>>>>>>
>>>>>>> thanks
>>>>>>> -- PMM
>>>>>>>
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> After some iterations I've reproduced on SIGABRT:
>>>>>>
>>>>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>>>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>>>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>>>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>>> new_bs=0x0) at ../block.c:2648
>>>>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00)
>>>>>> at ../block.c:2777
>>>>>> #6  0x000055a93374fd9c in bdrv_root_unref_child
>>>>>> (child=0x55a9363a3a00) at ../block.c:2789
>>>>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv
>>>>>> (job=0x55a935f4f4b0) at ../blockjob.c:191
>>>>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at
>>>>>> ../blockjob.c:88
>>>>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at
>>>>>> ../job.c:380
>>>>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at
>>>>>> ../job.c:894
>>>>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at
>>>>>> ../util/async.c:136
>>>>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at
>>>>>> ../util/async.c:164
>>>>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at
>>>>>> ../util/aio-posix.c:381
>>>>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0,
>>>>>> callback=0x0, user_data=0x0)
>>>>>>       at ../util/async.c:306
>>>>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at
>>>>>> /lib64/libglib-2.0.so.0
>>>>>> #16 0x000055a933840300 in glib_pollfds_poll () at
>>>>>> ../util/main-loop.c:221
>>>>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at
>>>>>> ../util/main-loop.c:244
>>>>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at
>>>>>> ../util/main-loop.c:520
>>>>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138,
>>>>>> envp=0x7ffd48c311e0)
>>>>>>
>>>>>> (gdb) fr 4
>>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>>> new_bs=0x0) at ../block.c:2648
>>>>>> 2648            assert(tighten_restrictions == false);
>>>>>> (gdb) list
>>>>>> 2643            int ret;
>>>>>> 2644
>>>>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>>>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm,
>>>>>> shared_perm, NULL,
>>>>>> 2647                                  &tighten_restrictions, NULL);
>>>>>> 2648            assert(tighten_restrictions == false);
>>>>>> 2649            if (ret < 0) {
>>>>>> 2650                /* We only tried to loosen restrictions, so
>>>>>> errors are not fatal */
>>>>>> 2651                bdrv_abort_perm_update(old_bs);
>>>>>> 2652            } else {
>>>>>> (gdb) p tighten_restrictions
>>>>>> $1 = true
>>>>>>
>>>>>>
>>>>>
>>>>> I've modified code a bit, to crash when we actually want to set
>>>>> tighten_restrictions to true, and get the following bt:
>>>>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>>>>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>>>>> #2  0x000055b9174104d7 in bdrv_check_perm
>>>>>       (bs=0x55b918f09720, q=0x0, cumulative_perms=1,
>>>>> cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...},
>>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>>> errp=0x0) at ../block.c:2009
>>>>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>>>>       (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21,
>>>>> ignore_children=0x55b918a57b20 = {...},
>>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>>> errp=0x0) at ../block.c:2280
>>>>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>>>>       (c=0x55b91921fcf0, q=0x0, perm=1, shared=21,
>>>>> ignore_children=0x55b918a57b20 = {...},
>>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>>> errp=0x0) at ../block.c:2294
>>>>> #5  0x000055b91741078c in bdrv_check_perm
>>>>>       (bs=0x55b918defd90, q=0x0, cumulative_perms=1,
>>>>> cumulative_shared_perms=21, ignore_children=0x0,
>>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>>> errp=0x0) at ../block.c:2076
>>>>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200,
>>>>> new_bs=0x0) at ../block.c:2666
>>>>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200)
>>>>> at ../block.c:2798
>>>>> #8  0x000055b917411f5f in bdrv_root_unref_child
>>>>> (child=0x55b919cf6200) at ../block.c:2810
>>>>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv
>>>>> (job=0x55b918f06a60) at ../blockjob.c:191
>>>>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at
>>>>> ../blockjob.c:88
>>>>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at
>>>>> ../job.c:380
>>>>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at
>>>>> ../job.c:894
>>>>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at
>>>>> ../util/async.c:136
>>>>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at
>>>>> ../util/async.c:164
>>>>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at
>>>>> ../util/aio-posix.c:381
>>>>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0,
>>>>> callback=0x0, user_data=0x0)
>>>>>       at ../util/async.c:306
>>>>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at
>>>>> /lib64/libglib-2.0.so.0
>>>>> #18 0x000055b9175029d9 in glib_pollfds_poll () at
>>>>> ../util/main-loop.c:221
>>>>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at
>>>>> ../util/main-loop.c:244
>>>>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at
>>>>> ../util/main-loop.c:520
>>>>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208,
>>>>> envp=0x7fff7f81f2b0)
>>>>>
>>>>>
>>>>> and the picture taken at the moment of abort (and it is the same as
>>>>> at the moment before bdrv_replace_child call) is attached. So, it
>>>>> looks like graph is already corrupted (you see that backing
>>>>> permissions are not propagated to node2-node0 child).
>>>>>
>>>>> How graph was corrupted it's still the question..
>>>>>
>>>>
>>>> additional info: it's all during test_stream_commit_1 in 30 iiotest
>>>>
>>>
>>> OK, after some debugging and looking at block-graph dumps I tend to
>>> think that this a race between finish (.prepare) of mirror and
>>> block-stream. They do graph updates. Nothing prevents interleaving of
>>> graph-updating operations (note that bdrv_replace_child_noperm may do
>>> aio_poll). And nothing protects two processes of graph-update from
>>> intersection.
>>>
>>> Any ideas?
>>>
>>
>> Wow, I get the following:
>>
>> #0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
>> #1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1,
>> timeout=600000000000) at ../util/qemu-timer.c:349
>> #2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0,
>> ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
>> #3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true)
>> at ../util/aio-posix.c:607
>> #4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at
>> ../block/mirror.c:648
>> #5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at
>> ../block/mirror.c:776
>> #6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
>> #7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20,
>> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
>> #8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at
>> ../job.c:798
>> #9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20)
>> at ../job.c:852
>> #10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at
>> ../job.c:865
>> #11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
>> #12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at
>> ../util/async.c:136
>> #13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
>> ../util/async.c:164
>> #14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true)
>> at ../util/aio-posix.c:659
>> #15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at
>> /work/src/qemu/master/block/block-gen.h:44
>> #16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0,
>> bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
>> #17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0,
>> buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
>> #18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at
>> ../block/qcow2.c:3037
>> #19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020,
>> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
>> backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
>> #20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020,
>> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
>> backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
>> #21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at
>> ../block/stream.c:88
>> #22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
>> #23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620,
>> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
>> #24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at
>> ../job.c:798
>> #25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620)
>> at ../job.c:852
>> #26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at
>> ../job.c:865
>> #27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
>> #28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at
>> ../util/async.c:136
>> #29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
>> ../util/async.c:164
>> #30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at
>> ../util/aio-posix.c:381
>> #31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0,
>> callback=0x0, user_data=0x0) at ../util/async.c:306
>> #32 0x00007fa84cdceecd in g_main_context_dispatch () at
>> /lib64/libglib-2.0.so.0
>> #33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
>> #34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at
>> ../util/main-loop.c:244
>> #35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at
>> ../util/main-loop.c:520
>> #36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
>> #37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88,
>> envp=0x7ffcc0af6b30) at ../softmmu/main.c:50
>>
>>
>> aio_poll at start of mirror_exit_common is my addition. But anyway the
>> problem is here: we do call mirror_prepare inside of stream_prepare!
>>
>> ----
>>
>> Ok, moving further, I moved stream_clean, stream_prepare,
>> mirror_exit_common and mirror_complete into coroutine context and
>> protect them all by new global coroutine mutex. This helps, now the bug
>> doesn't reproduce.
>>
>> Still, I'll try to finally understand, what's going on. But I think now
>> that:
>>
>> - graph modification code (permissions, drains, child-replacements, etc)
>> are not safe to call in parallel, for example in different coroutines or
>> through nested aio_poll.
>>
>> - but they all do different kinds of drains, io operations, etc, which
>> lead to nested aio polls and everything is broken
>>


My debugging log of the problem:

stream_start start
../block/stream.c:287 setro 0
bdrv_reopen_multiple enter
bdrv_reopen_multiple exit
stream_start end
../block/mirror.c:1889 setro 0
bdrv_reopen_multiple enter
bdrv_reopen_multiple exit
mirror_start_job start
mirror_start_job end
job_exit
bdrv_co_stream_prepare start
bdrv_co_stream_prepare end
bdrv_co_mirror_complete start
bdrv_co_mirror_complete end
stream_clean start
../block/stream.c:117 setro 1
bdrv_reopen_multiple enter
job_exit
bdrv_co_mirror_exit_common start
set no perm for mirror_top
replace mirror_top
replaceD mirror_top
bdrv_co_mirror_exit_common end
job_exit end
bdrv_reopen_multiple exit
stream_clean exit


So, we start bdrv_reopen_multiple in stream_clean, and switch to mirror_exit_common. Note that mirror_exit_common at some point drop permissions on mirror_top node, which is still in graph, so empty permission may propagate to the whole backing chain. Anyway we have nested permission update which results in unpredicted things.

I don't want to continue digging, it doesn't seem to make any more sense.


So, what I can summarize:

1. The bug may be workarounded by moving graph-changing operations in block-jobs to coroutine context and use coroutine-mutex. Still, it's not a generic solution.

2. The core problem is that graph modification functions may trigger context-switch due to nested aio_polls.. which leads to (for example) nested permission updates..


I have big series (not yet sent) of updating graph modifications (mostly around permissions) to solve some other problems and drop .active field workaround in block jobs.. But it doesn't solve aio_poll- problem.

Look also at my recent "question about bdrv_replace_node" letter, it's about same thing.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-20 12:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-16 16:16 iotest 030 still occasionally intermittently failing Peter Maydell
2020-11-16 17:34 ` Alberto Garcia
2020-11-16 17:59   ` Peter Maydell
2020-11-19 16:11     ` Vladimir Sementsov-Ogievskiy
2020-11-19 16:16       ` Vladimir Sementsov-Ogievskiy
2020-11-19 18:17       ` Alberto Garcia
2020-11-19 19:30       ` Vladimir Sementsov-Ogievskiy
2020-11-19 19:31         ` Vladimir Sementsov-Ogievskiy
2020-11-19 20:31           ` Vladimir Sementsov-Ogievskiy
2020-11-20  9:51             ` Vladimir Sementsov-Ogievskiy
2020-11-20 10:34               ` Philippe Mathieu-Daudé
2020-11-20 12:48                 ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.