From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42339) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1d2iXr-00030k-Bh for qemu-devel@nongnu.org; Mon, 24 Apr 2017 14:17:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1d2iXo-0002uY-6S for qemu-devel@nongnu.org; Mon, 24 Apr 2017 14:17:07 -0400 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:36261 helo=mx01.kamp.de) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1d2iXn-0002sv-Sk for qemu-devel@nongnu.org; Mon, 24 Apr 2017 14:17:04 -0400 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (1.0) From: Peter Lieven In-Reply-To: <15b32f67-fbce-fe9e-5a5f-ed25eae49941@virtuozzo.com> Date: Mon, 24 Apr 2017 20:16:56 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1492769076-64466-1-git-send-email-anton.nefedov@virtuozzo.com> <98514b87-d2a3-254d-f03e-0c6d3cd73ff6@kamp.de> <88259101-7040-f057-5ed4-76bd36f2b1a0@virtuozzo.com> <15b32f67-fbce-fe9e-5a5f-ed25eae49941@virtuozzo.com> Subject: Re: [Qemu-devel] [Qemu-stable] [PATCH v2 1/1] qemu-img: wait for convert coroutines to complete List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anton Nefedov Cc: qemu-devel@nongnu.org, kwolf@redhat.com, mreitz@redhat.com, den@virtuozzo.com, qemu-block@nongnu.org, qemu-stable@nongnu.org > Am 24.04.2017 um 18:27 schrieb Anton Nefedov = : >=20 >> On 04/21/2017 03:37 PM, Peter Lieven wrote: >>> Am 21.04.2017 um 14:19 schrieb Anton Nefedov: >>>> On 04/21/2017 01:44 PM, Peter Lieven wrote: >>>>> Am 21.04.2017 um 12:04 schrieb Anton Nefedov: >>>>> On error path (like i/o error in one of the coroutines), it's required= to >>>>> - wait for coroutines completion before cleaning the common structure= s >>>>> - reenter dependent coroutines so they ever finish >>>>>=20 >>>>> Introduced in 2d9187bc65. >>>>>=20 >>>>> Signed-off-by: Anton Nefedov >>>>> --- >>>>> [..] >>>>>=20 >>>>=20 >>>>=20 >>>> And what if we error out in the read path? Wouldn't be something like t= his easier? >>>>=20 >>>>=20 >>>> diff --git a/qemu-img.c b/qemu-img.c >>>> index 22f559a..4ff1085 100644 >>>> --- a/qemu-img.c >>>> +++ b/qemu-img.c >>>> @@ -1903,6 +1903,16 @@ static int convert_do_copy(ImgConvertState *s) >>>> main_loop_wait(false); >>>> } >>>>=20 >>>> + /* on error path we need to enter all coroutines that are still >>>> + * running before cleaning up common structures */ >>>> + if (s->ret) { >>>> + for (i =3D 0; i < s->num_coroutines; i++) { >>>> + if (s->co[i]) { >>>> + qemu_coroutine_enter(s->co[i]); >>>> + } >>>> + } >>>> + } >>>> + >>>> if (s->compressed && !s->ret) { >>>> /* signal EOF to align */ >>>> ret =3D blk_pwrite_compressed(s->target, 0, NULL, 0); >>>>=20 >>>>=20 >>>> Peter >>>>=20 >>>=20 >>> seemed a bit too daring to me to re-enter every coroutine potentially in= cluding the ones that yielded waiting for I/O completion. >>> If that's ok - that is for sure easier :) >>=20 >> I think we should enter every coroutine that is still running and have it= terminate correctly. It was my mistake that I have not >> done this in the original patch. Can you check if the above fixes your te= st cases that triggered the bug? >>=20 >=20 > hi, sorry I'm late with the answer >=20 > this segfaults in bdrv_close(). Looks like it tries to finish some i/o whi= ch coroutine we have already entered and terminated? >=20 > (gdb) run > Starting program: /vz/anefedov/qemu-build/us/./qemu-img convert -O qcow2 .= /harddisk.hdd.c ./harddisk.hdd > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > [New Thread 0x7fffeac2d700 (LWP 436020)] > [New Thread 0x7fffe4ed6700 (LWP 436021)] > qemu-img: error while reading sector 20480: Input/output error > qemu-img: error while writing sector 19712: Operation now in progress >=20 > Program received signal SIGSEGV, Segmentation fault. > aio_co_wake (co=3D0x0) at /mnt/code/us-qemu/util/async.c:454 > 454 ctx =3D atomic_read(&co->ctx); > (gdb) bt > #0 aio_co_wake (co=3D0x0) at /mnt/code/us-qemu/util/async.c:454 > /* [Anton]: thread_pool_co_cb () here */ > #1 0x0000555555634629 in thread_pool_completion_bh (opaque=3D0x555555cfe0= 20) at /mnt/code/us-qemu/util/thread-pool.c:189 > #2 0x0000555555633b31 in aio_bh_call (bh=3D0x555555cfe0f0) at /mnt/code/u= s-qemu/util/async.c:90 > #3 aio_bh_poll (ctx=3Dctx@entry=3D0x555555cee6d0) at /mnt/code/us-qemu/ut= il/async.c:118 > #4 0x0000555555636f14 in aio_poll (ctx=3Dctx@entry=3D0x555555cee6d0, bloc= king=3D) at /mnt/code/us-qemu/util/aio-posix.c:682 > #5 0x00005555555c52d4 in bdrv_drain_recurse (bs=3Dbs@entry=3D0x555555d225= 60) at /mnt/code/us-qemu/block/io.c:164 > #6 0x00005555555c5aed in bdrv_drained_begin (bs=3Dbs@entry=3D0x555555d225= 60) at /mnt/code/us-qemu/block/io.c:248 > #7 0x0000555555581443 in bdrv_close (bs=3D0x555555d22560) at /mnt/code/us= -qemu/block.c:2909 > #8 bdrv_delete (bs=3D0x555555d22560) at /mnt/code/us-qemu/block.c:3100 > #9 bdrv_unref (bs=3D0x555555d22560) at /mnt/code/us-qemu/block.c:4087 > #10 0x00005555555baf44 in blk_remove_bs (blk=3Dblk@entry=3D0x555555d22380)= at /mnt/code/us-qemu/block/block-backend.c:552 > #11 0x00005555555bb173 in blk_delete (blk=3D0x555555d22380) at /mnt/code/u= s-qemu/block/block-backend.c:238 > #12 blk_unref (blk=3Dblk@entry=3D0x555555d22380) at /mnt/code/us-qemu/bloc= k/block-backend.c:282 > #13 0x000055555557a22c in img_convert (argc=3D, argv=3D) at /mnt/code/us-qemu/qemu-img.c:2359 > #14 0x0000555555574189 in main (argc=3D5, argv=3D0x7fffffffe4a0) at /mnt/c= ode/us-qemu/qemu-img.c:4464 >=20 >=20 >> Peter >>=20 >=20 > /Anton >=20 it seems that this is a bit tricky, can you share how your test case works? thanks, peter=