Re: [Qemu-devel] [Qemu-block] [RFC PATCH V4] qemu-img: make convert async

From: Peter Lieven <pl@kamp.de>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: qemu-devel@nongnu.org, kwolf@redhat.com, famz@redhat.com,
	qemu-block@nongnu.org, mreitz@redhat.com
Subject: Re: [Qemu-devel] [Qemu-block] [RFC PATCH V4] qemu-img: make convert async
Date: Mon, 20 Feb 2017 15:59:42 +0100	[thread overview]
Message-ID: <73412b92-353d-7109-cfbe-9c61bf91e7eb@kamp.de> (raw)
In-Reply-To: <20170220145011.GJ21255@stefanha-x1.localdomain>

Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
> On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
>> this is something I have been thinking about for almost 2 years now.
>> we heavily have the following two use cases when using qemu-img convert.
>>
>> a) reading from NFS and writing to iSCSI for deploying templates
>> b) reading from iSCSI and writing to NFS for backups
>>
>> In both processes we use libiscsi and libnfs so we have no kernel pagecache.
>> As qemu-img convert is implemented with sync operations that means we
>> read one buffer and then write it. No parallelism and each sync request
>> takes as long as it takes until it is completed.
>>
>> This is version 4 of the approach using coroutine worker "threads".
>>
>> So far I have the following runtimes when reading an uncompressed QCOW2 from
>> NFS and writing it to iSCSI (raw):
>>
>> qemu-img (master)
>>   nfs -> iscsi 22.8 secs
>>   nfs -> ram   11.7 secs
>>   ram -> iscsi 12.3 secs
>>
>> qemu-img-async (8 coroutines, in-order write disabled)
>>   nfs -> iscsi 11.0 secs
>>   nfs -> ram   10.4 secs
>>   ram -> iscsi  9.0 secs
>>
>> The following are the runtimes found with different settings between V3 and V4.
>> This is always the best runtime out of 10 runs when converting from nfs to iscsi.
>> Please note that in V4 in-order write scenarios show a very high jitter. I think
>> this is because the get_block_status on the NFS share is delayed by concurrent read
>> requests.
>>
>>                         in-order        out-of-order
>> V3  - 16 coroutines    12.4 seconds    11.1 seconds
>>      -  8 coroutines    12.2 seconds    11.3 seconds
>>      -  4 coroutines    12.5 seconds    11.1 seconds
>>      -  2 coroutines    14.8 seconds    14.9 seconds
>>
>> V4  - 32 coroutines    15.9 seconds    11.5 seconds
>>      - 16 coroutines    12.5 seconds    11.0 seconds
>>      -  8 coroutines    12.9 seconds    11.0 seconds
>>      -  4 coroutines    14.1 seconds    11.5 seconds
>>      -  2 coroutines    16.9 seconds    13.2 seconds
> Does this patch work with compressed images?  Especially the
> out-of-order write mode may be problematic with a compressed qcow2 image.

It does, but you are right, out-of-order writes and compression should
be mutually exclusive.

>
> How should a user decide between in-order and out-of-order?

In general, out of order is ok, when we write to a preallocated device (e.g. all raw
devices or qcow2 with preallocation=full). default is in-order write. the user has
to manually enable out of order.

>
>> @@ -1651,12 +1680,117 @@ static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
>>       return 0;
>>   }
>>   
>> -static int convert_do_copy(ImgConvertState *s)
>> +static void convert_co_do_copy(void *opaque)
> Missing coroutine_fn here and for convert_co_read()/convert_co_write().
> Functions that must be called from coroutine context (because they
> yield, use coroutine mutexes, etc) need to be marked as such.

okay.

>
>> +        if (s->wr_in_order) {
>> +            /* reenter the coroutine that might have waited
>> +             * for this write to complete */
>> +            s->wr_offs = sector_num + n;
>> +            for (i = 0; i < s->num_coroutines; i++) {
>> +                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
>> +                    qemu_coroutine_enter(s->co[i]);
>> +                    break;
> This qemu_coroutine_enter() call relies on the yield pattern between
> sibling coroutines having no recursive qemu_coroutine_enter() calls.
> QEMU aborts if there is a code path where coroutine A enters B and then
> B enters A again before yielding.
>
> Paolo's new aio_co_wake() API solves this issue by deferring the
> qemu_coroutine_enter() to the event loop.  It's similar to CoQueue
> wakeup.  aio_co_wake() is part of my latest block pull request (should
> be merged into qemu.git soon).
>
> I *think* this patch has no A -> B -> A situation thanks to yields in
> the code path, but it would be nicer to use aio_co_wake() where this can
> never happen.

I also believe this can't happen. Would it be also okay to use
qemu_coroutine_enter_if_inactive() here?

>
>> +        case 'm':
>> +            num_coroutines = atoi(optarg);
>> +            if (num_coroutines > MAX_COROUTINES) {
>> +                error_report("Maximum allowed number of coroutines is %d",
>> +                             MAX_COROUTINES);
>> +                ret = -1;
>> +                goto fail_getopt;
>> +            }
> Missing input validation for the < 1 case.

Right.

Thank you,
Peter