From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=49462 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pi8v3-0007EO-W7
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 12:12:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pi8v2-0006oU-FZ
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 12:12:33 -0500
Received: from mail-iw0-f173.google.com ([209.85.214.173]:33746)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pi8v2-0006oG-8p
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 12:12:32 -0500
Received: by iwn40 with SMTP id 40so1126594iwn.4
	for <qemu-devel@nongnu.org>; Wed, 26 Jan 2011 09:12:31 -0800 (PST)
Message-ID: <4D4055FB.6090104@codemonkey.ws>
Date: Wed, 26 Jan 2011 11:12:27 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to
	use	coroutines for async I/O
References: <1295688567-25496-1-git-send-email-stefanha@linux.vnet.ibm.com>	<1295688567-25496-12-git-send-email-stefanha@linux.vnet.ibm.com>	<4D40406B.2070302@redhat.com>	<4D4042A8.2040903@redhat.com>	<4D4046EF.3050108@codemonkey.ws>
	<4D40481E.9040309@redhat.com>	<4D404B9F.7040801@codemonkey.ws>
	<4D404E1F.50809@redhat.com>
In-Reply-To: <4D404E1F.50809@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

On 01/26/2011 10:38 AM, Avi Kivity wrote:
> On 01/26/2011 06:28 PM, Anthony Liguori wrote:
>> On 01/26/2011 10:13 AM, Avi Kivity wrote:
>>> Serializing against a global mutex has the advantage that it can be 
>>> treated as a global lock that is decomposed into fine-grained locks.
>>>
>>> For example, we can start the code conversion from an explict async 
>>> model to a threaded sync model by converting the mutex into a 
>>> shared/exclusive lock.  Operations like read and write take the lock 
>>> for shared access (and take a fine-grained mutex on the metadata 
>>> cache entry), while operation like creating a snapshot take the lock 
>>> for exclusive access.  That doesn't work with freeze/thaw.
>>
>> The trouble with this is that you increase the amount of re-entrance 
>> whereas freeze/thaw doesn't.
>>
>> The code from the beginning of the request to where the mutex is 
>> acquired will be executed for every single request even while 
>> requests are blocked at the mutex acquisition.
>
> It's just a few instructions.
>
>>
>> With freeze/thaw, you freeze the queue and prevent any request from 
>> starting until you thaw.  You only thaw and return control to allow 
>> another request to execute when you begin executing an asynchronous 
>> I/O callback.
>>
>
> What do you actually save?  The longjmp() to the coroutine code, 
> linking in to the mutex wait queue, and another switch back to the 
> main coroutine?  Given that we don't expect to block often, it seems 
> hardly a cost worth optimizing.

It's a matter of correctness, not optimization.

Consider the following example:

coroutine {
    l2 = find_l2(offset);
    // allocates but does not increment max cluster offset
    l2[l2_offset(offset)] = alloc_cluster();

    co_mutex_lock(&lock);
    write_l2(l2);
    co_mutex_unlock(&lock);

    l1[l1_offset(offset)] = l2;

    co_mutex_lock(&lock);
    write_l1(l1);
    co_mutex_unlock(&lock);

    commit_cluster(l2[l2_offset(offset)]);
}

This code is incorrect.  The code before the first co_mutex_lock() may 
be called a dozen times before before anyone finishes this routine.  
That means the same cluster is reused multiple times.

This code was correct before we introduced coroutines because we 
effectively had one big lock around the whole thing.

Regards,

Anthony Liguori

>
>> I think my previous example was wrong, you really want to do:
>>
>> qcow2_aio_writev() {
>>    coroutine {
>>        freeze();
>>        sync_io(); // existing qcow2 code
>>        thaw();
>>        // existing non I/O code
>>        bdrv_aio_writev(callback); // no explicit freeze/thaw needed
>>    }
>> }
>>
>> This is equivalent to our existing code because no new re-entrance is 
>> introduced.  The only re-entrancy points are in the 
>> bdrv_aio_{readv,writev} calls.
>
> This requires you to know which code is sync, and which code is 
> async.  My conversion allows you to wrap the code blindly with a 
> mutex, and have it do the right thing automatically.  This is most 
> useful where the code can be sync or async depending on data (which is 
> the case for qcow2).
>