All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] qcow2 performance plan
@ 2010-09-14 13:07 Avi Kivity
  2010-09-14 13:43 ` Anthony Liguori
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Avi Kivity @ 2010-09-14 13:07 UTC (permalink / raw)
  To: qemu-devel, Kevin Wolf

  Here's a draft of a plan that should improve qcow2 performance.  It's 
written in wiki syntax for eventual upload to wiki.qemu.org; lines 
starting with # are numbered lists, not comments.

= Basics =

At the minimum level, no operation should block the main thread.  This
could be done in two ways: extending the state machine so that each
blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
or by threading: each new operation is handed off to a worker thread.
Since a full state machine is prohibitively complex, this document
will discuss threading.

== Basic threading strategy ==

A first iteration of qcow2 threading adds a single mutex to an image.
The existing qcow2 code is then executed within a worker thread,
acquiring the mutex before starting any operation and releasing it
after completion.  Concurrent operations will simply block until the
operation is complete.  For operations which are already asynchronous,
the blocking time will be negligible since the code will call
<code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
The immediate benefit is that currently blocking operations no long block
the main thread, instead they just block the block operation which is
blocking anyway.

== Eliminating the threading penalty ==

We can eliminate pointless context switches by using the worker thread
context we're in to issue the I/O.  This is trivial for synchronous calls
(<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue 
the I/O
from the same thread we're currently in.  The underlying raw block format
driver threading code needs to recognize we're in a worker thread context so
it doesn't need to use a worker thread of its own; perhaps using a thread
variable to see if it is in the main thread or an I/O worker thread.

For asynchronous operations, this is harder.  We may add a
<code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
to replace a

     bdrv_aio_read()
     mutex_unlock(bs.mutex)
     return;

sequence.  Alternatively, we can just eliminate asynchronous calls.  To
retain concurrency we drop the mutex while performing the operation:
an convert a <code>bdrv_aio_read</code> to:

     mutex_unlock(bs.mutex)
     bdrv_read()
     mutex_lock(bs.mutex)

This allows the operations to proceed in parallel.

For asynchronous metadata operations, the code is simplified considerably.
Dependency lists that are maintained in metadata caches are replaced by a
mutex; instead of adding an operation to a dependency list, acquire the 
mutex.
Then issue your metadata update synchronously.  If there is a lot of 
contention
on the resource, we can batch all updates into a single write:

    mutex_lock(l1.mutex)
    if not l1.dirty:
        l1.future = l1.data
        l1.dirty = True
    l1.future[idx] = cluster
    mutex_lock(l1.write_mutex)
    if l1.dirty:
        tmp = l1.future
        mutex_unlock(l1.mutex)
        bdrv_write(tmp)
        sync
        mutex_lock(l1.mutex)
        l1.dirty = tmp != l1.future
    mutex_unlock(l1.write_mutex)

== Special casing linux-aio ==

There is one case where a worker thread approach is detrimental:
<code>cache=none</code> together with <code>aio=native</code>.  We can solve
this by checking for the case where we're ready to issue the operation with
no metadata I/O:

     if mutex_trylock(bs.mutex):
        m = metadata_loopup(offset, length)
        if m:
            bdrv_aio_read(bs, m, offset, length, callback) # or write
            mutex_unlock(bs.mutex)
            return
     queue_task(operation, offset, length, callback)

= Speeding up allocation =

When a write grows a qcow2 image, the following operations take place:

# clusters are allocated, and the refcount table is updated to reflect this
# sync to ensure the allocation is committed
# the data is written to the clusters
# the L2 table is located; if it doesn't exist, it is allocated and linked
# the L2 table is updated
# sync to ensure the L2->data pointer is committed

We can avoid the first sync by maintaining a volatile list of allocated
but not yet linked clusters.  This requires a tradeoff between the risk of
losing those clusters on an abort, and the performance gain.  To 
minimize the
risk, the list is flushed if there is no demand for it.

# we maintain low and high theresholds for the volatile free list
# if we're under the low threshold, we start a task to allocate clusters 
up to the midpoint
# if we're above the high threshold, we start a task to return clusters 
down to the midpoint
# if we ever need a cluster (extent) and find that the volatile list is 
empty, we double the low and thresholds (up to a limit)
# once a second, we decrease the thresholds by 25%

This ensures that sustained writes will not block on allocation.

Note that a lost cluster is simply leaked; no data loss is involved.  
The free list can be rebuilt if an unclean shutdown is detected.  Older 
implementations can ignore this those leaks.  To transport an image, it 
is recommended to run qemu-img to reclaim any clusters in case it was 
shut down uncleanly.

== Alternative implementation ==

We can avoid a volatile list by relying on guest concurrency.  We replace
<code>bdrv_aio_write</code> by <code>bdrv_aio_submit</code>, which issues
many operations in parallel (but completes each one separately).  This 
mimics
SCSI and virtio devices, which can trigger multiple ops with a single call
to the hardware.  We make a first pass over all write operations, seeing how
many clusters need to be allocated, allocate that in a single operation, 
then
submit all of the allocating writes. Reads and non-allocating writes can
proceed in parallel.

Note that this implementation (as well as the current qcow2 code) may 
leak clusters if qemu aborts in the wrong place.  Avoiding leaks 
completely requires either journalling, allocate-on-write, or a free 
list rebuild.  The first two are slow due the need for barriers.

= Avoiding L2 syncs =

Currently after updating an L2 table with a cluster pointer, we sync to 
avoid
loss of a cluster.  We can avoid this since the guest is required to sync
if it wants to ensure the data is on disk.  We need only to sync if we UNMAP
the cluster, before we free it in the refcount table.

= Copying L1 tables =

qcow2 requires copying of L1 tables in two cases: taking a snapshot, and 
growing the physical image size beyond a certain boundary.  Since L1s 
are relatively small, even for very large images, and growing L1 is very 
rare, we can exclude all write operations by having a global 
shared/exclusive lock taken for shared access by write operations, and 
for exclusive access by grow/snapshot operations.

If concurrent growing and writing is desired, it can be achieved by 
having a thread copy L1, and requiring each L1 update to update both 
copies (for the region already copied) or just the source (for the 
region that was not yet copied).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 13:07 [Qemu-devel] qcow2 performance plan Avi Kivity
@ 2010-09-14 13:43 ` Anthony Liguori
  2010-09-14 15:11   ` Kevin Wolf
  2010-09-14 14:01 ` [Qemu-devel] " Kevin Wolf
  2010-09-14 15:25 ` [Qemu-devel] " Anthony Liguori
  2 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 13:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel

Hi Avi,

On 09/14/2010 08:07 AM, Avi Kivity wrote:
>  Here's a draft of a plan that should improve qcow2 performance.  It's 
> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
> starting with # are numbered lists, not comments.

Thanks for putting this together.  I think it's really useful to think 
through the problem before anyone jumps in and starts coding.

> = Basics =
>
> At the minimum level, no operation should block the main thread.  This
> could be done in two ways: extending the state machine so that each
> blocking operation can be performed asynchronously 
> (<code>bdrv_aio_*</code>)
> or by threading: each new operation is handed off to a worker thread.
> Since a full state machine is prohibitively complex, this document
> will discuss threading.

There's two distinct requirements that must be satisfied by a fast block 
device.  The device must have fast implementations of aio functions and 
it must support concurrent request processing.

If an aio function blocks in the process of submitting the request, it's 
by definition slow.  But even if you may the aio functions fast, you 
still need to be able to support concurrent request processing in order 
to achieve high throughput.

I'm not going to comment in depth on your threading proposal.  When it 
comes to adding concurrency, I think any approach will require a rewrite 
of the qcow2 code and if the author of that rewrite is more comfortable 
implementing concurrency with threads than with a state machine, I'm 
happy with a threaded implementation.

I'd suggest avoiding hyperbole like "a full state machine is 
prohibitively complex".  QED is a full state machine.  qcow2 adds a 
number of additional states because of the additional metadata and sync 
operations but it's not an exponential increase in complexity.

There are alternative models of concurrency that are worth exploring.  
For instance, coroutines are extremely appealing to me for this type of 
concurrency.  Let me give an example of why.  To handle a read request, 
qed loosely looks like:

aio_read():
    read_table(l1_table, aio_read_l2_table, opaque)

aio_read_l2_table():
    read_table(l2_table, aio_read_data, opaque)

aio_read_data():
    read_data(offset, aio_read_complete, opaque)

aio_read_complete():
    complete_request()

There's no parallelism so there's no need for locking.  With coroutines, 
there is no added parallelism but you can encode a state machine table 
using control flow by yielding to the next state.  QED aio_read() ends 
up looking like:

aio_read():
    launch_coroutine(aio_read_impl)

aio_read_impl():
    read_table(l1_table)
    read_table(l2_table)
    read_data(offset)
    complete_request()

There's a subtle difference with threads though.  Instead of being 
re-entrant by default, you only are re-entrant at well defined points in 
time.

What's nice is that coroutines can be implemented using threads or 
something more sophisticated (like ucontext).  Code written via state 
machine can be converted to coroutines without really thinking.

I'd like to get QED merged first, but my plan is to attempt to introduce 
coroutines to QEMU with something like QED.  I think coroutines are a 
much better fit for concurrency in QEMU than using pre-emptive threads.

> = Speeding up allocation =
>
> When a write grows a qcow2 image, the following operations take place:
>
> # clusters are allocated, and the refcount table is updated to reflect 
> this
> # sync to ensure the allocation is committed
> # the data is written to the clusters
> # the L2 table is located; if it doesn't exist, it is allocated and 
> linked
> # the L2 table is updated
> # sync to ensure the L2->data pointer is committed
>
> We can avoid the first sync by maintaining a volatile list of allocated
> but not yet linked clusters.  This requires a tradeoff between the 
> risk of
> losing those clusters on an abort, and the performance gain.  To 
> minimize the
> risk, the list is flushed if there is no demand for it.
>
> # we maintain low and high theresholds for the volatile free list
> # if we're under the low threshold, we start a task to allocate 
> clusters up to the midpoint
> # if we're above the high threshold, we start a task to return 
> clusters down to the midpoint
> # if we ever need a cluster (extent) and find that the volatile list 
> is empty, we double the low and thresholds (up to a limit)
> # once a second, we decrease the thresholds by 25%
>
> This ensures that sustained writes will not block on allocation.
>
> Note that a lost cluster is simply leaked; no data loss is involved.  
> The free list can be rebuilt if an unclean shutdown is detected.  
> Older implementations can ignore this those leaks.  To transport an 
> image, it is recommended to run qemu-img to reclaim any clusters in 
> case it was shut down uncleanly.

I'm not sure that letting older implementations ignore leaks is a 
reasonable approach.

>
> == Alternative implementation ==
>
> We can avoid a volatile list by relying on guest concurrency.  We replace
> <code>bdrv_aio_write</code> by <code>bdrv_aio_submit</code>, which issues
> many operations in parallel (but completes each one separately).  This 
> mimics
> SCSI and virtio devices, which can trigger multiple ops with a single 
> call
> to the hardware.  We make a first pass over all write operations, 
> seeing how
> many clusters need to be allocated, allocate that in a single 
> operation, then
> submit all of the allocating writes.

The difficulty here is that making a first pass means you have to read 
the metadata for each operation.  Each write() may have many writes to 
difference clusters which means that you've got to potentially read a 
lot of metadata.

This means that you may have one write that you know needs a new cluster 
but you wait to do the actual write until you finish other metadata 
operations.  It's non obvious without benchmarking whether this is the 
best strategy.

Regards,

Anthony Liguori

> = Avoiding L2 syncs =
>
> Currently after updating an L2 table with a cluster pointer, we sync 
> to avoid
> loss of a cluster.  We can avoid this since the guest is required to sync
> if it wants to ensure the data is on disk.  We need only to sync if we 
> UNMAP
> the cluster, before we free it in the refcount table.
>
> = Copying L1 tables =
>
> qcow2 requires copying of L1 tables in two cases: taking a snapshot, 
> and growing the physical image size beyond a certain boundary.  Since 
> L1s are relatively small, even for very large images, and growing L1 
> is very rare, we can exclude all write operations by having a global 
> shared/exclusive lock taken for shared access by write operations, and 
> for exclusive access by grow/snapshot operations.
>
> If concurrent growing and writing is desired, it can be achieved by 
> having a thread copy L1, and requiring each L1 update to update both 
> copies (for the region already copied) or just the source (for the 
> region that was not yet copied).
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: qcow2 performance plan
  2010-09-14 13:07 [Qemu-devel] qcow2 performance plan Avi Kivity
  2010-09-14 13:43 ` Anthony Liguori
@ 2010-09-14 14:01 ` Kevin Wolf
  2010-09-14 15:14   ` Stefan Hajnoczi
  2010-09-14 15:25 ` [Qemu-devel] " Anthony Liguori
  2 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-09-14 14:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Am 14.09.2010 15:07, schrieb Avi Kivity:
>   Here's a draft of a plan that should improve qcow2 performance.  It's 
> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
> starting with # are numbered lists, not comments.
> 
> = Basics =
> 
> At the minimum level, no operation should block the main thread.  This
> could be done in two ways: extending the state machine so that each
> blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
> or by threading: each new operation is handed off to a worker thread.
> Since a full state machine is prohibitively complex, this document
> will discuss threading.
> 
> == Basic threading strategy ==
> 
> A first iteration of qcow2 threading adds a single mutex to an image.
> The existing qcow2 code is then executed within a worker thread,
> acquiring the mutex before starting any operation and releasing it
> after completion.  Concurrent operations will simply block until the
> operation is complete.  For operations which are already asynchronous,
> the blocking time will be negligible since the code will call
> <code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
> The immediate benefit is that currently blocking operations no long block
> the main thread, instead they just block the block operation which is
> blocking anyway.
> 
> == Eliminating the threading penalty ==
> 
> We can eliminate pointless context switches by using the worker thread
> context we're in to issue the I/O.  This is trivial for synchronous calls
> (<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue 
> the I/O
> from the same thread we're currently in.  The underlying raw block format
> driver threading code needs to recognize we're in a worker thread context so
> it doesn't need to use a worker thread of its own; perhaps using a thread
> variable to see if it is in the main thread or an I/O worker thread.
> 
> For asynchronous operations, this is harder.  We may add a
> <code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
> to replace a
> 
>      bdrv_aio_read()
>      mutex_unlock(bs.mutex)
>      return;
> 
> sequence.  Alternatively, we can just eliminate asynchronous calls.  To
> retain concurrency we drop the mutex while performing the operation:
> an convert a <code>bdrv_aio_read</code> to:
> 
>      mutex_unlock(bs.mutex)
>      bdrv_read()
>      mutex_lock(bs.mutex)
> 
> This allows the operations to proceed in parallel.
> 
> For asynchronous metadata operations, the code is simplified considerably.
> Dependency lists that are maintained in metadata caches are replaced by a
> mutex; instead of adding an operation to a dependency list, acquire the 
> mutex.
> Then issue your metadata update synchronously.  If there is a lot of 
> contention
> on the resource, we can batch all updates into a single write:
> 
>     mutex_lock(l1.mutex)
>     if not l1.dirty:
>         l1.future = l1.data
>         l1.dirty = True
>     l1.future[idx] = cluster
>     mutex_lock(l1.write_mutex)
>     if l1.dirty:
>         tmp = l1.future
>         mutex_unlock(l1.mutex)
>         bdrv_write(tmp)
>         sync
>         mutex_lock(l1.mutex)
>         l1.dirty = tmp != l1.future
>     mutex_unlock(l1.write_mutex)
> 
> == Special casing linux-aio ==
> 
> There is one case where a worker thread approach is detrimental:
> <code>cache=none</code> together with <code>aio=native</code>.  We can solve
> this by checking for the case where we're ready to issue the operation with
> no metadata I/O:
> 
>      if mutex_trylock(bs.mutex):
>         m = metadata_loopup(offset, length)
>         if m:
>             bdrv_aio_read(bs, m, offset, length, callback) # or write
>             mutex_unlock(bs.mutex)
>             return
>      queue_task(operation, offset, length, callback)
> 
> = Speeding up allocation =
> 
> When a write grows a qcow2 image, the following operations take place:
> 
> # clusters are allocated, and the refcount table is updated to reflect this
> # sync to ensure the allocation is committed
> # the data is written to the clusters
> # the L2 table is located; if it doesn't exist, it is allocated and linked
> # the L2 table is updated
> # sync to ensure the L2->data pointer is committed

I have been thinking about changing this into:

# clusters are allocated, and the refcount table is updated to reflect this
# the data is written to the clusters
# Return success to the caller and schedule a second part to be run
immediately before the next sync is done for other reasons (e.g. the
guest requesting a flush)

This is done only before the next sync:

# sync to ensure the allocation is committed
# the L2 table is located; if it doesn't exist, it is allocated and linked
# the L2 table is updated
# One sync for everything that has accumulated

This would leave us with no sync in the typical cluster allocation case.
The downside is that we're in trouble if we get an I/O error during the
pre-sync writes. Losing this data is actually okay, though, because the
guest hasn't flushed yet.

If you extend this to more complicated things like a COW this will
involve a bit more syncs. We can probably have a queue like "before the
next sync, do A, B and C; after the second one D; after the third one E
and F".

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 13:43 ` Anthony Liguori
@ 2010-09-14 15:11   ` Kevin Wolf
  2010-09-14 15:20     ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-09-14 15:11 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

Am 14.09.2010 15:43, schrieb Anthony Liguori:
> Hi Avi,
> 
> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>  Here's a draft of a plan that should improve qcow2 performance.  It's 
>> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
>> starting with # are numbered lists, not comments.
> 
> Thanks for putting this together.  I think it's really useful to think 
> through the problem before anyone jumps in and starts coding.
> 
>> = Basics =
>>
>> At the minimum level, no operation should block the main thread.  This
>> could be done in two ways: extending the state machine so that each
>> blocking operation can be performed asynchronously 
>> (<code>bdrv_aio_*</code>)
>> or by threading: each new operation is handed off to a worker thread.
>> Since a full state machine is prohibitively complex, this document
>> will discuss threading.
> 
> There's two distinct requirements that must be satisfied by a fast block 
> device.  The device must have fast implementations of aio functions and 
> it must support concurrent request processing.
> 
> If an aio function blocks in the process of submitting the request, it's 
> by definition slow.  But even if you may the aio functions fast, you 
> still need to be able to support concurrent request processing in order 
> to achieve high throughput.
> 
> I'm not going to comment in depth on your threading proposal.  When it 
> comes to adding concurrency, I think any approach will require a rewrite 
> of the qcow2 code and if the author of that rewrite is more comfortable 
> implementing concurrency with threads than with a state machine, I'm 
> happy with a threaded implementation.
> 
> I'd suggest avoiding hyperbole like "a full state machine is 
> prohibitively complex".  QED is a full state machine.  qcow2 adds a 
> number of additional states because of the additional metadata and sync 
> operations but it's not an exponential increase in complexity.

It will be quite some additional states that qcow2 brings in, but I
suspect the really hard thing is getting the dependencies between
requests right.

I just had a look at how QED is doing this, and it seems to take the
easy solution, namely allowing only one allocation at the same time. So
this is roughly equivalent to Avi's worker thread that runs today's
qcow2 code and is protected by a global mutex.

No doubt compared to real concurrency this is relatively easy to achieve
in either model (but probably easier with threads).

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Re: qcow2 performance plan
  2010-09-14 14:01 ` [Qemu-devel] " Kevin Wolf
@ 2010-09-14 15:14   ` Stefan Hajnoczi
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Hajnoczi @ 2010-09-14 15:14 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Avi Kivity, qemu-devel

On Tue, Sep 14, 2010 at 3:01 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 14.09.2010 15:07, schrieb Avi Kivity:
>>   Here's a draft of a plan that should improve qcow2 performance.  It's
>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>> starting with # are numbered lists, not comments.
>>
>> = Basics =
>>
>> At the minimum level, no operation should block the main thread.  This
>> could be done in two ways: extending the state machine so that each
>> blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
>> or by threading: each new operation is handed off to a worker thread.
>> Since a full state machine is prohibitively complex, this document
>> will discuss threading.
>>
>> == Basic threading strategy ==
>>
>> A first iteration of qcow2 threading adds a single mutex to an image.
>> The existing qcow2 code is then executed within a worker thread,
>> acquiring the mutex before starting any operation and releasing it
>> after completion.  Concurrent operations will simply block until the
>> operation is complete.  For operations which are already asynchronous,
>> the blocking time will be negligible since the code will call
>> <code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
>> The immediate benefit is that currently blocking operations no long block
>> the main thread, instead they just block the block operation which is
>> blocking anyway.
>>
>> == Eliminating the threading penalty ==
>>
>> We can eliminate pointless context switches by using the worker thread
>> context we're in to issue the I/O.  This is trivial for synchronous calls
>> (<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue
>> the I/O
>> from the same thread we're currently in.  The underlying raw block format
>> driver threading code needs to recognize we're in a worker thread context so
>> it doesn't need to use a worker thread of its own; perhaps using a thread
>> variable to see if it is in the main thread or an I/O worker thread.
>>
>> For asynchronous operations, this is harder.  We may add a
>> <code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
>> to replace a
>>
>>      bdrv_aio_read()
>>      mutex_unlock(bs.mutex)
>>      return;
>>
>> sequence.  Alternatively, we can just eliminate asynchronous calls.  To
>> retain concurrency we drop the mutex while performing the operation:
>> an convert a <code>bdrv_aio_read</code> to:
>>
>>      mutex_unlock(bs.mutex)
>>      bdrv_read()
>>      mutex_lock(bs.mutex)
>>
>> This allows the operations to proceed in parallel.
>>
>> For asynchronous metadata operations, the code is simplified considerably.
>> Dependency lists that are maintained in metadata caches are replaced by a
>> mutex; instead of adding an operation to a dependency list, acquire the
>> mutex.
>> Then issue your metadata update synchronously.  If there is a lot of
>> contention
>> on the resource, we can batch all updates into a single write:
>>
>>     mutex_lock(l1.mutex)
>>     if not l1.dirty:
>>         l1.future = l1.data
>>         l1.dirty = True
>>     l1.future[idx] = cluster
>>     mutex_lock(l1.write_mutex)
>>     if l1.dirty:
>>         tmp = l1.future
>>         mutex_unlock(l1.mutex)
>>         bdrv_write(tmp)
>>         sync
>>         mutex_lock(l1.mutex)
>>         l1.dirty = tmp != l1.future
>>     mutex_unlock(l1.write_mutex)
>>
>> == Special casing linux-aio ==
>>
>> There is one case where a worker thread approach is detrimental:
>> <code>cache=none</code> together with <code>aio=native</code>.  We can solve
>> this by checking for the case where we're ready to issue the operation with
>> no metadata I/O:
>>
>>      if mutex_trylock(bs.mutex):
>>         m = metadata_loopup(offset, length)
>>         if m:
>>             bdrv_aio_read(bs, m, offset, length, callback) # or write
>>             mutex_unlock(bs.mutex)
>>             return
>>      queue_task(operation, offset, length, callback)
>>
>> = Speeding up allocation =
>>
>> When a write grows a qcow2 image, the following operations take place:
>>
>> # clusters are allocated, and the refcount table is updated to reflect this
>> # sync to ensure the allocation is committed
>> # the data is written to the clusters
>> # the L2 table is located; if it doesn't exist, it is allocated and linked
>> # the L2 table is updated
>> # sync to ensure the L2->data pointer is committed
>
> I have been thinking about changing this into:
>
> # clusters are allocated, and the refcount table is updated to reflect this
> # the data is written to the clusters
> # Return success to the caller and schedule a second part to be run
> immediately before the next sync is done for other reasons (e.g. the
> guest requesting a flush)
>
> This is done only before the next sync:
>
> # sync to ensure the allocation is committed
> # the L2 table is located; if it doesn't exist, it is allocated and linked
> # the L2 table is updated
> # One sync for everything that has accumulated
>
> This would leave us with no sync in the typical cluster allocation case.
> The downside is that we're in trouble if we get an I/O error during the
> pre-sync writes. Losing this data is actually okay, though, because the
> guest hasn't flushed yet.
>
> If you extend this to more complicated things like a COW this will
> involve a bit more syncs. We can probably have a queue like "before the
> next sync, do A, B and C; after the second one D; after the third one E
> and F".

If the metadata update code is smart enough it may be able to combine
multiple L2 updates contained in the same sector, for example.

If you have support for barriers/dependencies between metadata
updates, can this mechanism also be used to update the refcount table?
 For example:

# Write new data to a free cluster

Out-of-line:
# Increment the refcount
# Then, the L2 table is located; if it doesn't exist, it is allocated
# the L2 table is updated
# Then, If the L2 table was allocated, the L1 table is updated

(where "Then," is an ordering relationship using a sync)

It feels like this route is powerful but becomes very complex and
errors violating data integrity are subtle and may not be noticed.

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 15:11   ` Kevin Wolf
@ 2010-09-14 15:20     ` Anthony Liguori
  2010-09-14 15:47       ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 15:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Avi Kivity, qemu-devel

On 09/14/2010 10:11 AM, Kevin Wolf wrote:
> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>    
>> Hi Avi,
>>
>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>      
>>>   Here's a draft of a plan that should improve qcow2 performance.  It's
>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>> starting with # are numbered lists, not comments.
>>>        
>> Thanks for putting this together.  I think it's really useful to think
>> through the problem before anyone jumps in and starts coding.
>>
>>      
>>> = Basics =
>>>
>>> At the minimum level, no operation should block the main thread.  This
>>> could be done in two ways: extending the state machine so that each
>>> blocking operation can be performed asynchronously
>>> (<code>bdrv_aio_*</code>)
>>> or by threading: each new operation is handed off to a worker thread.
>>> Since a full state machine is prohibitively complex, this document
>>> will discuss threading.
>>>        
>> There's two distinct requirements that must be satisfied by a fast block
>> device.  The device must have fast implementations of aio functions and
>> it must support concurrent request processing.
>>
>> If an aio function blocks in the process of submitting the request, it's
>> by definition slow.  But even if you may the aio functions fast, you
>> still need to be able to support concurrent request processing in order
>> to achieve high throughput.
>>
>> I'm not going to comment in depth on your threading proposal.  When it
>> comes to adding concurrency, I think any approach will require a rewrite
>> of the qcow2 code and if the author of that rewrite is more comfortable
>> implementing concurrency with threads than with a state machine, I'm
>> happy with a threaded implementation.
>>
>> I'd suggest avoiding hyperbole like "a full state machine is
>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>> number of additional states because of the additional metadata and sync
>> operations but it's not an exponential increase in complexity.
>>      
> It will be quite some additional states that qcow2 brings in, but I
> suspect the really hard thing is getting the dependencies between
> requests right.
>
> I just had a look at how QED is doing this, and it seems to take the
> easy solution, namely allowing only one allocation at the same time.

One L2 allocation, not cluster allocations.  You can allocate multiple 
clusters concurrently and you can read/write L2s concurrently.

Since L2 allocation only happens every 2GB, it's a rare event.

>   So
> this is roughly equivalent to Avi's worker thread that runs today's
> qcow2 code and is protected by a global mutex.
>    

No.  First, you would have to dive deeply into the code and drop the 
qemu_mutex any time there is a synchronous IO op.  However, the code is 
likely making assumptions today whereas the introduction of re-entrance 
in QEMU could break things.

Basically, any time there's a unlock of qemu_mutex(), any state outside 
of qcow2 may have changed.  Are you 100% confident that's okay in the 
current code?

But with Avi's proposal, you don't have concurrency with any cluster 
allocation or read/write of metadata so it's not the same as QED today.

> No doubt compared to real concurrency this is relatively easy to achieve
> in either model (but probably easier with threads).
>    

FWIW, concurrent L2 allocation is not that hard in QED.  It just wasn't 
that important for the first implementation but it involves adding a 
work queue to each CachedL2Table such that you can stall multiple 
requests to the same L2 table while it's being allocated.

It doesn't change the state machine at all.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 13:07 [Qemu-devel] qcow2 performance plan Avi Kivity
  2010-09-14 13:43 ` Anthony Liguori
  2010-09-14 14:01 ` [Qemu-devel] " Kevin Wolf
@ 2010-09-14 15:25 ` Anthony Liguori
  2010-09-14 16:30   ` Avi Kivity
  2 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 15:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel

On 09/14/2010 08:07 AM, Avi Kivity wrote:
>  Here's a draft of a plan that should improve qcow2 performance.  It's 
> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
> starting with # are numbered lists, not comments.
>
> = Basics =
>
> At the minimum level, no operation should block the main thread.  This
> could be done in two ways: extending the state machine so that each
> blocking operation can be performed asynchronously 
> (<code>bdrv_aio_*</code>)
> or by threading: each new operation is handed off to a worker thread.
> Since a full state machine is prohibitively complex, this document
> will discuss threading.
>
> == Basic threading strategy ==
>
> A first iteration of qcow2 threading adds a single mutex to an image.
> The existing qcow2 code is then executed within a worker thread,
> acquiring the mutex before starting any operation and releasing it
> after completion.  Concurrent operations will simply block until the
> operation is complete.  For operations which are already asynchronous,
> the blocking time will be negligible since the code will call
> <code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
> The immediate benefit is that currently blocking operations no long block
> the main thread, instead they just block the block operation which is
> blocking anyway.
>
> == Eliminating the threading penalty ==
>
> We can eliminate pointless context switches by using the worker thread
> context we're in to issue the I/O.  This is trivial for synchronous calls
> (<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue 
> the I/O
> from the same thread we're currently in.  The underlying raw block format
> driver threading code needs to recognize we're in a worker thread 
> context so
> it doesn't need to use a worker thread of its own; perhaps using a thread
> variable to see if it is in the main thread or an I/O worker thread.
>
> For asynchronous operations, this is harder.  We may add a
> <code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
> to replace a
>
>     bdrv_aio_read()
>     mutex_unlock(bs.mutex)
>     return;
>
> sequence.  Alternatively, we can just eliminate asynchronous calls.  To
> retain concurrency we drop the mutex while performing the operation:
> an convert a <code>bdrv_aio_read</code> to:
>
>     mutex_unlock(bs.mutex)
>     bdrv_read()
>     mutex_lock(bs.mutex)

The incremental version of this is hard for me to understand.  
bdrv_read() may be implemented in terms of bdrv_aio_read() + 
qemu_io_wait() which dispatches bottom halves.  This is done through a 
shared resource so if you allow bdrv_read() to be called in parallel, 
there's a very real possibility that you'll get corruption of a shared 
resource.

You'd have to first instrument bdrv_read() to be re-entrant by acquiring 
bs.mutex() in every bdrv_read() caller.  You would then need to modify 
the file protocol so that it could safely be called in parallel.

IOW, you've got to make the whole block layer thread safe before you can 
begin to make qcow2 thread safe.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 15:20     ` Anthony Liguori
@ 2010-09-14 15:47       ` Kevin Wolf
  2010-09-14 16:03         ` Stefan Hajnoczi
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-09-14 15:47 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

Am 14.09.2010 17:20, schrieb Anthony Liguori:
> On 09/14/2010 10:11 AM, Kevin Wolf wrote:
>> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>>    
>>> Hi Avi,
>>>
>>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>>      
>>>>   Here's a draft of a plan that should improve qcow2 performance.  It's
>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>>> starting with # are numbered lists, not comments.
>>>>        
>>> Thanks for putting this together.  I think it's really useful to think
>>> through the problem before anyone jumps in and starts coding.
>>>
>>>      
>>>> = Basics =
>>>>
>>>> At the minimum level, no operation should block the main thread.  This
>>>> could be done in two ways: extending the state machine so that each
>>>> blocking operation can be performed asynchronously
>>>> (<code>bdrv_aio_*</code>)
>>>> or by threading: each new operation is handed off to a worker thread.
>>>> Since a full state machine is prohibitively complex, this document
>>>> will discuss threading.
>>>>        
>>> There's two distinct requirements that must be satisfied by a fast block
>>> device.  The device must have fast implementations of aio functions and
>>> it must support concurrent request processing.
>>>
>>> If an aio function blocks in the process of submitting the request, it's
>>> by definition slow.  But even if you may the aio functions fast, you
>>> still need to be able to support concurrent request processing in order
>>> to achieve high throughput.
>>>
>>> I'm not going to comment in depth on your threading proposal.  When it
>>> comes to adding concurrency, I think any approach will require a rewrite
>>> of the qcow2 code and if the author of that rewrite is more comfortable
>>> implementing concurrency with threads than with a state machine, I'm
>>> happy with a threaded implementation.
>>>
>>> I'd suggest avoiding hyperbole like "a full state machine is
>>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>>> number of additional states because of the additional metadata and sync
>>> operations but it's not an exponential increase in complexity.
>>>      
>> It will be quite some additional states that qcow2 brings in, but I
>> suspect the really hard thing is getting the dependencies between
>> requests right.
>>
>> I just had a look at how QED is doing this, and it seems to take the
>> easy solution, namely allowing only one allocation at the same time.
> 
> One L2 allocation, not cluster allocations.  You can allocate multiple 
> clusters concurrently and you can read/write L2s concurrently.
> 
> Since L2 allocation only happens every 2GB, it's a rare event.

Then your state machine is too complicated for me to understand. :-)

Let me try to chase function pointers for a simple cluster allocation:

bdrv_qed_aio_writev
qed_aio_setup
qed_aio_next_io
qed_find_cluster
qed_read_l2_table
...
qed_find_cluster_cb

This function contains the code to check if the cluster is already
allocated, right?

    n = qed_count_contiguous_clusters(s, request->l2_table->table,
                                      index, n, &offset);
    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;

The callback called from there is qed_aio_write_data(..., ret =
QED_CLUSTER_L2, ...) which means

    bool need_alloc = ret != QED_CLUSTER_FOUND;
    /* Freeze this request if another allocating write is in progress */
    if (need_alloc) {
    ...

So where did I start to follow the path of a L2 table allocation instead
of a simple cluster allocation?

>>   So
>> this is roughly equivalent to Avi's worker thread that runs today's
>> qcow2 code and is protected by a global mutex.
>>    
> 
> No.  First, you would have to dive deeply into the code and drop the 
> qemu_mutex any time there is a synchronous IO op.  However, the code is 
> likely making assumptions today whereas the introduction of re-entrance 
> in QEMU could break things.

No need to drop the mutex for metadata writes in this simplistic
approach. The requests are processed one after another if they touch
metadata. It would only stop blocking the device model instead of really
doing more requests in parallel.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 15:47       ` Kevin Wolf
@ 2010-09-14 16:03         ` Stefan Hajnoczi
  2010-09-14 16:16           ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2010-09-14 16:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Avi Kivity, qemu-devel

On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 14.09.2010 17:20, schrieb Anthony Liguori:
>> On 09/14/2010 10:11 AM, Kevin Wolf wrote:
>>> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>>>
>>>> Hi Avi,
>>>>
>>>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>>>
>>>>>   Here's a draft of a plan that should improve qcow2 performance.  It's
>>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>>>> starting with # are numbered lists, not comments.
>>>>>
>>>> Thanks for putting this together.  I think it's really useful to think
>>>> through the problem before anyone jumps in and starts coding.
>>>>
>>>>
>>>>> = Basics =
>>>>>
>>>>> At the minimum level, no operation should block the main thread.  This
>>>>> could be done in two ways: extending the state machine so that each
>>>>> blocking operation can be performed asynchronously
>>>>> (<code>bdrv_aio_*</code>)
>>>>> or by threading: each new operation is handed off to a worker thread.
>>>>> Since a full state machine is prohibitively complex, this document
>>>>> will discuss threading.
>>>>>
>>>> There's two distinct requirements that must be satisfied by a fast block
>>>> device.  The device must have fast implementations of aio functions and
>>>> it must support concurrent request processing.
>>>>
>>>> If an aio function blocks in the process of submitting the request, it's
>>>> by definition slow.  But even if you may the aio functions fast, you
>>>> still need to be able to support concurrent request processing in order
>>>> to achieve high throughput.
>>>>
>>>> I'm not going to comment in depth on your threading proposal.  When it
>>>> comes to adding concurrency, I think any approach will require a rewrite
>>>> of the qcow2 code and if the author of that rewrite is more comfortable
>>>> implementing concurrency with threads than with a state machine, I'm
>>>> happy with a threaded implementation.
>>>>
>>>> I'd suggest avoiding hyperbole like "a full state machine is
>>>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>>>> number of additional states because of the additional metadata and sync
>>>> operations but it's not an exponential increase in complexity.
>>>>
>>> It will be quite some additional states that qcow2 brings in, but I
>>> suspect the really hard thing is getting the dependencies between
>>> requests right.
>>>
>>> I just had a look at how QED is doing this, and it seems to take the
>>> easy solution, namely allowing only one allocation at the same time.
>>
>> One L2 allocation, not cluster allocations.  You can allocate multiple
>> clusters concurrently and you can read/write L2s concurrently.
>>
>> Since L2 allocation only happens every 2GB, it's a rare event.
>
> Then your state machine is too complicated for me to understand. :-)
>
> Let me try to chase function pointers for a simple cluster allocation:
>
> bdrv_qed_aio_writev
> qed_aio_setup
> qed_aio_next_io
> qed_find_cluster
> qed_read_l2_table
> ...
> qed_find_cluster_cb
>
> This function contains the code to check if the cluster is already
> allocated, right?
>
>    n = qed_count_contiguous_clusters(s, request->l2_table->table,
>                                      index, n, &offset);
>    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
>
> The callback called from there is qed_aio_write_data(..., ret =
> QED_CLUSTER_L2, ...) which means
>
>    bool need_alloc = ret != QED_CLUSTER_FOUND;
>    /* Freeze this request if another allocating write is in progress */
>    if (need_alloc) {
>    ...
>
> So where did I start to follow the path of a L2 table allocation instead
> of a simple cluster allocation?

qed_aio_write_main() writes the main body of data into the cluster.
Then it decides whether to update/allocate L2 tables if this is an
allocating write.

qed_aio_write_l2_update() is the function that gets called to touch
the L2 table (it also handles the allocation case).

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 16:03         ` Stefan Hajnoczi
@ 2010-09-14 16:16           ` Anthony Liguori
  2010-09-14 16:28             ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 16:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Avi Kivity, qemu-devel

On 09/14/2010 11:03 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>    
>> Am 14.09.2010 17:20, schrieb Anthony Liguori:
>>      
>>> On 09/14/2010 10:11 AM, Kevin Wolf wrote:
>>>        
>>>> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>>>>
>>>>          
>>>>> Hi Avi,
>>>>>
>>>>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>>>>
>>>>>            
>>>>>>    Here's a draft of a plan that should improve qcow2 performance.  It's
>>>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>>>>> starting with # are numbered lists, not comments.
>>>>>>
>>>>>>              
>>>>> Thanks for putting this together.  I think it's really useful to think
>>>>> through the problem before anyone jumps in and starts coding.
>>>>>
>>>>>
>>>>>            
>>>>>> = Basics =
>>>>>>
>>>>>> At the minimum level, no operation should block the main thread.  This
>>>>>> could be done in two ways: extending the state machine so that each
>>>>>> blocking operation can be performed asynchronously
>>>>>> (<code>bdrv_aio_*</code>)
>>>>>> or by threading: each new operation is handed off to a worker thread.
>>>>>> Since a full state machine is prohibitively complex, this document
>>>>>> will discuss threading.
>>>>>>
>>>>>>              
>>>>> There's two distinct requirements that must be satisfied by a fast block
>>>>> device.  The device must have fast implementations of aio functions and
>>>>> it must support concurrent request processing.
>>>>>
>>>>> If an aio function blocks in the process of submitting the request, it's
>>>>> by definition slow.  But even if you may the aio functions fast, you
>>>>> still need to be able to support concurrent request processing in order
>>>>> to achieve high throughput.
>>>>>
>>>>> I'm not going to comment in depth on your threading proposal.  When it
>>>>> comes to adding concurrency, I think any approach will require a rewrite
>>>>> of the qcow2 code and if the author of that rewrite is more comfortable
>>>>> implementing concurrency with threads than with a state machine, I'm
>>>>> happy with a threaded implementation.
>>>>>
>>>>> I'd suggest avoiding hyperbole like "a full state machine is
>>>>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>>>>> number of additional states because of the additional metadata and sync
>>>>> operations but it's not an exponential increase in complexity.
>>>>>
>>>>>            
>>>> It will be quite some additional states that qcow2 brings in, but I
>>>> suspect the really hard thing is getting the dependencies between
>>>> requests right.
>>>>
>>>> I just had a look at how QED is doing this, and it seems to take the
>>>> easy solution, namely allowing only one allocation at the same time.
>>>>          
>>> One L2 allocation, not cluster allocations.  You can allocate multiple
>>> clusters concurrently and you can read/write L2s concurrently.
>>>
>>> Since L2 allocation only happens every 2GB, it's a rare event.
>>>        
>> Then your state machine is too complicated for me to understand. :-)
>>
>> Let me try to chase function pointers for a simple cluster allocation:
>>
>> bdrv_qed_aio_writev
>> qed_aio_setup
>> qed_aio_next_io
>> qed_find_cluster
>> qed_read_l2_table
>> ...
>> qed_find_cluster_cb
>>
>> This function contains the code to check if the cluster is already
>> allocated, right?
>>
>>     n = qed_count_contiguous_clusters(s, request->l2_table->table,
>>                                       index, n,&offset);
>>     ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
>>
>> The callback called from there is qed_aio_write_data(..., ret =
>> QED_CLUSTER_L2, ...) which means
>>
>>     bool need_alloc = ret != QED_CLUSTER_FOUND;
>>     /* Freeze this request if another allocating write is in progress */
>>     if (need_alloc) {
>>     ...
>>
>> So where did I start to follow the path of a L2 table allocation instead
>> of a simple cluster allocation?
>>      
> qed_aio_write_main() writes the main body of data into the cluster.
> Then it decides whether to update/allocate L2 tables if this is an
> allocating write.
>    

Right, it should only freeze if the L2 table needs to be allocated, not 
if it only needs to be updated.  IOW,

diff --git a/block/qed.c b/block/qed.c
index 4c4e7a2..0357c03 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int ret,
      }

      /* Freeze this request if another allocating write is in progress */
-    if (need_alloc) {
+    if (ret == QED_CLUSTER_L1) {
          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
          }

It's being a bit more conservative than it needs to be.

Regards,

Anthony Liguori

> qed_aio_write_l2_update() is the function that gets called to touch
> the L2 table (it also handles the allocation case).
>
> Stefan
>
>    

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 16:16           ` Anthony Liguori
@ 2010-09-14 16:28             ` Avi Kivity
  2010-09-14 17:08               ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-09-14 16:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

  On 09/14/2010 06:16 PM, Anthony Liguori wrote:
>
> Right, it should only freeze if the L2 table needs to be allocated, 
> not if it only needs to be updated.  IOW,
>
> diff --git a/block/qed.c b/block/qed.c
> index 4c4e7a2..0357c03 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int ret,
>      }
>
>      /* Freeze this request if another allocating write is in progress */
> -    if (need_alloc) {
> +    if (ret == QED_CLUSTER_L1) {
>          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
>              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
>          }
>
> It's being a bit more conservative than it needs to be.

Yes, I hit this too.  So without this patch, it does serialize all 
allocating writes?

If multiple requests need to update pointers in L2, will those updates 
generate one write per request, or just two writes (one write from the 
first request, another from all those that serialized after it)?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 15:25 ` [Qemu-devel] " Anthony Liguori
@ 2010-09-14 16:30   ` Avi Kivity
  0 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2010-09-14 16:30 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel

  On 09/14/2010 05:25 PM, Anthony Liguori wrote:
>
> The incremental version of this is hard for me to understand.  
> bdrv_read() may be implemented in terms of bdrv_aio_read() + 
> qemu_io_wait() which dispatches bottom halves.  This is done through a 
> shared resource so if you allow bdrv_read() to be called in parallel, 
> there's a very real possibility that you'll get corruption of a shared 
> resource.
>
> You'd have to first instrument bdrv_read() to be re-entrant by 
> acquiring bs.mutex() in every bdrv_read() caller.  You would then need 
> to modify the file protocol so that it could safely be called in 
> parallel.
>
> IOW, you've got to make the whole block layer thread safe before you 
> can begin to make qcow2 thread safe.
>

It does rely on CONFIG_IO_THREAD and qemu_mutex.  However, we don't need 
to make the whole block layer safe, just the syncronous parts (drop 
qemu_mutex after I/O is issued, rely on the image's mutex for internal 
synchronization).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 16:28             ` Avi Kivity
@ 2010-09-14 17:08               ` Anthony Liguori
  2010-09-14 17:23                 ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 17:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 09/14/2010 11:28 AM, Avi Kivity wrote:
>  On 09/14/2010 06:16 PM, Anthony Liguori wrote:
>>
>> Right, it should only freeze if the L2 table needs to be allocated, 
>> not if it only needs to be updated.  IOW,
>>
>> diff --git a/block/qed.c b/block/qed.c
>> index 4c4e7a2..0357c03 100644
>> --- a/block/qed.c
>> +++ b/block/qed.c
>> @@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int 
>> ret,
>>      }
>>
>>      /* Freeze this request if another allocating write is in 
>> progress */
>> -    if (need_alloc) {
>> +    if (ret == QED_CLUSTER_L1) {
>>          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
>>              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
>>          }
>>
>> It's being a bit more conservative than it needs to be.
>
> Yes, I hit this too.  So without this patch, it does serialize all 
> allocating writes?

Yes, but my patch is not enough as it turns out.

When dealing with O_DIRECT, we have to handle RMW on our own which means 
we need to serialize access to the same sector.

The way we're planning on addressing this in the short term is to break 
the single allocator queue into a per-L2 table queue.  So writes to the 
same L2 would be serialized but writes to different L2s would not be 
serialized.

Regards,

Anthony Liguori

> If multiple requests need to update pointers in L2, will those updates 
> generate one write per request, or just two writes (one write from the 
> first request, another from all those that serialized after it)?
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 17:08               ` Anthony Liguori
@ 2010-09-14 17:23                 ` Avi Kivity
  2010-09-14 18:58                   ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-09-14 17:23 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

  On 09/14/2010 07:08 PM, Anthony Liguori wrote:
>> Yes, I hit this too.  So without this patch, it does serialize all 
>> allocating writes?
>
>
> Yes, but my patch is not enough as it turns out.
>
> When dealing with O_DIRECT, we have to handle RMW on our own which 
> means we need to serialize access to the same sector.
>
> The way we're planning on addressing this in the short term is to 
> break the single allocator queue into a per-L2 table queue.  So writes 
> to the same L2 would be serialized but writes to different L2s would 
> not be serialized.
>

So at least I read the code correctly.

The next step (also addressed in the qcow2 performance plan) is to batch 
writes to L2.  You'd actually expect to have many concurrent allocating 
writes to one L2.  The first is sent to disk, but the following ones 
just mark the L2 dirty.  When the write returns, it sees it's still 
dirty and goes back to disk again.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] qcow2 performance plan
  2010-09-14 17:23                 ` Avi Kivity
@ 2010-09-14 18:58                   ` Anthony Liguori
  0 siblings, 0 replies; 15+ messages in thread
From: Anthony Liguori @ 2010-09-14 18:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 09/14/2010 12:23 PM, Avi Kivity wrote:
>  On 09/14/2010 07:08 PM, Anthony Liguori wrote:
>>> Yes, I hit this too.  So without this patch, it does serialize all 
>>> allocating writes?
>>
>>
>> Yes, but my patch is not enough as it turns out.
>>
>> When dealing with O_DIRECT, we have to handle RMW on our own which 
>> means we need to serialize access to the same sector.
>>
>> The way we're planning on addressing this in the short term is to 
>> break the single allocator queue into a per-L2 table queue.  So 
>> writes to the same L2 would be serialized but writes to different L2s 
>> would not be serialized.
>>
>
> So at least I read the code correctly.
>
> The next step (also addressed in the qcow2 performance plan) is to 
> batch writes to L2.  You'd actually expect to have many concurrent 
> allocating writes to one L2.  The first is sent to disk, but the 
> following ones just mark the L2 dirty.  When the write returns, it 
> sees it's still dirty and goes back to disk again.

Yeah, I have to think through batching quite a bit more but I agree that 
batching should be a natural next step and can further reduce the cost 
of updating metadata in a streaming workload.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-09-14 19:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 13:07 [Qemu-devel] qcow2 performance plan Avi Kivity
2010-09-14 13:43 ` Anthony Liguori
2010-09-14 15:11   ` Kevin Wolf
2010-09-14 15:20     ` Anthony Liguori
2010-09-14 15:47       ` Kevin Wolf
2010-09-14 16:03         ` Stefan Hajnoczi
2010-09-14 16:16           ` Anthony Liguori
2010-09-14 16:28             ` Avi Kivity
2010-09-14 17:08               ` Anthony Liguori
2010-09-14 17:23                 ` Avi Kivity
2010-09-14 18:58                   ` Anthony Liguori
2010-09-14 14:01 ` [Qemu-devel] " Kevin Wolf
2010-09-14 15:14   ` Stefan Hajnoczi
2010-09-14 15:25 ` [Qemu-devel] " Anthony Liguori
2010-09-14 16:30   ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.