[Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
@ 2010-09-07 13:41 Anthony Liguori
  2010-09-07 14:01 ` Alexander Graf
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 13:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: libvir-list, Stefan Hajnoczi

Hi,

We've got copy-on-read and image streaming working in QED and before 
going much further, I wanted to bounce some interfaces off of the 
libvirt folks to make sure our final interface makes sense.

Here's the basic idea:

Today, you can create images based on base images that are copy on 
write.  With QED, we also support copy on read which forces a copy from 
the backing image on read requests and write requests.

In additional to copy on read, we introduce a notion of streaming a 
block device which means that we search for an unallocated region of the 
leaf image and force a copy-on-read operation.

The combination of copy-on-read and streaming means that you can start a 
guest based on slow storage (like over the network) and bring in blocks 
on demand while also having a deterministic mechanism to complete the 
transfer.

The interface for copy-on-read is just an option within qemu-img 
create.  Streaming, on the other hand, requires a bit more thought.  
Today, I have a monitor command that does the following:

stream <device> <sector offset>

Which will try to stream the minimal amount of data for a single I/O 
operation and then return how many sectors were successfully streamed.

The idea about how to drive this interface is a loop like:

offset = 0;
while offset < image_size:
    wait_for_idle_time()
    count = stream(device, offset)
    offset += count

Obviously, the "wait_for_idle_time()" requires wide system awareness.  
The thing I'm not sure about is 1) would libvirt want to expose a 
similar stream interface and let management software determine idle time 
2) attempt to detect idle time on it's own and provide a higher level 
interface.  If (2), the question then becomes whether we should try to 
do this within qemu and provide libvirt a higher level interface.

A related topic is block migration.  Today we support pre-copy migration 
which means we transfer the block device and then do a live migration.  
Another approach is to do a live migration, and on the source, run a 
block server using image streaming on the destination to move the device.

With QED, to implement this one would:

1) launch qemu-nbd on the source while the guest is running
2) create a qed file on the destination with copy-on-read enabled and a 
backing file using nbd: to point to the source qemu-nbd
3) run qemu -incoming on the destination with the qed file
4) execute the migration
5) when migration completes, begin streaming on the destination to 
complete the copy
6) when the streaming is complete, shut down the qemu-nbd instance on 
the source

This is a bit involved and we could potentially automate some of this in 
qemu by launching qemu-nbd and providing commands to do some of this.  
Again though, I think the question is what type of interfaces would 
libvirt prefer?  Low level interfaces + recipes on how to do high level 
things or higher level interfaces?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
@ 2010-09-07 14:01 ` Alexander Graf
  2010-09-07 14:31   ` Anthony Liguori
  2010-09-07 14:33 ` Stefan Hajnoczi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2010-09-07 14:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi


On 07.09.2010, at 15:41, Anthony Liguori wrote:

> Hi,
> 
> We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
> 
> Here's the basic idea:
> 
> Today, you can create images based on base images that are copy on write.  With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
> 
> In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
> 
> The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
> 
> The interface for copy-on-read is just an option within qemu-img create.  Streaming, on the other hand, requires a bit more thought.  Today, I have a monitor command that does the following:
> 
> stream <device> <sector offset>
> 
> Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
> 
> The idea about how to drive this interface is a loop like:
> 
> offset = 0;
> while offset < image_size:
>   wait_for_idle_time()
>   count = stream(device, offset)
>   offset += count
> 
> Obviously, the "wait_for_idle_time()" requires wide system awareness.  The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface.  If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.

I'm torn here too. Why not expose both? Have a qemu internal daemon available that gets a sleep time as parameter and an external "pull sectors" command. We'll see which one is more useful, but I don't think it's too much code to justify only having one of the two. And the internal daemon could be started using a command line parameter, which helps non-managed users.

> 
> A related topic is block migration.  Today we support pre-copy migration which means we transfer the block device and then do a live migration.  Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
> 
> With QED, to implement this one would:
> 
> 1) launch qemu-nbd on the source while the guest is running
> 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd
> 3) run qemu -incoming on the destination with the qed file
> 4) execute the migration
> 5) when migration completes, begin streaming on the destination to complete the copy
> 6) when the streaming is complete, shut down the qemu-nbd instance on the source
> 
> This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this.  Again though, I think the question is what type of interfaces would libvirt prefer?  Low level interfaces + recipes on how to do high level things or higher level interfaces?

Is there anything keeping us from making the QMP socket multiplexable? I was thinking of something like:

{ command = "nbd_server" ; block = "qemu_block_name" }
{ result = "done" }
<qmp socket turns into nbd socket>

This way we don't require yet another port, don't have to care about conflicts and get internal qemu block names for free.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:01 ` Alexander Graf
@ 2010-09-07 14:31   ` Anthony Liguori
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 14:31 UTC (permalink / raw)
  To: Alexander Graf; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 09:01 AM, Alexander Graf wrote:
> I'm torn here too. Why not expose both? Have a qemu internal daemon available that gets a sleep time as parameter and an external "pull sectors" command. We'll see which one is more useful, but I don't think it's too much code to justify only having one of the two. And the internal daemon could be started using a command line parameter, which helps non-managed users.
>    

Let me turn it around and ask, how would libvirt do this?  Would they 
just use a sleep time parameter and just make use of our command or 
would they do something more clever and attempt to detect system idle?  
Could we just do that in qemu?

Or would they punt to the end user?

>> A related topic is block migration.  Today we support pre-copy migration which means we transfer the block device and then do a live migration.  Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
>>
>> With QED, to implement this one would:
>>
>> 1) launch qemu-nbd on the source while the guest is running
>> 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd
>> 3) run qemu -incoming on the destination with the qed file
>> 4) execute the migration
>> 5) when migration completes, begin streaming on the destination to complete the copy
>> 6) when the streaming is complete, shut down the qemu-nbd instance on the source
>>
>> This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this.  Again though, I think the question is what type of interfaces would libvirt prefer?  Low level interfaces + recipes on how to do high level things or higher level interfaces?
>>      
> Is there anything keeping us from making the QMP socket multiplexable? I was thinking of something like:
>
> { command = "nbd_server" ; block = "qemu_block_name" }
> { result = "done" }
> <qmp socket turns into nbd socket>
>
> This way we don't require yet another port, don't have to care about conflicts and get internal qemu block names for free.
>    

Possibly, but something that complicates life here is that an nbd 
session would be source -> destination but there's no QMP session 
between source -> destination.  Instead, there's a session from source 
-> management node and destination -> management node so you'd have to 
proxy nbd traffic between the two.  That gets ugly quick.

Regards,

Anthony Liguori

> Alex
>
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
  2010-09-07 14:01 ` Alexander Graf
@ 2010-09-07 14:33 ` Stefan Hajnoczi
  2010-09-07 14:51   ` Anthony Liguori
  2010-09-07 14:34 ` Kevin Wolf
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07 14:33 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> The interface for copy-on-read is just an option within qemu-img create.
>  Streaming, on the other hand, requires a bit more thought.  Today, I have a
> monitor command that does the following:
>
> stream <device> <sector offset>
>
> Which will try to stream the minimal amount of data for a single I/O
> operation and then return how many sectors were successfully streamed.
>
> The idea about how to drive this interface is a loop like:
>
> offset = 0;
> while offset < image_size:
>   wait_for_idle_time()
>   count = stream(device, offset)
>   offset += count
>
> Obviously, the "wait_for_idle_time()" requires wide system awareness.  The
> thing I'm not sure about is 1) would libvirt want to expose a similar stream
> interface and let management software determine idle time 2) attempt to
> detect idle time on it's own and provide a higher level interface.  If (2),
> the question then becomes whether we should try to do this within qemu and
> provide libvirt a higher level interface.

A self-tuning solution is attractive because it reduces the need for
other components (management stack) or the user to get involved.  In
this case self-tuning should be possible.  We need to detect periods
of I/O inactivity, for example tracking the number of in-flight
requests and then setting a grace timer when it reaches zero.  When
the grace timer expires, we start streaming until the guest initiates
I/O again.

Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
  2010-09-07 14:01 ` Alexander Graf
  2010-09-07 14:33 ` Stefan Hajnoczi
@ 2010-09-07 14:34 ` Kevin Wolf
  2010-09-07 14:49   ` Stefan Hajnoczi
  2010-09-07 14:49   ` Anthony Liguori
  2010-09-07 15:03 ` [Qemu-devel] " Daniel P. Berrange
  2010-09-12 10:55 ` [Qemu-devel] " Avi Kivity
  4 siblings, 2 replies; 32+ messages in thread
From: Kevin Wolf @ 2010-09-07 14:34 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

Am 07.09.2010 15:41, schrieb Anthony Liguori:
> Hi,
> 
> We've got copy-on-read and image streaming working in QED and before 
> going much further, I wanted to bounce some interfaces off of the 
> libvirt folks to make sure our final interface makes sense.
> 
> Here's the basic idea:
> 
> Today, you can create images based on base images that are copy on 
> write.  With QED, we also support copy on read which forces a copy from 
> the backing image on read requests and write requests.
> 
> In additional to copy on read, we introduce a notion of streaming a 
> block device which means that we search for an unallocated region of the 
> leaf image and force a copy-on-read operation.
> 
> The combination of copy-on-read and streaming means that you can start a 
> guest based on slow storage (like over the network) and bring in blocks 
> on demand while also having a deterministic mechanism to complete the 
> transfer.
> 
> The interface for copy-on-read is just an option within qemu-img 
> create.  

Shouldn't it be a runtime option? You can use the very same image with
copy-on-read or copy-on-write and it will behave the same (execpt for
performance), so it's not an inherent feature of the image file.

Doing it this way has the additional advantage that you need no image
format support for this, so we could implement copy-on-read for other
formats, too.

> Streaming, on the other hand, requires a bit more thought.  
> Today, I have a monitor command that does the following:
> 
> stream <device> <sector offset>
> 
> Which will try to stream the minimal amount of data for a single I/O 
> operation and then return how many sectors were successfully streamed.
> 
> The idea about how to drive this interface is a loop like:
> 
> offset = 0;
> while offset < image_size:
>     wait_for_idle_time()
>     count = stream(device, offset)
>     offset += count
> 
> Obviously, the "wait_for_idle_time()" requires wide system awareness.  
> The thing I'm not sure about is 1) would libvirt want to expose a 
> similar stream interface and let management software determine idle time 
> 2) attempt to detect idle time on it's own and provide a higher level 
> interface.  If (2), the question then becomes whether we should try to 
> do this within qemu and provide libvirt a higher level interface.

I think libvirt shouldn't have to care about sector offsets. You should
just tell qemu to fetch the image and it should do so. We could have
something like -drive backing_mode=[cow|cor|stream].

> A related topic is block migration.  Today we support pre-copy migration 
> which means we transfer the block device and then do a live migration.  
> Another approach is to do a live migration, and on the source, run a 
> block server using image streaming on the destination to move the device.
> 
> With QED, to implement this one would:
> 
> 1) launch qemu-nbd on the source while the guest is running
> 2) create a qed file on the destination with copy-on-read enabled and a 
> backing file using nbd: to point to the source qemu-nbd
> 3) run qemu -incoming on the destination with the qed file
> 4) execute the migration
> 5) when migration completes, begin streaming on the destination to 
> complete the copy
> 6) when the streaming is complete, shut down the qemu-nbd instance on 
> the source

Hm, that's an interesting idea. :-)

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:34 ` Kevin Wolf
@ 2010-09-07 14:49   ` Stefan Hajnoczi
  2010-09-07 14:57     ` Anthony Liguori
  2010-09-07 14:49   ` Anthony Liguori
  1 sibling, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07 14:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list, Anthony Liguori, qemu-devel, Stefan Hajnoczi

On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>> Hi,
>>
>> We've got copy-on-read and image streaming working in QED and before
>> going much further, I wanted to bounce some interfaces off of the
>> libvirt folks to make sure our final interface makes sense.
>>
>> Here's the basic idea:
>>
>> Today, you can create images based on base images that are copy on
>> write.  With QED, we also support copy on read which forces a copy from
>> the backing image on read requests and write requests.
>>
>> In additional to copy on read, we introduce a notion of streaming a
>> block device which means that we search for an unallocated region of the
>> leaf image and force a copy-on-read operation.
>>
>> The combination of copy-on-read and streaming means that you can start a
>> guest based on slow storage (like over the network) and bring in blocks
>> on demand while also having a deterministic mechanism to complete the
>> transfer.
>>
>> The interface for copy-on-read is just an option within qemu-img
>> create.
>
> Shouldn't it be a runtime option? You can use the very same image with
> copy-on-read or copy-on-write and it will behave the same (execpt for
> performance), so it's not an inherent feature of the image file.
>
> Doing it this way has the additional advantage that you need no image
> format support for this, so we could implement copy-on-read for other
> formats, too.

I agree that streaming should be generic, like block migration.  The
trivial generic implementation is:

void bdrv_stream(BlockDriverState* bs)
{
    for (sector = 0; sector < bdrv_getlength(bs); sector += n) {
        if (!bdrv_is_allocated(bs, sector, &n)) {
            bdrv_read(bs, sector, ...);
            bdrv_write(bs, sector, ...);
        }
    }
}

>
>> Streaming, on the other hand, requires a bit more thought.
>> Today, I have a monitor command that does the following:
>>
>> stream <device> <sector offset>
>>
>> Which will try to stream the minimal amount of data for a single I/O
>> operation and then return how many sectors were successfully streamed.
>>
>> The idea about how to drive this interface is a loop like:
>>
>> offset = 0;
>> while offset < image_size:
>>     wait_for_idle_time()
>>     count = stream(device, offset)
>>     offset += count
>>
>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>> The thing I'm not sure about is 1) would libvirt want to expose a
>> similar stream interface and let management software determine idle time
>> 2) attempt to detect idle time on it's own and provide a higher level
>> interface.  If (2), the question then becomes whether we should try to
>> do this within qemu and provide libvirt a higher level interface.
>
> I think libvirt shouldn't have to care about sector offsets. You should
> just tell qemu to fetch the image and it should do so. We could have
> something like -drive backing_mode=[cow|cor|stream].
>
>> A related topic is block migration.  Today we support pre-copy migration
>> which means we transfer the block device and then do a live migration.
>> Another approach is to do a live migration, and on the source, run a
>> block server using image streaming on the destination to move the device.
>>
>> With QED, to implement this one would:
>>
>> 1) launch qemu-nbd on the source while the guest is running
>> 2) create a qed file on the destination with copy-on-read enabled and a
>> backing file using nbd: to point to the source qemu-nbd
>> 3) run qemu -incoming on the destination with the qed file
>> 4) execute the migration
>> 5) when migration completes, begin streaming on the destination to
>> complete the copy
>> 6) when the streaming is complete, shut down the qemu-nbd instance on
>> the source
>
> Hm, that's an interesting idea. :-)
>
> Kevin
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:34 ` Kevin Wolf
  2010-09-07 14:49   ` Stefan Hajnoczi
@ 2010-09-07 14:49   ` Anthony Liguori
  2010-09-07 15:02     ` Kevin Wolf
  1 sibling, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 14:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 09:34 AM, Kevin Wolf wrote:
> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>    
>> Hi,
>>
>> We've got copy-on-read and image streaming working in QED and before
>> going much further, I wanted to bounce some interfaces off of the
>> libvirt folks to make sure our final interface makes sense.
>>
>> Here's the basic idea:
>>
>> Today, you can create images based on base images that are copy on
>> write.  With QED, we also support copy on read which forces a copy from
>> the backing image on read requests and write requests.
>>
>> In additional to copy on read, we introduce a notion of streaming a
>> block device which means that we search for an unallocated region of the
>> leaf image and force a copy-on-read operation.
>>
>> The combination of copy-on-read and streaming means that you can start a
>> guest based on slow storage (like over the network) and bring in blocks
>> on demand while also having a deterministic mechanism to complete the
>> transfer.
>>
>> The interface for copy-on-read is just an option within qemu-img
>> create.
>>      
> Shouldn't it be a runtime option? You can use the very same image with
> copy-on-read or copy-on-write and it will behave the same (execpt for
> performance), so it's not an inherent feature of the image file.
>    

The way it's implemented in QED is that it's a compatible feature.  This 
means that implementations are allowed to ignore it if they want to.  
It's really a suggestion.

So yes, you could have a run time switch that overrides the feature bit 
on disk and either forces copy-on-read on or off.

Do we have a way to pass block drivers run time options?

> Doing it this way has the additional advantage that you need no image
> format support for this, so we could implement copy-on-read for other
> formats, too.
>    

To do it efficiently, it really needs to be in the format for the same 
reason that copy-on-write is part of the format.

You need to understand the cluster boundaries in order to optimize the 
metadata updates.  Sure, you can expose interfaces to the block layer to 
give all of this info but that's solving the same problem for doing 
block level copy-on-write.

The other challenge is that for copy-on-read to be efficiently, you 
really need a format that can distinguish between unallocated sectors 
and zero sectors and do zero detection during the copy-on-read 
operation.  Otherwise, if you have a 10G virtual disk with a backing 
file that's 1GB is size, copy-on-read will result in the leaf being 10G 
instead of ~1GB.

>> Streaming, on the other hand, requires a bit more thought.
>> Today, I have a monitor command that does the following:
>>
>> stream<device>  <sector offset>
>>
>> Which will try to stream the minimal amount of data for a single I/O
>> operation and then return how many sectors were successfully streamed.
>>
>> The idea about how to drive this interface is a loop like:
>>
>> offset = 0;
>> while offset<  image_size:
>>      wait_for_idle_time()
>>      count = stream(device, offset)
>>      offset += count
>>
>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>> The thing I'm not sure about is 1) would libvirt want to expose a
>> similar stream interface and let management software determine idle time
>> 2) attempt to detect idle time on it's own and provide a higher level
>> interface.  If (2), the question then becomes whether we should try to
>> do this within qemu and provide libvirt a higher level interface.
>>      
> I think libvirt shouldn't have to care about sector offsets. You should
> just tell qemu to fetch the image and it should do so. We could have
> something like -drive backing_mode=[cow|cor|stream].
>    

This interface let's libvirt decide when the I/O system is idle.  The 
sector is really just a token to keep track of our overall progress.

One thing I envisioned was that a tool like virt-manager could have a 
progress bar showing the streaming progress.  It could update the 
progress bar based on (offset * 512) / image_size.

If libvirt isn't driving it, we need to detect idle I/O time and we need 
to provide an interface to query status.  Not a huge problem but I'm not 
sure that a single QEMU instance can properly detect idle I/O time.

Regards,

Anthony Liguori

>> A related topic is block migration.  Today we support pre-copy migration
>> which means we transfer the block device and then do a live migration.
>> Another approach is to do a live migration, and on the source, run a
>> block server using image streaming on the destination to move the device.
>>
>> With QED, to implement this one would:
>>
>> 1) launch qemu-nbd on the source while the guest is running
>> 2) create a qed file on the destination with copy-on-read enabled and a
>> backing file using nbd: to point to the source qemu-nbd
>> 3) run qemu -incoming on the destination with the qed file
>> 4) execute the migration
>> 5) when migration completes, begin streaming on the destination to
>> complete the copy
>> 6) when the streaming is complete, shut down the qemu-nbd instance on
>> the source
>>      
> Hm, that's an interesting idea. :-)
>
> Kevin
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:33 ` Stefan Hajnoczi
@ 2010-09-07 14:51   ` Anthony Liguori
  2010-09-07 14:55     ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 14:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
> <aliguori@linux.vnet.ibm.com>  wrote:
>    
>> The interface for copy-on-read is just an option within qemu-img create.
>>   Streaming, on the other hand, requires a bit more thought.  Today, I have a
>> monitor command that does the following:
>>
>> stream<device>  <sector offset>
>>
>> Which will try to stream the minimal amount of data for a single I/O
>> operation and then return how many sectors were successfully streamed.
>>
>> The idea about how to drive this interface is a loop like:
>>
>> offset = 0;
>> while offset<  image_size:
>>    wait_for_idle_time()
>>    count = stream(device, offset)
>>    offset += count
>>
>> Obviously, the "wait_for_idle_time()" requires wide system awareness.  The
>> thing I'm not sure about is 1) would libvirt want to expose a similar stream
>> interface and let management software determine idle time 2) attempt to
>> detect idle time on it's own and provide a higher level interface.  If (2),
>> the question then becomes whether we should try to do this within qemu and
>> provide libvirt a higher level interface.
>>      
> A self-tuning solution is attractive because it reduces the need for
> other components (management stack) or the user to get involved.  In
> this case self-tuning should be possible.  We need to detect periods
> of I/O inactivity, for example tracking the number of in-flight
> requests and then setting a grace timer when it reaches zero.  When
> the grace timer expires, we start streaming until the guest initiates
> I/O again.
>    

That detects idle I/O within a single QEMU guest, but you might have 
another guest running that's I/O bound which means that from an overall 
system throughput perspective, you really don't want to stream.

I think libvirt might be able to do a better job here by looking at 
overall system I/O usage.  But I'm not sure hence this RFC :-)

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:51   ` Anthony Liguori
@ 2010-09-07 14:55     ` Stefan Hajnoczi
  2010-09-07 15:00       ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07 14:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
>> <aliguori@linux.vnet.ibm.com>  wrote:
>>
>>>
>>> The interface for copy-on-read is just an option within qemu-img create.
>>>  Streaming, on the other hand, requires a bit more thought.  Today, I
>>> have a
>>> monitor command that does the following:
>>>
>>> stream<device>  <sector offset>
>>>
>>> Which will try to stream the minimal amount of data for a single I/O
>>> operation and then return how many sectors were successfully streamed.
>>>
>>> The idea about how to drive this interface is a loop like:
>>>
>>> offset = 0;
>>> while offset<  image_size:
>>>   wait_for_idle_time()
>>>   count = stream(device, offset)
>>>   offset += count
>>>
>>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>>>  The
>>> thing I'm not sure about is 1) would libvirt want to expose a similar
>>> stream
>>> interface and let management software determine idle time 2) attempt to
>>> detect idle time on it's own and provide a higher level interface.  If
>>> (2),
>>> the question then becomes whether we should try to do this within qemu
>>> and
>>> provide libvirt a higher level interface.
>>>
>>
>> A self-tuning solution is attractive because it reduces the need for
>> other components (management stack) or the user to get involved.  In
>> this case self-tuning should be possible.  We need to detect periods
>> of I/O inactivity, for example tracking the number of in-flight
>> requests and then setting a grace timer when it reaches zero.  When
>> the grace timer expires, we start streaming until the guest initiates
>> I/O again.
>>
>
> That detects idle I/O within a single QEMU guest, but you might have another
> guest running that's I/O bound which means that from an overall system
> throughput perspective, you really don't want to stream.
>
> I think libvirt might be able to do a better job here by looking at overall
> system I/O usage.  But I'm not sure hence this RFC :-)

Isn't this what block I/O controller cgroups is meant to solve?  If
you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then
vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
 Also, I'm not sure we should worry about the priority of the I/O too
much: perhaps the user wants their vm to stream more than they want an
unimportant local vm that is currently I/O bound to have all resources
to itself.  So I think it makes sense to defer this and not try for
system-wide knowledge inside a QEMU process.

Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:49   ` Stefan Hajnoczi
@ 2010-09-07 14:57     ` Anthony Liguori
  2010-09-07 15:05       ` Stefan Hajnoczi
  2010-09-12 12:41       ` Avi Kivity
  0 siblings, 2 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 14:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>    
>> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>>      
>>> Hi,
>>>
>>> We've got copy-on-read and image streaming working in QED and before
>>> going much further, I wanted to bounce some interfaces off of the
>>> libvirt folks to make sure our final interface makes sense.
>>>
>>> Here's the basic idea:
>>>
>>> Today, you can create images based on base images that are copy on
>>> write.  With QED, we also support copy on read which forces a copy from
>>> the backing image on read requests and write requests.
>>>
>>> In additional to copy on read, we introduce a notion of streaming a
>>> block device which means that we search for an unallocated region of the
>>> leaf image and force a copy-on-read operation.
>>>
>>> The combination of copy-on-read and streaming means that you can start a
>>> guest based on slow storage (like over the network) and bring in blocks
>>> on demand while also having a deterministic mechanism to complete the
>>> transfer.
>>>
>>> The interface for copy-on-read is just an option within qemu-img
>>> create.
>>>        
>> Shouldn't it be a runtime option? You can use the very same image with
>> copy-on-read or copy-on-write and it will behave the same (execpt for
>> performance), so it's not an inherent feature of the image file.
>>
>> Doing it this way has the additional advantage that you need no image
>> format support for this, so we could implement copy-on-read for other
>> formats, too.
>>      
> I agree that streaming should be generic, like block migration.  The
> trivial generic implementation is:
>
> void bdrv_stream(BlockDriverState* bs)
> {
>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>          if (!bdrv_is_allocated(bs, sector,&n)) {
>    

Three problems here.  First problem is that bdrv_is_allocated is 
synchronous.  The second problem is that streaming makes the most sense 
when it's the smallest useful piece of work whereas bdrv_is_allocated() 
may return a very large range.

You could cap it here but you then need to make sure that cap is at 
least cluster_size to avoid a lot of unnecessary I/O.

The QED streaming implementation is 140 LOCs too so you quickly end up 
adding more code to the block formats to support these new interfaces 
than it takes to just implement it in the block format.

Third problem is that  streaming really requires being able to do zero 
write detection in a meaningful way.  You don't want to always do zero 
write detection so you need another interface to mark a specific write 
as a write that should be checked for zeros.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:55     ` Stefan Hajnoczi
@ 2010-09-07 15:00       ` Anthony Liguori
  2010-09-07 15:09         ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori
> <aliguori@linux.vnet.ibm.com>  wrote:
>    
>> On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
>>      
>>> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
>>> <aliguori@linux.vnet.ibm.com>    wrote:
>>>
>>>        
>>>> The interface for copy-on-read is just an option within qemu-img create.
>>>>   Streaming, on the other hand, requires a bit more thought.  Today, I
>>>> have a
>>>> monitor command that does the following:
>>>>
>>>> stream<device>    <sector offset>
>>>>
>>>> Which will try to stream the minimal amount of data for a single I/O
>>>> operation and then return how many sectors were successfully streamed.
>>>>
>>>> The idea about how to drive this interface is a loop like:
>>>>
>>>> offset = 0;
>>>> while offset<    image_size:
>>>>    wait_for_idle_time()
>>>>    count = stream(device, offset)
>>>>    offset += count
>>>>
>>>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>>>>   The
>>>> thing I'm not sure about is 1) would libvirt want to expose a similar
>>>> stream
>>>> interface and let management software determine idle time 2) attempt to
>>>> detect idle time on it's own and provide a higher level interface.  If
>>>> (2),
>>>> the question then becomes whether we should try to do this within qemu
>>>> and
>>>> provide libvirt a higher level interface.
>>>>
>>>>          
>>> A self-tuning solution is attractive because it reduces the need for
>>> other components (management stack) or the user to get involved.  In
>>> this case self-tuning should be possible.  We need to detect periods
>>> of I/O inactivity, for example tracking the number of in-flight
>>> requests and then setting a grace timer when it reaches zero.  When
>>> the grace timer expires, we start streaming until the guest initiates
>>> I/O again.
>>>
>>>        
>> That detects idle I/O within a single QEMU guest, but you might have another
>> guest running that's I/O bound which means that from an overall system
>> throughput perspective, you really don't want to stream.
>>
>> I think libvirt might be able to do a better job here by looking at overall
>> system I/O usage.  But I'm not sure hence this RFC :-)
>>      
> Isn't this what block I/O controller cgroups is meant to solve?  If
> you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then
> vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
>    

That assumes you're capping I/O.  But sometimes you care about overall 
system throughput more than you care about any individual VM.

Another way to look at it may be, a user waits for a cron job that runs 
at midnight and starts streaming as necessary.  However, the user wants 
to be able to interrupt the streaming should there been a sudden demand.

If the user drives the streaming through an interface like I've 
specified, they're in full control.  It's pretty simple to build a 
interfaces on top of this that implement stream as an aggressive or 
conservative background task too.

>   Also, I'm not sure we should worry about the priority of the I/O too
> much: perhaps the user wants their vm to stream more than they want an
> unimportant local vm that is currently I/O bound to have all resources
> to itself.  So I think it makes sense to defer this and not try for
> system-wide knowledge inside a QEMU process.
>    

Right, so that argues for an incremental interface like I started with :-)

BTW, this whole discussion is also relevant for other background tasks 
like online defragmentation so keep that use-case in mind too.

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:49   ` Anthony Liguori
@ 2010-09-07 15:02     ` Kevin Wolf
  2010-09-07 15:11       ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2010-09-07 15:02 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

Am 07.09.2010 16:49, schrieb Anthony Liguori:
>> Shouldn't it be a runtime option? You can use the very same image with
>> copy-on-read or copy-on-write and it will behave the same (execpt for
>> performance), so it's not an inherent feature of the image file.
>>    
> 
> The way it's implemented in QED is that it's a compatible feature.  This 
> means that implementations are allowed to ignore it if they want to.  
> It's really a suggestion.

Well, the point is that I see no reason why an image should contain this
suggestion. There's really nothing about an image that could reasonably
indicate "use this better with copy-on-read than with copy-on-write".

It's a decision you make when using the image.

> So yes, you could have a run time switch that overrides the feature bit 
> on disk and either forces copy-on-read on or off.
> 
> Do we have a way to pass block drivers run time options?

We'll get them with -blockdev. Today we're using colons for format
specific and separate -drive options for generic things.

>> Doing it this way has the additional advantage that you need no image
>> format support for this, so we could implement copy-on-read for other
>> formats, too.
>>    
> 
> To do it efficiently, it really needs to be in the format for the same 
> reason that copy-on-write is part of the format.

Copy-on-write is not part of the format, it's a way of how to use this
format. Backing files are part of the format, and they are used for both
copy-on-write and copy-on-read. Any driver implementing a format that
has support for backing files should be able to implement copy-on-read.

> You need to understand the cluster boundaries in order to optimize the 
> metadata updates.  Sure, you can expose interfaces to the block layer to 
> give all of this info but that's solving the same problem for doing 
> block level copy-on-write.
> 
> The other challenge is that for copy-on-read to be efficiently, you 
> really need a format that can distinguish between unallocated sectors 
> and zero sectors and do zero detection during the copy-on-read 
> operation.  Otherwise, if you have a 10G virtual disk with a backing 
> file that's 1GB is size, copy-on-read will result in the leaf being 10G 
> instead of ~1GB.

That's a good point. But it's not a reason to make the interface
specific to QED just because other formats would probably not implement
it as efficiently.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
                   ` (2 preceding siblings ...)
  2010-09-07 14:34 ` Kevin Wolf
@ 2010-09-07 15:03 ` Daniel P. Berrange
  2010-09-07 15:16   ` Anthony Liguori
  2010-09-12 10:55 ` [Qemu-devel] " Avi Kivity
  4 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrange @ 2010-09-07 15:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On Tue, Sep 07, 2010 at 08:41:44AM -0500, Anthony Liguori wrote:
> Hi,
> 
> We've got copy-on-read and image streaming working in QED and before 
> going much further, I wanted to bounce some interfaces off of the 
> libvirt folks to make sure our final interface makes sense.
> 
> Here's the basic idea:

[snip]

> A related topic is block migration.  Today we support pre-copy migration 
> which means we transfer the block device and then do a live migration.  
> Another approach is to do a live migration, and on the source, run a 
> block server using image streaming on the destination to move the device.
> 
> With QED, to implement this one would:
> 
> 1) launch qemu-nbd on the source while the guest is running
> 2) create a qed file on the destination with copy-on-read enabled and a 
> backing file using nbd: to point to the source qemu-nbd
> 3) run qemu -incoming on the destination with the qed file
> 4) execute the migration
> 5) when migration completes, begin streaming on the destination to 
> complete the copy
> 6) when the streaming is complete, shut down the qemu-nbd instance on 
> the source

IMHO, adding further network sockets is the one thing we absolutely
don't want to do to migration. I don't much like the idea of launching
extra daemons either.

> This is a bit involved and we could potentially automate some of this in 
> qemu by launching qemu-nbd and providing commands to do some of this.  
> Again though, I think the question is what type of interfaces would 
> libvirt prefer?  Low level interfaces + recipes on how to do high level 
> things or higher level interfaces?

I think it should be done entirely within the main QEMU migration
socket. I know this isn't possible with the current impl, since it
is unidirectional, preventing the target sending the source requests
for specific data blocks. If we made migration socket bi-directional
I think we could do it all within qemu with no external helpers
or extra sockets

 1. Create empty qed file on the destination with copy on read
    enable backing file pointing to a special 'migrate:' protocol
 2. Run qemu -incoming on the destination with with the qed file
 3. execute the migration
 4. when migration completes, target QEMU continues streaming blocks
    from the soruce qemu.
 5. when streaming is complete, source qemu can shutdown.


Both your original proposal and mine here seem to have a pretty
bad failure scenario though. After the cut-over point where the
VM cpus start running on the destination QEMU, AFAICT, any failure
on the source before block streaming complete leaves you dead in
the water.  The source VM no longer has up2date RAM contents and
the destination VM does not yet have a complete disk image.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:57     ` Anthony Liguori
@ 2010-09-07 15:05       ` Stefan Hajnoczi
  2010-09-07 15:23         ` Anthony Liguori
  2010-09-12 12:41       ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07 15:05 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, libvir-list, qemu-devel, Stefan Hajnoczi

On Tue, Sep 7, 2010 at 3:57 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>
>>>
>>> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>>>
>>>>
>>>> Hi,
>>>>
>>>> We've got copy-on-read and image streaming working in QED and before
>>>> going much further, I wanted to bounce some interfaces off of the
>>>> libvirt folks to make sure our final interface makes sense.
>>>>
>>>> Here's the basic idea:
>>>>
>>>> Today, you can create images based on base images that are copy on
>>>> write.  With QED, we also support copy on read which forces a copy from
>>>> the backing image on read requests and write requests.
>>>>
>>>> In additional to copy on read, we introduce a notion of streaming a
>>>> block device which means that we search for an unallocated region of the
>>>> leaf image and force a copy-on-read operation.
>>>>
>>>> The combination of copy-on-read and streaming means that you can start a
>>>> guest based on slow storage (like over the network) and bring in blocks
>>>> on demand while also having a deterministic mechanism to complete the
>>>> transfer.
>>>>
>>>> The interface for copy-on-read is just an option within qemu-img
>>>> create.
>>>>
>>>
>>> Shouldn't it be a runtime option? You can use the very same image with
>>> copy-on-read or copy-on-write and it will behave the same (execpt for
>>> performance), so it's not an inherent feature of the image file.
>>>
>>> Doing it this way has the additional advantage that you need no image
>>> format support for this, so we could implement copy-on-read for other
>>> formats, too.
>>>
>>
>> I agree that streaming should be generic, like block migration.  The
>> trivial generic implementation is:
>>
>> void bdrv_stream(BlockDriverState* bs)
>> {
>>     for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>         if (!bdrv_is_allocated(bs, sector,&n)) {
>>
>
> Three problems here.  First problem is that bdrv_is_allocated is
> synchronous.  The second problem is that streaming makes the most sense when
> it's the smallest useful piece of work whereas bdrv_is_allocated() may
> return a very large range.
>
> You could cap it here but you then need to make sure that cap is at least
> cluster_size to avoid a lot of unnecessary I/O.
>
> The QED streaming implementation is 140 LOCs too so you quickly end up
> adding more code to the block formats to support these new interfaces than
> it takes to just implement it in the block format.
>
> Third problem is that  streaming really requires being able to do zero write
> detection in a meaningful way.  You don't want to always do zero write
> detection so you need another interface to mark a specific write as a write
> that should be checked for zeros.

Good points.  I agree that it is easiest to write features into the
block driver, but there is a significant amount of code duplication,
plus the barrier for enabling other block drivers with these features
is increased.  These points (except the lines of code argument) can be
addressed with the proper extensions to the block driver interface.

Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:00       ` Anthony Liguori
@ 2010-09-07 15:09         ` Stefan Hajnoczi
  2010-09-07 15:20           ` Anthony Liguori
  2010-09-08  8:26           ` Kevin Wolf
  0 siblings, 2 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07 15:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On Tue, Sep 7, 2010 at 4:00 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori
>> <aliguori@linux.vnet.ibm.com>  wrote:
>>
>>>
>>> On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
>>>
>>>>
>>>> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
>>>> <aliguori@linux.vnet.ibm.com>    wrote:
>>>>
>>>>
>>>>>
>>>>> The interface for copy-on-read is just an option within qemu-img
>>>>> create.
>>>>>  Streaming, on the other hand, requires a bit more thought.  Today, I
>>>>> have a
>>>>> monitor command that does the following:
>>>>>
>>>>> stream<device>    <sector offset>
>>>>>
>>>>> Which will try to stream the minimal amount of data for a single I/O
>>>>> operation and then return how many sectors were successfully streamed.
>>>>>
>>>>> The idea about how to drive this interface is a loop like:
>>>>>
>>>>> offset = 0;
>>>>> while offset<    image_size:
>>>>>   wait_for_idle_time()
>>>>>   count = stream(device, offset)
>>>>>   offset += count
>>>>>
>>>>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>>>>>  The
>>>>> thing I'm not sure about is 1) would libvirt want to expose a similar
>>>>> stream
>>>>> interface and let management software determine idle time 2) attempt to
>>>>> detect idle time on it's own and provide a higher level interface.  If
>>>>> (2),
>>>>> the question then becomes whether we should try to do this within qemu
>>>>> and
>>>>> provide libvirt a higher level interface.
>>>>>
>>>>>
>>>>
>>>> A self-tuning solution is attractive because it reduces the need for
>>>> other components (management stack) or the user to get involved.  In
>>>> this case self-tuning should be possible.  We need to detect periods
>>>> of I/O inactivity, for example tracking the number of in-flight
>>>> requests and then setting a grace timer when it reaches zero.  When
>>>> the grace timer expires, we start streaming until the guest initiates
>>>> I/O again.
>>>>
>>>>
>>>
>>> That detects idle I/O within a single QEMU guest, but you might have
>>> another
>>> guest running that's I/O bound which means that from an overall system
>>> throughput perspective, you really don't want to stream.
>>>
>>> I think libvirt might be able to do a better job here by looking at
>>> overall
>>> system I/O usage.  But I'm not sure hence this RFC :-)
>>>
>>
>> Isn't this what block I/O controller cgroups is meant to solve?  If
>> you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then
>> vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
>>
>
> That assumes you're capping I/O.  But sometimes you care about overall
> system throughput more than you care about any individual VM.
>
> Another way to look at it may be, a user waits for a cron job that runs at
> midnight and starts streaming as necessary.  However, the user wants to be
> able to interrupt the streaming should there been a sudden demand.
>
> If the user drives the streaming through an interface like I've specified,
> they're in full control.  It's pretty simple to build a interfaces on top of
> this that implement stream as an aggressive or conservative background task
> too.
>
>>  Also, I'm not sure we should worry about the priority of the I/O too
>> much: perhaps the user wants their vm to stream more than they want an
>> unimportant local vm that is currently I/O bound to have all resources
>> to itself.  So I think it makes sense to defer this and not try for
>> system-wide knowledge inside a QEMU process.
>>
>
> Right, so that argues for an incremental interface like I started with :-)
>
> BTW, this whole discussion is also relevant for other background tasks like
> online defragmentation so keep that use-case in mind too.

Right, I'm a little hesitant to get too far into discussing the
management interface because I remember long threads about polling and
async.  I never fully read them but I bet some wisdom came out of them
that applies here.

There are two ways to do a long running (async?) task:
1. Multiple smaller pokes.  Perhaps completion of a single poke is
async.  But the key is that the interface is incremental and driven by
the management stack.
2. State.  Turn on streaming and watch it go.  You can find out its
current state using another command which will tell you whether it is
enabled/disabled and progress.  Use a command to disable it.

Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:02     ` Kevin Wolf
@ 2010-09-07 15:11       ` Anthony Liguori
  2010-09-07 15:20         ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:02 AM, Kevin Wolf wrote:
> Am 07.09.2010 16:49, schrieb Anthony Liguori:
>    
>>> Shouldn't it be a runtime option? You can use the very same image with
>>> copy-on-read or copy-on-write and it will behave the same (execpt for
>>> performance), so it's not an inherent feature of the image file.
>>>
>>>        
>> The way it's implemented in QED is that it's a compatible feature.  This
>> means that implementations are allowed to ignore it if they want to.
>> It's really a suggestion.
>>      
> Well, the point is that I see no reason why an image should contain this
> suggestion. There's really nothing about an image that could reasonably
> indicate "use this better with copy-on-read than with copy-on-write".
>
> It's a decision you make when using the image.
>    

Copy-on-read is, in many cases, a property of the backing file because 
it suggests that the backing file is either very slow or potentially 
volatile.

IOW, let's say I'm an image distributor and I want to provide my images 
in a QED format that actually streams the image from an http server.  I 
could provide a QED file without a copy-on-read bit set but I'd really 
like to convey this information as part of the image.

You can argue that I should provide a config file too that contained the 
copy-on-read flag set but you could make the same argument about backing 
files too.

>> So yes, you could have a run time switch that overrides the feature bit
>> on disk and either forces copy-on-read on or off.
>>
>> Do we have a way to pass block drivers run time options?
>>      
> We'll get them with -blockdev. Today we're using colons for format
> specific and separate -drive options for generic things.
>    

That's right.  I think I'd rather wait for -blockdev.

>> You need to understand the cluster boundaries in order to optimize the
>> metadata updates.  Sure, you can expose interfaces to the block layer to
>> give all of this info but that's solving the same problem for doing
>> block level copy-on-write.
>>
>> The other challenge is that for copy-on-read to be efficiently, you
>> really need a format that can distinguish between unallocated sectors
>> and zero sectors and do zero detection during the copy-on-read
>> operation.  Otherwise, if you have a 10G virtual disk with a backing
>> file that's 1GB is size, copy-on-read will result in the leaf being 10G
>> instead of ~1GB.
>>      
> That's a good point. But it's not a reason to make the interface
> specific to QED just because other formats would probably not implement
> it as efficiently.
>    

You really can't do as good of a job in the block layer because you have 
very little info about the characteristics of the disk image.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:03 ` [Qemu-devel] " Daniel P. Berrange
@ 2010-09-07 15:16   ` Anthony Liguori
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:16 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:03 AM, Daniel P. Berrange wrote:
> On Tue, Sep 07, 2010 at 08:41:44AM -0500, Anthony Liguori wrote:
>    
>> Hi,
>>
>> We've got copy-on-read and image streaming working in QED and before
>> going much further, I wanted to bounce some interfaces off of the
>> libvirt folks to make sure our final interface makes sense.
>>
>> Here's the basic idea:
>>      
> [snip]
>
>    
>> A related topic is block migration.  Today we support pre-copy migration
>> which means we transfer the block device and then do a live migration.
>> Another approach is to do a live migration, and on the source, run a
>> block server using image streaming on the destination to move the device.
>>
>> With QED, to implement this one would:
>>
>> 1) launch qemu-nbd on the source while the guest is running
>> 2) create a qed file on the destination with copy-on-read enabled and a
>> backing file using nbd: to point to the source qemu-nbd
>> 3) run qemu -incoming on the destination with the qed file
>> 4) execute the migration
>> 5) when migration completes, begin streaming on the destination to
>> complete the copy
>> 6) when the streaming is complete, shut down the qemu-nbd instance on
>> the source
>>      
> IMHO, adding further network sockets is the one thing we absolutely
> don't want to do to migration. I don't much like the idea of launching
> extra daemons either.
>    

One of the use cases I'm trying to accommodate is migration to free 
resources.  By launching a qemu-nbd daemon, we can kill the source qemu 
process and free up all of the associated memory.

>> This is a bit involved and we could potentially automate some of this in
>> qemu by launching qemu-nbd and providing commands to do some of this.
>> Again though, I think the question is what type of interfaces would
>> libvirt prefer?  Low level interfaces + recipes on how to do high level
>> things or higher level interfaces?
>>      
> I think it should be done entirely within the main QEMU migration
> socket. I know this isn't possible with the current impl, since it
> is unidirectional, preventing the target sending the source requests
> for specific data blocks. If we made migration socket bi-directional
> I think we could do it all within qemu with no external helpers
> or extra sockets
>
>   1. Create empty qed file on the destination with copy on read
>      enable backing file pointing to a special 'migrate:' protocol
>    

Why not just point migration and nbd to a unix domain socket and then 
multiplex the two protocols at a higher level?

>   2. Run qemu -incoming on the destination with with the qed file
>   3. execute the migration
>   4. when migration completes, target QEMU continues streaming blocks
>      from the soruce qemu.
>   5. when streaming is complete, source qemu can shutdown.
>
>
> Both your original proposal and mine here seem to have a pretty
> bad failure scenario though. After the cut-over point where the
> VM cpus start running on the destination QEMU, AFAICT, any failure
> on the source before block streaming complete leaves you dead in
> the water.  The source VM no longer has up2date RAM contents and
> the destination VM does not yet have a complete disk image.
>    

Yes.  It's a trade off.  However, pre-copy doesn't really change your 
likelihood of catastrophic failure because if you were going to fail in 
the source, it was going to happen before you completed the block 
transfer anyway.

The advantage of post-copy is that you immediately free resources on the 
source so as a reaction to pressure from overcommit, it's tremendously 
useful.

I still think pre-copy has it's place though.

Regards,

Anthony Liguori


> Regards,
> Daniel
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:09         ` Stefan Hajnoczi
@ 2010-09-07 15:20           ` Anthony Liguori
  2010-09-08  8:26           ` Kevin Wolf
  1 sibling, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:20 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:09 AM, Stefan Hajnoczi wrote:
>> Right, so that argues for an incremental interface like I started with :-)
>>
>> BTW, this whole discussion is also relevant for other background tasks like
>> online defragmentation so keep that use-case in mind too.
>>      
> Right, I'm a little hesitant to get too far into discussing the
> management interface because I remember long threads about polling and
> async.  I never fully read them but I bet some wisdom came out of them
> that applies here.
>
> There are two ways to do a long running (async?) task:
> 1. Multiple smaller pokes.  Perhaps completion of a single poke is
> async.  But the key is that the interface is incremental and driven by
> the management stack.
> 2. State.  Turn on streaming and watch it go.  You can find out its
> current state using another command which will tell you whether it is
> enabled/disabled and progress.  Use a command to disable it.
>    

If everyone is going to do (1) by just doing a tight loop or just using 
the same simple mechanism (a sleep(5)), then I agree, we should do (2).

I can envision people wanting to do very complex decisions about the 
right time to do the next poke though and I'm looking for feedback about 
what other people think.

I expected people to do complex heuristics with respect to migration 
convergence but in reality, I don't think anyone does today.  So while I 
generally like being flexible, I realize that too much flexibility isn't 
always a good thing :-)

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:11       ` Anthony Liguori
@ 2010-09-07 15:20         ` Kevin Wolf
  2010-09-07 15:30           ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2010-09-07 15:20 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

Am 07.09.2010 17:11, schrieb Anthony Liguori:
> On 09/07/2010 10:02 AM, Kevin Wolf wrote:
>> Am 07.09.2010 16:49, schrieb Anthony Liguori:
>>    
>>>> Shouldn't it be a runtime option? You can use the very same image with
>>>> copy-on-read or copy-on-write and it will behave the same (execpt for
>>>> performance), so it's not an inherent feature of the image file.
>>>>
>>>>        
>>> The way it's implemented in QED is that it's a compatible feature.  This
>>> means that implementations are allowed to ignore it if they want to.
>>> It's really a suggestion.
>>>      
>> Well, the point is that I see no reason why an image should contain this
>> suggestion. There's really nothing about an image that could reasonably
>> indicate "use this better with copy-on-read than with copy-on-write".
>>
>> It's a decision you make when using the image.
>>    
> 
> Copy-on-read is, in many cases, a property of the backing file because 
> it suggests that the backing file is either very slow or potentially 
> volatile.

The simple copy-on-read without actively streaming the rest of the image
is not enough anyway for volatile backing files.

> IOW, let's say I'm an image distributor and I want to provide my images 
> in a QED format that actually streams the image from an http server.  I 
> could provide a QED file without a copy-on-read bit set but I'd really 
> like to convey this information as part of the image.
> 
> You can argue that I should provide a config file too that contained the 
> copy-on-read flag set but you could make the same argument about backing 
> files too.

No. The image is perfectly readable when using COW instead of COR. On
the other hand, it's completely meaningless without its backing file.

>>> So yes, you could have a run time switch that overrides the feature bit
>>> on disk and either forces copy-on-read on or off.
>>>
>>> Do we have a way to pass block drivers run time options?
>>>      
>> We'll get them with -blockdev. Today we're using colons for format
>> specific and separate -drive options for generic things.
>>    
> 
> That's right.  I think I'd rather wait for -blockdev.

Well, then I consider -blockdev a dependency of QED (the copy-on-read
part at least) and we can't merge it before we have -blockdev.

>>> You need to understand the cluster boundaries in order to optimize the
>>> metadata updates.  Sure, you can expose interfaces to the block layer to
>>> give all of this info but that's solving the same problem for doing
>>> block level copy-on-write.
>>>
>>> The other challenge is that for copy-on-read to be efficiently, you
>>> really need a format that can distinguish between unallocated sectors
>>> and zero sectors and do zero detection during the copy-on-read
>>> operation.  Otherwise, if you have a 10G virtual disk with a backing
>>> file that's 1GB is size, copy-on-read will result in the leaf being 10G
>>> instead of ~1GB.
>>>      
>> That's a good point. But it's not a reason to make the interface
>> specific to QED just because other formats would probably not implement
>> it as efficiently.
> 
> You really can't do as good of a job in the block layer because you have 
> very little info about the characteristics of the disk image.

I'm not saying that the generic block layer should implement
copy-on-read. I just think that it should pass a run-time option to the
driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the
information in the image file. From a user perspective it should look
the same for qed, qcow2 and whatever else (like copy-on-write today)

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:05       ` Stefan Hajnoczi
@ 2010-09-07 15:23         ` Anthony Liguori
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:05 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 7, 2010 at 3:57 PM, Anthony Liguori
> <aliguori@linux.vnet.ibm.com>  wrote:
>    
>> On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
>>      
>>> On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com>    wrote:
>>>
>>>        
>>>> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>>>>
>>>>          
>>>>> Hi,
>>>>>
>>>>> We've got copy-on-read and image streaming working in QED and before
>>>>> going much further, I wanted to bounce some interfaces off of the
>>>>> libvirt folks to make sure our final interface makes sense.
>>>>>
>>>>> Here's the basic idea:
>>>>>
>>>>> Today, you can create images based on base images that are copy on
>>>>> write.  With QED, we also support copy on read which forces a copy from
>>>>> the backing image on read requests and write requests.
>>>>>
>>>>> In additional to copy on read, we introduce a notion of streaming a
>>>>> block device which means that we search for an unallocated region of the
>>>>> leaf image and force a copy-on-read operation.
>>>>>
>>>>> The combination of copy-on-read and streaming means that you can start a
>>>>> guest based on slow storage (like over the network) and bring in blocks
>>>>> on demand while also having a deterministic mechanism to complete the
>>>>> transfer.
>>>>>
>>>>> The interface for copy-on-read is just an option within qemu-img
>>>>> create.
>>>>>
>>>>>            
>>>> Shouldn't it be a runtime option? You can use the very same image with
>>>> copy-on-read or copy-on-write and it will behave the same (execpt for
>>>> performance), so it's not an inherent feature of the image file.
>>>>
>>>> Doing it this way has the additional advantage that you need no image
>>>> format support for this, so we could implement copy-on-read for other
>>>> formats, too.
>>>>
>>>>          
>>> I agree that streaming should be generic, like block migration.  The
>>> trivial generic implementation is:
>>>
>>> void bdrv_stream(BlockDriverState* bs)
>>> {
>>>      for (sector = 0; sector<    bdrv_getlength(bs); sector += n) {
>>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>>>
>>>        
>> Three problems here.  First problem is that bdrv_is_allocated is
>> synchronous.  The second problem is that streaming makes the most sense when
>> it's the smallest useful piece of work whereas bdrv_is_allocated() may
>> return a very large range.
>>
>> You could cap it here but you then need to make sure that cap is at least
>> cluster_size to avoid a lot of unnecessary I/O.
>>
>> The QED streaming implementation is 140 LOCs too so you quickly end up
>> adding more code to the block formats to support these new interfaces than
>> it takes to just implement it in the block format.
>>
>> Third problem is that  streaming really requires being able to do zero write
>> detection in a meaningful way.  You don't want to always do zero write
>> detection so you need another interface to mark a specific write as a write
>> that should be checked for zeros.
>>      
> Good points.  I agree that it is easiest to write features into the
> block driver, but there is a significant amount of code duplication,
>    

There's two ways to attack code duplication.  The first is to move the 
feature into block.c and add interfaces to the block drivers to support 
it.  The second is to keep it in qed.c but to abstract out things that 
could really be common to multiple drivers (like the find_cluster 
functionality and some of the request handling functionality).

I prefer the later approach because it keeps a high quality 
implementation of copy-on-read whereas the former is almost certainly 
going to dumb down the implementation.

> plus the barrier for enabling other block drivers with these features
> is increased.  These points (except the lines of code argument) can be
> addressed with the proper extensions to the block driver interface.
>    

Regards,

Anthony Liguori


> Stefan
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:20         ` Kevin Wolf
@ 2010-09-07 15:30           ` Anthony Liguori
  2010-09-07 15:39             ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 15:30 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:20 AM, Kevin Wolf wrote:
> Am 07.09.2010 17:11, schrieb Anthony Liguori:
>    
>> On 09/07/2010 10:02 AM, Kevin Wolf wrote:
>>      
>>> Am 07.09.2010 16:49, schrieb Anthony Liguori:
>>>
>>>        
>>>>> Shouldn't it be a runtime option? You can use the very same image with
>>>>> copy-on-read or copy-on-write and it will behave the same (execpt for
>>>>> performance), so it's not an inherent feature of the image file.
>>>>>
>>>>>
>>>>>            
>>>> The way it's implemented in QED is that it's a compatible feature.  This
>>>> means that implementations are allowed to ignore it if they want to.
>>>> It's really a suggestion.
>>>>
>>>>          
>>> Well, the point is that I see no reason why an image should contain this
>>> suggestion. There's really nothing about an image that could reasonably
>>> indicate "use this better with copy-on-read than with copy-on-write".
>>>
>>> It's a decision you make when using the image.
>>>
>>>        
>> Copy-on-read is, in many cases, a property of the backing file because
>> it suggests that the backing file is either very slow or potentially
>> volatile.
>>      
> The simple copy-on-read without actively streaming the rest of the image
> is not enough anyway for volatile backing files.
>    

But as a web site owner, it's extremely useful for me to associate 
copy-on-read with an image because it significantly reduces my bandwidth.

I have a hard time believing this isn't a valuable use-case and not one 
that's actually pretty common.

>> IOW, let's say I'm an image distributor and I want to provide my images
>> in a QED format that actually streams the image from an http server.  I
>> could provide a QED file without a copy-on-read bit set but I'd really
>> like to convey this information as part of the image.
>>
>> You can argue that I should provide a config file too that contained the
>> copy-on-read flag set but you could make the same argument about backing
>> files too.
>>      
> No. The image is perfectly readable when using COW instead of COR. On
> the other hand, it's completely meaningless without its backing file.
>    

N.B. the whole concept of compat features in QED is that if the features 
are ignored, the image is still perfectly readable.  It's extra 
information that let's an implementation to smarter things with a given 
image.

>>>> So yes, you could have a run time switch that overrides the feature bit
>>>> on disk and either forces copy-on-read on or off.
>>>>
>>>> Do we have a way to pass block drivers run time options?
>>>>
>>>>          
>>> We'll get them with -blockdev. Today we're using colons for format
>>> specific and separate -drive options for generic things.
>>>
>>>        
>> That's right.  I think I'd rather wait for -blockdev.
>>      
> Well, then I consider -blockdev a dependency of QED (the copy-on-read
> part at least) and we can't merge it before we have -blockdev.
>    

If we determine that having copy-on-read be a part of the image is 
universally a bad idea, then I'd agree with you.  Keep in mind, I don't 
expect to merge the cor or streaming stuff with the first merge of QED.

I'm still not convinced that having cor as a compat feature is a bad 
idea though.

>> You really can't do as good of a job in the block layer because you have
>> very little info about the characteristics of the disk image.
>>      
> I'm not saying that the generic block layer should implement
> copy-on-read. I just think that it should pass a run-time option to the
> driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the
> information in the image file. From a user perspective it should look
> the same for qed, qcow2 and whatever else (like copy-on-write today)
>    

Okay, the only place I'm disagreeing slightly is that I think an image 
format should be able to request copy_on_read such that the default 
behavior if an explicit flag isn't specified is to do what the image 
suggests we do.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:30           ` Anthony Liguori
@ 2010-09-07 15:39             ` Kevin Wolf
  2010-09-07 16:00               ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2010-09-07 15:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

Am 07.09.2010 17:30, schrieb Anthony Liguori:
> On 09/07/2010 10:20 AM, Kevin Wolf wrote:
>> Am 07.09.2010 17:11, schrieb Anthony Liguori:
>>> Copy-on-read is, in many cases, a property of the backing file because
>>> it suggests that the backing file is either very slow or potentially
>>> volatile.
>>>      
>> The simple copy-on-read without actively streaming the rest of the image
>> is not enough anyway for volatile backing files.
>>    
> 
> But as a web site owner, it's extremely useful for me to associate 
> copy-on-read with an image because it significantly reduces my bandwidth.
> 
> I have a hard time believing this isn't a valuable use-case and not one 
> that's actually pretty common.

As a web site user, I don't necessarily want you to control the
behaviour of my qemu. :-)

But I do see your point there.

>>> You really can't do as good of a job in the block layer because you have
>>> very little info about the characteristics of the disk image.
>>>      
>> I'm not saying that the generic block layer should implement
>> copy-on-read. I just think that it should pass a run-time option to the
>> driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the
>> information in the image file. From a user perspective it should look
>> the same for qed, qcow2 and whatever else (like copy-on-write today)
>>    
> 
> Okay, the only place I'm disagreeing slightly is that I think an image 
> format should be able to request copy_on_read such that the default 
> behavior if an explicit flag isn't specified is to do what the image 
> suggests we do.

Maybe we can agree on that. I'm not completely decided yet if allowing
the image to contain such a hint is a good or a bad thing.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:39             ` Kevin Wolf
@ 2010-09-07 16:00               ` Anthony Liguori
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2010-09-07 16:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

On 09/07/2010 10:39 AM, Kevin Wolf wrote:
> Am 07.09.2010 17:30, schrieb Anthony Liguori:
>    
>> On 09/07/2010 10:20 AM, Kevin Wolf wrote:
>>      
>>> Am 07.09.2010 17:11, schrieb Anthony Liguori:
>>>        
>>>> Copy-on-read is, in many cases, a property of the backing file because
>>>> it suggests that the backing file is either very slow or potentially
>>>> volatile.
>>>>
>>>>          
>>> The simple copy-on-read without actively streaming the rest of the image
>>> is not enough anyway for volatile backing files.
>>>
>>>        
>> But as a web site owner, it's extremely useful for me to associate
>> copy-on-read with an image because it significantly reduces my bandwidth.
>>
>> I have a hard time believing this isn't a valuable use-case and not one
>> that's actually pretty common.
>>      
> As a web site user, I don't necessarily want you to control the
> behaviour of my qemu. :-)
>    

That's why I understand your argument about -blockdev and making sure 
all compat features can be overridden.  I'm happy with that as a 
requirement.Okay, the only place I'm disagreeing slightly is that I 
think an image

>> format should be able to request copy_on_read such that the default
>> behavior if an explicit flag isn't specified is to do what the image
>> suggests we do.
>>      
> Maybe we can agree on that. I'm not completely decided yet if allowing
> the image to contain such a hint is a good or a bad thing.
>    

It's a tough space.  We don't want to include crazy amounts of metadata 
(and basically become OVF) but there's metadata that we would like to have.

backing_format is a good example.  It's a suggestion and it's something 
you really want to let a user override.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 15:09         ` Stefan Hajnoczi
  2010-09-07 15:20           ` Anthony Liguori
@ 2010-09-08  8:26           ` Kevin Wolf
  1 sibling, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2010-09-08  8:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libvir-list, Anthony Liguori, qemu-devel, Stefan Hajnoczi

Am 07.09.2010 17:09, schrieb Stefan Hajnoczi:
> Right, I'm a little hesitant to get too far into discussing the
> management interface because I remember long threads about polling and
> async.  I never fully read them but I bet some wisdom came out of them
> that applies here.
> 
> There are two ways to do a long running (async?) task:
> 1. Multiple smaller pokes.  Perhaps completion of a single poke is
> async.  But the key is that the interface is incremental and driven by
> the management stack.
> 2. State.  Turn on streaming and watch it go.  You can find out its
> current state using another command which will tell you whether it is
> enabled/disabled and progress.  Use a command to disable it.

I think we need option 2 in any case for users not using libvirt. I for
one wouldn't really love to type in monitor commands every few seconds
to get the streaming done. ;-)

Let's start with this. We can always add option 1 for more sophisticated
cases later if it's desired by users.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
                   ` (3 preceding siblings ...)
  2010-09-07 15:03 ` [Qemu-devel] " Daniel P. Berrange
@ 2010-09-12 10:55 ` Avi Kivity
  4 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2010-09-12 10:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: libvir-list, qemu-devel, Stefan Hajnoczi

  On 09/07/2010 04:41 PM, Anthony Liguori wrote:
> Hi,
>
> We've got copy-on-read and image streaming working in QED and before 
> going much further, I wanted to bounce some interfaces off of the 
> libvirt folks to make sure our final interface makes sense.
>
> Here's the basic idea:
>
> Today, you can create images based on base images that are copy on 
> write.  With QED, we also support copy on read which forces a copy 
> from the backing image on read requests and write requests.

Is copy on read QED specific?  It looks very similar to the commit 
command, except with I/O directions reversed.

IIRC, commit looks like

   for each sector:
     if image.mapped(sector):
         backing_image.write(sector, image.read(sector))

whereas copy-on-read looks like:

   def copy_on_read():
     set_ioprio(idle)
     for each sector:
       if not image.mapped(sector):
           image.write(sector, backing_image.read(sector))
    run_in_thread(copy_on_read)

With appropriate locking.

>
> In additional to copy on read, we introduce a notion of streaming a 
> block device which means that we search for an unallocated region of 
> the leaf image and force a copy-on-read operation.
>
> The combination of copy-on-read and streaming means that you can start 
> a guest based on slow storage (like over the network) and bring in 
> blocks on demand while also having a deterministic mechanism to 
> complete the transfer.
>
> The interface for copy-on-read is just an option within qemu-img 
> create.  Streaming, on the other hand, requires a bit more thought.  
> Today, I have a monitor command that does the following:
>
> stream <device> <sector offset>
>
> Which will try to stream the minimal amount of data for a single I/O 
> operation and then return how many sectors were successfully streamed.
>
> The idea about how to drive this interface is a loop like:
>
> offset = 0;
> while offset < image_size:
>    wait_for_idle_time()
>    count = stream(device, offset)
>    offset += count
>

This is way too low level for the management stack.

Have you considered using the idle class I/O priority to implement 
this?  That would allow host-wide prioritization.  Not sure how to do 
cluster-wide, I don't think NFS has the concept of I/O priority.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-07 14:57     ` Anthony Liguori
  2010-09-07 15:05       ` Stefan Hajnoczi
@ 2010-09-12 12:41       ` Avi Kivity
  2010-09-12 13:25         ` Anthony Liguori
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-09-12 12:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

  On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>> I agree that streaming should be generic, like block migration.  The
>> trivial generic implementation is:
>>
>> void bdrv_stream(BlockDriverState* bs)
>> {
>>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>
> Three problems here.  First problem is that bdrv_is_allocated is 
> synchronous. 

Put the whole thing in a thread.

> The second problem is that streaming makes the most sense when it's 
> the smallest useful piece of work whereas bdrv_is_allocated() may 
> return a very large range.
>
> You could cap it here but you then need to make sure that cap is at 
> least cluster_size to avoid a lot of unnecessary I/O.

That seems like a nice solution.  You probably want a multiple of the 
cluster size to retain efficiency.

>
> The QED streaming implementation is 140 LOCs too so you quickly end up 
> adding more code to the block formats to support these new interfaces 
> than it takes to just implement it in the block format.

bdrv_is_allocated() already exists (and is needed for commit), what else 
is needed?  cluster size?

> Third problem is that  streaming really requires being able to do zero 
> write detection in a meaningful way.  You don't want to always do zero 
> write detection so you need another interface to mark a specific write 
> as a write that should be checked for zeros.

You can do that in bdrv_stream(), above, before the actual write, and 
call bdrv_unmap() if you detect zeros.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 12:41       ` Avi Kivity
@ 2010-09-12 13:25         ` Anthony Liguori
  2010-09-12 13:40           ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-12 13:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

On 09/12/2010 07:41 AM, Avi Kivity wrote:
>  On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>>> I agree that streaming should be generic, like block migration.  The
>>> trivial generic implementation is:
>>>
>>> void bdrv_stream(BlockDriverState* bs)
>>> {
>>>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>>
>> Three problems here.  First problem is that bdrv_is_allocated is 
>> synchronous. 
>
> Put the whole thing in a thread.

It doesn't fix anything.  You don't want stream to serialize all I/O 
operations.

>> The second problem is that streaming makes the most sense when it's 
>> the smallest useful piece of work whereas bdrv_is_allocated() may 
>> return a very large range.
>>
>> You could cap it here but you then need to make sure that cap is at 
>> least cluster_size to avoid a lot of unnecessary I/O.
>
> That seems like a nice solution.  You probably want a multiple of the 
> cluster size to retain efficiency.

What you basically do is:

stream_step_three():
    complete()

stream_step_two(offset, length):
    bdrv_aio_readv(offset, length, buffer, stream_step_three)

bdrv_aio_stream():
     bdrv_aio_find_free_cluster(stream_step_two)

And that's exactly what the current code looks like.  The only change to 
the patch that this does is make some of qed's internals be block layer 
interfaces.

One of the things Stefan has mentioned is that a lot of the QED code 
could be reused by other formats.  All formats implement things like CoW 
on their own today but if you exposed interfaces like 
bdrv_aio_find_free_cluster(), you could actually implement a lot more in 
the generic block layer.

So, I agree with you in principle that this all should be common code.  
I think it's a larger effort though.

>>
>> The QED streaming implementation is 140 LOCs too so you quickly end 
>> up adding more code to the block formats to support these new 
>> interfaces than it takes to just implement it in the block format.
>
> bdrv_is_allocated() already exists (and is needed for commit), what 
> else is needed?  cluster size?

Synchronous implementations are not reusable to implement asynchronous 
anything.  But you need the code to be cluster aware too.

>> Third problem is that  streaming really requires being able to do 
>> zero write detection in a meaningful way.  You don't want to always 
>> do zero write detection so you need another interface to mark a 
>> specific write as a write that should be checked for zeros.
>
> You can do that in bdrv_stream(), above, before the actual write, and 
> call bdrv_unmap() if you detect zeros.

My QED branch now does that FWIW.  At the moment, it only detects zero 
reads to unallocated clusters and writes a special zero cluster marker.  
However, the detection code is in the generic path so once the fsck() 
logic is working, we can implement a free list in QED.

In QED, the detection code needs to have a lot of knowledge about 
cluster boundaries and the format of the device.  In principle, this 
should be common code but it's not for the same reason copy-on-write is 
not common code today.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 13:25         ` Anthony Liguori
@ 2010-09-12 13:40           ` Avi Kivity
  2010-09-12 15:23             ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-09-12 13:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

  On 09/12/2010 03:25 PM, Anthony Liguori wrote:
> On 09/12/2010 07:41 AM, Avi Kivity wrote:
>>  On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>>>> I agree that streaming should be generic, like block migration.  The
>>>> trivial generic implementation is:
>>>>
>>>> void bdrv_stream(BlockDriverState* bs)
>>>> {
>>>>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>>>
>>> Three problems here.  First problem is that bdrv_is_allocated is 
>>> synchronous. 
>>
>> Put the whole thing in a thread.
>
> It doesn't fix anything.  You don't want stream to serialize all I/O 
> operations.

Why would it serialize all I/O operations?  It's just like another vcpu 
issuing reads.

>
>>> The second problem is that streaming makes the most sense when it's 
>>> the smallest useful piece of work whereas bdrv_is_allocated() may 
>>> return a very large range.
>>>
>>> You could cap it here but you then need to make sure that cap is at 
>>> least cluster_size to avoid a lot of unnecessary I/O.
>>
>> That seems like a nice solution.  You probably want a multiple of the 
>> cluster size to retain efficiency.
>
> What you basically do is:
>
> stream_step_three():
>    complete()
>
> stream_step_two(offset, length):
>    bdrv_aio_readv(offset, length, buffer, stream_step_three)
>
> bdrv_aio_stream():
>     bdrv_aio_find_free_cluster(stream_step_two)

Isn't there a write() missing somewhere?

>
> And that's exactly what the current code looks like.  The only change 
> to the patch that this does is make some of qed's internals be block 
> layer interfaces.

Why do you need find_free_cluster()?  That's a physical offset thing.  
Just write to the same logical offset.

IOW:

     bdrv_aio_stream():
         bdrv_aio_read(offset, stream_2)

     stream_2():
         if all zeros:
             increment offset
             if more:
                 bdrv_aio_stream()
        bdrv_aio_write(offset, stream_3)

     stream_3():
         bdrv_aio_write(offset, stream_4)

     stream_4():
         increment offset
         if more:
              bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.

>
> One of the things Stefan has mentioned is that a lot of the QED code 
> could be reused by other formats.  All formats implement things like 
> CoW on their own today but if you exposed interfaces like 
> bdrv_aio_find_free_cluster(), you could actually implement a lot more 
> in the generic block layer.
>
> So, I agree with you in principle that this all should be common 
> code.  I think it's a larger effort though.

Not that large I think; and it will make commit async as a side effect.

>>>
>>> The QED streaming implementation is 140 LOCs too so you quickly end 
>>> up adding more code to the block formats to support these new 
>>> interfaces than it takes to just implement it in the block format.
>>
>> bdrv_is_allocated() already exists (and is needed for commit), what 
>> else is needed?  cluster size?
>
> Synchronous implementations are not reusable to implement asynchronous 
> anything. 

Surely this is easy to fix, at least for qed.

What we need is thread infrastructure that allows us to convert between 
the two methods.

> But you need the code to be cluster aware too.

Yes, another variable in BlockDriverState.

>
>>> Third problem is that  streaming really requires being able to do 
>>> zero write detection in a meaningful way.  You don't want to always 
>>> do zero write detection so you need another interface to mark a 
>>> specific write as a write that should be checked for zeros.
>>
>> You can do that in bdrv_stream(), above, before the actual write, and 
>> call bdrv_unmap() if you detect zeros.
>
> My QED branch now does that FWIW.  At the moment, it only detects zero 
> reads to unallocated clusters and writes a special zero cluster 
> marker.  However, the detection code is in the generic path so once 
> the fsck() logic is working, we can implement a free list in QED.
>
> In QED, the detection code needs to have a lot of knowledge about 
> cluster boundaries and the format of the device.  In principle, this 
> should be common code but it's not for the same reason copy-on-write 
> is not common code today.

Parts of it are: commit.  Of course, that's horribly synchronous.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 13:40           ` Avi Kivity
@ 2010-09-12 15:23             ` Anthony Liguori
  2010-09-12 16:45               ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-12 15:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

On 09/12/2010 08:40 AM, Avi Kivity wrote:
> Why would it serialize all I/O operations?  It's just like another 
> vcpu issuing reads.

Because the block layer isn't re-entrant.

>> What you basically do is:
>>
>> stream_step_three():
>>    complete()
>>
>> stream_step_two(offset, length):
>>    bdrv_aio_readv(offset, length, buffer, stream_step_three)
>>
>> bdrv_aio_stream():
>>     bdrv_aio_find_free_cluster(stream_step_two)
>
> Isn't there a write() missing somewhere?

Streaming relies on copy-on-read to do the writing.

>>
>> And that's exactly what the current code looks like.  The only change 
>> to the patch that this does is make some of qed's internals be block 
>> layer interfaces.
>
> Why do you need find_free_cluster()?  That's a physical offset thing.  
> Just write to the same logical offset.
>
> IOW:
>
>     bdrv_aio_stream():
>         bdrv_aio_read(offset, stream_2)

It's an optimization.  If you've got a fully missing L1 entry, then 
you're going to memset() 2GB worth of zeros.  That's just wasted work.  
With a 1TB image with a 1GB allocation, it's a huge amount of wasted work.

>     stream_2():
>         if all zeros:
>             increment offset
>             if more:
>                 bdrv_aio_stream()
>        bdrv_aio_write(offset, stream_3)
>
>     stream_3():
>         bdrv_aio_write(offset, stream_4)

I don't understand why stream_3() is needed.

>     stream_4():
>         increment offset
>         if more:
>              bdrv_aio_stream()
>
>
> Of course, need to serialize wrt guest writes, which adds a bit more 
> complexity.  I'll leave it to you to code the state machine for that.

http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d

>>
>>>> Third problem is that  streaming really requires being able to do 
>>>> zero write detection in a meaningful way.  You don't want to always 
>>>> do zero write detection so you need another interface to mark a 
>>>> specific write as a write that should be checked for zeros.
>>>
>>> You can do that in bdrv_stream(), above, before the actual write, 
>>> and call bdrv_unmap() if you detect zeros.
>>
>> My QED branch now does that FWIW.  At the moment, it only detects 
>> zero reads to unallocated clusters and writes a special zero cluster 
>> marker.  However, the detection code is in the generic path so once 
>> the fsck() logic is working, we can implement a free list in QED.
>>
>> In QED, the detection code needs to have a lot of knowledge about 
>> cluster boundaries and the format of the device.  In principle, this 
>> should be common code but it's not for the same reason copy-on-write 
>> is not common code today.
>
> Parts of it are: commit.  Of course, that's horribly synchronous.

If you've got AIO internally, making commit work is pretty easy.  Doing 
asynchronous commit at a generic layer is not easy though unless you 
expose lots of details.

Generally, I think the block layer makes more sense if the interface to 
the formats are high level and code sharing is achieved not by mandating 
a world view but rather but making libraries of common functionality.   
This is more akin to how the FS layer works in Linux.

So IMHO, we ought to add a bdrv_aio_commit function, turn the current 
code into a generic_aio_commit, implement a qed_aio_commit, then somehow 
do qcow2_aio_commit, and look at what we can refactor into common code.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 15:23             ` Anthony Liguori
@ 2010-09-12 16:45               ` Avi Kivity
  2010-09-12 17:19                 ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-09-12 16:45 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

  On 09/12/2010 05:23 PM, Anthony Liguori wrote:
> On 09/12/2010 08:40 AM, Avi Kivity wrote:
>> Why would it serialize all I/O operations?  It's just like another 
>> vcpu issuing reads.
>
> Because the block layer isn't re-entrant.

A threaded block layer is reentrant.  Of course pushing the thing into a 
thread requires that.

>
>>> What you basically do is:
>>>
>>> stream_step_three():
>>>    complete()
>>>
>>> stream_step_two(offset, length):
>>>    bdrv_aio_readv(offset, length, buffer, stream_step_three)
>>>
>>> bdrv_aio_stream():
>>>     bdrv_aio_find_free_cluster(stream_step_two)
>>
>> Isn't there a write() missing somewhere?
>
> Streaming relies on copy-on-read to do the writing.

Ah.  You can avoid the copy-on-read implementation in the block format 
driver and do it completely in generic code.

>
>>>
>>> And that's exactly what the current code looks like.  The only 
>>> change to the patch that this does is make some of qed's internals 
>>> be block layer interfaces.
>>
>> Why do you need find_free_cluster()?  That's a physical offset 
>> thing.  Just write to the same logical offset.
>>
>> IOW:
>>
>>     bdrv_aio_stream():
>>         bdrv_aio_read(offset, stream_2)
>
> It's an optimization.  If you've got a fully missing L1 entry, then 
> you're going to memset() 2GB worth of zeros.  That's just wasted 
> work.  With a 1TB image with a 1GB allocation, it's a huge amount of 
> wasted work.

Ok.  And it's a logical offset, not physical as I thought, which 
confused me.

>
>>     stream_2():
>>         if all zeros:
>>             increment offset
>>             if more:
>>                 bdrv_aio_stream()
>>        bdrv_aio_write(offset, stream_3)
>>
>>     stream_3():
>>         bdrv_aio_write(offset, stream_4)
>
> I don't understand why stream_3() is needed.

This implementation doesn't rely on copy-on-read code in the block 
format driver.  It is generic and uses existing block layer interfaces.  
It would need copy-on-read support in the generic block layer as well.

>
>>     stream_4():
>>         increment offset
>>         if more:
>>              bdrv_aio_stream()
>>
>>
>> Of course, need to serialize wrt guest writes, which adds a bit more 
>> complexity.  I'll leave it to you to code the state machine for that.
>
> http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d 
>

Clever - it pushes all the synchronization into the copy-on-read 
implementation.  But the serialization there hardly jumps out of the code.

Do I understand correctly that you can only have one allocating read or 
write running?

>> Parts of it are: commit.  Of course, that's horribly synchronous.
>
> If you've got AIO internally, making commit work is pretty easy.  
> Doing asynchronous commit at a generic layer is not easy though unless 
> you expose lots of details.

I don't see why.  Commit is a simple loop that copies all clusters.  All 
it needs to know is if a cluster is allocated or not.

When commit is running you need additional serialization against guest 
writes, and to direct guest writes and reads to the committed region to 
the backing file instead of the temporary image.  But the block layer 
already knows of all guest writes.

>
> Generally, I think the block layer makes more sense if the interface 
> to the formats are high level and code sharing is achieved not by 
> mandating a world view but rather but making libraries of common 
> functionality.   This is more akin to how the FS layer works in Linux.
>
> So IMHO, we ought to add a bdrv_aio_commit function, turn the current 
> code into a generic_aio_commit, implement a qed_aio_commit, then 
> somehow do qcow2_aio_commit, and look at what we can refactor into 
> common code.

What Linux does if have an equivalent of bdrv_generic_aio_commit() which 
most implementations call (or default to), and only do something if they 
want something special.  Something like commit (or copy-on-read, or 
copy-on-write, or streaming) can be implement 100% in terms of the 
generic functions (and indeed qcow2 backing files can be any format).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 16:45               ` Avi Kivity
@ 2010-09-12 17:19                 ` Anthony Liguori
  2010-09-12 17:31                   ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2010-09-12 17:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

On 09/12/2010 11:45 AM, Avi Kivity wrote:
>> Streaming relies on copy-on-read to do the writing.
>
>
> Ah.  You can avoid the copy-on-read implementation in the block format 
> driver and do it completely in generic code.

Copy on read takes advantage of temporal locality.  You wouldn't want to 
stream without copy on read because you decrease your idle I/O time by 
not effectively caching.

>>>     stream_4():
>>>         increment offset
>>>         if more:
>>>              bdrv_aio_stream()
>>>
>>>
>>> Of course, need to serialize wrt guest writes, which adds a bit more 
>>> complexity.  I'll leave it to you to code the state machine for that.
>>
>> http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d 
>>
>
> Clever - it pushes all the synchronization into the copy-on-read 
> implementation.  But the serialization there hardly jumps out of the 
> code.
>
> Do I understand correctly that you can only have one allocating read 
> or write running?

Cluster allocation, L2 cache allocation, or on-disk L2 allocation?

You only have one on-disk L2 allocation at one time.  That's just an 
implementation detail at the moment.  An on-disk L2 allocation happens 
only when writing to a new cluster that requires a totally new L2 
entry.  Since L2s cover 2GB of logical space, it's a rare event so this 
turns out to be pretty reasonable for a first implementation.

Parallel on-disk L2 allocations is not that difficult, it's just a 
future TODO.

>>
>> Generally, I think the block layer makes more sense if the interface 
>> to the formats are high level and code sharing is achieved not by 
>> mandating a world view but rather but making libraries of common 
>> functionality.   This is more akin to how the FS layer works in Linux.
>>
>> So IMHO, we ought to add a bdrv_aio_commit function, turn the current 
>> code into a generic_aio_commit, implement a qed_aio_commit, then 
>> somehow do qcow2_aio_commit, and look at what we can refactor into 
>> common code.
>
> What Linux does if have an equivalent of bdrv_generic_aio_commit() 
> which most implementations call (or default to), and only do something 
> if they want something special.  Something like commit (or 
> copy-on-read, or copy-on-write, or streaming) can be implement 100% in 
> terms of the generic functions (and indeed qcow2 backing files can be 
> any format).

Yes, what I'm really saying is that we should take the 
bdrv_generic_aio_commit() approach.  I think we're in agreement here.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
  2010-09-12 17:19                 ` Anthony Liguori
@ 2010-09-12 17:31                   ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2010-09-12 17:31 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, libvir-list, Stefan Hajnoczi

  On 09/12/2010 07:19 PM, Anthony Liguori wrote:
> On 09/12/2010 11:45 AM, Avi Kivity wrote:
>>> Streaming relies on copy-on-read to do the writing.
>>
>>
>> Ah.  You can avoid the copy-on-read implementation in the block 
>> format driver and do it completely in generic code.
>
> Copy on read takes advantage of temporal locality.  You wouldn't want 
> to stream without copy on read because you decrease your idle I/O time 
> by not effectively caching.

I meant, implement copy-on-read in generic code side by side with 
streaming.  Streaming becomes just a prefetch operation (read and 
discard) which lets copy-on-read do the rest.  This is essentially your 
implementation, yes?

>
>>>>     stream_4():
>>>>         increment offset
>>>>         if more:
>>>>              bdrv_aio_stream()
>>>>
>>>>
>>>> Of course, need to serialize wrt guest writes, which adds a bit 
>>>> more complexity.  I'll leave it to you to code the state machine 
>>>> for that.
>>>
>>> http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d 
>>>
>>
>> Clever - it pushes all the synchronization into the copy-on-read 
>> implementation.  But the serialization there hardly jumps out of the 
>> code.
>>
>> Do I understand correctly that you can only have one allocating read 
>> or write running?
>
> Cluster allocation, L2 cache allocation, or on-disk L2 allocation?
>
> You only have one on-disk L2 allocation at one time.  That's just an 
> implementation detail at the moment.  An on-disk L2 allocation happens 
> only when writing to a new cluster that requires a totally new L2 
> entry.  Since L2s cover 2GB of logical space, it's a rare event so 
> this turns out to be pretty reasonable for a first implementation.
>
> Parallel on-disk L2 allocations is not that difficult, it's just a 
> future TODO.

Really, you can just preallocate all L2s.  Most filesystems will touch 
all of them very soon.  qcow2 might save some space for snapshots which 
share L2s (doubtful) or for 4k clusters (historical) but for qed with 
64k clusters, it doesn't save any space.

Linear L2s will also make your fsck *much* quicker.  Size is .01% of 
logical image size.  1MB for a 10GB guest, by the time you install 
something on it that's a drop in the bucket.

If you install a guest on a 100GB disk, what percentage of L2s are 
allocated?

>
>>>
>>> Generally, I think the block layer makes more sense if the interface 
>>> to the formats are high level and code sharing is achieved not by 
>>> mandating a world view but rather but making libraries of common 
>>> functionality.   This is more akin to how the FS layer works in Linux.
>>>
>>> So IMHO, we ought to add a bdrv_aio_commit function, turn the 
>>> current code into a generic_aio_commit, implement a qed_aio_commit, 
>>> then somehow do qcow2_aio_commit, and look at what we can refactor 
>>> into common code.
>>
>> What Linux does if have an equivalent of bdrv_generic_aio_commit() 
>> which most implementations call (or default to), and only do 
>> something if they want something special.  Something like commit (or 
>> copy-on-read, or copy-on-write, or streaming) can be implement 100% 
>> in terms of the generic functions (and indeed qcow2 backing files can 
>> be any format).
>
> Yes, what I'm really saying is that we should take the 
> bdrv_generic_aio_commit() approach.  I think we're in agreement here.
>

Strange feeling.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-09-12 17:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
2010-09-07 14:01 ` Alexander Graf
2010-09-07 14:31   ` Anthony Liguori
2010-09-07 14:33 ` Stefan Hajnoczi
2010-09-07 14:51   ` Anthony Liguori
2010-09-07 14:55     ` Stefan Hajnoczi
2010-09-07 15:00       ` Anthony Liguori
2010-09-07 15:09         ` Stefan Hajnoczi
2010-09-07 15:20           ` Anthony Liguori
2010-09-08  8:26           ` Kevin Wolf
2010-09-07 14:34 ` Kevin Wolf
2010-09-07 14:49   ` Stefan Hajnoczi
2010-09-07 14:57     ` Anthony Liguori
2010-09-07 15:05       ` Stefan Hajnoczi
2010-09-07 15:23         ` Anthony Liguori
2010-09-12 12:41       ` Avi Kivity
2010-09-12 13:25         ` Anthony Liguori
2010-09-12 13:40           ` Avi Kivity
2010-09-12 15:23             ` Anthony Liguori
2010-09-12 16:45               ` Avi Kivity
2010-09-12 17:19                 ` Anthony Liguori
2010-09-12 17:31                   ` Avi Kivity
2010-09-07 14:49   ` Anthony Liguori
2010-09-07 15:02     ` Kevin Wolf
2010-09-07 15:11       ` Anthony Liguori
2010-09-07 15:20         ` Kevin Wolf
2010-09-07 15:30           ` Anthony Liguori
2010-09-07 15:39             ` Kevin Wolf
2010-09-07 16:00               ` Anthony Liguori
2010-09-07 15:03 ` [Qemu-devel] " Daniel P. Berrange
2010-09-07 15:16   ` Anthony Liguori
2010-09-12 10:55 ` [Qemu-devel] " Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.