From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:43079)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1QgfUq-0003sk-U0
	for qemu-devel@nongnu.org; Tue, 12 Jul 2011 12:07:42 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1QgfUo-0005n5-4X
	for qemu-devel@nongnu.org; Tue, 12 Jul 2011 12:07:40 -0400
Received: from mx1.redhat.com ([209.132.183.28]:25636)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1QgfUn-0005mx-HC
	for qemu-devel@nongnu.org; Tue, 12 Jul 2011 12:07:37 -0400
Message-ID: <4E1C71F2.4030507@redhat.com>
Date: Tue, 12 Jul 2011 18:10:26 +0200
From: Kevin Wolf <kwolf@redhat.com>
MIME-Version: 1.0
References: <4E131D0D.307@redhat.com>	<20110711125432.GA19686@stefanha-thinkpad.localdomain>	<CAJSP0QX0ksga-bPPy53SPcPaApT5uXfH-8ORtT0JkPRaskBkhg@mail.gmail.com>	<20110711163226.GA10924@amt.cnet>	<4E1C009C.1010408@redhat.com>
	<CAJSP0QXwEY7wKYD=vQf1H4nLKzccemNSbNauaMSK8GQgoHrbfw@mail.gmail.com>
In-Reply-To: <CAJSP0QXwEY7wKYD=vQf1H4nLKzccemNSbNauaMSK8GQgoHrbfw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] live block copy/stream/snapshot discussion
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, Dor Laor <dlaor@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, Avi Kivity <avi@redhat.com>, Adam Litke <agl@us.ibm.com>

Am 12.07.2011 17:45, schrieb Stefan Hajnoczi:
>>>> Image streaming API
>>>> ===================
>>>>
>>>> For leaf images with copy-on-read semantics, the stream commands allow the user
>>>> to populate local blocks by manually streaming them from the backing image.
>>>> Once all blocks have been streamed, the dependency on the original backing
>>>> image can be removed.  Therefore, stream commands can be used to implement
>>>> post-copy live block migration and rapid deployment.
>>>>
>>>> The block_stream command can be used to stream a single cluster, to
>>>> start streaming the entire device, and to cancel an active stream.  It
>>>> is easiest to allow the block_stream command to manage streaming for the
>>>> entire device but a managent tool could use single cluster mode to
>>>> throttle the I/O rate.
>>
>> As discussed earlier, having the management send requests for each
>> single cluster doesn't make any sense at all. It wouldn't only throttle
>> the I/O rate but bring it down to a level that makes it unusable. What
>> you really want is to allow the management to give us a range (offset +
>> length) that qemu should stream.
> 
> I feel that an iteration interface is problematic whether the
> management tool or QEMU decide what to stream.  Let's have just the
> background streaming operation.
> 
> The problem with byte ranges is two-fold.  The management tool doesn't
> know which regions of the image are allocated so it may do a lot of
> nop calls to already-allocated regions with no intelligence as to
> where the next sensible offset for streaming is.  Secondly, because
> the progress and performance of image streaming depend largely on
> whether or not clusters are allocated (it is very fast when a cluster
> is already allocated and we have no work to do), offsets are bad
> indicators of progress to the user.  I think it's best not to expose
> these details to the management tool at all.
> 
> The only reason for the iteration interface was to punt I/O throttling
> to the management tool.  I think it would be easier to just throttle
> inside the streaming function.
> 
> Kevin: Are you happy with dropping the iteration interface?
> Adam: Is there a libvirt requirement for iteration or could we support
> background copy only?

Okay, works for me.

>>>> The command synopses are as follows:
>>>>
>>>> block_stream
>>>> ------------
>>>>
>>>> Copy data from a backing file into a block device.
>>>>
>>>> If the optional 'all' argument is true, this operation is performed in the
>>>> background until the entire backing file has been copied.  The status of
>>>> ongoing block_stream operations can be checked with query-block-stream.
>>
>> Not sure if it's a good idea to use a bool argument to turn a command
>> into its opposite. I think having a separate command for stopping would
>> be cleaner. Something for the QMP folks to decide, though.
> 
> git branch new_branch
> git branch -D new_branch
> 
> Makes sense to me :)

I don't think you should compare a command line option to a programming
interface. Having a git_create_branch(const char *name, bool delete)
would really look strange. Anyway, probably a matter of taste.

A hint that separate commands would make sense is that the stop command
won't need the other arguments that the start command gets ('all' and
'base').

>>>> Arguments:
>>>>
>>>> - all:    copy entire device (json-bool, optional)
>>>> - stop:   stop copying to device (json-bool, optional)
>>>> - device: device name (json-string)
>>>
>>> It must be possible to specify backing file that will be
>>> active after streaming finishes (data from that file will not
>>> be streamed into active file, of course).
>>
>> Yes, I think the common base image belongs here.
> 
> Right.  We need to specify it by filename:
> 
>   - base: filename of base file (json-string, optional)
> 
>   Sectors are not copied from the base file and its backing file
>   chain.  The following describes this feature:
>     Before: base <- sn1 <- sn2 <- sn3 <- vm.img
>     After:  base <- vm.img

Does this imply that a rebase -u happens always after completion?

>> With all = false, where does the streaming begin?
> 
> Streaming begins at the start of the image.
> 
>> Do you have something like the "current streaming offset" in the state of each BlockDriverState?
> 
> Yes, there is a StreamState for each block device that has an
> in-progress operation.  The progress is saved between block_stream
> (without -a) invocations so the caller does not need to specify the
> streaming offset as an argument.
> 
> Thanks for pointing out these weaknesses in the documentation.  It
> should really be explained fully.

I think we also need to describe error cases. For example, what happens
if you try to start streaming while it's already in progress?

>>>> Return:
>>>>
>>>> - device: device name (json-string)
>>>> - len:    size of the device, in bytes (json-int)
>>>> - offset: ending offset of the completed I/O, in bytes (json-int)
>>
>> So you only get the reply when the request has completed? With the
>> current monitor, this means that QMP is blocked while we stream, doesn't
>> it? How are you supposed to send the stop command then?
> 
> Incomplete documentation again, sorry.  The block_stream command
> behaves as follows:
> 
> 1. block_stream all returns immediately and the BLOCK_STREAM_COMPLETED
> event is raised when streaming completes either successfully or with
> an error.
> 
> 2. block_stream stop returns when the in-progress streaming operation
> has been safely stopped.
> 
> 3. block_stream returns when one iteration of streaming has completed.
> 
>> Two of three examples below have an empty return value instead, so they
>> are not compliant to this specification.
> 
> I will update the documentation, the non-all invocations do not return anything.

Okay, then I don't understand what the 'offset' return value means. The
text says "offset of the completed I/O". If all=true immediately
returns, shouldn't it always be 0?

>> I find it rather disturbing that a command like 'change' has made it
>> into QMP... Anyway, I don't think this is really what we need.
>>
>> We have two switches to do. The first one happens before starting the
>> copy: Creating the copy, with the source as its backing file, and
>> switching to that. The monitor command to achieve this is snapshot_blkdev.
> 
> I don't think that creating image files in QEMU is going to work when
> running KVM with libvirt (SELinux).  The QEMU process does not have
> the ability to create new image files.  It needs at least a file
> descriptor to an empty file or maybe a file that has been created
> using qemu-img like I showed above.

Independent problem. We're really creating an external snapshot here, so
we should use the function for external snapshots. libvirt can
pre-create an empty image file, so that qemu will write the image format
data into it, but we have discussed this before.

Kevin