* [Qemu-devel] live block copy/stream/snapshot discussion @ 2011-07-05 14:17 Dor Laor 2011-07-11 12:54 ` Stefan Hajnoczi 0 siblings, 1 reply; 13+ messages in thread From: Dor Laor @ 2011-07-05 14:17 UTC (permalink / raw) To: qemu-devel, Stefan Hajnoczi, Marcelo Tosatti, Kevin Wolf, Avi Kivity, Anthony Liguori Anthony advised to clone http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture to the list in order to encourage discussion, so here it is: ------------------------------------------------------------------------ qemu is expected to support these features (some already implemented): = Live features = == Live block copy == Ability to copy 1+ virtual disk from the source backing file/block device to a new target that is accessible by the host. The copy supposed to be executed while the VM runs in a transparent way. == Live snapshots and live snapshot merge == Live snapshot is already incorporated (by Jes) in qemu (still need virt-agent work to freeze the guest FS). Live snapshot merge is required in order of reducing the overhead caused by the additional snapshots (sometimes over raw device). We'll use live copy to do the live merge == Image streaming (Copy on read) == Ability to start guest execution while the parent image reside remotely and each block access is replicated to a local copy (image format snapshot) Such functionality can be hooked together with live block migration instead of the 'post copy' method. == Live block migration (pre/post) == Beyond live block copy we'll sometimes need to move both the storage and the guest. There are two main approached here: - pre copy First live copy the image and only then live migration the VM. It is simple and safer approach in terms of management app, but if the purpose of the whole live block migration was to balance the cpu load, it won't be practical to use since copying an image of 100GB will take too long. - post copy (streaming / copy on read) First live migrate the VM, then on line stream its blocks. It's better approach for HA/load balancing but it might make management complex (need to keep the source VM alive, handling failures) In addition there are two cases for the storage access: 1. Shared storage Live block copy enable this capability, its seems like a rare case for live block migration. 2. There are some cases where the is no NFS/SAN storage and live migration is needed. It should be similar to VMW's storage VM motion. http://www.vmware.com/files/pdf/VMware-Storage-VMotion-DS-EN.pdf http://www.vmware.com/products/storage-vmotion/features.html == Using external dirty block bitmap == FVD has an option to use external dirty block bitmap file in addition to the regular mapping/data files. We can consider using it for live block migration and live merge too. It can also allow additional usages of 3rd party tools to calculate diffs between the snapshots. There is a big down side thought since it will make management complicated and there is the risky of the image and its bitmap file get out of sync. It's much better choice to have qemu-img tool to be the single interface to the dirty block bitmap data. = Solutions = == Non shared storage == Either use iscsi (target and initiator) or NBD or proprietary qemu solution. iScsi in theory is the best but there is a problem of dealing with COW images - iScsi cannot report the COW level and detect un-allocated blocks. This might force us to use proprietary solution. An interesting option (by Orit Wasserman) was to use iScsi for exporting the images externally to qemu level and qemu will access as if they were a local device. This can work well w/o almost any effort. What do we do with chains of COW files? We create up to N such iscsi connections for every COW file in the chain. == Live block migration == Use the streaming approach + regular live migration + iscsi: Execute regular live migration and at the end of it, start streaming. If there is no shared storage, use the external iscsi and behave as if the image is local. At the end of the streaming operation there will be a new local base image. == Block mirror layer == Was invented in order to duplicate write IOs for the source and destination images. It prevents the potential race when both qemu and the management crash at the end of the block copy stage and it is unknown whether management should pick the source or the destination == Streaming == No need for mirror since only the destination changes and is writable. == Block copy background task == Can be shared between block copy and streaming == Live snapshot == It can be seen as a (local) stream that preserve the current COW chain = Use cases = 1. Basic streaming, single base master image on source storage, need to be instantiated on destination storage The base image is a single level COW format (file or lvm). The base is RO and only new destination is RW. base' is empty at the beginning. The base image content is being copied in the background to base'. At the end of the operation, base' is a standalone image w/o depending on the base image. a. Case of a shared storage streaming guest boot Before: src storage: base dst storage: none After src storage: base dst storage: base' b. Case of no shared storage streaming guest boot Every thing is the same, we use external iscsi target on the src host and external iscsi initiator on the destination host. Qemu boots from the destination by using the iscsi access. This is transparent to qemu (expect cmd syntax change ). Once the streaming is over, we can live drop the usage of iscsi and open the image directly (some sort of null live copy) c. Live block migration (using streaming) w/ shared storage. Exactly like 1.a. First create the destination image, then we run live migration there w/o data in the new image. Now we stream like the boot scenario. d. Live block migration (using streaming) w/o shared storage. Like 1.b. + 1.c. *** There is complexity to handle multiple block device belonging to the same VM. Management will need to track each stream finish event and manage failures accordingly. 2. Basic streaming of raw files/devices Here we have an issue - what happens if there is a failure in the middle? Regular COW can sustain a failure since the intermediate base' contains information dirty bit block information. Such a base' intermediate raw image will be broken. We cannot revert back to the original base and start over because new writes were written only to the base'. Approaches: a. Don't support that b. Use intermediate COW image and then live copy it into raw (waste time, IO, space). One can easily add new COW over the source and continue from there. c. Use external metadata of dirty-block-bitmap even for raw Suggestion: at this stage, do either recommendation #a or #b 3. Basic live copy, single base master image on source storage, need to be copied to the destination storage The base image is a single level COW format or a raw file/device. The base image content is being copied in the background to base'. At the end of the operation, base' is a standalone image w/o depending on the base image. In this case we only take into account a running VM, no need to do that for boot stage. So it is either VM running locally and about to change its storage or a VM live migration. The plan is to use the mirror driver approach. Both src/dst are writable. a. Case of a shared storage, a VM changes its block device Before: src storage: base dst storage: none After src storage: base dst storage: base' This is a plain live copy w/o moving the VM. The case w/o shared storage seems not relevant here. We might want to move multiple block devices of the VM. It is written here for completeness - it shouldn't change anything. Still management/events will use the block name/id. b. Live block migration (w/o streaming) w/ shared storage. Unlike in the streaming case, the order here is reversed: Run live copy. When it ends and we're in the mirror state, run live migration. When it ends, stop the mirroring and make the VM continue on the destination. That's probably a rare use case. c. Live block migration (using streaming) w/o shared storage. Like 3.b. by using external iscsi 4. COW chains that preserve the full structure Before: src: base <- sn1 <- snx dst: none After: src: base <- sn1 <- snx dst: base' <- sn1' <- snx' All of the original snapshot chains should be copied or stream as is to the new storage. With copying we can do all of the non leaf images using standard 'cp tools'. If we're to use iscsi, we'll need to create N such connections. Probably not a common use case for streaming, we might ignore this and use this scenario only for copying. 5. Like 4. but the chain can collapse. In fact this is like special case of #4 Before:src: base <- sn1 <- sn2 .. <- snx dst: none After: src: base <- sn1 <- sn2 ...<- snx dst: base'<-sn1'..<- sny' There is no difference from #4 other than collapsing some chain path into the dst leaf 6. Live snapshot It's here since the interface can be similar. Basically it is similar to live copy but instead of copying, we switch to another COW on top. The only (separate) addition would be to add a verb to ask the guest to flush its file systems. Before: storage: base <- s1 <- sx After storage: base <- s1 <- sx <-sx+1 == Exceptions == 1. Hot unplug of the relevant disk Prevent that. (or cancel the operation) 1. Live migration in the middle of non migration action from above Shall we allow it? It can work but at the end of live migration we need to reopen the images (NFS mainly), it might add un-needed complexity. We better prevent that. = Interface = == Streaming (by Stefan) == 1. Start a background streaming operation: (qemu) block_stream -a ide0-hd 2. Check the status of the operation: (qemu) info block-stream Streaming device ide0-hd: Completed 512 of 34359738368 bytes 3. The status changes when the operation completes: (qemu) info block-stream No active stream On completion the image file no longer has a backing file dependency. When streaming completes QEMU updates the image file metadata to indicate that no backing file is used. The QMP interface is similar but provides QMP events to signal streaming completion and failure. Polling to query the streaming status is only used when the management application wishes to refresh progress information. If guest execution is interrupted by a power failure or QEMU crash, then the image file is still valid but streaming may be incomplete. When QEMU is launched again the block_stream command can be issued to resume streaming. ----------------- Cheers, Dor ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-05 14:17 [Qemu-devel] live block copy/stream/snapshot discussion Dor Laor @ 2011-07-11 12:54 ` Stefan Hajnoczi 2011-07-11 14:47 ` Stefan Hajnoczi 0 siblings, 1 reply; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-11 12:54 UTC (permalink / raw) To: Dor Laor Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity On Tue, Jul 05, 2011 at 05:17:49PM +0300, Dor Laor wrote: > Anthony advised to clone http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture > to the list in order to encourage discussion, so here it is: > ------------------------------------------------------------------------ > qemu is expected to support these features (some already implemented): > > = Live features = > > == Live block copy == > > Ability to copy 1+ virtual disk from the source backing file/block > device to a new target that is accessible by the host. The copy > supposed to be executed while the VM runs in a transparent way. > > == Live snapshots and live snapshot merge == > > Live snapshot is already incorporated (by Jes) in qemu (still need > virt-agent work to freeze the guest FS). > Live snapshot merge is required in order of reducing the overhead > caused by the additional snapshots (sometimes over raw device). > We'll use live copy to do the live merge This line seems outdated. Kevin and Marcelo have suggested a separate live commit operation that does not use the unified block copy/image streaming mechanism. > = Solutions = > > == Non shared storage == > > Either use iscsi (target and initiator) or NBD or proprietary qemu > solution. iScsi in theory is the best but there is a problem of > dealing with COW images - iScsi cannot report the COW level and > detect un-allocated blocks. This might force us to use > proprietary solution. > An interesting option (by Orit Wasserman) was to use iScsi for > exporting the images externally to qemu level and qemu will access > as if they were a local device. This can work well w/o almost any > effort. What do we do with chains of COW files? We create up to N > such iscsi connections for every COW file in the chain. If there is a discovery mechanism to locate LUNs then it would be possible to use this approach. However, using iSCSI but placing all the copy-on-write intelligence into the QEMU initiator is overkill since we need to support SAN/NAS appliances that provide snapshots, copy-on-write, and thin provisioning anyway. If you look at what other hypervisors are doing, they are trying to offload as much storage processing onto the appliance as possible. We probably want the appliance to do those operations for us, so implementing them in the initiator for some cases is duplicating that code and making the system more complex. The real problem is that we're lacking a library interface to manage volumes, including snapshots. I don't think that QEMU needs to drive this interface. It should be libvirt (which deals with storage pools and volumes today already). Once we do have an interface defined, I think it makes less sense implementing all of this in QEMU when this storage management functionality really belongs in NAS/SAN appliances and software targets. > > == Live block migration == > > Use the streaming approach + regular live migration + iscsi: > Execute regular live migration and at the end of it, start streaming. > If there is no shared storage, use the external iscsi and behave as > if the image is local. At the end of the streaming operation there > will be a new local base image. > > == Block mirror layer == > > Was invented in order to duplicate write IOs for the source and > destination images. It prevents the potential race when both qemu > and the management crash at the end of the block copy stage and it > is unknown whether management should pick the source or the > destination > > == Streaming == > > No need for mirror since only the destination changes and is > writable. > > == Block copy background task == > > Can be shared between block copy and streaming > > == Live snapshot == > > It can be seen as a (local) stream that preserve the current COW > chain > > = Use cases = > > 1. Basic streaming, single base master image on source storage, need > to be instantiated on destination storage > > The base image is a single level COW format (file or lvm). > The base is RO and only new destination is RW. base' is empty at > the beginning. The base image content is being copied in the > background to base'. At the end of the operation, base' is a > standalone image w/o depending on the base image. > > a. Case of a shared storage streaming guest boot > > Before: src storage: base dst storage: none > After src storage: base dst storage: base' > > b. Case of no shared storage streaming guest boot > Every thing is the same, we use external iscsi target on the > src host and external iscsi initiator on the destination host. > Qemu boots from the destination by using the iscsi access. This > is transparent to qemu (expect cmd syntax change ). Once the > streaming is over, we can live drop the usage of iscsi and open > the image directly (some sort of null live copy) > > c. Live block migration (using streaming) w/ shared storage. > Exactly like 1.a. First create the destination image, then we > run live migration there w/o data in the new image. Now we > stream like the boot scenario. > > d. Live block migration (using streaming) w/o shared storage. > Like 1.b. + 1.c. > > *** There is complexity to handle multiple block device belonging > to the same VM. Management will need to track each stream finish > event and manage failures accordingly. This is tangental but recently I've been thinking about the two roles that libvirt plays: 1. Hypervisor-neutral management API 2. KVM high-level functionality (image lifecycle, CPU affinity, networking configuration) Libvirt is accumulating KVM-specific high-level functionality that really should be in a KVM API or qemud. Otherwise libvirt will become lopsided with a significant part of the codebase doing KVM-specific things simply because there was no other place to put this functionality. Image lifecycle is one area where we could help. qemu-img is good but doesn't meet the needs of libvirt, which reimplements a bunch of image management functionality. When we talk about managing backing files or external dirty bitmaps I fear this unbalance will get worse. libvirt will be *doing* a lot of the work instead of *delegating* what needs to be done to virtualization software (like VMware APIs). I feel we're missing something between qemu (scope: single guest instance) and libvirt (scope: host-wide hypervisor-neutral management API). > 2. Basic streaming of raw files/devices s/Basic streaming of raw files/Basic streaming to raw files/ I think this makes it clearer that the issue is keeping track of streamed blocks in the destination file. > > Here we have an issue - what happens if there is a failure in the > middle? Regular COW can sustain a failure since the intermediate > base' contains information dirty bit block information. Such a > base' intermediate raw image will be broken. We cannot revert back > to the original base and start over because new writes were written > only to the base'. > > Approaches: > a. Don't support that > b. Use intermediate COW image and then live copy it into raw (waste > time, IO, space). One can easily add new COW over the source and > continue from there. > c. Use external metadata of dirty-block-bitmap even for raw > > Suggestion: at this stage, do either recommendation #a or #b I think #a is fine as a starting point. #c can be added as a feature later and should not require major changes. > > 3. Basic live copy, single base master image on source storage, need > to be copied to the destination storage > > The base image is a single level COW format or a raw file/device. > The base image content is being copied in the background to base'. > At the end of the operation, base' is a standalone image w/o > depending on the base image. In this case we only take into account > a running VM, no need to do that for boot stage. > So it is either VM running locally and about to change its storage > or a VM live migration. The plan is to use the mirror driver > approach. Both src/dst are writable. I think this is outdated. I believe Marcelo stated that the mirror driver is not needed and that streaming can be used for live block migration (pre-copy). So there is no difference between this and basic streaming. > == Exceptions == > > 1. Hot unplug of the relevant disk > Prevent that. (or cancel the operation) > > 1. Live migration in the middle of non migration action from above > Shall we allow it? It can work but at the end of live migration we > need to reopen the images (NFS mainly), it might add un-needed > complexity. > We better prevent that. I think state to lock devices is a good idea. It helps prevent human errors. The only thing to watch out for is that the guest should not be able to lock devices - otherwise the guest can prevent the administrator's actions. Never trust the guest :). Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-11 12:54 ` Stefan Hajnoczi @ 2011-07-11 14:47 ` Stefan Hajnoczi 2011-07-11 16:32 ` Marcelo Tosatti 0 siblings, 1 reply; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-11 14:47 UTC (permalink / raw) To: Marcelo Tosatti, Kevin Wolf Cc: Anthony Liguori, Stefan Hajnoczi, Dor Laor, qemu-devel, Avi Kivity, Adam Litke Kevin, Marcelo, I'd like to reach agreement on the QMP/HMP APIs for live block copy and image streaming. Libvirt has acked the image streaming APIs that Adam proposed and I think they are a good fit for the feature. I have described that API below for your review (it's exactly what the QED Image Streaming patches provide). Marcelo: Are you happy with this API for live block copy? Also please take a look at the switch command that I am proposing. Image streaming API =================== For leaf images with copy-on-read semantics, the stream commands allow the user to populate local blocks by manually streaming them from the backing image. Once all blocks have been streamed, the dependency on the original backing image can be removed. Therefore, stream commands can be used to implement post-copy live block migration and rapid deployment. The block_stream command can be used to stream a single cluster, to start streaming the entire device, and to cancel an active stream. It is easiest to allow the block_stream command to manage streaming for the entire device but a managent tool could use single cluster mode to throttle the I/O rate. The command synopses are as follows: block_stream ------------ Copy data from a backing file into a block device. If the optional 'all' argument is true, this operation is performed in the background until the entire backing file has been copied. The status of ongoing block_stream operations can be checked with query-block-stream. Arguments: - all: copy entire device (json-bool, optional) - stop: stop copying to device (json-bool, optional) - device: device name (json-string) Return: - device: device name (json-string) - len: size of the device, in bytes (json-int) - offset: ending offset of the completed I/O, in bytes (json-int) Examples: -> { "execute": "block_stream", "arguments": { "device": "virtio0" } } <- { "return": { "device": "virtio0", "len": 10737418240, "offset": 512 } } -> { "execute": "block_stream", "arguments": { "all": true, "device": "virtio0" } } <- { "return": {} } -> { "execute": "block_stream", "arguments": { "stop": true, "device": "virtio0" } } <- { "return": {} } query-block-stream ------------------ Show progress of ongoing block_stream operations. Return a json-array of all operations. If no operation is active then an empty array will be returned. Each operation is a json-object with the following data: - device: device name (json-string) - len: size of the device, in bytes (json-int) - offset: ending offset of the completed I/O, in bytes (json-int) Example: -> { "execute": "query-block-stream" } <- { "return":[ { "device": "virtio0", "len": 10737418240, "offset": 709632} ] } Block device switching API ========================== Extend the 'change' command to support changing the image file without media change notification. Perhaps we should take the opportunity to add a "format" argument for image files? change ------ Change a removable medium or VNC configuration. Arguments: - "device": device name (json-string) - "target": filename or item (json-string) - "arg": additional argument (json-string, optional) - "notify": whether to notify guest, defaults to true (json-bool, optional) Examples: 1. Change a removable medium -> { "execute": "change", "arguments": { "device": "ide1-cd0", "target": "/srv/images/Fedora-12-x86_64-DVD.iso" } } <- { "return": {} } 2. Change a disk without media change notification -> { "execute": "change", "arguments": { "device": "virtio-blk0", "target": "/srv/images/vm_1.img", "notify": false } } 3. Change VNC password -> { "execute": "change", "arguments": { "device": "vnc", "target": "password", "arg": "foobar1" } } <- { "return": {} } How live block copy works ========================= Live block copy does the following: 1. Create the destination file: qemu-img create -f $cow_fmt -o backing_file=$base destination.$cow_fmt 2. Switch to the destination file: change -n virtio-blk0 /srv/images/vm_1.img 3. Stream the base into the image file: block_stream -a virtio-blk0 Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-11 14:47 ` Stefan Hajnoczi @ 2011-07-11 16:32 ` Marcelo Tosatti 2011-07-12 8:06 ` Kevin Wolf 0 siblings, 1 reply; 13+ messages in thread From: Marcelo Tosatti @ 2011-07-11 16:32 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, Dor Laor, qemu-devel, Avi Kivity, Adam Litke On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote: > Kevin, Marcelo, > I'd like to reach agreement on the QMP/HMP APIs for live block copy > and image streaming. Libvirt has acked the image streaming APIs that > Adam proposed and I think they are a good fit for the feature. I have > described that API below for your review (it's exactly what the QED > Image Streaming patches provide). > > Marcelo: Are you happy with this API for live block copy? Also please > take a look at the switch command that I am proposing. > > Image streaming API > =================== > > For leaf images with copy-on-read semantics, the stream commands allow the user > to populate local blocks by manually streaming them from the backing image. > Once all blocks have been streamed, the dependency on the original backing > image can be removed. Therefore, stream commands can be used to implement > post-copy live block migration and rapid deployment. > > The block_stream command can be used to stream a single cluster, to > start streaming the entire device, and to cancel an active stream. It > is easiest to allow the block_stream command to manage streaming for the > entire device but a managent tool could use single cluster mode to > throttle the I/O rate. > > The command synopses are as follows: > > block_stream > ------------ > > Copy data from a backing file into a block device. > > If the optional 'all' argument is true, this operation is performed in the > background until the entire backing file has been copied. The status of > ongoing block_stream operations can be checked with query-block-stream. > > Arguments: > > - all: copy entire device (json-bool, optional) > - stop: stop copying to device (json-bool, optional) > - device: device name (json-string) It must be possible to specify backing file that will be active after streaming finishes (data from that file will not be streamed into active file, of course). > Return: > > - device: device name (json-string) > - len: size of the device, in bytes (json-int) > - offset: ending offset of the completed I/O, in bytes (json-int) > > Examples: > > -> { "execute": "block_stream", "arguments": { "device": "virtio0" } } > <- { "return": { "device": "virtio0", "len": 10737418240, "offset": 512 } } > > -> { "execute": "block_stream", "arguments": { "all": true, "device": > "virtio0" } } > <- { "return": {} } > > -> { "execute": "block_stream", "arguments": { "stop": true, "device": > "virtio0" } } > <- { "return": {} } > > query-block-stream > ------------------ > > Show progress of ongoing block_stream operations. > > Return a json-array of all operations. If no operation is active then an empty > array will be returned. Each operation is a json-object with the following > data: > > - device: device name (json-string) > - len: size of the device, in bytes (json-int) > - offset: ending offset of the completed I/O, in bytes (json-int) > > Example: > > -> { "execute": "query-block-stream" } > <- { "return":[ > { "device": "virtio0", "len": 10737418240, "offset": 709632} > ] > } > > > Block device switching API > ========================== > > Extend the 'change' command to support changing the image file without > media change notification. > > Perhaps we should take the opportunity to add a "format" argument for > image files? > > change > ------ > > Change a removable medium or VNC configuration. > > Arguments: > > - "device": device name (json-string) > - "target": filename or item (json-string) > - "arg": additional argument (json-string, optional) > - "notify": whether to notify guest, defaults to true (json-bool, optional) > > Examples: > > 1. Change a removable medium > > -> { "execute": "change", > "arguments": { "device": "ide1-cd0", > "target": "/srv/images/Fedora-12-x86_64-DVD.iso" } } > <- { "return": {} } > > 2. Change a disk without media change notification > > -> { "execute": "change", > "arguments": { "device": "virtio-blk0", > "target": "/srv/images/vm_1.img", > "notify": false } } > > 3. Change VNC password > > -> { "execute": "change", > "arguments": { "device": "vnc", "target": "password", > "arg": "foobar1" } } > <- { "return": {} } > > How live block copy works > ========================= > > Live block copy does the following: > > 1. Create the destination file: qemu-img create -f $cow_fmt -o > backing_file=$base destination.$cow_fmt > 2. Switch to the destination file: change -n virtio-blk0 /srv/images/vm_1.img The snapshot command (snapshot_blkdev) can be used for these two steps. > 3. Stream the base into the image file: block_stream -a virtio-blk0 > > Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-11 16:32 ` Marcelo Tosatti @ 2011-07-12 8:06 ` Kevin Wolf 2011-07-12 15:45 ` Stefan Hajnoczi 0 siblings, 1 reply; 13+ messages in thread From: Kevin Wolf @ 2011-07-12 8:06 UTC (permalink / raw) To: Marcelo Tosatti Cc: Anthony Liguori, Stefan Hajnoczi, Stefan Hajnoczi, Dor Laor, qemu-devel, Avi Kivity, Adam Litke Am 11.07.2011 18:32, schrieb Marcelo Tosatti: > On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote: >> Kevin, Marcelo, >> I'd like to reach agreement on the QMP/HMP APIs for live block copy >> and image streaming. Libvirt has acked the image streaming APIs that >> Adam proposed and I think they are a good fit for the feature. I have >> described that API below for your review (it's exactly what the QED >> Image Streaming patches provide). >> >> Marcelo: Are you happy with this API for live block copy? Also please >> take a look at the switch command that I am proposing. >> >> Image streaming API >> =================== >> >> For leaf images with copy-on-read semantics, the stream commands allow the user >> to populate local blocks by manually streaming them from the backing image. >> Once all blocks have been streamed, the dependency on the original backing >> image can be removed. Therefore, stream commands can be used to implement >> post-copy live block migration and rapid deployment. >> >> The block_stream command can be used to stream a single cluster, to >> start streaming the entire device, and to cancel an active stream. It >> is easiest to allow the block_stream command to manage streaming for the >> entire device but a managent tool could use single cluster mode to >> throttle the I/O rate. As discussed earlier, having the management send requests for each single cluster doesn't make any sense at all. It wouldn't only throttle the I/O rate but bring it down to a level that makes it unusable. What you really want is to allow the management to give us a range (offset + length) that qemu should stream. >> The command synopses are as follows: >> >> block_stream >> ------------ >> >> Copy data from a backing file into a block device. >> >> If the optional 'all' argument is true, this operation is performed in the >> background until the entire backing file has been copied. The status of >> ongoing block_stream operations can be checked with query-block-stream. Not sure if it's a good idea to use a bool argument to turn a command into its opposite. I think having a separate command for stopping would be cleaner. Something for the QMP folks to decide, though. >> Arguments: >> >> - all: copy entire device (json-bool, optional) >> - stop: stop copying to device (json-bool, optional) >> - device: device name (json-string) > > It must be possible to specify backing file that will be > active after streaming finishes (data from that file will not > be streamed into active file, of course). Yes, I think the common base image belongs here. With all = false, where does the streaming begin? Do you have something like the "current streaming offset" in the state of each BlockDriverState? As I said above, I would prefer adding offset and length to the arguments. >> Return: >> >> - device: device name (json-string) >> - len: size of the device, in bytes (json-int) >> - offset: ending offset of the completed I/O, in bytes (json-int) So you only get the reply when the request has completed? With the current monitor, this means that QMP is blocked while we stream, doesn't it? How are you supposed to send the stop command then? Two of three examples below have an empty return value instead, so they are not compliant to this specification. >> Examples: >> >> -> { "execute": "block_stream", "arguments": { "device": "virtio0" } } >> <- { "return": { "device": "virtio0", "len": 10737418240, "offset": 512 } } >> >> -> { "execute": "block_stream", "arguments": { "all": true, "device": >> "virtio0" } } >> <- { "return": {} } >> >> -> { "execute": "block_stream", "arguments": { "stop": true, "device": >> "virtio0" } } >> <- { "return": {} } >> >> query-block-stream >> ------------------ >> >> Show progress of ongoing block_stream operations. >> >> Return a json-array of all operations. If no operation is active then an empty >> array will be returned. Each operation is a json-object with the following >> data: >> >> - device: device name (json-string) >> - len: size of the device, in bytes (json-int) >> - offset: ending offset of the completed I/O, in bytes (json-int) >> >> Example: >> >> -> { "execute": "query-block-stream" } >> <- { "return":[ >> { "device": "virtio0", "len": 10737418240, "offset": 709632} >> ] >> } When block_stream is changed, this will have to make the same changes. >> Block device switching API >> ========================== >> >> Extend the 'change' command to support changing the image file without >> media change notification. >> >> Perhaps we should take the opportunity to add a "format" argument for >> image files? >> >> change >> ------ >> >> Change a removable medium or VNC configuration. >> >> Arguments: >> >> - "device": device name (json-string) >> - "target": filename or item (json-string) >> - "arg": additional argument (json-string, optional) >> - "notify": whether to notify guest, defaults to true (json-bool, optional) >> >> Examples: >> >> 1. Change a removable medium >> >> -> { "execute": "change", >> "arguments": { "device": "ide1-cd0", >> "target": "/srv/images/Fedora-12-x86_64-DVD.iso" } } >> <- { "return": {} } >> >> 2. Change a disk without media change notification >> >> -> { "execute": "change", >> "arguments": { "device": "virtio-blk0", >> "target": "/srv/images/vm_1.img", >> "notify": false } } >> >> 3. Change VNC password >> >> -> { "execute": "change", >> "arguments": { "device": "vnc", "target": "password", >> "arg": "foobar1" } } >> <- { "return": {} } I find it rather disturbing that a command like 'change' has made it into QMP... Anyway, I don't think this is really what we need. We have two switches to do. The first one happens before starting the copy: Creating the copy, with the source as its backing file, and switching to that. The monitor command to achieve this is snapshot_blkdev. The second switch is after the copy has completed. At this point you can remove the source as the backing file and use the common base image instead. This is a call to bdrv_change_backing_file(), for which a monitor command doesn't exist yet (and unless we want to overload 'change' even more, it's not the right command to do this). Kevin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-12 8:06 ` Kevin Wolf @ 2011-07-12 15:45 ` Stefan Hajnoczi 2011-07-12 16:10 ` Kevin Wolf 2011-07-12 17:47 ` Adam Litke 0 siblings, 2 replies; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-12 15:45 UTC (permalink / raw) To: Kevin Wolf Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke On Tue, Jul 12, 2011 at 9:06 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 11.07.2011 18:32, schrieb Marcelo Tosatti: >> On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote: >>> Kevin, Marcelo, >>> I'd like to reach agreement on the QMP/HMP APIs for live block copy >>> and image streaming. Libvirt has acked the image streaming APIs that >>> Adam proposed and I think they are a good fit for the feature. I have >>> described that API below for your review (it's exactly what the QED >>> Image Streaming patches provide). >>> >>> Marcelo: Are you happy with this API for live block copy? Also please >>> take a look at the switch command that I am proposing. >>> >>> Image streaming API >>> =================== >>> >>> For leaf images with copy-on-read semantics, the stream commands allow the user >>> to populate local blocks by manually streaming them from the backing image. >>> Once all blocks have been streamed, the dependency on the original backing >>> image can be removed. Therefore, stream commands can be used to implement >>> post-copy live block migration and rapid deployment. >>> >>> The block_stream command can be used to stream a single cluster, to >>> start streaming the entire device, and to cancel an active stream. It >>> is easiest to allow the block_stream command to manage streaming for the >>> entire device but a managent tool could use single cluster mode to >>> throttle the I/O rate. > > As discussed earlier, having the management send requests for each > single cluster doesn't make any sense at all. It wouldn't only throttle > the I/O rate but bring it down to a level that makes it unusable. What > you really want is to allow the management to give us a range (offset + > length) that qemu should stream. I feel that an iteration interface is problematic whether the management tool or QEMU decide what to stream. Let's have just the background streaming operation. The problem with byte ranges is two-fold. The management tool doesn't know which regions of the image are allocated so it may do a lot of nop calls to already-allocated regions with no intelligence as to where the next sensible offset for streaming is. Secondly, because the progress and performance of image streaming depend largely on whether or not clusters are allocated (it is very fast when a cluster is already allocated and we have no work to do), offsets are bad indicators of progress to the user. I think it's best not to expose these details to the management tool at all. The only reason for the iteration interface was to punt I/O throttling to the management tool. I think it would be easier to just throttle inside the streaming function. Kevin: Are you happy with dropping the iteration interface? Adam: Is there a libvirt requirement for iteration or could we support background copy only? >>> The command synopses are as follows: >>> >>> block_stream >>> ------------ >>> >>> Copy data from a backing file into a block device. >>> >>> If the optional 'all' argument is true, this operation is performed in the >>> background until the entire backing file has been copied. The status of >>> ongoing block_stream operations can be checked with query-block-stream. > > Not sure if it's a good idea to use a bool argument to turn a command > into its opposite. I think having a separate command for stopping would > be cleaner. Something for the QMP folks to decide, though. git branch new_branch git branch -D new_branch Makes sense to me :) >>> Arguments: >>> >>> - all: copy entire device (json-bool, optional) >>> - stop: stop copying to device (json-bool, optional) >>> - device: device name (json-string) >> >> It must be possible to specify backing file that will be >> active after streaming finishes (data from that file will not >> be streamed into active file, of course). > > Yes, I think the common base image belongs here. Right. We need to specify it by filename: - base: filename of base file (json-string, optional) Sectors are not copied from the base file and its backing file chain. The following describes this feature: Before: base <- sn1 <- sn2 <- sn3 <- vm.img After: base <- vm.img > With all = false, where does the streaming begin? Streaming begins at the start of the image. > Do you have something like the "current streaming offset" in the state of each BlockDriverState? Yes, there is a StreamState for each block device that has an in-progress operation. The progress is saved between block_stream (without -a) invocations so the caller does not need to specify the streaming offset as an argument. Thanks for pointing out these weaknesses in the documentation. It should really be explained fully. >>> Return: >>> >>> - device: device name (json-string) >>> - len: size of the device, in bytes (json-int) >>> - offset: ending offset of the completed I/O, in bytes (json-int) > > So you only get the reply when the request has completed? With the > current monitor, this means that QMP is blocked while we stream, doesn't > it? How are you supposed to send the stop command then? Incomplete documentation again, sorry. The block_stream command behaves as follows: 1. block_stream all returns immediately and the BLOCK_STREAM_COMPLETED event is raised when streaming completes either successfully or with an error. 2. block_stream stop returns when the in-progress streaming operation has been safely stopped. 3. block_stream returns when one iteration of streaming has completed. > Two of three examples below have an empty return value instead, so they > are not compliant to this specification. I will update the documentation, the non-all invocations do not return anything. >>> Examples: >>> >>> -> { "execute": "block_stream", "arguments": { "device": "virtio0" } } >>> <- { "return": { "device": "virtio0", "len": 10737418240, "offset": 512 } } >>> >>> -> { "execute": "block_stream", "arguments": { "all": true, "device": >>> "virtio0" } } >>> <- { "return": {} } >>> >>> -> { "execute": "block_stream", "arguments": { "stop": true, "device": >>> "virtio0" } } >>> <- { "return": {} } >>> >>> query-block-stream >>> ------------------ >>> >>> Show progress of ongoing block_stream operations. >>> >>> Return a json-array of all operations. If no operation is active then an empty >>> array will be returned. Each operation is a json-object with the following >>> data: >>> >>> - device: device name (json-string) >>> - len: size of the device, in bytes (json-int) >>> - offset: ending offset of the completed I/O, in bytes (json-int) >>> >>> Example: >>> >>> -> { "execute": "query-block-stream" } >>> <- { "return":[ >>> { "device": "virtio0", "len": 10737418240, "offset": 709632} >>> ] >>> } > > When block_stream is changed, this will have to make the same changes. > >>> Block device switching API >>> ========================== >>> >>> Extend the 'change' command to support changing the image file without >>> media change notification. >>> >>> Perhaps we should take the opportunity to add a "format" argument for >>> image files? >>> >>> change >>> ------ >>> >>> Change a removable medium or VNC configuration. >>> >>> Arguments: >>> >>> - "device": device name (json-string) >>> - "target": filename or item (json-string) >>> - "arg": additional argument (json-string, optional) >>> - "notify": whether to notify guest, defaults to true (json-bool, optional) >>> >>> Examples: >>> >>> 1. Change a removable medium >>> >>> -> { "execute": "change", >>> "arguments": { "device": "ide1-cd0", >>> "target": "/srv/images/Fedora-12-x86_64-DVD.iso" } } >>> <- { "return": {} } >>> >>> 2. Change a disk without media change notification >>> >>> -> { "execute": "change", >>> "arguments": { "device": "virtio-blk0", >>> "target": "/srv/images/vm_1.img", >>> "notify": false } } >>> >>> 3. Change VNC password >>> >>> -> { "execute": "change", >>> "arguments": { "device": "vnc", "target": "password", >>> "arg": "foobar1" } } >>> <- { "return": {} } > > I find it rather disturbing that a command like 'change' has made it > into QMP... Anyway, I don't think this is really what we need. > > We have two switches to do. The first one happens before starting the > copy: Creating the copy, with the source as its backing file, and > switching to that. The monitor command to achieve this is snapshot_blkdev. I don't think that creating image files in QEMU is going to work when running KVM with libvirt (SELinux). The QEMU process does not have the ability to create new image files. It needs at least a file descriptor to an empty file or maybe a file that has been created using qemu-img like I showed above. > The second switch is after the copy has completed. At this point you can > remove the source as the backing file and use the common base image > instead. This is a call to bdrv_change_backing_file(), for which a > monitor command doesn't exist yet (and unless we want to overload > 'change' even more, it's not the right command to do this). I agree. We need the ability to change the backing file (aka qemu-img rebase -u). Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-12 15:45 ` Stefan Hajnoczi @ 2011-07-12 16:10 ` Kevin Wolf 2011-07-13 9:51 ` Stefan Hajnoczi 2011-07-12 17:47 ` Adam Litke 1 sibling, 1 reply; 13+ messages in thread From: Kevin Wolf @ 2011-07-12 16:10 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke Am 12.07.2011 17:45, schrieb Stefan Hajnoczi: >>>> Image streaming API >>>> =================== >>>> >>>> For leaf images with copy-on-read semantics, the stream commands allow the user >>>> to populate local blocks by manually streaming them from the backing image. >>>> Once all blocks have been streamed, the dependency on the original backing >>>> image can be removed. Therefore, stream commands can be used to implement >>>> post-copy live block migration and rapid deployment. >>>> >>>> The block_stream command can be used to stream a single cluster, to >>>> start streaming the entire device, and to cancel an active stream. It >>>> is easiest to allow the block_stream command to manage streaming for the >>>> entire device but a managent tool could use single cluster mode to >>>> throttle the I/O rate. >> >> As discussed earlier, having the management send requests for each >> single cluster doesn't make any sense at all. It wouldn't only throttle >> the I/O rate but bring it down to a level that makes it unusable. What >> you really want is to allow the management to give us a range (offset + >> length) that qemu should stream. > > I feel that an iteration interface is problematic whether the > management tool or QEMU decide what to stream. Let's have just the > background streaming operation. > > The problem with byte ranges is two-fold. The management tool doesn't > know which regions of the image are allocated so it may do a lot of > nop calls to already-allocated regions with no intelligence as to > where the next sensible offset for streaming is. Secondly, because > the progress and performance of image streaming depend largely on > whether or not clusters are allocated (it is very fast when a cluster > is already allocated and we have no work to do), offsets are bad > indicators of progress to the user. I think it's best not to expose > these details to the management tool at all. > > The only reason for the iteration interface was to punt I/O throttling > to the management tool. I think it would be easier to just throttle > inside the streaming function. > > Kevin: Are you happy with dropping the iteration interface? > Adam: Is there a libvirt requirement for iteration or could we support > background copy only? Okay, works for me. >>>> The command synopses are as follows: >>>> >>>> block_stream >>>> ------------ >>>> >>>> Copy data from a backing file into a block device. >>>> >>>> If the optional 'all' argument is true, this operation is performed in the >>>> background until the entire backing file has been copied. The status of >>>> ongoing block_stream operations can be checked with query-block-stream. >> >> Not sure if it's a good idea to use a bool argument to turn a command >> into its opposite. I think having a separate command for stopping would >> be cleaner. Something for the QMP folks to decide, though. > > git branch new_branch > git branch -D new_branch > > Makes sense to me :) I don't think you should compare a command line option to a programming interface. Having a git_create_branch(const char *name, bool delete) would really look strange. Anyway, probably a matter of taste. A hint that separate commands would make sense is that the stop command won't need the other arguments that the start command gets ('all' and 'base'). >>>> Arguments: >>>> >>>> - all: copy entire device (json-bool, optional) >>>> - stop: stop copying to device (json-bool, optional) >>>> - device: device name (json-string) >>> >>> It must be possible to specify backing file that will be >>> active after streaming finishes (data from that file will not >>> be streamed into active file, of course). >> >> Yes, I think the common base image belongs here. > > Right. We need to specify it by filename: > > - base: filename of base file (json-string, optional) > > Sectors are not copied from the base file and its backing file > chain. The following describes this feature: > Before: base <- sn1 <- sn2 <- sn3 <- vm.img > After: base <- vm.img Does this imply that a rebase -u happens always after completion? >> With all = false, where does the streaming begin? > > Streaming begins at the start of the image. > >> Do you have something like the "current streaming offset" in the state of each BlockDriverState? > > Yes, there is a StreamState for each block device that has an > in-progress operation. The progress is saved between block_stream > (without -a) invocations so the caller does not need to specify the > streaming offset as an argument. > > Thanks for pointing out these weaknesses in the documentation. It > should really be explained fully. I think we also need to describe error cases. For example, what happens if you try to start streaming while it's already in progress? >>>> Return: >>>> >>>> - device: device name (json-string) >>>> - len: size of the device, in bytes (json-int) >>>> - offset: ending offset of the completed I/O, in bytes (json-int) >> >> So you only get the reply when the request has completed? With the >> current monitor, this means that QMP is blocked while we stream, doesn't >> it? How are you supposed to send the stop command then? > > Incomplete documentation again, sorry. The block_stream command > behaves as follows: > > 1. block_stream all returns immediately and the BLOCK_STREAM_COMPLETED > event is raised when streaming completes either successfully or with > an error. > > 2. block_stream stop returns when the in-progress streaming operation > has been safely stopped. > > 3. block_stream returns when one iteration of streaming has completed. > >> Two of three examples below have an empty return value instead, so they >> are not compliant to this specification. > > I will update the documentation, the non-all invocations do not return anything. Okay, then I don't understand what the 'offset' return value means. The text says "offset of the completed I/O". If all=true immediately returns, shouldn't it always be 0? >> I find it rather disturbing that a command like 'change' has made it >> into QMP... Anyway, I don't think this is really what we need. >> >> We have two switches to do. The first one happens before starting the >> copy: Creating the copy, with the source as its backing file, and >> switching to that. The monitor command to achieve this is snapshot_blkdev. > > I don't think that creating image files in QEMU is going to work when > running KVM with libvirt (SELinux). The QEMU process does not have > the ability to create new image files. It needs at least a file > descriptor to an empty file or maybe a file that has been created > using qemu-img like I showed above. Independent problem. We're really creating an external snapshot here, so we should use the function for external snapshots. libvirt can pre-create an empty image file, so that qemu will write the image format data into it, but we have discussed this before. Kevin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-12 16:10 ` Kevin Wolf @ 2011-07-13 9:51 ` Stefan Hajnoczi 2011-07-14 9:39 ` Stefan Hajnoczi 0 siblings, 1 reply; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-13 9:51 UTC (permalink / raw) To: Kevin Wolf Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke On Tue, Jul 12, 2011 at 5:10 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 12.07.2011 17:45, schrieb Stefan Hajnoczi: >>>>> The command synopses are as follows: >>>>> >>>>> block_stream >>>>> ------------ >>>>> >>>>> Copy data from a backing file into a block device. >>>>> >>>>> If the optional 'all' argument is true, this operation is performed in the >>>>> background until the entire backing file has been copied. The status of >>>>> ongoing block_stream operations can be checked with query-block-stream. >>> >>> Not sure if it's a good idea to use a bool argument to turn a command >>> into its opposite. I think having a separate command for stopping would >>> be cleaner. Something for the QMP folks to decide, though. >> >> git branch new_branch >> git branch -D new_branch >> >> Makes sense to me :) > > I don't think you should compare a command line option to a programming > interface. Having a git_create_branch(const char *name, bool delete) > would really look strange. Anyway, probably a matter of taste. > > A hint that separate commands would make sense is that the stop command > won't need the other arguments that the start command gets ('all' and > 'base'). I can see your point. Splitting the command might make the code more straightforward and eliminate the need for checking invalid argument combinations. >>>>> Arguments: >>>>> >>>>> - all: copy entire device (json-bool, optional) >>>>> - stop: stop copying to device (json-bool, optional) >>>>> - device: device name (json-string) >>>> >>>> It must be possible to specify backing file that will be >>>> active after streaming finishes (data from that file will not >>>> be streamed into active file, of course). >>> >>> Yes, I think the common base image belongs here. >> >> Right. We need to specify it by filename: >> >> - base: filename of base file (json-string, optional) >> >> Sectors are not copied from the base file and its backing file >> chain. The following describes this feature: >> Before: base <- sn1 <- sn2 <- sn3 <- vm.img >> After: base <- vm.img > > Does this imply that a rebase -u happens always after completion? Yes. The current implementation removes the backing file when streaming completes. I think this is the right thing to do since all sectors are now allocated - there is no way to use the backing file anymore. If we don't change the backing file on streaming completion, then the user has to issue an extra command. There's nothing to gain by doing that so I think rebase -u should happen on completion. >>> With all = false, where does the streaming begin? >> >> Streaming begins at the start of the image. >> >>> Do you have something like the "current streaming offset" in the state of each BlockDriverState? >> >> Yes, there is a StreamState for each block device that has an >> in-progress operation. The progress is saved between block_stream >> (without -a) invocations so the caller does not need to specify the >> streaming offset as an argument. >> >> Thanks for pointing out these weaknesses in the documentation. It >> should really be explained fully. > > I think we also need to describe error cases. For example, what happens > if you try to start streaming while it's already in progress? Yes, will do. >>>>> Return: >>>>> >>>>> - device: device name (json-string) >>>>> - len: size of the device, in bytes (json-int) >>>>> - offset: ending offset of the completed I/O, in bytes (json-int) >>> >>> So you only get the reply when the request has completed? With the >>> current monitor, this means that QMP is blocked while we stream, doesn't >>> it? How are you supposed to send the stop command then? >> >> Incomplete documentation again, sorry. The block_stream command >> behaves as follows: >> >> 1. block_stream all returns immediately and the BLOCK_STREAM_COMPLETED >> event is raised when streaming completes either successfully or with >> an error. >> >> 2. block_stream stop returns when the in-progress streaming operation >> has been safely stopped. >> >> 3. block_stream returns when one iteration of streaming has completed. >> >>> Two of three examples below have an empty return value instead, so they >>> are not compliant to this specification. >> >> I will update the documentation, the non-all invocations do not return anything. > > Okay, then I don't understand what the 'offset' return value means. The > text says "offset of the completed I/O". If all=true immediately > returns, shouldn't it always be 0? The 'offset' value gives you an indication of progress when using the iteration interface. You don't need to separately call query-block-stream, instead you can use the return value from the iteration interface to get progress information. However, let's drop iteration. >>> I find it rather disturbing that a command like 'change' has made it >>> into QMP... Anyway, I don't think this is really what we need. >>> >>> We have two switches to do. The first one happens before starting the >>> copy: Creating the copy, with the source as its backing file, and >>> switching to that. The monitor command to achieve this is snapshot_blkdev. >> >> I don't think that creating image files in QEMU is going to work when >> running KVM with libvirt (SELinux). The QEMU process does not have >> the ability to create new image files. It needs at least a file >> descriptor to an empty file or maybe a file that has been created >> using qemu-img like I showed above. > > Independent problem. We're really creating an external snapshot here, so > we should use the function for external snapshots. libvirt can > pre-create an empty image file, so that qemu will write the image format > data into it, but we have discussed this before. Cool. If snapshot_blkdev will be able to work in an sVirt environment then great. Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-13 9:51 ` Stefan Hajnoczi @ 2011-07-14 9:39 ` Stefan Hajnoczi 2011-07-14 9:55 ` Kevin Wolf 0 siblings, 1 reply; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-14 9:39 UTC (permalink / raw) To: Kevin Wolf, Adam Litke Cc: Anthony Liguori, Marcelo Tosatti, Stefan Hajnoczi, Dor Laor, qemu-devel, Avi Kivity Here is the latest interface, I'm not updating existing patches to implement and test it (not yet using generic image stream): http://wiki.qemu.org/Features/LiveBlockMigration/ImageStreamingAPI =Changelog= v2: * Remove iteration interface where management tool drives individual copy iterations * Add block_stream_cancel command (like migrate_cancel) * Add 'base' common backing file argument to block_stream * Replace QError object in BLOCK_STREAM_COMPLETED with an error string * Add error documentation =Image streaming API= The stream commands populate an image file by streaming data from its backing file. Once all blocks have been streamed, the dependency on the original backing image is removed. The stream commands can be used to implement post-copy live block migration and rapid deployment. The block_stream command starts streaming the image file. When streaming completes successfully or with an error, the BLOCK_STREAM_COMPLETED event is raised. The progress of a streaming operation can be polled using query-block-stream. This returns information regarding how much of the image has been streamed. The block_stream_cancel command stops streaming the image file. The image file retains its backing file. A new streaming operation can be started at a later time. The command synopses are as follows: block_stream ------------ Copy data from a backing file into a block device. The block streaming operation is performed in the background until the entire backing file has been copied. This command returns immediately once streaming has started. The status of ongoing block streaming operations can be checked with query-block-stream. The operation can be stopped before it has completed using the block_stream_cancel command. If a base file is specified then sectors are not copied from that base file and its backing chain. When streaming completes the image file will have the base file as its backing file. This can be used to stream a subset of the backing file chain instead of flattening the entire image. On successful completion the image file is updated to drop the backing file. Arguments: - device: device name (json-string) - base: common backing file (json-string, optional) Errors: DeviceInUse: streaming is already active on this device DeviceNotFound: device name is invalid NotSupported: image streaming is not supported by this device Events: On completion the BLOCK_STREAM_COMPLETED event is raised with the following fields: - device: device name (json-string) - len: size of the device, in bytes (json-int) - offset: last offset of completed I/O, in bytes (json-int) - error: error message (json-string, only on error) The completion event is raised both on success and on failure. Examples: -> { "execute": "block_stream", "arguments": { "device": "virtio0" } } <- { "return": {} } block_stream_cancel ------------------- Stop an active block streaming operation. This command returns once the active block streaming operation has been stopped. It is an error to call this command if no operation is in progress. The image file retains its backing file unless the streaming operation happens to complete just as it is being cancelled. A new block streaming operation can be started at a later time to finish copying all data from the backing file. Arguments: - device: device name (json-string) Errors: DeviceNotActive: streaming is not active on this device DeviceInUse: cancellation already in progress Examples: -> { "execute": "block_stream_cancel", "arguments": { "device": "virtio0" } } <- { "return": {} } query-block-stream ------------------ Show progress of ongoing block_stream operations. Return a json-array of all block streaming operations. If no operation is active then return an empty array. Each operation is a json-object with the following data: - device: device name (json-string) - len: size of the device, in bytes (json-int) - offset: ending offset of the completed I/O, in bytes (json-int) Example: -> { "execute": "query-block-stream" } <- { "return":[ { "device": "virtio0", "len": 10737418240, "offset": 709632} ] } =How live block copy works= Live block copy does the following: # Create and switch to the destination file: snapshot_blkdev virtio-blk0 destination.$fmt $fmt # Stream the base into the image file: block_stream -a virtio-blk0 Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-14 9:39 ` Stefan Hajnoczi @ 2011-07-14 9:55 ` Kevin Wolf 2011-07-14 10:00 ` Stefan Hajnoczi 0 siblings, 1 reply; 13+ messages in thread From: Kevin Wolf @ 2011-07-14 9:55 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke Am 14.07.2011 11:39, schrieb Stefan Hajnoczi: > Events: > > On completion the BLOCK_STREAM_COMPLETED event is raised with the following > fields: > > - device: device name (json-string) > - len: size of the device, in bytes (json-int) > - offset: last offset of completed I/O, in bytes (json-int) > - error: error message (json-string, only on error) > > The completion event is raised both on success and on failure. Why do len/offset matter in a completion event? Other than that it looks good to me. Kevin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-14 9:55 ` Kevin Wolf @ 2011-07-14 10:00 ` Stefan Hajnoczi 2011-07-14 10:07 ` Kevin Wolf 0 siblings, 1 reply; 13+ messages in thread From: Stefan Hajnoczi @ 2011-07-14 10:00 UTC (permalink / raw) To: Kevin Wolf Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke On Thu, Jul 14, 2011 at 10:55 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 14.07.2011 11:39, schrieb Stefan Hajnoczi: >> Events: >> >> On completion the BLOCK_STREAM_COMPLETED event is raised with the following >> fields: >> >> - device: device name (json-string) >> - len: size of the device, in bytes (json-int) >> - offset: last offset of completed I/O, in bytes (json-int) >> - error: error message (json-string, only on error) >> >> The completion event is raised both on success and on failure. > > Why do len/offset matter in a completion event? For completeness. You could see it as telling you how much progress was made before an error occurred. In the success case offset will always be equal to len. But in the error case you get the last completed progress before error, which could be useful (for example if you weren't polling but want to display "Streaming virtio-blk0 failed at 33%"). Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-14 10:00 ` Stefan Hajnoczi @ 2011-07-14 10:07 ` Kevin Wolf 0 siblings, 0 replies; 13+ messages in thread From: Kevin Wolf @ 2011-07-14 10:07 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity, Adam Litke Am 14.07.2011 12:00, schrieb Stefan Hajnoczi: > On Thu, Jul 14, 2011 at 10:55 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 14.07.2011 11:39, schrieb Stefan Hajnoczi: >>> Events: >>> >>> On completion the BLOCK_STREAM_COMPLETED event is raised with the following >>> fields: >>> >>> - device: device name (json-string) >>> - len: size of the device, in bytes (json-int) >>> - offset: last offset of completed I/O, in bytes (json-int) >>> - error: error message (json-string, only on error) >>> >>> The completion event is raised both on success and on failure. >> >> Why do len/offset matter in a completion event? > > For completeness. You could see it as telling you how much progress > was made before an error occurred. In the success case offset will > always be equal to len. But in the error case you get the last > completed progress before error, which could be useful (for example if > you weren't polling but want to display "Streaming virtio-blk0 failed > at 33%"). Makes sense. We also need to define the possible error messages, and probably use an enum instead of a string. Kevin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] live block copy/stream/snapshot discussion 2011-07-12 15:45 ` Stefan Hajnoczi 2011-07-12 16:10 ` Kevin Wolf @ 2011-07-12 17:47 ` Adam Litke 1 sibling, 0 replies; 13+ messages in thread From: Adam Litke @ 2011-07-12 17:47 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Anthony Liguori, Dor Laor, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel, Avi Kivity On 07/12/2011 10:45 AM, Stefan Hajnoczi wrote: > On Tue, Jul 12, 2011 at 9:06 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 11.07.2011 18:32, schrieb Marcelo Tosatti: >>> On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote: >>>> Kevin, Marcelo, >>>> I'd like to reach agreement on the QMP/HMP APIs for live block copy >>>> and image streaming. Libvirt has acked the image streaming APIs that >>>> Adam proposed and I think they are a good fit for the feature. I have >>>> described that API below for your review (it's exactly what the QED >>>> Image Streaming patches provide). >>>> >>>> Marcelo: Are you happy with this API for live block copy? Also please >>>> take a look at the switch command that I am proposing. >>>> >>>> Image streaming API >>>> =================== >>>> >>>> For leaf images with copy-on-read semantics, the stream commands allow the user >>>> to populate local blocks by manually streaming them from the backing image. >>>> Once all blocks have been streamed, the dependency on the original backing >>>> image can be removed. Therefore, stream commands can be used to implement >>>> post-copy live block migration and rapid deployment. >>>> >>>> The block_stream command can be used to stream a single cluster, to >>>> start streaming the entire device, and to cancel an active stream. It >>>> is easiest to allow the block_stream command to manage streaming for the >>>> entire device but a managent tool could use single cluster mode to >>>> throttle the I/O rate. >> >> As discussed earlier, having the management send requests for each >> single cluster doesn't make any sense at all. It wouldn't only throttle >> the I/O rate but bring it down to a level that makes it unusable. What >> you really want is to allow the management to give us a range (offset + >> length) that qemu should stream. > > I feel that an iteration interface is problematic whether the > management tool or QEMU decide what to stream. Let's have just the > background streaming operation. > > The problem with byte ranges is two-fold. The management tool doesn't > know which regions of the image are allocated so it may do a lot of > nop calls to already-allocated regions with no intelligence as to > where the next sensible offset for streaming is. Secondly, because > the progress and performance of image streaming depend largely on > whether or not clusters are allocated (it is very fast when a cluster > is already allocated and we have no work to do), offsets are bad > indicators of progress to the user. I think it's best not to expose > these details to the management tool at all. > > The only reason for the iteration interface was to punt I/O throttling > to the management tool. I think it would be easier to just throttle > inside the streaming function. > > Kevin: Are you happy with dropping the iteration interface? > Adam: Is there a libvirt requirement for iteration or could we support > background copy only? There is no hard requirement for iteration in libvirt. However, I think there is a requirement that we report some sort of progress to an end user. These operations can easily take many minutes (even hours) and such a long-running operation needs to report progress. I think the current information returned by 'query-block-stream' is appropriate for this purpose and should definitely be maintained. -- Adam Litke IBM Linux Technology Center ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-07-14 10:05 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-05 14:17 [Qemu-devel] live block copy/stream/snapshot discussion Dor Laor 2011-07-11 12:54 ` Stefan Hajnoczi 2011-07-11 14:47 ` Stefan Hajnoczi 2011-07-11 16:32 ` Marcelo Tosatti 2011-07-12 8:06 ` Kevin Wolf 2011-07-12 15:45 ` Stefan Hajnoczi 2011-07-12 16:10 ` Kevin Wolf 2011-07-13 9:51 ` Stefan Hajnoczi 2011-07-14 9:39 ` Stefan Hajnoczi 2011-07-14 9:55 ` Kevin Wolf 2011-07-14 10:00 ` Stefan Hajnoczi 2011-07-14 10:07 ` Kevin Wolf 2011-07-12 17:47 ` Adam Litke
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.