All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
@ 2012-05-18 17:08 Paolo Bonzini
  2012-05-21  9:29 ` Kevin Wolf
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-18 17:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Federico Simoncelli, Eric Blake, Stefan Hajnoczi,
	Luiz Capitulino

Hi all,

the current block job API is designed for streaming; one property of
streaming is that in case of an error it can be restarted from the point
where it was left.

In QEMU 1.2 I would like to add an implementation of mirroring (live
block copy) based on the block job API and on dirty-block tracking.
Unlike streaming, this operation is not restartable, because canceling
the job turns off dirty-block tracking.

To avoid this problem, my proposal is to add to block jobs options
similar to rerror/werror.  There are a few more details required to get
this to work, and the purpose of this email is to summarize these details.

New QMP commands
================

The following QMP commands are added.

* block-job-pause: Takes a block device (drive), pauses an active
background block operation on that device.
This command returns immediately after marking the active background
block operation for pausing.  It is an error to call this command if no
operation is in progress.  The operation will pause as soon as possible
(it won't pause if the job is being cancelled).  No event is emitted
when the operation is actually paused.  Cancelling a paused job
automatically resumes it.

* block-job-resume: Takes a block device (drive), resume a paused
background block operation on that device.
This command returns immediately after resuming a paused background
block operation.  It is an error to call this command if no operation is
in progress.

A successful block-job-resume operation also resets the iostatus on the
device that is passed.

  Rationale: because block job errors can occur even while the VM is
  stopped, rerror=stop/werror=stop cannot reuse the "cont" monitor
  command to restart a block job.  So block-job-resume is required
  anyway.  Adding block-job-pause makes it simpler to test the new
  feature.

Modified QMP commands
=====================

* block-stream: I propose adding two options to the existing
block-stream command.  If this is rejected, only mirroring will be able
to use rerror/werror.

The new options are of course rerror/werror.  They are enum options,
with the following possible values:

'report': The behavior is the same as in 1.1.  An I/O error,
respectively during a read or a write, will complete the job immediately
with an error code.

'ignore': An I/O error, respectively during a read or a write, will be
ignored.  For streaming, the job will complete with an error and the
backing file will be left in place.  For mirroring, the sector will be
marked again as dirty and re-examined later.

'stop': The VM *and* the job will be paused---the VM is stopped even if
the block device has neither rerror=stop nor werror={stop,enospc}.  The
error is recorded in the block device's iostatus (which can be examined
with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
job.

  Rationale: stopping all I/O seems to be the best choice in order
  to limit the number of errors received.  However, due to backwards-
  compatibility with QEMU 1.1 we cannot pause the job when guest-
  initiated I/O causes an error.  We could do that if the block
  device has rerror=stop/werror={stop,enospc}, but it seems more
  complicated to just never do it.

'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

It is possible that while stopping the VM a BLOCK_IO_ERROR event will be
reported and will clobber the event from BLOCK_JOB_ERROR, or vice versa.
 This is not really avoidable since stopping the VM completes all
pending I/O requests.  In fact, it is already possible now that a series
of BLOCK_IO_ERROR events are reported with rerror=stop.

After cancelling a job, the job implementation MAY choose to treat stop
and enospc values as report, i.e. complete the job immediately with an
error code, as long as block_job_is_cancelled(job) returns true when the
completion callback is called.

  Open problems: There could be unrecoverable errors in which the job
  will be completed as if rerror/werror were set to report (example:
  error while switching backing files).  Does it make sense to fire an
  event before the point in time where such errors can happen?


Other points specific to mirroring
==================================

* query-block-jobs: The returned JSON object will grow an additional
member, "target".  The target field is a dictionary with two fields,
"info" and "stats" (resembling the output of query-block and
query-blockstat but for the mirroring target).  Member "device" of the
BlockInfo structure will be made optional.

  Rationale: this allows libvirt to observe the high watermark of qcow2
  mirroring targets, and avoids putting a bad iostatus on a working
  migration source.

* cont: even though cont does _not_ restart the block job that reported
an error, the iostatus is reset for all block devices that are attached
to a block job (like the mirroring target).

  Rationale: cont anyway resets the iostatus for the streaming target
  or mirroring source, because there is a single iostatus for the
  device and the job.  It is simpler to do the same also for the
  mirroring target.

* block-job-resume also resets the iostatus on the mirroring target.

* block-job-complete: new command specific to mirroring (switches the
device to the target), not related to the rest of the proposal.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-18 17:08 [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2 Paolo Bonzini
@ 2012-05-21  9:29 ` Kevin Wolf
  2012-05-21 10:02   ` Paolo Bonzini
  2012-05-21 12:20 ` Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Kevin Wolf @ 2012-05-21  9:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Am 18.05.2012 19:08, schrieb Paolo Bonzini:
> Hi all,
> 
> the current block job API is designed for streaming; one property of
> streaming is that in case of an error it can be restarted from the point
> where it was left.
> 
> In QEMU 1.2 I would like to add an implementation of mirroring (live
> block copy) based on the block job API and on dirty-block tracking.
> Unlike streaming, this operation is not restartable, because canceling
> the job turns off dirty-block tracking.
> 
> To avoid this problem, my proposal is to add to block jobs options
> similar to rerror/werror.  There are a few more details required to get
> this to work, and the purpose of this email is to summarize these details.

I generally like the proposal. For details, I'll comment inline.

> * block-stream: I propose adding two options to the existing
> block-stream command.  If this is rejected, only mirroring will be able
> to use rerror/werror.
> 
> The new options are of course rerror/werror.  They are enum options,
> with the following possible values:

Do we really need separate werror/rerror? For guest operations they
really exist only for historical reasons: werror was there first, and
when we wanted the same functionality, it seemed odd to overload werror
to include reads as well.

For block jobs, where there is no such option yet, we could go with a
single error option, unless there is a use case for separate
werror/rerror options.

> 'report': The behavior is the same as in 1.1.  An I/O error,
> respectively during a read or a write, will complete the job immediately
> with an error code.
> 
> 'ignore': An I/O error, respectively during a read or a write, will be
> ignored.  For streaming, the job will complete with an error and the
> backing file will be left in place.  For mirroring, the sector will be
> marked again as dirty and re-examined later.

This is not really 'ignore' as used for guest operations. There it means
"no matter what the return value is, the operation has succeeded". For
streaming it would mean that it just goes on with the next cluster (and
if we don't cut the backing file link at the end, it would at least not
corrupt anything).

Just like with guest operations it's a mostly useless mode, do we really
need this option?

> 'stop': The VM *and* the job will be paused---the VM is stopped even if
> the block device has neither rerror=stop nor werror={stop,enospc}.  The
> error is recorded in the block device's iostatus (which can be examined
> with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
> job.
> 
>   Rationale: stopping all I/O seems to be the best choice in order
>   to limit the number of errors received.  However, due to backwards-
>   compatibility with QEMU 1.1 we cannot pause the job when guest-
>   initiated I/O causes an error.  We could do that if the block
>   device has rerror=stop/werror={stop,enospc}, but it seems more
>   complicated to just never do it.

I don't agree with stopping the VM. Consider a case where the target is
somewhere on the network and you lose the connection, but the primary
image is local on the hard disk. You don't want to stop the VM just
because continuing with the copy isn't possible for the moment.

Of course, this means that you can't reuse the block device's io_status,
but you need a separate job_iostatus.

If the VM is stopped (including BLOCK_IO_ERROR), no I/O should be going
on at all. Do we really keep running the jobs in 1.1? If so, this is a
bug and should be fixed before the release.

> * query-block-jobs: The returned JSON object will grow an additional
> member, "target".  The target field is a dictionary with two fields,
> "info" and "stats" (resembling the output of query-block and
> query-blockstat but for the mirroring target).  Member "device" of the
> BlockInfo structure will be made optional.
> 
>   Rationale: this allows libvirt to observe the high watermark of qcow2
>   mirroring targets, and avoids putting a bad iostatus on a working
>   migration source.

The mirroring target should be present in query-block instead. It is a
user-visible BlockDriverState, so let's treat it like one. We just need
to give it a name.

> * cont: even though cont does _not_ restart the block job that reported
> an error, the iostatus is reset for all block devices that are attached
> to a block job (like the mirroring target).
> 
>   Rationale: cont anyway resets the iostatus for the streaming target
>   or mirroring source, because there is a single iostatus for the
>   device and the job.  It is simpler to do the same also for the
>   mirroring target.

See above, better have a separate iostatus for the job.

> 
> * block-job-resume also resets the iostatus on the mirroring target.
> 
> * block-job-complete: new command specific to mirroring (switches the
> device to the target), not related to the rest of the proposal.

What semantics will block-job-cancel have then for mirroring? Will it be
incompatible with RHEL 6?

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21  9:29 ` Kevin Wolf
@ 2012-05-21 10:02   ` Paolo Bonzini
  2012-05-21 10:32     ` Kevin Wolf
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 10:02 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Il 21/05/2012 11:29, Kevin Wolf ha scritto:
>> * block-stream: I propose adding two options to the existing
>> block-stream command.  If this is rejected, only mirroring will be able
>> to use rerror/werror.
>>
>> The new options are of course rerror/werror.  They are enum options,
>> with the following possible values:
> 
> Do we really need separate werror/rerror? For guest operations they
> really exist only for historical reasons: werror was there first, and
> when we wanted the same functionality, it seemed odd to overload werror
> to include reads as well.
> 
> For block jobs, where there is no such option yet, we could go with a
> single error option, unless there is a use case for separate
> werror/rerror options.

For mirroring rerror=source and werror=target.  I'm not sure there is an
actual usecase, but at least it is more interesting than for devices...

>> 'report': The behavior is the same as in 1.1.  An I/O error,
>> respectively during a read or a write, will complete the job immediately
>> with an error code.
>>
>> 'ignore': An I/O error, respectively during a read or a write, will be
>> ignored.  For streaming, the job will complete with an error and the
>> backing file will be left in place.  For mirroring, the sector will be
>> marked again as dirty and re-examined later.
> 
> This is not really 'ignore' as used for guest operations. There it means
> "no matter what the return value is, the operation has succeeded". For
> streaming it would mean that it just goes on with the next cluster (and
> if we don't cut the backing file link at the end, it would at least not
> corrupt anything).

Yes, for streaming it would mean that it just goes on with the next
cluster and then report an error at the end.

> Just like with guest operations it's a mostly useless mode, do we really
> need this option?

Perhaps we should remove it for guest operations as well; certainly it
makes more sense (if any) for jobs than for guest operations.

>> 'stop': The VM *and* the job will be paused---the VM is stopped even if
>> the block device has neither rerror=stop nor werror={stop,enospc}.  The
>> error is recorded in the block device's iostatus (which can be examined
>> with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
>> job.
>>
>>   Rationale: stopping all I/O seems to be the best choice in order
>>   to limit the number of errors received.  However, due to backwards-
>>   compatibility with QEMU 1.1 we cannot pause the job when guest-
>>   initiated I/O causes an error.  We could do that if the block
>>   device has rerror=stop/werror={stop,enospc}, but it seems more
>>   complicated to just never do it.
> 
> I don't agree with stopping the VM. Consider a case where the target is
> somewhere on the network and you lose the connection, but the primary
> image is local on the hard disk. You don't want to stop the VM just
> because continuing with the copy isn't possible for the moment.

I think this is something that management should resolve.  For an error
on the source, stopping the VM makes sense.  I don't think management
cares about what caused an I/O error on a device.  Does it matter if
streaming was active or rather the guest was executing "dd if=/dev/sda
of=/dev/null".

Management may want to keep the VM stopped even for an error on the
target, as long as mirroring has finished the initial synchronization
step.  The VM can perform large amounts of I/O while the job is paused,
and then completing the job can take a large amount of time.

> Of course, this means that you can't reuse the block device's io_status,
> but you need a separate job_iostatus.

For mirroring, source and target are separate devices and have separate
iostatuses anyway.

> If the VM is stopped (including BLOCK_IO_ERROR), no I/O should be going
> on at all. Do we really keep running the jobs in 1.1? If so, this is a
> bug and should be fixed before the release.

Yes, we do.  Do you think it's a problem for migration (thinking more
about it: ouch, yes, it should be)?

We have no pause/resume infrastructure, so we could simply force
synchronous cancellation at the end (before vm_stop_force_state).
Stefan, do you have any free cycle for this?

>> * query-block-jobs: The returned JSON object will grow an additional
>> member, "target".  The target field is a dictionary with two fields,
>> "info" and "stats" (resembling the output of query-block and
>> query-blockstat but for the mirroring target).  Member "device" of the
>> BlockInfo structure will be made optional.
>>
>>   Rationale: this allows libvirt to observe the high watermark of qcow2
>>   mirroring targets, and avoids putting a bad iostatus on a working
>>   migration source.
> 
> The mirroring target should be present in query-block instead. It is a
> user-visible BlockDriverState

It is not user visible, and making it user visible adds a lot more
things to worry about (e.g. making sure you cannot use it in a
device_add).

It reminds me of Xen's renaming of domains (foo->migrating-foo and
foo->zombie-foo), which was an endless source of pain.

I'd rather make the extension of query-block-jobs more generic, with a
list "devices" instead of a member "target", and making up the device
name in the implementation (so you have "device": "target" for mirroring).

>> * block-job-complete: new command specific to mirroring (switches the
>> device to the target), not related to the rest of the proposal.
> 
> What semantics will block-job-cancel have then for mirroring? Will it be
> incompatible with RHEL 6?

They will have the same semantics: make sure that the target matches
some state of the source, and drop it.  block-job-complete adds the
switch to the target (which I'll do with something like bdrv_append,
btw).  It synchronously opens the backing files of the target, and
asynchronously completes the job.

Upstream will not have drive-reopen; this will be incompatible with RHEL6.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 10:02   ` Paolo Bonzini
@ 2012-05-21 10:32     ` Kevin Wolf
  2012-05-21 11:02       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Kevin Wolf @ 2012-05-21 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Am 21.05.2012 12:02, schrieb Paolo Bonzini:
> Il 21/05/2012 11:29, Kevin Wolf ha scritto:
>>> * block-stream: I propose adding two options to the existing
>>> block-stream command.  If this is rejected, only mirroring will be able
>>> to use rerror/werror.
>>>
>>> The new options are of course rerror/werror.  They are enum options,
>>> with the following possible values:
>>
>> Do we really need separate werror/rerror? For guest operations they
>> really exist only for historical reasons: werror was there first, and
>> when we wanted the same functionality, it seemed odd to overload werror
>> to include reads as well.
>>
>> For block jobs, where there is no such option yet, we could go with a
>> single error option, unless there is a use case for separate
>> werror/rerror options.
> 
> For mirroring rerror=source and werror=target.  I'm not sure there is an
> actual usecase, but at least it is more interesting than for devices...

Hm. What if we add an active mirror? Then we can get some kind of COW,
and rerror can happen on the target as well.

If source/target is really the distinction we want to have, should the
available options be specific to the job type, so that you could have
src_error and dst_error for mirroring?

>> Just like with guest operations it's a mostly useless mode, do we really
>> need this option?
> 
> Perhaps we should remove it for guest operations as well; certainly it
> makes more sense (if any) for jobs than for guest operations.

The guest operation one I used once for debugging/reproducing a specific
error condition, and I'm not aware of other users. Maybe it can be
useful for jobs in similar contexts.

>>> 'stop': The VM *and* the job will be paused---the VM is stopped even if
>>> the block device has neither rerror=stop nor werror={stop,enospc}.  The
>>> error is recorded in the block device's iostatus (which can be examined
>>> with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
>>> job.
>>>
>>>   Rationale: stopping all I/O seems to be the best choice in order
>>>   to limit the number of errors received.  However, due to backwards-
>>>   compatibility with QEMU 1.1 we cannot pause the job when guest-
>>>   initiated I/O causes an error.  We could do that if the block
>>>   device has rerror=stop/werror={stop,enospc}, but it seems more
>>>   complicated to just never do it.
>>
>> I don't agree with stopping the VM. Consider a case where the target is
>> somewhere on the network and you lose the connection, but the primary
>> image is local on the hard disk. You don't want to stop the VM just
>> because continuing with the copy isn't possible for the moment.
> 
> I think this is something that management should resolve.

Management doesn't necessarily exist.

> For an error
> on the source, stopping the VM makes sense.  I don't think management
> cares about what caused an I/O error on a device.  Does it matter if
> streaming was active or rather the guest was executing "dd if=/dev/sda
> of=/dev/null".

Yes, there's a big difference: If it was a job, the guest can keep
running without any problems. If it was a guest operation, we would have
to return an error to the guest, which may offline the disk in response.

As long as the VM can keep running, we should let it run.

> Management may want to keep the VM stopped even for an error on the
> target, as long as mirroring has finished the initial synchronization
> step.  The VM can perform large amounts of I/O while the job is paused,
> and then completing the job can take a large amount of time.

If management wants to limit the impact of this, it can decide to
explicitly stop the VM when it receives the error event.

>> Of course, this means that you can't reuse the block device's io_status,
>> but you need a separate job_iostatus.
> 
> For mirroring, source and target are separate devices and have separate
> iostatuses anyway.
> 
>> If the VM is stopped (including BLOCK_IO_ERROR), no I/O should be going
>> on at all. Do we really keep running the jobs in 1.1? If so, this is a
>> bug and should be fixed before the release.
> 
> Yes, we do.  Do you think it's a problem for migration (thinking more
> about it: ouch, yes, it should be)?

I'm pretty sure that it is a problem for migration. And it's likely a
problem in more cases.

> We have no pause/resume infrastructure, so we could simply force
> synchronous cancellation at the end (before vm_stop_force_state).
> Stefan, do you have any free cycle for this?

Not very nice, but considering that we're past -rc2 this might be the
only thing we can do now.

>>> * query-block-jobs: The returned JSON object will grow an additional
>>> member, "target".  The target field is a dictionary with two fields,
>>> "info" and "stats" (resembling the output of query-block and
>>> query-blockstat but for the mirroring target).  Member "device" of the
>>> BlockInfo structure will be made optional.
>>>
>>>   Rationale: this allows libvirt to observe the high watermark of qcow2
>>>   mirroring targets, and avoids putting a bad iostatus on a working
>>>   migration source.
>>
>> The mirroring target should be present in query-block instead. It is a
>> user-visible BlockDriverState
> 
> It is not user visible, and making it user visible adds a lot more
> things to worry about (e.g. making sure you cannot use it in a
> device_add).
> 
> It reminds me of Xen's renaming of domains (foo->migrating-foo and
> foo->zombie-foo), which was an endless source of pain.
> 
> I'd rather make the extension of query-block-jobs more generic, with a
> list "devices" instead of a member "target", and making up the device
> name in the implementation (so you have "device": "target" for mirroring).

Well, my idea for blockdev was something like (represented in a -drive
syntax because I don't know what it will look like):

(qemu) blockdev_add file=foo.img,id=src
(qemu) device_add virtio-blk-pci,drive=src
...
(qemu) blockdev_add file=bar.img,id=dst
(qemu) blockdev_mirror foo bar

Once QOM reaches the block layer, I guess we'll want to make all
BlockDriverStates user visible anyway.

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 10:32     ` Kevin Wolf
@ 2012-05-21 11:02       ` Paolo Bonzini
  2012-05-21 13:07         ` Kevin Wolf
  2012-05-21 13:13         ` Eric Blake
  0 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 11:02 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Il 21/05/2012 12:32, Kevin Wolf ha scritto:
> Am 21.05.2012 12:02, schrieb Paolo Bonzini:
>> Il 21/05/2012 11:29, Kevin Wolf ha scritto:
>>>> * block-stream: I propose adding two options to the existing
>>>> block-stream command.  If this is rejected, only mirroring will be able
>>>> to use rerror/werror.
>>>>
>>>> The new options are of course rerror/werror.  They are enum options,
>>>> with the following possible values:
>>>
>>> Do we really need separate werror/rerror? For guest operations they
>>> really exist only for historical reasons: werror was there first, and
>>> when we wanted the same functionality, it seemed odd to overload werror
>>> to include reads as well.
>>>
>>> For block jobs, where there is no such option yet, we could go with a
>>> single error option, unless there is a use case for separate
>>> werror/rerror options.
>>
>> For mirroring rerror=source and werror=target.  I'm not sure there is an
>> actual usecase, but at least it is more interesting than for devices...
> 
> Hm. What if we add an active mirror? Then we can get some kind of COW,
> and rerror can happen on the target as well.

Errors during the read part of COW are always reported as werror.

> If source/target is really the distinction we want to have, should the
> available options be specific to the job type, so that you could have
> src_error and dst_error for mirroring?

Yes, that would make sense.

>>>> 'stop': The VM *and* the job will be paused---the VM is stopped even if
>>>> the block device has neither rerror=stop nor werror={stop,enospc}.  The
>>>> error is recorded in the block device's iostatus (which can be examined
>>>> with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
>>>> job.
>>>>
>>>>   Rationale: stopping all I/O seems to be the best choice in order
>>>>   to limit the number of errors received.  However, due to backwards-
>>>>   compatibility with QEMU 1.1 we cannot pause the job when guest-
>>>>   initiated I/O causes an error.  We could do that if the block
>>>>   device has rerror=stop/werror={stop,enospc}, but it seems more
>>>>   complicated to just never do it.
>>>
>>> I don't agree with stopping the VM. Consider a case where the target is
>>> somewhere on the network and you lose the connection, but the primary
>>> image is local on the hard disk. You don't want to stop the VM just
>>> because continuing with the copy isn't possible for the moment.
>>
>> I think this is something that management should resolve.
> 
> Management doesn't necessarily exist.

Even a human sitting at a console is management.  (Though I don't plan
HMP to expose rerror/werror; so you can assume in some sense that
management exists).

>> For an error on the source, stopping the VM makes sense.  I don't
>> think management cares about what caused an I/O error on a device.
>> Does it matter if streaming was active or rather the guest was
>> executing "dd if=/dev/sda of=/dev/null".
> 
> Yes, there's a big difference: If it was a job, the guest can keep
> running without any problems. If it was a guest operation, we would have
> to return an error to the guest, which may offline the disk in response.

Ok, this makes sense.

>> Management may want to keep the VM stopped even for an error on the
>> target, as long as mirroring has finished the initial synchronization
>> step.  The VM can perform large amounts of I/O while the job is paused,
>> and then completing the job can take a large amount of time.
> 
> If management wants to limit the impact of this, it can decide to
> explicitly stop the VM when it receives the error event.

That can be too late.

Eric, is it a problem for libvirt if a pause or target error during
mirroring causes the job to exit steady state?  That means that after a
target error the offset can go back from 100% to <100%.

>>> If the VM is stopped (including BLOCK_IO_ERROR), no I/O should be going
>>> on at all. Do we really keep running the jobs in 1.1? If so, this is a
>>> bug and should be fixed before the release.
>>
>> Yes, we do.  Do you think it's a problem for migration (thinking more
>> about it: ouch, yes, it should be)?
> 
> I'm pretty sure that it is a problem for migration. And it's likely a
> problem in more cases.

On the other hand, in other cases it can be desirable (qemu -S, run
streaming before the VM starts).

>> I'd rather make the extension of query-block-jobs more generic, with a
>> list "devices" instead of a member "target", and making up the device
>> name in the implementation (so you have "device": "target" for mirroring).
> 
> Well, my idea for blockdev was something like (represented in a -drive
> syntax because I don't know what it will look like):
> 
> (qemu) blockdev_add file=foo.img,id=src
> (qemu) device_add virtio-blk-pci,drive=src
> ...
> (qemu) blockdev_add file=bar.img,id=dst
> (qemu) blockdev_mirror foo bar
> 
> Once QOM reaches the block layer, I guess we'll want to make all
> BlockDriverStates user visible anyway.

I don't disagree, but that's very different from what is done with
drive-mirror.

So for now I'll keep my proposed extension of query-block-jobs; later it
can be modified so that the target will have a name if you started the
mirroring with blockdev_mirror instead of drive_mirror.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-18 17:08 [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2 Paolo Bonzini
  2012-05-21  9:29 ` Kevin Wolf
@ 2012-05-21 12:20 ` Stefan Hajnoczi
  2012-05-21 13:59 ` Luiz Capitulino
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
  3 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-05-21 12:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Federico Simoncelli, Eric Blake, qemu-devel, Luiz Capitulino

This makes sense given the generic nature of block jobs.  If mirroring
was only for live migration, for example, then we could avoid all this
by choosing a single policy.  As a generic operation it's nice to have
control over error policy.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 11:02       ` Paolo Bonzini
@ 2012-05-21 13:07         ` Kevin Wolf
  2012-05-21 15:18           ` Paolo Bonzini
  2012-05-21 13:13         ` Eric Blake
  1 sibling, 1 reply; 38+ messages in thread
From: Kevin Wolf @ 2012-05-21 13:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Am 21.05.2012 13:02, schrieb Paolo Bonzini:
> Il 21/05/2012 12:32, Kevin Wolf ha scritto:
>> Am 21.05.2012 12:02, schrieb Paolo Bonzini:
>>> Il 21/05/2012 11:29, Kevin Wolf ha scritto:
>>>>> * block-stream: I propose adding two options to the existing
>>>>> block-stream command.  If this is rejected, only mirroring will be able
>>>>> to use rerror/werror.
>>>>>
>>>>> The new options are of course rerror/werror.  They are enum options,
>>>>> with the following possible values:
>>>>
>>>> Do we really need separate werror/rerror? For guest operations they
>>>> really exist only for historical reasons: werror was there first, and
>>>> when we wanted the same functionality, it seemed odd to overload werror
>>>> to include reads as well.
>>>>
>>>> For block jobs, where there is no such option yet, we could go with a
>>>> single error option, unless there is a use case for separate
>>>> werror/rerror options.
>>>
>>> For mirroring rerror=source and werror=target.  I'm not sure there is an
>>> actual usecase, but at least it is more interesting than for devices...
>>
>> Hm. What if we add an active mirror? Then we can get some kind of COW,
>> and rerror can happen on the target as well.
> 
> Errors during the read part of COW are always reported as werror.

Good point.

Thinking a bit more about it, with an active mirror (i.e. a filter block
driver) things become a bit less clear anyway. The filter would have to
be linked to the job somehow.

Another interesting question is if we'll want to restrict ourselves to
one job at a time forever. But when we stop doing it, we'll need new
APIs anyway.

>> If source/target is really the distinction we want to have, should the
>> available options be specific to the job type, so that you could have
>> src_error and dst_error for mirroring?
> 
> Yes, that would make sense.

Of course, at the same time it also makes the implementation a bit more
complicated.

>>>>> 'stop': The VM *and* the job will be paused---the VM is stopped even if
>>>>> the block device has neither rerror=stop nor werror={stop,enospc}.  The
>>>>> error is recorded in the block device's iostatus (which can be examined
>>>>> with query-block).  However, a BLOCK_IO_ERROR event will _never_ pause a
>>>>> job.
>>>>>
>>>>>   Rationale: stopping all I/O seems to be the best choice in order
>>>>>   to limit the number of errors received.  However, due to backwards-
>>>>>   compatibility with QEMU 1.1 we cannot pause the job when guest-
>>>>>   initiated I/O causes an error.  We could do that if the block
>>>>>   device has rerror=stop/werror={stop,enospc}, but it seems more
>>>>>   complicated to just never do it.
>>>>
>>>> I don't agree with stopping the VM. Consider a case where the target is
>>>> somewhere on the network and you lose the connection, but the primary
>>>> image is local on the hard disk. You don't want to stop the VM just
>>>> because continuing with the copy isn't possible for the moment.
>>>
>>> I think this is something that management should resolve.
>>
>> Management doesn't necessarily exist.
> 
> Even a human sitting at a console is management.  (Though I don't plan
> HMP to expose rerror/werror; so you can assume in some sense that
> management exists).

But it's management that cares about good defaults. :-)

Why not expose the options in HMP?

>>> For an error on the source, stopping the VM makes sense.  I don't
>>> think management cares about what caused an I/O error on a device.
>>> Does it matter if streaming was active or rather the guest was
>>> executing "dd if=/dev/sda of=/dev/null".
>>
>> Yes, there's a big difference: If it was a job, the guest can keep
>> running without any problems. If it was a guest operation, we would have
>> to return an error to the guest, which may offline the disk in response.
> 
> Ok, this makes sense.
> 
>>> Management may want to keep the VM stopped even for an error on the
>>> target, as long as mirroring has finished the initial synchronization
>>> step.  The VM can perform large amounts of I/O while the job is paused,
>>> and then completing the job can take a large amount of time.
>>
>> If management wants to limit the impact of this, it can decide to
>> explicitly stop the VM when it receives the error event.
> 
> That can be too late.
> 
> Eric, is it a problem for libvirt if a pause or target error during
> mirroring causes the job to exit steady state?  That means that after a
> target error the offset can go back from 100% to <100%.

"too late" in what respect? With the passive mirror, we already have a
window in which data is on the source, but not copied to the target.
Does it make a big difference if it is a few bytes more or less?

>>>> If the VM is stopped (including BLOCK_IO_ERROR), no I/O should be going
>>>> on at all. Do we really keep running the jobs in 1.1? If so, this is a
>>>> bug and should be fixed before the release.
>>>
>>> Yes, we do.  Do you think it's a problem for migration (thinking more
>>> about it: ouch, yes, it should be)?
>>
>> I'm pretty sure that it is a problem for migration. And it's likely a
>> problem in more cases.
> 
> On the other hand, in other cases it can be desirable (qemu -S, run
> streaming before the VM starts).

We would have to verify that the whole qemu code can deal with it. I'm
pretty sure that today it can't and we had a related bug before, even
though I can't remember the details.

>>> I'd rather make the extension of query-block-jobs more generic, with a
>>> list "devices" instead of a member "target", and making up the device
>>> name in the implementation (so you have "device": "target" for mirroring).
>>
>> Well, my idea for blockdev was something like (represented in a -drive
>> syntax because I don't know what it will look like):
>>
>> (qemu) blockdev_add file=foo.img,id=src
>> (qemu) device_add virtio-blk-pci,drive=src
>> ...
>> (qemu) blockdev_add file=bar.img,id=dst
>> (qemu) blockdev_mirror foo bar
>>
>> Once QOM reaches the block layer, I guess we'll want to make all
>> BlockDriverStates user visible anyway.
> 
> I don't disagree, but that's very different from what is done with
> drive-mirror.

Yes. Which isn't a problem per se because drive-mirror will be replaced
by blockdev-mirror. However, things like query-block-jobs are probably
going to stay, so they should be designed for the future.

Things like this are why I don't feel overly comfortable with adding
more and more block layer features before we implement -blockdev.

> So for now I'll keep my proposed extension of query-block-jobs; later it
> can be modified so that the target will have a name if you started the
> mirroring with blockdev_mirror instead of drive_mirror.

You mean the same QMP field is a string when the block device was added
with blockdev_add and a dict when it was added with drive_add?
Maintaining this sounds like a nightmare to me.

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 11:02       ` Paolo Bonzini
  2012-05-21 13:07         ` Kevin Wolf
@ 2012-05-21 13:13         ` Eric Blake
  1 sibling, 0 replies; 38+ messages in thread
From: Eric Blake @ 2012-05-21 13:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Federico Simoncelli, Luiz Capitulino, qemu-devel,
	Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1551 bytes --]

On 05/21/2012 05:02 AM, Paolo Bonzini wrote:

> Eric, is it a problem for libvirt if a pause or target error during
> mirroring causes the job to exit steady state?  That means that after a
> target error the offset can go back from 100% to <100%.

Libvirt would really like to have events present.  If you emit an event
on every change from <100% to 100%, and a counterpart event on every
revert from 100% back to <100%, then that will help tremendously.

For RHEL 6.3, libvirt requires 100% before attempting a drive-reopen.
But since this proposal says qemu 1.2 won't even have a drive-reopen,
but instead have a new block-job-complete, then libvirt should be okay
if block-job-complete gracefully fails any time things are less than
100% (remember, drive-reopen has an attempted effect even if things are
less than 100%, which is why libvirt had to pre-filter for that
situation, but then again, RHEL never downgraded).  In other words, I
think downgrading out of steady state is not a problem, as long as
events exist to help management track that, and as long as the command
to pivot to the destination gracefully requires steady state.

> 
> On the other hand, in other cases it can be desirable (qemu -S, run
> streaming before the VM starts).

Yes, libvirt would really like a way to be able to start a VM with
mirroring active, in order to make it possible to support drive copy on
persistent domains.

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-18 17:08 [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2 Paolo Bonzini
  2012-05-21  9:29 ` Kevin Wolf
  2012-05-21 12:20 ` Stefan Hajnoczi
@ 2012-05-21 13:59 ` Luiz Capitulino
  2012-05-21 14:09   ` Kevin Wolf
                     ` (2 more replies)
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
  3 siblings, 3 replies; 38+ messages in thread
From: Luiz Capitulino @ 2012-05-21 13:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, aliguori, Stefan Hajnoczi, qemu-devel,
	Federico Simoncelli, Eric Blake

On Fri, 18 May 2012 19:08:42 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Modified QMP commands
> =====================

As we have discussed on the ML, we're not going to extend QMP commands.

I understand your reasoning, and since the beginning I thought this was
something useful to do, but we've already settled for not doing this.

I also think that we shouldn't have exceptions, as in practice this means
we're extending commands anyway. So either, we do it or we don't.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 13:59 ` Luiz Capitulino
@ 2012-05-21 14:09   ` Kevin Wolf
  2012-05-21 14:16     ` Anthony Liguori
  2012-05-21 14:17     ` Luiz Capitulino
  2012-05-21 14:10   ` Anthony Liguori
  2012-05-21 14:39   ` Paolo Bonzini
  2 siblings, 2 replies; 38+ messages in thread
From: Kevin Wolf @ 2012-05-21 14:09 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: aliguori, Stefan Hajnoczi, qemu-devel, Federico Simoncelli,
	Paolo Bonzini, Eric Blake

Am 21.05.2012 15:59, schrieb Luiz Capitulino:
> On Fri, 18 May 2012 19:08:42 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Modified QMP commands
>> =====================
> 
> As we have discussed on the ML, we're not going to extend QMP commands.
> 
> I understand your reasoning, and since the beginning I thought this was
> something useful to do, but we've already settled for not doing this.
> 
> I also think that we shouldn't have exceptions, as in practice this means
> we're extending commands anyway. So either, we do it or we don't.

What's the reason for disallowing command extensions? I'd find it rather
silly to maintain ten different functions that all do basically the
same, just that some of them have more optional parameters than others...

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 13:59 ` Luiz Capitulino
  2012-05-21 14:09   ` Kevin Wolf
@ 2012-05-21 14:10   ` Anthony Liguori
  2012-05-21 14:16     ` Luiz Capitulino
  2012-05-21 14:17     ` Kevin Wolf
  2012-05-21 14:39   ` Paolo Bonzini
  2 siblings, 2 replies; 38+ messages in thread
From: Anthony Liguori @ 2012-05-21 14:10 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Kevin Wolf, aliguori, Stefan Hajnoczi, qemu-devel,
	Federico Simoncelli, Paolo Bonzini, Eric Blake

On 05/21/2012 08:59 AM, Luiz Capitulino wrote:
> On Fri, 18 May 2012 19:08:42 +0200
> Paolo Bonzini<pbonzini@redhat.com>  wrote:
>
>> Modified QMP commands
>> =====================
>
> As we have discussed on the ML, we're not going to extend QMP commands.
>
> I understand your reasoning, and since the beginning I thought this was
> something useful to do, but we've already settled for not doing this.
>
> I also think that we shouldn't have exceptions, as in practice this means
> we're extending commands anyway. So either, we do it or we don't.

Well, I think we should ask ourselves the following question:

How would a client figure out if the new options are available?

This is the primary reason for not extending existing commands.

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:10   ` Anthony Liguori
@ 2012-05-21 14:16     ` Luiz Capitulino
  2012-05-21 14:19       ` Anthony Liguori
  2012-05-21 14:17     ` Kevin Wolf
  1 sibling, 1 reply; 38+ messages in thread
From: Luiz Capitulino @ 2012-05-21 14:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, aliguori, Stefan Hajnoczi, qemu-devel,
	Federico Simoncelli, Paolo Bonzini, Eric Blake

On Mon, 21 May 2012 09:10:40 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 05/21/2012 08:59 AM, Luiz Capitulino wrote:
> > On Fri, 18 May 2012 19:08:42 +0200
> > Paolo Bonzini<pbonzini@redhat.com>  wrote:
> >
> >> Modified QMP commands
> >> =====================
> >
> > As we have discussed on the ML, we're not going to extend QMP commands.
> >
> > I understand your reasoning, and since the beginning I thought this was
> > something useful to do, but we've already settled for not doing this.
> >
> > I also think that we shouldn't have exceptions, as in practice this means
> > we're extending commands anyway. So either, we do it or we don't.
> 
> Well, I think we should ask ourselves the following question:
> 
> How would a client figure out if the new options are available?
> 
> This is the primary reason for not extending existing commands.

Yes, I know. But if Paolo implements schema introspection, would you
agree on extending commands.

You seemed to be against even if we had schema introspection.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:09   ` Kevin Wolf
@ 2012-05-21 14:16     ` Anthony Liguori
  2012-05-21 14:17     ` Luiz Capitulino
  1 sibling, 0 replies; 38+ messages in thread
From: Anthony Liguori @ 2012-05-21 14:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Paolo Bonzini, Eric Blake

On 05/21/2012 09:09 AM, Kevin Wolf wrote:
> Am 21.05.2012 15:59, schrieb Luiz Capitulino:
>> On Fri, 18 May 2012 19:08:42 +0200
>> Paolo Bonzini<pbonzini@redhat.com>  wrote:
>>
>>> Modified QMP commands
>>> =====================
>>
>> As we have discussed on the ML, we're not going to extend QMP commands.
>>
>> I understand your reasoning, and since the beginning I thought this was
>> something useful to do, but we've already settled for not doing this.
>>
>> I also think that we shouldn't have exceptions, as in practice this means
>> we're extending commands anyway. So either, we do it or we don't.
>
> What's the reason for disallowing command extensions? I'd find it rather
> silly to maintain ten different functions that all do basically the
> same, just that some of them have more optional parameters than others...

How does a client figure out if the command supports the new options?  And start 
writing the Python code for it before answering...  Even if we exposed the 
schema via QMP, checking via schema interpretation would be a nightmare.

Regards,

Anthony Liguori

>
> Kevin
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:09   ` Kevin Wolf
  2012-05-21 14:16     ` Anthony Liguori
@ 2012-05-21 14:17     ` Luiz Capitulino
  1 sibling, 0 replies; 38+ messages in thread
From: Luiz Capitulino @ 2012-05-21 14:17 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: aliguori, Stefan Hajnoczi, qemu-devel, Federico Simoncelli,
	Paolo Bonzini, Eric Blake

On Mon, 21 May 2012 16:09:28 +0200
Kevin Wolf <kwolf@redhat.com> wrote:

> Am 21.05.2012 15:59, schrieb Luiz Capitulino:
> > On Fri, 18 May 2012 19:08:42 +0200
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> >> Modified QMP commands
> >> =====================
> > 
> > As we have discussed on the ML, we're not going to extend QMP commands.
> > 
> > I understand your reasoning, and since the beginning I thought this was
> > something useful to do, but we've already settled for not doing this.
> > 
> > I also think that we shouldn't have exceptions, as in practice this means
> > we're extending commands anyway. So either, we do it or we don't.
> 
> What's the reason for disallowing command extensions? I'd find it rather
> silly to maintain ten different functions that all do basically the
> same, just that some of them have more optional parameters than others...

See the subthread started by Anthony.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:10   ` Anthony Liguori
  2012-05-21 14:16     ` Luiz Capitulino
@ 2012-05-21 14:17     ` Kevin Wolf
  1 sibling, 0 replies; 38+ messages in thread
From: Kevin Wolf @ 2012-05-21 14:17 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: aliguori, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Paolo Bonzini, Eric Blake

Am 21.05.2012 16:10, schrieb Anthony Liguori:
> On 05/21/2012 08:59 AM, Luiz Capitulino wrote:
>> On Fri, 18 May 2012 19:08:42 +0200
>> Paolo Bonzini<pbonzini@redhat.com>  wrote:
>>
>>> Modified QMP commands
>>> =====================
>>
>> As we have discussed on the ML, we're not going to extend QMP commands.
>>
>> I understand your reasoning, and since the beginning I thought this was
>> something useful to do, but we've already settled for not doing this.
>>
>> I also think that we shouldn't have exceptions, as in practice this means
>> we're extending commands anyway. So either, we do it or we don't.
> 
> Well, I think we should ask ourselves the following question:
> 
> How would a client figure out if the new options are available?
> 
> This is the primary reason for not extending existing commands.

I agree that trying out if an optional argument is accepted or not isn't
a nice solution, even though it works and is probably better than adding
block-stream-2, ..., block-stream-7.

Didn't you say a while ago that adding a command that would return the
JSON schema wouldn't be all that hard? Maybe it's the right time to
implement it for 1.2.

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:16     ` Luiz Capitulino
@ 2012-05-21 14:19       ` Anthony Liguori
  2012-05-21 14:26         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2012-05-21 14:19 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Federico Simoncelli,
	Paolo Bonzini, Eric Blake

On 05/21/2012 09:16 AM, Luiz Capitulino wrote:
> On Mon, 21 May 2012 09:10:40 -0500
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>
>> On 05/21/2012 08:59 AM, Luiz Capitulino wrote:
>>> On Fri, 18 May 2012 19:08:42 +0200
>>> Paolo Bonzini<pbonzini@redhat.com>   wrote:
>>>
>>>> Modified QMP commands
>>>> =====================
>>>
>>> As we have discussed on the ML, we're not going to extend QMP commands.
>>>
>>> I understand your reasoning, and since the beginning I thought this was
>>> something useful to do, but we've already settled for not doing this.
>>>
>>> I also think that we shouldn't have exceptions, as in practice this means
>>> we're extending commands anyway. So either, we do it or we don't.
>>
>> Well, I think we should ask ourselves the following question:
>>
>> How would a client figure out if the new options are available?
>>
>> This is the primary reason for not extending existing commands.
>
> Yes, I know. But if Paolo implements schema introspection, would you
> agree on extending commands.
>
> You seemed to be against even if we had schema introspection.

I'm not against it in principle, just in practice.  Today, checking whether a 
command exists is:

commands = qmp.query_commands()

if 'block-stream' in commands:
     # has block-stream

I have a hard time envisioning how schema introspection can be reasonably 
implemented in a client.

If we really feel it's important to extend commands, I'd prefer something like 
per-command capabilities.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:19       ` Anthony Liguori
@ 2012-05-21 14:26         ` Paolo Bonzini
  2012-05-21 14:40           ` Anthony Liguori
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 14:26 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Eric Blake

Il 21/05/2012 16:19, Anthony Liguori ha scritto:
>>
> 
> I'm not against it in principle, just in practice.  Today, checking
> whether a command exists is:
> 
> commands = qmp.query_commands()
> 
> if 'block-stream' in commands:
>     # has block-stream
> 
> I have a hard time envisioning how schema introspection can be
> reasonably implemented in a client.

schema = qmp.query_command_schema('block-stream')
if 'on-error' in schema:
    # has on-error

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 13:59 ` Luiz Capitulino
  2012-05-21 14:09   ` Kevin Wolf
  2012-05-21 14:10   ` Anthony Liguori
@ 2012-05-21 14:39   ` Paolo Bonzini
  2 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 14:39 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Kevin Wolf, aliguori, Stefan Hajnoczi, qemu-devel,
	Federico Simoncelli, Eric Blake

Il 21/05/2012 15:59, Luiz Capitulino ha scritto:
> I understand your reasoning, and since the beginning I thought this was
> something useful to do, but we've already settled for not doing this.
> 
> I also think that we shouldn't have exceptions, as in practice this means
> we're extending commands anyway. So either, we do it or we don't.

Ok, I don't really care... streaming works decently even without this
extension, it is only needed for mirroring (and only 1 patch to leave
out of the series).

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:26         ` Paolo Bonzini
@ 2012-05-21 14:40           ` Anthony Liguori
  2012-05-21 14:47             ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2012-05-21 14:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Eric Blake

On 05/21/2012 09:26 AM, Paolo Bonzini wrote:
> Il 21/05/2012 16:19, Anthony Liguori ha scritto:
>>>
>>
>> I'm not against it in principle, just in practice.  Today, checking
>> whether a command exists is:
>>
>> commands = qmp.query_commands()
>>
>> if 'block-stream' in commands:
>>      # has block-stream
>>
>> I have a hard time envisioning how schema introspection can be
>> reasonably implemented in a client.
>
> schema = qmp.query_command_schema('block-stream')

What would schema return?

Did you mean:

if schema['arguments'].has_key('on_error'):

What about adding a parameter to a structure?

BTW, the other problem with adding arguments like this is that it makes a stable 
C API impossible.

Regards,

Anthony Liguori

> if 'on-error' in schema:
>      # has on-error
>
> Paolo
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:40           ` Anthony Liguori
@ 2012-05-21 14:47             ` Paolo Bonzini
  2012-05-21 15:44               ` Anthony Liguori
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 14:47 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Eric Blake

Il 21/05/2012 16:40, Anthony Liguori ha scritto:
> On 05/21/2012 09:26 AM, Paolo Bonzini wrote:
>> Il 21/05/2012 16:19, Anthony Liguori ha scritto:
>>>>
>>>
>>> I'm not against it in principle, just in practice.  Today, checking
>>> whether a command exists is:
>>>
>>> commands = qmp.query_commands()
>>>
>>> if 'block-stream' in commands:
>>>      # has block-stream
>>>
>>> I have a hard time envisioning how schema introspection can be
>>> reasonably implemented in a client.
>>
>> schema = qmp.query_command_schema('block-stream')
> 
> What would schema return?
> 
> Did you mean:
> 
> if schema['arguments'].has_key('on_error'):

Yes, something like that.

> What about adding a parameter to a structure?

schema = qmp.query_type('foo')
if schema['data'].has_key('xyz')

> BTW, the other problem with adding arguments like this is that it makes
> a stable C API impossible.

Adding fields to structs is not a problem as long as "libqmp" takes care
of all allocations on part of its client.  Adding parameters to commands
requires some smartness, but there are ways to do it:

1) add the first version number to the schema, generate versioned entry
points

qmp_block_stream_v1_1
qmp_block_stream_v1_2

etc.  Provide multiple headers libqmp-1.1.h, libqmp-1.2.h etc. that take
care of #define'ing qmp_block_stream to one of them

2) Same as (1) but use qmp_block_stream for the oldest version having
the command.

3) have fun with the preprocessor (macro with variable arguments,
sizeof, designated initializers, whatever) and emulate keyword arguments
for the C API.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 13:07         ` Kevin Wolf
@ 2012-05-21 15:18           ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 15:18 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Federico Simoncelli, Luiz Capitulino, Eric Blake, qemu-devel,
	Stefan Hajnoczi

Il 21/05/2012 15:07, Kevin Wolf ha scritto:
> Am 21.05.2012 13:02, schrieb Paolo Bonzini:
>> Il 21/05/2012 12:32, Kevin Wolf ha scritto:
>>> Am 21.05.2012 12:02, schrieb Paolo Bonzini:
>>>> Il 21/05/2012 11:29, Kevin Wolf ha scritto:
>>> If source/target is really the distinction we want to have, should the
>>> available options be specific to the job type, so that you could have
>>> src_error and dst_error for mirroring?
>>
>> Yes, that would make sense.
> 
> Of course, at the same time it also makes the implementation a bit more
> complicated.

Not much, I'd have rerror/werror already split if I followed my spec.

>>> Management doesn't necessarily exist.
>>
>> Even a human sitting at a console is management.  (Though I don't plan
>> HMP to expose rerror/werror; so you can assume in some sense that
>> management exists).
> 
> But it's management that cares about good defaults. :-)
> 
> Why not expose the options in HMP?

Honestly, because it's a pain.  HMP doesn't have options with arguments,
and I don't really care if HMP users curse at me because they end
out-of-disk-space and their migration is dropped.  If you're using HMP,
more than likely: 1) you can just stop the VM and copy the storage
manually; 2) even if you can't, if you get an EIO you can just stop the
VM and figure the root cause.

As far as storage migration is concerned (and quite a few other
scenarios) HMP is a nice debugging tool, nothing more.

>>>> Management may want to keep the VM stopped even for an error on the
>>>> target, as long as mirroring has finished the initial synchronization
>>>> step.  The VM can perform large amounts of I/O while the job is paused,
>>>> and then completing the job can take a large amount of time.
>>>
>>> If management wants to limit the impact of this, it can decide to
>>> explicitly stop the VM when it receives the error event.
>>
>> That can be too late.
>>
>> Eric, is it a problem for libvirt if a pause or target error during
>> mirroring causes the job to exit steady state?  That means that after a
>> target error the offset can go back from 100% to <100%.
> 
> "too late" in what respect?

Too late to properly show the error... though actually there is no
guarantee that the VM will not clobber the error even with vm_stop, so
it doesn't make a big difference in that, too.

>> So for now I'll keep my proposed extension of query-block-jobs; later it
>> can be modified so that the target will have a name if you started the
>> mirroring with blockdev_mirror instead of drive_mirror.
> 
> You mean the same QMP field is a string when the block device was added
> with blockdev_add and a dict when it was added with drive_add?
> Maintaining this sounds like a nightmare to me.

No, the same QMP field is a dict.  With blockdev_add it will have a
name, and it will just be a shortcut to info block/info blockstat.  With
drive_add it will have no name.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 14:47             ` Paolo Bonzini
@ 2012-05-21 15:44               ` Anthony Liguori
  2012-05-21 15:55                 ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2012-05-21 15:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Eric Blake

On 05/21/2012 09:47 AM, Paolo Bonzini wrote:
> Il 21/05/2012 16:40, Anthony Liguori ha scritto:
>> On 05/21/2012 09:26 AM, Paolo Bonzini wrote:
>>> Il 21/05/2012 16:19, Anthony Liguori ha scritto:
>>>>>
>>>>
>>>> I'm not against it in principle, just in practice.  Today, checking
>>>> whether a command exists is:
>>>>
>>>> commands = qmp.query_commands()
>>>>
>>>> if 'block-stream' in commands:
>>>>       # has block-stream
>>>>
>>>> I have a hard time envisioning how schema introspection can be
>>>> reasonably implemented in a client.
>>>
>>> schema = qmp.query_command_schema('block-stream')
>>
>> What would schema return?
>>
>> Did you mean:
>>
>> if schema['arguments'].has_key('on_error'):
>
> Yes, something like that.
>
>> What about adding a parameter to a structure?
>
> schema = qmp.query_type('foo')
> if schema['data'].has_key('xyz')

This may end up being workable.  I hadn't thought of having granular QMP schema 
querying functions.

>> BTW, the other problem with adding arguments like this is that it makes
>> a stable C API impossible.
>
> Adding fields to structs is not a problem as long as "libqmp" takes care
> of all allocations on part of its client.

Yes, the previous version I wrote handled this padding structures appropriately.

> Adding parameters to commands
> requires some smartness, but there are ways to do it:
>
> 1) add the first version number to the schema, generate versioned entry
> points
>
> qmp_block_stream_v1_1
> qmp_block_stream_v1_2
>
> etc.  Provide multiple headers libqmp-1.1.h, libqmp-1.2.h etc. that take
> care of #define'ing qmp_block_stream to one of them

This is pretty nasty :-/  It also gets very challenging if some options are 
backported and others aren't.

I guess let me ask the following question:

Is the problem that should have an 'options' parameter to this command?  Do we 
anticipate future addition of arguments?  If we can at least minimize the C API 
breakages, I'd be happy.

Regards,

Anthony Liguori

> 2) Same as (1) but use qmp_block_stream for the oldest version having
> the command.
>
> 3) have fun with the preprocessor (macro with variable arguments,
> sizeof, designated initializers, whatever) and emulate keyword arguments
> for the C API.



> Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2
  2012-05-21 15:44               ` Anthony Liguori
@ 2012-05-21 15:55                 ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-21 15:55 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Federico Simoncelli, Eric Blake

Il 21/05/2012 17:44, Anthony Liguori ha scritto:
> It also gets very challenging if some options are backported and others
> aren't.

It gets challenging anyway with backports.  If qmp_block_stream_v1_2
knows of defaults and doesn't send them on the wire, it will work if you
only rely on the subset of options that was backported.  If it doesn't,
downstreams will have to use vendor namespaces.

> I guess let me ask the following question:
> 
> Is the problem that should have an 'options' parameter to this command? 
> Do we anticipate future addition of arguments?

First of all, streaming works nicely as it is.  It is nice to make
things work similarly for streaming and mirroring, but not paramount.

Regarding the latter question: perhaps, but then where do you draw the
line?  Is it okay to add new kinds of data to unions?  (Besides, unions
are quite heavyweight to use in the JSON, and if I wanted to have an
"options" array of unions I would like to put the speed in there too.
So the boat has sailed).

> If we can at least minimize the C API breakages, I'd be happy.

My opinion is this: your goal makes sense, but it is not 100% practical
yet because:

1) we're still far enough from being feature complete (in comparison to
a certain proprietary product that does sport some extra bullets in its
marketing brochures).

2) we don't have a C API, and we don't know how it would look like (e.g.
would it use gio? or libev? or...).  It's good to be in the right
mindset from the beginning, but minimizing changes to hmp.c is by itself
not a particularly useful goal.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-18 17:08 [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2 Paolo Bonzini
                   ` (2 preceding siblings ...)
  2012-05-21 13:59 ` Luiz Capitulino
@ 2012-05-24 13:41 ` Paolo Bonzini
  2012-05-24 14:00   ` Ori Mamluk
                     ` (4 more replies)
  3 siblings, 5 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-24 13:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Eric Blake, Stefan Hajnoczi, Ori Mamluk, Luiz Capitulino

changes from v1:
- added per-job iostatus
- added description of persistent dirty bitmap

The same content is also at
http://wiki.qemu.org/Features/LiveBlockMigration/1.2


QMP changes for error handling
==============================

* query-block-jobs: BlockJobInfo gets two new fields, paused and
io-status.  The job-specific iostatus is completely separate from the
block device iostatus.


* block-stream: I would still like to add on_error to the existing
block-stream command, if only to ease unit testing.  Concerns about the
stability of the API can be handled by adding introspection (exporting
the schema), which is not hard to do.  The new option is an enum with
the following possible values:

'report': The behavior is the same as in 1.1.  An I/O error will
complete the job immediately with an error code.

'ignore': An I/O error, respectively during a read or a write, will be
ignored.  For streaming, the job will complete with an error and the
backing file will be left in place.  For mirroring, the sector will be
marked again as dirty and re-examined later.

'stop': The job will be paused, and the job iostatus (which can be
examined with query-block-jobs) is updated.

'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

After cancelling a job, the job implementation MAY choose to treat stop
and enospc values as report, i.e. complete the job immediately with an
error code, as long as block_job_is_cancelled(job) returns true when the
completion callback is called.

  Open problem: There could be unrecoverable errors in which the job
  will always fail as if rerror/werror were set to report (example:
  error while switching backing files).  Does it make sense to fire an
  event before the point in time where such errors can happen?


* block-job-pause: A new QMP command.  Takes a block device (drive),
pauses an active background block operation on that device.  This
command returns immediately after marking the active background block
operation for pausing.  It is an error to call this command if no
operation is in progress.  The operation will pause as soon as possible
(it won't pause if the job is being cancelled).  No event is emitted
when the operation is actually paused.  Cancelling a paused job
automatically resumes it.


* block-job-resume: A new QMP command.  Takes a block device (drive),
resume a paused background block operation on that device.  This command
returns immediately after resuming a paused background block operation.
 It is an error to call this command if no operation is in progress.

A successful block-job-resume operation also resets the iostatus on the
job that is passed.

  Rationale: block-job-resume is required to restart a job that had
  on_error behavior set to 'stop' or 'enospc'.  Adding block-job-pause
  makes it simpler to test the new feature.


Other points specific to mirroring
==================================

* query-block-jobs: The returned JSON object will grow an additional
member, "target".  The target field is a dictionary with two fields,
"info" and "stats" (resembling the output of query-block and
query-blockstat but for the mirroring target).  Member "device" of the
BlockInfo structure will be made optional.

  Rationale: this allows libvirt to observe the high watermark of qcow2
  mirroring targets.

If present, the target has its own iostatus.  It is set when the job is
paused due to an error on the target (together with sending a
BLOCK_JOB_ERROR event). block-job-resume resets it.


* drive-mirror: activates mirroring to a second block device (optionally
creating the image on that second block device).  Compared to the
earlier versions, the "full" argument is replaced by an enum option
"sync" with three values:

- top: copies data in the topmost image to the destination

- full: copies data from all images to the destination

- dirty: copies clusters that are marked in the dirty bitmap to the
destination (see below)


* block-job-complete: force completion of mirroring and switching of the
device to the target, not related to the rest of the proposal.
Synchronously opens backing files if needed, asynchronously completes
the job.


* MIRROR_STATE_CHANGE: new event, triggered every time the
block-job-complete becomes available/unavailable.  Contains the device
name (like device: 'ide0-hd0'), and the state (synced: true/false).


Persistent dirty bitmap
=======================

A persistent dirty bitmap can be used by management for two reasons.
When mirroring is used for continuous replication of storage, to record
I/O operations that happened while the replication server is not
connected or unavailable.  When mirroring is used for storage migration,
to check after a management crash whether the VM must be restarted with
the source or the destination.

The dirty bitmap is synchronized on every bdrv_flush (or on every I/O
operation if the disk operates in writethrough or directsync mode).

The persistent dirty bitmap is created by management, but QEMU needs it
also for drive-mirror.  If so:

* if management has not set up a persistent dirty bitmap, QEMU will use
a simple non-persistent bitmap.

* if management has set up a persistent dirty bitmap and later calls
blockdev-dirty-disable, QEMU will delay the disabling until drive
mirroring also terminates.


The dirty bitmap is managed by these QMP commands:

* blockdev-dirty-enable: takes a file name used for the dirty bitmap,
and an optional granularity.  Setting the granularity will not be
supported in the initial version.

* query-block-dirty: returns statistics about the dirty bitmap: right
now the granularity, the number of bits that are set, and whether QEMU
is using the dirty bitmap or just adding to it.

* blockdev-dirty-disable: disable the dirty bitmap.


The dirty bitmap can also be specified on the command-line with -drive.

The dirty bitmap can be used as follows for storage migration.  To start
migration:

1) blockdev-dirty-enable ide0-hd0 /var/lib/libvirt/dirty/diskname

2) management notes existence of dirty bitmap for /mnt/src/diskname.img
in its private data

3) drive-mirror ide0-hd0 /mnt/dest/diskname.img

4) management notes /mnt/dest/diskname.img as the mirroring target in
its private data

At this point, mirroring has taken a reference to the dirty bitmap.  To
end migration:

5) blockdev-dirty-disable ide0-hd0

6) block-job-complete ide0-hd0

The dirty bitmap remains enabled until the BLOCK_JOB_COMPLETED event is
sent.

7) When management receives the BLOCK_JOB_COMPLETED event, it notes
switch to /mnt/dest/diskname.img (without dirty bitmap nor mirroring
target) in its private data.

If management crashes between (6) and (7), it can examine the dirty
bitmap on disk.  If it is all-zeros, management can restart the virtual
machine with /mnt/dest/diskname.img.  If it has even a single zero bit,
management can restart the virtual machine with the persistent dirty
bitmap enabled, and later issue again a drive-mirror command to restart
from step 4.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
@ 2012-05-24 14:00   ` Ori Mamluk
  2012-05-24 14:19     ` Paolo Bonzini
  2012-05-24 16:57   ` Eric Blake
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Ori Mamluk @ 2012-05-24 14:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Luiz Capitulino, Eric Blake, qemu-devel, Stefan Hajnoczi

On 24/05/2012 16:41, Paolo Bonzini wrote:
> The dirty bitmap is managed by these QMP commands:
>
> * blockdev-dirty-enable: takes a file name used for the dirty bitmap,
> and an optional granularity.  Setting the granularity will not be
> supported in the initial version.
>
> * query-block-dirty: returns statistics about the dirty bitmap: right
> now the granularity, the number of bits that are set, and whether QEMU
> is using the dirty bitmap or just adding to it.
>
> * blockdev-dirty-disable: disable the dirty bitmap.
>
>
When do bits get cleared from the bitmap?
"using the dirty bitmap or just adding to it" - I'm not sure I 
understand what you mean. what's the difference?

Thanks,
Ori

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 14:00   ` Ori Mamluk
@ 2012-05-24 14:19     ` Paolo Bonzini
  2012-05-24 15:32       ` Dor Laor
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-24 14:19 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf, Luiz Capitulino, Eric Blake, qemu-devel, Stefan Hajnoczi

Il 24/05/2012 16:00, Ori Mamluk ha scritto:
> 
>> The dirty bitmap is managed by these QMP commands:
>>
>> * blockdev-dirty-enable: takes a file name used for the dirty bitmap,
>> and an optional granularity.  Setting the granularity will not be
>> supported in the initial version.
>>
>> * query-block-dirty: returns statistics about the dirty bitmap: right
>> now the granularity, the number of bits that are set, and whether QEMU
>> is using the dirty bitmap or just adding to it.
>>
>> * blockdev-dirty-disable: disable the dirty bitmap.
>>
>
> When do bits get cleared from the bitmap?

drive-mirror clears bits from the bitmap as it processes the writes.

In addition to the persistent dirty bitmap, QEMU keeps an in-flight
bitmap.  The in-flight bitmap does not need to be persistent.


Here is how the bitmaps are handled when doing I/O on the source:
- after writing to the source:
  - clear bit in the volatile in-flight bitmap
  - set bit in the persistent dirty bitmap

- after flushing the source:
  - msync the persistent bitmap to disk


Here is how the bitmaps are handled in the drive-mirror coroutine:
- before reading from the source:
  - set bit in the volatile in-flight bitmap

- after writing to the target:
  - if the dirty count will become zero, flush the target
  - if the bit is still set in the in-flight bitmap, clear bit in the
    persistent dirty bitmap
  - clear bit in the volatile in-flight bitmap

> "using the dirty bitmap or just adding to it" - I'm not sure I
> understand what you mean. what's the difference?

Processing the data and removing from the bitmap (mirroring active), or
just setting dirty bits (mirroring inactive).

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 14:19     ` Paolo Bonzini
@ 2012-05-24 15:32       ` Dor Laor
  2012-05-25  8:59         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Dor Laor @ 2012-05-24 15:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Ori Mamluk, Eric Blake

On 05/24/2012 05:19 PM, Paolo Bonzini wrote:
> Il 24/05/2012 16:00, Ori Mamluk ha scritto:
>>
>>> The dirty bitmap is managed by these QMP commands:
>>>
>>> * blockdev-dirty-enable: takes a file name used for the dirty bitmap,
>>> and an optional granularity.  Setting the granularity will not be
>>> supported in the initial version.
>>>
>>> * query-block-dirty: returns statistics about the dirty bitmap: right
>>> now the granularity, the number of bits that are set, and whether QEMU
>>> is using the dirty bitmap or just adding to it.
>>>
>>> * blockdev-dirty-disable: disable the dirty bitmap.
>>>
>>
>> When do bits get cleared from the bitmap?
>
> drive-mirror clears bits from the bitmap as it processes the writes.
>
> In addition to the persistent dirty bitmap, QEMU keeps an in-flight
> bitmap.  The in-flight bitmap does not need to be persistent.
>
>
> Here is how the bitmaps are handled when doing I/O on the source:
> - after writing to the source:
>    - clear bit in the volatile in-flight bitmap
>    - set bit in the persistent dirty bitmap
>
> - after flushing the source:
>    - msync the persistent bitmap to disk
>
>
> Here is how the bitmaps are handled in the drive-mirror coroutine:
> - before reading from the source:
>    - set bit in the volatile in-flight bitmap
>
> - after writing to the target:
>    - if the dirty count will become zero, flush the target
>    - if the bit is still set in the in-flight bitmap, clear bit in the
>      persistent dirty bitmap
>    - clear bit in the volatile in-flight bitmap

I didn't understand whether the persistent dirty bitmap needs to be 
flushed. This bitmap actually control the persistent known state of the 
destination image. Since w/ mirroring we always have the source in full 
state condition, we can choose to lazy update the destination w/ a risk 
of loosing some content from the last flush (of the destination only side).

This way one can pick the frequency of flushing the persistent bits map 
(and the respective target IO writes). Continuous replication can chose 
a timely based fashion, such as every 5 seconds. A standard mirroring 
job for live copy proposes can pick just to flush once at the end of the 
copy process.

Dor
>
>> "using the dirty bitmap or just adding to it" - I'm not sure I
>> understand what you mean. what's the difference?
>
> Processing the data and removing from the bitmap (mirroring active), or
> just setting dirty bits (mirroring inactive).
>
> Paolo
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
  2012-05-24 14:00   ` Ori Mamluk
@ 2012-05-24 16:57   ` Eric Blake
  2012-05-25  8:48     ` Paolo Bonzini
  2012-05-25  8:28   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Eric Blake @ 2012-05-24 16:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Luiz Capitulino, qemu-devel, Ori Mamluk, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 3971 bytes --]

On 05/24/2012 07:41 AM, Paolo Bonzini wrote:
> changes from v1:
> - added per-job iostatus
> - added description of persistent dirty bitmap
> 
> The same content is also at
> http://wiki.qemu.org/Features/LiveBlockMigration/1.2
> 

> * query-block-jobs: BlockJobInfo gets two new fields, paused and
> io-status.  The job-specific iostatus is completely separate from the
> block device iostatus.

Is it still true that for mirror jobs, whether we are mirroring is still
determined by whether 'len'=='offset'?

> * drive-mirror: activates mirroring to a second block device (optionally
> creating the image on that second block device).  Compared to the
> earlier versions, the "full" argument is replaced by an enum option
> "sync" with three values:
> 
> - top: copies data in the topmost image to the destination
> 
> - full: copies data from all images to the destination
> 
> - dirty: copies clusters that are marked in the dirty bitmap to the
> destination (see below)

Different, but at least RHEL used the name __com.redhat_drive-mirror, so
libvirt can cope with the difference.

> 
> 
> * block-job-complete: force completion of mirroring and switching of the
> device to the target, not related to the rest of the proposal.
> Synchronously opens backing files if needed, asynchronously completes
> the job.

Can this be made part of a 'transaction'?  Likewise, can
'block-job-cancel' be made part of a 'transaction'?  Having those two
commands transactionable means that you could copy multiple disks at the
same point in time (block-job-cancel) or pivot multiple disks leaving
the former files consistent at the same point in time
(block-job-complete).  It doesn't have to be done in the first round,
but we should make sure we are not precluding this for future growth.

Also, for the purposes of copying but not pivoting, you only have a safe
copy if 'len'=='offset' at the time of the cancel.  But now that you are
adding the possibility of mirroring reverting to copying, there is a
race where I can probe and see that we are in mirroring, then issue a
'block-job-cancel' to affect a copy operation, but in the meantime
things reverted, and the cancel ends up leaving me with an incomplete
copy.  Maybe 'block-job-complete' should be given an optional boolean
parameter; by default or if the parameter is true, we pivot, but if
false, then we do the same as 'block-job-cancel' to affect a safe copy
if we are in mirroring, while erroring out if we are not in mirroring,
leaving 'block-job-cancel' as a way to always cancel a job but no longer
a safe way to guarantee a copy operation.


> Persistent dirty bitmap
> =======================
> 
> A persistent dirty bitmap can be used by management for two reasons.
> When mirroring is used for continuous replication of storage, to record
> I/O operations that happened while the replication server is not
> connected or unavailable.  When mirroring is used for storage migration,
> to check after a management crash whether the VM must be restarted with
> the source or the destination.

Is there a particular file format for the dirty bitmap?  Is there a
header, or is it just straight bitmap, where the size of the file is an
exact function of size of the file that it maps?

> 
> If management crashes between (6) and (7), it can examine the dirty
> bitmap on disk.  If it is all-zeros,

Obviously, this would be all-zeros in the map portion of the file, any
header portion would not impact this.

> management can restart the virtual
> machine with /mnt/dest/diskname.img.  If it has even a single zero bit,

s/zero/non-zero/

> management can restart the virtual machine with the persistent dirty
> bitmap enabled, and later issue again a drive-mirror command to restart
> from step 4.
> 
> Paolo
> 

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
  2012-05-24 14:00   ` Ori Mamluk
  2012-05-24 16:57   ` Eric Blake
@ 2012-05-25  8:28   ` Stefan Hajnoczi
  2012-05-25  8:42     ` Kevin Wolf
  2012-05-25  9:43   ` Stefan Hajnoczi
  2012-05-25 16:57   ` Luiz Capitulino
  4 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-05-25  8:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
> changes from v1:
> - added per-job iostatus
> - added description of persistent dirty bitmap
> 
> The same content is also at
> http://wiki.qemu.org/Features/LiveBlockMigration/1.2
> 
> 
> QMP changes for error handling
> ==============================
> 
> * query-block-jobs: BlockJobInfo gets two new fields, paused and
> io-status.  The job-specific iostatus is completely separate from the
> block device iostatus.
> 
> 
> * block-stream: I would still like to add on_error to the existing
> block-stream command, if only to ease unit testing.  Concerns about the
> stability of the API can be handled by adding introspection (exporting
> the schema), which is not hard to do.  The new option is an enum with
> the following possible values:
> 
> 'report': The behavior is the same as in 1.1.  An I/O error will
> complete the job immediately with an error code.
> 
> 'ignore': An I/O error, respectively during a read or a write, will be
> ignored.  For streaming, the job will complete with an error and the
> backing file will be left in place.  For mirroring, the sector will be
> marked again as dirty and re-examined later.
> 
> 'stop': The job will be paused, and the job iostatus (which can be
> examined with query-block-jobs) is updated.
> 
> 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

'stop' and 'enospc' must raise a QMP event so the user is notified when
the job is paused.  Are the details on this missing from this draft?

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-25  8:28   ` Stefan Hajnoczi
@ 2012-05-25  8:42     ` Kevin Wolf
  0 siblings, 0 replies; 38+ messages in thread
From: Kevin Wolf @ 2012-05-25  8:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

Am 25.05.2012 10:28, schrieb Stefan Hajnoczi:
> On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
>> changes from v1:
>> - added per-job iostatus
>> - added description of persistent dirty bitmap
>>
>> The same content is also at
>> http://wiki.qemu.org/Features/LiveBlockMigration/1.2
>>
>>
>> QMP changes for error handling
>> ==============================
>>
>> * query-block-jobs: BlockJobInfo gets two new fields, paused and
>> io-status.  The job-specific iostatus is completely separate from the
>> block device iostatus.
>>
>>
>> * block-stream: I would still like to add on_error to the existing
>> block-stream command, if only to ease unit testing.  Concerns about the
>> stability of the API can be handled by adding introspection (exporting
>> the schema), which is not hard to do.  The new option is an enum with
>> the following possible values:
>>
>> 'report': The behavior is the same as in 1.1.  An I/O error will
>> complete the job immediately with an error code.
>>
>> 'ignore': An I/O error, respectively during a read or a write, will be
>> ignored.  For streaming, the job will complete with an error and the
>> backing file will be left in place.  For mirroring, the sector will be
>> marked again as dirty and re-examined later.
>>
>> 'stop': The job will be paused, and the job iostatus (which can be
>> examined with query-block-jobs) is updated.
>>
>> 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

May I quote the next two lines as well?

"In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR."

> 'stop' and 'enospc' must raise a QMP event so the user is notified when
> the job is paused.  Are the details on this missing from this draft?

No, just from your quote. :-)

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 16:57   ` Eric Blake
@ 2012-05-25  8:48     ` Paolo Bonzini
  2012-05-25 15:02       ` Eric Blake
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-25  8:48 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Luiz Capitulino, qemu-devel, Ori Mamluk, Stefan Hajnoczi

Il 24/05/2012 18:57, Eric Blake ha scritto:
> On 05/24/2012 07:41 AM, Paolo Bonzini wrote:
>> changes from v1:
>> - added per-job iostatus
>> - added description of persistent dirty bitmap
>>
>> The same content is also at
>> http://wiki.qemu.org/Features/LiveBlockMigration/1.2
>>
> 
>> * query-block-jobs: BlockJobInfo gets two new fields, paused and
>> io-status.  The job-specific iostatus is completely separate from the
>> block device iostatus.
> 
> Is it still true that for mirror jobs, whether we are mirroring is still
> determined by whether 'len'=='offset'?

Yes.

>> * block-job-complete: force completion of mirroring and switching of the
>> device to the target, not related to the rest of the proposal.
>> Synchronously opens backing files if needed, asynchronously completes
>> the job.
> 
> Can this be made part of a 'transaction'?  Likewise, can
> 'block-job-cancel' be made part of a 'transaction'?

Both of them are asynchronous so they would not create an atomic
snapshot.  We could add it later, in the meanwhile you can wrap with
fsfreeze/fsthaw.

> But now that you are adding the possibility of mirroring reverting
> to copying, there is a race where I can probe and see that we are
> in mirroring, then issue a 'block-job-cancel' to affect a copy operation,
> but in the meantime things reverted, and the cancel ends up leaving me
> with an incomplete copy.

Hmm, that's right.  But then this can only happen if you have an error
in the target.  I can make block-job-cancel _not_ resume a paused job.
Would that satisfy your needs?

>> Persistent dirty bitmap
>> =======================
>>
>> A persistent dirty bitmap can be used by management for two reasons.
>> When mirroring is used for continuous replication of storage, to record
>> I/O operations that happened while the replication server is not
>> connected or unavailable.  When mirroring is used for storage migration,
>> to check after a management crash whether the VM must be restarted with
>> the source or the destination.
> 
> Is there a particular file format for the dirty bitmap?  Is there a
> header, or is it just straight bitmap, where the size of the file is an
> exact function of size of the file that it maps?

I think it could be just a straight bitmap.

>> management can restart the virtual
>> machine with /mnt/dest/diskname.img.  If it has even a single zero bit,
> 
> s/zero/non-zero/

Doh, of course.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 15:32       ` Dor Laor
@ 2012-05-25  8:59         ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-25  8:59 UTC (permalink / raw)
  To: dlaor
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Ori Mamluk, Eric Blake

Il 24/05/2012 17:32, Dor Laor ha scritto:
> I didn't understand whether the persistent dirty bitmap needs to be
> flushed. This bitmap actually control the persistent known state of the
> destination image. Since w/ mirroring we always have the source in full
> state condition, we can choose to lazy update the destination w/ a risk
> of loosing some content from the last flush (of the destination only side).

Flushing the dirty bitmap after writing to the target can indeed be
tuned for the application.  However, it is not optional to msync the
bitmap when flushing the source.  If the source has a power loss, it has
to know what to retransmit.

> This way one can pick the frequency of flushing the persistent bits map
> (and the respective target IO writes).  Continuous replication can chose
> a timely based fashion, such as every 5 seconds.

But then the target is not able to restore a consistent state (which is
a state where the dirty bitmap is all-zeros).

The scheme above is roughly what DRBD does.  But in any case,
optimizations need to be worked out with a model checker, it's too delicate.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
                     ` (2 preceding siblings ...)
  2012-05-25  8:28   ` Stefan Hajnoczi
@ 2012-05-25  9:43   ` Stefan Hajnoczi
  2012-05-25 11:17     ` Paolo Bonzini
  2012-05-25 16:57   ` Luiz Capitulino
  4 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-05-25  9:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
> Persistent dirty bitmap
> =======================
> 
> A persistent dirty bitmap can be used by management for two reasons.
> When mirroring is used for continuous replication of storage, to record
> I/O operations that happened while the replication server is not
> connected or unavailable.

For incremental backups we also need a dirty bitmap API.  This allows
backup software to determine which blocks are "dirty" in a snapshot.
The backup software will only copy out the dirty blocks.

(For external snapshots the "dirty bitmap" is actually the qemu-io -c
map of is_allocated clusters.  But for internal snapshots it would be a
diff of image metadata which results in a real bitmap.)

So it seems like a dirty bitmap API will be required for continuous
replication (so the server can find out what it missed) and for
incremental backup.  If there is commonality here we should work
together so the same API can do both.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-25  9:43   ` Stefan Hajnoczi
@ 2012-05-25 11:17     ` Paolo Bonzini
  2012-05-25 12:09       ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-25 11:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto:
> On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
>> Persistent dirty bitmap
>> =======================
>>
>> A persistent dirty bitmap can be used by management for two reasons.
>> When mirroring is used for continuous replication of storage, to record
>> I/O operations that happened while the replication server is not
>> connected or unavailable.
> 
> For incremental backups we also need a dirty bitmap API.  This allows
> backup software to determine which blocks are "dirty" in a snapshot.
> The backup software will only copy out the dirty blocks.
> 
> (For external snapshots the "dirty bitmap" is actually the qemu-io -c
> map of is_allocated clusters.  But for internal snapshots it would be a
> diff of image metadata which results in a real bitmap.)

Perhaps that be simply a new qemu-img subcommand?  It should be possible
to run it while the VM is offline.  Then the file that is produced could
be fed to blockdev-dirty-enable.

Paolo

> So it seems like a dirty bitmap API will be required for continuous
> replication (so the server can find out what it missed) and for
> incremental backup.  If there is commonality here we should work
> together so the same API can do both.
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-25 11:17     ` Paolo Bonzini
@ 2012-05-25 12:09       ` Stefan Hajnoczi
  2012-05-25 13:25         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-05-25 12:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

On Fri, May 25, 2012 at 01:17:04PM +0200, Paolo Bonzini wrote:
> Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto:
> > On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
> >> Persistent dirty bitmap
> >> =======================
> >>
> >> A persistent dirty bitmap can be used by management for two reasons.
> >> When mirroring is used for continuous replication of storage, to record
> >> I/O operations that happened while the replication server is not
> >> connected or unavailable.
> > 
> > For incremental backups we also need a dirty bitmap API.  This allows
> > backup software to determine which blocks are "dirty" in a snapshot.
> > The backup software will only copy out the dirty blocks.
> > 
> > (For external snapshots the "dirty bitmap" is actually the qemu-io -c
> > map of is_allocated clusters.  But for internal snapshots it would be a
> > diff of image metadata which results in a real bitmap.)
> 
> Perhaps that be simply a new qemu-img subcommand?  It should be possible
> to run it while the VM is offline.  Then the file that is produced could
> be fed to blockdev-dirty-enable.

For both continuous replication and incremental backups we cannot
require that the guest is shut down in order to collect the dirty
bitmap, I think.

Also, anything to do with internal snapshots needs to be online since
QEMU will still have the image file open and be writing to it.  With
backing files we have a little bit more wiggle room when QEMU has
backing files open read-only.

I think we really need a libvirt API because a local file not only has
permissions issues but also is not network transparent.  The continuous
replication server runs on another machine, how will it access the dirty
bitmap file?

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-25 12:09       ` Stefan Hajnoczi
@ 2012-05-25 13:25         ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-25 13:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Luiz Capitulino

Il 25/05/2012 14:09, Stefan Hajnoczi ha scritto:
>> > 
>> > Perhaps that be simply a new qemu-img subcommand?  It should be possible
>> > to run it while the VM is offline.  Then the file that is produced could
>> > be fed to blockdev-dirty-enable.
> For both continuous replication and incremental backups we cannot
> require that the guest is shut down in order to collect the dirty
> bitmap, I think.

Yes, that is a problem for internal snapshots.  For external snapshots,
see the drive-mirror command's sync parameter.  Perhaps we can add a
blockdev-dirty-fill command that adds allocated sectors up to a given
base to the dirty bitmap.

> I think we really need a libvirt API because a local file not only has
> permissions issues but also is not network transparent.  The continuous
> replication server runs on another machine, how will it access the dirty
> bitmap file?

This is still using a "push" model where the dirty data is sent from
QEMU to the replication server, so the dirty bitmap is not needed on the
machine that runs the replication server---only on the machine that runs
the VM (to preserve the bitmap across VM shutdowns including power
loss).  It has to be stored on shared storage if you plan to run the VM
from multiple hosts.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-25  8:48     ` Paolo Bonzini
@ 2012-05-25 15:02       ` Eric Blake
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Blake @ 2012-05-25 15:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Luiz Capitulino, qemu-devel, Ori Mamluk, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2893 bytes --]

On 05/25/2012 02:48 AM, Paolo Bonzini wrote:

>>> * block-job-complete: force completion of mirroring and switching of the
>>> device to the target, not related to the rest of the proposal.
>>> Synchronously opens backing files if needed, asynchronously completes
>>> the job.
>>
>> Can this be made part of a 'transaction'?  Likewise, can
>> 'block-job-cancel' be made part of a 'transaction'?
> 
> Both of them are asynchronous so they would not create an atomic
> snapshot.  We could add it later, in the meanwhile you can wrap with
> fsfreeze/fsthaw.

It doesn't have to be right away, I just want to make sure that we
aren't excluding it from a possible future extension, because it _does_
sound useful.

> 
>> But now that you are adding the possibility of mirroring reverting
>> to copying, there is a race where I can probe and see that we are
>> in mirroring, then issue a 'block-job-cancel' to affect a copy operation,
>> but in the meantime things reverted, and the cancel ends up leaving me
>> with an incomplete copy.
> 
> Hmm, that's right.  But then this can only happen if you have an error
> in the target.  I can make block-job-cancel _not_ resume a paused job.
> Would that satisfy your needs?

I'm not sure I follow what you are asking.  My scenario is:

call 'drive-mirror' to start a job
'block-job-complete' fails because job is not ready, but the job is not
affected
wait for the event telling me we are in mirroring phase
start issuing my call to 'block-job-complete' to pivot
something happens where we are no longer mirroring
'block-job-complete' fails because we are not mirroring - good

call 'drive-mirror' to start a job
calling 'block-job-cancel' would abort the job, which is not what I want
wait for the event telling me we are in mirroring phase
start issuing my call to 'block-job-cancel' to cleanly leave the copy behind
something happens where we are no longer mirroring
'block-job-cancel' completes, but did not leave a complete mirror - bad

On the other hand, if I'm _not_ trying to make a clean copy, then I want
'block-job-cancel' to work as fast as possible, no matter what.

I'm not sure why having block-job-cancel resume or not resume a job
would make a difference.  What I really am asking for here is a way to
have some command (perhaps 'block-job-complete' but with an optional
flag set to a non-default value) that says I want to complete the job as
a clean copy, but revert back to the source rather than pivot to the
destination, and to cleanly fail with the job still around for
additional actions if I cannot get a clean copy at the current moment,
in the same way that the default 'block-job-complete' cleanly fails but
does not kill the job if I'm not mirroring yet.

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
  2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
                     ` (3 preceding siblings ...)
  2012-05-25  9:43   ` Stefan Hajnoczi
@ 2012-05-25 16:57   ` Luiz Capitulino
  4 siblings, 0 replies; 38+ messages in thread
From: Luiz Capitulino @ 2012-05-25 16:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Eric Blake, qemu-devel, Ori Mamluk, Stefan Hajnoczi

On Thu, 24 May 2012 15:41:29 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> * block-stream: I would still like to add on_error to the existing
> block-stream command, if only to ease unit testing.  Concerns about the
> stability of the API can be handled by adding introspection (exporting
> the schema), which is not hard to do.  The new option is an enum with
> the following possible values:
> 
> 'report': The behavior is the same as in 1.1.  An I/O error will
> complete the job immediately with an error code.
> 
> 'ignore': An I/O error, respectively during a read or a write, will be
> ignored.  For streaming, the job will complete with an error and the
> backing file will be left in place.  For mirroring, the sector will be
> marked again as dirty and re-examined later.
> 
> 'stop': The job will be paused, and the job iostatus (which can be
> examined with query-block-jobs) is updated.
> 
> 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.
> 
> In all cases, even for 'report', the I/O error is reported as a QMP
> event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.
> 
> After cancelling a job, the job implementation MAY choose to treat stop
> and enospc values as report, i.e. complete the job immediately with an
> error code, as long as block_job_is_cancelled(job) returns true when the
> completion callback is called.
> 
>   Open problem: There could be unrecoverable errors in which the job
>   will always fail as if rerror/werror were set to report (example:
>   error while switching backing files).  Does it make sense to fire an
>   event before the point in time where such errors can happen?

You mean, you fire the event before the point the error can happen but
the operation keeps running if it doesn't fail?

If that's the case, I think that the returned error is enough for the mngt app
to decide what to do.

> * block-job-pause: A new QMP command.  Takes a block device (drive),
> pauses an active background block operation on that device.  This
> command returns immediately after marking the active background block
> operation for pausing.  It is an error to call this command if no
> operation is in progress.  The operation will pause as soon as possible
> (it won't pause if the job is being cancelled).  No event is emitted
> when the operation is actually paused.  Cancelling a paused job
> automatically resumes it.

Is pausing guaranteed to succeed?

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-05-25 16:57 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-18 17:08 [Qemu-devel] Proposal for extensions of block job commands in QEMU 1.2 Paolo Bonzini
2012-05-21  9:29 ` Kevin Wolf
2012-05-21 10:02   ` Paolo Bonzini
2012-05-21 10:32     ` Kevin Wolf
2012-05-21 11:02       ` Paolo Bonzini
2012-05-21 13:07         ` Kevin Wolf
2012-05-21 15:18           ` Paolo Bonzini
2012-05-21 13:13         ` Eric Blake
2012-05-21 12:20 ` Stefan Hajnoczi
2012-05-21 13:59 ` Luiz Capitulino
2012-05-21 14:09   ` Kevin Wolf
2012-05-21 14:16     ` Anthony Liguori
2012-05-21 14:17     ` Luiz Capitulino
2012-05-21 14:10   ` Anthony Liguori
2012-05-21 14:16     ` Luiz Capitulino
2012-05-21 14:19       ` Anthony Liguori
2012-05-21 14:26         ` Paolo Bonzini
2012-05-21 14:40           ` Anthony Liguori
2012-05-21 14:47             ` Paolo Bonzini
2012-05-21 15:44               ` Anthony Liguori
2012-05-21 15:55                 ` Paolo Bonzini
2012-05-21 14:17     ` Kevin Wolf
2012-05-21 14:39   ` Paolo Bonzini
2012-05-24 13:41 ` [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication] Paolo Bonzini
2012-05-24 14:00   ` Ori Mamluk
2012-05-24 14:19     ` Paolo Bonzini
2012-05-24 15:32       ` Dor Laor
2012-05-25  8:59         ` Paolo Bonzini
2012-05-24 16:57   ` Eric Blake
2012-05-25  8:48     ` Paolo Bonzini
2012-05-25 15:02       ` Eric Blake
2012-05-25  8:28   ` Stefan Hajnoczi
2012-05-25  8:42     ` Kevin Wolf
2012-05-25  9:43   ` Stefan Hajnoczi
2012-05-25 11:17     ` Paolo Bonzini
2012-05-25 12:09       ` Stefan Hajnoczi
2012-05-25 13:25         ` Paolo Bonzini
2012-05-25 16:57   ` Luiz Capitulino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.