All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] Removing brdv_read()s from device init functions
@ 2012-07-03  1:08 Peter Crosthwaite
  2012-07-13  8:33 ` Markus Armbruster
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Crosthwaite @ 2012-07-03  1:08 UTC (permalink / raw)
  To: qemu-devel@nongnu.org Developers, Kevin Wolf, Peter Maydell,
	Stefan Hajnoczi, Markus Armbruster
  Cc: Jan Kiszka, Anthony Liguori, Andreas Färber, Paolo Bonzini

Hi All,

This RFC comes from the recent discussion Re coroutines and the block
layer - the current topic of disucussion there has shifted to
"bdrv_read() from device init", so rather than continuing the
discussion as a tangent to the unrelated original topic I'm recreating
the thread.

So anyways, there are several motivations to remove bdrv_read() from
init functions. Kevin points out that it breaks migration as the
reading of machine state at init stage will read out-of-date data. It
also is the source of my coroutine bug. It also is a worst case
performance wise. The big question is how for existing models do we
fix it?

One policy suggested is to ban bdrv_read() from init period and
require devices to Lazy init. That is, on the first load, do the read
you were going to do at init time. Peter Maydell points out that this
has the drawback of late detection of errors, I.E. errors that should
really be picked up at load time are not picked up until first access
(like the bdrv file being too small for the device).

Kevin propsoses a second init function for devices. I guess the way
this is implemented is TYPE_DEVICE or TYPE_OBJECT has  ->init_state()
or some such that is called once the the machine is ready to migrate.
The bdrv_read is moved from ->init() to ->init_state(). You may or may
not convert to bdrv_aio_read() but it doesn't matter for that
approach.

I came up with the idea of just using AIO to delay the read until the
appropriate time. The bdrv_read in init is converted to a
bdrv_aio_read(), and AIO yields the created coroutine when it figures
out that the machine is not migration ready. When the machine is ready
the coroutines is re-entered.

I have CCd some of the QOM people as i noticed they were discussing
some infrastracture for late device init. This may or may not be
related.

Regards,
Peter

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions
  2012-07-03  1:08 [Qemu-devel] [RFC] Removing brdv_read()s from device init functions Peter Crosthwaite
@ 2012-07-13  8:33 ` Markus Armbruster
  2012-07-13  9:18   ` Peter Crosthwaite
  0 siblings, 1 reply; 4+ messages in thread
From: Markus Armbruster @ 2012-07-13  8:33 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Jan Kiszka,
	qemu-devel@nongnu.org Developers, Stefan Hajnoczi, Paolo Bonzini,
	Andreas Färber

Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:

> Hi All,
>
> This RFC comes from the recent discussion Re coroutines and the block
> layer - the current topic of disucussion there has shifted to
> "bdrv_read() from device init", so rather than continuing the
> discussion as a tangent to the unrelated original topic I'm recreating
> the thread.
>
> So anyways, there are several motivations to remove bdrv_read() from
> init functions. Kevin points out that it breaks migration as the
> reading of machine state at init stage will read out-of-date data. It
> also is the source of my coroutine bug. It also is a worst case
> performance wise.

Peter asked me to comment re encypted images.  In my opinion, image
encryption is a "feeling lucky" feature, and we shouldn't complicate QOM
or qdev to work around its deficiencies, such as keys becoming available
too late.  But here goes anyway: it also breaks with encrypted images.
Details in Message-ID: <m3bojl2j0t.fsf@blackfin.pond.sub.org>
http://lists.nongnu.org/archive/html/qemu-devel/2012-07/msg01572.html

>                   The big question is how for existing models do we
> fix it?
>
> One policy suggested is to ban bdrv_read() from init period and
> require devices to Lazy init. That is, on the first load, do the read
> you were going to do at init time. Peter Maydell points out that this
> has the drawback of late detection of errors, I.E. errors that should
> really be picked up at load time are not picked up until first access
> (like the bdrv file being too small for the device).

Outlawing data read/write in init() doesn't mean we must also outlaw
examining image meta-data, such as size.  We just need to be clear what
exactly is permitted.  Even better: catch violations.

> Kevin propsoses a second init function for devices. I guess the way
> this is implemented is TYPE_DEVICE or TYPE_OBJECT has  ->init_state()
> or some such that is called once the the machine is ready to migrate.
> The bdrv_read is moved from ->init() to ->init_state(). You may or may
> not convert to bdrv_aio_read() but it doesn't matter for that
> approach.

We need to have a clear idea of what the various device methods are
supposed to do.

In the case of init() and init_state(): what's their contract?

In particular, which one completes property initialization?  Right now,
we do create() - set properties - init(), and init() checks property
values, supplies defaults, and so forth.  After init(), property values
shouldn't change.

Example: disk geometry.  If the geometry properties haven't been set,
init() supplies defaults.  Computing defaults requires reading the MBR.
If that's no longer allowed in init(), and has to be done in
init_state(), then property initialization completes only in
init_state().  I doubt that's what we want.

The disk geometry problem can be solved by getting rid of the guessing
from MBR.

I believe the more general problem "encryption keys not available in
init()" can also be solved, but it'll involve hairy rearrangement of the
startup code.

> I came up with the idea of just using AIO to delay the read until the
> appropriate time. The bdrv_read in init is converted to a
> bdrv_aio_read(), and AIO yields the created coroutine when it figures
> out that the machine is not migration ready. When the machine is ready
> the coroutines is re-entered.
>
> I have CCd some of the QOM people as i noticed they were discussing
> some infrastracture for late device init. This may or may not be
> related.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions
  2012-07-13  8:33 ` Markus Armbruster
@ 2012-07-13  9:18   ` Peter Crosthwaite
  2012-07-13  9:31     ` Kevin Wolf
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Crosthwaite @ 2012-07-13  9:18 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Jan Kiszka,
	qemu-devel@nongnu.org Developers, Stefan Hajnoczi, Paolo Bonzini,
	Andreas Färber

On Fri, Jul 13, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:
>
>> Hi All,
>>
>> This RFC comes from the recent discussion Re coroutines and the block
>> layer - the current topic of disucussion there has shifted to
>> "bdrv_read() from device init", so rather than continuing the
>> discussion as a tangent to the unrelated original topic I'm recreating
>> the thread.
>>
>> So anyways, there are several motivations to remove bdrv_read() from
>> init functions. Kevin points out that it breaks migration as the
>> reading of machine state at init stage will read out-of-date data. It
>> also is the source of my coroutine bug. It also is a worst case
>> performance wise.
>
> Peter asked me to comment re encypted images.  In my opinion, image
> encryption is a "feeling lucky" feature, and we shouldn't complicate QOM
> or qdev to work around its deficiencies, such as keys becoming available
> too late.  But here goes anyway: it also breaks with encrypted images.
> Details in Message-ID: <m3bojl2j0t.fsf@blackfin.pond.sub.org>
> http://lists.nongnu.org/archive/html/qemu-devel/2012-07/msg01572.html
>
>>                   The big question is how for existing models do we
>> fix it?
>>
>> One policy suggested is to ban bdrv_read() from init period and
>> require devices to Lazy init. That is, on the first load, do the read
>> you were going to do at init time. Peter Maydell points out that this
>> has the drawback of late detection of errors, I.E. errors that should
>> really be picked up at load time are not picked up until first access
>> (like the bdrv file being too small for the device).
>
> Outlawing data read/write in init() doesn't mean we must also outlaw
> examining image meta-data, such as size.  We just need to be clear what
> exactly is permitted.  Even better: catch violations.
>
>> Kevin propsoses a second init function for devices. I guess the way
>> this is implemented is TYPE_DEVICE or TYPE_OBJECT has  ->init_state()
>> or some such that is called once the the machine is ready to migrate.
>> The bdrv_read is moved from ->init() to ->init_state(). You may or may
>> not convert to bdrv_aio_read() but it doesn't matter for that
>> approach.
>
> We need to have a clear idea of what the various device methods are
> supposed to do.
>
> In the case of init() and init_state(): what's their contract?
>

So the concept is the Machine itself (devices and their properties)
are created by init(), and machine state (memory and disk contents,
vmsd type stuff) is created by init_state(). By extension, any
migratable data must not be populated or manipulated until
state_init().

> In particular, which one completes property initialization?  Right now,
> we do create() - set properties - init(), and init() checks property
> values, supplies defaults, and so forth.  After init(), property values
> shouldn't change.
>
> Example: disk geometry.  If the geometry properties haven't been set,
> init() supplies defaults.  Computing defaults requires reading the MBR.
> If that's no longer allowed in init(), and has to be done in
> init_state(), then property initialization completes only in
> init_state().  I doubt that's what we want.

Well another policy is to allow init to bdrv_read, just not allow to
set any device state. Getting geo from the MBR seems OK to me, because
your not actually loading any device state, your acquiring a best
guess at the unspecified device properties. Theres only one corner
case left, and that is changing the MBR in the migration window. But
you can adopt something of a "better than nothin" approach here, as I
dont a see a solution here without changing the Migration process
(that Kevin summarised). This corner case exists even if the code
remains unchanged.

Regards,
Peter
>
> The disk geometry problem can be solved by getting rid of the guessing
> from MBR.
>
> I believe the more general problem "encryption keys not available in
> init()" can also be solved, but it'll involve hairy rearrangement of the
> startup code.
>
>> I came up with the idea of just using AIO to delay the read until the
>> appropriate time. The bdrv_read in init is converted to a
>> bdrv_aio_read(), and AIO yields the created coroutine when it figures
>> out that the machine is not migration ready. When the machine is ready
>> the coroutines is re-entered.
>>
>> I have CCd some of the QOM people as i noticed they were discussing
>> some infrastracture for late device init. This may or may not be
>> related.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions
  2012-07-13  9:18   ` Peter Crosthwaite
@ 2012-07-13  9:31     ` Kevin Wolf
  0 siblings, 0 replies; 4+ messages in thread
From: Kevin Wolf @ 2012-07-13  9:31 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Peter Maydell, Anthony Liguori, Stefan Hajnoczi,
	Markus Armbruster, qemu-devel@nongnu.org Developers, Jan Kiszka,
	Paolo Bonzini, Andreas Färber

Am 13.07.2012 11:18, schrieb Peter Crosthwaite:
> On Fri, Jul 13, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:
>>> One policy suggested is to ban bdrv_read() from init period and
>>> require devices to Lazy init. That is, on the first load, do the read
>>> you were going to do at init time. Peter Maydell points out that this
>>> has the drawback of late detection of errors, I.E. errors that should
>>> really be picked up at load time are not picked up until first access
>>> (like the bdrv file being too small for the device).
>>
>> Outlawing data read/write in init() doesn't mean we must also outlaw
>> examining image meta-data, such as size.  We just need to be clear what
>> exactly is permitted.  Even better: catch violations.
>>
>>> Kevin propsoses a second init function for devices. I guess the way
>>> this is implemented is TYPE_DEVICE or TYPE_OBJECT has  ->init_state()
>>> or some such that is called once the the machine is ready to migrate.
>>> The bdrv_read is moved from ->init() to ->init_state(). You may or may
>>> not convert to bdrv_aio_read() but it doesn't matter for that
>>> approach.
>>
>> We need to have a clear idea of what the various device methods are
>> supposed to do.
>>
>> In the case of init() and init_state(): what's their contract?
>>
> 
> So the concept is the Machine itself (devices and their properties)
> are created by init(), and machine state (memory and disk contents,
> vmsd type stuff) is created by init_state(). By extension, any
> migratable data must not be populated or manipulated until
> state_init().

Does this mean that we wouldn't call state_init() on a destination VM
that waits for an incoming migration? If so, I think this gives a pretty
clear picture of what belongs where.

>> In particular, which one completes property initialization?  Right now,
>> we do create() - set properties - init(), and init() checks property
>> values, supplies defaults, and so forth.  After init(), property values
>> shouldn't change.
>>
>> Example: disk geometry.  If the geometry properties haven't been set,
>> init() supplies defaults.  Computing defaults requires reading the MBR.
>> If that's no longer allowed in init(), and has to be done in
>> init_state(), then property initialization completes only in
>> init_state().  I doubt that's what we want.
> 
> Well another policy is to allow init to bdrv_read, just not allow to
> set any device state. Getting geo from the MBR seems OK to me, because
> your not actually loading any device state, your acquiring a best
> guess at the unspecified device properties. Theres only one corner
> case left, and that is changing the MBR in the migration window. But
> you can adopt something of a "better than nothin" approach here, as I
> dont a see a solution here without changing the Migration process
> (that Kevin summarised). This corner case exists even if the code
> remains unchanged.

Yes, this is one example of broken code that we have today. Ideally,
bdrv_* shouldn't be allowed at all during init().

In the block layer, it's giving us real trouble that during migration
images can be opened by two qemu instances, and after migration
completion we must invalidate any cached content that the destination
already read (which we can't reliably for the kernel page cache; this is
why migration is only safe with cache=none/directsync in some common
configurations).

So getting rid of bdrv_* in init() wouldn't only fix the obvious
problems without outdated geometry, disk content etc., but also make
life easier in the block layer.

Kevin

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-07-13  9:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-03  1:08 [Qemu-devel] [RFC] Removing brdv_read()s from device init functions Peter Crosthwaite
2012-07-13  8:33 ` Markus Armbruster
2012-07-13  9:18   ` Peter Crosthwaite
2012-07-13  9:31     ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.