Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions

From: Kevin Wolf <kwolf@redhat.com>
To: Peter Crosthwaite <peter.crosthwaite@petalogix.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	"Anthony Liguori" <aliguori@us.ibm.com>,
	"Stefan Hajnoczi" <stefanha@gmail.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"qemu-devel@nongnu.org Developers" <qemu-devel@nongnu.org>,
	"Jan Kiszka" <jan.kiszka@siemens.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Andreas Färber" <afaerber@suse.de>
Subject: Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions
Date: Fri, 13 Jul 2012 11:31:16 +0200	[thread overview]
Message-ID: <4FFFEAE4.2040207@redhat.com> (raw)
In-Reply-To: <CAEgOgz4uf-ju8TUpZ8-Q7SxciDE2D3VKaKiv1qisNQJnFD4PBA@mail.gmail.com>

Am 13.07.2012 11:18, schrieb Peter Crosthwaite:
> On Fri, Jul 13, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:
>>> One policy suggested is to ban bdrv_read() from init period and
>>> require devices to Lazy init. That is, on the first load, do the read
>>> you were going to do at init time. Peter Maydell points out that this
>>> has the drawback of late detection of errors, I.E. errors that should
>>> really be picked up at load time are not picked up until first access
>>> (like the bdrv file being too small for the device).
>>
>> Outlawing data read/write in init() doesn't mean we must also outlaw
>> examining image meta-data, such as size.  We just need to be clear what
>> exactly is permitted.  Even better: catch violations.
>>
>>> Kevin propsoses a second init function for devices. I guess the way
>>> this is implemented is TYPE_DEVICE or TYPE_OBJECT has  ->init_state()
>>> or some such that is called once the the machine is ready to migrate.
>>> The bdrv_read is moved from ->init() to ->init_state(). You may or may
>>> not convert to bdrv_aio_read() but it doesn't matter for that
>>> approach.
>>
>> We need to have a clear idea of what the various device methods are
>> supposed to do.
>>
>> In the case of init() and init_state(): what's their contract?
>>
> 
> So the concept is the Machine itself (devices and their properties)
> are created by init(), and machine state (memory and disk contents,
> vmsd type stuff) is created by init_state(). By extension, any
> migratable data must not be populated or manipulated until
> state_init().

Does this mean that we wouldn't call state_init() on a destination VM
that waits for an incoming migration? If so, I think this gives a pretty
clear picture of what belongs where.

>> In particular, which one completes property initialization?  Right now,
>> we do create() - set properties - init(), and init() checks property
>> values, supplies defaults, and so forth.  After init(), property values
>> shouldn't change.
>>
>> Example: disk geometry.  If the geometry properties haven't been set,
>> init() supplies defaults.  Computing defaults requires reading the MBR.
>> If that's no longer allowed in init(), and has to be done in
>> init_state(), then property initialization completes only in
>> init_state().  I doubt that's what we want.
> 
> Well another policy is to allow init to bdrv_read, just not allow to
> set any device state. Getting geo from the MBR seems OK to me, because
> your not actually loading any device state, your acquiring a best
> guess at the unspecified device properties. Theres only one corner
> case left, and that is changing the MBR in the migration window. But
> you can adopt something of a "better than nothin" approach here, as I
> dont a see a solution here without changing the Migration process
> (that Kevin summarised). This corner case exists even if the code
> remains unchanged.

Yes, this is one example of broken code that we have today. Ideally,
bdrv_* shouldn't be allowed at all during init().

In the block layer, it's giving us real trouble that during migration
images can be opened by two qemu instances, and after migration
completion we must invalidate any cached content that the destination
already read (which we can't reliably for the kernel page cache; this is
why migration is only safe with cache=none/directsync in some common
configurations).

So getting rid of bdrv_* in init() wouldn't only fix the obvious
problems without outdated geometry, disk content etc., but also make
life easier in the block layer.

Kevin