* [Qemu-devel] [RFC] Removing brdv_read()s from device init functions @ 2012-07-03 1:08 Peter Crosthwaite 2012-07-13 8:33 ` Markus Armbruster 0 siblings, 1 reply; 4+ messages in thread From: Peter Crosthwaite @ 2012-07-03 1:08 UTC (permalink / raw) To: qemu-devel@nongnu.org Developers, Kevin Wolf, Peter Maydell, Stefan Hajnoczi, Markus Armbruster Cc: Jan Kiszka, Anthony Liguori, Andreas Färber, Paolo Bonzini Hi All, This RFC comes from the recent discussion Re coroutines and the block layer - the current topic of disucussion there has shifted to "bdrv_read() from device init", so rather than continuing the discussion as a tangent to the unrelated original topic I'm recreating the thread. So anyways, there are several motivations to remove bdrv_read() from init functions. Kevin points out that it breaks migration as the reading of machine state at init stage will read out-of-date data. It also is the source of my coroutine bug. It also is a worst case performance wise. The big question is how for existing models do we fix it? One policy suggested is to ban bdrv_read() from init period and require devices to Lazy init. That is, on the first load, do the read you were going to do at init time. Peter Maydell points out that this has the drawback of late detection of errors, I.E. errors that should really be picked up at load time are not picked up until first access (like the bdrv file being too small for the device). Kevin propsoses a second init function for devices. I guess the way this is implemented is TYPE_DEVICE or TYPE_OBJECT has ->init_state() or some such that is called once the the machine is ready to migrate. The bdrv_read is moved from ->init() to ->init_state(). You may or may not convert to bdrv_aio_read() but it doesn't matter for that approach. I came up with the idea of just using AIO to delay the read until the appropriate time. The bdrv_read in init is converted to a bdrv_aio_read(), and AIO yields the created coroutine when it figures out that the machine is not migration ready. When the machine is ready the coroutines is re-entered. I have CCd some of the QOM people as i noticed they were discussing some infrastracture for late device init. This may or may not be related. Regards, Peter ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions 2012-07-03 1:08 [Qemu-devel] [RFC] Removing brdv_read()s from device init functions Peter Crosthwaite @ 2012-07-13 8:33 ` Markus Armbruster 2012-07-13 9:18 ` Peter Crosthwaite 0 siblings, 1 reply; 4+ messages in thread From: Markus Armbruster @ 2012-07-13 8:33 UTC (permalink / raw) To: Peter Crosthwaite Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Jan Kiszka, qemu-devel@nongnu.org Developers, Stefan Hajnoczi, Paolo Bonzini, Andreas Färber Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes: > Hi All, > > This RFC comes from the recent discussion Re coroutines and the block > layer - the current topic of disucussion there has shifted to > "bdrv_read() from device init", so rather than continuing the > discussion as a tangent to the unrelated original topic I'm recreating > the thread. > > So anyways, there are several motivations to remove bdrv_read() from > init functions. Kevin points out that it breaks migration as the > reading of machine state at init stage will read out-of-date data. It > also is the source of my coroutine bug. It also is a worst case > performance wise. Peter asked me to comment re encypted images. In my opinion, image encryption is a "feeling lucky" feature, and we shouldn't complicate QOM or qdev to work around its deficiencies, such as keys becoming available too late. But here goes anyway: it also breaks with encrypted images. Details in Message-ID: <m3bojl2j0t.fsf@blackfin.pond.sub.org> http://lists.nongnu.org/archive/html/qemu-devel/2012-07/msg01572.html > The big question is how for existing models do we > fix it? > > One policy suggested is to ban bdrv_read() from init period and > require devices to Lazy init. That is, on the first load, do the read > you were going to do at init time. Peter Maydell points out that this > has the drawback of late detection of errors, I.E. errors that should > really be picked up at load time are not picked up until first access > (like the bdrv file being too small for the device). Outlawing data read/write in init() doesn't mean we must also outlaw examining image meta-data, such as size. We just need to be clear what exactly is permitted. Even better: catch violations. > Kevin propsoses a second init function for devices. I guess the way > this is implemented is TYPE_DEVICE or TYPE_OBJECT has ->init_state() > or some such that is called once the the machine is ready to migrate. > The bdrv_read is moved from ->init() to ->init_state(). You may or may > not convert to bdrv_aio_read() but it doesn't matter for that > approach. We need to have a clear idea of what the various device methods are supposed to do. In the case of init() and init_state(): what's their contract? In particular, which one completes property initialization? Right now, we do create() - set properties - init(), and init() checks property values, supplies defaults, and so forth. After init(), property values shouldn't change. Example: disk geometry. If the geometry properties haven't been set, init() supplies defaults. Computing defaults requires reading the MBR. If that's no longer allowed in init(), and has to be done in init_state(), then property initialization completes only in init_state(). I doubt that's what we want. The disk geometry problem can be solved by getting rid of the guessing from MBR. I believe the more general problem "encryption keys not available in init()" can also be solved, but it'll involve hairy rearrangement of the startup code. > I came up with the idea of just using AIO to delay the read until the > appropriate time. The bdrv_read in init is converted to a > bdrv_aio_read(), and AIO yields the created coroutine when it figures > out that the machine is not migration ready. When the machine is ready > the coroutines is re-entered. > > I have CCd some of the QOM people as i noticed they were discussing > some infrastracture for late device init. This may or may not be > related. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions 2012-07-13 8:33 ` Markus Armbruster @ 2012-07-13 9:18 ` Peter Crosthwaite 2012-07-13 9:31 ` Kevin Wolf 0 siblings, 1 reply; 4+ messages in thread From: Peter Crosthwaite @ 2012-07-13 9:18 UTC (permalink / raw) To: Markus Armbruster Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Jan Kiszka, qemu-devel@nongnu.org Developers, Stefan Hajnoczi, Paolo Bonzini, Andreas Färber On Fri, Jul 13, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote: > Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes: > >> Hi All, >> >> This RFC comes from the recent discussion Re coroutines and the block >> layer - the current topic of disucussion there has shifted to >> "bdrv_read() from device init", so rather than continuing the >> discussion as a tangent to the unrelated original topic I'm recreating >> the thread. >> >> So anyways, there are several motivations to remove bdrv_read() from >> init functions. Kevin points out that it breaks migration as the >> reading of machine state at init stage will read out-of-date data. It >> also is the source of my coroutine bug. It also is a worst case >> performance wise. > > Peter asked me to comment re encypted images. In my opinion, image > encryption is a "feeling lucky" feature, and we shouldn't complicate QOM > or qdev to work around its deficiencies, such as keys becoming available > too late. But here goes anyway: it also breaks with encrypted images. > Details in Message-ID: <m3bojl2j0t.fsf@blackfin.pond.sub.org> > http://lists.nongnu.org/archive/html/qemu-devel/2012-07/msg01572.html > >> The big question is how for existing models do we >> fix it? >> >> One policy suggested is to ban bdrv_read() from init period and >> require devices to Lazy init. That is, on the first load, do the read >> you were going to do at init time. Peter Maydell points out that this >> has the drawback of late detection of errors, I.E. errors that should >> really be picked up at load time are not picked up until first access >> (like the bdrv file being too small for the device). > > Outlawing data read/write in init() doesn't mean we must also outlaw > examining image meta-data, such as size. We just need to be clear what > exactly is permitted. Even better: catch violations. > >> Kevin propsoses a second init function for devices. I guess the way >> this is implemented is TYPE_DEVICE or TYPE_OBJECT has ->init_state() >> or some such that is called once the the machine is ready to migrate. >> The bdrv_read is moved from ->init() to ->init_state(). You may or may >> not convert to bdrv_aio_read() but it doesn't matter for that >> approach. > > We need to have a clear idea of what the various device methods are > supposed to do. > > In the case of init() and init_state(): what's their contract? > So the concept is the Machine itself (devices and their properties) are created by init(), and machine state (memory and disk contents, vmsd type stuff) is created by init_state(). By extension, any migratable data must not be populated or manipulated until state_init(). > In particular, which one completes property initialization? Right now, > we do create() - set properties - init(), and init() checks property > values, supplies defaults, and so forth. After init(), property values > shouldn't change. > > Example: disk geometry. If the geometry properties haven't been set, > init() supplies defaults. Computing defaults requires reading the MBR. > If that's no longer allowed in init(), and has to be done in > init_state(), then property initialization completes only in > init_state(). I doubt that's what we want. Well another policy is to allow init to bdrv_read, just not allow to set any device state. Getting geo from the MBR seems OK to me, because your not actually loading any device state, your acquiring a best guess at the unspecified device properties. Theres only one corner case left, and that is changing the MBR in the migration window. But you can adopt something of a "better than nothin" approach here, as I dont a see a solution here without changing the Migration process (that Kevin summarised). This corner case exists even if the code remains unchanged. Regards, Peter > > The disk geometry problem can be solved by getting rid of the guessing > from MBR. > > I believe the more general problem "encryption keys not available in > init()" can also be solved, but it'll involve hairy rearrangement of the > startup code. > >> I came up with the idea of just using AIO to delay the read until the >> appropriate time. The bdrv_read in init is converted to a >> bdrv_aio_read(), and AIO yields the created coroutine when it figures >> out that the machine is not migration ready. When the machine is ready >> the coroutines is re-entered. >> >> I have CCd some of the QOM people as i noticed they were discussing >> some infrastracture for late device init. This may or may not be >> related. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [RFC] Removing brdv_read()s from device init functions 2012-07-13 9:18 ` Peter Crosthwaite @ 2012-07-13 9:31 ` Kevin Wolf 0 siblings, 0 replies; 4+ messages in thread From: Kevin Wolf @ 2012-07-13 9:31 UTC (permalink / raw) To: Peter Crosthwaite Cc: Peter Maydell, Anthony Liguori, Stefan Hajnoczi, Markus Armbruster, qemu-devel@nongnu.org Developers, Jan Kiszka, Paolo Bonzini, Andreas Färber Am 13.07.2012 11:18, schrieb Peter Crosthwaite: > On Fri, Jul 13, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote: >> Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes: >>> One policy suggested is to ban bdrv_read() from init period and >>> require devices to Lazy init. That is, on the first load, do the read >>> you were going to do at init time. Peter Maydell points out that this >>> has the drawback of late detection of errors, I.E. errors that should >>> really be picked up at load time are not picked up until first access >>> (like the bdrv file being too small for the device). >> >> Outlawing data read/write in init() doesn't mean we must also outlaw >> examining image meta-data, such as size. We just need to be clear what >> exactly is permitted. Even better: catch violations. >> >>> Kevin propsoses a second init function for devices. I guess the way >>> this is implemented is TYPE_DEVICE or TYPE_OBJECT has ->init_state() >>> or some such that is called once the the machine is ready to migrate. >>> The bdrv_read is moved from ->init() to ->init_state(). You may or may >>> not convert to bdrv_aio_read() but it doesn't matter for that >>> approach. >> >> We need to have a clear idea of what the various device methods are >> supposed to do. >> >> In the case of init() and init_state(): what's their contract? >> > > So the concept is the Machine itself (devices and their properties) > are created by init(), and machine state (memory and disk contents, > vmsd type stuff) is created by init_state(). By extension, any > migratable data must not be populated or manipulated until > state_init(). Does this mean that we wouldn't call state_init() on a destination VM that waits for an incoming migration? If so, I think this gives a pretty clear picture of what belongs where. >> In particular, which one completes property initialization? Right now, >> we do create() - set properties - init(), and init() checks property >> values, supplies defaults, and so forth. After init(), property values >> shouldn't change. >> >> Example: disk geometry. If the geometry properties haven't been set, >> init() supplies defaults. Computing defaults requires reading the MBR. >> If that's no longer allowed in init(), and has to be done in >> init_state(), then property initialization completes only in >> init_state(). I doubt that's what we want. > > Well another policy is to allow init to bdrv_read, just not allow to > set any device state. Getting geo from the MBR seems OK to me, because > your not actually loading any device state, your acquiring a best > guess at the unspecified device properties. Theres only one corner > case left, and that is changing the MBR in the migration window. But > you can adopt something of a "better than nothin" approach here, as I > dont a see a solution here without changing the Migration process > (that Kevin summarised). This corner case exists even if the code > remains unchanged. Yes, this is one example of broken code that we have today. Ideally, bdrv_* shouldn't be allowed at all during init(). In the block layer, it's giving us real trouble that during migration images can be opened by two qemu instances, and after migration completion we must invalidate any cached content that the destination already read (which we can't reliably for the kernel page cache; this is why migration is only safe with cache=none/directsync in some common configurations). So getting rid of bdrv_* in init() wouldn't only fix the obvious problems without outdated geometry, disk content etc., but also make life easier in the block layer. Kevin ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-07-13 9:31 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-03 1:08 [Qemu-devel] [RFC] Removing brdv_read()s from device init functions Peter Crosthwaite 2012-07-13 8:33 ` Markus Armbruster 2012-07-13 9:18 ` Peter Crosthwaite 2012-07-13 9:31 ` Kevin Wolf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.