All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] vm state save/restore question
@ 2012-06-09 10:53 Benjamin Herrenschmidt
  2012-06-09 11:34 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-09 10:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Graf

Hi folks !

I'm looking at sorting out the state save/restore of target-ppc (which
means understanding in general how it works in qemu :-)

So far I've somewhat figured out that there's the "old way" where we
just provide a "bulk" save/restore function pair, and the "new way"
where we have nicely tagged field lists etc...

x86 seems to use the later for the CPU state, ppc is a mess and uses the
former with interesting incompatible format change depending on how qemu
is build :-) So I think that's one area I need to fix.

However, another issue I have, I'm not sure how to fix properly:

When using the "spapr" machine model (pseries), which is in effect a
paravirtualized model, the MMU hash table isn't part of the guest
memory. It's either somewhere in qemu own memory (full emu) or in the
host kernel (kvm) and it's accessed via hcalls.

Now, that's per-se is fine, I can add a pair of save/restore hooks in
spapr to handle it (in fact using the "old" scheme here makes sense
since it's a potentially big bulk of memory unless I find a way to
represent that with the new "structured" way, but that's a detail).

I will probably need some new ioctl's with the kernel to be able to dump
the kvm one in a state useful for a saved image, but here too nothing we
don't know how to do, but that's fine too, we'll have more vcpu state we
need to properly sort out for save/restore with kvm, we'll deal with it
eventually.

The problem I have is in fact with full emu on restore. After I've
re-allocated the hash table, loaded it from disk, etc... I need to
update some data in the vcpu env.

IE. The vcpu has a pointer to the "external hash" (and it's size) which
it uses to perform translation and that needs to be restored.

cpu_load/save isn't a good place to do that of course since that code
knows nothing about the spapr stuff, besides I don't think I have any
ordering guarantee between those vs. my spapr htab save/load.

What I'd need is something in spapr that can be used to "resync" bits of
the cpu state with the external htab that gets run after everything is
loaded and before emulation restarts.

Any idea how to do that properly ? I suppose I could also try to iterate
all the vcpu's after loading the hash table & update the fields but not
only that's gross ... I also don't know how to do it :-)

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-09 10:53 [Qemu-devel] vm state save/restore question Benjamin Herrenschmidt
@ 2012-06-09 11:34 ` Benjamin Herrenschmidt
  2012-06-09 11:37   ` Andreas Färber
  2012-06-19 14:07   ` Alexander Graf
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-09 11:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Graf

On Sat, 2012-06-09 at 20:53 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !

(After some discussion with Andreas ...)

> I'm looking at sorting out the state save/restore of target-ppc (which
> means understanding in general how it works in qemu :-)
> 
> So far I've somewhat figured out that there's the "old way" where we
> just provide a "bulk" save/restore function pair, and the "new way"
> where we have nicely tagged field lists etc...
> 
> x86 seems to use the later for the CPU state, ppc is a mess and uses the
> former with interesting incompatible format change depending on how qemu
> is build :-) So I think that's one area I need to fix.

Ok, so I'm told there are patches to convert ppc, I haven't seen them in
my list archives, so if somebody has a pointer, please shoot, that will
save me some work :-)

  .../...

> What I'd need is something in spapr that can be used to "resync" bits of
> the cpu state with the external htab that gets run after everything is
> loaded and before emulation restarts.
> 
> Any idea how to do that properly ? I suppose I could also try to iterate
> all the vcpu's after loading the hash table & update the fields but not
> only that's gross ... I also don't know how to do it :-)

So I did an experiment using the "old style" save/restore (bad boy !)
and got that part to work by just iterating the vcpu's.

It's a bit nasty but it's the right way I think, ie, what we have here
(the external hash table) is a global object under control/ownership of
the platform code for which a pointer is cached in the CPU state (so the
mmu emulation gets to it easily), so those cached pointers need to be
updated in all CPUs when a new hash table is loaded/allocated.

That leads to another question however... I need to add save/restore to
a bunch more stuff such as the xics (interrupt controller), the various
spapr devices, etc...

So far the VMState stuff is all nice if you have fixed sized arrays.
However I haven't quite found out the right way to use it for things
like:

 - The hash table (mentioned above). This is just a big chunk of memory
(it will routinely be 16M), so I really don't want to start iterating
all elements, just a bulk load will do, and the size might actually be
variable.

 - The XICS (interrupt controller). The actual size of the interrupt
state array can vary (the number of interrupt sources can vary, it's
fixed today by the machine code but I wouldn't rely too much on that and
in any case, from the XICS driver perspective, it's not a constant, it's
a variable it gets passed when initializing).

So in both these cases, I need either code to control the save/load
process (old style ? hard to hook into vmstate as far as I can tell) or
maybe a way to describe the array so that the array size itself is a
pointer to a variable (Andreas mentioned something along those lines).
Is there any doco for that stuff btw ? I haven't seen anything
detailed...

Suggestions ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-09 11:34 ` Benjamin Herrenschmidt
@ 2012-06-09 11:37   ` Andreas Färber
  2012-06-19 14:07   ` Alexander Graf
  1 sibling, 0 replies; 20+ messages in thread
From: Andreas Färber @ 2012-06-09 11:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Juan Quintela, qemu-devel, Alexander Graf

Hi,

Am 09.06.2012 13:34, schrieb Benjamin Herrenschmidt:
> On Sat, 2012-06-09 at 20:53 +1000, Benjamin Herrenschmidt wrote:
> (After some discussion with Andreas ...)
> 
>> I'm looking at sorting out the state save/restore of target-ppc (which
>> means understanding in general how it works in qemu :-)
>>
>> So far I've somewhat figured out that there's the "old way" where we
>> just provide a "bulk" save/restore function pair, and the "new way"
>> where we have nicely tagged field lists etc...
>>
>> x86 seems to use the later for the CPU state, ppc is a mess and uses the
>> former with interesting incompatible format change depending on how qemu
>> is build :-) So I think that's one area I need to fix.
> 
> Ok, so I'm told there are patches to convert ppc, I haven't seen them in
> my list archives, so if somebody has a pointer, please shoot, that will
> save me some work :-)

https://lists.gnu.org/archive/html/qemu-devel/2012-05/msg00524.html

Cheers,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-09 11:34 ` Benjamin Herrenschmidt
  2012-06-09 11:37   ` Andreas Färber
@ 2012-06-19 14:07   ` Alexander Graf
  2012-06-19 14:59     ` Juan Quintela
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Graf @ 2012-06-19 14:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org Developers, Juan Quintela


On 09.06.2012, at 13:34, Benjamin Herrenschmidt wrote:

> On Sat, 2012-06-09 at 20:53 +1000, Benjamin Herrenschmidt wrote:
>> Hi folks !
> 
> (After some discussion with Andreas ...)
> 
>> I'm looking at sorting out the state save/restore of target-ppc (which
>> means understanding in general how it works in qemu :-)
>> 
>> So far I've somewhat figured out that there's the "old way" where we
>> just provide a "bulk" save/restore function pair, and the "new way"
>> where we have nicely tagged field lists etc...
>> 
>> x86 seems to use the later for the CPU state, ppc is a mess and uses the
>> former with interesting incompatible format change depending on how qemu
>> is build :-) So I think that's one area I need to fix.
> 
> Ok, so I'm told there are patches to convert ppc, I haven't seen them in
> my list archives, so if somebody has a pointer, please shoot, that will
> save me some work :-)
> 
>  .../...
> 
>> What I'd need is something in spapr that can be used to "resync" bits of
>> the cpu state with the external htab that gets run after everything is
>> loaded and before emulation restarts.
>> 
>> Any idea how to do that properly ? I suppose I could also try to iterate
>> all the vcpu's after loading the hash table & update the fields but not
>> only that's gross ... I also don't know how to do it :-)
> 
> So I did an experiment using the "old style" save/restore (bad boy !)
> and got that part to work by just iterating the vcpu's.
> 
> It's a bit nasty but it's the right way I think, ie, what we have here
> (the external hash table) is a global object under control/ownership of
> the platform code for which a pointer is cached in the CPU state (so the
> mmu emulation gets to it easily), so those cached pointers need to be
> updated in all CPUs when a new hash table is loaded/allocated.
> 
> That leads to another question however... I need to add save/restore to
> a bunch more stuff such as the xics (interrupt controller), the various
> spapr devices, etc...
> 
> So far the VMState stuff is all nice if you have fixed sized arrays.
> However I haven't quite found out the right way to use it for things
> like:
> 
> - The hash table (mentioned above). This is just a big chunk of memory
> (it will routinely be 16M), so I really don't want to start iterating
> all elements, just a bulk load will do, and the size might actually be
> variable.
> 
> - The XICS (interrupt controller). The actual size of the interrupt
> state array can vary (the number of interrupt sources can vary, it's
> fixed today by the machine code but I wouldn't rely too much on that and
> in any case, from the XICS driver perspective, it's not a constant, it's
> a variable it gets passed when initializing).
> 
> So in both these cases, I need either code to control the save/load
> process (old style ? hard to hook into vmstate as far as I can tell) or
> maybe a way to describe the array so that the array size itself is a
> pointer to a variable (Andreas mentioned something along those lines).
> Is there any doco for that stuff btw ? I haven't seen anything
> detailed...

I'm sure Juan knows more there :)


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 14:07   ` Alexander Graf
@ 2012-06-19 14:59     ` Juan Quintela
  2012-06-19 17:16       ` Andreas Färber
  2012-06-19 20:30       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Juan Quintela @ 2012-06-19 14:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers

Alexander Graf <agraf@suse.de> wrote:
> On 09.06.2012, at 13:34, Benjamin Herrenschmidt wrote:
>
>> On Sat, 2012-06-09 at 20:53 +1000, Benjamin Herrenschmidt wrote:
>>> Hi folks !
>> 
>> (After some discussion with Andreas ...)
>> 
>>> I'm looking at sorting out the state save/restore of target-ppc (which
>>> means understanding in general how it works in qemu :-)
>>> 
>>> So far I've somewhat figured out that there's the "old way" where we
>>> just provide a "bulk" save/restore function pair, and the "new way"
>>> where we have nicely tagged field lists etc...
>>> 
>>> x86 seems to use the later for the CPU state, ppc is a mess and uses the
>>> former with interesting incompatible format change depending on how qemu
>>> is build :-) So I think that's one area I need to fix.
>> 
>> Ok, so I'm told there are patches to convert ppc, I haven't seen them in
>> my list archives, so if somebody has a pointer, please shoot, that will
>> save me some work :-)

I can send a new version tomorrow.

>>> What I'd need is something in spapr that can be used to "resync" bits of
>>> the cpu state with the external htab that gets run after everything is
>>> loaded and before emulation restarts.
>>> 
>>> Any idea how to do that properly ? I suppose I could also try to iterate
>>> all the vcpu's after loading the hash table & update the fields but not
>>> only that's gross ... I also don't know how to do it :-)
>> 
>> So I did an experiment using the "old style" save/restore (bad boy !)
>> and got that part to work by just iterating the vcpu's.
>> 
>> It's a bit nasty but it's the right way I think, ie, what we have here
>> (the external hash table) is a global object under control/ownership of
>> the platform code for which a pointer is cached in the CPU state (so the
>> mmu emulation gets to it easily), so those cached pointers need to be
>> updated in all CPUs when a new hash table is loaded/allocated.
>> 
>> That leads to another question however... I need to add save/restore to
>> a bunch more stuff such as the xics (interrupt controller), the various
>> spapr devices, etc...
>> 
>> So far the VMState stuff is all nice if you have fixed sized arrays.
>> However I haven't quite found out the right way to use it for things
>> like:
>> 
>> - The hash table (mentioned above). This is just a big chunk of memory
>> (it will routinely be 16M), so I really don't want to start iterating
>> all elements, just a bulk load will do, and the size might actually be
>> variable.

This is going to kill migration download time.  With current setup, we
just sent something like 1-2MB in stage 3 (i.e. after the machine is
down).  Default downtime is 30ms, And 16MB is going to take around 1s on
gigabit ethenet.

Once said that, if you told me the state that you want to sent, I can
take a look.

>> - The XICS (interrupt controller). The actual size of the interrupt
>> state array can vary (the number of interrupt sources can vary, it's
>> fixed today by the machine code but I wouldn't rely too much on that and
>> in any case, from the XICS driver perspective, it's not a constant, it's
>> a variable it gets passed when initializing).

Can you point me at the structure that you want to sent?

>> So in both these cases, I need either code to control the save/load
>> process (old style ? hard to hook into vmstate as far as I can tell) or
>> maybe a way to describe the array so that the array size itself is a
>> pointer to a variable (Andreas mentioned something along those lines).
>> Is there any doco for that stuff btw ? I haven't seen anything
>> detailed...
>
> I'm sure Juan knows more there :)

thanks for pointing me to the discussion O:-)

Later, Juan.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 14:59     ` Juan Quintela
@ 2012-06-19 17:16       ` Andreas Färber
  2012-06-19 20:30       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Andreas Färber @ 2012-06-19 17:16 UTC (permalink / raw)
  To: quintela
  Cc: Alexander Graf, Anthony Liguori, qemu-devel@nongnu.org Developers

Juan,

Am 19.06.2012 16:59, schrieb Juan Quintela:
> Alexander Graf <agraf@suse.de> wrote:
>> On 09.06.2012, at 13:34, Benjamin Herrenschmidt wrote:
>>
>>> Ok, so I'm told there are patches to convert ppc, I haven't seen them in
>>> my list archives, so if somebody has a pointer, please shoot, that will
>>> save me some work :-)
> 
> I can send a new version tomorrow.

I have a patch queued to convert the CPU to a device, which would give
us a VMSD pointer to associate the existing / to be added VMState (one
of my comments on your previous round, to which you haven't replied).
With QBus applied now, that should be mergeable (although Anthony wanted
to take three steps at once, ;) making qdev available in linux-user as
well).

I'm still unclear about whether a change of the index from cpu_index to
-1 as used in qdev would cause any issues? If that were an issue, we
could duplicate the vmsd field into CPUState and would be independent
from deriving from DeviceState.

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 14:59     ` Juan Quintela
  2012-06-19 17:16       ` Andreas Färber
@ 2012-06-19 20:30       ` Benjamin Herrenschmidt
  2012-06-19 21:00         ` Alexander Graf
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 20:30 UTC (permalink / raw)
  To: quintela; +Cc: Alexander Graf, qemu-devel@nongnu.org Developers

On Tue, 2012-06-19 at 16:59 +0200, Juan Quintela wrote:
> >> - The hash table (mentioned above). This is just a big chunk of
> memory
> >> (it will routinely be 16M), so I really don't want to start
> iterating
> >> all elements, just a bulk load will do, and the size might actually
> be
> >> variable.
> 
> This is going to kill migration download time.  With current setup, we
> just sent something like 1-2MB in stage 3 (i.e. after the machine is
> down).  Default downtime is 30ms, And 16MB is going to take around 1s
> on gigabit ethenet.
> 
> Once said that, if you told me the state that you want to sent, I can
> take a look.

Well, we don't have much of a choice unless we do something really fancy
but that would be a second step...

The MMU hash table on power is where all our translations go. What we
could do is put in some knowledge about what translations are actually
necessary for the guest and which ones can be rebuild (faulted in),
essentially by adding knowledge to qemu/kvm about the "bolted" bit that
the guests uses for translations that must not be evicted.

However, that would require at least some interaction between the guest
and qemu/kvm to enable the function since this bit is a guest SW
construct (unless it got architected in recent PAPR, I need to dbl
check).

In that case, we save a "compressed" variant of the hash table (RLE
style ? we need to keep the hash location of each entry ...) which
contains only the bolted translations.

The consequence is that post-migration, the guest will temporarily slow
down as most translations will have to be faulted back in.

> >> - The XICS (interrupt controller). The actual size of the interrupt
> >> state array can vary (the number of interrupt sources can vary,
> it's
> >> fixed today by the machine code but I wouldn't rely too much on
> that and
> >> in any case, from the XICS driver perspective, it's not a constant,
> it's
> >> a variable it gets passed when initializing).
> 
> Can you point me at the structure that you want to sent?

hw/xics.c, there's several bits here but the "array" itself is
the struct ics_state which has a pointer to an array of  struct
ics_irq_state.

However, I think I figured out how to do that by using the array macro
which takes the size of the array as a pointer to a variable containing
that size or am I mistaken ?

> >> So in both these cases, I need either code to control the save/load
> >> process (old style ? hard to hook into vmstate as far as I can
> tell) or
> >> maybe a way to describe the array so that the array size itself is
> a
> >> pointer to a variable (Andreas mentioned something along those
> lines).
> >> Is there any doco for that stuff btw ? I haven't seen anything
> >> detailed...
> >
> > I'm sure Juan knows more there :)
> 
> thanks for pointing me to the discussion O:-)

Thanks for the help !

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 20:30       ` Benjamin Herrenschmidt
@ 2012-06-19 21:00         ` Alexander Graf
  2012-06-19 21:13           ` Benjamin Herrenschmidt
  2012-06-19 22:55           ` Juan Quintela
  0 siblings, 2 replies; 20+ messages in thread
From: Alexander Graf @ 2012-06-19 21:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org Developers, quintela


On 19.06.2012, at 22:30, Benjamin Herrenschmidt wrote:

> On Tue, 2012-06-19 at 16:59 +0200, Juan Quintela wrote:
>>>> - The hash table (mentioned above). This is just a big chunk of
>> memory
>>>> (it will routinely be 16M), so I really don't want to start
>> iterating
>>>> all elements, just a bulk load will do, and the size might actually
>> be
>>>> variable.
>> 
>> This is going to kill migration download time.  With current setup, we
>> just sent something like 1-2MB in stage 3 (i.e. after the machine is
>> down).  Default downtime is 30ms, And 16MB is going to take around 1s
>> on gigabit ethenet.
>> 
>> Once said that, if you told me the state that you want to sent, I can
>> take a look.
> 
> Well, we don't have much of a choice unless we do something really fancy
> but that would be a second step...
> 
> The MMU hash table on power is where all our translations go. What we
> could do is put in some knowledge about what translations are actually
> necessary for the guest and which ones can be rebuild (faulted in),
> essentially by adding knowledge to qemu/kvm about the "bolted" bit that
> the guests uses for translations that must not be evicted.
> 
> However, that would require at least some interaction between the guest
> and qemu/kvm to enable the function since this bit is a guest SW
> construct (unless it got architected in recent PAPR, I need to dbl
> check).

How is the problem different from RAM? It's a 16MB region that can be accessed by the guest even during transfer time, so it can get dirty during the migration. But we only need to really transfer the last small delta at the end of the migration, right?


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 21:00         ` Alexander Graf
@ 2012-06-19 21:13           ` Benjamin Herrenschmidt
  2012-06-19 21:48             ` Alexander Graf
  2012-06-19 22:55           ` Juan Quintela
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 21:13 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers, quintela

On Tue, 2012-06-19 at 23:00 +0200, Alexander Graf wrote:
> How is the problem different from RAM? It's a 16MB region that can be
> accessed by the guest even during transfer time, so it can get dirty
> during the migration. But we only need to really transfer the last
> small delta at the end of the migration, right?

Because with -M pseries it's not mapped into guest space but instead is
a chunk of physically contiguous memory accessed directly in real mode
by KVM. So no dirty tracking here.

We could keep track manually maybe using some kind of dirty bitmap of
changes to the hash table but that would add overhead to things like
H_ENTER.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 21:13           ` Benjamin Herrenschmidt
@ 2012-06-19 21:48             ` Alexander Graf
  2012-06-19 21:51               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Graf @ 2012-06-19 21:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org Developers, quintela



On 19.06.2012, at 23:13, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2012-06-19 at 23:00 +0200, Alexander Graf wrote:
>> How is the problem different from RAM? It's a 16MB region that can be
>> accessed by the guest even during transfer time, so it can get dirty
>> during the migration. But we only need to really transfer the last
>> small delta at the end of the migration, right?
> 
> Because with -M pseries it's not mapped into guest space but instead is
> a chunk of physically contiguous memory accessed directly in real mode
> by KVM. So no dirty tracking here.
> 
> We could keep track manually maybe using some kind of dirty bitmap of
> changes to the hash table but that would add overhead to things like
> H_ENTER.

Only during migration, right?

Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 21:48             ` Alexander Graf
@ 2012-06-19 21:51               ` Benjamin Herrenschmidt
  2012-06-19 22:27                 ` Alexander Graf
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 21:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers, quintela

On Tue, 2012-06-19 at 23:48 +0200, Alexander Graf wrote:
> > We could keep track manually maybe using some kind of dirty bitmap of
> > changes to the hash table but that would add overhead to things like
> > H_ENTER.
> 
> Only during migration, right?

True. It will be an "interesting" user/kernel API tho ... I'll give it more thoughts.

I need to understand better how do that vs. qemu save/restore though. IE. That means
we can't just save the hash as a bulk and reload it, but we'd have to save bits of
it at a time or something like that no ? Or do we save it once, then save the diff
at the end ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 21:51               ` Benjamin Herrenschmidt
@ 2012-06-19 22:27                 ` Alexander Graf
  2012-06-19 22:32                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Graf @ 2012-06-19 22:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org  Developers, quintela


On 19.06.2012, at 23:51, Benjamin Herrenschmidt wrote:

> On Tue, 2012-06-19 at 23:48 +0200, Alexander Graf wrote:
>>> We could keep track manually maybe using some kind of dirty bitmap of
>>> changes to the hash table but that would add overhead to things like
>>> H_ENTER.
>> 
>> Only during migration, right?
> 
> True. It will be an "interesting" user/kernel API tho ... I'll give it more thoughts.

Well, all we need is 2 user space pointers in an ENABLE_CAP call. And maybe a DISABLE_CAP call to disable the syncing again.

void *htab
u8 *htab_dirty;

ENABLE_CAP(KVM_PPC_SYNC_HTAB, htab, htab_dirty);

which would then make all the current GVA->GPA entries visible to the htab pointer. That view is always current. H_ENTER and friends update it in parallel to the GVA->HPA htab. We don't have to keep H_ENTER super fast during migration, so we can easily go to virtual mode for that one. Any time an entry changes, the dirty bitmap gets updated.

Usually, migration ends in killing the VM. But we shouldn't rely on that. Instead, we should provide an API to stop the synced mode again. Maybe

  ENABLE_CAP(KVM_PPC_SYNC_HTAB, NULL, NULL);

:)

> I need to understand better how do that vs. qemu save/restore though. IE. That means
> we can't just save the hash as a bulk and reload it, but we'd have to save bits of
> it at a time or something like that no ? Or do we save it once, then save the diff
> at the end ?

The best way would be to throw it into the same bucket as RAM. At the end of the day, it really is no different. It'd then be synced during every iteration of the migration.


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 22:27                 ` Alexander Graf
@ 2012-06-19 22:32                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 22:32 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers, quintela

On Wed, 2012-06-20 at 00:27 +0200, Alexander Graf wrote:
> 
> > I need to understand better how do that vs. qemu save/restore
> though. IE. That means
> > we can't just save the hash as a bulk and reload it, but we'd have
> to save bits of
> > it at a time or something like that no ? Or do we save it once, then
> save the diff
> > at the end ?
> 
> The best way would be to throw it into the same bucket as RAM. At the
> end of the day, it really is no different. It'd then be synced during
> every iteration of the migration.

Sure, I understand the principle on paper :-) Not the details on how to
do that in the code :-) I'll dig see if I can figure it out from the RAM
code later this week. I'm not actually implementing that stuff just yet,
just sizing it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 21:00         ` Alexander Graf
  2012-06-19 21:13           ` Benjamin Herrenschmidt
@ 2012-06-19 22:55           ` Juan Quintela
  2012-06-19 23:00             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2012-06-19 22:55 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers

Alexander Graf <agraf@suse.de> wrote:
> On 19.06.2012, at 22:30, Benjamin Herrenschmidt wrote:
>
>> On Tue, 2012-06-19 at 16:59 +0200, Juan Quintela wrote:
>>>>> - The hash table (mentioned above). This is just a big chunk of
>>> memory
>>>>> (it will routinely be 16M), so I really don't want to start
>>> iterating
>>>>> all elements, just a bulk load will do, and the size might actually
>>> be
>>>>> variable.
>>> 
>>> This is going to kill migration download time.  With current setup, we
>>> just sent something like 1-2MB in stage 3 (i.e. after the machine is
>>> down).  Default downtime is 30ms, And 16MB is going to take around 1s
>>> on gigabit ethenet.
>>> 
>>> Once said that, if you told me the state that you want to sent, I can
>>> take a look.
>> 
>> Well, we don't have much of a choice unless we do something really fancy
>> but that would be a second step...
>> 
>> The MMU hash table on power is where all our translations go. What we
>> could do is put in some knowledge about what translations are actually
>> necessary for the guest and which ones can be rebuild (faulted in),
>> essentially by adding knowledge to qemu/kvm about the "bolted" bit that
>> the guests uses for translations that must not be evicted.
>> 
>> However, that would require at least some interaction between the guest
>> and qemu/kvm to enable the function since this bit is a guest SW
>> construct (unless it got architected in recent PAPR, I need to dbl
>> check).
>
> How is the problem different from RAM? It's a 16MB region that can be
> accessed by the guest even during transfer time, so it can get dirty
> during the migration. But we only need to really transfer the last
> small delta at the end of the migration, right?

This was going to be my question.

If we can do something like: send hash register, and get a bitmap of the
ones that get changed, we should be good.  Perhaps we need something
"interesting" like removing old entries (no clue if they got just
overwritten, or how they are replaced), and we should be good.

Later, Juan.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 22:55           ` Juan Quintela
@ 2012-06-19 23:00             ` Benjamin Herrenschmidt
  2012-06-19 23:11               ` Juan Quintela
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 23:00 UTC (permalink / raw)
  To: quintela; +Cc: Alexander Graf, qemu-devel@nongnu.org Developers

On Wed, 2012-06-20 at 00:55 +0200, Juan Quintela wrote:
> 
> This was going to be my question.
> 
> If we can do something like: send hash register, and get a bitmap of
> the
> ones that get changed, we should be good.  Perhaps we need something
> "interesting" like removing old entries (no clue if they got just
> overwritten, or how they are replaced), and we should be good. 

Right, we can do an initial "snapshot" and have the kernel start
tracking changes from there. On the final step, we can then request from
the kernel a new snapshot with a dirty bitmap.

Or we can be even simpler, just do two snapshots and have qemu diff them
itself :-)

I am confident I can come up with something as far as the kernel and
qemu <-> kernel interface goes. I need to get my head around the details
on how to implement that two stage save process in qemu though and the
corresponding restore which will need to read both snapshots and apply
the diff before shooting it back to the kernel.

BTW. Does migration in pure qemu (full system emu) works similarily, ie,
two stage ? If it does I can easily prototype everything there.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 23:00             ` Benjamin Herrenschmidt
@ 2012-06-19 23:11               ` Juan Quintela
  2012-06-19 23:28                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2012-06-19 23:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Alexander Graf, qemu-devel@nongnu.org Developers

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Wed, 2012-06-20 at 00:55 +0200, Juan Quintela wrote:
>> 
>> This was going to be my question.
>> 
>> If we can do something like: send hash register, and get a bitmap of
>> the
>> ones that get changed, we should be good.  Perhaps we need something
>> "interesting" like removing old entries (no clue if they got just
>> overwritten, or how they are replaced), and we should be good. 
>
> Right, we can do an initial "snapshot" and have the kernel start
> tracking changes from there. On the final step, we can then request from
> the kernel a new snapshot with a dirty bitmap.
>
> Or we can be even simpler, just do two snapshots and have qemu diff them
> itself :-)
>
> I am confident I can come up with something as far as the kernel and
> qemu <-> kernel interface goes. I need to get my head around the details
> on how to implement that two stage save process in qemu though and the
> corresponding restore which will need to read both snapshots and apply
> the diff before shooting it back to the kernel.
>
> BTW. Does migration in pure qemu (full system emu) works similarily, ie,
> two stage ? If it does I can easily prototype everything there.

It does, but I have no clue how the hashed page tables are implemented
on ppc, i.e. if there is anything specific for bare metal.  Alex?

Later, Juan.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 23:11               ` Juan Quintela
@ 2012-06-19 23:28                 ` Benjamin Herrenschmidt
  2012-06-19 23:30                   ` Alexander Graf
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 23:28 UTC (permalink / raw)
  To: quintela; +Cc: Alexander Graf, qemu-devel@nongnu.org Developers

On Wed, 2012-06-20 at 01:11 +0200, Juan Quintela wrote:
> 
> > I am confident I can come up with something as far as the kernel and
> > qemu <-> kernel interface goes. I need to get my head around the details
> > on how to implement that two stage save process in qemu though and the
> > corresponding restore which will need to read both snapshots and apply
> > the diff before shooting it back to the kernel.
> >
> > BTW. Does migration in pure qemu (full system emu) works similarily, ie,
> > two stage ? If it does I can easily prototype everything there.
> 
> It does, but I have no clue how the hashed page tables are implemented
> on ppc, i.e. if there is anything specific for bare metal.  Alex?

We support the paravirtualized -M pseries in full emu as well, in which
case the hashed page table is handled by qemu itself who implements the
H_ENTER & co hypercalls. So it's very similar, except that qemu doesn't
have to ask the kernel to get a snapshot :-)

So I can flush out the storage format and two stage process inside qemu,
and then bother with the kvm/kernel interface.

Normal "bare metal" operation in qemu (or even KVM "PR") doesn't require
this as in that case the hash table is just a normal part of the guest
memory, it's only an issue when doing paravirtualized guest such as
pseries (aka PAPR).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 23:28                 ` Benjamin Herrenschmidt
@ 2012-06-19 23:30                   ` Alexander Graf
  2012-06-19 23:52                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Graf @ 2012-06-19 23:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org Developers, quintela


On 20.06.2012, at 01:28, Benjamin Herrenschmidt wrote:

> On Wed, 2012-06-20 at 01:11 +0200, Juan Quintela wrote:
>> 
>>> I am confident I can come up with something as far as the kernel and
>>> qemu <-> kernel interface goes. I need to get my head around the details
>>> on how to implement that two stage save process in qemu though and the
>>> corresponding restore which will need to read both snapshots and apply
>>> the diff before shooting it back to the kernel.
>>> 
>>> BTW. Does migration in pure qemu (full system emu) works similarily, ie,
>>> two stage ? If it does I can easily prototype everything there.
>> 
>> It does, but I have no clue how the hashed page tables are implemented
>> on ppc, i.e. if there is anything specific for bare metal.  Alex?
> 
> We support the paravirtualized -M pseries in full emu as well, in which
> case the hashed page table is handled by qemu itself who implements the
> H_ENTER & co hypercalls. So it's very similar, except that qemu doesn't
> have to ask the kernel to get a snapshot :-)
> 
> So I can flush out the storage format and two stage process inside qemu,
> and then bother with the kvm/kernel interface.
> 
> Normal "bare metal" operation in qemu (or even KVM "PR") doesn't require
> this as in that case the hash table is just a normal part of the guest
> memory, it's only an issue when doing paravirtualized guest such as
> pseries (aka PAPR).

IIRC we still allocate it outside of normal guest memory, so you don't get the migration for free :).


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 23:30                   ` Alexander Graf
@ 2012-06-19 23:52                     ` Benjamin Herrenschmidt
  2012-06-20  0:05                       ` Alexander Graf
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-19 23:52 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel@nongnu.org Developers, quintela

On Wed, 2012-06-20 at 01:30 +0200, Alexander Graf wrote:
> > We support the paravirtualized -M pseries in full emu as well, in which
> > case the hashed page table is handled by qemu itself who implements the
> > H_ENTER & co hypercalls. So it's very similar, except that qemu doesn't
> > have to ask the kernel to get a snapshot :-)
> > 
> > So I can flush out the storage format and two stage process inside qemu,
> > and then bother with the kvm/kernel interface.
> > 
> > Normal "bare metal" operation in qemu (or even KVM "PR") doesn't require
> > this as in that case the hash table is just a normal part of the guest
> > memory, it's only an issue when doing paravirtualized guest such as
> > pseries (aka PAPR).
> 
> IIRC we still allocate it outside of normal guest memory, so you don't get the migration for free :).

You haven't ready my post properly :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] vm state save/restore question
  2012-06-19 23:52                     ` Benjamin Herrenschmidt
@ 2012-06-20  0:05                       ` Alexander Graf
  0 siblings, 0 replies; 20+ messages in thread
From: Alexander Graf @ 2012-06-20  0:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel@nongnu.org Developers, quintela


On 20.06.2012, at 01:52, Benjamin Herrenschmidt wrote:

> On Wed, 2012-06-20 at 01:30 +0200, Alexander Graf wrote:
>>> We support the paravirtualized -M pseries in full emu as well, in which
>>> case the hashed page table is handled by qemu itself who implements the
>>> H_ENTER & co hypercalls. So it's very similar, except that qemu doesn't
>>> have to ask the kernel to get a snapshot :-)
>>> 
>>> So I can flush out the storage format and two stage process inside qemu,
>>> and then bother with the kvm/kernel interface.
>>> 
>>> Normal "bare metal" operation in qemu (or even KVM "PR") doesn't require
>>> this as in that case the hash table is just a normal part of the guest
>>> memory, it's only an issue when doing paravirtualized guest such as
>>> pseries (aka PAPR).
>> 
>> IIRC we still allocate it outside of normal guest memory, so you don't get the migration for free :).
> 
> You haven't ready my post properly :-)

Ugh. Right. I haven't :). For Non-pseries VMs, the HTAB is always in guest RAM, so it's migrated automatically. Only pseries is special in keeping it outside.


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-06-20  0:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-09 10:53 [Qemu-devel] vm state save/restore question Benjamin Herrenschmidt
2012-06-09 11:34 ` Benjamin Herrenschmidt
2012-06-09 11:37   ` Andreas Färber
2012-06-19 14:07   ` Alexander Graf
2012-06-19 14:59     ` Juan Quintela
2012-06-19 17:16       ` Andreas Färber
2012-06-19 20:30       ` Benjamin Herrenschmidt
2012-06-19 21:00         ` Alexander Graf
2012-06-19 21:13           ` Benjamin Herrenschmidt
2012-06-19 21:48             ` Alexander Graf
2012-06-19 21:51               ` Benjamin Herrenschmidt
2012-06-19 22:27                 ` Alexander Graf
2012-06-19 22:32                   ` Benjamin Herrenschmidt
2012-06-19 22:55           ` Juan Quintela
2012-06-19 23:00             ` Benjamin Herrenschmidt
2012-06-19 23:11               ` Juan Quintela
2012-06-19 23:28                 ` Benjamin Herrenschmidt
2012-06-19 23:30                   ` Alexander Graf
2012-06-19 23:52                     ` Benjamin Herrenschmidt
2012-06-20  0:05                       ` Alexander Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.