All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: DomU crash during migration when suspendingsource domain
@ 2007-02-14 14:43 Graham, Simon
  2007-02-14 14:56 ` Keir Fraser
  2007-02-14 15:15 ` Petersson, Mats
  0 siblings, 2 replies; 7+ messages in thread
From: Graham, Simon @ 2007-02-14 14:43 UTC (permalink / raw)
  To: Keir Fraser, xen-devel


> In general we *cannot* expect to support CPUs with different features
> in
> CPUID. We plan to fix this in two ways:
>  1. Allow a guest to be given a restricted CPUID view (e.g., with
> features
> masked out, or cacheinfo leaves missing).

Do you plan to do this for PV domains as well as HVM?

>  2. Where a guest has been exposed to extended features and leaves,
> prevent
> it from being migrated to a less-capable CPU.
> 

I guess I'm not quite sure I fully understand -- since we hot remove all
the processors (but one - I guess that is an issue) and then hot add
them again after migration, you would think it would be OK to hot add a
completely different processor -- of course there will be issues with
the Linux code given that you cant actually test this on a
non-virtualized system.

> A further option (3) for cache info might be to fake out the leaves
for
> CPUs
> that do not support them. But I'm not sure whether, for example, this
> would
> be compatible with AMD's CPUID instruction.
> 

Agreed.

> This issue is hardly specific to HA/FT. You can safely build yourself
a
> HA/FT cluster out of homogeneous hardware. Building it out of odds and
> ends
> you have already is going to be hard or impossible to guarantee safety
> of in
> general. I don't believe anyone sells or supports software to allow
you
> to
> do this, and there's a reason for that.

You misunderstand my point -- in an FT environment, you MUST be able to
upgrade and repair hardware without taking the domain down -- clearly
this would normally be to an equivalent or higher functionality system
but we cant guarantee that there wont be a new spiffy processor that
causes this same issue to arise or that we wont run into some similar
issue when replacing faulty hardware (the original system might no
longer be available for example).

Simon

^ permalink raw reply	[flat|nested] 7+ messages in thread
* RE: DomU crash during migration when suspendingsource domain
@ 2007-02-14 15:08 Graham, Simon
  2007-02-14 15:43 ` Keir Fraser
  0 siblings, 1 reply; 7+ messages in thread
From: Graham, Simon @ 2007-02-14 15:08 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

> In this particular case it is quite arguable that
> cache_remove_shared_cpu_map() should check cpuid4_info[i]!=NULL, just
> as
> done in cache_shared_cpu_map_setup(). I can make this fix in our tree
> but
> something similar ought to be submitted upstream too. I'm pretty
> certain
> that this will fix your crash.
> 

Let me try that out here and get back to you -- I can submit a patch
with this specific fix in if it solves the problem. 

Since, as you say, this is just one aspect of dealing with hot plugging
completely different processors, I somehow feel that a point fix like
this wouldn't be accepted upstream and instead we'd need to think about
a more complete solution (If, indeed, this is feasible).

> 
> Upgrading upwards actually tends to be okay. I can't think of any
> practical
> examples of how that might fail. After all, worst case we can hide the
> extra
> features from the guest since we have some control over CPUID.
> *Downgrading*
> is the problem!

Understood... I can conceive of cases where this would not be true, but
I agree that Intel/AMD usually do a good job of ensuring backward
compatibility so we could hide the newer features until all systems have
the newer processors in place and you reboot the domains.

Simon

^ permalink raw reply	[flat|nested] 7+ messages in thread
* RE: DomU crash during migration when suspendingsource domain
@ 2007-02-14 13:57 Graham, Simon
  2007-02-14 14:35 ` Keir Fraser
  0 siblings, 1 reply; 7+ messages in thread
From: Graham, Simon @ 2007-02-14 13:57 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

> Are you migrating between unlike boxes? My guess is that the original
> box
> has processors supporting cacheinfo cpuid leaves and the target box
> does
> not. Migrating to older less-capable CPUs is definitely hit-and-miss
> I'm
> afraid. It really is best not to do it!
>

I think this is indeed what is happening -- supporting this is kind of
important for HA/FT - you need to be able to keep the domains running
when upgrading/replacing hardware.

I guess I'm still a tad confused, but presumably the CPU_DEAD processing
is not completely uninitializing the cache info (it seems to me that if
it discarded the cache info and NULL's the pointer in the CPU_DEAD
processing then it should get recreated when the CPU_ONLINE is done -
presumably there is some path where this is not done when it should be.

I'll do some more digging and get back with a proposed fix.
Simon
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-02-14 15:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-14 14:43 DomU crash during migration when suspendingsource domain Graham, Simon
2007-02-14 14:56 ` Keir Fraser
2007-02-14 15:15 ` Petersson, Mats
  -- strict thread matches above, loose matches on Subject: below --
2007-02-14 15:08 Graham, Simon
2007-02-14 15:43 ` Keir Fraser
2007-02-14 13:57 Graham, Simon
2007-02-14 14:35 ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.