All of lore.kernel.org
 help / color / mirror / Atom feed
* DomU crash during migration when suspending source domain
@ 2007-02-14  3:42 Graham, Simon
  2007-02-14 10:13 ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Graham, Simon @ 2007-02-14  3:42 UTC (permalink / raw)
  To: xen-devel

Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic is attached at the end of this, but the bottom line is that the code in cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is attempting to clean up the cache info for a processor that does not yet have this info setup - the code is dereferencing a pointer in the cpuid4_info[] array and looking at the dump I can see that this is NULL.

My working theory here is that we attempted the migration waaay early and the initialization of the array of cache info pointers was not setup for all processors yet; it would be relatively easy to protect against this by checking for NULL, but I'm not sure if this is the correct solution or not -- if anyone is familiar with this code and can comment on an appropriate fix I'd be grateful.

One thing I'm really not sure about is the timing of marking the CPUs up with respect to the trace re initializing CPUs (see console output below) -- I can see that the four VCPUs are setup in the cpu_sys_devices array (which is setup by the code that outputs the 'Initializing CPU#n' trace) but the array of cache info structures only has an entry for VCPU 0:

crash> cpu_sys_devices
cpu_sys_devices = $3 =
 {0xc0464448, 0xc046448c, 0xc04644d0, 0xc0464514, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}

crash> cpuid4_info
cpuid4_info = $4 =
 {0xc7971180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}

Any suggestions for appropriate fixes here?
Simon

--- console output ---

Enabling SMP...
Initializing CPU#3
Initializing CPU#2
Initializing CPU#1
eth0: no IPv6 routers present
Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c010dd3a
0204a000 -> *pde = 00000001:0d8ec001
06a9c000 -> *pme = 00000000:00000000
Oops: 0000 [#1]
SMP 
Modules linked in: ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core binfmt_misc dm_mirror dm_mod bnx2 ata_piix libata mptscsih mptfc mptspi mptsas mptscsi scsi_mod mptbase
CPU:    0
EIP:    0061:[<c010dd3a>]    Tainted: GF    VLI
EFLAGS: 00010202  (2.6.16.29-xen #1) 
EIP is at cache_remove_shared_cpu_map+0x1a/0x90
eax: 00000000  ebx: 00000001  ecx: 00000001  edx: 00000000
esi: 00000000  edi: 00000010  ebp: c3913f14  esp: c3913f08
ds: 007b  es: 007b  ss: 0069
Process suspend (pid: 4038, threadinfo=c3912000 task=c2244570)
Stack: <0>00000001 00000001 00000000 c3913f28 c010e3ba 00000007 00000001 00000007 
      c3913f34 c010e425 c03bd804 c3913f48 c012fae8 ffffffea 00000001 c568c570 
      c3913f7c c013b889 c3913fc0 00000002 00000001 00000000 00000003 00000000 
Call Trace:
 [<c0105401>] show_stack_log_lvl+0xa1/0xe0
 [<c01055f1>] show_registers+0x181/0x200
 [<c0105810>] die+0x100/0x1a0
 [<c01156f6>] do_page_fault+0x3c6/0x8b1
 [<c0105067>] error_code+0x2b/0x30
 [<c010e3ba>] cache_remove_dev+0x2a/0x60
 [<c010e425>] cacheinfo_cpu_callback+0x35/0x40
 [<c012fae8>] notifier_call_chain+0x18/0x40
 [<c013b889>] cpu_down+0x139/0x260
 [<c028bc9f>] smp_suspend+0x7f/0x100
 [<c028ca80>] __do_suspend+0x40/0x180
 [<c0136a06>] kthread+0x96/0xe0
 [<c0102e95>] kernel_thread_helper+0x5/0x10
Code: 0c 5b 5e 5f 5d c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 89 d6 53 89 c3 8d 04 92 8b 14 9d 20 4d 46 c0 8d 04 82 8d 78 10 <8b> 40 10 ba 20 00 00 00 85 c0 74 03 0f bc d0 83 fa 21 b9 20 00 

-and-

crash> bt
PID: 4038   TASK: c2244570  CPU: 0   COMMAND: "suspend"
 #0 [c3913ddc] xen_panic_event at c010a527
 #1 [c3913df8] notifier_call_chain at c012fae6
 #2 [c3913e0c] panic at c0120b16
 #3 [c3913e20] die at c0105866
 #4 [c3913e6c] do_page_fault at c01156f1
 #5 [c3913ed0] error_code (via page_fault) at c0105065
    EAX: 00000000  EBX: 00000001  ECX: 00000001  EDX: 00000000  EBP: c3913f14
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000010
    CS:  0061      EIP: c010dd3a  ERR: ffffffff  EFLAGS: 00010202
 #6 [c3913f04] cache_remove_shared_cpu_map at c010dd3a
 #7 [c3913f18] cache_remove_dev at c010e3b5
 #8 [c3913f2c] cacheinfo_cpu_callback at c010e420
 #9 [c3913f38] notifier_call_chain at c012fae6
#10 [c3913f4c] cpu_down at c013b884
#11 [c3913f80] smp_suspend at c028bc9a
#12 [c3913f98] __do_suspend at c028ca7b
#13 [c3913fc4] kthread at c0136a03
#14 [c3913fe8] kernel_thread_helper at c0102e93
crash>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-14  3:42 DomU crash during migration when suspending source domain Graham, Simon
@ 2007-02-14 10:13 ` Keir Fraser
  2007-02-14 10:36   ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2007-02-14 10:13 UTC (permalink / raw)
  To: Graham, Simon, xen-devel

Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
hotplug issue in Linux, but we did so lots of testing of that...

 -- Keir

On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic
> is attached at the end of this, but the bottom line is that the code in
> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
> attempting to clean up the cache info for a processor that does not yet have
> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
> array and looking at the dump I can see that this is NULL.
> 
> My working theory here is that we attempted the migration waaay early and the
> initialization of the array of cache info pointers was not setup for all
> processors yet; it would be relatively easy to protect against this by
> checking for NULL, but I'm not sure if this is the correct solution or not --
> if anyone is familiar with this code and can comment on an appropriate fix I'd
> be grateful.
> 
> One thing I'm really not sure about is the timing of marking the CPUs up with
> respect to the trace re initializing CPUs (see console output below) -- I can
> see that the four VCPUs are setup in the cpu_sys_devices array (which is setup
> by the code that outputs the 'Initializing CPU#n' trace) but the array of
> cache info structures only has an entry for VCPU 0:

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-14 10:13 ` Keir Fraser
@ 2007-02-14 10:36   ` Keir Fraser
  2007-02-14 10:48     ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2007-02-14 10:36 UTC (permalink / raw)
  To: Graham, Simon, xen-devel

Your theory that the cpu_down() is happening too early sounds plausible
except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
See their definitions in kernel/cpu.c.

The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
events that the cacheinfo code cares about. You can see that both
notifications are broadcast under the cpu_hotplug_lock, so there should be
no race possible in which a CPU starts to be taken down before all
notification work associated with it coming online has completed.

 -- Keir

On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:

> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
> 
> On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@stratus.com> wrote:
> 
>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
>> panic
>> is attached at the end of this, but the bottom line is that the code in
>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
>> attempting to clean up the cache info for a processor that does not yet have
>> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
>> array and looking at the dump I can see that this is NULL.
>> 
>> My working theory here is that we attempted the migration waaay early and the
>> initialization of the array of cache info pointers was not setup for all
>> processors yet; it would be relatively easy to protect against this by
>> checking for NULL, but I'm not sure if this is the correct solution or not --
>> if anyone is familiar with this code and can comment on an appropriate fix
>> I'd
>> be grateful.
>> 
>> One thing I'm really not sure about is the timing of marking the CPUs up with
>> respect to the trace re initializing CPUs (see console output below) -- I can
>> see that the four VCPUs are setup in the cpu_sys_devices array (which is
>> setup
>> by the code that outputs the 'Initializing CPU#n' trace) but the array of
>> cache info structures only has an entry for VCPU 0:
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-14 10:36   ` Keir Fraser
@ 2007-02-14 10:48     ` Keir Fraser
  0 siblings, 0 replies; 10+ messages in thread
From: Keir Fraser @ 2007-02-14 10:48 UTC (permalink / raw)
  To: Keir Fraser, Graham, Simon, xen-devel


Are you migrating between unlike boxes? My guess is that the original box
has processors supporting cacheinfo cpuid leaves and the target box does
not. Migrating to older less-capable CPUs is definitely hit-and-miss I'm
afraid. It really is best not to do it!

 -- Keir

On 14/2/07 10:36, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:

> Your theory that the cpu_down() is happening too early sounds plausible
> except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
> See their definitions in kernel/cpu.c.
> 
> The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
> events that the cacheinfo code cares about. You can see that both
> notifications are broadcast under the cpu_hotplug_lock, so there should be
> no race possible in which a CPU starts to be taken down before all
> notification work associated with it coming online has completed.
> 
>  -- Keir
> 
> On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:
> 
>> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
>> hotplug issue in Linux, but we did so lots of testing of that...
>> 
>>  -- Keir
>> 
>> On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@stratus.com> wrote:
>> 
>>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain
>>> (with
>>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
>>> panic
>>> is attached at the end of this, but the bottom line is that the code in
>>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
>>> attempting to clean up the cache info for a processor that does not yet have
>>> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
>>> array and looking at the dump I can see that this is NULL.
>>> 
>>> My working theory here is that we attempted the migration waaay early and
>>> the
>>> initialization of the array of cache info pointers was not setup for all
>>> processors yet; it would be relatively easy to protect against this by
>>> checking for NULL, but I'm not sure if this is the correct solution or not
>>> --
>>> if anyone is familiar with this code and can comment on an appropriate fix
>>> I'd
>>> be grateful.
>>> 
>>> One thing I'm really not sure about is the timing of marking the CPUs up
>>> with
>>> respect to the trace re initializing CPUs (see console output below) -- I
>>> can
>>> see that the four VCPUs are setup in the cpu_sys_devices array (which is
>>> setup
>>> by the code that outputs the 'Initializing CPU#n' trace) but the array of
>>> cache info structures only has an entry for VCPU 0:
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-15 12:45 Graham, Simon
  2007-02-15 13:22 ` Keir Fraser
@ 2007-02-15 19:52 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2007-02-15 19:52 UTC (permalink / raw)
  To: Graham, Simon; +Cc: xen-devel, Keir Fraser

Graham, Simon wrote:
> You have a better idea of that than I do - however, it seemed to me that
> a complete fix for the cases where allocation failed would be more
> likely to be accepted than something that only fixes one place.
>
> Another thing to consider -- hitting allocation failures is MUCH more
> likely in the Xen environment when live migration is in use; how often
> is a processor taken down and then restarted outside of live migration?
> However, since live migration can occur when the domain has been up for
> a long time, it's much more likely for the allocation to fail (although
> presumably still somewhat low)...
>
> Excuse my ignorance but what's the process for pushing something like
> this upstream?

In addition to all the things Keir mentioned, the other important thing
is to make sure that your comments at the top of the patch explain 1)
what the problem you're trying to solve is, and 2) how this patch solves
those problems.  And a good concise patch subject line helps as well. 
See http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt for all the
details.

It seems to me that this patch should be pretty easy to get accepted,
since it is local, a pretty obvious bug, and not Xen-specific.  But it
may need some stylistic attention: for one, the kernel style tries to
avoid nesting things in deep if() statements, and an early return or
goto to a comment exit point might be more appropriate.

Andi Kleen <ak@muc.de> is probably the best person to send it to (as
well as cc:ing it linux-kernel).

    J

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-15 12:45 Graham, Simon
@ 2007-02-15 13:22 ` Keir Fraser
  2007-02-15 19:52 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 10+ messages in thread
From: Keir Fraser @ 2007-02-15 13:22 UTC (permalink / raw)
  To: Graham, Simon, Keir Fraser, xen-devel




On 15/2/07 12:45, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> Excuse my ignorance but what's the process for pushing something like
> this upstream?

It can be a bit of a pain: you need to port to at least the latest
git-snapshot patch from kernel.org. Quite often you're lucky and the
surrounding code hasn't changed that much. Then you post to the linux-kernel
mailing list, perhaps cc'ing some likely interested parties (a couple of the
email addresses from the top of the file you've modified would make sense).
Then sit back and hope for some responses, or the patch to be picked up.
Make sure your email subject starts with [PATCH], or you'll likely be
ignored.

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: DomU crash during migration when suspending source domain
@ 2007-02-15 12:45 Graham, Simon
  2007-02-15 13:22 ` Keir Fraser
  2007-02-15 19:52 ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 10+ messages in thread
From: Graham, Simon @ 2007-02-15 12:45 UTC (permalink / raw)
  To: Keir Fraser, Keir Fraser, xen-devel

> The question is whether it has a chance of being accepted upstream.
> Since
> it's rather bigger than the one-liner we originally hoped for, we
don't
> want
> to carry it if it has no chance of acceptance.
> 

You have a better idea of that than I do - however, it seemed to me that
a complete fix for the cases where allocation failed would be more
likely to be accepted than something that only fixes one place.

Another thing to consider -- hitting allocation failures is MUCH more
likely in the Xen environment when live migration is in use; how often
is a processor taken down and then restarted outside of live migration?
However, since live migration can occur when the domain has been up for
a long time, it's much more likely for the allocation to fail (although
presumably still somewhat low)...

Excuse my ignorance but what's the process for pushing something like
this upstream?
/simgr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DomU crash during migration when suspending source domain
  2007-02-15  2:47 Graham, Simon
@ 2007-02-15  8:12 ` Keir Fraser
  0 siblings, 0 replies; 10+ messages in thread
From: Keir Fraser @ 2007-02-15  8:12 UTC (permalink / raw)
  To: Graham, Simon, Keir Fraser, xen-devel

On 15/2/07 02:47, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> Attached is a patch to intel_cacheinfo-xen.c in unstable -- I basically
> found everywhere that you might have a NULL due to allocation failure
> and tested for it; I've run this a little and confirmed that it solves
> the original crash migrating back and forth between the systems with
> different processor families. Will run more extensive regressions on it
> tonight and tomorrow, but I thought I'd send it now for review; will
> resend with signed-off-by and comment if you think it is OK.

The question is whether it has a chance of being accepted upstream. Since
it's rather bigger than the one-liner we originally hoped for, we don't want
to carry it if it has no chance of acceptance.

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: DomU crash during migration when suspending source domain
@ 2007-02-15  2:47 Graham, Simon
  2007-02-15  8:12 ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Graham, Simon @ 2007-02-15  2:47 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]

> A broader set of fixes might get accepted upstream because
> cache_add_dev()
> can fail for other reasons too (at least out-of-memory) and any such
> failure
> will cause cache_remove_dev() to barf. But it's not such a simple
thing
> to
> fix and it does not solve the general problem for us.

OK -- after some work today, I have a patch that would perhaps be
acceptable upstream -- I started off looking into ways of ensuring the
information on the cache got reset properly after migration but quickly
decided that this was not going to be at all easy (plus it wouldn't
address the larger issues of processor compatibility), so instead I
concentrated on handling all the cases where pointers might be NULL
because of an allocation failure (which also happens to fix the crash we
see here).

Attached is a patch to intel_cacheinfo-xen.c in unstable -- I basically
found everywhere that you might have a NULL due to allocation failure
and tested for it; I've run this a little and confirmed that it solves
the original crash migrating back and forth between the systems with
different processor families. Will run more extensive regressions on it
tonight and tomorrow, but I thought I'd send it now for review; will
resend with signed-off-by and comment if you think it is OK.

Simon


[-- Attachment #2: xen-unstable-intel-cachinfo.patch --]
[-- Type: application/octet-stream, Size: 4888 bytes --]

Index: trunk/linux-2.6-xen-sparse/arch/i386/kernel/cpu/intel_cacheinfo-xen.c
===================================================================
--- trunk/linux-2.6-xen-sparse/arch/i386/kernel/cpu/intel_cacheinfo-xen.c	(revision 10421)
+++ trunk/linux-2.6-xen-sparse/arch/i386/kernel/cpu/intel_cacheinfo-xen.c	(working copy)
@@ -418,7 +418,7 @@
 
 /* pointer to _cpuid4_info array (for each cache leaf) */
 static struct _cpuid4_info *cpuid4_info[NR_CPUS];
-#define CPUID4_INFO_IDX(x,y)    (&((cpuid4_info[x])[y]))
+#define CPUID4_INFO_IDX(x,y)    (cpuid4_info[x]?(&((cpuid4_info[x])[y])):NULL)
 
 #ifdef CONFIG_SMP
 static void __cpuinit cache_shared_cpu_map_setup(unsigned int cpu, int index)
@@ -429,20 +429,23 @@
 	struct cpuinfo_x86 *c = cpu_data;
 
 	this_leaf = CPUID4_INFO_IDX(cpu, index);
-	num_threads_sharing = 1 + this_leaf->eax.split.num_threads_sharing;
+	if (this_leaf) {
+		num_threads_sharing = 1 + this_leaf->eax.split.num_threads_sharing;
 
-	if (num_threads_sharing == 1)
-		cpu_set(cpu, this_leaf->shared_cpu_map);
-	else {
-		index_msb = get_count_order(num_threads_sharing);
+		if (num_threads_sharing == 1)
+			cpu_set(cpu, this_leaf->shared_cpu_map);
+		else {
+			index_msb = get_count_order(num_threads_sharing);
 
-		for_each_online_cpu(i) {
-			if (c[i].apicid >> index_msb ==
-			    c[cpu].apicid >> index_msb) {
-				cpu_set(i, this_leaf->shared_cpu_map);
-				if (i != cpu && cpuid4_info[i])  {
-					sibling_leaf = CPUID4_INFO_IDX(i, index);
-					cpu_set(cpu, sibling_leaf->shared_cpu_map);
+			for_each_online_cpu(i) {
+				if (c[i].apicid >> index_msb ==
+				    c[cpu].apicid >> index_msb) {
+					cpu_set(i, this_leaf->shared_cpu_map);
+					if (i != cpu && cpuid4_info[i])  {
+						sibling_leaf = CPUID4_INFO_IDX(i, index);
+						if (sibling_leaf)
+							cpu_set(cpu, sibling_leaf->shared_cpu_map);
+					}
 				}
 			}
 		}
@@ -454,9 +457,11 @@
 	int sibling;
 
 	this_leaf = CPUID4_INFO_IDX(cpu, index);
-	for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) {
-		sibling_leaf = CPUID4_INFO_IDX(sibling, index);	
-		cpu_clear(cpu, sibling_leaf->shared_cpu_map);
+	if (this_leaf) {
+		for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) {
+			sibling_leaf = CPUID4_INFO_IDX(sibling, index);	
+			cpu_clear(cpu, sibling_leaf->shared_cpu_map);
+		}
 	}
 }
 #else
@@ -496,10 +501,12 @@
 	retval = 0;
 	for (j = 0; j < num_cache_leaves; j++) {
 		this_leaf = CPUID4_INFO_IDX(cpu, j);
-		retval = cpuid4_cache_lookup(j, this_leaf);
-		if (unlikely(retval < 0))
-			break;
-		cache_shared_cpu_map_setup(cpu, j);
+		if (this_leaf) {
+			retval = cpuid4_cache_lookup(j, this_leaf);
+			if (unlikely(retval < 0))
+				break;
+			cache_shared_cpu_map_setup(cpu, j);
+		}
 	}
 	set_cpus_allowed(current, oldmask);
 
@@ -527,7 +534,7 @@
 
 /* pointer to array of kobjects for cpuX/cache/indexY */
 static struct _index_kobject *index_kobject[NR_CPUS];
-#define INDEX_KOBJECT_PTR(x,y)    (&((index_kobject[x])[y]))
+#define INDEX_KOBJECT_PTR(x,y)    (index_kobject[x]?(&((index_kobject[x])[y])):NULL)
 
 #define show_one_plus(file_name, object, val)				\
 static ssize_t show_##file_name						\
@@ -608,12 +615,14 @@
 static ssize_t show(struct kobject * kobj, struct attribute * attr, char * buf)
 {
 	struct _cache_attr *fattr = to_attr(attr);
-	struct _index_kobject *this_leaf = to_object(kobj);
+	struct _index_kobject *this_leaf_obj = to_object(kobj);
+	struct _cpuid4_info *this_leaf = CPUID4_INFO_IDX(this_leaf_obj->cpu,
+							 this_leaf_obj->index);
+
 	ssize_t ret;
 
-	ret = fattr->show ?
-		fattr->show(CPUID4_INFO_IDX(this_leaf->cpu, this_leaf->index),
-			buf) :
+	ret = fattr->show && this_leaf?
+		fattr->show(this_leaf,buf) :
 	       	0;
 	return ret;
 }
@@ -696,6 +705,9 @@
 
 	for (i = 0; i < num_cache_leaves; i++) {
 		this_object = INDEX_KOBJECT_PTR(cpu,i);
+		if (!this_object)
+			continue;
+
 		this_object->cpu = cpu;
 		this_object->index = i;
 		this_object->kobj.parent = cache_kobject[cpu];
@@ -704,8 +716,10 @@
 		retval = kobject_register(&(this_object->kobj));
 		if (unlikely(retval)) {
 			for (j = 0; j < i; j++) {
-				kobject_unregister(
-					&(INDEX_KOBJECT_PTR(cpu,j)->kobj));
+				struct _index_kobject *robj = 
+					INDEX_KOBJECT_PTR(cpu,j);
+				if (robj)
+					kobject_unregister(&robj->kobj);
 			}
 			kobject_unregister(cache_kobject[cpu]);
 			cpuid4_cache_sysfs_exit(cpu);
@@ -719,12 +733,16 @@
 {
 	unsigned int cpu = sys_dev->id;
 	unsigned long i;
+	struct _index_kobject *this_object;
 
 	for (i = 0; i < num_cache_leaves; i++) {
 		cache_remove_shared_cpu_map(cpu, i);
-		kobject_unregister(&(INDEX_KOBJECT_PTR(cpu,i)->kobj));
+		this_object = INDEX_KOBJECT_PTR(cpu,i);
+		if (this_object)
+			kobject_unregister(&this_object->kobj);
 	}
-	kobject_unregister(cache_kobject[cpu]);
+	if (cache_kobject[cpu])
+		kobject_unregister(cache_kobject[cpu]);
 	cpuid4_cache_sysfs_exit(cpu);
 	return;
 }

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: DomU crash during migration when suspending source domain
@ 2007-02-14 13:02 Graham, Simon
  0 siblings, 0 replies; 10+ messages in thread
From: Graham, Simon @ 2007-02-14 13:02 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

> 
> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a
CPU
> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
>

Yes -- in fact, we're using the 3.0.4 kernel for both Dom0 and DomU.
I've some more info from failures last night:

1. I was wrong; it isn't a way early migration; it's a 2nd migration -
it seems that
   the first migration leaves the system in a bad way
2. It is only happening on specific machines - interestingly enough, in
every case, we
   migrate between dual core Xeon and quad core E5310 systems; these
systems have different
   cache hierarchies I think; maybe the CPU hotplug code cant handle a
change in the
   processor type?

Simon
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-02-15 19:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-14  3:42 DomU crash during migration when suspending source domain Graham, Simon
2007-02-14 10:13 ` Keir Fraser
2007-02-14 10:36   ` Keir Fraser
2007-02-14 10:48     ` Keir Fraser
2007-02-14 13:02 Graham, Simon
2007-02-15  2:47 Graham, Simon
2007-02-15  8:12 ` Keir Fraser
2007-02-15 12:45 Graham, Simon
2007-02-15 13:22 ` Keir Fraser
2007-02-15 19:52 ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.