DomU crash during migration when suspending source domain

* DomU crash during migration when suspending source domain
@ 2007-02-14  3:42 Graham, Simon
  2007-02-14 10:13 ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Graham, Simon @ 2007-02-14  3:42 UTC (permalink / raw)
  To: xen-devel

Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic is attached at the end of this, but the bottom line is that the code in cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is attempting to clean up the cache info for a processor that does not yet have this info setup - the code is dereferencing a pointer in the cpuid4_info[] array and looking at the dump I can see that this is NULL.

My working theory here is that we attempted the migration waaay early and the initialization of the array of cache info pointers was not setup for all processors yet; it would be relatively easy to protect against this by checking for NULL, but I'm not sure if this is the correct solution or not -- if anyone is familiar with this code and can comment on an appropriate fix I'd be grateful.

One thing I'm really not sure about is the timing of marking the CPUs up with respect to the trace re initializing CPUs (see console output below) -- I can see that the four VCPUs are setup in the cpu_sys_devices array (which is setup by the code that outputs the 'Initializing CPU#n' trace) but the array of cache info structures only has an entry for VCPU 0:

crash> cpu_sys_devices
cpu_sys_devices = $3 =
 {0xc0464448, 0xc046448c, 0xc04644d0, 0xc0464514, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}

crash> cpuid4_info
cpuid4_info = $4 =
 {0xc7971180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}

Any suggestions for appropriate fixes here?
Simon

--- console output ---

Enabling SMP...
Initializing CPU#3
Initializing CPU#2
Initializing CPU#1
eth0: no IPv6 routers present
Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c010dd3a
0204a000 -> *pde = 00000001:0d8ec001
06a9c000 -> *pme = 00000000:00000000
Oops: 0000 [#1]
SMP 
Modules linked in: ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core binfmt_misc dm_mirror dm_mod bnx2 ata_piix libata mptscsih mptfc mptspi mptsas mptscsi scsi_mod mptbase
CPU:    0
EIP:    0061:[<c010dd3a>]    Tainted: GF    VLI
EFLAGS: 00010202  (2.6.16.29-xen #1) 
EIP is at cache_remove_shared_cpu_map+0x1a/0x90
eax: 00000000  ebx: 00000001  ecx: 00000001  edx: 00000000
esi: 00000000  edi: 00000010  ebp: c3913f14  esp: c3913f08
ds: 007b  es: 007b  ss: 0069
Process suspend (pid: 4038, threadinfo=c3912000 task=c2244570)
Stack: <0>00000001 00000001 00000000 c3913f28 c010e3ba 00000007 00000001 00000007 
      c3913f34 c010e425 c03bd804 c3913f48 c012fae8 ffffffea 00000001 c568c570 
      c3913f7c c013b889 c3913fc0 00000002 00000001 00000000 00000003 00000000 
Call Trace:
 [<c0105401>] show_stack_log_lvl+0xa1/0xe0
 [<c01055f1>] show_registers+0x181/0x200
 [<c0105810>] die+0x100/0x1a0
 [<c01156f6>] do_page_fault+0x3c6/0x8b1
 [<c0105067>] error_code+0x2b/0x30
 [<c010e3ba>] cache_remove_dev+0x2a/0x60
 [<c010e425>] cacheinfo_cpu_callback+0x35/0x40
 [<c012fae8>] notifier_call_chain+0x18/0x40
 [<c013b889>] cpu_down+0x139/0x260
 [<c028bc9f>] smp_suspend+0x7f/0x100
 [<c028ca80>] __do_suspend+0x40/0x180
 [<c0136a06>] kthread+0x96/0xe0
 [<c0102e95>] kernel_thread_helper+0x5/0x10
Code: 0c 5b 5e 5f 5d c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 89 d6 53 89 c3 8d 04 92 8b 14 9d 20 4d 46 c0 8d 04 82 8d 78 10 <8b> 40 10 ba 20 00 00 00 85 c0 74 03 0f bc d0 83 fa 21 b9 20 00 

-and-

crash> bt
PID: 4038   TASK: c2244570  CPU: 0   COMMAND: "suspend"
 #0 [c3913ddc] xen_panic_event at c010a527
 #1 [c3913df8] notifier_call_chain at c012fae6
 #2 [c3913e0c] panic at c0120b16
 #3 [c3913e20] die at c0105866
 #4 [c3913e6c] do_page_fault at c01156f1
 #5 [c3913ed0] error_code (via page_fault) at c0105065
    EAX: 00000000  EBX: 00000001  ECX: 00000001  EDX: 00000000  EBP: c3913f14
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000010
    CS:  0061      EIP: c010dd3a  ERR: ffffffff  EFLAGS: 00010202
 #6 [c3913f04] cache_remove_shared_cpu_map at c010dd3a
 #7 [c3913f18] cache_remove_dev at c010e3b5
 #8 [c3913f2c] cacheinfo_cpu_callback at c010e420
 #9 [c3913f38] notifier_call_chain at c012fae6
#10 [c3913f4c] cpu_down at c013b884
#11 [c3913f80] smp_suspend at c028bc9a
#12 [c3913f98] __do_suspend at c028ca7b
#13 [c3913fc4] kthread at c0136a03
#14 [c3913fe8] kernel_thread_helper at c0102e93
crash>

^ permalink raw reply	[flat|nested] 10+ messages in thread