Re: udev oops, and system boot failure, with 2.6.32.44 as PV guest

From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: xen-devel@lists.xensource.com
Subject: Re: udev oops, and system boot failure, with 2.6.32.44 as PV guest
Date: Thu, 18 Aug 2011 14:19:04 -0700	[thread overview]
Message-ID: <4E4D81C8.7000900@goop.org> (raw)
In-Reply-To: <20044.62811.430835.316774@mariner.uk.xensource.com>

On 08/18/2011 04:19 AM, Ian Jackson wrote:
> I am currently commissioning some new machines for the Xen.org test
> infrastructure.  I have one pair of machines on which our current
> stable kernel branch does not boot properly as a PV guest (at least,
> when booting 64-bit).
>
> The symptoms are oopses in udevd followed by a failure to continue
> with the boot.  See the attached kernel log, but the core of the first
> oops is this:
>
>  BUG: unable to handle kernel NULL pointer dereference at (null)
>  IP: [<ffffffff8112c6c4>] alloc_fd+0x53/0x137
>  PGD 0 
>  Oops: 0000 [#1] SMP 
>  last sysfs file: /sys/devices/virtual/bdi/1:13/uevent
>  CPU 0 
>  Modules linked in:
>  Pid: 721, comm: udevd Not tainted 2.6.32.44 #1 
>  RIP: e030:[<ffffffff8112c6c4>]  [<ffffffff8112c6c4>] alloc_fd+0x53/0x137
>  RSP: e02b:ffff880007309ed8  EFLAGS: 00010246
>  RAX: ffff88000730c008 RBX: 000000000712d000 RCX: 00000000fc567701
>  RDX: 0000000000000000 RSI: ffffffff819103a0 RDI: 0000000000000006
>  RBP: ffff880007309f18 R08: 0000000000000023 R09: 0000000000000001
>  R10: 0000000000000000 R11: 0000000000000246 R12: ffff88000730c000
>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  FS:  00007f0774f867a0(0000) GS:ffff880007b61000(0000) knlGS:0000000000000000
>  CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>  CR2: 0000000000000000 CR3: 00000000072f5000 CR4: 0000000000042660
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  Process udevd (pid: 721, threadinfo ffff880007308000, task ffff8800072dcf20)
>  Stack:
>   0008800007309f18 ffff88000730c008 ffff8800072ef790 000000000712d000
>  <0> 00000000ffffff9c 0000000000088000 00000000000001b6 00007f0775904980
>  <0> ffff880007309f68 ffffffff81115eff 0000000000000000 ffff88000712d000
>  Call Trace:
>   [<ffffffff81115eff>] do_sys_open+0x3f/0x10a
>   [<ffffffff81115ff3>] sys_open+0x1b/0x1d
>   [<ffffffff8103ccc2>] system_call_fastpath+0x16/0x1b
>  Code: 8d bc 24 80 00 00 00 e8 20 1d 40 00 49 8d 44 24 08 48 89 45 c8 48 8b 45 c8 45 8b b4 24 84 00 00 00 4c 8b 28 45 39 f7 45 0f 43 f7 <41> 8b 75 00 41 39 f6 73 11 49 8b 7d 18 44 89 f2 89 f6 e8 d9 e0 
>  RIP  [<ffffffff8112c6c4>] alloc_fd+0x53/0x137
>   RSP <ffff880007309ed8>
>  CR2: 0000000000000000
> ---[ end trace dc9c072b55616b5c ]---
>
> After this the kernel is still somewhat up, although the guest doesn't
> continue with the boot.  At around "[  148.233911]" the test harness
> gives up on the test, and asks for various debug keys, which you can
> see providing output in the guest kernel log.
>
> The host serial console log does not contain anything relating to the
> guest, until it does its own debug keys at "Aug 17 10:23:38" onwards.
>
> The setup is 64-bit xen-unstable, and the same kernel is being used
> for both dom0 (for which it seems to work fine) and guest.  The
> toolstack is xl.
>
> This failure happens only on these two machines, for some reason.
> I haven't tried 32-bit.

At first glance it doesn't really look very Xen-related; alloc_fd isn't
generally a place where anything Xen-specific happens.   Can you decode
that to a specific line of code?

I'm wondering if the access to "/sys/devices/virtual/bdi/1:13/uevent" is
pertinent though; it could be one of our drivers which is doing the
wrong thing which causes alloc_fd to explode.

Is this expected, or does it indicate something wrong with your
(initramfs?) confg?

[    0.434574] Write protecting the kernel read-only data: 7760k
Loading, please wait...
mount: mounting none on /dev failed: No such device
W: devtmpfs not available, falling back to tmpfs for /dev
[    0.507700] BUG: unable to handle kernel NULL pointer dereference at (null)

	J