All of lore.kernel.org
 help / color / mirror / Atom feed
* 100% reliable Oops on xen 4.0.1
@ 2012-08-14  0:03 Peter Moody
  2012-08-14  6:46 ` Pasi Kärkkäinen
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Peter Moody @ 2012-08-14  0:03 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

This seems to be some combination of Xen and the audit subsystem, but
the attached program crashes my machine 100% of the time.

steps to reproduce the crash:

 *  1) compile with gcc -m32
 *  2) start auditd, install any rule (I've only tested syscall
auditing, but any syscall seems to work).
 *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
exit,always -F arch=64 -S chmod
 *  3) run'n wait (this only loops twice for me before dying)
 *     ./a.out
 *  4) bask in instantaneous kernel oops.

here's xm info from dom0

[xen2.atl] root@gntb1:~# xm info
host                   : gntb1.atl.corp.google.com
release                : 3.2.13-ganeti-rx6-xen0
version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
machine                : x86_64
nr_cpus                : 12
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 1
cpu_mhz                : 2660
hw_caps                :
bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
virt_caps              : hvm
total_memory           : 32755
free_memory            : 22665
node_to_cpu            : node0:0,2,4,6,8,10
                         node1:1,3,5,7,9,11
node_to_memory         : node0:13083
                         node1:9582
node_to_dma32_mem      : node0:0
                         node1:3235
max_node_id            : 1
xen_major              : 4
xen_minor              : 0
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : placeholder dom0_mem=1024M loglvl=all
com1=115200,8n1 console=com1 iommu=0
cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
cc_compile_by          : pmacedo
cc_compile_domain      : google.com
cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
xend_config_format     : 4

I'm not sure what you need from the domU. It's running 2.6.38.8 (but
I've seen this bug all the way up to 3.5.0-rc7, the latest I've
tested). It's a fairly beefy setup, 32G memory and 6 cpus.

I suspect xen as opposed to auditd because:

 a) this only happens on our xen machines (though not all of them)
 b) one of my stack traces started with

[172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10

Any one have any idea what's going on?

Cheers,
peter

-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

[-- Attachment #2: crasher.c --]
[-- Type: text/x-csrc, Size: 3827 bytes --]

/*
 * steps:
 *  1) compile with gcc -m32
 *  2) start auditd, install any rule (I've only tested syscall auditing, but any syscall seems to work).
 *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a exit,always -F arch=64 -S chmod
 *  3) run'n wait (this only loops twice for me before dying)
 *     ./a.out
 *  4) bask in instantaneous kernel oops.
 [  571.282777] ------------[ cut here ]------------
 [  571.282786] kernel BUG at fs/buffer.c:1263!
 [  571.282790] invalid opcode: 0000 [#1] SMP
 [  571.282795] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
 [  571.282798] CPU 0
 [  571.282802] Pid: 7457, comm: a.out Not tainted 2.6.38.8-gg868-ganetixenu #1
 [  571.282808] RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
 [  571.282819] RSP: e02b:ffff88079b7ddc78  EFLAGS: 00010046
 [  571.282822] RAX: ffff8807bc290000 RBX: ffff8806d9bb9a98 RCX: 00000000023dc17c
 [  571.282826] RDX: 0000000000001000 RSI: 00000000023dc17c RDI: ffff8807fec29a00
 [  571.282830] RBP: ffff88079b7ddcd8 R08: 0000000000000001 R09: ffff8806d9bb99c0
 [  571.282834] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8806d9bb99c4
 [  571.282839] R13: ffff8806d9bb99f0 R14: ffff8807feff9060 R15: 00000000023dc17c
 [  571.282845] FS:  00007f8f6a76a7c0(0000) GS:ffff8807fff26000(0063) knlGS:0000000000000000
 [  571.282849] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
 [  571.282853] CR2: 00000000f76c6970 CR3: 00000007a250b000 CR4: 0000000000002660
 [  571.282857] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [  571.282861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 [  571.282866] Process a.out (pid: 7457, threadinfo ffff88079b7dc000, task ffff8807786843e0)
 [  571.282870] Stack:
 [  571.282872]  ffff88079b7ddc98 ffffffff81654cd1 ffff88079b7ddca8 ffff8806d9bba440
 [  571.282879]  ffff88079b7ddd08 ffffffff811c9294 ffff8807ffffffc3 0000000000000014
 [  571.282887]  ffff8806d9bb9a98 ffff8806d9bb99c4 ffff8806d9bb99f0 ffff8807feff9060
 [  571.282895] Call Trace:
 [  571.282901]  [<ffffffff81654cd1>] ? down_read+0x11/0x30
 [  571.282907]  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
 [  571.282913]  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
 [  571.282918]  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
 [  571.282923]  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
 [  571.282928]  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
 [  571.282933]  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
 [  571.282938]  [<ffffffff8114065f>] evict+0x1f/0xb0
 [  571.282945]  [<ffffffff81006d52>] ? check_events+0x12/0x20
 [  571.282949]  [<ffffffff81140c14>] iput+0x1a4/0x290
 [  571.282955]  [<ffffffff8113ed05>] dput+0x265/0x310
 [  571.282959]  [<ffffffff81132435>] path_put+0x15/0x30
 [  571.282965]  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
 [  571.282971]  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
 [  571.282974] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89
 [  571.283027] RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
 [  571.283033]  RSP <ffff88079b7ddc78>
 [  571.283036] ---[ end trace 5975ffe20808ecd2 ]---
 *
 */

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define KILLDIR "/usr/local/tmp/crasher/kill_dir"

int main(void) {
  FILE *f;
  char fullpath[512];
  int i;

  while (1) {
    fprintf(stderr, "%d ", i++);
    mkdir(KILLDIR, 0777);
    chdir(KILLDIR);
    sprintf(fullpath, "%s/file", KILLDIR);
    f = fopen(fullpath, "w+");
    fprintf(f, "nothing to see here");
    fclose(f);
    unlink("/usr/local/tmp/crasher/kill_dir/file");
    rmdir(KILLDIR);
  }
  return 0;
}

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  0:03 100% reliable Oops on xen 4.0.1 Peter Moody
@ 2012-08-14  6:46 ` Pasi Kärkkäinen
  2012-08-14  9:11   ` Iustin Pop
  2012-08-14  8:27 ` Jan Beulich
  2012-08-14  9:19 ` Ian Campbell
  2 siblings, 1 reply; 13+ messages in thread
From: Pasi Kärkkäinen @ 2012-08-14  6:46 UTC (permalink / raw)
  To: Peter Moody; +Cc: xen-devel

On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody wrote:
> This seems to be some combination of Xen and the audit subsystem, but
> the attached program crashes my machine 100% of the time.
> 

Did you try with a later Xen version? 4.0.1 is quite old. 
For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ? 

-- Pasi

> steps to reproduce the crash:
> 
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I've only tested syscall
> auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
> exit,always -F arch=64 -S chmod
>  *  3) run'n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
> 
> here's xm info from dom0
> 
> [xen2.atl] root@gntb1:~# xm info
> host                   : gntb1.atl.corp.google.com
> release                : 3.2.13-ganeti-rx6-xen0
> version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
> machine                : x86_64
> nr_cpus                : 12
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 1
> cpu_mhz                : 2660
> hw_caps                :
> bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm
> total_memory           : 32755
> free_memory            : 22665
> node_to_cpu            : node0:0,2,4,6,8,10
>                          node1:1,3,5,7,9,11
> node_to_memory         : node0:13083
>                          node1:9582
> node_to_dma32_mem      : node0:0
>                          node1:3235
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : placeholder dom0_mem=1024M loglvl=all
> com1=115200,8n1 console=com1 iommu=0
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by          : pmacedo
> cc_compile_domain      : google.com
> cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
> xend_config_format     : 4
> 
> I'm not sure what you need from the domU. It's running 2.6.38.8 (but
> I've seen this bug all the way up to 3.5.0-rc7, the latest I've
> tested). It's a fairly beefy setup, 32G memory and 6 cpus.
> 
> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
> 
> Any one have any idea what's going on?
> 
> Cheers,
> peter
> 
> -- 
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038

> /*
>  * steps:
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I've only tested syscall auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a exit,always -F arch=64 -S chmod
>  *  3) run'n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
>  [  571.282777] ------------[ cut here ]------------
>  [  571.282786] kernel BUG at fs/buffer.c:1263!
>  [  571.282790] invalid opcode: 0000 [#1] SMP
>  [  571.282795] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
>  [  571.282798] CPU 0
>  [  571.282802] Pid: 7457, comm: a.out Not tainted 2.6.38.8-gg868-ganetixenu #1
>  [  571.282808] RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
>  [  571.282819] RSP: e02b:ffff88079b7ddc78  EFLAGS: 00010046
>  [  571.282822] RAX: ffff8807bc290000 RBX: ffff8806d9bb9a98 RCX: 00000000023dc17c
>  [  571.282826] RDX: 0000000000001000 RSI: 00000000023dc17c RDI: ffff8807fec29a00
>  [  571.282830] RBP: ffff88079b7ddcd8 R08: 0000000000000001 R09: ffff8806d9bb99c0
>  [  571.282834] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8806d9bb99c4
>  [  571.282839] R13: ffff8806d9bb99f0 R14: ffff8807feff9060 R15: 00000000023dc17c
>  [  571.282845] FS:  00007f8f6a76a7c0(0000) GS:ffff8807fff26000(0063) knlGS:0000000000000000
>  [  571.282849] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>  [  571.282853] CR2: 00000000f76c6970 CR3: 00000007a250b000 CR4: 0000000000002660
>  [  571.282857] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  [  571.282861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  [  571.282866] Process a.out (pid: 7457, threadinfo ffff88079b7dc000, task ffff8807786843e0)
>  [  571.282870] Stack:
>  [  571.282872]  ffff88079b7ddc98 ffffffff81654cd1 ffff88079b7ddca8 ffff8806d9bba440
>  [  571.282879]  ffff88079b7ddd08 ffffffff811c9294 ffff8807ffffffc3 0000000000000014
>  [  571.282887]  ffff8806d9bb9a98 ffff8806d9bb99c4 ffff8806d9bb99f0 ffff8807feff9060
>  [  571.282895] Call Trace:
>  [  571.282901]  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>  [  571.282907]  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>  [  571.282913]  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>  [  571.282918]  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>  [  571.282923]  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>  [  571.282928]  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>  [  571.282933]  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>  [  571.282938]  [<ffffffff8114065f>] evict+0x1f/0xb0
>  [  571.282945]  [<ffffffff81006d52>] ? check_events+0x12/0x20
>  [  571.282949]  [<ffffffff81140c14>] iput+0x1a4/0x290
>  [  571.282955]  [<ffffffff8113ed05>] dput+0x265/0x310
>  [  571.282959]  [<ffffffff81132435>] path_put+0x15/0x30
>  [  571.282965]  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>  [  571.282971]  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>  [  571.282974] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89
>  [  571.283027] RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
>  [  571.283033]  RSP <ffff88079b7ddc78>
>  [  571.283036] ---[ end trace 5975ffe20808ecd2 ]---
>  *
>  */
> 
> #include <stdio.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
> 
> #define KILLDIR "/usr/local/tmp/crasher/kill_dir"
> 
> int main(void) {
>   FILE *f;
>   char fullpath[512];
>   int i;
> 
>   while (1) {
>     fprintf(stderr, "%d ", i++);
>     mkdir(KILLDIR, 0777);
>     chdir(KILLDIR);
>     sprintf(fullpath, "%s/file", KILLDIR);
>     f = fopen(fullpath, "w+");
>     fprintf(f, "nothing to see here");
>     fclose(f);
>     unlink("/usr/local/tmp/crasher/kill_dir/file");
>     rmdir(KILLDIR);
>   }
>   return 0;
> }

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  0:03 100% reliable Oops on xen 4.0.1 Peter Moody
  2012-08-14  6:46 ` Pasi Kärkkäinen
@ 2012-08-14  8:27 ` Jan Beulich
  2012-08-14  9:12   ` Iustin Pop
  2012-08-14  9:19 ` Ian Campbell
  2 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-08-14  8:27 UTC (permalink / raw)
  To: Peter Moody; +Cc: xen-devel

>>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com> wrote:
> I'm not sure what you need from the domU. It's running 2.6.38.8 (but
> I've seen this bug all the way up to 3.5.0-rc7, the latest I've
> tested). It's a fairly beefy setup, 32G memory and 6 cpus.

Are these kernel versions refer to plain upstream ones?

Is the subject referring to 4.0.1 in any way meaningful? I.e.
does the problem not occur with other Xen versions?

> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10

This is a weak indication of a problem with Xen, but could as
well just indicate it's a problem that only gets surfaced under
Xen. It would certainly help if you included the full oops
message (or multiple of them if they're meaningfully different).

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  6:46 ` Pasi Kärkkäinen
@ 2012-08-14  9:11   ` Iustin Pop
  0 siblings, 0 replies; 13+ messages in thread
From: Iustin Pop @ 2012-08-14  9:11 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Peter Moody, xen-devel

On Tue, Aug 14, 2012 at 09:46:28AM +0300, Pasi Kärkkäinen wrote:
> On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody wrote:
> > This seems to be some combination of Xen and the audit subsystem, but
> > the attached program crashes my machine 100% of the time.
> > 
> 
> Did you try with a later Xen version? 4.0.1 is quite old. 
> For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ? 

This is 4.0.1 from Debian, so it has at least all the CVEs applied. We
haven't tried yet with a newer Xen though (yet).

regards,
iustin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  8:27 ` Jan Beulich
@ 2012-08-14  9:12   ` Iustin Pop
  0 siblings, 0 replies; 13+ messages in thread
From: Iustin Pop @ 2012-08-14  9:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Peter Moody, xen-devel

On Tue, Aug 14, 2012 at 09:27:31AM +0100, Jan Beulich wrote:
> >>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com> wrote:
> > I'm not sure what you need from the domU. It's running 2.6.38.8 (but
> > I've seen this bug all the way up to 3.5.0-rc7, the latest I've
> > tested). It's a fairly beefy setup, 32G memory and 6 cpus.
> 
> Are these kernel versions refer to plain upstream ones?

They are mostly Ubuntu kernels, so not vanilla.
> 
> Is the subject referring to 4.0.1 in any way meaningful? I.e.
> does the problem not occur with other Xen versions?

I believe this was only related to the version we run, not that it's
fixed in other Xen version. We will try to test with newer Xens to see.

regards,
iustin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  0:03 100% reliable Oops on xen 4.0.1 Peter Moody
  2012-08-14  6:46 ` Pasi Kärkkäinen
  2012-08-14  8:27 ` Jan Beulich
@ 2012-08-14  9:19 ` Ian Campbell
  2012-08-14 14:42   ` Peter Moody
  2 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2012-08-14  9:19 UTC (permalink / raw)
  To: Peter Moody; +Cc: xen-devel

On Tue, 2012-08-14 at 01:03 +0100, Peter Moody wrote:
> This seems to be some combination of Xen and the audit subsystem, but
> the attached program crashes my machine 100% of the time.
> 
> steps to reproduce the crash:
> 
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I've only tested syscall
> auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
> exit,always -F arch=64 -S chmod
>  *  3) run'n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
> 
> here's xm info from dom0
> 
> [xen2.atl] root@gntb1:~# xm info
> host                   : gntb1.atl.corp.google.com
> release                : 3.2.13-ganeti-rx6-xen0
> version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
> machine                : x86_64
> nr_cpus                : 12
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 1
> cpu_mhz                : 2660
> hw_caps                :
> bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm
> total_memory           : 32755
> free_memory            : 22665
> node_to_cpu            : node0:0,2,4,6,8,10
>                          node1:1,3,5,7,9,11
> node_to_memory         : node0:13083
>                          node1:9582
> node_to_dma32_mem      : node0:0
>                          node1:3235
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : placeholder dom0_mem=1024M loglvl=all
> com1=115200,8n1 console=com1 iommu=0
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by          : pmacedo
> cc_compile_domain      : google.com
> cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
> xend_config_format     : 4
> 
> I'm not sure what you need from the domU. It's running 2.6.38.8 (but
> I've seen this bug all the way up to 3.5.0-rc7, the latest I've
> tested). It's a fairly beefy setup, 32G memory and 6 cpus.
> 
> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10

This is likely to be a coincidence IMHO since this function forces a
call to the hypervisor to trigger the (re)injection of any pending
interrupts (typically after reenabling interrupts), so it is not unusual
for it to be at the bottom of any stack trace which happens in interrupt
context.

The example stack trace in crasher.c doesn't involve Xen -- can you post
any examples of ones which do.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14  9:19 ` Ian Campbell
@ 2012-08-14 14:42   ` Peter Moody
  2012-08-14 14:47     ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Moody @ 2012-08-14 14:42 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

On Tue, Aug 14, 2012 at 2:19 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:

>>
>>  a) this only happens on our xen machines (though not all of them)
>>  b) one of my stack traces started with
>>
>> [172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>
> This is likely to be a coincidence IMHO since this function forces a
> call to the hypervisor to trigger the (re)injection of any pending
> interrupts (typically after reenabling interrupts), so it is not unusual
> for it to be at the bottom of any stack trace which happens in interrupt
> context.
>
> The example stack trace in crasher.c doesn't involve Xen -- can you post
> any examples of ones which do.

Hi Ian, here's the trace in question. I'm perfectly happy with this
not being a xen issue if for no other reason then it means that I have
one less thing I need to look at. The python script in question was
essentially doing the same thing as crasher.c, though in the middle of
other, more productive activities.

Cheers,
peter

------------[ cut here ]------------
kernel BUG at fs/buffer.c:1263!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 3
Pid: 27277, comm: python2.6 Not tainted 2.6.38.8-gg868-ganetixenu #1
RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>]
__find_get_block+0x1f3/0x200
RSP: e02b:ffff880496cffc78  EFLAGS: 00010046
RAX: ffff8807b9480000 RBX: ffff88049f172de8 RCX: 000000000086dafd
RDX: 0000000000001000 RSI: 000000000086dafd RDI: ffff8807ba4dd380
RBP: ffff880496cffcd8 R08: 0000000000000001 R09: ffff88049f172d10
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88049f172d14
R13: ffff88049f172d40 R14: ffff8807ba4b7228 R15: 000000000086dafd
FS:  00007f667a0ca700(0000) GS:ffff8807fff74000(0063) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 000000000a130260 CR3: 00000004e978c000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process python2.6 (pid: 27277, threadinfo ffff880496cfe000, task
ffff8804b5a72d40)
Stack:
 ffff880496cffc98 ffffffff81654cd1 ffff880496cffca8 ffff88062d8d2440
 ffff880496cffd08 ffffffff811c9294 ffff8804ffffffc3 0000000000000014
 ffff88049f172de8 ffff88049f172d14 ffff88049f172d40 ffff8807ba4b7228
Call Trace:
 [<ffffffff81654cd1>] ? down_read+0x11/0x30
 [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
 [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
 [<ffffffff811bb104>] ext3_free_data+0x114/0x160
 [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
 [<ffffffff812133f5>] ? journal_start+0xb5/0x100
 [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
 [<ffffffff8114065f>] evict+0x1f/0xb0
 [<ffffffff81006d52>] ? check_events+0x12/0x20
 [<ffffffff81140c14>] iput+0x1a4/0x290
 [<ffffffff8113ed05>] dput+0x265/0x310
 [<ffffffff81132435>] path_put+0x15/0x30
 [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
 [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
 [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81006d52>] ? check_events+0x12/0x20
Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe
ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f>
0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89
RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
 RSP <ffff880496cffc78>
---[ end trace d45267c89c4e0548 ]---


-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 14:42   ` Peter Moody
@ 2012-08-14 14:47     ` Jan Beulich
  2012-08-14 15:55       ` Peter Moody
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-08-14 14:47 UTC (permalink / raw)
  To: Peter Moody; +Cc: Ian Campbell, xen-devel

>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote:
> Hi Ian, here's the trace in question. I'm perfectly happy with this
> not being a xen issue if for no other reason then it means that I have
> one less thing I need to look at. The python script in question was
> essentially doing the same thing as crasher.c, though in the middle of
> other, more productive activities.
> ...
> Call Trace:
>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>  [<ffffffff8114065f>] evict+0x1f/0xb0
>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>  [<ffffffff81140c14>] iput+0x1a4/0x290
>  [<ffffffff8113ed05>] dput+0x265/0x310
>  [<ffffffff81132435>] path_put+0x15/0x30
>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>  [<ffffffff81006d52>] ? check_events+0x12/0x20

This obviously is just a leftover on the stack, one can see clearly
that we're in the middle of a syscall (which would never have
xen_force_evtchn_callback that deep into the stack (i.e. where
we just came from user mode).

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 14:47     ` Jan Beulich
@ 2012-08-14 15:55       ` Peter Moody
  2012-08-14 16:09         ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Moody @ 2012-08-14 15:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel

On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote:
>> Hi Ian, here's the trace in question. I'm perfectly happy with this
>> not being a xen issue if for no other reason then it means that I have
>> one less thing I need to look at. The python script in question was
>> essentially doing the same thing as crasher.c, though in the middle of
>> other, more productive activities.
>> ...
>> Call Trace:
>>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>>  [<ffffffff8114065f>] evict+0x1f/0xb0
>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>  [<ffffffff81140c14>] iput+0x1a4/0x290
>>  [<ffffffff8113ed05>] dput+0x265/0x310
>>  [<ffffffff81132435>] path_put+0x15/0x30
>>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>
> This obviously is just a leftover on the stack, one can see clearly
> that we're in the middle of a syscall (which would never have
> xen_force_evtchn_callback that deep into the stack (i.e. where
> we just came from user mode).

Interesting, thanks. Do you have any idea why something like this
would only be reproducible (thus far anyway, still trying to get my
hands on some other test systems) on xen? And not just xen, but on
this particular xen configuration (huge memory, lots of cpus, etc)? Is
this likely a race condition with the audit subsystem or some other
part of the kernel that this configuration somehow tickles?

Cheers,
peter

> Jan
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 15:55       ` Peter Moody
@ 2012-08-14 16:09         ` Jan Beulich
  2012-08-14 16:16           ` Peter Moody
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-08-14 16:09 UTC (permalink / raw)
  To: Peter Moody; +Cc: Ian Campbell, xen-devel

>>> On 14.08.12 at 17:55, Peter Moody <pmoody@google.com> wrote:
> On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote:
>>> Hi Ian, here's the trace in question. I'm perfectly happy with this
>>> not being a xen issue if for no other reason then it means that I have
>>> one less thing I need to look at. The python script in question was
>>> essentially doing the same thing as crasher.c, though in the middle of
>>> other, more productive activities.
>>> ...
>>> Call Trace:
>>>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>>>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>>>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>>>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>>>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>>>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>>>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>>>  [<ffffffff8114065f>] evict+0x1f/0xb0
>>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>>  [<ffffffff81140c14>] iput+0x1a4/0x290
>>>  [<ffffffff8113ed05>] dput+0x265/0x310
>>>  [<ffffffff81132435>] path_put+0x15/0x30
>>>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>>>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>>>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>
>> This obviously is just a leftover on the stack, one can see clearly
>> that we're in the middle of a syscall (which would never have
>> xen_force_evtchn_callback that deep into the stack (i.e. where
>> we just came from user mode).
> 
> Interesting, thanks. Do you have any idea why something like this
> would only be reproducible (thus far anyway, still trying to get my
> hands on some other test systems) on xen? And not just xen, but on
> this particular xen configuration (huge memory, lots of cpus, etc)? Is
> this likely a race condition with the audit subsystem or some other
> part of the kernel that this configuration somehow tickles?

>From the above as well as based on you indicating that the
traces are highly variable between instances, I'd suppose
this is memory corruption of some sort, which can easily be
hidden by all sorts of factors.

Until you can find a pattern, I don't think there can be done
much by anyone not having an affected system available for
debugging.

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 16:09         ` Jan Beulich
@ 2012-08-14 16:16           ` Peter Moody
  2012-08-14 16:26             ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Moody @ 2012-08-14 16:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel

On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote:

> From the above as well as based on you indicating that the
> traces are highly variable between instances, I'd suppose
> this is memory corruption of some sort, which can easily be
> hidden by all sorts of factors.
>
> Until you can find a pattern, I don't think there can be done
> much by anyone not having an affected system available for
> debugging.

So I have such a system :).

Are there any pointers or tips you can give me to help me track down
the root cause? I realize that's a broad question, and a perfectly
justifiable answer is "read the memory management chapter of
understanding linux device drivers" but at this point basically any
advice you can give me is appreciated (and will most likely get me
closer to the solution).

Cheers,
peter

> Jan
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 16:16           ` Peter Moody
@ 2012-08-14 16:26             ` Jan Beulich
  2012-08-17 20:38               ` Peter Moody
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-08-14 16:26 UTC (permalink / raw)
  To: Peter Moody; +Cc: Ian Campbell, xen-devel

>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com> wrote:
> On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote:
> 
>> From the above as well as based on you indicating that the
>> traces are highly variable between instances, I'd suppose
>> this is memory corruption of some sort, which can easily be
>> hidden by all sorts of factors.
>>
>> Until you can find a pattern, I don't think there can be done
>> much by anyone not having an affected system available for
>> debugging.
> 
> So I have such a system :).

That's what I implied.

> Are there any pointers or tips you can give me to help me track down
> the root cause? I realize that's a broad question, and a perfectly
> justifiable answer is "read the memory management chapter of
> understanding linux device drivers" but at this point basically any
> advice you can give me is appreciated (and will most likely get me
> closer to the solution).

As said, figuring out a pattern in the crashes would likely help
placing debug prints, breakpoints, or anything similar to aid
detecting the presumed corruption earlier. Without a pattern,
there's regretfully not much I can suggest.

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 100% reliable Oops on xen 4.0.1
  2012-08-14 16:26             ` Jan Beulich
@ 2012-08-17 20:38               ` Peter Moody
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Moody @ 2012-08-17 20:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel

Just to close the loop over here. This is an audit bug, not a xen bug.

https://www.redhat.com/archives/linux-audit/2012-August/msg00018.html

Cheers,
peter

On Tue, Aug 14, 2012 at 9:26 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com> wrote:
>> On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>
>>> From the above as well as based on you indicating that the
>>> traces are highly variable between instances, I'd suppose
>>> this is memory corruption of some sort, which can easily be
>>> hidden by all sorts of factors.
>>>
>>> Until you can find a pattern, I don't think there can be done
>>> much by anyone not having an affected system available for
>>> debugging.
>>
>> So I have such a system :).
>
> That's what I implied.
>
>> Are there any pointers or tips you can give me to help me track down
>> the root cause? I realize that's a broad question, and a perfectly
>> justifiable answer is "read the memory management chapter of
>> understanding linux device drivers" but at this point basically any
>> advice you can give me is appreciated (and will most likely get me
>> closer to the solution).
>
> As said, figuring out a pattern in the crashes would likely help
> placing debug prints, breakpoints, or anything similar to aid
> detecting the presumed corruption earlier. Without a pattern,
> there's regretfully not much I can suggest.
>
> Jan
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-08-17 20:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-14  0:03 100% reliable Oops on xen 4.0.1 Peter Moody
2012-08-14  6:46 ` Pasi Kärkkäinen
2012-08-14  9:11   ` Iustin Pop
2012-08-14  8:27 ` Jan Beulich
2012-08-14  9:12   ` Iustin Pop
2012-08-14  9:19 ` Ian Campbell
2012-08-14 14:42   ` Peter Moody
2012-08-14 14:47     ` Jan Beulich
2012-08-14 15:55       ` Peter Moody
2012-08-14 16:09         ` Jan Beulich
2012-08-14 16:16           ` Peter Moody
2012-08-14 16:26             ` Jan Beulich
2012-08-17 20:38               ` Peter Moody

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.