All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [2.6.27.24] Kernel coredump to a pipe is failing
@ 2009-05-26 16:33 Paul Smith
  2009-05-26 18:01 ` Paul Smith
  2009-05-26 20:31 ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-26 16:33 UTC (permalink / raw)
  To: linux-kernel

Sorry for not following up to my previous message to get the threading
headers correct: the original is on another system I don't have access
to and I can't find a way to reply from any of the web archived
versions.  Anyway, this is a link to the original FYI:

http://marc.info/?l=linux-kernel&m=124299629611706

In that post I observed that my short cores were being caused by
dump_write() returning 0; taking another look at dump_write():

        static int dump_write(struct file *file, const void *addr, int nr)
        {
        	return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
        }

If the write operation returns an error of any kind, or it fails to
write the complete set of bytes (nr here is always PAGE_SIZE, as this
function is called when it fails), then dump_write() returns 0 and we
get a short core.

So I annotated dump_write() to printk() if this operation is false, and
I get:

        file ffff8803b95d0180: dump_write: -512 < 4096

Well, -512 is ERESTARTSYS.  That, to me, seems like a reasonable error
code to get when we're trying to dump core to a pipe.  Yes?  No?

Shouldn't we be doing some kind of error handling here, at least for
basic things like signals?  Should a process that's dumping core be set
to ignore signals?  Should dump_write() try again on ERESTARTSYS?

Any advice or comments would be greatly appreciated!



^ permalink raw reply	[flat|nested] 27+ messages in thread
* [2.6.27.24] Kernel coredump to a pipe is failing
@ 2009-05-22 12:34 Paul Smith
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-22 12:34 UTC (permalink / raw)
  To: linux-kernel

Hi all; I'm having issues trying to get my system to dump core through a
pipe to a process, by setting /proc/sys/kernel/core_pattern to start
with a "|".  I'm working on a ramdisk-based system that I boot via PXE.
It's a 64bit 4core system with hyperthreading for 8 CPUs reported by
Linux, and I'm running 2.6.27.24.  Unfortunately it would be quite some
effort to try to update to a newer kernel on this system.

Here's my scenario: I'm running a single process on the system.  That
process receives a job, and it forks a number of copies of itself
(between 4 and 8) to do this work.  In this processing there are two
areas of the code.  In the first area, any core is considered
unrecoverable; in this area when one of the workers takes a core a
signal handler is fired that stops all processes in the process group,
dumps the single core, then all the processes are killed so they never
dump core.  This style of single-core dumping works just fine.

In the second area, a core is considered local to that worker process
and so the system as a whole is considered recoverable.  In this area,
each process dumps core without trying to affect its siblings or parent.
Of course, since the workers are running the same code it could be
(depending on the fault condition) that most or all of the workers dump
core at the same time.

In that last situation, when many of them dump core at the same time, if
I use a filename pattern in core_pattern it works fine, but I'm seeing
major problems when I try to pipe the core to a process.  To test this
I've added an explicit SEGV (invalid pointer dereference) to the code
just when it enters the "recoverable" area, so all workers hit it at the
same time and all dump core.

My core handler process does various things with locking, compression of
the core, etc., but I've reproduced my problem with as simple a setup as
this:

        # cat >/tmp/savecore <<EOF
        #!/bin/sh
        exec cat > /path/to/cores/core.$1
        EOF
        # chmod 755 /tmp/savecore
        # echo '|/tmp/savecore %p' > /proc/sys/kernel/core_pattern

With this, when my workers dump core, the best outcome is that all of
the cores are "short".  A normal core for these processes would be about
1G, but these are anywhere from 50M to as small as 64K.  Examining the
core it basically contains the first 50M (or 64K) of a valid core, so
that GDB will even load it, but obviously as soon as you do any
operation like examine backtraces, etc. it says invalid memory address.
Of course I have core size set to unlimited, etc.

I instrumented fs/binfmt_elf.c:elf_core_dump() with some printk's to try
to figure out what's going on.  I've added a printout to show the value
of "offset" which is how large the core is expected to be.  I also added
printk's before each of the "goto end_coredump" jumps inside the VMA
dump loop.  Just before the end_coredump label I print a message saying
all VMAs were successfully dumped, then after the end_coredump label I
show how many VMA entries were dumped and the total size of them (I
added some variables to track this).  So, on my console when I use a
simple filename pattern in core_patterm, I get this output which is
correct; in this example we have 6 helpers:

        file ffff88046e9e8240: foffset=2820 / offset=1085698048 / dataoff=4096
        file ffff8803ba38e300: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88046a264600: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803b9164cc0: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58900: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58d80: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba38e300: Completed VMA write
        file ffff8803ba38e300: 41 VMA / 1085562880 B
        file ffff88046a264600: Completed VMA write
        file ffff8803b9164cc0: Completed VMA write
        file ffff8803b9164cc0: 41 VMA / 1085562880 B
        file ffff88042cf58d80: Completed VMA write
        file ffff88042cf58d80: 41 VMA / 1085562880 B
        file ffff88046e9e8240: Completed VMA write
        file ffff88046e9e8240: 42 VMA / 1085693952 B
        file ffff88042cf58900: Completed VMA write
        file ffff88042cf58900: 41 VMA / 1085562880 B
        file ffff88046a264600: 41 VMA / 1085562880 B

Here we see the offset, then the message that the loop dumping VMA
completes, then the size (this is the size of VMA segments, so it's less
than the total core size).

However, when I install the pipe as described above, things do not go so
well at all:

        file ffff88042cf58cc0: foffset=2820 / offset=1085698048 / dataoff=4096
        file ffff88046e9e8540: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba38e840: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042fd1a6c0: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042fd1ae40: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba245f00: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58cc0: (3540372 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba245f00: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba245f00: 1 VMA / 5484544 B
        file ffff88046e9e8540: (5309812 > 18446744073709551615 || dump_write() == 0)
        file ffff88046e9e8540: 1 VMA / 5484544 B
        file ffff8803ba38e840: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff88042fd1a6c0: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff88042fd1ae40: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba38e840: 1 VMA / 5484544 B
        file ffff88042fd1a6c0: 1 VMA / 5484544 B
        file ffff88042fd1ae40: 1 VMA / 5484544 B
        file ffff88042cf58cc0: 1 VMA / 5484544 B

What this shows is that the cores were supposed to be about the same
size, but that in all cases the VMA dump loop exited early (in this case
after 1 try) at the if-statement here:

        if ((size += PAGE_SIZE) > limit ||
            !dump_write(file, kaddr,
                        PAGE_SIZE)) {
                kunmap(page);
                page_cache_release(page);
          printk("file %p: (%lu > %lu || dump_write() == 0)\n", file, size, limit);
                goto end_coredump;
        }

Basically this means that the file->f_op->write() method returned a 0...
but I don't know why it would return that!  There are no other messages
in dmesg or syslog etc.

Even worse, after 3 or so times of this, I get a kernel panic, every
time.  The details of the panic vary but a sample is included below.

I'm not sure how to go further.  Any pointers, advice, requests, etc.
are gratefully received.


Kernel panic info:

Bad page state in process 'worker'
page:ffffe20010d63d00 flags:0x8000000000000001 mapping:0000000000000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 3346, comm: worker Tainted: P          2.6.27.24-worker #4

Call Trace:
 [<ffffffff80284fd4>] bad_page+0x74/0xc0
 [<ffffffff80286168>] free_hot_cold_page+0x248/0x2f0
 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
 [<ffffffff802a95c6>] kfree+0x86/0x100
 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
 [<ffffffff802f0991>] elf_core_dump+0x611/0x1160
 [<ffffffff802b49ba>] do_coredump+0x9ea/0xa30
 [<ffffffff8048bb31>] _read_unlock+0x11/0x40
 [<ffffffff802801e3>] find_lock_page+0x23/0x80
 [<ffffffff80280c25>] filemap_fault+0x275/0x4e0
 [<ffffffff80249771>] __dequeue_signal+0xe1/0x190
 [<ffffffff8024bb22>] get_signal_to_deliver+0x272/0x3b0
 [<ffffffff8020b6f7>] do_notify_resume+0xd7/0x9c0
 [<ffffffff8022fe9f>] task_rq_lock+0x4f/0xa0
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff802368fe>] try_to_wake_up+0x16e/0x250
 [<ffffffff80226c96>] do_page_fault+0x426/0xbc0
 [<ffffffff802388c7>] kick_process+0x57/0x90
 [<ffffffff80249c2f>] send_signal+0x17f/0x320
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff802487eb>] sys_rt_sigprocmask+0x8b/0x120
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8022c0e5>] sys32_rt_sigprocmask+0x75/0x120
 [<ffffffff8022c0e5>] sys32_rt_sigprocmask+0x75/0x120
 [<ffffffff8020c90b>] int_signal+0x12/0x17

general protection fault: 0000 [1] PREEMPT SMP 
CPU 3 
Modules linked in: rng_core dock scsi_mod libata ata_piix zlib_inflate bnx2 libphy tg3 ipmi_msghandler ipmi_si ipmi_devintf nsoc sd_mod sg scsi_transport_sas mptbase mptscsih mptsas mptctl md_mod raid1 raid10 raid0 linear dm_mod dm_multipath dm_round_robin jbd ext3 disklog xxds(P)
Pid: 2895, comm: md4_resync Tainted: P    B     2.6.27.24-worker #4
RIP: 0010:[<ffffffff802a989e>]  [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
RSP: 0000:ffff88046baf3a70  EFLAGS: 00010086
RAX: 0000000000000000 RBX: 8000000000000000 RCX: 0000000000000010
RDX: ffff88002805d5e0 RSI: 0000000000011220 RDI: ffffffff8058cce8
RBP: 0000000000000082 R08: ffffffffa000d2b0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8058cce8
R13: ffffffff80282da5 R14: 0000000000011220 R15: 0000000000000200
FS:  0000000000000000(0000) GS:ffff88046f805a80(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000080a35ac CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md4_resync (pid: 2895, threadinfo ffff88046baf2000, task ffff88046e505b00)
Stack:  ffff88046a2b6580 0000000000000010 ffff88046fa8f060 0000000000011220
 ffff88046baf3ac0 ffff88046fa8f090 0000000000000000 ffffffff80282da5
 ffff88046a243c00 ffff88046baf3ad8 0000000000001000 ffffffffa0181924
Call Trace:
 [<ffffffff80282da5>] ? mempool_alloc+0x55/0x130
 [<ffffffffa0181924>] ? multipath_map+0x74/0xc0 [dm_multipath]
 [<ffffffff8036c2e4>] ? __sg_alloc_table+0x64/0x130
 [<ffffffffa000d2b0>] ? scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffffa000d1f4>] ? scsi_init_sgtable+0x34/0xa0 [scsi_mod]
 [<ffffffffa000d561>] ? scsi_init_io+0x21/0x100 [scsi_mod]
 [<ffffffffa00e37fc>] ? sd_prep_fn+0x9c/0x5b0 [sd_mod]
 [<ffffffff8034c23d>] ? elv_next_request+0x15d/0x290
 [<ffffffffa000cbfa>] ? scsi_request_fn+0x6a/0x3c0 [scsi_mod]
 [<ffffffff8034e884>] ? generic_unplug_device+0x24/0x40
 [<ffffffffa0172b14>] ? dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] ? dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa0172b14>] ? dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] ? dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa015c4d6>] ? unplug_slaves+0xb6/0x140 [raid1]
 [<ffffffffa015c570>] ? raid1_unplug+0x10/0x20 [raid1]
 [<ffffffffa014ae1c>] ? md_do_sync+0x5cc/0xb40 [md_mod]
 [<ffffffff8048baf2>] ? _spin_unlock_irqrestore+0x12/0x40
 [<ffffffffa014b937>] ? md_thread+0x47/0x120 [md_mod]
 [<ffffffffa014b8f0>] ? md_thread+0x0/0x120 [md_mod]
 [<ffffffff80253267>] ? kthread+0x47/0x90
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff8020d5f9>] ? child_rip+0xa/0x11
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff8020d5ef>] ? child_rip+0x0/0x11


Code: 89 fc 41 89 f6 4c 8b 6c 24 38 9c 5d fa e8 bb 3b 0c 00 48 98 49 8b 94 c4 e8 00 00 00 48 8b 1a 44 8b 7a 18 48 85 db 74 4d 8b 42 14 <48> 8b 04 c3 48 89 02 55 9d 66 45 85 f6 79 12 48 85 db 74 0d 44 
RIP  [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
 RSP <ffff88046baf3a70>
Kernel panic - not syncing: Fatal exception
------------[ cut here ]------------
WARNING: at /home/build/linux/kernel/smp.c:332 smp_call_function_mask+0x25d/0x270()
Modules linked in: rng_core dock scsi_mod libata ata_piix zlib_inflate bnx2 libphy tg3 ipmi_msghandler ipmi_si ipmi_devintf nsoc sd_mod sg scsi_transport_sas mptbase mptscsih mptsas mptctl md_mod raid1 raid10 raid0 linear dm_mod dm_multipath dm_round_robin jbd ext3 disklog xxds(P)
Pid: 2895, comm: md4_resync Tainted: P    B D   2.6.27.24-worker #4

Call Trace:
 [<ffffffff8023bf54>] warn_on_slowpath+0x64/0xb0
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff80257f30>] down_trylock+0x30/0x50
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff8023cce2>] vprintk+0x1f2/0x490
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff8021c9b0>] stop_this_cpu+0x0/0x30
 [<ffffffff8026234d>] smp_call_function_mask+0x25d/0x270
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8048916f>] printk+0xc0/0xd1
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff80269d8d>] crash_kexec+0x6d/0x110
 [<ffffffff8020ea84>] show_registers+0x2a4/0x2c0
 [<ffffffff8048916f>] printk+0xc0/0xd1
 [<ffffffff8021c9b0>] stop_this_cpu+0x0/0x30
 [<ffffffff80262398>] smp_call_function+0x38/0x80
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8021c9a0>] native_smp_send_stop+0x20/0x30
 [<ffffffff80488ffb>] panic+0x9f/0x153
 [<ffffffff80253729>] autoremove_wake_function+0x9/0x30
 [<ffffffff8022e2ca>] __wake_up_common+0x5a/0x90
 [<ffffffff8022fc13>] __wake_up+0x43/0x70
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8020dfa1>] oops_end+0x81/0xd0
 [<ffffffff8048bf69>] error_exit+0x0/0x51
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffffa000d2b0>] scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffffa0181924>] multipath_map+0x74/0xc0 [dm_multipath]
 [<ffffffff8036c2e4>] __sg_alloc_table+0x64/0x130
 [<ffffffffa000d2b0>] scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffffa000d1f4>] scsi_init_sgtable+0x34/0xa0 [scsi_mod]
 [<ffffffffa000d561>] scsi_init_io+0x21/0x100 [scsi_mod]
 [<ffffffffa00e37fc>] sd_prep_fn+0x9c/0x5b0 [sd_mod]
 [<ffffffff8034c23d>] elv_next_request+0x15d/0x290
 [<ffffffffa000cbfa>] scsi_request_fn+0x6a/0x3c0 [scsi_mod]
 [<ffffffff8034e884>] generic_unplug_device+0x24/0x40
 [<ffffffffa0172b14>] dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa0172b14>] dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa015c4d6>] unplug_slaves+0xb6/0x140 [raid1]
 [<ffffffffa015c570>] raid1_unplug+0x10/0x20 [raid1]
 [<ffffffffa014ae1c>] md_do_sync+0x5cc/0xb40 [md_mod]
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffffa014b937>] md_thread+0x47/0x120 [md_mod]
 [<ffffffffa014b8f0>] md_thread+0x0/0x120 [md_mod]
 [<ffffffff80253267>] kthread+0x47/0x90
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff8020d5f9>] child_rip+0xa/0x11
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff8020d5ef>] child_rip+0x0/0x11

---[ end trace b6e9b2155948c94b ]---


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-05-29 12:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-26 16:33 [2.6.27.24] Kernel coredump to a pipe is failing Paul Smith
2009-05-26 18:01 ` Paul Smith
2009-05-26 20:31 ` Andi Kleen
2009-05-26 21:09   ` Paul Smith
2009-05-26 23:00   ` Andrew Morton
2009-05-26 23:14     ` Andi Kleen
2009-05-26 23:28       ` Andrew Morton
2009-05-26 23:41         ` Andi Kleen
2009-05-26 23:45           ` Andrew Morton
2009-05-27  0:11             ` Andi Kleen
2009-05-27  0:29               ` Andrew Morton
2009-05-27  6:02                 ` Paul Smith
2009-05-27  6:17                 ` Paul Smith
2009-05-27  7:31                 ` Andi Kleen
2009-05-27  7:45                   ` Andrew Morton
2009-05-27  8:52                     ` Andi Kleen
2009-05-27  8:56                       ` Andrew Morton
2009-05-27 20:25           ` Jesper Juhl
2009-05-29 10:34           ` Pavel Machek
2009-05-27 18:31   ` Oleg Nesterov
2009-05-27 18:50     ` Andi Kleen
2009-05-27 19:05       ` Oleg Nesterov
2009-05-27 19:49         ` Paul Smith
2009-05-27 20:34           ` Oleg Nesterov
2009-05-27 20:04     ` Oleg Nesterov
2009-05-27 20:22       ` Paul Smith
  -- strict thread matches above, loose matches on Subject: below --
2009-05-22 12:34 Paul Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.