All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [2.6.27.24] Kernel coredump to a pipe is failing
@ 2009-05-26 16:33 Paul Smith
  2009-05-26 18:01 ` Paul Smith
  2009-05-26 20:31 ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-26 16:33 UTC (permalink / raw)
  To: linux-kernel

Sorry for not following up to my previous message to get the threading
headers correct: the original is on another system I don't have access
to and I can't find a way to reply from any of the web archived
versions.  Anyway, this is a link to the original FYI:

http://marc.info/?l=linux-kernel&m=124299629611706

In that post I observed that my short cores were being caused by
dump_write() returning 0; taking another look at dump_write():

        static int dump_write(struct file *file, const void *addr, int nr)
        {
        	return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
        }

If the write operation returns an error of any kind, or it fails to
write the complete set of bytes (nr here is always PAGE_SIZE, as this
function is called when it fails), then dump_write() returns 0 and we
get a short core.

So I annotated dump_write() to printk() if this operation is false, and
I get:

        file ffff8803b95d0180: dump_write: -512 < 4096

Well, -512 is ERESTARTSYS.  That, to me, seems like a reasonable error
code to get when we're trying to dump core to a pipe.  Yes?  No?

Shouldn't we be doing some kind of error handling here, at least for
basic things like signals?  Should a process that's dumping core be set
to ignore signals?  Should dump_write() try again on ERESTARTSYS?

Any advice or comments would be greatly appreciated!



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 16:33 [2.6.27.24] Kernel coredump to a pipe is failing Paul Smith
@ 2009-05-26 18:01 ` Paul Smith
  2009-05-26 20:31 ` Andi Kleen
  1 sibling, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-26 18:01 UTC (permalink / raw)
  To: linux-kernel

On Tue, 2009-05-26 at 12:33 -0400, Paul Smith wrote:
> Sorry for not following up to my previous message to get the threading
> headers correct: the original is on another system I don't have access
> to and I can't find a way to reply from any of the web archived
> versions.  Anyway, this is a link to the original FYI:
> 
> http://marc.info/?l=linux-kernel&m=124299629611706
> 
> In that post I observed that my short cores were being caused by
> dump_write() returning 0; taking another look at dump_write():
> 
>         static int dump_write(struct file *file, const void *addr, int nr)
>         {
>         	return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
>         }
> 
> If the write operation returns an error of any kind, or it fails to
> write the complete set of bytes (nr here is always PAGE_SIZE, as this
> function is called when it fails), then dump_write() returns 0 and we
> get a short core.

I have verified that fs/pipe.c:pipe_write() is returning ERESTARTSYS as
a result of this code:

		if (signal_pending(current)) {
			if (!ret)
				ret = -ERESTARTSYS;
			break;
		}

I'm not exactly sure where to go from here, without knowing what the
design SHOULD be for signals that are received during core dumps.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 16:33 [2.6.27.24] Kernel coredump to a pipe is failing Paul Smith
  2009-05-26 18:01 ` Paul Smith
@ 2009-05-26 20:31 ` Andi Kleen
  2009-05-26 21:09   ` Paul Smith
                     ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Andi Kleen @ 2009-05-26 20:31 UTC (permalink / raw)
  To: paul; +Cc: linux-kernel

Paul Smith <paul@mad-scientist.net> writes:
>
> So I annotated dump_write() to printk() if this operation is false, and
> I get:
>
>         file ffff8803b95d0180: dump_write: -512 < 4096
>
> Well, -512 is ERESTARTSYS.  That, to me, seems like a reasonable error
> code to get when we're trying to dump core to a pipe.  Yes?  No?

Which signal is it? SIGPIPE?

>
> Shouldn't we be doing some kind of error handling here, at least for
> basic things like signals?  Should a process that's dumping core be set
> to ignore signals?  Should dump_write() try again on ERESTARTSYS?

I think it should block signals. Here's a untested patch.

It has the disadvantage that it reports the incorrect blocked mask
in the ELF corefile, but that's probably better than truncated 
coredumps.

-Andi

---

Block signals during core dump

When a signal happens during core dump the core dump to a pipe 
can fail, because the write returns short, but the ELF core dumpers
cannot handle that.

There's no reason to handle signals during core dumping, so just
block them all.

Open issue: ELF puts blocked signals into the core dump and
that will be always fully blocked now.  Need to save it somewhere?

Based on debugging by Paul Smith.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/exec.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6.30-rc5-ak/fs/exec.c
===================================================================
--- linux-2.6.30-rc5-ak.orig/fs/exec.c	2009-05-14 11:46:24.000000000 +0200
+++ linux-2.6.30-rc5-ak/fs/exec.c	2009-05-26 22:22:12.000000000 +0200
@@ -1760,6 +1760,12 @@
 		goto fail;
 	}
 
+	/* block all signals */
+	spin_lock_irq(&current->sighand->siglock);
+	sigfillset(&current->blocked);
+	/* No recalc sigpending */
+	spin_unlock_irq(&current->sighand->siglock);
+
 	down_write(&mm->mmap_sem);
 	/*
 	 * If another thread got here first, or we are not dumpable, bail out.



-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 20:31 ` Andi Kleen
@ 2009-05-26 21:09   ` Paul Smith
  2009-05-26 23:00   ` Andrew Morton
  2009-05-27 18:31   ` Oleg Nesterov
  2 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-26 21:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Tue, 2009-05-26 at 22:31 +0200, Andi Kleen wrote:
> Paul Smith <paul@mad-scientist.net> writes:
> > Well, -512 is ERESTARTSYS.  That, to me, seems like a reasonable error
> > code to get when we're trying to dump core to a pipe.  Yes?  No?
> 
> Which signal is it? SIGPIPE?

I'm not sure; I'll have to dig in a little further.  I'm not sure
offhand how to determine which signal it was from inside the kernel but
it shouldn't be hard to find.

> >
> > Shouldn't we be doing some kind of error handling here, at least for
> > basic things like signals?  Should a process that's dumping core be set
> > to ignore signals?  Should dump_write() try again on ERESTARTSYS?
> 
> I think it should block signals. Here's a untested patch.
> 
> It has the disadvantage that it reports the incorrect blocked mask
> in the ELF corefile, but that's probably better than truncated 
> coredumps.

As a quick test I changed dump_write() to retry on ERESTARTSYS after
disabling the pending signal, like:

        static int dump_write(struct file *file, const void *addr, int nr)
        {
                while (1) {
                        int r = file->f_op->write(file, addr, nr, &file->f_pos);
                        if (r != -ERESTARTSYS)
                                return r == nr;
        
                        /* We don't handle signals while dumping core. */
                        clear_thread_flag(TIF_SIGPENDING);
                }
        }

I don't know if this is right, but in some quick tests I ran it did
work: my cores were full size.  I haven't finished testing (and I have
to go to soccer practice right now).

This obviously doesn't reset the signal mask in the dumping process, but
it makes the dump_write() more complex and it may cause other issues so
I can't say whether this is the way to go.

> -
> Block signals during core dump

Cool, I'll test this one as well.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 20:31 ` Andi Kleen
  2009-05-26 21:09   ` Paul Smith
@ 2009-05-26 23:00   ` Andrew Morton
  2009-05-26 23:14     ` Andi Kleen
  2009-05-27 18:31   ` Oleg Nesterov
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-26 23:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel

On Tue, 26 May 2009 22:31:41 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> Block signals during core dump
> 
> When a signal happens during core dump the core dump to a pipe 
> can fail, because the write returns short, but the ELF core dumpers
> cannot handle that.
> 
> There's no reason to handle signals during core dumping, so just
> block them all.
> 
> Open issue: ELF puts blocked signals into the core dump and
> that will be always fully blocked now.  Need to save it somewhere?
> 
> Based on debugging by Paul Smith.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

dump_write() doesn't seem right, either.  If ->write() returns, say,
100 then the dump should keep on going.  At present it treats this
return as an error.

I wonder why the signal problem has just turned up now - did we change
the pipe code or something?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:00   ` Andrew Morton
@ 2009-05-26 23:14     ` Andi Kleen
  2009-05-26 23:28       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-05-26 23:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, paul, linux-kernel

On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> dump_write() doesn't seem right, either.  If ->write() returns, say,
> 100 then the dump should keep on going.  At present it treats this
> return as an error.

I think that's correct actually. Short write typically means serious 
issue like disk full or broken pipe, so stopping is good.

> I wonder why the signal problem has just turned up now - did we change
> the pipe code or something?

No idea.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:14     ` Andi Kleen
@ 2009-05-26 23:28       ` Andrew Morton
  2009-05-26 23:41         ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-26 23:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: andi, paul, linux-kernel

On Wed, 27 May 2009 01:14:28 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> > dump_write() doesn't seem right, either.  If ->write() returns, say,
> > 100 then the dump should keep on going.  At present it treats this
> > return as an error.
> 
> I think that's correct actually. Short write typically means serious 
> issue like disk full or broken pipe, so stopping is good.

But we shouldn't assume that.  It could be that the ->write
implementation is perfectly able to absorb the remaining data.

We should only error out of the write() returned zero or -EFOO.
The current code is simply buggy, but got lucky.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:28       ` Andrew Morton
@ 2009-05-26 23:41         ` Andi Kleen
  2009-05-26 23:45           ` Andrew Morton
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andi Kleen @ 2009-05-26 23:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, paul, linux-kernel

On Tue, May 26, 2009 at 04:28:21PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 01:14:28 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> > > dump_write() doesn't seem right, either.  If ->write() returns, say,
> > > 100 then the dump should keep on going.  At present it treats this
> > > return as an error.
> > 
> > I think that's correct actually. Short write typically means serious 
> > issue like disk full or broken pipe, so stopping is good.
> 
> But we shouldn't assume that.  It could be that the ->write
> implementation is perfectly able to absorb the remaining data.

Maybe in theory, but in practice that's unlikely isn't it?
Disk is full or pipe is blocking etc.

> We should only error out of the write() returned zero or -EFOO.
> The current code is simply buggy, but got lucky.

Maybe very pedantically, but I would argue that most programs 
don't do what you're saying (retry on any short write) and
it's actually not very nice to always write a loop for each write.

Also any IO device who relies on that would likely find
that it won't work with a lot of software.

So I think the current behaviour is ok, just need to get 
rid of the signals.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:41         ` Andi Kleen
@ 2009-05-26 23:45           ` Andrew Morton
  2009-05-27  0:11             ` Andi Kleen
  2009-05-27 20:25           ` Jesper Juhl
  2009-05-29 10:34           ` Pavel Machek
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-26 23:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: andi, paul, linux-kernel

On Wed, 27 May 2009 01:41:09 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, May 26, 2009 at 04:28:21PM -0700, Andrew Morton wrote:
> > On Wed, 27 May 2009 01:14:28 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> > > > dump_write() doesn't seem right, either.  If ->write() returns, say,
> > > > 100 then the dump should keep on going.  At present it treats this
> > > > return as an error.
> > > 
> > > I think that's correct actually. Short write typically means serious 
> > > issue like disk full or broken pipe, so stopping is good.
> > 
> > But we shouldn't assume that.  It could be that the ->write
> > implementation is perfectly able to absorb the remaining data.
> 
> Maybe in theory, but in practice that's unlikely isn't it?

I dunno.  Is this true of all linux filesystems in all cases?  Maybe.

> Disk is full or pipe is blocking etc.
> 
> > We should only error out of the write() returned zero or -EFOO.
> > The current code is simply buggy, but got lucky.
> 
> Maybe very pedantically, but I would argue that most programs 
> don't do what you're saying (retry on any short write) and
> it's actually not very nice to always write a loop for each write.
> 
> Also any IO device who relies on that would likely find
> that it won't work with a lot of software.
> 
> So I think the current behaviour is ok, just need to get 
> rid of the signals.

It's buggy!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:45           ` Andrew Morton
@ 2009-05-27  0:11             ` Andi Kleen
  2009-05-27  0:29               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-05-27  0:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, paul, linux-kernel

> I dunno.  Is this true of all linux filesystems in all cases?  Maybe.

Assuming one of them is not would you rather want to fix that file system
or 10 zillion user programs (including the kernel core dumper) that 
get it wrong? @)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  0:11             ` Andi Kleen
@ 2009-05-27  0:29               ` Andrew Morton
  2009-05-27  6:02                 ` Paul Smith
                                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andrew Morton @ 2009-05-27  0:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel

On Wed, 27 May 2009 02:11:04 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > I dunno.  Is this true of all linux filesystems in all cases?  Maybe.
> 
> Assuming one of them is not would you rather want to fix that file system
> or 10 zillion user programs (including the kernel core dumper) that 
> get it wrong? @)
> 

I think that removing one bug is better than adding one.

Many filesystems will return a short write if they hit a memory
allocation failure, for example.  pipe_write() sure will.  Retrying
is appropriate in such a case.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  0:29               ` Andrew Morton
@ 2009-05-27  6:02                 ` Paul Smith
  2009-05-27  6:17                 ` Paul Smith
  2009-05-27  7:31                 ` Andi Kleen
  2 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-27  6:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

On Tue, 2009-05-26 at 17:29 -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 02:11:04 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > I dunno.  Is this true of all linux filesystems in all cases?  Maybe.
> > 
> > Assuming one of them is not would you rather want to fix that file system
> > or 10 zillion user programs (including the kernel core dumper) that 
> > get it wrong? @)
> 
> I think that removing one bug is better than adding one.
> 
> Many filesystems will return a short write if they hit a memory
> allocation failure, for example.  pipe_write() sure will.  Retrying
> is appropriate in such a case.

As a mainly userspace guy maybe I'm missing some details for kernel
behavior, but I know I would never write a program that used write(2)
and assumed it would never return a short write.  The documentation for
write(2) is very clear that short writes are possible and any reasonably
robust program will handle this.  Consider things like NFS filesystems,
etc. where who knows what behavior is found.

I'm more concerned with the loss of the signal mask settings in the core
dump in Andi's patch.  This seems to be losing important information.
Andi, why did you prefer that to clearing the pending signal and
retrying the write?  I'm definitely not familiar enough with signal
management in the kernel to know what side-effects there might be from
just clearing the pending flag without doing anything else: I did it
that way because fs/exec.c:do_coredump() does this before it runs the
->core_dump function.

I wonder whether dump_write() shouldn't be rewritten along the lines of
a normal, robust userspace writer, where we handle EAGAIN and EINTR (can
we ever get these at this level, or do we ever just get ERESTARTSYS?),
short writes, etc.


PS. I have a thought about why this happens for me; I doubt I'm getting
SIGPIPE.  In our system it's almost certain that these worker processes
will get a signal (SIGUSR1 or something: I forget exactly which one) if
they are still alive after a few seconds.  I suspect that the core dump
takes long enough that this signal is received in the middle of the core
dump.  It may be that this problem hasn't been noticed before because
it's unlikely you'll receive a signal in the middle of dumping core, and
if you do get one every now and then, and get a short core, it's not
easily reproducible.  I left my debugging in the kernel and I get
exactly one instance of signal_pending() per process, so having the
signal be SIGPIPE seems unlikely.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  0:29               ` Andrew Morton
  2009-05-27  6:02                 ` Paul Smith
@ 2009-05-27  6:17                 ` Paul Smith
  2009-05-27  7:31                 ` Andi Kleen
  2 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-27  6:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

On Tue, 2009-05-26 at 17:29 -0700, Andrew Morton wrote:
> Many filesystems will return a short write if they hit a memory
> allocation failure, for example.  pipe_write() sure will.  Retrying
> is appropriate in such a case.

Here's a patch that "works for me" and tries to address the various
issues.  I've no idea what landmines I might have stepped on here.  I
also have no git-fu so this uses simple diff -u format.

Open issues: is it possible to get -EAGAIN or -EINTR at this level of
the kernel?  Or will it always be just -ERESTARTSYS?  Is there evil in
simply running clear_thread_flag(TIF_SIGPENDING) without "handling" the
signal in any way?

---

Retry core dump writes where appropriate

Core dump write operations can be incomplete due to signal reception or
possibly recoverable partial writes.

Previously any incomplete write in the ELF core dumper caused the core
dump to stop, giving short cores in these cases.  Modify the core dumper
to retry the write where appropriate.

Signed-off-by: Paul Smith <paul@mad-scientist.net>

---
--- a/fs/binfmt_elf.c	2009-05-27 01:12:35.000000000 -0400
+++ b/fs/binfmt_elf.c	2009-05-27 01:20:21.000000000 -0400
@@ -1128,7 +1128,25 @@
  */
 static int dump_write(struct file *file, const void *addr, int nr)
 {
-	return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
+	const char *p = addr;
+	while (1) {
+		int r = file->f_op->write(file, p, nr, &file->f_pos);
+
+		if (likely(r == nr))
+			return 1;
+
+		if (r == -ERESTARTSYS || r == -EAGAIN || r == -EINTR)
+			/* Ignore signals during coredump. */
+			clear_thread_flag(TIF_SIGPENDING);
+		else if (r > 0) {
+			/* Partial write: try again with the rest. */
+			p += r;
+			nr -= r;
+		}
+		else
+			/* Lose! */
+			return 0;
+	}
 }
 
 static int dump_seek(struct file *file, loff_t off)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  0:29               ` Andrew Morton
  2009-05-27  6:02                 ` Paul Smith
  2009-05-27  6:17                 ` Paul Smith
@ 2009-05-27  7:31                 ` Andi Kleen
  2009-05-27  7:45                   ` Andrew Morton
  2 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-05-27  7:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, paul, linux-kernel

On Tue, May 26, 2009 at 05:29:35PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 02:11:04 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > I dunno.  Is this true of all linux filesystems in all cases?  Maybe.
> > 
> > Assuming one of them is not would you rather want to fix that file system
> > or 10 zillion user programs (including the kernel core dumper) that 
> > get it wrong? @)
> > 
> 
> I think that removing one bug is better than adding one.
> 
> Many filesystems will return a short write if they hit a memory
> allocation failure, for example.  pipe_write() sure will.  Retrying
> is appropriate in such a case.

Sorry but are you really suggesting every program in the world that uses
write() anywhere should put it into a loop? That seems just like really
bad API design to me, requiring such contortions in a fundamental
system call just to work around kernel deficiencies.

I can just imagine the programmers putting nasty comments
about the Linux kernel on top of those loops and they would
be fully deserved.

And the same applies to in-kernel users really.

The memory allocation case more sounds like a bug in these fs and
in pipe.

e.g. the network stack sleeps waiting for memory, perhaps these
file systems should too.

Or it should just always return -ENOMEM. Typically when the 
system is badly out of memory you're gonna lose anyways because
a lot of things start failing.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  7:31                 ` Andi Kleen
@ 2009-05-27  7:45                   ` Andrew Morton
  2009-05-27  8:52                     ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-27  7:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel

On Wed, 27 May 2009 09:31:36 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, May 26, 2009 at 05:29:35PM -0700, Andrew Morton wrote:
> > On Wed, 27 May 2009 02:11:04 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > > I dunno.  Is this true of all linux filesystems in all cases?  Maybe.
> > > 
> > > Assuming one of them is not would you rather want to fix that file system
> > > or 10 zillion user programs (including the kernel core dumper) that 
> > > get it wrong? @)
> > > 
> > 
> > I think that removing one bug is better than adding one.
> > 
> > Many filesystems will return a short write if they hit a memory
> > allocation failure, for example.  pipe_write() sure will.  Retrying
> > is appropriate in such a case.
> 
> Sorry but are you really suggesting every program in the world that uses
> write() anywhere should put it into a loop? That seems just like really
> bad API design to me, requiring such contortions in a fundamental
> system call just to work around kernel deficiencies.
> 
> I can just imagine the programmers putting nasty comments
> about the Linux kernel on top of those loops and they would
> be fully deserved.
> 

Hey, don't look at me - blame Brian Kernighan or George Bush or
someone.

> And the same applies to in-kernel users really.

We could delete a rather nice amount of tricky VFS code if we were to
make this assumption.  But of course we daren't do that.

And as long as we're attempting to correctly handle partial writes all
over the kernel, it's a bit dopey to deliberately avoid doing this at one
particular codesite.

I bet glibc handles partial writes...

> The memory allocation case more sounds like a bug in these fs and
> in pipe.
> 
> e.g. the network stack sleeps waiting for memory, perhaps these
> file systems should too.
> 
> Or it should just always return -ENOMEM. Typically when the 
> system is badly out of memory you're gonna lose anyways because
> a lot of things start failing.

The kernel should only fail if it has no other option.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  7:45                   ` Andrew Morton
@ 2009-05-27  8:52                     ` Andi Kleen
  2009-05-27  8:56                       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-05-27  8:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, paul, linux-kernel

> Hey, don't look at me - blame Brian Kernighan or George Bush or
> someone.

Heh.

What I meant is: if it makes sense to retry the kernel should do that
on its own. It has better information about the circumstances anyways.
And it would make sense to put all such logic into a single place
instead of all programs.

If it doesn't we shouldn't try to force that to user space. That would be similar
to signal handling without SA_RESTART which I think nearly everyone agrees was one
of the worst APIs in Unix ever.

In most cases (e.g. out of memory) it likely doesn't make much sense to
retry anyways.

So short write always means error.


> > And the same applies to in-kernel users really.
> 
> We could delete a rather nice amount of tricky VFS code if we were to
> make this assumption.  But of course we daren't do that.

What do you mean?

> 
> And as long as we're attempting to correctly handle partial writes all
> over the kernel, it's a bit dopey to deliberately avoid doing this at one
> particular codesite.
> 
> I bet glibc handles partial writes...

I just spent some time nagivating through the glibc stdio me^wmaze 
and I don't see any retry loops.

Also even if it did there would be lots of other users
(google codesearch has >23 million hits for "write")

> > a lot of things start failing.
> 
> The kernel should only fail if it has no other option.

Typically short write means no other option.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27  8:52                     ` Andi Kleen
@ 2009-05-27  8:56                       ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2009-05-27  8:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel

On Wed, 27 May 2009 10:52:38 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > Hey, don't look at me - blame Brian Kernighan or George Bush or
> > someone.
> 
> Heh.
> 
> What I meant is: if it makes sense to retry the kernel should do that
> on its own. It has better information about the circumstances anyways.
> And it would make sense to put all such logic into a single place
> instead of all programs.
> 
> If it doesn't we shouldn't try to force that to user space. That would be similar
> to signal handling without SA_RESTART which I think nearly everyone agrees was one
> of the worst APIs in Unix ever.
> 
> In most cases (e.g. out of memory) it likely doesn't make much sense to
> retry anyways.
> 
> So short write always means error.
> 
> 
> > > And the same applies to in-kernel users really.
> > 
> > We could delete a rather nice amount of tricky VFS code if we were to
> > make this assumption.  But of course we daren't do that.
> 
> What do you mean?

There's code all over the VFS IO paths which correctly recognises,
handles and propagates short reads and writes.

Actually, by far the most common case here is that the short read/write
will be followed by a -ENOSPC or -EFAULT or whatever, so forget I said
that ;)

> > 
> > And as long as we're attempting to correctly handle partial writes all
> > over the kernel, it's a bit dopey to deliberately avoid doing this at one
> > particular codesite.
> > 
> > I bet glibc handles partial writes...
> 
> I just spent some time nagivating through the glibc stdio me^wmaze 
> and I don't see any retry loops.

Perhaps it's usually propagated back, dunno.

> Also even if it did there would be lots of other users
> (google codesearch has >23 million hits for "write")
> 
> > > a lot of things start failing.
> > 
> > The kernel should only fail if it has no other option.
> 
> Typically short write means no other option.

typically != always.  It would certainly be peculiar behaviour and
perhaps it was specified that way mainly on behalf of UART drivers and
such, dunno.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 20:31 ` Andi Kleen
  2009-05-26 21:09   ` Paul Smith
  2009-05-26 23:00   ` Andrew Morton
@ 2009-05-27 18:31   ` Oleg Nesterov
  2009-05-27 18:50     ` Andi Kleen
  2009-05-27 20:04     ` Oleg Nesterov
  2 siblings, 2 replies; 27+ messages in thread
From: Oleg Nesterov @ 2009-05-27 18:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel, Andrew Morton, Roland McGrath

On 05/26, Andi Kleen wrote:
>
> When a signal happens during core dump the core dump to a pipe
> can fail, because the write returns short, but the ELF core dumpers
> cannot handle that.
>
> There's no reason to handle signals during core dumping, so just
> block them all.

Actually, I think there is a strong reason to handle signals during
core dumping. The coredump can take a lot of time/resources, not good
it looks like unkillable procees to users.

Please look at

	killable/interruptible coredumps
	http://marc.info/?l=linux-kernel&m=121665710711931

at least, I think SIGKILL should terminate core dumping.

> Open issue: ELF puts blocked signals into the core dump and
> that will be always fully blocked now.  Need to save it somewhere?
>
> --- linux-2.6.30-rc5-ak.orig/fs/exec.c	2009-05-14 11:46:24.000000000 +0200
> +++ linux-2.6.30-rc5-ak/fs/exec.c	2009-05-26 22:22:12.000000000 +0200
> @@ -1760,6 +1760,12 @@
>  		goto fail;
>  	}
>
> +	/* block all signals */
> +	spin_lock_irq(&current->sighand->siglock);
> +	sigfillset(&current->blocked);
> +	/* No recalc sigpending */
> +	spin_unlock_irq(&current->sighand->siglock);

Perhaps it makes sense to do

	--- a/kernel/signal.c
	+++ b/kernel/signal.c
	@@ -644,6 +644,7 @@ static int prepare_signal(int sig, struc
			/*
			 * The process is in the middle of dying, nothing to do.
			 */
	+		 return 0;
		} else if (sig_kernel_stop(sig)) {
			/*
			 * This is a stop signal.  Remove SIGCONT from all queues.

instead, this was discussed before. This way the exiting/coredumping
task ignores all signals, and we cab simplify complete_signal() a bit.


This all needs more discussion, but imho for now something like
Paul's patch http://marc.info/?l=linux-kernel&m=124340506200729
is the best workaround. Note that we have the same dump_write()
in binfmt_elf.c and binfmt_aout.c, perhaps it makes sense to
create coredump_file_write() helper in fs/exec.c.

Oleg.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 18:31   ` Oleg Nesterov
@ 2009-05-27 18:50     ` Andi Kleen
  2009-05-27 19:05       ` Oleg Nesterov
  2009-05-27 20:04     ` Oleg Nesterov
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-05-27 18:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, paul, linux-kernel, Andrew Morton, Roland McGrath

> Actually, I think there is a strong reason to handle signals during
> core dumping. The coredump can take a lot of time/resources, not good
> it looks like unkillable procees to users.

One problem with that is if you send a process a string of signals that cause
a core dump and then kill. In the old case you would just get a full core dump
on the first signal and be done. With your change it would process
the second signal too and stop the dumping and you get none or a partial 
core dump. That might well break existing setups.

> 
> Perhaps it makes sense to do

Probably, my patch is not usable as is anyways because it would
need to save the mask somewhere so that the ELF dumper can 
build the correct mask. Your version would at least avoid that 
problem.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 18:50     ` Andi Kleen
@ 2009-05-27 19:05       ` Oleg Nesterov
  2009-05-27 19:49         ` Paul Smith
  0 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2009-05-27 19:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel, Andrew Morton, Roland McGrath

On 05/27, Andi Kleen wrote:
>
> > Actually, I think there is a strong reason to handle signals during
> > core dumping. The coredump can take a lot of time/resources, not good
> > it looks like unkillable procees to users.
>
> One problem with that is if you send a process a string of signals that cause
> a core dump and then kill. In the old case you would just get a full core dump
> on the first signal and be done. With your change it would process
> the second signal too and stop the dumping and you get none or a partial
> core dump. That might well break existing setups.

I don't think we should worry about this particular case. Suppose a user
does

	kill(pid, SIGQUIT);
	kill(pid, SIGKILL);

In this case, most likely the core dump will not start. Because SIGKILL
will set SIGNAL_GROUP_EXIT before the process dequeues SIGQUIT and start
do_coredump() which checks signal_group_exit() in zap_threads().

But yes, I agree, this change is user-visible and should be discussed.

Oleg.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 19:05       ` Oleg Nesterov
@ 2009-05-27 19:49         ` Paul Smith
  2009-05-27 20:34           ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Smith @ 2009-05-27 19:49 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andi Kleen, linux-kernel, Andrew Morton, Roland McGrath

On Wed, 2009-05-27 at 21:05 +0200, Oleg Nesterov wrote:
> On 05/27, Andi Kleen wrote:
> >
> > > Actually, I think there is a strong reason to handle signals during
> > > core dumping. The coredump can take a lot of time/resources, not good
> > > it looks like unkillable procees to users.
> >
> > One problem with that is if you send a process a string of signals that cause
> > a core dump and then kill. In the old case you would just get a full core dump
> > on the first signal and be done. With your change it would process
> > the second signal too and stop the dumping and you get none or a partial
> > core dump. That might well break existing setups.
> 
> I don't think we should worry about this particular case. Suppose a user
> does
> 
> 	kill(pid, SIGQUIT);
> 	kill(pid, SIGKILL);

I'm not sure about this.  Why even bother with SIGQUIT (or anything
else) if you're just going to immediately SIGKILL afterwards?  What
people do all the time, and I think should be supported, is something
like this:

	<do 5 times>
		kill(pid, SIGINT);
		sleep(1);
		<if pid is dead break>
	kill(pid, SIGKILL);

Often with other signals in the mix like SIGHUP or whatever.  The idea
is to give the process a chance to do "whatever it does" to clean up and
then, if it's still there we consider it too wedged to respond and send
a SIGKILL.  If the cleanup operations invoked by receiving the SIGINT
caused a core dump, then you wouldn't want the SIGKILL to stop the core
dump.

On the other hand I do agree that it would be nice to be able to smash a
core dump that was taking a long time or trying to write to an
unavailable resource like a stalled NFS mount or whatever.  Sigh.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 18:31   ` Oleg Nesterov
  2009-05-27 18:50     ` Andi Kleen
@ 2009-05-27 20:04     ` Oleg Nesterov
  2009-05-27 20:22       ` Paul Smith
  1 sibling, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2009-05-27 20:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paul, linux-kernel, Andrew Morton, Roland McGrath

On 05/27, Oleg Nesterov wrote:
>
> On 05/26, Andi Kleen wrote:
> >
> > When a signal happens during core dump the core dump to a pipe
> > can fail, because the write returns short, but the ELF core dumpers
> > cannot handle that.
> >
> > There's no reason to handle signals during core dumping, so just
> > block them all.
>
> Actually, I think there is a strong reason to handle signals during
> core dumping. The coredump can take a lot of time/resources, not good
> it looks like unkillable procees to users.
>
> Please look at
>
> 	killable/interruptible coredumps
> 	http://marc.info/?l=linux-kernel&m=121665710711931
>
> at least, I think SIGKILL should terminate core dumping.

Forgot to mention, and we have problems with OOM. Not only the coredumping
task can't be killed (and it can populate the memory via get_user_pages).
The coredump just disables OOM, if select_bad_process() sees the PF_EXITING
task with ->mm == NULL it returns -1.

> This all needs more discussion, but imho for now something like
> Paul's patch http://marc.info/?l=linux-kernel&m=124340506200729
> is the best workaround. Note that we have the same dump_write()
> in binfmt_elf.c and binfmt_aout.c, perhaps it makes sense to
> create coredump_file_write() helper in fs/exec.c.


But I didn't notice Paul also reports the kernel panic:

	page:ffffe20010d63d00 flags:0x8000000000000001 mapping:0000000000000000 mapcount:0 \
	count:0 Trying to fix it up, but a reboot is needed
	Backtrace:
	Pid: 3346, comm: worker Tainted: P          2.6.27.24-worker #4

	Call Trace:
	 [<ffffffff80284fd4>] bad_page+0x74/0xc0
	 [<ffffffff80286168>] free_hot_cold_page+0x248/0x2f0
	 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
	 [<ffffffff802a95c6>] kfree+0x86/0x100
	 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
	 [<ffffffff802f0991>] elf_core_dump+0x611/0x1160

At first glance, this looks like a bug outside of coredump.c,
we are trying to free PG_locked page?

Oleg.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 20:04     ` Oleg Nesterov
@ 2009-05-27 20:22       ` Paul Smith
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-27 20:22 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andi Kleen, linux-kernel, Andrew Morton, Roland McGrath

On Wed, 2009-05-27 at 22:04 +0200, Oleg Nesterov wrote:
> Forgot to mention, and we have problems with OOM. Not only the coredumping
> task can't be killed (and it can populate the memory via get_user_pages).
> The coredump just disables OOM, if select_bad_process() sees the PF_EXITING
> task with ->mm == NULL it returns -1.
> 
> > This all needs more discussion, but imho for now something like
> > Paul's patch http://marc.info/?l=linux-kernel&m=124340506200729
> > is the best workaround. Note that we have the same dump_write()
> > in binfmt_elf.c and binfmt_aout.c, perhaps it makes sense to
> > create coredump_file_write() helper in fs/exec.c.
> 
> But I didn't notice Paul also reports the kernel panic:
> 
> 	page:ffffe20010d63d00 flags:0x8000000000000001 mapping:0000000000000000 mapcount:0 \
> 	count:0 Trying to fix it up, but a reboot is needed
> 	Backtrace:
> 	Pid: 3346, comm: worker Tainted: P          2.6.27.24-worker #4
> 
> 	Call Trace:
> 	 [<ffffffff80284fd4>] bad_page+0x74/0xc0
> 	 [<ffffffff80286168>] free_hot_cold_page+0x248/0x2f0
> 	 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
> 	 [<ffffffff802a95c6>] kfree+0x86/0x100
> 	 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
> 	 [<ffffffff802f0991>] elf_core_dump+0x611/0x1160
> 
> At first glance, this looks like a bug outside of coredump.c,
> we are trying to free PG_locked page?

This might be something different, or a side-effect that's not
understood; I haven't seen this happen again since I applied my change,
and I used to be able to make it happen every time within 2 or 3
invocations of my "failing" core dump procedure.  Now I have dumped core
using my "failing" procedure 10-15 times in a row with no ill-effects.

I'll keep an eye out for this one though.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:41         ` Andi Kleen
  2009-05-26 23:45           ` Andrew Morton
@ 2009-05-27 20:25           ` Jesper Juhl
  2009-05-29 10:34           ` Pavel Machek
  2 siblings, 0 replies; 27+ messages in thread
From: Jesper Juhl @ 2009-05-27 20:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, paul, linux-kernel

On Wed, 27 May 2009, Andi Kleen wrote:

> On Tue, May 26, 2009 at 04:28:21PM -0700, Andrew Morton wrote:
> > On Wed, 27 May 2009 01:14:28 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> > > > dump_write() doesn't seem right, either.  If ->write() returns, say,
> > > > 100 then the dump should keep on going.  At present it treats this
> > > > return as an error.
> > > 
> > > I think that's correct actually. Short write typically means serious 
> > > issue like disk full or broken pipe, so stopping is good.
> > 
> > But we shouldn't assume that.  It could be that the ->write
> > implementation is perfectly able to absorb the remaining data.
> 
> Maybe in theory, but in practice that's unlikely isn't it?
> Disk is full or pipe is blocking etc.
> 
> > We should only error out of the write() returned zero or -EFOO.
> > The current code is simply buggy, but got lucky.
> 
> Maybe very pedantically, but I would argue that most programs 
> don't do what you're saying (retry on any short write) and
> it's actually not very nice to always write a loop for each write.
> 
My experience, from userspace, both with the codebases I currently work on 
professionally and my own hobby projects, is that people usually write a 
"write wrapper" function that deals with short writes, interrupted system 
call etc and then just call the wrapper so one does not have to open-code 
a loop everywhere one wants to call write() - one just calls the wrapper 
function that does the right thing.

-- 
Jesper Juhl <jj@chaosbits.net>             http://www.chaosbits.net/
Plain text mails only, please      http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-27 19:49         ` Paul Smith
@ 2009-05-27 20:34           ` Oleg Nesterov
  0 siblings, 0 replies; 27+ messages in thread
From: Oleg Nesterov @ 2009-05-27 20:34 UTC (permalink / raw)
  To: Paul Smith; +Cc: Andi Kleen, linux-kernel, Andrew Morton, Roland McGrath

On 05/27, Paul Smith wrote:
>
> On Wed, 2009-05-27 at 21:05 +0200, Oleg Nesterov wrote:
> > On 05/27, Andi Kleen wrote:
> > >
> > > > Actually, I think there is a strong reason to handle signals during
> > > > core dumping. The coredump can take a lot of time/resources, not good
> > > > it looks like unkillable procees to users.
> > >
> > > One problem with that is if you send a process a string of signals that cause
> > > a core dump and then kill. In the old case you would just get a full core dump
> > > on the first signal and be done. With your change it would process
> > > the second signal too and stop the dumping and you get none or a partial
> > > core dump. That might well break existing setups.
> >
> > I don't think we should worry about this particular case. Suppose a user
> > does
> >
> > 	kill(pid, SIGQUIT);
> > 	kill(pid, SIGKILL);
>
> I'm not sure about this.  Why even bother with SIGQUIT (or anything
> else) if you're just going to immediately SIGKILL afterwards?

Probably I misunderstood what Andi meant.

> What
> people do all the time, and I think should be supported, is something
> like this:
>
> 	<do 5 times>
> 		kill(pid, SIGINT);
> 		sleep(1);
> 		<if pid is dead break>
> 	kill(pid, SIGKILL);
>
> Often with other signals in the mix like SIGHUP or whatever.  The idea
> is to give the process a chance to do "whatever it does" to clean up and
> then, if it's still there we consider it too wedged to respond and send
> a SIGKILL.  If the cleanup operations invoked by receiving the SIGINT
> caused a core dump, then you wouldn't want the SIGKILL to stop the core
> dump.

Yes. Once again, this change is user-visble and it can confuse/break
existing setups, I agree. As almost any user-visible change ;)

> On the other hand I do agree that it would be nice to be able to smash a
> core dump that was taking a long time or trying to write to an
> unavailable resource like a stalled NFS mount or whatever.  Sigh.

Yes.

Oleg.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [2.6.27.24] Kernel coredump to a pipe is failing
  2009-05-26 23:41         ` Andi Kleen
  2009-05-26 23:45           ` Andrew Morton
  2009-05-27 20:25           ` Jesper Juhl
@ 2009-05-29 10:34           ` Pavel Machek
  2 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2009-05-29 10:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, paul, linux-kernel

On Wed 2009-05-27 01:41:09, Andi Kleen wrote:
> On Tue, May 26, 2009 at 04:28:21PM -0700, Andrew Morton wrote:
> > On Wed, 27 May 2009 01:14:28 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > On Tue, May 26, 2009 at 04:00:17PM -0700, Andrew Morton wrote:
> > > > dump_write() doesn't seem right, either.  If ->write() returns, say,
> > > > 100 then the dump should keep on going.  At present it treats this
> > > > return as an error.
> > > 
> > > I think that's correct actually. Short write typically means serious 
> > > issue like disk full or broken pipe, so stopping is good.
> > 
> > But we shouldn't assume that.  It could be that the ->write
> > implementation is perfectly able to absorb the remaining data.
> 
> Maybe in theory, but in practice that's unlikely isn't it?
> Disk is full or pipe is blocking etc.
> 
> > We should only error out of the write() returned zero or -EFOO.
> > The current code is simply buggy, but got lucky.
> 
> Maybe very pedantically, but I would argue that most programs 
> don't do what you're saying (retry on any short write) and
> it's actually not very nice to always write a loop for each write.
> 
> Also any IO device who relies on that would likely find
> that it won't work with a lot of software.
> 
> So I think the current behaviour is ok, just need to get 
> rid of the signals.

Short writes are normal at least for pipes and sockets... better fix
the sw.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [2.6.27.24] Kernel coredump to a pipe is failing
@ 2009-05-22 12:34 Paul Smith
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Smith @ 2009-05-22 12:34 UTC (permalink / raw)
  To: linux-kernel

Hi all; I'm having issues trying to get my system to dump core through a
pipe to a process, by setting /proc/sys/kernel/core_pattern to start
with a "|".  I'm working on a ramdisk-based system that I boot via PXE.
It's a 64bit 4core system with hyperthreading for 8 CPUs reported by
Linux, and I'm running 2.6.27.24.  Unfortunately it would be quite some
effort to try to update to a newer kernel on this system.

Here's my scenario: I'm running a single process on the system.  That
process receives a job, and it forks a number of copies of itself
(between 4 and 8) to do this work.  In this processing there are two
areas of the code.  In the first area, any core is considered
unrecoverable; in this area when one of the workers takes a core a
signal handler is fired that stops all processes in the process group,
dumps the single core, then all the processes are killed so they never
dump core.  This style of single-core dumping works just fine.

In the second area, a core is considered local to that worker process
and so the system as a whole is considered recoverable.  In this area,
each process dumps core without trying to affect its siblings or parent.
Of course, since the workers are running the same code it could be
(depending on the fault condition) that most or all of the workers dump
core at the same time.

In that last situation, when many of them dump core at the same time, if
I use a filename pattern in core_pattern it works fine, but I'm seeing
major problems when I try to pipe the core to a process.  To test this
I've added an explicit SEGV (invalid pointer dereference) to the code
just when it enters the "recoverable" area, so all workers hit it at the
same time and all dump core.

My core handler process does various things with locking, compression of
the core, etc., but I've reproduced my problem with as simple a setup as
this:

        # cat >/tmp/savecore <<EOF
        #!/bin/sh
        exec cat > /path/to/cores/core.$1
        EOF
        # chmod 755 /tmp/savecore
        # echo '|/tmp/savecore %p' > /proc/sys/kernel/core_pattern

With this, when my workers dump core, the best outcome is that all of
the cores are "short".  A normal core for these processes would be about
1G, but these are anywhere from 50M to as small as 64K.  Examining the
core it basically contains the first 50M (or 64K) of a valid core, so
that GDB will even load it, but obviously as soon as you do any
operation like examine backtraces, etc. it says invalid memory address.
Of course I have core size set to unlimited, etc.

I instrumented fs/binfmt_elf.c:elf_core_dump() with some printk's to try
to figure out what's going on.  I've added a printout to show the value
of "offset" which is how large the core is expected to be.  I also added
printk's before each of the "goto end_coredump" jumps inside the VMA
dump loop.  Just before the end_coredump label I print a message saying
all VMAs were successfully dumped, then after the end_coredump label I
show how many VMA entries were dumped and the total size of them (I
added some variables to track this).  So, on my console when I use a
simple filename pattern in core_patterm, I get this output which is
correct; in this example we have 6 helpers:

        file ffff88046e9e8240: foffset=2820 / offset=1085698048 / dataoff=4096
        file ffff8803ba38e300: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88046a264600: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803b9164cc0: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58900: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58d80: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba38e300: Completed VMA write
        file ffff8803ba38e300: 41 VMA / 1085562880 B
        file ffff88046a264600: Completed VMA write
        file ffff8803b9164cc0: Completed VMA write
        file ffff8803b9164cc0: 41 VMA / 1085562880 B
        file ffff88042cf58d80: Completed VMA write
        file ffff88042cf58d80: 41 VMA / 1085562880 B
        file ffff88046e9e8240: Completed VMA write
        file ffff88046e9e8240: 42 VMA / 1085693952 B
        file ffff88042cf58900: Completed VMA write
        file ffff88042cf58900: 41 VMA / 1085562880 B
        file ffff88046a264600: 41 VMA / 1085562880 B

Here we see the offset, then the message that the loop dumping VMA
completes, then the size (this is the size of VMA segments, so it's less
than the total core size).

However, when I install the pipe as described above, things do not go so
well at all:

        file ffff88042cf58cc0: foffset=2820 / offset=1085698048 / dataoff=4096
        file ffff88046e9e8540: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba38e840: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042fd1a6c0: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042fd1ae40: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff8803ba245f00: foffset=2788 / offset=1085566976 / dataoff=4096
        file ffff88042cf58cc0: (3540372 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba245f00: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba245f00: 1 VMA / 5484544 B
        file ffff88046e9e8540: (5309812 > 18446744073709551615 || dump_write() == 0)
        file ffff88046e9e8540: 1 VMA / 5484544 B
        file ffff8803ba38e840: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff88042fd1a6c0: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff88042fd1ae40: (5277044 > 18446744073709551615 || dump_write() == 0)
        file ffff8803ba38e840: 1 VMA / 5484544 B
        file ffff88042fd1a6c0: 1 VMA / 5484544 B
        file ffff88042fd1ae40: 1 VMA / 5484544 B
        file ffff88042cf58cc0: 1 VMA / 5484544 B

What this shows is that the cores were supposed to be about the same
size, but that in all cases the VMA dump loop exited early (in this case
after 1 try) at the if-statement here:

        if ((size += PAGE_SIZE) > limit ||
            !dump_write(file, kaddr,
                        PAGE_SIZE)) {
                kunmap(page);
                page_cache_release(page);
          printk("file %p: (%lu > %lu || dump_write() == 0)\n", file, size, limit);
                goto end_coredump;
        }

Basically this means that the file->f_op->write() method returned a 0...
but I don't know why it would return that!  There are no other messages
in dmesg or syslog etc.

Even worse, after 3 or so times of this, I get a kernel panic, every
time.  The details of the panic vary but a sample is included below.

I'm not sure how to go further.  Any pointers, advice, requests, etc.
are gratefully received.


Kernel panic info:

Bad page state in process 'worker'
page:ffffe20010d63d00 flags:0x8000000000000001 mapping:0000000000000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 3346, comm: worker Tainted: P          2.6.27.24-worker #4

Call Trace:
 [<ffffffff80284fd4>] bad_page+0x74/0xc0
 [<ffffffff80286168>] free_hot_cold_page+0x248/0x2f0
 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
 [<ffffffff802a95c6>] kfree+0x86/0x100
 [<ffffffff802f4096>] free_wr_note_data+0x56/0x70
 [<ffffffff802f0991>] elf_core_dump+0x611/0x1160
 [<ffffffff802b49ba>] do_coredump+0x9ea/0xa30
 [<ffffffff8048bb31>] _read_unlock+0x11/0x40
 [<ffffffff802801e3>] find_lock_page+0x23/0x80
 [<ffffffff80280c25>] filemap_fault+0x275/0x4e0
 [<ffffffff80249771>] __dequeue_signal+0xe1/0x190
 [<ffffffff8024bb22>] get_signal_to_deliver+0x272/0x3b0
 [<ffffffff8020b6f7>] do_notify_resume+0xd7/0x9c0
 [<ffffffff8022fe9f>] task_rq_lock+0x4f/0xa0
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff802368fe>] try_to_wake_up+0x16e/0x250
 [<ffffffff80226c96>] do_page_fault+0x426/0xbc0
 [<ffffffff802388c7>] kick_process+0x57/0x90
 [<ffffffff80249c2f>] send_signal+0x17f/0x320
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff802487eb>] sys_rt_sigprocmask+0x8b/0x120
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8022c0e5>] sys32_rt_sigprocmask+0x75/0x120
 [<ffffffff8022c0e5>] sys32_rt_sigprocmask+0x75/0x120
 [<ffffffff8020c90b>] int_signal+0x12/0x17

general protection fault: 0000 [1] PREEMPT SMP 
CPU 3 
Modules linked in: rng_core dock scsi_mod libata ata_piix zlib_inflate bnx2 libphy tg3 ipmi_msghandler ipmi_si ipmi_devintf nsoc sd_mod sg scsi_transport_sas mptbase mptscsih mptsas mptctl md_mod raid1 raid10 raid0 linear dm_mod dm_multipath dm_round_robin jbd ext3 disklog xxds(P)
Pid: 2895, comm: md4_resync Tainted: P    B     2.6.27.24-worker #4
RIP: 0010:[<ffffffff802a989e>]  [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
RSP: 0000:ffff88046baf3a70  EFLAGS: 00010086
RAX: 0000000000000000 RBX: 8000000000000000 RCX: 0000000000000010
RDX: ffff88002805d5e0 RSI: 0000000000011220 RDI: ffffffff8058cce8
RBP: 0000000000000082 R08: ffffffffa000d2b0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8058cce8
R13: ffffffff80282da5 R14: 0000000000011220 R15: 0000000000000200
FS:  0000000000000000(0000) GS:ffff88046f805a80(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000080a35ac CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md4_resync (pid: 2895, threadinfo ffff88046baf2000, task ffff88046e505b00)
Stack:  ffff88046a2b6580 0000000000000010 ffff88046fa8f060 0000000000011220
 ffff88046baf3ac0 ffff88046fa8f090 0000000000000000 ffffffff80282da5
 ffff88046a243c00 ffff88046baf3ad8 0000000000001000 ffffffffa0181924
Call Trace:
 [<ffffffff80282da5>] ? mempool_alloc+0x55/0x130
 [<ffffffffa0181924>] ? multipath_map+0x74/0xc0 [dm_multipath]
 [<ffffffff8036c2e4>] ? __sg_alloc_table+0x64/0x130
 [<ffffffffa000d2b0>] ? scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffffa000d1f4>] ? scsi_init_sgtable+0x34/0xa0 [scsi_mod]
 [<ffffffffa000d561>] ? scsi_init_io+0x21/0x100 [scsi_mod]
 [<ffffffffa00e37fc>] ? sd_prep_fn+0x9c/0x5b0 [sd_mod]
 [<ffffffff8034c23d>] ? elv_next_request+0x15d/0x290
 [<ffffffffa000cbfa>] ? scsi_request_fn+0x6a/0x3c0 [scsi_mod]
 [<ffffffff8034e884>] ? generic_unplug_device+0x24/0x40
 [<ffffffffa0172b14>] ? dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] ? dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa0172b14>] ? dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] ? dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa015c4d6>] ? unplug_slaves+0xb6/0x140 [raid1]
 [<ffffffffa015c570>] ? raid1_unplug+0x10/0x20 [raid1]
 [<ffffffffa014ae1c>] ? md_do_sync+0x5cc/0xb40 [md_mod]
 [<ffffffff8048baf2>] ? _spin_unlock_irqrestore+0x12/0x40
 [<ffffffffa014b937>] ? md_thread+0x47/0x120 [md_mod]
 [<ffffffffa014b8f0>] ? md_thread+0x0/0x120 [md_mod]
 [<ffffffff80253267>] ? kthread+0x47/0x90
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff8020d5f9>] ? child_rip+0xa/0x11
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff80253220>] ? kthread+0x0/0x90
 [<ffffffff8020d5ef>] ? child_rip+0x0/0x11


Code: 89 fc 41 89 f6 4c 8b 6c 24 38 9c 5d fa e8 bb 3b 0c 00 48 98 49 8b 94 c4 e8 00 00 00 48 8b 1a 44 8b 7a 18 48 85 db 74 4d 8b 42 14 <48> 8b 04 c3 48 89 02 55 9d 66 45 85 f6 79 12 48 85 db 74 0d 44 
RIP  [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
 RSP <ffff88046baf3a70>
Kernel panic - not syncing: Fatal exception
------------[ cut here ]------------
WARNING: at /home/build/linux/kernel/smp.c:332 smp_call_function_mask+0x25d/0x270()
Modules linked in: rng_core dock scsi_mod libata ata_piix zlib_inflate bnx2 libphy tg3 ipmi_msghandler ipmi_si ipmi_devintf nsoc sd_mod sg scsi_transport_sas mptbase mptscsih mptsas mptctl md_mod raid1 raid10 raid0 linear dm_mod dm_multipath dm_round_robin jbd ext3 disklog xxds(P)
Pid: 2895, comm: md4_resync Tainted: P    B D   2.6.27.24-worker #4

Call Trace:
 [<ffffffff8023bf54>] warn_on_slowpath+0x64/0xb0
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff80257f30>] down_trylock+0x30/0x50
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff8023cce2>] vprintk+0x1f2/0x490
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff8048b88f>] _spin_lock_irqsave+0x1f/0x50
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffff8021c9b0>] stop_this_cpu+0x0/0x30
 [<ffffffff8026234d>] smp_call_function_mask+0x25d/0x270
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8048916f>] printk+0xc0/0xd1
 [<ffffffff8048bbb0>] _spin_unlock+0x10/0x30
 [<ffffffff80269d8d>] crash_kexec+0x6d/0x110
 [<ffffffff8020ea84>] show_registers+0x2a4/0x2c0
 [<ffffffff8048916f>] printk+0xc0/0xd1
 [<ffffffff8021c9b0>] stop_this_cpu+0x0/0x30
 [<ffffffff80262398>] smp_call_function+0x38/0x80
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8021c9a0>] native_smp_send_stop+0x20/0x30
 [<ffffffff80488ffb>] panic+0x9f/0x153
 [<ffffffff80253729>] autoremove_wake_function+0x9/0x30
 [<ffffffff8022e2ca>] __wake_up_common+0x5a/0x90
 [<ffffffff8022fc13>] __wake_up+0x43/0x70
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffff8020dfa1>] oops_end+0x81/0xd0
 [<ffffffff8048bf69>] error_exit+0x0/0x51
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffffa000d2b0>] scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffff802a989e>] kmem_cache_alloc+0x4e/0xc0
 [<ffffffff80282da5>] mempool_alloc+0x55/0x130
 [<ffffffffa0181924>] multipath_map+0x74/0xc0 [dm_multipath]
 [<ffffffff8036c2e4>] __sg_alloc_table+0x64/0x130
 [<ffffffffa000d2b0>] scsi_sg_alloc+0x0/0x50 [scsi_mod]
 [<ffffffffa000d1f4>] scsi_init_sgtable+0x34/0xa0 [scsi_mod]
 [<ffffffffa000d561>] scsi_init_io+0x21/0x100 [scsi_mod]
 [<ffffffffa00e37fc>] sd_prep_fn+0x9c/0x5b0 [sd_mod]
 [<ffffffff8034c23d>] elv_next_request+0x15d/0x290
 [<ffffffffa000cbfa>] scsi_request_fn+0x6a/0x3c0 [scsi_mod]
 [<ffffffff8034e884>] generic_unplug_device+0x24/0x40
 [<ffffffffa0172b14>] dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa0172b14>] dm_table_unplug_all+0x34/0x50 [dm_mod]
 [<ffffffffa01713fd>] dm_unplug_all+0x1d/0x40 [dm_mod]
 [<ffffffffa015c4d6>] unplug_slaves+0xb6/0x140 [raid1]
 [<ffffffffa015c570>] raid1_unplug+0x10/0x20 [raid1]
 [<ffffffffa014ae1c>] md_do_sync+0x5cc/0xb40 [md_mod]
 [<ffffffff8048baf2>] _spin_unlock_irqrestore+0x12/0x40
 [<ffffffffa014b937>] md_thread+0x47/0x120 [md_mod]
 [<ffffffffa014b8f0>] md_thread+0x0/0x120 [md_mod]
 [<ffffffff80253267>] kthread+0x47/0x90
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff8020d5f9>] child_rip+0xa/0x11
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff80253220>] kthread+0x0/0x90
 [<ffffffff8020d5ef>] child_rip+0x0/0x11

---[ end trace b6e9b2155948c94b ]---


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-05-29 12:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-26 16:33 [2.6.27.24] Kernel coredump to a pipe is failing Paul Smith
2009-05-26 18:01 ` Paul Smith
2009-05-26 20:31 ` Andi Kleen
2009-05-26 21:09   ` Paul Smith
2009-05-26 23:00   ` Andrew Morton
2009-05-26 23:14     ` Andi Kleen
2009-05-26 23:28       ` Andrew Morton
2009-05-26 23:41         ` Andi Kleen
2009-05-26 23:45           ` Andrew Morton
2009-05-27  0:11             ` Andi Kleen
2009-05-27  0:29               ` Andrew Morton
2009-05-27  6:02                 ` Paul Smith
2009-05-27  6:17                 ` Paul Smith
2009-05-27  7:31                 ` Andi Kleen
2009-05-27  7:45                   ` Andrew Morton
2009-05-27  8:52                     ` Andi Kleen
2009-05-27  8:56                       ` Andrew Morton
2009-05-27 20:25           ` Jesper Juhl
2009-05-29 10:34           ` Pavel Machek
2009-05-27 18:31   ` Oleg Nesterov
2009-05-27 18:50     ` Andi Kleen
2009-05-27 19:05       ` Oleg Nesterov
2009-05-27 19:49         ` Paul Smith
2009-05-27 20:34           ` Oleg Nesterov
2009-05-27 20:04     ` Oleg Nesterov
2009-05-27 20:22       ` Paul Smith
  -- strict thread matches above, loose matches on Subject: below --
2009-05-22 12:34 Paul Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.