linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
       [not found]   ` <87h7i694ij.fsf_-_@disp2133>
@ 2021-06-09 20:33     ` Linus Torvalds
  2021-06-09 20:48       ` Eric W. Biederman
  2021-06-09 21:02       ` Olivier Langlois
  2021-10-22 14:13     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Pavel Begunkov
  1 sibling, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2021-06-09 20:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Olivier Langlois, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Wed, Jun 9, 2021 at 1:17 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
> In short the coredump code deliberately supports being interrupted by
> SIGKILL, and depends upon prepare_signal to filter out all other
> signals.

Hmm.

I have to say, that looks like the core reason for the bug: if you
want to be interrupted by a fatal signal, you shouldn't use
signal_pending(), you should use fatal_signal_pending().

Now, the fact that we haven't cleared TIF_NOTIFY_SIGNAL for the first
signal is clearly the immediate cause of this, but at the same time I
really get the feeling that that coredump aborting code should always
had used fatal_signal_pending().

We do want to be able to abort core-dumps (stuck network filesystems
is the traditional reason), but the fact that it used signal_pending()
looks buggy.

In fact, the very comment in that dump_interrupted() function seems to
acknowledge that signal_pending() is all kinds of silly.

So regardless of the fact that io_uring does seem to have messed up
this part of signals, I think the fix is not to change
signal_pending() to task_sigpending(), but to just do what the comment
suggests we should do.

But also:

> With the io_uring code comes an extra test in signal_pending
> for TIF_NOTIFY_SIGNAL (which is something about asking a task to run
> task_work_run).

Jens, is this still relevant? Maybe we can revert that whole series
now, and make the confusing difference between signal_pending() and
task_sigpending() go away again?

               Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 20:33     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Linus Torvalds
@ 2021-06-09 20:48       ` Eric W. Biederman
  2021-06-09 20:52         ` Linus Torvalds
  2021-06-09 21:02       ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-09 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Olivier Langlois, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Jun 9, 2021 at 1:17 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>> In short the coredump code deliberately supports being interrupted by
>> SIGKILL, and depends upon prepare_signal to filter out all other
>> signals.
>
> Hmm.
>
> I have to say, that looks like the core reason for the bug: if you
> want to be interrupted by a fatal signal, you shouldn't use
> signal_pending(), you should use fatal_signal_pending().
>
> Now, the fact that we haven't cleared TIF_NOTIFY_SIGNAL for the first
> signal is clearly the immediate cause of this, but at the same time I
> really get the feeling that that coredump aborting code should always
> had used fatal_signal_pending().
>
> We do want to be able to abort core-dumps (stuck network filesystems
> is the traditional reason), but the fact that it used signal_pending()
> looks buggy.
>
> In fact, the very comment in that dump_interrupted() function seems to
> acknowledge that signal_pending() is all kinds of silly.
>
> So regardless of the fact that io_uring does seem to have messed up
> this part of signals, I think the fix is not to change
> signal_pending() to task_sigpending(), but to just do what the comment
> suggests we should do.

It looks like it would need to be:

static bool dump_interrupted(void)
{
	return fatal_signal_pending() || freezing();
}

As the original implementation of dump_interrupted 528f827ee0bb
("coredump: introduce dump_interrupted()") is deliberately allowing the
freezer to terminate the core dumps to allow for reliable system
suspend.

>
> But also:
>
>> With the io_uring code comes an extra test in signal_pending
>> for TIF_NOTIFY_SIGNAL (which is something about asking a task to run
>> task_work_run).
>
> Jens, is this still relevant? Maybe we can revert that whole series
> now, and make the confusing difference between signal_pending() and
> task_sigpending() go away again?
>
>                Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 20:48       ` Eric W. Biederman
@ 2021-06-09 20:52         ` Linus Torvalds
  0 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2021-06-09 20:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Olivier Langlois, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Wed, Jun 9, 2021 at 1:48 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> It looks like it would need to be:
>
> static bool dump_interrupted(void)
> {
>         return fatal_signal_pending() || freezing();
> }
>
> As the original implementation of dump_interrupted 528f827ee0bb
> ("coredump: introduce dump_interrupted()") is deliberately allowing the
> freezer to terminate the core dumps to allow for reliable system
> suspend.

Ack. That would seem to be the right conversion to do.

Now, I'm not sure if system suspend really should abort a core dump,
but it's clearly what we have done in the past.

Maybe we'd like to remove that "|| freezing()" at some point, but
that's a separate discussion, I think. At least having it in that form
makes it all very explicit, instead of the current very subtle exact
interaction with the TIF_NOTIFY_SIGNAL bit (that has other meanings).

Hmm?

             Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 20:33     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Linus Torvalds
  2021-06-09 20:48       ` Eric W. Biederman
@ 2021-06-09 21:02       ` Olivier Langlois
  2021-06-09 21:05         ` Eric W. Biederman
  1 sibling, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-06-09 21:02 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Wed, 2021-06-09 at 13:33 -0700, Linus Torvalds wrote:
> Now, the fact that we haven't cleared TIF_NOTIFY_SIGNAL for the first
> signal is clearly the immediate cause of this, but at the same time I
> really get the feeling that that coredump aborting code should always
> had used fatal_signal_pending().

I need clarify what does happen with the io_uring situation. If
somehow, TIF_NOTIFY_SIGNAL wasn't cleared, I would get all the time a 0
byte size core dump because do_coredump() does check if the dump is
interrupted before writing a single byte.

io_uring is quite a strange animal. AFAIK, the common pattern to use a
wait_queue is to insert a task into it and then put that task to sleep
until the waited event occur.

io_uring place tasks into wait queues and then let the the task return
to user space to do some other stuff (like core dumping). I would guess
that it is the main reason for it using the task_work feature.

So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is written.

Greetings,
Olivier



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 21:02       ` Olivier Langlois
@ 2021-06-09 21:05         ` Eric W. Biederman
  2021-06-09 21:26           ` Olivier Langlois
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-09 21:05 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

Olivier Langlois <olivier@trillion01.com> writes:

> On Wed, 2021-06-09 at 13:33 -0700, Linus Torvalds wrote:
>> Now, the fact that we haven't cleared TIF_NOTIFY_SIGNAL for the first
>> signal is clearly the immediate cause of this, but at the same time I
>> really get the feeling that that coredump aborting code should always
>> had used fatal_signal_pending().
>
> I need clarify what does happen with the io_uring situation. If
> somehow, TIF_NOTIFY_SIGNAL wasn't cleared, I would get all the time a 0
> byte size core dump because do_coredump() does check if the dump is
> interrupted before writing a single byte.
>
> io_uring is quite a strange animal. AFAIK, the common pattern to use a
> wait_queue is to insert a task into it and then put that task to sleep
> until the waited event occur.
>
> io_uring place tasks into wait queues and then let the the task return
> to user space to do some other stuff (like core dumping). I would guess
> that it is the main reason for it using the task_work feature.
>
> So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is written.

Did you mean?

So the TIF_NOTIFY_SIGNAL does _not_ get set WHILE the core dump is written.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 21:05         ` Eric W. Biederman
@ 2021-06-09 21:26           ` Olivier Langlois
  2021-06-09 21:56             ` Olivier Langlois
  2021-06-10 14:26             ` Eric W. Biederman
  0 siblings, 2 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-06-09 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Wed, 2021-06-09 at 16:05 -0500, Eric W. Biederman wrote:
> > 
> > So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is
> > written.
> 
> Did you mean?
> 
> So the TIF_NOTIFY_SIGNAL does _not_ get set WHILE the core dump is
> written.
> 
> 
Absolutely not. I did really mean what I have said. Bear with me that,
I am not qualifying myself as an expert kernel dev yet so feel free to
correct me if I say some heresy...

io_uring is placing my task in my TCP socket wait queue because it
wants to read data from it.

The task returns to user space and core dump with a SEGV.

now my understanding is that the code that is waking up tasks, it is
the NIC driver interrupt handler which can occur while the core dump is
written.

does that make sense?

my testing is telling me that this is exactly what happens...



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 21:26           ` Olivier Langlois
@ 2021-06-09 21:56             ` Olivier Langlois
  2021-06-10 14:26             ` Eric W. Biederman
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-06-09 21:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Wed, 2021-06-09 at 17:26 -0400, Olivier Langlois wrote:
> On Wed, 2021-06-09 at 16:05 -0500, Eric W. Biederman wrote:
> > > 
> > > So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is
> > > written.
> > 
> > Did you mean?
> > 
> > So the TIF_NOTIFY_SIGNAL does _not_ get set WHILE the core dump is
> > written.
> > 
> > 
> Absolutely not. I did really mean what I have said. Bear with me
> that,
> I am not qualifying myself as an expert kernel dev yet so feel free
> to
> correct me if I say some heresy...
> 
> io_uring is placing my task in my TCP socket wait queue because it
> wants to read data from it.
> 
> The task returns to user space and core dump with a SEGV.
> 
> now my understanding is that the code that is waking up tasks, it is
> the NIC driver interrupt handler which can occur while the core dump
> is
> written.
> 
> does that make sense?
> 
> my testing is telling me that this is exactly what happens...
> 
> 
Another thing to know is that dump_interrupted() isn't only called from
do_coredump().

At first, I did the mistake to think that if dump_interrupt() was
returning false when called from do_coredump() all was good.

It is not the case. dump_interrupted() is also called from dump_emit()
which is called from several places by functions inside binfmt_elf.c

So dump_interrupted() is called several times during the coredump
generation.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-09 21:26           ` Olivier Langlois
  2021-06-09 21:56             ` Olivier Langlois
@ 2021-06-10 14:26             ` Eric W. Biederman
  2021-06-10 15:17               ` Olivier Langlois
  2021-06-10 18:58               ` [CFT}[PATCH] coredump: Limit what can interrupt coredumps Eric W. Biederman
  1 sibling, 2 replies; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-10 14:26 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

Olivier Langlois <olivier@trillion01.com> writes:

> On Wed, 2021-06-09 at 16:05 -0500, Eric W. Biederman wrote:
>> > 
>> > So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is
>> > written.
>> 
>> Did you mean?
>> 
>> So the TIF_NOTIFY_SIGNAL does _not_ get set WHILE the core dump is
>> written.
>> 
>> 
> Absolutely not. I did really mean what I have said. Bear with me that,
> I am not qualifying myself as an expert kernel dev yet so feel free to
> correct me if I say some heresy...

No.  I was just asking to make certain I understood what you said.

I thought you said you were getting a consistent 0 byte coredump,
and that implied that TIF_NOTIFY_SIGNAL was coming in before
the coredump even started.

> io_uring is placing my task in my TCP socket wait queue because it
> wants to read data from it.
>
> The task returns to user space and core dump with a SEGV.
>
> now my understanding is that the code that is waking up tasks, it is
> the NIC driver interrupt handler which can occur while the core dump is
> written.
>
> does that make sense?
>
> my testing is telling me that this is exactly what happens...

If you are getting partial coredumps that completely makes sense.



I was hoping that by this time Jens or Oleg would have been able to
chime in and at least confirm I am not missing something subtle.

I was afraid for a little bit that the file system code in called in
dump_emit would be checking signal_pending.  After looking into that I
see that the filesystem code very reasonably limits itself to testing
fatal_signal_pending (because by definition disk I/O on unix is not
interruptible).

So I will spin up a good version of my patch (based on your patch)
so we can unbreak coredumps.

Eric


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-06-10 14:26             ` Eric W. Biederman
@ 2021-06-10 15:17               ` Olivier Langlois
  2021-06-10 18:58               ` [CFT}[PATCH] coredump: Limit what can interrupt coredumps Eric W. Biederman
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-06-10 15:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Thu, 2021-06-10 at 09:26 -0500, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
> > On Wed, 2021-06-09 at 16:05 -0500, Eric W. Biederman wrote:
> > > > 
> > > > So the TIF_NOTIFY_SIGNAL does get set WHILE the core dump is
> > > > written.
> > > 
> > > Did you mean?
> > > 
> > > So the TIF_NOTIFY_SIGNAL does _not_ get set WHILE the core dump
> > > is
> > > written.
> > > 
> > > 
> > Absolutely not. I did really mean what I have said. Bear with me
> > that,
> > I am not qualifying myself as an expert kernel dev yet so feel free
> > to
> > correct me if I say some heresy...
> 
> No.  I was just asking to make certain I understood what you said.
> 
> I thought you said you were getting a consistent 0 byte coredump,
> and that implied that TIF_NOTIFY_SIGNAL was coming in before
> the coredump even started.

due to the asynchronous nature of the problem, it is all random.

Sometimes, I do get 0 byte coredump.
Most of the times, I get a truncated one
and very rarely (this is why the issue was so annoying), I get a full
coredump.
> 
> So I will spin up a good version of my patch (based on your patch)
> so we can unbreak coredumps.
> 
That is super nice. I am looking forward it!

Greetings,
Olivier


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [CFT}[PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 14:26             ` Eric W. Biederman
  2021-06-10 15:17               ` Olivier Langlois
@ 2021-06-10 18:58               ` Eric W. Biederman
  2021-06-10 19:10                 ` Linus Torvalds
  1 sibling, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-10 18:58 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov


Olivier Langlois has been struggling with coredumps written incompletely
in processes using io_uring.

Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>    set_notify_signal()

The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals.   Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.

Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports.  This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.

The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending.  So it should
not develop surprising wake-up reasons either.

With dump_interrupted properly testing for the reasons it supports
being interrupted remove the special case from prepare_signal.

Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Olivier can you test this, and confirm this works for you?

 fs/coredump.c   | 2 +-
 kernel/signal.c | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..c3d8fc14b993 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -519,7 +519,7 @@ static bool dump_interrupted(void)
 	 * but then we need to teach dump_write() to restart and clear
 	 * TIF_SIGPENDING.
 	 */
-	return signal_pending(current);
+	return fatal_signal_pending(current) || freezing(current);
 }
 
 static void wait_for_dump_helpers(struct file *file)
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..83d534deeb76 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -943,8 +943,6 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
 	sigset_t flush;
 
 	if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
-		if (!(signal->flags & SIGNAL_GROUP_EXIT))
-			return sig == SIGKILL;
 		/*
 		 * The process is in the middle of dying, nothing to do.
 		 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [CFT}[PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 18:58               ` [CFT}[PATCH] coredump: Limit what can interrupt coredumps Eric W. Biederman
@ 2021-06-10 19:10                 ` Linus Torvalds
  2021-06-10 19:18                   ` Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Linus Torvalds @ 2021-06-10 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Olivier Langlois, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Thu, Jun 10, 2021 at 12:01 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f7c6ffcbd044..83d534deeb76 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -943,8 +943,6 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>         sigset_t flush;
>
>         if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
> -               if (!(signal->flags & SIGNAL_GROUP_EXIT))
> -                       return sig == SIGKILL;
>                 /*
>                  * The process is in the middle of dying, nothing to do.
>                  */

I do think this part of the patch is correct, but I'd like to know
what triggered this change?

It seems fairly harmless - SIGKILL used to be the only signal that was
passed through in the coredump case, now you pass through all
non-ignored signals.

But since SIGKILL is the only signal that is relevant for the
fatal_signal_pending() case, this change seems irrelevant for the
coredump issue. Any other signals passed through won't matter.

End result: I think removing those two lines is likely a good idea,
but I also suspect it could/should just be a separate patch with a
separate explanation for it.

Hmm?

              Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [CFT}[PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 19:10                 ` Linus Torvalds
@ 2021-06-10 19:18                   ` Eric W. Biederman
  2021-06-10 19:50                     ` Linus Torvalds
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-10 19:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olivier Langlois, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 10, 2021 at 12:01 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index f7c6ffcbd044..83d534deeb76 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -943,8 +943,6 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>>         sigset_t flush;
>>
>>         if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> -               if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> -                       return sig == SIGKILL;
>>                 /*
>>                  * The process is in the middle of dying, nothing to do.
>>                  */
>
> I do think this part of the patch is correct, but I'd like to know
> what triggered this change?
>
> It seems fairly harmless - SIGKILL used to be the only signal that was
> passed through in the coredump case, now you pass through all
> non-ignored signals.
>
> But since SIGKILL is the only signal that is relevant for the
> fatal_signal_pending() case, this change seems irrelevant for the
> coredump issue. Any other signals passed through won't matter.
>
> End result: I think removing those two lines is likely a good idea,
> but I also suspect it could/should just be a separate patch with a
> separate explanation for it.
>
> Hmm?

I just didn't want those two lines hiding any other issues we might
have in the coredumps.

That is probably better development thinking than minimal fix thinking.

I am annoyed at the moment that those two lines even exist and figure
they are the confusing root cause of the problem.  That if we had
realized all it would take was to call fatal_signal_pending || freezing
than we could have avoided a problem entirely.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [CFT}[PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 19:18                   ` Eric W. Biederman
@ 2021-06-10 19:50                     ` Linus Torvalds
  2021-06-10 20:11                       ` [PATCH] " Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Linus Torvalds @ 2021-06-10 19:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Olivier Langlois, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Thu, Jun 10, 2021 at 12:18 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I just didn't want those two lines hiding any other issues we might
> have in the coredumps.
>
> That is probably better development thinking than minimal fix thinking.

Well, I think we should first do the minimal targeted fix (the part in
fs/coredump.c).

Then we should look at whether we could do cleanups as a result of that fix.

And I suspect the cleanups might bigger than the two-liner removal.
The whole SIGNAL_GROUP_COREDUMP flag was introduced for this issue,

See commit 403bad72b67d ("coredump: only SIGKILL should interrupt the
coredumping task") which introduced this all.

Now, we have since grown other users of SIGNAL_GROUP_COREDUMP - OOM
hanmdling and the clear_child_tid thing in mm_release(). So maybe we
should keep SIGNAL_GROUP_COREDUMP around.

So maybe only those two lines end up being the ones to remove, but I'd
really like to think of it as a separate thing from the fix itself.

                Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 19:50                     ` Linus Torvalds
@ 2021-06-10 20:11                       ` Eric W. Biederman
  2021-06-10 21:04                         ` Linus Torvalds
                                           ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-10 20:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Olivier Langlois, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov


Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.

Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>    set_notify_signal()

The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals.   Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.

Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports.  This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.

The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending.  So it should
not develop surprising wake-up reasons either.

v2: Don't remove the now unnecessary code in prepare_signal.

Cc: stable@vger.kernel.org
Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..c3d8fc14b993 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -519,7 +519,7 @@ static bool dump_interrupted(void)
 	 * but then we need to teach dump_write() to restart and clear
 	 * TIF_SIGPENDING.
 	 */
-	return signal_pending(current);
+	return fatal_signal_pending(current) || freezing(current);
 }
 
 static void wait_for_dump_helpers(struct file *file)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 20:11                       ` [PATCH] " Eric W. Biederman
@ 2021-06-10 21:04                         ` Linus Torvalds
  2021-06-12 14:36                         ` Olivier Langlois
  2021-06-14 14:10                         ` Oleg Nesterov
  2 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2021-06-10 21:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Olivier Langlois, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Thu, Jun 10, 2021 at 1:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.

Thanks, applied.

And lots of thanks to Olivier who kept debugging the odd coredump
behavior he saw.

            Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 20:11                       ` [PATCH] " Eric W. Biederman
  2021-06-10 21:04                         ` Linus Torvalds
@ 2021-06-12 14:36                         ` Olivier Langlois
  2021-06-12 16:26                           ` Jens Axboe
  2021-06-14 14:10                         ` Oleg Nesterov
  2 siblings, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-06-12 14:36 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Jens Axboe, Pavel Begunkov>,
	Oleg Nesterov

On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
> 
> Olivier Langlois has been struggling with coredumps being incompletely
> written in
> processes using io_uring.
> 
> Olivier Langlois <olivier@trillion01.com> writes:
> > io_uring is a big user of task_work and any event that io_uring made
> > a
> > task waiting for that occurs during the core dump generation will
> > generate a TIF_NOTIFY_SIGNAL.
> > 
> > Here are the detailed steps of the problem:
> > 1. io_uring calls vfs_poll() to install a task to a file wait queue
> >    with io_async_wake() as the wakeup function cb from
> > io_arm_poll_handler()
> > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
> >    set_notify_signal()
> 
> The coredump code deliberately supports being interrupted by SIGKILL,
> and depends upon prepare_signal to filter out all other signals.   Now
> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
> in dump_emitted by the coredump code no longer works.
> 
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.
> 
> The filesystem code that the coredump code uses already limits
> itself to only aborting on fatal_signal_pending.  So it should
> not develop surprising wake-up reasons either.
> 
> v2: Don't remove the now unnecessary code in prepare_signal.
> 
> Cc: stable@vger.kernel.org
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Olivier Langlois <olivier@trillion01.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/coredump.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 2868e3e171ae..c3d8fc14b993 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>          * but then we need to teach dump_write() to restart and clear
>          * TIF_SIGPENDING.
>          */
> -       return signal_pending(current);
> +       return fatal_signal_pending(current) || freezing(current);
>  }
>  
>  static void wait_for_dump_helpers(struct file *file)

Tested-by: Olivier Langlois <olivier@trillion01.com>



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-12 14:36                         ` Olivier Langlois
@ 2021-06-12 16:26                           ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2021-06-12 16:26 UTC (permalink / raw)
  To: Olivier Langlois, Eric W. Biederman, Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, io-uring,
	Alexander Viro, Pavel Begunkov>,
	Oleg Nesterov

On 6/12/21 8:36 AM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
>>
>> Olivier Langlois has been struggling with coredumps being incompletely
>> written in
>> processes using io_uring.
>>
>> Olivier Langlois <olivier@trillion01.com> writes:
>>> io_uring is a big user of task_work and any event that io_uring made
>>> a
>>> task waiting for that occurs during the core dump generation will
>>> generate a TIF_NOTIFY_SIGNAL.
>>>
>>> Here are the detailed steps of the problem:
>>> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>>>    with io_async_wake() as the wakeup function cb from
>>> io_arm_poll_handler()
>>> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
>>> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>>>    set_notify_signal()
>>
>> The coredump code deliberately supports being interrupted by SIGKILL,
>> and depends upon prepare_signal to filter out all other signals.   Now
>> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
>> in dump_emitted by the coredump code no longer works.
>>
>> Make the coredump code more robust by explicitly testing for all of
>> the wakeup conditions the coredump code supports.  This prevents
>> new wakeup conditions from breaking the coredump code, as well
>> as fixing the current issue.
>>
>> The filesystem code that the coredump code uses already limits
>> itself to only aborting on fatal_signal_pending.  So it should
>> not develop surprising wake-up reasons either.
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>> Reported-by: Olivier Langlois <olivier@trillion01.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/coredump.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 2868e3e171ae..c3d8fc14b993 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>          * but then we need to teach dump_write() to restart and clear
>>          * TIF_SIGPENDING.
>>          */
>> -       return signal_pending(current);
>> +       return fatal_signal_pending(current) || freezing(current);
>>  }
>>  
>>  static void wait_for_dump_helpers(struct file *file)
> 
> Tested-by: Olivier Langlois <olivier@trillion01.com>

Thanks Olivier and Eric for taking care of this. I've been mostly
offline for more than a week, back at it next week.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-10 20:11                       ` [PATCH] " Eric W. Biederman
  2021-06-10 21:04                         ` Linus Torvalds
  2021-06-12 14:36                         ` Olivier Langlois
@ 2021-06-14 14:10                         ` Oleg Nesterov
  2021-06-14 16:37                           ` Eric W. Biederman
  2021-06-15 22:08                           ` Eric W. Biederman
  2 siblings, 2 replies; 66+ messages in thread
From: Oleg Nesterov @ 2021-06-14 14:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Olivier Langlois, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

Eric, et al, sorry for delay, I didn't read emails several days.

On 06/10, Eric W. Biederman wrote:
>
> v2: Don't remove the now unnecessary code in prepare_signal.

No, that code is still needed. Otherwise any fatal signal will be
turned into SIGKILL.

> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>  	 * but then we need to teach dump_write() to restart and clear
>  	 * TIF_SIGPENDING.
>  	 */
> -	return signal_pending(current);
> +	return fatal_signal_pending(current) || freezing(current);
>  }


Well yes, this is what the comment says.

But note that there is another reason why dump_interrupted() returns true
if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
perhaps something else...

That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
dumping threads?

Or perhaps we should change __dump_emit() to clear signal_pending() and
restart __kernel_write() if it fails or returns a short write.

Otherwise the change above doesn't look like a full fix to me.

Oleg.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-14 14:10                         ` Oleg Nesterov
@ 2021-06-14 16:37                           ` Eric W. Biederman
  2021-06-14 16:59                             ` Oleg Nesterov
  2021-06-15 22:08                           ` Eric W. Biederman
  1 sibling, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-14 16:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Olivier Langlois, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, et al, sorry for delay, I didn't read emails several days.
>
> On 06/10, Eric W. Biederman wrote:
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>
> No, that code is still needed. Otherwise any fatal signal will be
> turned into SIGKILL.
>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...

The pipe_write case is a good case to consider.  In general filesystems
are only allowed to stop if fatal_signal_pending.  It is an ancient
unix/posix thing.

> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?

I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.
I don't know what it has to do with signals.

> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.

Given that the code needs to handle pipes that seems a reasonable thing
to do.  Note.  All of the blocking of new signals in prepare_signal
is still in place.  So only a SIGKILL can come in set TIF_SIGPENDING.

So I would say that the current fix is correct (and safe to backport).
But still requires some magic in prepare_signal until we do more.

I don't understand the logic with well enough of adding work to
non-io_uring threads with task_work_add to understand why that happens
in the first place.

There are a lot of silly corners here.  Yes please let's keep on
digging.

Eric


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-14 16:37                           ` Eric W. Biederman
@ 2021-06-14 16:59                             ` Oleg Nesterov
  0 siblings, 0 replies; 66+ messages in thread
From: Oleg Nesterov @ 2021-06-14 16:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Olivier Langlois, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

On 06/14, Eric W. Biederman wrote:
>
> I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
> least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.

No, no, no ;)

I think that, for example, freezer should be changed to use
set_notify_signal() rather than fake_signal_wake_up(). Livepatch.
And probably it will have more users.

> I don't understand the logic with well enough of adding work to
> non-io_uring threads with task_work_add to understand why that happens
> in the first place.

Same here.

Oleg.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-14 14:10                         ` Oleg Nesterov
  2021-06-14 16:37                           ` Eric W. Biederman
@ 2021-06-15 22:08                           ` Eric W. Biederman
  2021-06-16 19:23                             ` Olivier Langlois
  2021-08-05 13:06                             ` Olivier Langlois
  1 sibling, 2 replies; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-15 22:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Olivier Langlois, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

Oleg Nesterov <oleg@redhat.com> writes:

>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...
>
> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?
>
> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.
>
> Otherwise the change above doesn't look like a full fix to me.

Agreed.  The coredump to a pipe will still be short.  That needs
something additional.

The problem Olivier Langlois <olivier@trillion01.com> reported was
core dumps coming up short because TIF_NOTIFY_SIGNAL was being
set during a core dump.

We can see this with pipe_write returning -ERESTARTSYS
on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
is true.

Looking further if the thread that is core dumping initiated
any io_uring work then io_ring_exit_work will use task_work_add
to request that thread clean up it's io_uring state.

Perhaps we can put a big comment in dump_emit and if we
get back -ERESTARTSYS run tracework_notify_signal.  I am not
seeing any locks held at that point in the coredump, so it
should be safe.  The coredump is run inside of file_start_write
which is the only potential complication.



The code flow is complicated but it looks like the entire
point of the exercise is to call io_uring_del_task_file
on the originating thread.  I suppose that keeps the
locking of the xarray in io_uring_task simple.


Hmm.   All of this comes from io_uring_release.
How do we get to io_uring_release?  The coredump should
be catching everything in exit_mm before exit_files?

Confused and hopeful someone can explain to me what is going on,
and perhaps simplify it.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-15 22:08                           ` Eric W. Biederman
@ 2021-06-16 19:23                             ` Olivier Langlois
  2021-06-16 20:00                               ` Eric W. Biederman
  2021-08-05 13:06                             ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-06-16 19:23 UTC (permalink / raw)
  To: Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>

On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
> > >          * but then we need to teach dump_write() to restart and
> > > clear
> > >          * TIF_SIGPENDING.
> > >          */
> > > -       return signal_pending(current);
> > > +       return fatal_signal_pending(current) ||
> > > freezing(current);
> > >  }
> > 
> > 
> > Well yes, this is what the comment says.
> > 
> > But note that there is another reason why dump_interrupted()
> > returns true
> > if signal_pending(), it assumes thagt __dump_emit()-
> > >__kernel_write() may
> > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
> > nfs,
> > perhaps something else...
> > 
> > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
> > clear
> > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not
> > abuse the
> > dumping threads?
> > 
> > Or perhaps we should change __dump_emit() to clear signal_pending()
> > and
> > restart __kernel_write() if it fails or returns a short write.
> > 
> > Otherwise the change above doesn't look like a full fix to me.
> 
> Agreed.  The coredump to a pipe will still be short.  That needs
> something additional.
> 
> The problem Olivier Langlois <olivier@trillion01.com> reported was
> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
> set during a core dump.
> 
> We can see this with pipe_write returning -ERESTARTSYS
> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
> is true.
> 
Eric,

I redid my test but this time instead of dumping directly into a file,
I did let the coredump be piped to the systemd coredump module and the
coredump generation isn't working as expected when piping.

So your code review conclusions are correct.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-16 19:23                             ` Olivier Langlois
@ 2021-06-16 20:00                               ` Eric W. Biederman
  2021-06-18 20:05                                 ` Olivier Langlois
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-06-16 20:00 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Oleg Nesterov, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

Olivier Langlois <olivier@trillion01.com> writes:

> I redid my test but this time instead of dumping directly into a file,
> I did let the coredump be piped to the systemd coredump module and the
> coredump generation isn't working as expected when piping.
>
> So your code review conclusions are correct.

Thank you for confirming that.

Do you know how your test program is using io_uring?

I have been trying to put the pieces together on what io_uring is doing
that stops the coredump.  The fact that it takes a little while before
it kills the coredump is a little puzzling.  The code looks like all of
the io_uring operations should have been canceled before the coredump
starts.

Thanks,
Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-16 20:00                               ` Eric W. Biederman
@ 2021-06-18 20:05                                 ` Olivier Langlois
  0 siblings, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-06-18 20:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Pavel Begunkov>

On Wed, 2021-06-16 at 15:00 -0500, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
> > I redid my test but this time instead of dumping directly into a
> > file,
> > I did let the coredump be piped to the systemd coredump module and
> > the
> > coredump generation isn't working as expected when piping.
> > 
> > So your code review conclusions are correct.
> 
> Thank you for confirming that.
> 
> Do you know how your test program is using io_uring?
> 
> I have been trying to put the pieces together on what io_uring is
> doing
> that stops the coredump.  The fact that it takes a little while
> before
> it kills the coredump is a little puzzling.  The code looks like all
> of
> the io_uring operations should have been canceled before the coredump
> starts.
> 
> 
With a very simple setup, I guess that this could easily be
reproducible. Make a TCP connection with a server that is streaming
non-stop data and enter a loop where you keep initiating async
OP_IOURING_READ operations on your TCP fd.

Once you have that, manually sending a SIG_SEGV is a sure fire way to
stumble into the problem. This is how I am testing the patches.

IRL, it is possible to call io_uring_enter() to submit operations and
return from the syscall without waiting on all events to have
completed. Once the process is back in userspace, if it stumble into a
bug that triggers a coredump, any remaining pending I/O operations can
set TIF_SIGNAL_NOTIFY while the coredump is generated.

I have read the part of your previous email where you share the result
of your ongoing investigation. I didn't comment as the definitive
references in io_uring matters are Jens and Pavel but I am going to
share my opinion on the matter.

I think that you did put the finger on the code cleaning up the
io_uring instance in regards to pending operations. It seems to be
io_uring_release() which is probably called from exit_files() which
happens to be after the call to exit_mm().

At first, I did entertain the idea of considering if it could be
possible to duplicate some of the operations performed by
io_uring_release() related to the infamous TIF_SIGNAL_NOTIFY setting
into io_uring_files_cancel() which is called before exit_mm().

but the idea is useless as it is not the other threads of the group
that are causing the TIF_SIGNAL_NOTIFY problem. It is the thread
calling do_coredump() which is done by the signal handing code even
before that thread enters do_exit() and start to be cleaned up. That
thread when it enters do_coredump() is still fully loaded and
operational in terms of io_uring functionality.

I guess that this io_uring cancel all pending operations hook would
have to be called from do_coredump or from get_signal() but if it is
the way to go, I feel that this is a change major enough that wouldn't
dare going there without the blessing of the maintainers in cause....



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-06-15 22:08                           ` Eric W. Biederman
  2021-06-16 19:23                             ` Olivier Langlois
@ 2021-08-05 13:06                             ` Olivier Langlois
  2021-08-10 21:48                               ` Tony Battersby
  1 sibling, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-08-05 13:06 UTC (permalink / raw)
  To: Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>

On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
> > >          * but then we need to teach dump_write() to restart and
> > > clear
> > >          * TIF_SIGPENDING.
> > >          */
> > > -       return signal_pending(current);
> > > +       return fatal_signal_pending(current) || freezing(current);
> > >  }
> > 
> > 
> > Well yes, this is what the comment says.
> > 
> > But note that there is another reason why dump_interrupted() returns
> > true
> > if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
> > may
> > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
> > nfs,
> > perhaps something else...
> > 
> > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
> > clear
> > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
> > the
> > dumping threads?
> > 
> > Or perhaps we should change __dump_emit() to clear signal_pending()
> > and
> > restart __kernel_write() if it fails or returns a short write.
> > 
> > Otherwise the change above doesn't look like a full fix to me.
> 
> Agreed.  The coredump to a pipe will still be short.  That needs
> something additional.
> 
> The problem Olivier Langlois <olivier@trillion01.com> reported was
> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
> set during a core dump.
> 
> We can see this with pipe_write returning -ERESTARTSYS
> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
> is true.
> 
> Looking further if the thread that is core dumping initiated
> any io_uring work then io_ring_exit_work will use task_work_add
> to request that thread clean up it's io_uring state.
> 
> Perhaps we can put a big comment in dump_emit and if we
> get back -ERESTARTSYS run tracework_notify_signal.  I am not
> seeing any locks held at that point in the coredump, so it
> should be safe.  The coredump is run inside of file_start_write
> which is the only potential complication.
> 
> 
> 
> The code flow is complicated but it looks like the entire
> point of the exercise is to call io_uring_del_task_file
> on the originating thread.  I suppose that keeps the
> locking of the xarray in io_uring_task simple.
> 
> 
> Hmm.   All of this comes from io_uring_release.
> How do we get to io_uring_release?  The coredump should
> be catching everything in exit_mm before exit_files?
> 
> Confused and hopeful someone can explain to me what is going on,
> and perhaps simplify it.
> 
> Eric

Hi all,

I didn't forgot about this remaining issue and I have kept thinking
about it on and off.

I did try the following on 5.12.19:

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..614fe7a54c1a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		need_suid_safe = true;
 	}
 
+	io_uring_files_cancel(current->files);
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;
-- 
2.32.0

with my current understanding, io_uring_files_cancel is supposed to
cancel everything that might set the TIF_NOTIFY_SIGNAL.

I must report that in my testing with generating a core dump through a
pipe with the modif above, I still get truncated core dumps.

systemd is having a weird error:
[ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
process

and nothing is captured

so I have replaced it with a very simple shell:
$ cat /proc/sys/kernel/core_pattern 
|/home/lano1106/bin/pipe_core.sh %e %p

~/bin $ cat pipe_core.sh 
#!/bin/sh

cat > /home/lano1106/core/core.$1.$2

BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
expected core file size >= 24129536, found: 61440

I conclude from my attempt that maybe io_uring_files_cancel is not 100%
cleaning everything that it should clean.



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-05 13:06                             ` Olivier Langlois
@ 2021-08-10 21:48                               ` Tony Battersby
  2021-08-11 20:47                                 ` Olivier Langlois
  2021-08-12  1:55                                 ` Jens Axboe
  0 siblings, 2 replies; 66+ messages in thread
From: Tony Battersby @ 2021-08-10 21:48 UTC (permalink / raw)
  To: Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>

On 8/5/21 9:06 AM, Olivier Langlois wrote:
> On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>>>          * but then we need to teach dump_write() to restart and
>>>> clear
>>>>          * TIF_SIGPENDING.
>>>>          */
>>>> -       return signal_pending(current);
>>>> +       return fatal_signal_pending(current) || freezing(current);
>>>>  }
>>> Well yes, this is what the comment says.
>>>
>>> But note that there is another reason why dump_interrupted() returns
>>> true
>>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
>>> may
>>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
>>> nfs,
>>> perhaps something else...
>>>
>>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
>>> clear
>>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
>>> the
>>> dumping threads?
>>>
>>> Or perhaps we should change __dump_emit() to clear signal_pending()
>>> and
>>> restart __kernel_write() if it fails or returns a short write.
>>>
>>> Otherwise the change above doesn't look like a full fix to me.
>> Agreed.  The coredump to a pipe will still be short.  That needs
>> something additional.
>>
>> The problem Olivier Langlois <olivier@trillion01.com> reported was
>> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
>> set during a core dump.
>>
>> We can see this with pipe_write returning -ERESTARTSYS
>> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
>> is true.
>>
>> Looking further if the thread that is core dumping initiated
>> any io_uring work then io_ring_exit_work will use task_work_add
>> to request that thread clean up it's io_uring state.
>>
>> Perhaps we can put a big comment in dump_emit and if we
>> get back -ERESTARTSYS run tracework_notify_signal.  I am not
>> seeing any locks held at that point in the coredump, so it
>> should be safe.  The coredump is run inside of file_start_write
>> which is the only potential complication.
>>
>>
>>
>> The code flow is complicated but it looks like the entire
>> point of the exercise is to call io_uring_del_task_file
>> on the originating thread.  I suppose that keeps the
>> locking of the xarray in io_uring_task simple.
>>
>>
>> Hmm.   All of this comes from io_uring_release.
>> How do we get to io_uring_release?  The coredump should
>> be catching everything in exit_mm before exit_files?
>>
>> Confused and hopeful someone can explain to me what is going on,
>> and perhaps simplify it.
>>
>> Eric
> Hi all,
>
> I didn't forgot about this remaining issue and I have kept thinking
> about it on and off.
>
> I did try the following on 5.12.19:
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..614fe7a54c1a 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		need_suid_safe = true;
>  	}
>  
> +	io_uring_files_cancel(current->files);
> +
>  	retval = coredump_wait(siginfo->si_signo, &core_state);
>  	if (retval < 0)
>  		goto fail_creds;
> --
> 2.32.0
>
> with my current understanding, io_uring_files_cancel is supposed to
> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>
> I must report that in my testing with generating a core dump through a
> pipe with the modif above, I still get truncated core dumps.
>
> systemd is having a weird error:
> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
> process
>
> and nothing is captured
>
> so I have replaced it with a very simple shell:
> $ cat /proc/sys/kernel/core_pattern 
> |/home/lano1106/bin/pipe_core.sh %e %p
>
> ~/bin $ cat pipe_core.sh 
> #!/bin/sh
>
> cat > /home/lano1106/core/core.$1.$2
>
> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
> expected core file size >= 24129536, found: 61440
>
> I conclude from my attempt that maybe io_uring_files_cancel is not 100%
> cleaning everything that it should clean.
>
>
>
I just ran into this problem also - coredumps from an io_uring program
to a pipe are truncated.  But I am using kernel 5.10.57, which does NOT
have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or
commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). 
Kernel 5.4 works though, so I bisected the problem to commit
f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
properly") in kernel 5.9.  Note that my io_uring program uses only async
buffered reads, which may be why this particular commit makes a
difference to my program.

My io_uring program is a multi-purpose long-running program with many
threads.  Most threads don't use io_uring but a few of them do. 
Normally, my core dumps are piped to a program so that they can be
compressed before being written to disk, but I can also test writing the
core dumps directly to disk.  This is what I have found:

*) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a
coredump, the core file is written correctly, whether it is written to
disk or piped to a program, even if another thread is using io_uring at
the same time.

*) Unpatched 5.10.57: if a thread that uses io_uring triggers a
coredump, the core file is truncated, whether written directly to disk
or piped to a program.

*) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
triggers a coredump, and the core is written directly to disk, then it
is written correctly.

*) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
triggers a coredump, and the core is piped to a program, then it is
truncated.

*) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
whether written directly to disk or piped to a program.

Tony Battersby
Cybernetics


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-10 21:48                               ` Tony Battersby
@ 2021-08-11 20:47                                 ` Olivier Langlois
  2021-08-12  1:55                                 ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-08-11 20:47 UTC (permalink / raw)
  To: Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Jens Axboe, Pavel Begunkov>

On Tue, 2021-08-10 at 17:48 -0400, Tony Battersby wrote:
> > 
> I just ran into this problem also - coredumps from an io_uring
> program
> to a pipe are truncated.  But I am using kernel 5.10.57, which does
> NOT
> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> or
> commit 06af8679449d ("coredump: Limit what can interrupt
> coredumps"). 
> Kernel 5.4 works though, so I bisected the problem to commit
> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> properly") in kernel 5.9.  Note that my io_uring program uses only
> async
> buffered reads, which may be why this particular commit makes a
> difference to my program.
> 
> My io_uring program is a multi-purpose long-running program with many
> threads.  Most threads don't use io_uring but a few of them do. 
> Normally, my core dumps are piped to a program so that they can be
> compressed before being written to disk, but I can also test writing
> the
> core dumps directly to disk.  This is what I have found:
> 
> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
> a
> coredump, the core file is written correctly, whether it is written
> to
> disk or piped to a program, even if another thread is using io_uring
> at
> the same time.
> 
> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> coredump, the core file is truncated, whether written directly to
> disk
> or piped to a program.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is written directly to disk, then
> it
> is written correctly.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is piped to a program, then it is
> truncated.
> 
> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> whether written directly to disk or piped to a program.
> 
> Tony Battersby
> Cybernetics
> 
Tony,

this is super interesting details. I'm leaving for few days so I will
not be able to look into it until I am back but here is my
interpretation of your findings:

f38c7e3abfba makes it more likely that your task ends up in a fd read
wait queue. Previously the io_uring req queuing was failing and
returning EAGAIN. Now it ends up using io uring fast poll.

When the core dump gets written through a pipe, pipe_write must block
waiting for some event. If the task gets waken up by the io_uring wait
queue entry instead, it must somehow make pipe_write fails.

So the problem must be a mix of TIF_NOTIFY_SIGNAL and the fact that
io_uring wait queue entries aren't cleaned up while doing the core
dump.

I have a new modif to try out. I'll hopefully be able to submit a patch
to fix that once I come back (I cannot do it now or else, I'll never
leave ;-))



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-10 21:48                               ` Tony Battersby
  2021-08-11 20:47                                 ` Olivier Langlois
@ 2021-08-12  1:55                                 ` Jens Axboe
  2021-08-12 13:53                                   ` Tony Battersby
  2021-08-15 20:42                                   ` Olivier Langlois
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-12  1:55 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/10/21 3:48 PM, Tony Battersby wrote:
> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>> On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
>>> Oleg Nesterov <oleg@redhat.com> writes:
>>>
>>>>> --- a/fs/coredump.c
>>>>> +++ b/fs/coredump.c
>>>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>>>>          * but then we need to teach dump_write() to restart and
>>>>> clear
>>>>>          * TIF_SIGPENDING.
>>>>>          */
>>>>> -       return signal_pending(current);
>>>>> +       return fatal_signal_pending(current) || freezing(current);
>>>>>  }
>>>> Well yes, this is what the comment says.
>>>>
>>>> But note that there is another reason why dump_interrupted() returns
>>>> true
>>>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write()
>>>> may
>>>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
>>>> nfs,
>>>> perhaps something else...
>>>>
>>>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
>>>> clear
>>>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse
>>>> the
>>>> dumping threads?
>>>>
>>>> Or perhaps we should change __dump_emit() to clear signal_pending()
>>>> and
>>>> restart __kernel_write() if it fails or returns a short write.
>>>>
>>>> Otherwise the change above doesn't look like a full fix to me.
>>> Agreed.  The coredump to a pipe will still be short.  That needs
>>> something additional.
>>>
>>> The problem Olivier Langlois <olivier@trillion01.com> reported was
>>> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
>>> set during a core dump.
>>>
>>> We can see this with pipe_write returning -ERESTARTSYS
>>> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
>>> is true.
>>>
>>> Looking further if the thread that is core dumping initiated
>>> any io_uring work then io_ring_exit_work will use task_work_add
>>> to request that thread clean up it's io_uring state.
>>>
>>> Perhaps we can put a big comment in dump_emit and if we
>>> get back -ERESTARTSYS run tracework_notify_signal.  I am not
>>> seeing any locks held at that point in the coredump, so it
>>> should be safe.  The coredump is run inside of file_start_write
>>> which is the only potential complication.
>>>
>>>
>>>
>>> The code flow is complicated but it looks like the entire
>>> point of the exercise is to call io_uring_del_task_file
>>> on the originating thread.  I suppose that keeps the
>>> locking of the xarray in io_uring_task simple.
>>>
>>>
>>> Hmm.   All of this comes from io_uring_release.
>>> How do we get to io_uring_release?  The coredump should
>>> be catching everything in exit_mm before exit_files?
>>>
>>> Confused and hopeful someone can explain to me what is going on,
>>> and perhaps simplify it.
>>>
>>> Eric
>> Hi all,
>>
>> I didn't forgot about this remaining issue and I have kept thinking
>> about it on and off.
>>
>> I did try the following on 5.12.19:
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..614fe7a54c1a 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/path.h>
>>  #include <linux/timekeeping.h>
>> +#include <linux/io_uring.h>
>>  
>>  #include <linux/uaccess.h>
>>  #include <asm/mmu_context.h>
>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		need_suid_safe = true;
>>  	}
>>  
>> +	io_uring_files_cancel(current->files);
>> +
>>  	retval = coredump_wait(siginfo->si_signo, &core_state);
>>  	if (retval < 0)
>>  		goto fail_creds;
>> --
>> 2.32.0
>>
>> with my current understanding, io_uring_files_cancel is supposed to
>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>
>> I must report that in my testing with generating a core dump through a
>> pipe with the modif above, I still get truncated core dumps.
>>
>> systemd is having a weird error:
>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>> process
>>
>> and nothing is captured
>>
>> so I have replaced it with a very simple shell:
>> $ cat /proc/sys/kernel/core_pattern 
>> |/home/lano1106/bin/pipe_core.sh %e %p
>>
>> ~/bin $ cat pipe_core.sh 
>> #!/bin/sh
>>
>> cat > /home/lano1106/core/core.$1.$2
>>
>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>> expected core file size >= 24129536, found: 61440
>>
>> I conclude from my attempt that maybe io_uring_files_cancel is not 100%
>> cleaning everything that it should clean.
>>
>>
>>
> I just ran into this problem also - coredumps from an io_uring program
> to a pipe are truncated.  But I am using kernel 5.10.57, which does NOT
> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or
> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). 
> Kernel 5.4 works though, so I bisected the problem to commit
> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> properly") in kernel 5.9.  Note that my io_uring program uses only async
> buffered reads, which may be why this particular commit makes a
> difference to my program.
> 
> My io_uring program is a multi-purpose long-running program with many
> threads.  Most threads don't use io_uring but a few of them do. 
> Normally, my core dumps are piped to a program so that they can be
> compressed before being written to disk, but I can also test writing the
> core dumps directly to disk.  This is what I have found:
> 
> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a
> coredump, the core file is written correctly, whether it is written to
> disk or piped to a program, even if another thread is using io_uring at
> the same time.
> 
> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> coredump, the core file is truncated, whether written directly to disk
> or piped to a program.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is written directly to disk, then it
> is written correctly.
> 
> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> triggers a coredump, and the core is piped to a program, then it is
> truncated.
> 
> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> whether written directly to disk or piped to a program.

That is very interesting. Like Olivier mentioned, it's not that actual
commit, but rather the change of behavior implemented by it. Before that
commit, we'd hit the async workers more often, whereas after we do the
correct retry method where it's driven by the wakeup when the page is
unlocked. This is purely speculation, but perhaps the fact that the
process changes state potentially mid dump is why the dump ends up being
truncated?

I'd love to dive into this and try and figure it out. Absent a test
case, at least the above gives me an idea of what to try out. I'll see
if it makes it easier for me to create a case that does result in a
truncated core dump.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-12  1:55                                 ` Jens Axboe
@ 2021-08-12 13:53                                   ` Tony Battersby
  2021-08-15 20:42                                   ` Olivier Langlois
  1 sibling, 0 replies; 66+ messages in thread
From: Tony Battersby @ 2021-08-12 13:53 UTC (permalink / raw)
  To: Jens Axboe, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/11/21 9:55 PM, Jens Axboe wrote:
>
> That is very interesting. Like Olivier mentioned, it's not that actual
> commit, but rather the change of behavior implemented by it. Before that
> commit, we'd hit the async workers more often, whereas after we do the
> correct retry method where it's driven by the wakeup when the page is
> unlocked. This is purely speculation, but perhaps the fact that the
> process changes state potentially mid dump is why the dump ends up being
> truncated?
>
> I'd love to dive into this and try and figure it out. Absent a test
> case, at least the above gives me an idea of what to try out. I'll see
> if it makes it easier for me to create a case that does result in a
> truncated core dump.
>
If it helps, a "good" coredump from my program is about 350 MB
compressed down to about 7 MB by bzip2.  A truncated coredump varies in
size from about 60 KB to about 2 MB before compression.  The program
that receives the coredump uses bzip2 to compress the data before
writing it to disk.

Tony


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-12  1:55                                 ` Jens Axboe
  2021-08-12 13:53                                   ` Tony Battersby
@ 2021-08-15 20:42                                   ` Olivier Langlois
  2021-08-16 13:02                                     ` Pavel Begunkov
  2021-08-17 18:15                                     ` Jens Axboe
  1 sibling, 2 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-08-15 20:42 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
> On 8/10/21 3:48 PM, Tony Battersby wrote:
> > On 8/5/21 9:06 AM, Olivier Langlois wrote:
> > > 
> > > Hi all,
> > > 
> > > I didn't forgot about this remaining issue and I have kept thinking
> > > about it on and off.
> > > 
> > > I did try the following on 5.12.19:
> > > 
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > index 07afb5ddb1c4..614fe7a54c1a 100644
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -41,6 +41,7 @@
> > >  #include <linux/fs.h>
> > >  #include <linux/path.h>
> > >  #include <linux/timekeeping.h>
> > > +#include <linux/io_uring.h>
> > >  
> > >  #include <linux/uaccess.h>
> > >  #include <asm/mmu_context.h>
> > > @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
> > > *siginfo)
> > >                 need_suid_safe = true;
> > >         }
> > >  
> > > +       io_uring_files_cancel(current->files);
> > > +
> > >         retval = coredump_wait(siginfo->si_signo, &core_state);
> > >         if (retval < 0)
> > >                 goto fail_creds;
> > > --
> > > 2.32.0
> > > 
> > > with my current understanding, io_uring_files_cancel is supposed to
> > > cancel everything that might set the TIF_NOTIFY_SIGNAL.
> > > 
> > > I must report that in my testing with generating a core dump
> > > through a
> > > pipe with the modif above, I still get truncated core dumps.
> > > 
> > > systemd is having a weird error:
> > > [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
> > > process
> > > 
> > > and nothing is captured
> > > 
> > > so I have replaced it with a very simple shell:
> > > $ cat /proc/sys/kernel/core_pattern 
> > > > /home/lano1106/bin/pipe_core.sh %e %p
> > > 
> > > ~/bin $ cat pipe_core.sh 
> > > #!/bin/sh
> > > 
> > > cat > /home/lano1106/core/core.$1.$2
> > > 
> > > BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
> > > expected core file size >= 24129536, found: 61440
> > > 
> > > I conclude from my attempt that maybe io_uring_files_cancel is not
> > > 100%
> > > cleaning everything that it should clean.
> > > 
> > > 
> > > 
> > I just ran into this problem also - coredumps from an io_uring
> > program
> > to a pipe are truncated.  But I am using kernel 5.10.57, which does
> > NOT
> > have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> > or
> > commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
> > Kernel 5.4 works though, so I bisected the problem to commit
> > f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
> > properly") in kernel 5.9.  Note that my io_uring program uses only
> > async
> > buffered reads, which may be why this particular commit makes a
> > difference to my program.
> > 
> > My io_uring program is a multi-purpose long-running program with many
> > threads.  Most threads don't use io_uring but a few of them do. 
> > Normally, my core dumps are piped to a program so that they can be
> > compressed before being written to disk, but I can also test writing
> > the
> > core dumps directly to disk.  This is what I have found:
> > 
> > *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
> > a
> > coredump, the core file is written correctly, whether it is written
> > to
> > disk or piped to a program, even if another thread is using io_uring
> > at
> > the same time.
> > 
> > *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
> > coredump, the core file is truncated, whether written directly to
> > disk
> > or piped to a program.
> > 
> > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> > triggers a coredump, and the core is written directly to disk, then
> > it
> > is written correctly.
> > 
> > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
> > triggers a coredump, and the core is piped to a program, then it is
> > truncated.
> > 
> > *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
> > whether written directly to disk or piped to a program.
> 
> That is very interesting. Like Olivier mentioned, it's not that actual
> commit, but rather the change of behavior implemented by it. Before
> that
> commit, we'd hit the async workers more often, whereas after we do the
> correct retry method where it's driven by the wakeup when the page is
> unlocked. This is purely speculation, but perhaps the fact that the
> process changes state potentially mid dump is why the dump ends up
> being
> truncated?
> 
> I'd love to dive into this and try and figure it out. Absent a test
> case, at least the above gives me an idea of what to try out. I'll see
> if it makes it easier for me to create a case that does result in a
> truncated core dump.
> 
Jens,

When I have first encountered the issue, the very first thing that I
did try was to create a simple test program that would synthetize the
problem.

After few time consumming failed attempts, I just gave up the idea and
simply settle to my prod program that showcase systematically the
problem every time that I kill the process with a SEGV signal.

In a nutshell, all the program does is to issue read operations with
io_uring on a TCP socket on which there is a constant data stream.

Now that I have a better understanding of what is going on, I think
that one way that could reproduce the problem consistently could be
along those lines:

1. Create a pipe
2. fork a child
3. Initiate a read operation on the pipe with io_uring from the child
4. Let the parent kill its child with a core dump generating signal.
5. Write something in the pipe from the parent so that the io_uring
read operation completes while the core dump is generated.

I guess that I'll end up doing that if I cannot fix the issue with my
current setup but here is what I have attempted so far:

1. Call io_uring_files_cancel from do_coredump
2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
returning from io_uring_files_cancel

Those attempts didn't work but lurking in the io_uring dev mailing list
is starting to pay off. I thought that I did reach the bottom of the
rabbit hole in my journey of understanding io_uring but the recent
patch set sent by Hao Xu

https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t

made me realize that I still haven't assimilated all the small io_uring
nuances...

Here is my feedback. From my casual io_uring code reader point of view,
it is not 100% obvious what the difference is between
io_uring_files_cancel and io_uring_task_cancel

It seems like io_uring_files_cancel is cancelling polls only if they
have the REQ_F_INFLIGHT flag set.

I have no idea what an inflight request means and why someone would
want to call io_uring_files_cancel over io_uring_task_cancel.

I guess that if I was to meditate on the question for few hours, I
would at some point get some illumination strike me but I believe that
it could be a good idea to document in the code those concepts for
helping casual readers...

Bottomline, I now understand that io_uring_files_cancel does not cancel
all the requests. Therefore, without fully understanding what I am
doing, I am going to replace my call to io_uring_files_cancel from
do_coredump with io_uring_task_cancel and see if this finally fix the
issue for good.

What I am trying to do is to cancel pending io_uring requests to make
sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.

Maybe another solution would simply be to modify __dump_emit to make it
resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
suggested.

or maybe do both...

Not sure which approach is best. If someone has an opinion, I would be
curious to hear it.

Greetings,



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-15 20:42                                   ` Olivier Langlois
@ 2021-08-16 13:02                                     ` Pavel Begunkov
  2021-08-16 13:06                                       ` Pavel Begunkov
  2021-08-17 18:15                                     ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2021-08-16 13:02 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, Tony Battersby, Eric W. Biederman,
	Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro

On 8/15/21 9:42 PM, Olivier Langlois wrote:
[...]
> When I have first encountered the issue, the very first thing that I
> did try was to create a simple test program that would synthetize the
> problem.
> 
> After few time consumming failed attempts, I just gave up the idea and
> simply settle to my prod program that showcase systematically the
> problem every time that I kill the process with a SEGV signal.
> 
> In a nutshell, all the program does is to issue read operations with
> io_uring on a TCP socket on which there is a constant data stream.
> 
> Now that I have a better understanding of what is going on, I think
> that one way that could reproduce the problem consistently could be
> along those lines:
> 
> 1. Create a pipe
> 2. fork a child
> 3. Initiate a read operation on the pipe with io_uring from the child
> 4. Let the parent kill its child with a core dump generating signal.
> 5. Write something in the pipe from the parent so that the io_uring
> read operation completes while the core dump is generated.
> 
> I guess that I'll end up doing that if I cannot fix the issue with my
> current setup but here is what I have attempted so far:
> 
> 1. Call io_uring_files_cancel from do_coredump
> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
> returning from io_uring_files_cancel
> 
> Those attempts didn't work but lurking in the io_uring dev mailing list
> is starting to pay off. I thought that I did reach the bottom of the
> rabbit hole in my journey of understanding io_uring but the recent
> patch set sent by Hao Xu
> 
> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
> 
> made me realize that I still haven't assimilated all the small io_uring
> nuances...
> 
> Here is my feedback. From my casual io_uring code reader point of view,
> it is not 100% obvious what the difference is between
> io_uring_files_cancel and io_uring_task_cancel

As you mentioned, io_uring_task_cancel() cancels and waits for all
requests submitted by current task, used in exec() and SQPOLL because
of potential races.

io_uring_task_cancel() cancels only selected ones and


io_uring_files_cancel()
cancels and waits only some specific requests that we absolutely have
to, e.g. in 5.15 it'll be only requests referencing the ring itself.
It's used on normal task exit.

io_uring_task_cancel() cancels and waits all requests submitted by
current task, used on exec() because of races.



As you mentioned 

> 
> It seems like io_uring_files_cancel is cancelling polls only if they
> have the REQ_F_INFLIGHT flag set.
> 
> I have no idea what an inflight request means and why someone would
> want to call io_uring_files_cancel over io_uring_task_cancel.
> 
> I guess that if I was to meditate on the question for few hours, I
> would at some point get some illumination strike me but I believe that
> it could be a good idea to document in the code those concepts for
> helping casual readers...
> 
> Bottomline, I now understand that io_uring_files_cancel does not cancel
> all the requests. Therefore, without fully understanding what I am
> doing, I am going to replace my call to io_uring_files_cancel from
> do_coredump with io_uring_task_cancel and see if this finally fix the
> issue for good.
> 
> What I am trying to do is to cancel pending io_uring requests to make
> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
> 
> Maybe another solution would simply be to modify __dump_emit to make it
> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
> suggested.
> 
> or maybe do both...
> 
> Not sure which approach is best. If someone has an opinion, I would be
> curious to hear it.
> 
> Greetings,
> 
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-16 13:02                                     ` Pavel Begunkov
@ 2021-08-16 13:06                                       ` Pavel Begunkov
  0 siblings, 0 replies; 66+ messages in thread
From: Pavel Begunkov @ 2021-08-16 13:06 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, Tony Battersby, Eric W. Biederman,
	Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro

On 8/16/21 2:02 PM, Pavel Begunkov wrote:
> On 8/15/21 9:42 PM, Olivier Langlois wrote:
> [...]
>> When I have first encountered the issue, the very first thing that I
>> did try was to create a simple test program that would synthetize the
>> problem.
>>
>> After few time consumming failed attempts, I just gave up the idea and
>> simply settle to my prod program that showcase systematically the
>> problem every time that I kill the process with a SEGV signal.
>>
>> In a nutshell, all the program does is to issue read operations with
>> io_uring on a TCP socket on which there is a constant data stream.
>>
>> Now that I have a better understanding of what is going on, I think
>> that one way that could reproduce the problem consistently could be
>> along those lines:
>>
>> 1. Create a pipe
>> 2. fork a child
>> 3. Initiate a read operation on the pipe with io_uring from the child
>> 4. Let the parent kill its child with a core dump generating signal.
>> 5. Write something in the pipe from the parent so that the io_uring
>> read operation completes while the core dump is generated.
>>
>> I guess that I'll end up doing that if I cannot fix the issue with my
>> current setup but here is what I have attempted so far:
>>
>> 1. Call io_uring_files_cancel from do_coredump
>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>> returning from io_uring_files_cancel
>>
>> Those attempts didn't work but lurking in the io_uring dev mailing list
>> is starting to pay off. I thought that I did reach the bottom of the
>> rabbit hole in my journey of understanding io_uring but the recent
>> patch set sent by Hao Xu
>>
>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>
>> made me realize that I still haven't assimilated all the small io_uring
>> nuances...
>>
>> Here is my feedback. From my casual io_uring code reader point of view,
>> it is not 100% obvious what the difference is between
>> io_uring_files_cancel and io_uring_task_cancel
> 
> As you mentioned, io_uring_task_cancel() cancels and waits for all
> requests submitted by current task, used in exec() and SQPOLL because
> of potential races.

Apologies for this draft rumbling...

As you mentioned, io_uring_task_cancel() cancels and waits for all
requests submitted by current task, used in exec() and SQPOLL because
of potential races.

io_uring_task_cancel() cancels only selected ones, e.g. in 5.15
will be only requests operating on io_uring, and used during normal
exit.

Agree that the names may be not too descriptive.

>>
>> It seems like io_uring_files_cancel is cancelling polls only if they
>> have the REQ_F_INFLIGHT flag set.
>>
>> I have no idea what an inflight request means and why someone would
>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>
>> I guess that if I was to meditate on the question for few hours, I
>> would at some point get some illumination strike me but I believe that
>> it could be a good idea to document in the code those concepts for
>> helping casual readers...
>>
>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>> all the requests. Therefore, without fully understanding what I am
>> doing, I am going to replace my call to io_uring_files_cancel from
>> do_coredump with io_uring_task_cancel and see if this finally fix the
>> issue for good.
>>
>> What I am trying to do is to cancel pending io_uring requests to make
>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>
>> Maybe another solution would simply be to modify __dump_emit to make it
>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>> suggested.
>>
>> or maybe do both...
>>
>> Not sure which approach is best. If someone has an opinion, I would be
>> curious to hear it.
>>
>> Greetings,
>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-15 20:42                                   ` Olivier Langlois
  2021-08-16 13:02                                     ` Pavel Begunkov
@ 2021-08-17 18:15                                     ` Jens Axboe
  2021-08-17 18:24                                       ` Jens Axboe
  2021-08-21  9:48                                       ` Olivier Langlois
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-17 18:15 UTC (permalink / raw)
  To: Olivier Langlois, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/15/21 2:42 PM, Olivier Langlois wrote:
> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>> about it on and off.
>>>>
>>>> I did try the following on 5.12.19:
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -41,6 +41,7 @@
>>>>  #include <linux/fs.h>
>>>>  #include <linux/path.h>
>>>>  #include <linux/timekeeping.h>
>>>> +#include <linux/io_uring.h>
>>>>  
>>>>  #include <linux/uaccess.h>
>>>>  #include <asm/mmu_context.h>
>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>> *siginfo)
>>>>                 need_suid_safe = true;
>>>>         }
>>>>  
>>>> +       io_uring_files_cancel(current->files);
>>>> +
>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>         if (retval < 0)
>>>>                 goto fail_creds;
>>>> --
>>>> 2.32.0
>>>>
>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>
>>>> I must report that in my testing with generating a core dump
>>>> through a
>>>> pipe with the modif above, I still get truncated core dumps.
>>>>
>>>> systemd is having a weird error:
>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>> process
>>>>
>>>> and nothing is captured
>>>>
>>>> so I have replaced it with a very simple shell:
>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>
>>>> ~/bin $ cat pipe_core.sh 
>>>> #!/bin/sh
>>>>
>>>> cat > /home/lano1106/core/core.$1.$2
>>>>
>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>> expected core file size >= 24129536, found: 61440
>>>>
>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>> 100%
>>>> cleaning everything that it should clean.
>>>>
>>>>
>>>>
>>> I just ran into this problem also - coredumps from an io_uring
>>> program
>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>> NOT
>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>> or
>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>> Kernel 5.4 works though, so I bisected the problem to commit
>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>> async
>>> buffered reads, which may be why this particular commit makes a
>>> difference to my program.
>>>
>>> My io_uring program is a multi-purpose long-running program with many
>>> threads.  Most threads don't use io_uring but a few of them do. 
>>> Normally, my core dumps are piped to a program so that they can be
>>> compressed before being written to disk, but I can also test writing
>>> the
>>> core dumps directly to disk.  This is what I have found:
>>>
>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>> a
>>> coredump, the core file is written correctly, whether it is written
>>> to
>>> disk or piped to a program, even if another thread is using io_uring
>>> at
>>> the same time.
>>>
>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>> coredump, the core file is truncated, whether written directly to
>>> disk
>>> or piped to a program.
>>>
>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>> triggers a coredump, and the core is written directly to disk, then
>>> it
>>> is written correctly.
>>>
>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>> triggers a coredump, and the core is piped to a program, then it is
>>> truncated.
>>>
>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>> whether written directly to disk or piped to a program.
>>
>> That is very interesting. Like Olivier mentioned, it's not that actual
>> commit, but rather the change of behavior implemented by it. Before
>> that
>> commit, we'd hit the async workers more often, whereas after we do the
>> correct retry method where it's driven by the wakeup when the page is
>> unlocked. This is purely speculation, but perhaps the fact that the
>> process changes state potentially mid dump is why the dump ends up
>> being
>> truncated?
>>
>> I'd love to dive into this and try and figure it out. Absent a test
>> case, at least the above gives me an idea of what to try out. I'll see
>> if it makes it easier for me to create a case that does result in a
>> truncated core dump.
>>
> Jens,
> 
> When I have first encountered the issue, the very first thing that I
> did try was to create a simple test program that would synthetize the
> problem.
> 
> After few time consumming failed attempts, I just gave up the idea and
> simply settle to my prod program that showcase systematically the
> problem every time that I kill the process with a SEGV signal.
> 
> In a nutshell, all the program does is to issue read operations with
> io_uring on a TCP socket on which there is a constant data stream.
> 
> Now that I have a better understanding of what is going on, I think
> that one way that could reproduce the problem consistently could be
> along those lines:
> 
> 1. Create a pipe
> 2. fork a child
> 3. Initiate a read operation on the pipe with io_uring from the child
> 4. Let the parent kill its child with a core dump generating signal.
> 5. Write something in the pipe from the parent so that the io_uring
> read operation completes while the core dump is generated.
> 
> I guess that I'll end up doing that if I cannot fix the issue with my
> current setup but here is what I have attempted so far:
> 
> 1. Call io_uring_files_cancel from do_coredump
> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
> returning from io_uring_files_cancel
> 
> Those attempts didn't work but lurking in the io_uring dev mailing list
> is starting to pay off. I thought that I did reach the bottom of the
> rabbit hole in my journey of understanding io_uring but the recent
> patch set sent by Hao Xu
> 
> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
> 
> made me realize that I still haven't assimilated all the small io_uring
> nuances...
> 
> Here is my feedback. From my casual io_uring code reader point of view,
> it is not 100% obvious what the difference is between
> io_uring_files_cancel and io_uring_task_cancel
> 
> It seems like io_uring_files_cancel is cancelling polls only if they
> have the REQ_F_INFLIGHT flag set.
> 
> I have no idea what an inflight request means and why someone would
> want to call io_uring_files_cancel over io_uring_task_cancel.
> 
> I guess that if I was to meditate on the question for few hours, I
> would at some point get some illumination strike me but I believe that
> it could be a good idea to document in the code those concepts for
> helping casual readers...
> 
> Bottomline, I now understand that io_uring_files_cancel does not cancel
> all the requests. Therefore, without fully understanding what I am
> doing, I am going to replace my call to io_uring_files_cancel from
> do_coredump with io_uring_task_cancel and see if this finally fix the
> issue for good.
> 
> What I am trying to do is to cancel pending io_uring requests to make
> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
> 
> Maybe another solution would simply be to modify __dump_emit to make it
> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
> suggested.
> 
> or maybe do both...
> 
> Not sure which approach is best. If someone has an opinion, I would be
> curious to hear it.

It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
signal_pending() and cause an interruption of the core dump. Just out of
curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
set to some piped process, can you try and set it to 'core' and see if
that eliminates the truncation of the core dumps for your case?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 18:15                                     ` Jens Axboe
@ 2021-08-17 18:24                                       ` Jens Axboe
  2021-08-17 19:29                                         ` Tony Battersby
  2021-08-21  9:52                                         ` Olivier Langlois
  2021-08-21  9:48                                       ` Olivier Langlois
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-17 18:24 UTC (permalink / raw)
  To: Olivier Langlois, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 12:15 PM, Jens Axboe wrote:
> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>> about it on and off.
>>>>>
>>>>> I did try the following on 5.12.19:
>>>>>
>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>> --- a/fs/coredump.c
>>>>> +++ b/fs/coredump.c
>>>>> @@ -41,6 +41,7 @@
>>>>>  #include <linux/fs.h>
>>>>>  #include <linux/path.h>
>>>>>  #include <linux/timekeeping.h>
>>>>> +#include <linux/io_uring.h>
>>>>>  
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <asm/mmu_context.h>
>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>> *siginfo)
>>>>>                 need_suid_safe = true;
>>>>>         }
>>>>>  
>>>>> +       io_uring_files_cancel(current->files);
>>>>> +
>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>         if (retval < 0)
>>>>>                 goto fail_creds;
>>>>> --
>>>>> 2.32.0
>>>>>
>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>
>>>>> I must report that in my testing with generating a core dump
>>>>> through a
>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>
>>>>> systemd is having a weird error:
>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>> process
>>>>>
>>>>> and nothing is captured
>>>>>
>>>>> so I have replaced it with a very simple shell:
>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>
>>>>> ~/bin $ cat pipe_core.sh 
>>>>> #!/bin/sh
>>>>>
>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>
>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>> expected core file size >= 24129536, found: 61440
>>>>>
>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>> 100%
>>>>> cleaning everything that it should clean.
>>>>>
>>>>>
>>>>>
>>>> I just ran into this problem also - coredumps from an io_uring
>>>> program
>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>> NOT
>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>> or
>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>> async
>>>> buffered reads, which may be why this particular commit makes a
>>>> difference to my program.
>>>>
>>>> My io_uring program is a multi-purpose long-running program with many
>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>> Normally, my core dumps are piped to a program so that they can be
>>>> compressed before being written to disk, but I can also test writing
>>>> the
>>>> core dumps directly to disk.  This is what I have found:
>>>>
>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>> a
>>>> coredump, the core file is written correctly, whether it is written
>>>> to
>>>> disk or piped to a program, even if another thread is using io_uring
>>>> at
>>>> the same time.
>>>>
>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>> coredump, the core file is truncated, whether written directly to
>>>> disk
>>>> or piped to a program.
>>>>
>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>> triggers a coredump, and the core is written directly to disk, then
>>>> it
>>>> is written correctly.
>>>>
>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>> triggers a coredump, and the core is piped to a program, then it is
>>>> truncated.
>>>>
>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>> whether written directly to disk or piped to a program.
>>>
>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>> commit, but rather the change of behavior implemented by it. Before
>>> that
>>> commit, we'd hit the async workers more often, whereas after we do the
>>> correct retry method where it's driven by the wakeup when the page is
>>> unlocked. This is purely speculation, but perhaps the fact that the
>>> process changes state potentially mid dump is why the dump ends up
>>> being
>>> truncated?
>>>
>>> I'd love to dive into this and try and figure it out. Absent a test
>>> case, at least the above gives me an idea of what to try out. I'll see
>>> if it makes it easier for me to create a case that does result in a
>>> truncated core dump.
>>>
>> Jens,
>>
>> When I have first encountered the issue, the very first thing that I
>> did try was to create a simple test program that would synthetize the
>> problem.
>>
>> After few time consumming failed attempts, I just gave up the idea and
>> simply settle to my prod program that showcase systematically the
>> problem every time that I kill the process with a SEGV signal.
>>
>> In a nutshell, all the program does is to issue read operations with
>> io_uring on a TCP socket on which there is a constant data stream.
>>
>> Now that I have a better understanding of what is going on, I think
>> that one way that could reproduce the problem consistently could be
>> along those lines:
>>
>> 1. Create a pipe
>> 2. fork a child
>> 3. Initiate a read operation on the pipe with io_uring from the child
>> 4. Let the parent kill its child with a core dump generating signal.
>> 5. Write something in the pipe from the parent so that the io_uring
>> read operation completes while the core dump is generated.
>>
>> I guess that I'll end up doing that if I cannot fix the issue with my
>> current setup but here is what I have attempted so far:
>>
>> 1. Call io_uring_files_cancel from do_coredump
>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>> returning from io_uring_files_cancel
>>
>> Those attempts didn't work but lurking in the io_uring dev mailing list
>> is starting to pay off. I thought that I did reach the bottom of the
>> rabbit hole in my journey of understanding io_uring but the recent
>> patch set sent by Hao Xu
>>
>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>
>> made me realize that I still haven't assimilated all the small io_uring
>> nuances...
>>
>> Here is my feedback. From my casual io_uring code reader point of view,
>> it is not 100% obvious what the difference is between
>> io_uring_files_cancel and io_uring_task_cancel
>>
>> It seems like io_uring_files_cancel is cancelling polls only if they
>> have the REQ_F_INFLIGHT flag set.
>>
>> I have no idea what an inflight request means and why someone would
>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>
>> I guess that if I was to meditate on the question for few hours, I
>> would at some point get some illumination strike me but I believe that
>> it could be a good idea to document in the code those concepts for
>> helping casual readers...
>>
>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>> all the requests. Therefore, without fully understanding what I am
>> doing, I am going to replace my call to io_uring_files_cancel from
>> do_coredump with io_uring_task_cancel and see if this finally fix the
>> issue for good.
>>
>> What I am trying to do is to cancel pending io_uring requests to make
>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>
>> Maybe another solution would simply be to modify __dump_emit to make it
>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>> suggested.
>>
>> or maybe do both...
>>
>> Not sure which approach is best. If someone has an opinion, I would be
>> curious to hear it.
> 
> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
> signal_pending() and cause an interruption of the core dump. Just out of
> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
> set to some piped process, can you try and set it to 'core' and see if
> that eliminates the truncation of the core dumps for your case?

And assuming that works, then I suspect this one would fix your issue
even with a piped core dump:

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..852737a9ccbf 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	};
 
 	audit_core_dumps(siginfo->si_signo);
+	io_uring_task_cancel();
 
 	binfmt = mm->binfmt;
 	if (!binfmt || !binfmt->core_dump)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 18:24                                       ` Jens Axboe
@ 2021-08-17 19:29                                         ` Tony Battersby
  2021-08-17 19:59                                           ` Jens Axboe
  2021-08-21  9:52                                         ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Tony Battersby @ 2021-08-17 19:29 UTC (permalink / raw)
  To: Jens Axboe, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 2:24 PM, Jens Axboe wrote:
> On 8/17/21 12:15 PM, Jens Axboe wrote:
>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>> about it on and off.
>>>>>>
>>>>>> I did try the following on 5.12.19:
>>>>>>
>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>> --- a/fs/coredump.c
>>>>>> +++ b/fs/coredump.c
>>>>>> @@ -41,6 +41,7 @@
>>>>>>  #include <linux/fs.h>
>>>>>>  #include <linux/path.h>
>>>>>>  #include <linux/timekeeping.h>
>>>>>> +#include <linux/io_uring.h>
>>>>>>  
>>>>>>  #include <linux/uaccess.h>
>>>>>>  #include <asm/mmu_context.h>
>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>> *siginfo)
>>>>>>                 need_suid_safe = true;
>>>>>>         }
>>>>>>  
>>>>>> +       io_uring_files_cancel(current->files);
>>>>>> +
>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>         if (retval < 0)
>>>>>>                 goto fail_creds;
>>>>>> --
>>>>>> 2.32.0
>>>>>>
>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>
>>>>>> I must report that in my testing with generating a core dump
>>>>>> through a
>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>
>>>>>> systemd is having a weird error:
>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>> process
>>>>>>
>>>>>> and nothing is captured
>>>>>>
>>>>>> so I have replaced it with a very simple shell:
>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>> #!/bin/sh
>>>>>>
>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>
>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>
>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>> 100%
>>>>>> cleaning everything that it should clean.
>>>>>>
>>>>>>
>>>>>>
>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>> program
>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>> NOT
>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>> or
>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>> async
>>>>> buffered reads, which may be why this particular commit makes a
>>>>> difference to my program.
>>>>>
>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>> compressed before being written to disk, but I can also test writing
>>>>> the
>>>>> core dumps directly to disk.  This is what I have found:
>>>>>
>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>> a
>>>>> coredump, the core file is written correctly, whether it is written
>>>>> to
>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>> at
>>>>> the same time.
>>>>>
>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>> coredump, the core file is truncated, whether written directly to
>>>>> disk
>>>>> or piped to a program.
>>>>>
>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>> it
>>>>> is written correctly.
>>>>>
>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>> truncated.
>>>>>
>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>> whether written directly to disk or piped to a program.
>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>> commit, but rather the change of behavior implemented by it. Before
>>>> that
>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>> correct retry method where it's driven by the wakeup when the page is
>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>> process changes state potentially mid dump is why the dump ends up
>>>> being
>>>> truncated?
>>>>
>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>> if it makes it easier for me to create a case that does result in a
>>>> truncated core dump.
>>>>
>>> Jens,
>>>
>>> When I have first encountered the issue, the very first thing that I
>>> did try was to create a simple test program that would synthetize the
>>> problem.
>>>
>>> After few time consumming failed attempts, I just gave up the idea and
>>> simply settle to my prod program that showcase systematically the
>>> problem every time that I kill the process with a SEGV signal.
>>>
>>> In a nutshell, all the program does is to issue read operations with
>>> io_uring on a TCP socket on which there is a constant data stream.
>>>
>>> Now that I have a better understanding of what is going on, I think
>>> that one way that could reproduce the problem consistently could be
>>> along those lines:
>>>
>>> 1. Create a pipe
>>> 2. fork a child
>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>> 4. Let the parent kill its child with a core dump generating signal.
>>> 5. Write something in the pipe from the parent so that the io_uring
>>> read operation completes while the core dump is generated.
>>>
>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>> current setup but here is what I have attempted so far:
>>>
>>> 1. Call io_uring_files_cancel from do_coredump
>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>> returning from io_uring_files_cancel
>>>
>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>> is starting to pay off. I thought that I did reach the bottom of the
>>> rabbit hole in my journey of understanding io_uring but the recent
>>> patch set sent by Hao Xu
>>>
>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>
>>> made me realize that I still haven't assimilated all the small io_uring
>>> nuances...
>>>
>>> Here is my feedback. From my casual io_uring code reader point of view,
>>> it is not 100% obvious what the difference is between
>>> io_uring_files_cancel and io_uring_task_cancel
>>>
>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>> have the REQ_F_INFLIGHT flag set.
>>>
>>> I have no idea what an inflight request means and why someone would
>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>
>>> I guess that if I was to meditate on the question for few hours, I
>>> would at some point get some illumination strike me but I believe that
>>> it could be a good idea to document in the code those concepts for
>>> helping casual readers...
>>>
>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>> all the requests. Therefore, without fully understanding what I am
>>> doing, I am going to replace my call to io_uring_files_cancel from
>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>> issue for good.
>>>
>>> What I am trying to do is to cancel pending io_uring requests to make
>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>
>>> Maybe another solution would simply be to modify __dump_emit to make it
>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>> suggested.
>>>
>>> or maybe do both...
>>>
>>> Not sure which approach is best. If someone has an opinion, I would be
>>> curious to hear it.
>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>> signal_pending() and cause an interruption of the core dump. Just out of
>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>> set to some piped process, can you try and set it to 'core' and see if
>> that eliminates the truncation of the core dumps for your case?
> And assuming that works, then I suspect this one would fix your issue
> even with a piped core dump:
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..852737a9ccbf 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	};
>  
>  	audit_core_dumps(siginfo->si_signo);
> +	io_uring_task_cancel();
>  
>  	binfmt = mm->binfmt;
>  	if (!binfmt || !binfmt->core_dump)
>
FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
with my io_uring program.  The coredump locked up even when writing the
core file directly to disk; the zombie process could not be killed with
"kill -9".  Unfortunately I can't test with newer kernels without
spending some time on it, and I am too busy with other stuff right now.

My io_uring program does async buffered reads
(io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
(no filesystem).  One thread submits I/Os while another thread calls
io_uring_wait_cqe() and processes the completions.  To trigger the
coredump, I added an intentional abort() in the thread that submits I/Os
after running for a second.

Tony Battersby


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 19:29                                         ` Tony Battersby
@ 2021-08-17 19:59                                           ` Jens Axboe
  2021-08-17 21:28                                             ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2021-08-17 19:59 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 1:29 PM, Tony Battersby wrote:
> On 8/17/21 2:24 PM, Jens Axboe wrote:
>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>> about it on and off.
>>>>>>>
>>>>>>> I did try the following on 5.12.19:
>>>>>>>
>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>> --- a/fs/coredump.c
>>>>>>> +++ b/fs/coredump.c
>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>  #include <linux/fs.h>
>>>>>>>  #include <linux/path.h>
>>>>>>>  #include <linux/timekeeping.h>
>>>>>>> +#include <linux/io_uring.h>
>>>>>>>  
>>>>>>>  #include <linux/uaccess.h>
>>>>>>>  #include <asm/mmu_context.h>
>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>> *siginfo)
>>>>>>>                 need_suid_safe = true;
>>>>>>>         }
>>>>>>>  
>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>> +
>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>         if (retval < 0)
>>>>>>>                 goto fail_creds;
>>>>>>> --
>>>>>>> 2.32.0
>>>>>>>
>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>
>>>>>>> I must report that in my testing with generating a core dump
>>>>>>> through a
>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>
>>>>>>> systemd is having a weird error:
>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>> process
>>>>>>>
>>>>>>> and nothing is captured
>>>>>>>
>>>>>>> so I have replaced it with a very simple shell:
>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>
>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>
>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>> 100%
>>>>>>> cleaning everything that it should clean.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>> program
>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>> NOT
>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>> or
>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>> async
>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>> difference to my program.
>>>>>>
>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>> compressed before being written to disk, but I can also test writing
>>>>>> the
>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>
>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>> a
>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>> to
>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>> at
>>>>>> the same time.
>>>>>>
>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>> disk
>>>>>> or piped to a program.
>>>>>>
>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>> it
>>>>>> is written correctly.
>>>>>>
>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>> truncated.
>>>>>>
>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>> whether written directly to disk or piped to a program.
>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>> that
>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>> process changes state potentially mid dump is why the dump ends up
>>>>> being
>>>>> truncated?
>>>>>
>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>> if it makes it easier for me to create a case that does result in a
>>>>> truncated core dump.
>>>>>
>>>> Jens,
>>>>
>>>> When I have first encountered the issue, the very first thing that I
>>>> did try was to create a simple test program that would synthetize the
>>>> problem.
>>>>
>>>> After few time consumming failed attempts, I just gave up the idea and
>>>> simply settle to my prod program that showcase systematically the
>>>> problem every time that I kill the process with a SEGV signal.
>>>>
>>>> In a nutshell, all the program does is to issue read operations with
>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>
>>>> Now that I have a better understanding of what is going on, I think
>>>> that one way that could reproduce the problem consistently could be
>>>> along those lines:
>>>>
>>>> 1. Create a pipe
>>>> 2. fork a child
>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>> read operation completes while the core dump is generated.
>>>>
>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>> current setup but here is what I have attempted so far:
>>>>
>>>> 1. Call io_uring_files_cancel from do_coredump
>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>> returning from io_uring_files_cancel
>>>>
>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>> patch set sent by Hao Xu
>>>>
>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>
>>>> made me realize that I still haven't assimilated all the small io_uring
>>>> nuances...
>>>>
>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>> it is not 100% obvious what the difference is between
>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>
>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>> have the REQ_F_INFLIGHT flag set.
>>>>
>>>> I have no idea what an inflight request means and why someone would
>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>
>>>> I guess that if I was to meditate on the question for few hours, I
>>>> would at some point get some illumination strike me but I believe that
>>>> it could be a good idea to document in the code those concepts for
>>>> helping casual readers...
>>>>
>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>> all the requests. Therefore, without fully understanding what I am
>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>> issue for good.
>>>>
>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>
>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>> suggested.
>>>>
>>>> or maybe do both...
>>>>
>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>> curious to hear it.
>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>> signal_pending() and cause an interruption of the core dump. Just out of
>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>> set to some piped process, can you try and set it to 'core' and see if
>>> that eliminates the truncation of the core dumps for your case?
>> And assuming that works, then I suspect this one would fix your issue
>> even with a piped core dump:
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..852737a9ccbf 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/path.h>
>>  #include <linux/timekeeping.h>
>> +#include <linux/io_uring.h>
>>  
>>  #include <linux/uaccess.h>
>>  #include <asm/mmu_context.h>
>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  	};
>>  
>>  	audit_core_dumps(siginfo->si_signo);
>> +	io_uring_task_cancel();
>>  
>>  	binfmt = mm->binfmt;
>>  	if (!binfmt || !binfmt->core_dump)
>>
> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
> with my io_uring program.  The coredump locked up even when writing the
> core file directly to disk; the zombie process could not be killed with
> "kill -9".  Unfortunately I can't test with newer kernels without
> spending some time on it, and I am too busy with other stuff right now.

That sounds like 5.10-stable is missing some of the cancelation
backports, and your setup makes the cancelation stall because of that.
Need to go over the 11/12/13 fixes and ensure that we've got everything
we need for those stable versions, particularly 5.10.

> My io_uring program does async buffered reads
> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
> (no filesystem).  One thread submits I/Os while another thread calls
> io_uring_wait_cqe() and processes the completions.  To trigger the
> coredump, I added an intentional abort() in the thread that submits I/Os
> after running for a second.

OK, so that one is also using task_work for the retry based async
buffered reads, so it makes sense.

Maybe a temporary work-around is to use 06af8679449d and eliminate the
pipe based coredump?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 19:59                                           ` Jens Axboe
@ 2021-08-17 21:28                                             ` Jens Axboe
  2021-08-17 21:39                                               ` Tony Battersby
  2021-08-18  2:57                                               ` Jens Axboe
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-17 21:28 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 1:59 PM, Jens Axboe wrote:
> On 8/17/21 1:29 PM, Tony Battersby wrote:
>> On 8/17/21 2:24 PM, Jens Axboe wrote:
>>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>>> about it on and off.
>>>>>>>>
>>>>>>>> I did try the following on 5.12.19:
>>>>>>>>
>>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>>> --- a/fs/coredump.c
>>>>>>>> +++ b/fs/coredump.c
>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>  #include <linux/fs.h>
>>>>>>>>  #include <linux/path.h>
>>>>>>>>  #include <linux/timekeeping.h>
>>>>>>>> +#include <linux/io_uring.h>
>>>>>>>>  
>>>>>>>>  #include <linux/uaccess.h>
>>>>>>>>  #include <asm/mmu_context.h>
>>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>>> *siginfo)
>>>>>>>>                 need_suid_safe = true;
>>>>>>>>         }
>>>>>>>>  
>>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>>> +
>>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>>         if (retval < 0)
>>>>>>>>                 goto fail_creds;
>>>>>>>> --
>>>>>>>> 2.32.0
>>>>>>>>
>>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>>
>>>>>>>> I must report that in my testing with generating a core dump
>>>>>>>> through a
>>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>>
>>>>>>>> systemd is having a weird error:
>>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>>> process
>>>>>>>>
>>>>>>>> and nothing is captured
>>>>>>>>
>>>>>>>> so I have replaced it with a very simple shell:
>>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>>> #!/bin/sh
>>>>>>>>
>>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>>
>>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>>
>>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>>> 100%
>>>>>>>> cleaning everything that it should clean.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>>> program
>>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>>> NOT
>>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>>> or
>>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>>> async
>>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>>> difference to my program.
>>>>>>>
>>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>>> compressed before being written to disk, but I can also test writing
>>>>>>> the
>>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>>
>>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>>> a
>>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>>> to
>>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>>> at
>>>>>>> the same time.
>>>>>>>
>>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>>> disk
>>>>>>> or piped to a program.
>>>>>>>
>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>>> it
>>>>>>> is written correctly.
>>>>>>>
>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>>> truncated.
>>>>>>>
>>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>>> whether written directly to disk or piped to a program.
>>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>>> that
>>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>>> process changes state potentially mid dump is why the dump ends up
>>>>>> being
>>>>>> truncated?
>>>>>>
>>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>>> if it makes it easier for me to create a case that does result in a
>>>>>> truncated core dump.
>>>>>>
>>>>> Jens,
>>>>>
>>>>> When I have first encountered the issue, the very first thing that I
>>>>> did try was to create a simple test program that would synthetize the
>>>>> problem.
>>>>>
>>>>> After few time consumming failed attempts, I just gave up the idea and
>>>>> simply settle to my prod program that showcase systematically the
>>>>> problem every time that I kill the process with a SEGV signal.
>>>>>
>>>>> In a nutshell, all the program does is to issue read operations with
>>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>>
>>>>> Now that I have a better understanding of what is going on, I think
>>>>> that one way that could reproduce the problem consistently could be
>>>>> along those lines:
>>>>>
>>>>> 1. Create a pipe
>>>>> 2. fork a child
>>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>>> read operation completes while the core dump is generated.
>>>>>
>>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>>> current setup but here is what I have attempted so far:
>>>>>
>>>>> 1. Call io_uring_files_cancel from do_coredump
>>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>>> returning from io_uring_files_cancel
>>>>>
>>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>>> patch set sent by Hao Xu
>>>>>
>>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>>
>>>>> made me realize that I still haven't assimilated all the small io_uring
>>>>> nuances...
>>>>>
>>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>>> it is not 100% obvious what the difference is between
>>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>>
>>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>>> have the REQ_F_INFLIGHT flag set.
>>>>>
>>>>> I have no idea what an inflight request means and why someone would
>>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>>
>>>>> I guess that if I was to meditate on the question for few hours, I
>>>>> would at some point get some illumination strike me but I believe that
>>>>> it could be a good idea to document in the code those concepts for
>>>>> helping casual readers...
>>>>>
>>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>>> all the requests. Therefore, without fully understanding what I am
>>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>>> issue for good.
>>>>>
>>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>>
>>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>>> suggested.
>>>>>
>>>>> or maybe do both...
>>>>>
>>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>>> curious to hear it.
>>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>>> signal_pending() and cause an interruption of the core dump. Just out of
>>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>>> set to some piped process, can you try and set it to 'core' and see if
>>>> that eliminates the truncation of the core dumps for your case?
>>> And assuming that works, then I suspect this one would fix your issue
>>> even with a piped core dump:
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index 07afb5ddb1c4..852737a9ccbf 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -41,6 +41,7 @@
>>>  #include <linux/fs.h>
>>>  #include <linux/path.h>
>>>  #include <linux/timekeeping.h>
>>> +#include <linux/io_uring.h>
>>>  
>>>  #include <linux/uaccess.h>
>>>  #include <asm/mmu_context.h>
>>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>  	};
>>>  
>>>  	audit_core_dumps(siginfo->si_signo);
>>> +	io_uring_task_cancel();
>>>  
>>>  	binfmt = mm->binfmt;
>>>  	if (!binfmt || !binfmt->core_dump)
>>>
>> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
>> with my io_uring program.  The coredump locked up even when writing the
>> core file directly to disk; the zombie process could not be killed with
>> "kill -9".  Unfortunately I can't test with newer kernels without
>> spending some time on it, and I am too busy with other stuff right now.
> 
> That sounds like 5.10-stable is missing some of the cancelation
> backports, and your setup makes the cancelation stall because of that.
> Need to go over the 11/12/13 fixes and ensure that we've got everything
> we need for those stable versions, particularly 5.10.
> 
>> My io_uring program does async buffered reads
>> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
>> (no filesystem).  One thread submits I/Os while another thread calls
>> io_uring_wait_cqe() and processes the completions.  To trigger the
>> coredump, I added an intentional abort() in the thread that submits I/Os
>> after running for a second.
> 
> OK, so that one is also using task_work for the retry based async
> buffered reads, so it makes sense.
> 
> Maybe a temporary work-around is to use 06af8679449d and eliminate the
> pipe based coredump?

Another approach - don't allow TWA_SIGNAL task_work to get queued if
PF_SIGNALED has been set on the task. This is similar to how we reject
task_work_add() on process exit, and the callers must be able to handle
that already.

Can you test this one on top of your 5.10-stable?


diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..ca7c1ee44ada 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		.mm_flags = mm->flags,
 	};
 
+	/*
+	 * task_work_add() will refuse to add work after PF_SIGNALED has
+	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
+	 * if any was queued before that.
+	 */
+	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+		tracehook_notify_signal();
+
 	audit_core_dumps(siginfo->si_signo);
 
 	binfmt = mm->binfmt;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 1698fbe6f0e1..1ab28904adc4 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
+		/*
+		 * TIF_NOTIFY_SIGNAL notifications will interfere with
+		 * a core dump in progress, reject them.
+		 */
+		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 21:28                                             ` Jens Axboe
@ 2021-08-17 21:39                                               ` Tony Battersby
  2021-08-17 22:05                                                 ` Jens Axboe
  2021-08-18  2:57                                               ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Tony Battersby @ 2021-08-17 21:39 UTC (permalink / raw)
  To: Jens Axboe, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 5:28 PM, Jens Axboe wrote:
>
> Another approach - don't allow TWA_SIGNAL task_work to get queued if
> PF_SIGNALED has been set on the task. This is similar to how we reject
> task_work_add() on process exit, and the callers must be able to handle
> that already.
>
> Can you test this one on top of your 5.10-stable?
>
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..ca7c1ee44ada 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +		tracehook_notify_signal();
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..1ab28904adc4 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
>
Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.

Tony Battersby


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 21:39                                               ` Tony Battersby
@ 2021-08-17 22:05                                                 ` Jens Axboe
  2021-08-18 14:37                                                   ` Tony Battersby
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2021-08-17 22:05 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 3:39 PM, Tony Battersby wrote:
> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>
>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>> PF_SIGNALED has been set on the task. This is similar to how we reject
>> task_work_add() on process exit, and the callers must be able to handle
>> that already.
>>
>> Can you test this one on top of your 5.10-stable?
>>
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		.mm_flags = mm->flags,
>>  	};
>>  
>> +	/*
>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>> +	 * if any was queued before that.
>> +	 */
>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>> +		tracehook_notify_signal();
>> +
>>  	audit_core_dumps(siginfo->si_signo);
>>  
>>  	binfmt = mm->binfmt;
>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>> index 1698fbe6f0e1..1ab28904adc4 100644
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>  		head = READ_ONCE(task->task_works);
>>  		if (unlikely(head == &work_exited))
>>  			return -ESRCH;
>> +		/*
>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>> +		 * a core dump in progress, reject them.
>> +		 */
>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>> +			return -ESRCH;
>>  		work->next = head;
>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>  
>>
> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.

Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
untested...

diff --git a/fs/coredump.c b/fs/coredump.c
index c6acfc694f65..9e899ce67589 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		.mm_flags = mm->flags,
 	};
 
+	/*
+	 * task_work_add() will refuse to add work after PF_SIGNALED has
+	 * been set, ensure that we flush any pending TWA_SIGNAL work
+	 * if any was queued before that.
+	 */
+	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
+		task_work_run();
+		spin_lock_irq(&current->sighand->siglock);
+		current->jobctl &= ~JOBCTL_TASK_WORK;
+		recalc_sigpending();
+		spin_unlock_irq(&current->sighand->siglock);
+	}
+
 	audit_core_dumps(siginfo->si_signo);
 
 	binfmt = mm->binfmt;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8d6e1217c451..93b3f262eb4a 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
+		/*
+		 * TWA_SIGNAL notifications will interfere with
+		 * a core dump in progress, reject them.
+		 */
+		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 21:28                                             ` Jens Axboe
  2021-08-17 21:39                                               ` Tony Battersby
@ 2021-08-18  2:57                                               ` Jens Axboe
  2021-08-18  2:58                                                 ` Jens Axboe
  2021-08-21 10:08                                                 ` Olivier Langlois
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-18  2:57 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 3:28 PM, Jens Axboe wrote:
> On 8/17/21 1:59 PM, Jens Axboe wrote:
>> On 8/17/21 1:29 PM, Tony Battersby wrote:
>>> On 8/17/21 2:24 PM, Jens Axboe wrote:
>>>> On 8/17/21 12:15 PM, Jens Axboe wrote:
>>>>> On 8/15/21 2:42 PM, Olivier Langlois wrote:
>>>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote:
>>>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote:
>>>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I didn't forgot about this remaining issue and I have kept thinking
>>>>>>>>> about it on and off.
>>>>>>>>>
>>>>>>>>> I did try the following on 5.12.19:
>>>>>>>>>
>>>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644
>>>>>>>>> --- a/fs/coredump.c
>>>>>>>>> +++ b/fs/coredump.c
>>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>>  #include <linux/fs.h>
>>>>>>>>>  #include <linux/path.h>
>>>>>>>>>  #include <linux/timekeeping.h>
>>>>>>>>> +#include <linux/io_uring.h>
>>>>>>>>>  
>>>>>>>>>  #include <linux/uaccess.h>
>>>>>>>>>  #include <asm/mmu_context.h>
>>>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t
>>>>>>>>> *siginfo)
>>>>>>>>>                 need_suid_safe = true;
>>>>>>>>>         }
>>>>>>>>>  
>>>>>>>>> +       io_uring_files_cancel(current->files);
>>>>>>>>> +
>>>>>>>>>         retval = coredump_wait(siginfo->si_signo, &core_state);
>>>>>>>>>         if (retval < 0)
>>>>>>>>>                 goto fail_creds;
>>>>>>>>> --
>>>>>>>>> 2.32.0
>>>>>>>>>
>>>>>>>>> with my current understanding, io_uring_files_cancel is supposed to
>>>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL.
>>>>>>>>>
>>>>>>>>> I must report that in my testing with generating a core dump
>>>>>>>>> through a
>>>>>>>>> pipe with the modif above, I still get truncated core dumps.
>>>>>>>>>
>>>>>>>>> systemd is having a weird error:
>>>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> and nothing is captured
>>>>>>>>>
>>>>>>>>> so I have replaced it with a very simple shell:
>>>>>>>>> $ cat /proc/sys/kernel/core_pattern 
>>>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p
>>>>>>>>> ~/bin $ cat pipe_core.sh 
>>>>>>>>> #!/bin/sh
>>>>>>>>>
>>>>>>>>> cat > /home/lano1106/core/core.$1.$2
>>>>>>>>>
>>>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated:
>>>>>>>>> expected core file size >= 24129536, found: 61440
>>>>>>>>>
>>>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not
>>>>>>>>> 100%
>>>>>>>>> cleaning everything that it should clean.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I just ran into this problem also - coredumps from an io_uring
>>>>>>>> program
>>>>>>>> to a pipe are truncated.  But I am using kernel 5.10.57, which does
>>>>>>>> NOT
>>>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>>>>>>>> or
>>>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps").
>>>>>>>> Kernel 5.4 works though, so I bisected the problem to commit
>>>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup
>>>>>>>> properly") in kernel 5.9.  Note that my io_uring program uses only
>>>>>>>> async
>>>>>>>> buffered reads, which may be why this particular commit makes a
>>>>>>>> difference to my program.
>>>>>>>>
>>>>>>>> My io_uring program is a multi-purpose long-running program with many
>>>>>>>> threads.  Most threads don't use io_uring but a few of them do. 
>>>>>>>> Normally, my core dumps are piped to a program so that they can be
>>>>>>>> compressed before being written to disk, but I can also test writing
>>>>>>>> the
>>>>>>>> core dumps directly to disk.  This is what I have found:
>>>>>>>>
>>>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers
>>>>>>>> a
>>>>>>>> coredump, the core file is written correctly, whether it is written
>>>>>>>> to
>>>>>>>> disk or piped to a program, even if another thread is using io_uring
>>>>>>>> at
>>>>>>>> the same time.
>>>>>>>>
>>>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a
>>>>>>>> coredump, the core file is truncated, whether written directly to
>>>>>>>> disk
>>>>>>>> or piped to a program.
>>>>>>>>
>>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>>> triggers a coredump, and the core is written directly to disk, then
>>>>>>>> it
>>>>>>>> is written correctly.
>>>>>>>>
>>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring
>>>>>>>> triggers a coredump, and the core is piped to a program, then it is
>>>>>>>> truncated.
>>>>>>>>
>>>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly,
>>>>>>>> whether written directly to disk or piped to a program.
>>>>>>> That is very interesting. Like Olivier mentioned, it's not that actual
>>>>>>> commit, but rather the change of behavior implemented by it. Before
>>>>>>> that
>>>>>>> commit, we'd hit the async workers more often, whereas after we do the
>>>>>>> correct retry method where it's driven by the wakeup when the page is
>>>>>>> unlocked. This is purely speculation, but perhaps the fact that the
>>>>>>> process changes state potentially mid dump is why the dump ends up
>>>>>>> being
>>>>>>> truncated?
>>>>>>>
>>>>>>> I'd love to dive into this and try and figure it out. Absent a test
>>>>>>> case, at least the above gives me an idea of what to try out. I'll see
>>>>>>> if it makes it easier for me to create a case that does result in a
>>>>>>> truncated core dump.
>>>>>>>
>>>>>> Jens,
>>>>>>
>>>>>> When I have first encountered the issue, the very first thing that I
>>>>>> did try was to create a simple test program that would synthetize the
>>>>>> problem.
>>>>>>
>>>>>> After few time consumming failed attempts, I just gave up the idea and
>>>>>> simply settle to my prod program that showcase systematically the
>>>>>> problem every time that I kill the process with a SEGV signal.
>>>>>>
>>>>>> In a nutshell, all the program does is to issue read operations with
>>>>>> io_uring on a TCP socket on which there is a constant data stream.
>>>>>>
>>>>>> Now that I have a better understanding of what is going on, I think
>>>>>> that one way that could reproduce the problem consistently could be
>>>>>> along those lines:
>>>>>>
>>>>>> 1. Create a pipe
>>>>>> 2. fork a child
>>>>>> 3. Initiate a read operation on the pipe with io_uring from the child
>>>>>> 4. Let the parent kill its child with a core dump generating signal.
>>>>>> 5. Write something in the pipe from the parent so that the io_uring
>>>>>> read operation completes while the core dump is generated.
>>>>>>
>>>>>> I guess that I'll end up doing that if I cannot fix the issue with my
>>>>>> current setup but here is what I have attempted so far:
>>>>>>
>>>>>> 1. Call io_uring_files_cancel from do_coredump
>>>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on
>>>>>> returning from io_uring_files_cancel
>>>>>>
>>>>>> Those attempts didn't work but lurking in the io_uring dev mailing list
>>>>>> is starting to pay off. I thought that I did reach the bottom of the
>>>>>> rabbit hole in my journey of understanding io_uring but the recent
>>>>>> patch set sent by Hao Xu
>>>>>>
>>>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t
>>>>>>
>>>>>> made me realize that I still haven't assimilated all the small io_uring
>>>>>> nuances...
>>>>>>
>>>>>> Here is my feedback. From my casual io_uring code reader point of view,
>>>>>> it is not 100% obvious what the difference is between
>>>>>> io_uring_files_cancel and io_uring_task_cancel
>>>>>>
>>>>>> It seems like io_uring_files_cancel is cancelling polls only if they
>>>>>> have the REQ_F_INFLIGHT flag set.
>>>>>>
>>>>>> I have no idea what an inflight request means and why someone would
>>>>>> want to call io_uring_files_cancel over io_uring_task_cancel.
>>>>>>
>>>>>> I guess that if I was to meditate on the question for few hours, I
>>>>>> would at some point get some illumination strike me but I believe that
>>>>>> it could be a good idea to document in the code those concepts for
>>>>>> helping casual readers...
>>>>>>
>>>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel
>>>>>> all the requests. Therefore, without fully understanding what I am
>>>>>> doing, I am going to replace my call to io_uring_files_cancel from
>>>>>> do_coredump with io_uring_task_cancel and see if this finally fix the
>>>>>> issue for good.
>>>>>>
>>>>>> What I am trying to do is to cancel pending io_uring requests to make
>>>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated.
>>>>>>
>>>>>> Maybe another solution would simply be to modify __dump_emit to make it
>>>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally
>>>>>> suggested.
>>>>>>
>>>>>> or maybe do both...
>>>>>>
>>>>>> Not sure which approach is best. If someone has an opinion, I would be
>>>>>> curious to hear it.
>>>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some
>>>>> signal_pending() and cause an interruption of the core dump. Just out of
>>>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
>>>>> set to some piped process, can you try and set it to 'core' and see if
>>>>> that eliminates the truncation of the core dumps for your case?
>>>> And assuming that works, then I suspect this one would fix your issue
>>>> even with a piped core dump:
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..852737a9ccbf 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -41,6 +41,7 @@
>>>>  #include <linux/fs.h>
>>>>  #include <linux/path.h>
>>>>  #include <linux/timekeeping.h>
>>>> +#include <linux/io_uring.h>
>>>>  
>>>>  #include <linux/uaccess.h>
>>>>  #include <asm/mmu_context.h>
>>>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>>  	};
>>>>  
>>>>  	audit_core_dumps(siginfo->si_signo);
>>>> +	io_uring_task_cancel();
>>>>  
>>>>  	binfmt = mm->binfmt;
>>>>  	if (!binfmt || !binfmt->core_dump)
>>>>
>>> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above
>>> with my io_uring program.  The coredump locked up even when writing the
>>> core file directly to disk; the zombie process could not be killed with
>>> "kill -9".  Unfortunately I can't test with newer kernels without
>>> spending some time on it, and I am too busy with other stuff right now.
>>
>> That sounds like 5.10-stable is missing some of the cancelation
>> backports, and your setup makes the cancelation stall because of that.
>> Need to go over the 11/12/13 fixes and ensure that we've got everything
>> we need for those stable versions, particularly 5.10.
>>
>>> My io_uring program does async buffered reads
>>> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition
>>> (no filesystem).  One thread submits I/Os while another thread calls
>>> io_uring_wait_cqe() and processes the completions.  To trigger the
>>> coredump, I added an intentional abort() in the thread that submits I/Os
>>> after running for a second.
>>
>> OK, so that one is also using task_work for the retry based async
>> buffered reads, so it makes sense.
>>
>> Maybe a temporary work-around is to use 06af8679449d and eliminate the
>> pipe based coredump?
> 
> Another approach - don't allow TWA_SIGNAL task_work to get queued if
> PF_SIGNALED has been set on the task. This is similar to how we reject
> task_work_add() on process exit, and the callers must be able to handle
> that already.
> 
> Can you test this one on top of your 5.10-stable?
> 
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..ca7c1ee44ada 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +		tracehook_notify_signal();
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..1ab28904adc4 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
> 

Olivier, I sent a 5.10 version for Nathan, any chance you can test this
one for the current kernels? Basically this one should work for 5.11+,
and the later 5.10 version is just for 5.10. I'm going to send it out
separately for review.

I do think this is the right solution, barring a tweak maybe on testing
notify == TWA_SIGNAL first before digging into the task struct. But the
principle is sound, and it'll work for other users of TWA_SIGNAL as
well. None right now as far as I can tell, but the live patching is
switching to TIF_NOTIFY_SIGNAL as well which will also cause issues with
coredumps potentially.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-18  2:57                                               ` Jens Axboe
@ 2021-08-18  2:58                                                 ` Jens Axboe
  2021-08-21 10:08                                                 ` Olivier Langlois
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-18  2:58 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

> Olivier, I sent a 5.10 version for Nathan, any chance you can test this
                                     ^^^^^^

Tony of course, my apologies.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 22:05                                                 ` Jens Axboe
@ 2021-08-18 14:37                                                   ` Tony Battersby
  2021-08-18 14:46                                                     ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Tony Battersby @ 2021-08-18 14:37 UTC (permalink / raw)
  To: Jens Axboe, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/17/21 6:05 PM, Jens Axboe wrote:
> On 8/17/21 3:39 PM, Tony Battersby wrote:
>> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>>> PF_SIGNALED has been set on the task. This is similar to how we reject
>>> task_work_add() on process exit, and the callers must be able to handle
>>> that already.
>>>
>>> Can you test this one on top of your 5.10-stable?
>>>
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>  		.mm_flags = mm->flags,
>>>  	};
>>>  
>>> +	/*
>>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>>> +	 * if any was queued before that.
>>> +	 */
>>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>> +		tracehook_notify_signal();
>>> +
>>>  	audit_core_dumps(siginfo->si_signo);
>>>  
>>>  	binfmt = mm->binfmt;
>>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>>> index 1698fbe6f0e1..1ab28904adc4 100644
>>> --- a/kernel/task_work.c
>>> +++ b/kernel/task_work.c
>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>>  		head = READ_ONCE(task->task_works);
>>>  		if (unlikely(head == &work_exited))
>>>  			return -ESRCH;
>>> +		/*
>>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>>> +		 * a core dump in progress, reject them.
>>> +		 */
>>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>>> +			return -ESRCH;
>>>  		work->next = head;
>>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>>  
>>>
>> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.
> Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
> untested...
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index c6acfc694f65..9e899ce67589 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		.mm_flags = mm->flags,
>  	};
>  
> +	/*
> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
> +	 * been set, ensure that we flush any pending TWA_SIGNAL work
> +	 * if any was queued before that.
> +	 */
> +	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
> +		task_work_run();
> +		spin_lock_irq(&current->sighand->siglock);
> +		current->jobctl &= ~JOBCTL_TASK_WORK;
> +		recalc_sigpending();
> +		spin_unlock_irq(&current->sighand->siglock);
> +	}
> +
>  	audit_core_dumps(siginfo->si_signo);
>  
>  	binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 8d6e1217c451..93b3f262eb4a 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		head = READ_ONCE(task->task_works);
>  		if (unlikely(head == &work_exited))
>  			return -ESRCH;
> +		/*
> +		 * TWA_SIGNAL notifications will interfere with
> +		 * a core dump in progress, reject them.
> +		 */
> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
> +			return -ESRCH;
>  		work->next = head;
>  	} while (cmpxchg(&task->task_works, head, work) != head);
>  
>
Tested with 5.10.59 + backport 06af8679449d + the patch above.  That
fixes it for me.  I tested a couple of variations to make sure.

Thanks!

Tested-by: Tony Battersby <tonyb@cybernetics.com>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-18 14:37                                                   ` Tony Battersby
@ 2021-08-18 14:46                                                     ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2021-08-18 14:46 UTC (permalink / raw)
  To: Tony Battersby, Olivier Langlois, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/18/21 8:37 AM, Tony Battersby wrote:
> On 8/17/21 6:05 PM, Jens Axboe wrote:
>> On 8/17/21 3:39 PM, Tony Battersby wrote:
>>> On 8/17/21 5:28 PM, Jens Axboe wrote:
>>>> Another approach - don't allow TWA_SIGNAL task_work to get queued if
>>>> PF_SIGNALED has been set on the task. This is similar to how we reject
>>>> task_work_add() on process exit, and the callers must be able to handle
>>>> that already.
>>>>
>>>> Can you test this one on top of your 5.10-stable?
>>>>
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index 07afb5ddb1c4..ca7c1ee44ada 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>>>  		.mm_flags = mm->flags,
>>>>  	};
>>>>  
>>>> +	/*
>>>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>>>> +	 * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
>>>> +	 * if any was queued before that.
>>>> +	 */
>>>> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>>>> +		tracehook_notify_signal();
>>>> +
>>>>  	audit_core_dumps(siginfo->si_signo);
>>>>  
>>>>  	binfmt = mm->binfmt;
>>>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>>>> index 1698fbe6f0e1..1ab28904adc4 100644
>>>> --- a/kernel/task_work.c
>>>> +++ b/kernel/task_work.c
>>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>>>  		head = READ_ONCE(task->task_works);
>>>>  		if (unlikely(head == &work_exited))
>>>>  			return -ESRCH;
>>>> +		/*
>>>> +		 * TIF_NOTIFY_SIGNAL notifications will interfere with
>>>> +		 * a core dump in progress, reject them.
>>>> +		 */
>>>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>>>> +			return -ESRCH;
>>>>  		work->next = head;
>>>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>>>  
>>>>
>>> Doesn't compile.  5.10 doesn't have TIF_NOTIFY_SIGNAL.
>> Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally
>> untested...
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index c6acfc694f65..9e899ce67589 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  		.mm_flags = mm->flags,
>>  	};
>>  
>> +	/*
>> +	 * task_work_add() will refuse to add work after PF_SIGNALED has
>> +	 * been set, ensure that we flush any pending TWA_SIGNAL work
>> +	 * if any was queued before that.
>> +	 */
>> +	if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) {
>> +		task_work_run();
>> +		spin_lock_irq(&current->sighand->siglock);
>> +		current->jobctl &= ~JOBCTL_TASK_WORK;
>> +		recalc_sigpending();
>> +		spin_unlock_irq(&current->sighand->siglock);
>> +	}
>> +
>>  	audit_core_dumps(siginfo->si_signo);
>>  
>>  	binfmt = mm->binfmt;
>> diff --git a/kernel/task_work.c b/kernel/task_work.c
>> index 8d6e1217c451..93b3f262eb4a 100644
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>  		head = READ_ONCE(task->task_works);
>>  		if (unlikely(head == &work_exited))
>>  			return -ESRCH;
>> +		/*
>> +		 * TWA_SIGNAL notifications will interfere with
>> +		 * a core dump in progress, reject them.
>> +		 */
>> +		if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL)
>> +			return -ESRCH;
>>  		work->next = head;
>>  	} while (cmpxchg(&task->task_works, head, work) != head);
>>  
>>
> Tested with 5.10.59 + backport 06af8679449d + the patch above.  That
> fixes it for me.  I tested a couple of variations to make sure.
> 
> Thanks!
> 
> Tested-by: Tony Battersby <tonyb@cybernetics.com>

Great, thanks for testing! The 5.10 version is a bit uglier due to how
TWA_SIGNAL used to work, but it's the most straight forward backport of
the other version I sent.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 18:15                                     ` Jens Axboe
  2021-08-17 18:24                                       ` Jens Axboe
@ 2021-08-21  9:48                                       ` Olivier Langlois
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-08-21  9:48 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Tue, 2021-08-17 at 12:15 -0600, Jens Axboe wrote:
> 
> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger
> some
> signal_pending() and cause an interruption of the core dump. Just out
> of
> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's
> set to some piped process, can you try and set it to 'core' and see
> if
> that eliminates the truncation of the core dumps for your case?
> 

/proc/sys/kernel/core_pattern is set to:
|/home/lano1106/bin/pipe_core.sh %e %p

It normally points to systemd coredump module. I have pointed to a
simpler program for debugging purposes.

when core_pattern points to a local file, core dump files are just
fine. That was the whole point of 

commit 06af8679449d ("coredump: Limit what can interrupt coredumps")

I have been distracted by other things this week but my last attempt to
fix this problem appears to be successful. I will send out a small
patch set shortly...


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-17 18:24                                       ` Jens Axboe
  2021-08-17 19:29                                         ` Tony Battersby
@ 2021-08-21  9:52                                         ` Olivier Langlois
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-08-21  9:52 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Tue, 2021-08-17 at 12:24 -0600, Jens Axboe wrote:
> 
> And assuming that works, then I suspect this one would fix your issue
> even with a piped core dump:
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..852737a9ccbf 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -41,6 +41,7 @@
>  #include <linux/fs.h>
>  #include <linux/path.h>
>  #include <linux/timekeeping.h>
> +#include <linux/io_uring.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>         };
>  
>         audit_core_dumps(siginfo->si_signo);
> +       io_uring_task_cancel();
>  
>         binfmt = mm->binfmt;
>         if (!binfmt || !binfmt->core_dump)
> 
That is what my patch is doing. Function call is inserted at a
different place... I am not sure if one location is better than the
other or if it matters at all but there is an extra change required to
make it work...

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..614fe7a54c1a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/timekeeping.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		need_suid_safe = true;
 	}
 
+	io_uring_task_cancel();
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;




^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-18  2:57                                               ` Jens Axboe
  2021-08-18  2:58                                                 ` Jens Axboe
@ 2021-08-21 10:08                                                 ` Olivier Langlois
  2021-08-21 16:47                                                   ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-08-21 10:08 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
> 
> Olivier, I sent a 5.10 version for Nathan, any chance you can test
> this
> one for the current kernels? Basically this one should work for
> 5.11+,
> and the later 5.10 version is just for 5.10. I'm going to send it out
> separately for review.
> 
> I do think this is the right solution, barring a tweak maybe on
> testing
> notify == TWA_SIGNAL first before digging into the task struct. But
> the
> principle is sound, and it'll work for other users of TWA_SIGNAL as
> well. None right now as far as I can tell, but the live patching is
> switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
> with
> coredumps potentially.
> 
Ok, I am going to give it a shot. This solution is probably superior to
the previous attempt as it does not inject io_uring dependency into the
coredump module.

The small extra change that I alluded to in my previous reply will
still be relevant even if we go with your patch...

I'll come back soon with your patch testing result and my small extra
change that I keep teasing about.

Greetings,



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-21 10:08                                                 ` Olivier Langlois
@ 2021-08-21 16:47                                                   ` Olivier Langlois
  2021-08-21 16:51                                                     ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-08-21 16:47 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote:
> On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
> > 
> > Olivier, I sent a 5.10 version for Nathan, any chance you can test
> > this
> > one for the current kernels? Basically this one should work for
> > 5.11+,
> > and the later 5.10 version is just for 5.10. I'm going to send it
> > out
> > separately for review.
> > 
> > I do think this is the right solution, barring a tweak maybe on
> > testing
> > notify == TWA_SIGNAL first before digging into the task struct. But
> > the
> > principle is sound, and it'll work for other users of TWA_SIGNAL as
> > well. None right now as far as I can tell, but the live patching is
> > switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
> > with
> > coredumps potentially.
> > 
> Ok, I am going to give it a shot. This solution is probably superior
> to
> the previous attempt as it does not inject io_uring dependency into
> the
> coredump module.
> 
> The small extra change that I alluded to in my previous reply will
> still be relevant even if we go with your patch...
> 
> I'll come back soon with your patch testing result and my small extra
> change that I keep teasing about.
> 
> Greetings,
> 
Jens,

your patch doesn't compile with 5.12+. AFAIK, the reason is that
JOBCTL_TASK_WORK is gone.

Wouldn't just a call to tracehook_notify_signal from do_coredump be
enough and backward compatible with every possible stable branches?

Greetings,


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-21 16:47                                                   ` Olivier Langlois
@ 2021-08-21 16:51                                                     ` Jens Axboe
  2021-08-21 17:21                                                       ` Olivier Langlois
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2021-08-21 16:51 UTC (permalink / raw)
  To: Olivier Langlois, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On 8/21/21 10:47 AM, Olivier Langlois wrote:
> On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote:
>> On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote:
>>>
>>> Olivier, I sent a 5.10 version for Nathan, any chance you can test
>>> this
>>> one for the current kernels? Basically this one should work for
>>> 5.11+,
>>> and the later 5.10 version is just for 5.10. I'm going to send it
>>> out
>>> separately for review.
>>>
>>> I do think this is the right solution, barring a tweak maybe on
>>> testing
>>> notify == TWA_SIGNAL first before digging into the task struct. But
>>> the
>>> principle is sound, and it'll work for other users of TWA_SIGNAL as
>>> well. None right now as far as I can tell, but the live patching is
>>> switching to TIF_NOTIFY_SIGNAL as well which will also cause issues
>>> with
>>> coredumps potentially.
>>>
>> Ok, I am going to give it a shot. This solution is probably superior
>> to
>> the previous attempt as it does not inject io_uring dependency into
>> the
>> coredump module.
>>
>> The small extra change that I alluded to in my previous reply will
>> still be relevant even if we go with your patch...
>>
>> I'll come back soon with your patch testing result and my small extra
>> change that I keep teasing about.
>>
>> Greetings,
>>
> Jens,
> 
> your patch doesn't compile with 5.12+. AFAIK, the reason is that
> JOBCTL_TASK_WORK is gone.
> 
> Wouldn't just a call to tracehook_notify_signal from do_coredump be
> enough and backward compatible with every possible stable branches?

That version is just for 5.10, the first I posted is applicable to
5.11+

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] coredump: Limit what can interrupt coredumps
  2021-08-21 16:51                                                     ` Jens Axboe
@ 2021-08-21 17:21                                                       ` Olivier Langlois
  0 siblings, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2021-08-21 17:21 UTC (permalink / raw)
  To: Jens Axboe, Tony Battersby, Eric W. Biederman, Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	io-uring, Alexander Viro, Pavel Begunkov>

On Sat, 2021-08-21 at 10:51 -0600, Jens Axboe wrote:
> On 8/21/21 10:47 AM, Olivier Langlois wrote:
> > Jens,
> > 
> > your patch doesn't compile with 5.12+. AFAIK, the reason is that
> > JOBCTL_TASK_WORK is gone.
> > 
> > Wouldn't just a call to tracehook_notify_signal from do_coredump be
> > enough and backward compatible with every possible stable branches?
> 
> That version is just for 5.10, the first I posted is applicable to
> 5.11+
> 
Ok, in that case, I can tell you that it partially works. There is a
small thing missing.

I have tested mine which I did share in
https://lore.kernel.org/io-uring/87pmwt6biw.fsf@disp2133/T/#m3b51d44c12e39a06b82aac1d372df05312cff833

and with another small patch added to it, it does work.

I will offer my patch later today.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
       [not found]   ` <87h7i694ij.fsf_-_@disp2133>
  2021-06-09 20:33     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Linus Torvalds
@ 2021-10-22 14:13     ` Pavel Begunkov
  2021-12-24  1:34       ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2021-10-22 14:13 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel
  Cc: linux-fsdevel, io-uring, Alexander Viro, Olivier Langlois,
	Jens Axboe, Oleg Nesterov, Linus Torvalds

On 6/9/21 21:17, Eric W. Biederman wrote:
> 
> Folks,
> 
> Olivier Langlois has been struggling with coredumps getting truncated in
> tasks using io_uring.  He has also apparently been struggling with
> the some of his email messages not making it to the lists.

Looks syzbot hit something relevant, see
https://lore.kernel.org/io-uring/0000000000000012fb05cee99477@google.com/

In short, a task creates an io_uring worker thread, then the worker
submits a task_work item to the creator task and won't die until
the item is executed/cancelled. And I found that the creator task is
sleeping in do_coredump() -> wait_for_completion()

0xffffffff81343ccb is in do_coredump (fs/coredump.c:469).
464
465             if (core_waiters > 0) {
466                     struct core_thread *ptr;
467
468                     freezer_do_not_count();
469                     wait_for_completion(&core_state->startup);
470                     freezer_count();


A hack executing tws there helps (see diff below).
Any chance anyone knows what this is and how to fix it?


diff --git a/fs/coredump.c b/fs/coredump.c
index 3224dee44d30..f6f9dfb02296 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -466,7 +466,8 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
          struct core_thread *ptr;
  
          freezer_do_not_count();
-        wait_for_completion(&core_state->startup);
+        while (wait_for_completion_interruptible(&core_state->startup))
+            tracehook_notify_signal();
          freezer_count();
          /*
           * Wait for all the threads to become inactive, so that



> 
> We were talking about some of his struggles and questions in this area
> and he pointed me to this patch he thought he had posted but I could not
> find in the list archives.
> 
> In short the coredump code deliberately supports being interrupted by
> SIGKILL, and depends upon prepare_signal to filter out all other
> signals.  With the io_uring code comes an extra test in signal_pending
> for TIF_NOTIFY_SIGNAL (which is something about asking a task to run
> task_work_run).
> 
> I am baffled why the dumper thread would be getting interrupted by
> TIF_NOTIFY_SIGNAL but apparently it is.  Perhaps it is an io_uring
> thread that is causing the dump.
> 
> Now that we know the problem the question becomes how to fix this issue.
> 
> Is there any chance all of this TWA_SIGNAL logic could simply be removed
> now that io_uring threads are normal process threads?
> 
> There are only the two call sites so I perhaps the could test
> signal->flags & SIGNAL_FLAG_COREDUMP before scheduling a work on
> a process that is dumping core?
> 
> Perhaps the coredump code needs to call task_work_run before it does
> anything?
> 
> -----
> 
> From: Olivier Langlois <olivier@trillion01.com>
> Subject: [PATCH] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
> Date: Mon, 07 Jun 2021 16:25:06 -0400
> 
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
> 
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>     with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>     set_notify_signal()
> 
> Signed-off-by: Olivier Langlois <olivier@trillion01.com>
> ---
>   fs/coredump.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 2868e3e171ae..79c6e3f114db 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>   	 * but then we need to teach dump_write() to restart and clear
>   	 * TIF_SIGPENDING.
>   	 */
> -	return signal_pending(current);
> +	return task_sigpending(current);
>   }
>   
>   static void wait_for_dump_helpers(struct file *file)
> 

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-10-22 14:13     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Pavel Begunkov
@ 2021-12-24  1:34       ` Olivier Langlois
  2021-12-24 10:37         ` Pavel Begunkov
  0 siblings, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2021-12-24  1:34 UTC (permalink / raw)
  To: Pavel Begunkov, Eric W. Biederman, linux-kernel
  Cc: linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Oleg Nesterov, Linus Torvalds

On Fri, 2021-10-22 at 15:13 +0100, Pavel Begunkov wrote:
> On 6/9/21 21:17, Eric W. Biederman wrote:
> > 
> > Folks,
> > 
> > Olivier Langlois has been struggling with coredumps getting
> > truncated in
> > tasks using io_uring.  He has also apparently been struggling with
> > the some of his email messages not making it to the lists.
> 
> Looks syzbot hit something relevant, see
> https://lore.kernel.org/io-
> uring/0000000000000012fb05cee99477@google.com/
> 
> In short, a task creates an io_uring worker thread, then the worker
> submits a task_work item to the creator task and won't die until
> the item is executed/cancelled. And I found that the creator task is
> sleeping in do_coredump() -> wait_for_completion()
> 
> 0xffffffff81343ccb is in do_coredump (fs/coredump.c:469).
> 464
> 465             if (core_waiters > 0) {
> 466                     struct core_thread *ptr;
> 467
> 468                     freezer_do_not_count();
> 469                     wait_for_completion(&core_state->startup);
> 470                     freezer_count();
> 
> 
> A hack executing tws there helps (see diff below).
> Any chance anyone knows what this is and how to fix it?
> 
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 3224dee44d30..f6f9dfb02296 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -466,7 +466,8 @@ static int coredump_wait(int exit_code, struct
> core_state *core_state)
>           struct core_thread *ptr;
>   
>           freezer_do_not_count();
> -        wait_for_completion(&core_state->startup);
> +        while (wait_for_completion_interruptible(&core_state-
> >startup))
> +            tracehook_notify_signal();
>           freezer_count();
>           /*
>            * Wait for all the threads to become inactive, so that
> 
> 
> 
> 
Pavel,

I cannot comment on the merit of the proposed hack but my proposed
patch to fix the coredump truncation issue when a process using
io_uring core dumps that I submitted back in August is still
unreviewed!

https://lore.kernel.org/lkml/1625bc89782bf83d9d8c7c63e8ffcb651ccb15fa.1629655338.git.olivier@trillion01.com/

I have been using it since then I must have generated many dozens of
perfect core dump files with it and I have not seen a single truncated
core dump files like I used to have prior to the patch.

I am bringing back my patch to your attention because one nice side
effect of it is that it would have avoided totally the problem that you
have encountered in coredump_wait() since it does cancel io_uring
resources before calling coredump_wait()!

Greetings,
Olivier


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-12-24  1:34       ` Olivier Langlois
@ 2021-12-24 10:37         ` Pavel Begunkov
  2021-12-24 19:52           ` Eric W. Biederman
  2022-01-05 19:39           ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Olivier Langlois
  0 siblings, 2 replies; 66+ messages in thread
From: Pavel Begunkov @ 2021-12-24 10:37 UTC (permalink / raw)
  To: Olivier Langlois, Eric W. Biederman, linux-kernel
  Cc: linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Oleg Nesterov, Linus Torvalds

On 12/24/21 01:34, Olivier Langlois wrote:
> On Fri, 2021-10-22 at 15:13 +0100, Pavel Begunkov wrote:
>> On 6/9/21 21:17, Eric W. Biederman wrote:
>> In short, a task creates an io_uring worker thread, then the worker
>> submits a task_work item to the creator task and won't die until
>> the item is executed/cancelled. And I found that the creator task is
>> sleeping in do_coredump() -> wait_for_completion()
>>
[...]
>> A hack executing tws there helps (see diff below).
>> Any chance anyone knows what this is and how to fix it?
>>
[...]
> Pavel,
> 
> I cannot comment on the merit of the proposed hack but my proposed
> patch to fix the coredump truncation issue when a process using
> io_uring core dumps that I submitted back in August is still
> unreviewed!

That's unfortunate. Not like I can help in any case, but I assumed
it was dealt with by

commit 06af8679449d4ed282df13191fc52d5ba28ec536
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jun 10 15:11:11 2021 -0500

     coredump: Limit what can interrupt coredumps
     
     Olivier Langlois has been struggling with coredumps being incompletely written in
     processes using io_uring.
     ...

> https://lore.kernel.org/lkml/1625bc89782bf83d9d8c7c63e8ffcb651ccb15fa.1629655338.git.olivier@trillion01.com/
> 
> I have been using it since then I must have generated many dozens of
> perfect core dump files with it and I have not seen a single truncated
> core dump files like I used to have prior to the patch.
> 
> I am bringing back my patch to your attention because one nice side
> effect of it is that it would have avoided totally the problem that you
> have encountered in coredump_wait() since it does cancel io_uring
> resources before calling coredump_wait()!

FWIW, I worked it around in io_uring back then by breaking the
dependency.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-12-24 10:37         ` Pavel Begunkov
@ 2021-12-24 19:52           ` Eric W. Biederman
  2021-12-28 11:24             ` Pavel Begunkov
  2022-01-05 19:39           ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2021-12-24 19:52 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Olivier Langlois, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Jens Axboe, Oleg Nesterov, Linus Torvalds

Pavel Begunkov <asml.silence@gmail.com> writes:

> On 12/24/21 01:34, Olivier Langlois wrote:
>> On Fri, 2021-10-22 at 15:13 +0100, Pavel Begunkov wrote:
>>> On 6/9/21 21:17, Eric W. Biederman wrote:
>>> In short, a task creates an io_uring worker thread, then the worker
>>> submits a task_work item to the creator task and won't die until
>>> the item is executed/cancelled. And I found that the creator task is
>>> sleeping in do_coredump() -> wait_for_completion()
>>>
> [...]
>>> A hack executing tws there helps (see diff below).
>>> Any chance anyone knows what this is and how to fix it?
>>>
> [...]
>> Pavel,
>>
>> I cannot comment on the merit of the proposed hack but my proposed
>> patch to fix the coredump truncation issue when a process using
>> io_uring core dumps that I submitted back in August is still
>> unreviewed!
>
> That's unfortunate. Not like I can help in any case, but I assumed
> it was dealt with by
>
> commit 06af8679449d4ed282df13191fc52d5ba28ec536
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Thu Jun 10 15:11:11 2021 -0500
>
>     coredump: Limit what can interrupt coredumps
>       Olivier Langlois has been struggling with coredumps being incompletely
> written in
>     processes using io_uring.
>     ...

I thought it had been too.

>> https://lore.kernel.org/lkml/1625bc89782bf83d9d8c7c63e8ffcb651ccb15fa.1629655338.git.olivier@trillion01.com/
>>
>> I have been using it since then I must have generated many dozens of
>> perfect core dump files with it and I have not seen a single truncated
>> core dump files like I used to have prior to the patch.
>>
>> I am bringing back my patch to your attention because one nice side
>> effect of it is that it would have avoided totally the problem that you
>> have encountered in coredump_wait() since it does cancel io_uring
>> resources before calling coredump_wait()!
>
> FWIW, I worked it around in io_uring back then by breaking the
> dependency.

I am in the middle of untangling the dependencies between ptrace,
coredump, signal handling and maybe a few related things.

Do folks have a reproducer I can look at?  Pavel especially if you have
something that reproduces on the current kernels.

As part of that I am in the process of guaranteeing all of the coredump
work happens in get_signal so nothing of io_uring or any cleanup
anywhere else runs until the coredump completes.

I haven't quite posted the code for review because it's the holidays.
But I am aiming at v5.17 or possibly v5.18, as the code is just about
ready.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-12-24 19:52           ` Eric W. Biederman
@ 2021-12-28 11:24             ` Pavel Begunkov
  2022-03-14 23:58               ` Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Begunkov @ 2021-12-28 11:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Olivier Langlois, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Jens Axboe, Oleg Nesterov, Linus Torvalds

On 12/24/21 19:52, Eric W. Biederman wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
[...]
>> FWIW, I worked it around in io_uring back then by breaking the
>> dependency.
> 
> I am in the middle of untangling the dependencies between ptrace,
> coredump, signal handling and maybe a few related things.

Sounds great

> Do folks have a reproducer I can look at?  Pavel especially if you have
> something that reproduces on the current kernels.

A syz reproducer was triggering it reliably, I'd try to revert the
commit below and test:
https://syzkaller.appspot.com/text?tag=ReproC&x=15d3600cb00000

It should hung a task. Syzbot report for reference:
https://syzkaller.appspot.com/bug?extid=27d62ee6f256b186883e


commit 1d5f5ea7cb7d15b9fb1cc82673ebb054f02cd7d2
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Oct 29 13:11:33 2021 +0100

     io-wq: remove worker to owner tw dependency

     INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds.
           Not tainted 5.15.0-rc5-syzkaller #0
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:iou-wrk-6609    state:D stack:27944 pid: 6612 ppid:  6526 flags:0x00004006
     Call Trace:
      context_switch kernel/sched/core.c:4940 [inline]
      __schedule+0xb44/0x5960 kernel/sched/core.c:6287
      schedule+0xd3/0x270 kernel/sched/core.c:6366
      schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
      do_wait_for_common kernel/sched/completion.c:85 [inline]
      __wait_for_common kernel/sched/completion.c:106 [inline]
      wait_for_common kernel/sched/completion.c:117 [inline]
      wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
      io_worker_exit fs/io-wq.c:183 [inline]
      io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      ...

> As part of that I am in the process of guaranteeing all of the coredump
> work happens in get_signal so nothing of io_uring or any cleanup
> anywhere else runs until the coredump completes.
> 
> I haven't quite posted the code for review because it's the holidays.
> But I am aiming at v5.17 or possibly v5.18, as the code is just about
> ready.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-12-24 10:37         ` Pavel Begunkov
  2021-12-24 19:52           ` Eric W. Biederman
@ 2022-01-05 19:39           ` Olivier Langlois
  1 sibling, 0 replies; 66+ messages in thread
From: Olivier Langlois @ 2022-01-05 19:39 UTC (permalink / raw)
  To: Pavel Begunkov, Eric W. Biederman, linux-kernel
  Cc: linux-fsdevel, io-uring, Alexander Viro, Jens Axboe,
	Oleg Nesterov, Linus Torvalds

On Fri, 2021-12-24 at 10:37 +0000, Pavel Begunkov wrote:
> On 12/24/21 01:34, Olivier Langlois wrote:
> > On Fri, 2021-10-22 at 15:13 +0100, Pavel Begunkov wrote:
> > > On 6/9/21 21:17, Eric W. Biederman wrote:
> > > In short, a task creates an io_uring worker thread, then the
> > > worker
> > > submits a task_work item to the creator task and won't die until
> > > the item is executed/cancelled. And I found that the creator task
> > > is
> > > sleeping in do_coredump() -> wait_for_completion()
> > > 
> [...]
> > > A hack executing tws there helps (see diff below).
> > > Any chance anyone knows what this is and how to fix it?
> > > 
> [...]
> > Pavel,
> > 
> > I cannot comment on the merit of the proposed hack but my proposed
> > patch to fix the coredump truncation issue when a process using
> > io_uring core dumps that I submitted back in August is still
> > unreviewed!
> 
> That's unfortunate. Not like I can help in any case, but I assumed
> it was dealt with by
> 
> commit 06af8679449d4ed282df13191fc52d5ba28ec536
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Thu Jun 10 15:11:11 2021 -0500
> 
>      coredump: Limit what can interrupt coredumps
>      
>      Olivier Langlois has been struggling with coredumps being
> incompletely written in
>      processes using io_uring.
>      ...
> 
It was a partial fix only. Oleg Nesterov pointed out that the fix was
not good if if the core dump was written into a pipe.

https://lore.kernel.org/io-uring/20210614141032.GA13677@redhat.com/

> > https://lore.kernel.org/lkml/1625bc89782bf83d9d8c7c63e8ffcb651ccb15
> > fa.1629655338.git.olivier@trillion01.com/
> > 
> > I have been using it since then I must have generated many dozens
> > of
> > perfect core dump files with it and I have not seen a single
> > truncated
> > core dump files like I used to have prior to the patch.
> > 
> > I am bringing back my patch to your attention because one nice side
> > effect of it is that it would have avoided totally the problem that
> > you
> > have encountered in coredump_wait() since it does cancel io_uring
> > resources before calling coredump_wait()!
> 
> FWIW, I worked it around in io_uring back then by breaking the
> dependency.
> 
Yes I have seen your work to fix the problem.

It just seems to me that there is no good reason to keep io_uring
process requests once you are generating a core dump and simply
cancelling io_uring before generating the core dump would have avoided
the problem that you have encountered plus any other similar issues yet
to come...

Greetings,


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
  2021-12-28 11:24             ` Pavel Begunkov
@ 2022-03-14 23:58               ` Eric W. Biederman
       [not found]                 ` <8218f1a245d054c940e25142fd00a5f17238d078.camel@trillion01.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2022-03-14 23:58 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Olivier Langlois, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Jens Axboe, Oleg Nesterov, Linus Torvalds

Pavel Begunkov <asml.silence@gmail.com> writes:

> On 12/24/21 19:52, Eric W. Biederman wrote:
>> Pavel Begunkov <asml.silence@gmail.com> writes:
> [...]
>>> FWIW, I worked it around in io_uring back then by breaking the
>>> dependency.
>> I am in the middle of untangling the dependencies between ptrace,
>> coredump, signal handling and maybe a few related things.
>
> Sounds great
>
>> Do folks have a reproducer I can look at?  Pavel especially if you have
>> something that reproduces on the current kernels.
>
> A syz reproducer was triggering it reliably, I'd try to revert the
> commit below and test:
> https://syzkaller.appspot.com/text?tag=ReproC&x=15d3600cb00000
>
> It should hung a task. Syzbot report for reference:
> https://syzkaller.appspot.com/bug?extid=27d62ee6f256b186883e
>
>
> commit 1d5f5ea7cb7d15b9fb1cc82673ebb054f02cd7d2
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Fri Oct 29 13:11:33 2021 +0100
>
>     io-wq: remove worker to owner tw dependency
>
>     INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds.
>           Not tainted 5.15.0-rc5-syzkaller #0
>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     task:iou-wrk-6609    state:D stack:27944 pid: 6612 ppid:  6526 flags:0x00004006
>     Call Trace:
>      context_switch kernel/sched/core.c:4940 [inline]
>      __schedule+0xb44/0x5960 kernel/sched/core.c:6287
>      schedule+0xd3/0x270 kernel/sched/core.c:6366
>      schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
>      do_wait_for_common kernel/sched/completion.c:85 [inline]
>      __wait_for_common kernel/sched/completion.c:106 [inline]
>      wait_for_common kernel/sched/completion.c:117 [inline]
>      wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
>      io_worker_exit fs/io-wq.c:183 [inline]
>      io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
>      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
>      ...

Thank you very much for this.  There were some bugs elsewhere I had to
deal with so I am slower looking at this part of the code than I was
expecting.

I have now reproduced this with the commit reverted on current kernels
and the repro.c from the syzcaller report.  I am starting to look into
how this interacts with my planned code changes in this area.

In combination with my other planned changes I think all that needs to
happen in do_coredump is to clear TIF_NOTIFY_SIGNAL along with
TIF_SIGPENDING to prevent io_uring interaction problems.  But we will
see.

The deadlock you demonstrate here shows that it is definitely not enough
to clear TIF_NOTIFY_SIGNAL (without other changes) so that
signal_pending returns false, which I was hoping was be the case.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL
       [not found]                 ` <8218f1a245d054c940e25142fd00a5f17238d078.camel@trillion01.com>
@ 2022-06-01  3:15                   ` Jens Axboe
  2022-07-20 16:49                     ` [PATCH 0/2] coredump: Allow io_uring using apps to dump to pipes Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2022-06-01  3:15 UTC (permalink / raw)
  To: Olivier Langlois, Eric W. Biederman, Pavel Begunkov
  Cc: linux-kernel, linux-fsdevel, io-uring, Alexander Viro,
	Oleg Nesterov, Linus Torvalds

On 5/31/22 2:06 PM, Olivier Langlois wrote:
> On Mon, 2022-03-14 at 18:58 -0500, Eric W. Biederman wrote:
>>
>> Thank you very much for this.  There were some bugs elsewhere I had
>> to
>> deal with so I am slower looking at this part of the code than I was
>> expecting.
>>
>> I have now reproduced this with the commit reverted on current
>> kernels
>> and the repro.c from the syzcaller report.  I am starting to look
>> into
>> how this interacts with my planned code changes in this area.
>>
>> In combination with my other planned changes I think all that needs
>> to
>> happen in do_coredump is to clear TIF_NOTIFY_SIGNAL along with
>> TIF_SIGPENDING to prevent io_uring interaction problems.  But we will
>> see.
>>
>> The deadlock you demonstrate here shows that it is definitely not
>> enough
>> to clear TIF_NOTIFY_SIGNAL (without other changes) so that
>> signal_pending returns false, which I was hoping was be the case.
>>
>> Eric
> 
> I have been away for some time but if this is not resoved yet, I just
> want to remind that clearing TIF_NOTIFY_SIGNAL along with
> TIF_SIGPENDING won't do it because io_uring may set them asynchronously
> to report some io completion while do_coredump() is executing.
> 
> IMHO, just calling io_uring_task_cancel() from do_coredump() before
> actually writing the dump, while maybe not the perfect solution, is the
> simplest one.
> 
> Otherwise, maybe masking interrupts locally could work but I wouldn't
> dare to explore this option personally...

Eric, are you fine with doing the cancelation based patch for now? IMHO
it's not the ideal approach, but it will resolve the issue. And it'd
honestly be great to get some closure on this so we don't have truncated
core dumps if they are interrupted by task_work.

The best solution would be to make the core dumps resilient to
task_work, but a workable solution would be nice at this point...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 0/2] coredump: Allow io_uring using apps to dump to pipes.
  2022-06-01  3:15                   ` Jens Axboe
@ 2022-07-20 16:49                     ` Eric W. Biederman
  2022-07-20 16:50                       ` [PATCH 1/2] signal: Move stopping for the coredump from do_exit into get_signal Eric W. Biederman
  2022-07-20 16:51                       ` [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring Eric W. Biederman
  0 siblings, 2 replies; 66+ messages in thread
From: Eric W. Biederman @ 2022-07-20 16:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Olivier Langlois, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds



Folks who have been suffering from io_uring coredumping issues please
give this a spin.  I have lightly tested this version and more heavily
tested an earlier version which had more dependencies.

Unless I have missed something in cleaning up the code this should
be a comprehensive fix to the coredumping issues when using io_uring.

But folks please test and verify this.  It has taken me long enough to
get back to this point I don't properly remember how the reproducer I
have was supposed to fail.  All I can say with certainty is set of
changes has what looks like a positive effect.

Eric W. Biederman (2):
      signal: Move stopping for the coredump from do_exit into get_signal
      coredump: Allow coredumps to pipes to work with io_uring


 fs/coredump.c            | 30 ++++++++++++++++++++++++++++--
 include/linux/coredump.h |  2 ++
 kernel/exit.c            | 29 +++++------------------------
 kernel/signal.c          |  5 +++++
 mm/oom_kill.c            |  2 +-
 5 files changed, 41 insertions(+), 27 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 1/2] signal: Move stopping for the coredump from do_exit into get_signal
  2022-07-20 16:49                     ` [PATCH 0/2] coredump: Allow io_uring using apps to dump to pipes Eric W. Biederman
@ 2022-07-20 16:50                       ` Eric W. Biederman
  2022-07-20 16:51                       ` [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring Eric W. Biederman
  1 sibling, 0 replies; 66+ messages in thread
From: Eric W. Biederman @ 2022-07-20 16:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Olivier Langlois, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds


Stopping to participate in a coredump from a kernel oops makes no
sense and is actively dangerous because the kernel is known to be
broken.  Considering to stop in a coredump from a kernel thread exit
is silly because userspace coredumps are not generated from kernel
threads.  Not stopping for a coredump in exit(2) and exit_group(2) and
related userspace exits that call do_exit or do_group_exit directly is
the current behavior of the code as the PF_SIGNALED test in
coredump_task_exit attests.

Since only tasks that pass through get_signal and set PF_SIGNALED can
join coredumps move stopping for coredumps into get_signal, where the
PF_SIGNALED test is unnecessary.  This avoids even the potential of
stopping for coredumps in the silly or dangerous places.

This can be seen to be safe by examining the few places that call do_exit:

- get_signal calling do_group_exit
  Called by get_signal to terminate the userspace process.  As stopping
  for the coredump happens now happens in get_signal the code will
  continue to participate in the coredump.

- exit_group(2) calling do_group_exit

  If a thread calls exit_group(2) while another thread in the same process
  is performing a coredump there is a race.  The thread that wins the
  race will take the lock and set SIGNAL_GROUP_EXIT.  If it is the
  thread that called do_group_exit then zap_threads will return -EAGAIN
  and no coredump will be generated.  If it is the thread that is
  coredumping that wins the race, the task that called do_group_exit
  will exit gracefully with an error code before the coredump begins.

  Having a single thread exit just before the coredump starts is not
  ideal as the semantics make no sense. (Did the group exit happen
  before the coredump or did the coredump happen before the group
  exit?).

  Eventually I intend for group exits to flow through get_signal and
  this silliness will no longer be possible.  Until then the current
  behavior when this race occurs is maintained.

- io_uring
  Called after get_signal returns to terminate the I/O worker thread
  (essentially a userspace thread that only runs kernel code) so that
  additional cleanup code can be run before do_exit.  As get_signal is
  called the prior to do_exit code will continue to participate in the
  coredump.

- make_task_dead
  Called on an unhandled kernel or hardware failure.  As the failure
  is unhandled any extra work has the potential to make the failure worse
  so being part of a coredump is not appropriate.

- kthread_exit
  Called to terminate a kernel thread as such coredumps do not exist.

- call_usermodehelper_exec_async
  Called to terminate a kernel thread if kerenel_execve fails, as it is a
  kernel thread coredumps do not exist.

- reboot, seeccomp
  For these calls of do_exit() they are semantically direct calls of
  exit(2) today.  As do_exit() does not synchronize with siglock there
  is no logical race between a coredump killing the thread and these
  threads exiting.  These threads logically exit before the coredump
  happens.  This is also the current behavior so there is nothing to
  be concerned about with respect to userspsace semantics or
  regresssions.

Moving the coredump stop for userspace threads that did not dequeue
the coredumping signal from from do_exit into get_signal in general is
safe, because the coredump in the single threaded case completely
happens in get_signal.  The code movement ensures that a
multi-threaded coredump will not have any issues because the
additional threads stop after some amount of cleanup has been done.

The coredump code is robust to all kinds of userspace changes
happening in parallel as multiple processes can share a mm.  This
makes the it safe to perform the coredump before the io_uring cleanup
happens as io_uring can't do anything another process sharing the mm
would not be doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c            | 25 ++++++++++++++++++++++++-
 include/linux/coredump.h |  2 ++
 kernel/exit.c            | 29 +++++------------------------
 kernel/signal.c          |  5 +++++
 mm/oom_kill.c            |  2 +-
 5 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index b836948c9543..67dda77c500f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -389,6 +389,29 @@ static int zap_threads(struct task_struct *tsk,
 	return nr;
 }
 
+void coredump_join(struct core_state *core_state)
+{
+	/* Stop and join the in-progress coredump */
+	struct core_thread self;
+
+	self.task = current;
+	self.next = xchg(&core_state->dumper.next, &self);
+	/*
+	 * Implies mb(), the result of xchg() must be visible
+	 * to core_state->dumper.
+	 */
+	if (atomic_dec_and_test(&core_state->nr_threads))
+		complete(&core_state->startup);
+
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!self.task) /* see coredump_finish() */
+			break;
+		freezable_schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+}
+
 static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
@@ -436,7 +459,7 @@ static void coredump_finish(bool core_dumped)
 		next = curr->next;
 		task = curr->task;
 		/*
-		 * see coredump_task_exit(), curr->task must not see
+		 * see coredump_join(), curr->task must not see
 		 * ->task == NULL before we read ->next.
 		 */
 		smp_mb();
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 08a1d3e7e46d..815d6099b757 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -40,8 +40,10 @@ extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
 extern int dump_align(struct coredump_params *cprm, int align);
 int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		    unsigned long len);
+extern void coredump_join(struct core_state *core_state);
 extern void do_coredump(const kernel_siginfo_t *siginfo);
 #else
+extern inline void coredump_join(struct core_state *core_state) {}
 static inline void do_coredump(const kernel_siginfo_t *siginfo) {}
 #endif
 
diff --git a/kernel/exit.c b/kernel/exit.c
index d8ecbaa514f7..2218ca02ac71 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -352,35 +352,16 @@ static void coredump_task_exit(struct task_struct *tsk)
 	 * We must hold siglock around checking core_state
 	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
 	 * will increment ->nr_threads for each thread in the
-	 * group without PF_POSTCOREDUMP set.
+	 * group without PF_POSTCOREDUMP set.  Decrement ->nr_threads
+	 * and possibly complete core_state->startup to politely skip
+	 * participating in any pending coredumps.
 	 */
 	spin_lock_irq(&tsk->sighand->siglock);
 	tsk->flags |= PF_POSTCOREDUMP;
 	core_state = tsk->signal->core_state;
+	if (core_state && atomic_dec_and_test(&core_state->nr_threads))
+		complete(&core_state->startup);
 	spin_unlock_irq(&tsk->sighand->siglock);
-	if (core_state) {
-		struct core_thread self;
-
-		self.task = current;
-		if (self.task->flags & PF_SIGNALED)
-			self.next = xchg(&core_state->dumper.next, &self);
-		else
-			self.task = NULL;
-		/*
-		 * Implies mb(), the result of xchg() must be visible
-		 * to core_state->dumper.
-		 */
-		if (atomic_dec_and_test(&core_state->nr_threads))
-			complete(&core_state->startup);
-
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!self.task) /* see coredump_finish() */
-				break;
-			freezable_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
-	}
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/kernel/signal.c b/kernel/signal.c
index 8a0f114d00e0..8595c935027e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2687,6 +2687,7 @@ bool get_signal(struct ksignal *ksig)
 	}
 
 	for (;;) {
+		struct core_state *core_state;
 		struct k_sigaction *ka;
 		enum pid_type type;
 
@@ -2820,6 +2821,7 @@ bool get_signal(struct ksignal *ksig)
 		}
 
 	fatal:
+		core_state = signal->core_state;
 		spin_unlock_irq(&sighand->siglock);
 		if (unlikely(cgroup_task_frozen(current)))
 			cgroup_leave_frozen(true);
@@ -2842,6 +2844,9 @@ bool get_signal(struct ksignal *ksig)
 			 * that value and ignore the one we pass it.
 			 */
 			do_coredump(&ksig->info);
+		} else if (core_state) {
+			/* Wait for the coredump to happen */
+			coredump_join(core_state);
 		}
 
 		/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c6cf9e3cd66..1bb689fd9f81 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -845,7 +845,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 
 	/*
 	 * A coredumping process may sleep for an extended period in
-	 * coredump_task_exit(), so the oom killer cannot assume that
+	 * get_signal(), so the oom killer cannot assume that
 	 * the process will promptly exit and release memory.
 	 */
 	if (sig->core_state)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-07-20 16:49                     ` [PATCH 0/2] coredump: Allow io_uring using apps to dump to pipes Eric W. Biederman
  2022-07-20 16:50                       ` [PATCH 1/2] signal: Move stopping for the coredump from do_exit into get_signal Eric W. Biederman
@ 2022-07-20 16:51                       ` Eric W. Biederman
  2022-08-22 21:16                         ` Olivier Langlois
  1 sibling, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2022-07-20 16:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Olivier Langlois, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds


Now that io_uring like everything else stops for coredumps in
get_signal the code can once again allow any interruptible
condition after coredump_wait to interrupt the coredump.

Clear TIF_NOTIFY_SIGNAL after coredump_wait, to guarantee that
anything that sets TIF_NOTIFY_SIGNAL before coredump_wait completed
won't cause the coredumps to interrupted.

With all of the other threads in the process stopped io_uring doesn't
call task_work_add on the thread running do_coredump.  Combined with
the clearing of TIF_NOTIFY_SIGNAL this allows processes that use
io_uring to coredump through pipes.

Restore dump_interrupted to be a simple call to signal_pending
effectively reverting commit 06af8679449d ("coredump: Limit what can
interrupt coredumps").  At this point only SIGKILL delivered to the
coredumping thread should be able to cause signal_pending to return
true.

A nice followup would be to find a reliable race free way to modify
task_work_add and probably set_notify_signal to skip setting
TIF_NOTIFY_SIGNAL once it is clear a task will no longer process
signals and other interruptible conditions.  That would allow
TIF_NOTIFY_SIGNAL to be cleared where TIF_SIGPENDING is cleared in
coredump_zap_process.

To be as certain as possible that this works, I tested this with
commit 1d5f5ea7cb7d ("io-wq: remove worker to owner tw dependency")
reverted.  Which means that not only is TIF_NOTIFY_SIGNAL prevented
from stopping coredumps to pipes, the sequence of stopping threads to
participate in the coredump avoids deadlocks that were possible
previously.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 67dda77c500f..c06594f56cbb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -476,7 +476,7 @@ static bool dump_interrupted(void)
 	 * but then we need to teach dump_write() to restart and clear
 	 * TIF_SIGPENDING.
 	 */
-	return fatal_signal_pending(current) || freezing(current);
+	return signal_pending(current);
 }
 
 static void wait_for_dump_helpers(struct file *file)
@@ -589,6 +589,9 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
+	/* Don't break out of interruptible sleeps */
+	clear_notify_signal();
+
 	ispipe = format_corename(&cn, &cprm, &argv, &argc);
 
 	if (ispipe) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-07-20 16:51                       ` [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring Eric W. Biederman
@ 2022-08-22 21:16                         ` Olivier Langlois
  2022-08-23  3:35                           ` Olivier Langlois
  0 siblings, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2022-08-22 21:16 UTC (permalink / raw)
  To: Eric W. Biederman, Jens Axboe
  Cc: Pavel Begunkov, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Oleg Nesterov, Linus Torvalds

On Wed, 2022-07-20 at 11:51 -0500, Eric W. Biederman wrote:
> 
> Now that io_uring like everything else stops for coredumps in
> get_signal the code can once again allow any interruptible
> condition after coredump_wait to interrupt the coredump.
> 
> Clear TIF_NOTIFY_SIGNAL after coredump_wait, to guarantee that
> anything that sets TIF_NOTIFY_SIGNAL before coredump_wait completed
> won't cause the coredumps to interrupted.
> 
> With all of the other threads in the process stopped io_uring doesn't
> call task_work_add on the thread running do_coredump.  Combined with
> the clearing of TIF_NOTIFY_SIGNAL this allows processes that use
> io_uring to coredump through pipes.
> 
> Restore dump_interrupted to be a simple call to signal_pending
> effectively reverting commit 06af8679449d ("coredump: Limit what can
> interrupt coredumps").  At this point only SIGKILL delivered to the
> coredumping thread should be able to cause signal_pending to return
> true.
> 
> A nice followup would be to find a reliable race free way to modify
> task_work_add and probably set_notify_signal to skip setting
> TIF_NOTIFY_SIGNAL once it is clear a task will no longer process
> signals and other interruptible conditions.  That would allow
> TIF_NOTIFY_SIGNAL to be cleared where TIF_SIGPENDING is cleared in
> coredump_zap_process.
> 
> To be as certain as possible that this works, I tested this with
> commit 1d5f5ea7cb7d ("io-wq: remove worker to owner tw dependency")
> reverted.  Which means that not only is TIF_NOTIFY_SIGNAL prevented
> from stopping coredumps to pipes, the sequence of stopping threads to
> participate in the coredump avoids deadlocks that were possible
> previously.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
Hi Eric,

What is stopping the task calling do_coredump() to be interrupted and
call task_work_add() from the interrupt context?

This is precisely what I was experiencing last summer when I did work
on this issue.

My understanding of how async I/O works with io_uring is that the task
is added to a wait queue without being put to sleep and when the
io_uring callback is called from the interrupt context, task_work_add()
is called so that the next time io_uring syscall is invoked, pending
work is processed to complete the I/O.

So if:

1. io_uring request is initiated AND the task is in a wait queue
2. do_coredump() is called before the I/O is completed

IMHO, this is how you end up having task_work_add() called while the
coredump is generated.

So far, the only way that I have found making sure that this was not
happening was to cancel every pending io_uring requests before writing
the coredump by calling io_uring_task_cancel():

--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -43,7 +43,7 @@
 #include <linux/timekeeping.h>
 #include <linux/sysctl.h>
 #include <linux/elf.h>
-
+#include <linux/io_uring.h>
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
 #include <asm/tlb.h>
@@ -561,6 +561,8 @@
 		need_suid_safe = true;
 	}
 
+	io_uring_task_cancel();
+
 	retval = coredump_wait(siginfo->si_signo, &core_state);
 	if (retval < 0)
 		goto fail_creds;


Greetings,


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-08-22 21:16                         ` Olivier Langlois
@ 2022-08-23  3:35                           ` Olivier Langlois
  2022-08-23 18:22                             ` Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Olivier Langlois @ 2022-08-23  3:35 UTC (permalink / raw)
  To: Eric W. Biederman, Jens Axboe
  Cc: Pavel Begunkov, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Oleg Nesterov, Linus Torvalds

On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
> 
> What is stopping the task calling do_coredump() to be interrupted and
> call task_work_add() from the interrupt context?
> 
> This is precisely what I was experiencing last summer when I did work
> on this issue.
> 
> My understanding of how async I/O works with io_uring is that the
> task
> is added to a wait queue without being put to sleep and when the
> io_uring callback is called from the interrupt context,
> task_work_add()
> is called so that the next time io_uring syscall is invoked, pending
> work is processed to complete the I/O.
> 
> So if:
> 
> 1. io_uring request is initiated AND the task is in a wait queue
> 2. do_coredump() is called before the I/O is completed
> 
> IMHO, this is how you end up having task_work_add() called while the
> coredump is generated.
> 
I forgot to add that I have experienced the issue with TCP/IP I/O.

I suspect that with a TCP socket, the race condition window is much
larger than if it was disk I/O and this might make it easier to
reproduce the issue this way...

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-08-23  3:35                           ` Olivier Langlois
@ 2022-08-23 18:22                             ` Eric W. Biederman
  2022-08-23 18:27                               ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2022-08-23 18:22 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Jens Axboe, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds

Olivier Langlois <olivier@trillion01.com> writes:

> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>> 
>> What is stopping the task calling do_coredump() to be interrupted and
>> call task_work_add() from the interrupt context?
>> 
>> This is precisely what I was experiencing last summer when I did work
>> on this issue.
>> 
>> My understanding of how async I/O works with io_uring is that the
>> task
>> is added to a wait queue without being put to sleep and when the
>> io_uring callback is called from the interrupt context,
>> task_work_add()
>> is called so that the next time io_uring syscall is invoked, pending
>> work is processed to complete the I/O.
>> 
>> So if:
>> 
>> 1. io_uring request is initiated AND the task is in a wait queue
>> 2. do_coredump() is called before the I/O is completed
>> 
>> IMHO, this is how you end up having task_work_add() called while the
>> coredump is generated.
>> 
> I forgot to add that I have experienced the issue with TCP/IP I/O.
>
> I suspect that with a TCP socket, the race condition window is much
> larger than if it was disk I/O and this might make it easier to
> reproduce the issue this way...

I was under the apparently mistaken impression that the io_uring
task_work_add only comes from the io_uring userspace helper threads.
Those are definitely suppressed by my change.

Do you have any idea in the code where io_uring code is being called in
an interrupt context?  I would really like to trace that code path so I
have a better grasp on what is happening.

If task_work_add is being called from interrupt context then something
additional from what I have proposed certainly needs to be done.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-08-23 18:22                             ` Eric W. Biederman
@ 2022-08-23 18:27                               ` Jens Axboe
  2022-08-24 15:11                                 ` Eric W. Biederman
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2022-08-23 18:27 UTC (permalink / raw)
  To: Eric W. Biederman, Olivier Langlois
  Cc: Pavel Begunkov, linux-kernel, linux-fsdevel, io-uring,
	Alexander Viro, Oleg Nesterov, Linus Torvalds

On 8/23/22 12:22 PM, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>
>>> What is stopping the task calling do_coredump() to be interrupted and
>>> call task_work_add() from the interrupt context?
>>>
>>> This is precisely what I was experiencing last summer when I did work
>>> on this issue.
>>>
>>> My understanding of how async I/O works with io_uring is that the
>>> task
>>> is added to a wait queue without being put to sleep and when the
>>> io_uring callback is called from the interrupt context,
>>> task_work_add()
>>> is called so that the next time io_uring syscall is invoked, pending
>>> work is processed to complete the I/O.
>>>
>>> So if:
>>>
>>> 1. io_uring request is initiated AND the task is in a wait queue
>>> 2. do_coredump() is called before the I/O is completed
>>>
>>> IMHO, this is how you end up having task_work_add() called while the
>>> coredump is generated.
>>>
>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>
>> I suspect that with a TCP socket, the race condition window is much
>> larger than if it was disk I/O and this might make it easier to
>> reproduce the issue this way...
> 
> I was under the apparently mistaken impression that the io_uring
> task_work_add only comes from the io_uring userspace helper threads.
> Those are definitely suppressed by my change.
> 
> Do you have any idea in the code where io_uring code is being called in
> an interrupt context?  I would really like to trace that code path so I
> have a better grasp on what is happening.
> 
> If task_work_add is being called from interrupt context then something
> additional from what I have proposed certainly needs to be done.

task_work may come from the helper threads, but generally it does not.
One example would be doing a read from a socket. There's no data there,
poll is armed to trigger a retry. When we get the poll notification that
there's now data to be read, then we kick that off with task_work. Since
it's from the poll handler, it can trigger from interrupt context. See
the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
io_req_task_work_add() -> task_work_add().

It can also happen for regular IRQ based reads from regular files, where
the completion is actually done via task_work added from the potentially
IRQ based completion path.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-08-23 18:27                               ` Jens Axboe
@ 2022-08-24 15:11                                 ` Eric W. Biederman
  2022-08-24 15:51                                   ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Eric W. Biederman @ 2022-08-24 15:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Olivier Langlois, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds

Jens Axboe <axboe@kernel.dk> writes:

> On 8/23/22 12:22 PM, Eric W. Biederman wrote:
>> Olivier Langlois <olivier@trillion01.com> writes:
>> 
>>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>>
>>>> What is stopping the task calling do_coredump() to be interrupted and
>>>> call task_work_add() from the interrupt context?
>>>>
>>>> This is precisely what I was experiencing last summer when I did work
>>>> on this issue.
>>>>
>>>> My understanding of how async I/O works with io_uring is that the
>>>> task
>>>> is added to a wait queue without being put to sleep and when the
>>>> io_uring callback is called from the interrupt context,
>>>> task_work_add()
>>>> is called so that the next time io_uring syscall is invoked, pending
>>>> work is processed to complete the I/O.
>>>>
>>>> So if:
>>>>
>>>> 1. io_uring request is initiated AND the task is in a wait queue
>>>> 2. do_coredump() is called before the I/O is completed
>>>>
>>>> IMHO, this is how you end up having task_work_add() called while the
>>>> coredump is generated.
>>>>
>>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>>
>>> I suspect that with a TCP socket, the race condition window is much
>>> larger than if it was disk I/O and this might make it easier to
>>> reproduce the issue this way...
>> 
>> I was under the apparently mistaken impression that the io_uring
>> task_work_add only comes from the io_uring userspace helper threads.
>> Those are definitely suppressed by my change.
>> 
>> Do you have any idea in the code where io_uring code is being called in
>> an interrupt context?  I would really like to trace that code path so I
>> have a better grasp on what is happening.
>> 
>> If task_work_add is being called from interrupt context then something
>> additional from what I have proposed certainly needs to be done.
>
> task_work may come from the helper threads, but generally it does not.
> One example would be doing a read from a socket. There's no data there,
> poll is armed to trigger a retry. When we get the poll notification that
> there's now data to be read, then we kick that off with task_work. Since
> it's from the poll handler, it can trigger from interrupt context. See
> the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
> io_req_task_work_add() -> task_work_add().

But that is a task_work to the helper thread correct?

> It can also happen for regular IRQ based reads from regular files, where
> the completion is actually done via task_work added from the potentially
> IRQ based completion path.

I can see that.

Which leaves me with the question do these task_work's directly wake up
the thread that submitted the I/O request?   Or is there likely to be
something for an I/O thread to do before an ordinary program thread is
notified.

I am asking because it is only the case of notifying ordinary program
threads that is interesting in the case of a coredump.  As I understand
it a data to read notification would typically be handled by the I/O
uring worker thread to trigger reading the data before letting userspace
know everything it asked to be done is complete.

Eric

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring
  2022-08-24 15:11                                 ` Eric W. Biederman
@ 2022-08-24 15:51                                   ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2022-08-24 15:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Olivier Langlois, Pavel Begunkov, linux-kernel, linux-fsdevel,
	io-uring, Alexander Viro, Oleg Nesterov, Linus Torvalds

On 8/24/22 9:11 AM, Eric W. Biederman wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 8/23/22 12:22 PM, Eric W. Biederman wrote:
>>> Olivier Langlois <olivier@trillion01.com> writes:
>>>
>>>> On Mon, 2022-08-22 at 17:16 -0400, Olivier Langlois wrote:
>>>>>
>>>>> What is stopping the task calling do_coredump() to be interrupted and
>>>>> call task_work_add() from the interrupt context?
>>>>>
>>>>> This is precisely what I was experiencing last summer when I did work
>>>>> on this issue.
>>>>>
>>>>> My understanding of how async I/O works with io_uring is that the
>>>>> task
>>>>> is added to a wait queue without being put to sleep and when the
>>>>> io_uring callback is called from the interrupt context,
>>>>> task_work_add()
>>>>> is called so that the next time io_uring syscall is invoked, pending
>>>>> work is processed to complete the I/O.
>>>>>
>>>>> So if:
>>>>>
>>>>> 1. io_uring request is initiated AND the task is in a wait queue
>>>>> 2. do_coredump() is called before the I/O is completed
>>>>>
>>>>> IMHO, this is how you end up having task_work_add() called while the
>>>>> coredump is generated.
>>>>>
>>>> I forgot to add that I have experienced the issue with TCP/IP I/O.
>>>>
>>>> I suspect that with a TCP socket, the race condition window is much
>>>> larger than if it was disk I/O and this might make it easier to
>>>> reproduce the issue this way...
>>>
>>> I was under the apparently mistaken impression that the io_uring
>>> task_work_add only comes from the io_uring userspace helper threads.
>>> Those are definitely suppressed by my change.
>>>
>>> Do you have any idea in the code where io_uring code is being called in
>>> an interrupt context?  I would really like to trace that code path so I
>>> have a better grasp on what is happening.
>>>
>>> If task_work_add is being called from interrupt context then something
>>> additional from what I have proposed certainly needs to be done.
>>
>> task_work may come from the helper threads, but generally it does not.
>> One example would be doing a read from a socket. There's no data there,
>> poll is armed to trigger a retry. When we get the poll notification that
>> there's now data to be read, then we kick that off with task_work. Since
>> it's from the poll handler, it can trigger from interrupt context. See
>> the path from io_uring/poll.c:io_poll_wake() -> __io_poll_execute() ->
>> io_req_task_work_add() -> task_work_add().
> 
> But that is a task_work to the helper thread correct?

No, it goes to the task that originally allocated/issued the request.
Which would be the original task, unless the request was originally
marked as going straight to a helper (IOSQE_ASYNC). For most cases,
it'll be the original task, not a helper thread.

>> It can also happen for regular IRQ based reads from regular files, where
>> the completion is actually done via task_work added from the potentially
>> IRQ based completion path.
> 
> I can see that.
> 
> Which leaves me with the question do these task_work's directly wake up
> the thread that submitted the I/O request?   Or is there likely to be
> something for an I/O thread to do before an ordinary program thread is
> notified.
> 
> I am asking because it is only the case of notifying ordinary program
> threads that is interesting in the case of a coredump.  As I understand
> it a data to read notification would typically be handled by the I/O
> uring worker thread to trigger reading the data before letting userspace
> know everything it asked to be done is complete.

By default, it'll go back to the original task. If something is
pollable, then there's no helper thread doing the request. An issue
attempt is performed by the original task, there's no data/space there,
poll is armed to trigger a retry. Retry notification will then queue
task_work with the original task to retry.

Generally for io_uring, helper threads are a slow path and aren't used
unless we have no other options. For example, if someone has a network
IO backend and there's helper thread activity, then that's generally a
sign that something is wrong. This isn't super relevant to this
particular topic, just highlighting that normally you would not expect
to see much (if any) io-wq activity at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2022-08-24 15:51 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <192c9697e379bf084636a8213108be6c3b948d0b.camel@trillion01.com>
     [not found] ` <9692dbb420eef43a9775f425cb8f6f33c9ba2db9.camel@trillion01.com>
     [not found]   ` <87h7i694ij.fsf_-_@disp2133>
2021-06-09 20:33     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Linus Torvalds
2021-06-09 20:48       ` Eric W. Biederman
2021-06-09 20:52         ` Linus Torvalds
2021-06-09 21:02       ` Olivier Langlois
2021-06-09 21:05         ` Eric W. Biederman
2021-06-09 21:26           ` Olivier Langlois
2021-06-09 21:56             ` Olivier Langlois
2021-06-10 14:26             ` Eric W. Biederman
2021-06-10 15:17               ` Olivier Langlois
2021-06-10 18:58               ` [CFT}[PATCH] coredump: Limit what can interrupt coredumps Eric W. Biederman
2021-06-10 19:10                 ` Linus Torvalds
2021-06-10 19:18                   ` Eric W. Biederman
2021-06-10 19:50                     ` Linus Torvalds
2021-06-10 20:11                       ` [PATCH] " Eric W. Biederman
2021-06-10 21:04                         ` Linus Torvalds
2021-06-12 14:36                         ` Olivier Langlois
2021-06-12 16:26                           ` Jens Axboe
2021-06-14 14:10                         ` Oleg Nesterov
2021-06-14 16:37                           ` Eric W. Biederman
2021-06-14 16:59                             ` Oleg Nesterov
2021-06-15 22:08                           ` Eric W. Biederman
2021-06-16 19:23                             ` Olivier Langlois
2021-06-16 20:00                               ` Eric W. Biederman
2021-06-18 20:05                                 ` Olivier Langlois
2021-08-05 13:06                             ` Olivier Langlois
2021-08-10 21:48                               ` Tony Battersby
2021-08-11 20:47                                 ` Olivier Langlois
2021-08-12  1:55                                 ` Jens Axboe
2021-08-12 13:53                                   ` Tony Battersby
2021-08-15 20:42                                   ` Olivier Langlois
2021-08-16 13:02                                     ` Pavel Begunkov
2021-08-16 13:06                                       ` Pavel Begunkov
2021-08-17 18:15                                     ` Jens Axboe
2021-08-17 18:24                                       ` Jens Axboe
2021-08-17 19:29                                         ` Tony Battersby
2021-08-17 19:59                                           ` Jens Axboe
2021-08-17 21:28                                             ` Jens Axboe
2021-08-17 21:39                                               ` Tony Battersby
2021-08-17 22:05                                                 ` Jens Axboe
2021-08-18 14:37                                                   ` Tony Battersby
2021-08-18 14:46                                                     ` Jens Axboe
2021-08-18  2:57                                               ` Jens Axboe
2021-08-18  2:58                                                 ` Jens Axboe
2021-08-21 10:08                                                 ` Olivier Langlois
2021-08-21 16:47                                                   ` Olivier Langlois
2021-08-21 16:51                                                     ` Jens Axboe
2021-08-21 17:21                                                       ` Olivier Langlois
2021-08-21  9:52                                         ` Olivier Langlois
2021-08-21  9:48                                       ` Olivier Langlois
2021-10-22 14:13     ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Pavel Begunkov
2021-12-24  1:34       ` Olivier Langlois
2021-12-24 10:37         ` Pavel Begunkov
2021-12-24 19:52           ` Eric W. Biederman
2021-12-28 11:24             ` Pavel Begunkov
2022-03-14 23:58               ` Eric W. Biederman
     [not found]                 ` <8218f1a245d054c940e25142fd00a5f17238d078.camel@trillion01.com>
2022-06-01  3:15                   ` Jens Axboe
2022-07-20 16:49                     ` [PATCH 0/2] coredump: Allow io_uring using apps to dump to pipes Eric W. Biederman
2022-07-20 16:50                       ` [PATCH 1/2] signal: Move stopping for the coredump from do_exit into get_signal Eric W. Biederman
2022-07-20 16:51                       ` [PATCH 2/2] coredump: Allow coredumps to pipes to work with io_uring Eric W. Biederman
2022-08-22 21:16                         ` Olivier Langlois
2022-08-23  3:35                           ` Olivier Langlois
2022-08-23 18:22                             ` Eric W. Biederman
2022-08-23 18:27                               ` Jens Axboe
2022-08-24 15:11                                 ` Eric W. Biederman
2022-08-24 15:51                                   ` Jens Axboe
2022-01-05 19:39           ` [RFC] coredump: Do not interrupt dump for TIF_NOTIFY_SIGNAL Olivier Langlois

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).