[PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread
@ 2016-09-21 15:43 Roman Pen
  2016-09-21 15:43 ` [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead Roman Pen
  2016-10-20 23:07 ` [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Andy Lutomirski
  0 siblings, 2 replies; 9+ messages in thread
From: Roman Pen @ 2016-09-21 15:43 UTC (permalink / raw)
  Cc: Roman Pen, Andy Lutomirski, Josh Poimboeuf, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Tejun Heo, x86, linux-kernel

kthread uses stack and keeps completion structure on it to be woken up
on vfork_done completion.

In commit 2deb4be28 Andy Lutomirski rewinds the stack unconditionally
and further completion of task->vfork_done for any kthread leads to stack
corruption (or infinite spin on attempt to spin lock on garbage memory).

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/dumpstack.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index e0648f7..74be764 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -250,9 +250,14 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 	/*
 	 * We're not going to return, but we might be on an IST stack or
 	 * have very little stack space left.  Rewind the stack and kill
-	 * the task.
+	 * the task.  But kthread is a special case, since kthread uses
+	 * stack to keep completion structure to be woken on vfork_done
+	 * completion.
 	 */
-	rewind_stack_do_exit(signr);
+	if (current->flags & PF_KTHREAD)
+		do_exit(signr);
+	else
+		rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
  2016-09-21 15:43 [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Roman Pen
@ 2016-09-21 15:43 ` Roman Pen
  2016-10-20 23:08   ` Andy Lutomirski
  2016-10-21  5:39   ` Peter Zijlstra
  2016-10-20 23:07 ` [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Andy Lutomirski
  1 sibling, 2 replies; 9+ messages in thread
From: Roman Pen @ 2016-09-21 15:43 UTC (permalink / raw)
  Cc: Roman Pen, Andy Lutomirski, Josh Poimboeuf, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Tejun Heo, linux-kernel

If panic_on_oops is not set and oops happens inside workqueue kthread,
kernel kills this kthread.  Current patch fixes recursive GPF which
happens in that case with the following stack:

  [<ffffffff81397f75>] dump_stack+0x68/0x93
  [<ffffffff8106954b>] ? do_exit+0x7ab/0xc10
  [<ffffffff8108fd73>] __schedule_bug+0x83/0xe0
  [<ffffffff81716d5a>] __schedule+0x7ea/0xba0
  [<ffffffff810c864f>] ? vprintk_default+0x1f/0x30
  [<ffffffff8116a63c>] ? printk+0x48/0x50
  [<ffffffff81717150>] schedule+0x40/0x90
  [<ffffffff8106976a>] do_exit+0x9ca/0xc10
  [<ffffffff810c8e3d>] ? kmsg_dump+0x11d/0x190
  [<ffffffff810c8d37>] ? kmsg_dump+0x17/0x190
  [<ffffffff81021ee9>] oops_end+0x99/0xd0
  [<ffffffff81052da5>] no_context+0x185/0x3e0
  [<ffffffff81053083>] __bad_area_nosemaphore+0x83/0x1c0
  [<ffffffff810c820e>] ? vprintk_emit+0x25e/0x530
  [<ffffffff810531d4>] bad_area_nosemaphore+0x14/0x20
  [<ffffffff8105355c>] __do_page_fault+0xac/0x570
  [<ffffffff810c66fe>] ? console_trylock+0x1e/0xe0
  [<ffffffff81002036>] ? trace_hardirqs_off_thunk+0x1a/0x1c
  [<ffffffff81053a2c>] do_page_fault+0xc/0x10
  [<ffffffff8171f812>] page_fault+0x22/0x30
  [<ffffffff81089bc3>] ? kthread_data+0x33/0x40
  [<ffffffff8108427e>] ? wq_worker_sleeping+0xe/0x80
  [<ffffffff817169eb>] __schedule+0x47b/0xba0
  [<ffffffff81717150>] schedule+0x40/0x90
  [<ffffffff8106957d>] do_exit+0x7dd/0xc10
  [<ffffffff81021ee9>] oops_end+0x99/0xd0

The root cause is that zeroed task->vfork_done member is accessed from
wq_worker_sleeping() hook.  The zeroing out happens on the following
path:

   oops_end()
   do_exit()
   exit_mm()
   mm_release()
   complete_vfork_done()

In order to fix a bug dead tasks must be ignored.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2c303e7..50772e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt)
 			 * If a worker went to sleep, notify and ask workqueue
 			 * whether it wants to wake up a task to maintain
 			 * concurrency.
+			 *
+			 * Also the following stack is possible:
+			 *    oops_end()
+			 *    do_exit()
+			 *    schedule()
+			 *
+			 * If panic_on_oops is not set and oops happens on
+			 * a workqueue execution path, thread will be killed.
+			 * That is definitly sad, but not to make the situation
+			 * even worse we have to ignore dead tasks in order not
+			 * to step on zeroed out members (e.g. t->vfork_done is
+			 * already NULL on that path, since we were called by
+			 * do_exit()))
 			 */
-			if (prev->flags & PF_WQ_WORKER) {
+			if (prev->flags & PF_WQ_WORKER &&
+			    prev->state != TASK_DEAD) {
 				struct task_struct *to_wakeup;
 
 				to_wakeup = wq_worker_sleeping(prev);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread
  2016-09-21 15:43 [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Roman Pen
  2016-09-21 15:43 ` [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead Roman Pen
@ 2016-10-20 23:07 ` Andy Lutomirski
  2016-10-21  5:56   ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2016-10-20 23:07 UTC (permalink / raw)
  To: Roman Pen
  Cc: Andy Lutomirski, Josh Poimboeuf, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Tejun Heo, X86 ML, linux-kernel

On Wed, Sep 21, 2016 at 8:43 AM, Roman Pen
<roman.penyaev@profitbricks.com> wrote:
> kthread uses stack and keeps completion structure on it to be woken up
> on vfork_done completion.
>
> In commit 2deb4be28 Andy Lutomirski rewinds the stack unconditionally
> and further completion of task->vfork_done for any kthread leads to stack
> corruption (or infinite spin on attempt to spin lock on garbage memory).

This is sort of okay, but it will blow up pretty badly if a kthread
overflows its stack.  Would it make more sense to change
rewind_stack_do_exit() to leave a big enough gap at the top of the
stack to avoid clobbering the completion?

--Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
  2016-09-21 15:43 ` [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead Roman Pen
@ 2016-10-20 23:08   ` Andy Lutomirski
  2016-10-21 15:47     ` Oleg Nesterov
  2016-10-21  5:39   ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2016-10-20 23:08 UTC (permalink / raw)
  To: Roman Pen, Oleg Nesterov
  Cc: Andy Lutomirski, Josh Poimboeuf, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Tejun Heo, linux-kernel

On Wed, Sep 21, 2016 at 8:43 AM, Roman Pen
<roman.penyaev@profitbricks.com> wrote:
> If panic_on_oops is not set and oops happens inside workqueue kthread,
> kernel kills this kthread.  Current patch fixes recursive GPF which
> happens in that case with the following stack:

Oleg, can you take a look at this?

--Andy

>
>   [<ffffffff81397f75>] dump_stack+0x68/0x93
>   [<ffffffff8106954b>] ? do_exit+0x7ab/0xc10
>   [<ffffffff8108fd73>] __schedule_bug+0x83/0xe0
>   [<ffffffff81716d5a>] __schedule+0x7ea/0xba0
>   [<ffffffff810c864f>] ? vprintk_default+0x1f/0x30
>   [<ffffffff8116a63c>] ? printk+0x48/0x50
>   [<ffffffff81717150>] schedule+0x40/0x90
>   [<ffffffff8106976a>] do_exit+0x9ca/0xc10
>   [<ffffffff810c8e3d>] ? kmsg_dump+0x11d/0x190
>   [<ffffffff810c8d37>] ? kmsg_dump+0x17/0x190
>   [<ffffffff81021ee9>] oops_end+0x99/0xd0
>   [<ffffffff81052da5>] no_context+0x185/0x3e0
>   [<ffffffff81053083>] __bad_area_nosemaphore+0x83/0x1c0
>   [<ffffffff810c820e>] ? vprintk_emit+0x25e/0x530
>   [<ffffffff810531d4>] bad_area_nosemaphore+0x14/0x20
>   [<ffffffff8105355c>] __do_page_fault+0xac/0x570
>   [<ffffffff810c66fe>] ? console_trylock+0x1e/0xe0
>   [<ffffffff81002036>] ? trace_hardirqs_off_thunk+0x1a/0x1c
>   [<ffffffff81053a2c>] do_page_fault+0xc/0x10
>   [<ffffffff8171f812>] page_fault+0x22/0x30
>   [<ffffffff81089bc3>] ? kthread_data+0x33/0x40
>   [<ffffffff8108427e>] ? wq_worker_sleeping+0xe/0x80
>   [<ffffffff817169eb>] __schedule+0x47b/0xba0
>   [<ffffffff81717150>] schedule+0x40/0x90
>   [<ffffffff8106957d>] do_exit+0x7dd/0xc10
>   [<ffffffff81021ee9>] oops_end+0x99/0xd0
>
> The root cause is that zeroed task->vfork_done member is accessed from
> wq_worker_sleeping() hook.  The zeroing out happens on the following
> path:
>
>    oops_end()
>    do_exit()
>    exit_mm()
>    mm_release()
>    complete_vfork_done()
>
> In order to fix a bug dead tasks must be ignored.
>
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/core.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2c303e7..50772e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt)
>                          * If a worker went to sleep, notify and ask workqueue
>                          * whether it wants to wake up a task to maintain
>                          * concurrency.
> +                        *
> +                        * Also the following stack is possible:
> +                        *    oops_end()
> +                        *    do_exit()
> +                        *    schedule()
> +                        *
> +                        * If panic_on_oops is not set and oops happens on
> +                        * a workqueue execution path, thread will be killed.
> +                        * That is definitly sad, but not to make the situation
> +                        * even worse we have to ignore dead tasks in order not
> +                        * to step on zeroed out members (e.g. t->vfork_done is
> +                        * already NULL on that path, since we were called by
> +                        * do_exit()))
>                          */
> -                       if (prev->flags & PF_WQ_WORKER) {
> +                       if (prev->flags & PF_WQ_WORKER &&
> +                           prev->state != TASK_DEAD) {
>                                 struct task_struct *to_wakeup;
>
>                                 to_wakeup = wq_worker_sleeping(prev);
> --
> 2.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
  2016-09-21 15:43 ` [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead Roman Pen
  2016-10-20 23:08   ` Andy Lutomirski
@ 2016-10-21  5:39   ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-10-21  5:39 UTC (permalink / raw)
  To: Roman Pen
  Cc: Andy Lutomirski, Josh Poimboeuf, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Tejun Heo, linux-kernel

On Wed, Sep 21, 2016 at 05:43:50PM +0200, Roman Pen wrote:
> If panic_on_oops is not set and oops happens inside workqueue kthread,
> kernel kills this kthread.  Current patch fixes recursive GPF which
> happens in that case with the following stack:


> The root cause is that zeroed task->vfork_done member is accessed from
> wq_worker_sleeping() hook. 

This is the kthread_data() -> to_kthread() thing? Could've done with
spelling out, now you had me searching all over :/


Urgh what a mess..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread
  2016-10-20 23:07 ` [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Andy Lutomirski
@ 2016-10-21  5:56   ` Peter Zijlstra
  2016-10-21  8:05     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-10-21  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Roman Pen, Andy Lutomirski, Josh Poimboeuf, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Tejun Heo, X86 ML, linux-kernel

On Thu, Oct 20, 2016 at 04:07:28PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 21, 2016 at 8:43 AM, Roman Pen
> <roman.penyaev@profitbricks.com> wrote:
> > kthread uses stack and keeps completion structure on it to be woken up
> > on vfork_done completion.
> >
> > In commit 2deb4be28 Andy Lutomirski rewinds the stack unconditionally
> > and further completion of task->vfork_done for any kthread leads to stack
> > corruption (or infinite spin on attempt to spin lock on garbage memory).
> 
> This is sort of okay, but it will blow up pretty badly if a kthread
> overflows its stack.  Would it make more sense to change
> rewind_stack_do_exit() to leave a big enough gap at the top of the
> stack to avoid clobbering the completion?

We need to preserve the entire struct kthread on the stack, kthread just
abuses that pointer to stash an on-stack kthread descriptor. See
kthread():

  current->vfork_done = &self.exited;

Its a horrible horrible thing kthread does. I suppose there might have
been some intent by keeping that exited completion last in the
structure, but *shudder*.

But yes, leaving enough stack to not clobber that might keep this horror
show working.

ISTR talk about alternative schemes for this a long time ago, but I
cannot recall :-(

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread
  2016-10-21  5:56   ` Peter Zijlstra
@ 2016-10-21  8:05     ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2016-10-21  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Roman Pen, Andy Lutomirski, Josh Poimboeuf,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H . Peter Anvin,
	Ingo Molnar, Tejun Heo, X86 ML, linux-kernel

On Fri, 21 Oct 2016, Peter Zijlstra wrote:
> We need to preserve the entire struct kthread on the stack, kthread just
> abuses that pointer to stash an on-stack kthread descriptor. See
> kthread():
> 
>   current->vfork_done = &self.exited;
> 
> Its a horrible horrible thing kthread does. I suppose there might have
> been some intent by keeping that exited completion last in the
> structure, but *shudder*.
> 
> But yes, leaving enough stack to not clobber that might keep this horror
> show working.
> 
> ISTR talk about alternative schemes for this a long time ago, but I
> cannot recall :-(

The simplest solution would be to stick struct kthread into task_struct,
but that's bloat.

But we can allocate it seperately along with kthread_create_info. That's
pretty straight forward.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
  2016-10-20 23:08   ` Andy Lutomirski
@ 2016-10-21 15:47     ` Oleg Nesterov
  2016-10-24 16:01       ` Roman Penyaev
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2016-10-21 15:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Roman Pen, Andy Lutomirski, Josh Poimboeuf, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Tejun Heo, linux-kernel

On 10/20, Andy Lutomirski wrote:
>
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt)
> >                          * If a worker went to sleep, notify and ask workqueue
> >                          * whether it wants to wake up a task to maintain
> >                          * concurrency.
> > +                        *
> > +                        * Also the following stack is possible:
> > +                        *    oops_end()
> > +                        *    do_exit()
> > +                        *    schedule()
> > +                        *
> > +                        * If panic_on_oops is not set and oops happens on
> > +                        * a workqueue execution path, thread will be killed.
> > +                        * That is definitly sad, but not to make the situation
> > +                        * even worse we have to ignore dead tasks in order not
> > +                        * to step on zeroed out members (e.g. t->vfork_done is
> > +                        * already NULL on that path, since we were called by
> > +                        * do_exit()))

And we have more problems like this. Say, if blk_flush_plug_list()
crashes it will likely crash again and again recursively.

> >                          */
> > -                       if (prev->flags & PF_WQ_WORKER) {
> > +                       if (prev->flags & PF_WQ_WORKER &&
> > +                           prev->state != TASK_DEAD) {

I don't think we should change __schedule()... Can't we simply clear
PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks
into wq_worker_sleeping() and wq_worker_waking_up().

Or perhaps something like the change below.

Oleg.

--- x/kernel/workqueue.c
+++ x/kernel/workqueue.c
@@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru
 	}
 }
 
+static void oops_handler(struct callback_head *oops_work)
+{
+	if (!(current->flags & PF_WQ_WORKER))
+		return;
+
+	clear PF_WQ_WORKER, probably do more cleanups
+}
+
 /**
  * worker_thread - the worker thread function
  * @__worker: self
@@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru
  */
 static int worker_thread(void *__worker)
 {
+	struct callback_head oops_work;
 	struct worker *worker = __worker;
 	struct worker_pool *pool = worker->pool;
 
 	/* tell the scheduler that this is a workqueue worker */
 	worker->task->flags |= PF_WQ_WORKER;
+	init_task_work(&oops_work, oops_handler);
+	task_work_add(current, &oops_work, false);
 woke_up:
 	spin_lock_irq(&pool->lock);
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
  2016-10-21 15:47     ` Oleg Nesterov
@ 2016-10-24 16:01       ` Roman Penyaev
  0 siblings, 0 replies; 9+ messages in thread
From: Roman Penyaev @ 2016-10-24 16:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, Josh Poimboeuf,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, H . Peter Anvin,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

On Fri, Oct 21, 2016 at 5:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 10/20, Andy Lutomirski wrote:
>>
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt)
>> >                          * If a worker went to sleep, notify and ask workqueue
>> >                          * whether it wants to wake up a task to maintain
>> >                          * concurrency.
>> > +                        *
>> > +                        * Also the following stack is possible:
>> > +                        *    oops_end()
>> > +                        *    do_exit()
>> > +                        *    schedule()
>> > +                        *
>> > +                        * If panic_on_oops is not set and oops happens on
>> > +                        * a workqueue execution path, thread will be killed.
>> > +                        * That is definitly sad, but not to make the situation
>> > +                        * even worse we have to ignore dead tasks in order not
>> > +                        * to step on zeroed out members (e.g. t->vfork_done is
>> > +                        * already NULL on that path, since we were called by
>> > +                        * do_exit()))
>
> And we have more problems like this. Say, if blk_flush_plug_list()
> crashes it will likely crash again and again recursively.

I will send a patch if reproduce it :)

>
>> >                          */
>> > -                       if (prev->flags & PF_WQ_WORKER) {
>> > +                       if (prev->flags & PF_WQ_WORKER &&
>> > +                           prev->state != TASK_DEAD) {
>
> I don't think we should change __schedule()... Can't we simply clear
> PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks
> into wq_worker_sleeping() and wq_worker_waking_up().

Yeah, probably handling this corner case in wq_worker_sleeping() func
is much better.

>
> Or perhaps something like the change below.

That's a nice stuff, thanks Oleg. I simply did not know about these
callbacks.

But the huge problem is that after commit 2deb4be28 by Andy Lutomirski
we can't use stack when we are already in do_exit().  And putting this
callback head inside a worker structure is a bloat.  I will resend this
with a simple task state check in a wq_worker_sleeping().

--
Roman


>
> Oleg.
>
> --- x/kernel/workqueue.c
> +++ x/kernel/workqueue.c
> @@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru
>         }
>  }
>
> +static void oops_handler(struct callback_head *oops_work)
> +{
> +       if (!(current->flags & PF_WQ_WORKER))
> +               return;
> +
> +       clear PF_WQ_WORKER, probably do more cleanups
> +}
> +
>  /**
>   * worker_thread - the worker thread function
>   * @__worker: self
> @@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru
>   */
>  static int worker_thread(void *__worker)
>  {
> +       struct callback_head oops_work;
>         struct worker *worker = __worker;
>         struct worker_pool *pool = worker->pool;
>
>         /* tell the scheduler that this is a workqueue worker */
>         worker->task->flags |= PF_WQ_WORKER;
> +       init_task_work(&oops_work, oops_handler);
> +       task_work_add(current, &oops_work, false);
>  woke_up:
>         spin_lock_irq(&pool->lock);
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-10-24 16:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-21 15:43 [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Roman Pen
2016-09-21 15:43 ` [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead Roman Pen
2016-10-20 23:08   ` Andy Lutomirski
2016-10-21 15:47     ` Oleg Nesterov
2016-10-24 16:01       ` Roman Penyaev
2016-10-21  5:39   ` Peter Zijlstra
2016-10-20 23:07 ` [PATCH 1/2] x86/dumpstack: on oops do not rewind stack for kthread Andy Lutomirski
2016-10-21  5:56   ` Peter Zijlstra
2016-10-21  8:05     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.