All of lore.kernel.org
 help / color / mirror / Atom feed
* [v3.13][v3.14][Regression] kthread: make kthread_create() killable
@ 2014-03-14 20:46 Joseph Salisbury
  2014-03-15  0:43 ` Tetsuo Handa
  2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
  0 siblings, 2 replies; 41+ messages in thread
From: Joseph Salisbury @ 2014-03-14 20:46 UTC (permalink / raw)
  To: penguin-kernel
  Cc: Oleg Nesterov, rientjes, akpm, Linus Torvalds, tj,
	Thomas Gleixner, LKML, Kernel Team

Hi Tetsuo,

A kernel bug report was opened against Ubuntu[0].  We performed a kernel
bisect, and found that reverting the following commit resolved this bug:


commit 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Tue Nov 12 15:06:45 2013 -0800

    kthread: make kthread_create() killable

The regression was introduced as of v3.13-rc1.

The bug indicates an issue with the SAS controller during
initialization, which prevents the system from booting.  Additional
details are available in the bug report or on request.

I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?


Thanks,

Joe

[0] http://pad.lv/1276705

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-14 20:46 [v3.13][v3.14][Regression] kthread: make kthread_create() killable Joseph Salisbury
@ 2014-03-15  0:43 ` Tetsuo Handa
  2014-03-16 15:13   ` Tetsuo Handa
  2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
  1 sibling, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-15  0:43 UTC (permalink / raw)
  To: joseph.salisbury, Nagalakshmi.Nandigama, Sreekanth.Reddy
  Cc: oleg, rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team

Joseph Salisbury wrote:
> A kernel bug report was opened against Ubuntu[0].  We performed a kernel
> bisect, and found that reverting the following commit resolved this bug:

I added a comment to that bug report.

This commit by chance revealed incorrect error handling of mptsas_probe() or
mptscsih_remove().

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-15  0:43 ` Tetsuo Handa
@ 2014-03-16 15:13   ` Tetsuo Handa
  2014-03-16 16:25     ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-16 15:13 UTC (permalink / raw)
  To: oleg, rientjes, akpm
  Cc: joseph.salisbury, torvalds, tj, tglx, linux-kernel, kernel-team

Hello.

Although the process who is sending SIGKILL to systemd-udevd is not
identified yet, adding wait_for_completion_timeout() can solve this
regression. Therefore, I'd like to propose this patch for 3.14-final
and 3.13-stable.

Regards.
----------
>From c6562e5d774dd1f36724197dbcb8976cccfaab53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 16 Mar 2014 22:44:23 +0900
Subject: [PATCH] kthread: Do not leave kthread_create() immediately upon SIGKILL.

Commit 786235ee "kthread: make kthread_create() killable" changed to
leave kthread_create() as soon as receiving SIGKILL. But this change
caused boot failures if systemd-udevd received SIGKILL (probably due
to timeout) while loading SCSI controller drivers using
finit_module() [1].

Therefore, this patch changes kthread_create() to wait for 10 more
seconds after receiving SIGKILL, unless chosen by the OOM killer,
in order to give the kthreadd a chance to complete the request.

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: <stable@vger.kernel.org> [3.13+]
---
 kernel/kthread.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b5ae3ee..52ae7dc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -292,6 +292,17 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 	 * new kernel thread.
 	 */
 	if (unlikely(wait_for_completion_killable(&done))) {
+		int i = 0;
+
+		/*
+		 * I got SIGKILL, but wait for 10 more seconds for completion
+		 * unless chosen by the OOM killer. This delay is there as a
+		 * workaround for boot failure caused by SIGKILL upon device
+		 * driver initialization timeout.
+		 */
+		while (i++ < 10 && !test_tsk_thread_flag(current, TIF_MEMDIE))
+			if (wait_for_completion_timeout(&done, HZ))
+				goto ready;
 		/*
 		 * If I was SIGKILLed before kthreadd (or new kernel thread)
 		 * calls complete(), leave the cleanup of this structure to
@@ -305,6 +316,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 		 */
 		wait_for_completion(&done);
 	}
+ready:
 	task = create->result;
 	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-16 15:13   ` Tetsuo Handa
@ 2014-03-16 16:25     ` Oleg Nesterov
  2014-03-17 12:38       ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-16 16:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, joseph.salisbury, torvalds, tj, tglx,
	linux-kernel, kernel-team

On 03/17, Tetsuo Handa wrote:
>
> Therefore, I'd like to propose this patch for 3.14-final
> and 3.13-stable.

Well, I disagree. To me, the patch tries to fix the problem in the wrong
place,

> Commit 786235ee "kthread: make kthread_create() killable" changed to
> leave kthread_create() as soon as receiving SIGKILL. But this change
> caused boot failures if systemd-udevd received SIGKILL (probably due
> to timeout) while loading SCSI controller drivers using
> finit_module() [1].

Shouldn't we fix the caller instead? It should handle the error from
kthread_create() correctly.

And could you tell who is the caller which doesn't do this? If it can't
be fixed, then, say, it can use workqueue to create a kernel thread.

> @@ -292,6 +292,17 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  	 * new kernel thread.
>  	 */
>  	if (unlikely(wait_for_completion_killable(&done))) {
> +		int i = 0;
> +
> +		/*
> +		 * I got SIGKILL, but wait for 10 more seconds for completion
> +		 * unless chosen by the OOM killer. This delay is there as a
> +		 * workaround for boot failure caused by SIGKILL upon device
> +		 * driver initialization timeout.
> +		 */
> +		while (i++ < 10 && !test_tsk_thread_flag(current, TIF_MEMDIE))
> +			if (wait_for_completion_timeout(&done, HZ))
> +				goto ready;

Personally I really dislike this hack. And btw, why we return -ENOMEM if
SIGKILL'ed? Why not EINTR ?

If nothing else we can change the caller to do

	for (;;) {
		kthread = kthread_create(...);
		if (!IS_ERR(kthread) || PTR_ERR(kthread) != -EINTR)
			break;
		// FIXME, I am stupid and can't handle SIGKILL properly
		clear_thread_flag(TIF_SIGPENDING);
	}
	recalc_sigpending();

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create()killable
  2014-03-16 16:25     ` Oleg Nesterov
@ 2014-03-17 12:38       ` Tetsuo Handa
  2014-03-17 14:22         ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-17 12:38 UTC (permalink / raw)
  To: oleg
  Cc: rientjes, akpm, joseph.salisbury, torvalds, tj, tglx,
	linux-kernel, kernel-team

Oleg Nesterov wrote:
> > @@ -292,6 +292,17 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> >  	 * new kernel thread.
> >  	 */
> >  	if (unlikely(wait_for_completion_killable(&done))) {
> > +		int i = 0;
> > +
> > +		/*
> > +		 * I got SIGKILL, but wait for 10 more seconds for completion
> > +		 * unless chosen by the OOM killer. This delay is there as a
> > +		 * workaround for boot failure caused by SIGKILL upon device
> > +		 * driver initialization timeout.
> > +		 */
> > +		while (i++ < 10 && !test_tsk_thread_flag(current, TIF_MEMDIE))
> > +			if (wait_for_completion_timeout(&done, HZ))
> > +				goto ready;
> 
> Personally I really dislike this hack. And btw, why we return -ENOMEM if
> SIGKILL'ed? Why not EINTR ?

I chose -ENOMEM because -ENOMEM looked better for conveying that current thread
was SIGKILLed by the OOM killer in order to solve no memory state. But I forgot
that -ENOMEM will not convey the reason properly if current thread was
SIGKILLed by other than the OOM killer. Maybe

  return test_tsk_thread_flag(current, TIF_MEMDIE) ? -ENOMEM : -EINTR;

rather than

  return -ENOMEM;

?

> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures if systemd-udevd received SIGKILL (probably due
> > to timeout) while loading SCSI controller drivers using
> > finit_module() [1].
> 
> Shouldn't we fix the caller instead? It should handle the error from
> kthread_create() correctly.
> 
> And could you tell who is the caller which doesn't do this? If it can't
> be fixed, then, say, it can use workqueue to create a kernel thread.
> 

There are many callers. One of them is scsi_host_alloc() which is called
upon bootup in order to recognize SCSI storage devices.

To my surprise, systemd-udevd process sends SIGKILL to worker systemd-udevd
processes if they did not complete their jobs within 30 seconds. On some
machines, it takes more than 30 seconds to recognize SCSI storage devices.

On such machines, scsi_host_alloc() is called after the worker process
received SIGKILL. Since commit 786235ee "kthread: make kthread_create()
killable" broke all callers of kthread_create() who had been able to survive
SIGKILL, I think fixing this regression at kthread_create() is the appropriate
response.

Given that said, which one do we prefer?

  (a) Wait for completion forever after receiving SIGKILL, unless chosen
      by the OOM killer.

  (b) Wait for completion for only limited duration after receiving SIGKILL.

This patch is (b) which waits for only 10 seconds after receiving SIGKILL.
(a) will change "kthread: make kthread_create() killable" to
"kthread: allow the OOM-killer to kill kthread_create()".

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create()killable
  2014-03-17 12:38       ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
@ 2014-03-17 14:22         ` Oleg Nesterov
  2014-03-18 12:03           ` [v3.13][v3.14][Regression] kthread: makekthread_create()killable Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-17 14:22 UTC (permalink / raw)
  To: Tetsuo Handa, James E.J. Bottomley, Nagalakshmi Nandigama,
	Sreekanth Reddy
  Cc: rientjes, akpm, joseph.salisbury, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi

On 03/17, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> >
> > Personally I really dislike this hack. And btw, why we return -ENOMEM if
> > SIGKILL'ed? Why not EINTR ?
>
> I chose -ENOMEM because -ENOMEM looked better for conveying that current thread
> was SIGKILLed by the OOM killer in order to solve no memory state. But I forgot
> that -ENOMEM will not convey the reason properly if current thread was
> SIGKILLed by other than the OOM killer. Maybe
>
>   return test_tsk_thread_flag(current, TIF_MEMDIE) ? -ENOMEM : -EINTR;
>
> rather than
>
>   return -ENOMEM;

Why complicate the things? Following this logic you can change you
can change almost every user of, say, fatal_signal_pending() to take
TIF_MEMDIE into account. For what? Just return -EINTR.

> > > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > > leave kthread_create() as soon as receiving SIGKILL. But this change
> > > caused boot failures if systemd-udevd received SIGKILL (probably due
> > > to timeout) while loading SCSI controller drivers using
> > > finit_module() [1].
> >
> > Shouldn't we fix the caller instead? It should handle the error from
> > kthread_create() correctly.
> >
> > And could you tell who is the caller which doesn't do this? If it can't
> > be fixed, then, say, it can use workqueue to create a kernel thread.
> >
>
> There are many callers.

Who else? I mean, who else doesn't handle SIGKILL correctly ?

> One of them is scsi_host_alloc() which is called
> upon bootup in order to recognize SCSI storage devices.
>
> To my surprise, systemd-udevd process sends SIGKILL to worker systemd-udevd
> processes if they did not complete their jobs within 30 seconds. On some
> machines, it takes more than 30 seconds to recognize SCSI storage devices.
>
> On such machines, scsi_host_alloc() is called after the worker process
> received SIGKILL. Since commit 786235ee "kthread: make kthread_create()
> killable" broke all callers of kthread_create() who had been able to survive
> SIGKILL,

Well. I do not know if user-space should be blamed or not. But. The worker
process was killed, the kernel has all rights to check signal_pending() at
any moment and return to user-mode. Even if we change kthread_create() to
ignore SIGKILL it still can fail.

And probably there is a bug in the driver. IIUC, the probe hangs, the task
is killed after 30 seconds, I'd say it should not even call scsi_host_alloc()
in this case.


Lets look at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
I do not understand what happens and I know nothing about drivers, scsci,
but https://launchpadlibrarian.net/165067008/console.log shows

	[   36.539955] scsi4: error handler thread failed to spawn, error = -12

OK, scsi_host_alloc() fails. But it can fail due to another reason,

	[   36.552694] mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem

mptsas_probe() goes to out_mptsas_probe,

	[   36.569962] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
	[   36.573954] IP: [<ffffffff8170fe52>] mutex_lock+0x12/0x2f
	[   36.573954] PGD 368dd067 PUD 368de067 PMD 0
	[   36.573954] Oops: 0002 [#1] SMP
	[   36.573954] Modules linked in: tg3 hid_generic ptp usbhid mptsas(+) mptscsih hid mptbase pps_core scsi_transport_sas
	[   36.573954] CPU: 1 PID: 130 Comm: systemd-udevd Not tainted 3.13.0-6-generic #23-Ubuntu
	[   36.573954] Hardware name: Dell Inc. PowerEdge R300/0TY179, BIOS 1.5.2 11/02/2010
	[   36.573954] task: ffff88003689b000 ti: ffff8800368e8000 task.ti: ffff8800368e8000
	[   36.573954] RIP: 0010:[<ffffffff8170fe52>]  [<ffffffff8170fe52>] mutex_lock+0x12/0x2f
	[   36.573954] RSP: 0018:ffff8800368e9b10  EFLAGS: 00010246
	[   36.573954] RAX: 0000000000000000 RBX: 0000000000000060 RCX: 0000000000000dac
	[   36.573954] RDX: 000000000000229c RSI: 00000000229e229c RDI: 0000000000000060
	[   36.573954] RBP: ffff8800368e9b18 R08: 0000000000000086 R09: 00000000000002ac
	[   36.573954] R10: ffffffff8185a460 R11: ffff8800368e98ae R12: 0000000000000060
	[   36.573954] R13: ffff880223c4b000 R14: ffff880223c4b098 R15: ffffffffa00953a0
	[   36.573954] FS:  00007f22a26a1880(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
	[   36.573954] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	[   36.573954] CR2: 0000000000000060 CR3: 00000000368dc000 CR4: 00000000000407e0
	[   36.573954] Stack:
	[   36.573954]  0000000000000000 ffff8800368e9b40 ffffffff814c851d ffff880220153000
	[   36.573954]  0000000000000000 ffff880223c4b000 ffff8800368e9b70 ffffffffa007d2a1
	[   36.573954]  ffff880220153000 00000000ffffffff ffff880223c4b000 ffff880223c4b098
	[   36.573954] Call Trace:
	[   36.573954]  [<ffffffff814c851d>] scsi_remove_host+0x1d/0x120
	[   36.573954]  [<ffffffffa007d2a1>] mptscsih_remove+0x31/0xc0 [mptscsih]
	[   36.573954]  [<ffffffffa008f259>] mptsas_probe+0x419/0x5a0 [mptsas]

and why the error path should OOPS? I think this should be fixed?

> I think fixing this regression at kthread_create() is the appropriate
> response.

I still can't understand why do you think we should "fix" kthread_create().

> Given that said, which one do we prefer?
>
>   (a) Wait for completion forever after receiving SIGKILL, unless chosen
>       by the OOM killer.
>
>   (b) Wait for completion for only limited duration after receiving SIGKILL.

Personally I dislike both. It should either react to SIGKILL or not, in the
latter case it would be better to revert this patch, imho.

If we need the urgent hack to fix the regression, then I suggest to change
scsi_host_alloc() temporary until mptsas (or whatever) is fixed.

Oleg.

--- x/drivers/scsi/hosts.c
+++ x/drivers/scsi/hosts.c
@@ -447,8 +447,18 @@ struct Scsi_Host *scsi_host_alloc(struct
 	dev_set_name(&shost->shost_dev, "host%d", shost->host_no);
 	shost->shost_dev.groups = scsi_sysfs_shost_attr_groups;
 
-	shost->ehandler = kthread_run(scsi_error_handler, shost,
-			"scsi_eh_%d", shost->host_no);
+	/*
+	 * HUGE COMMENT. and kthread_create() needs s/ENOMEM/EINTR/.
+	 */
+	for (;;) {
+		shost->ehandler = kthread_run(scsi_error_handler, shost,
+						"scsi_eh_%d", shost->host_no);
+		if (!IS_ERR(shost->ehandler) || PTR_ERR(shost->ehandler) != -EINTR)
+			break;
+		clear_thread_flag(TIF_SIGPENDING);
+	}
+	recalc_sigpending();
+
 	if (IS_ERR(shost->ehandler)) {
 		printk(KERN_WARNING "scsi%d: error handler thread failed to spawn, error = %ld\n",
 			shost->host_no, PTR_ERR(shost->ehandler));


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-14 20:46 [v3.13][v3.14][Regression] kthread: make kthread_create() killable Joseph Salisbury
  2014-03-15  0:43 ` Tetsuo Handa
@ 2014-03-17 20:02 ` Andrew Morton
  2014-03-17 20:19   ` Oleg Nesterov
                     ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Andrew Morton @ 2014-03-17 20:02 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: penguin-kernel, Oleg Nesterov, rientjes, Linus Torvalds, tj,
	Thomas Gleixner, LKML, Kernel Team

On Fri, 14 Mar 2014 16:46:26 -0400 Joseph Salisbury <joseph.salisbury@canonical.com> wrote:

> Hi Tetsuo,
> 
> A kernel bug report was opened against Ubuntu[0].  We performed a kernel
> bisect, and found that reverting the following commit resolved this bug:
> 
> 
> commit 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date:   Tue Nov 12 15:06:45 2013 -0800
> 
>     kthread: make kthread_create() killable
> 
> The regression was introduced as of v3.13-rc1.
> 
> The bug indicates an issue with the SAS controller during
> initialization, which prevents the system from booting.  Additional
> details are available in the bug report or on request.
> 
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?
> 
> [0] http://pad.lv/1276705

What process is running here?  Presumably modprobe.

A possible explanation is that modprobe has genuinely received a
SIGKILL.  Can you identify anything in this setup which might send a
SIGKILL to the modprobe process?

kthread_create_on_node() thinks that SIGKILL came from the oom-killer
and it cheerfully returns -ENOMEM, which is incorrect if that signal
came from userspace.  And I don't _think_ we prevent
userspace-originated signals from unblocking
wait_for_completion_killable()?


Root cause time: it's wrong for the oom-killer to use SIGKILL.  In fact
it's basically always wrong to send signals from in-kernel.  Signals
are a userspace IPC mechanism and using them in-kernel a) makes it hard
(or impossible) to distinguish them from userspace-originated signals
and b) permits userspace to produce surprising results in the kernel,
which I suspect is what we're seeing here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
@ 2014-03-17 20:19   ` Oleg Nesterov
  2014-03-17 20:39     ` Andrew Morton
  2014-06-03 13:03     ` [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL Tetsuo Handa
  2014-03-17 21:32   ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
  2014-03-17 23:18   ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable One Thousand Gnomes
  2 siblings, 2 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-17 20:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joseph Salisbury, penguin-kernel, rientjes, Linus Torvalds, tj,
	Thomas Gleixner, LKML, Kernel Team

On 03/17, Andrew Morton wrote:
>
> On Fri, 14 Mar 2014 16:46:26 -0400 Joseph Salisbury <joseph.salisbury@canonical.com> wrote:
>
> > Hi Tetsuo,
> >
> > A kernel bug report was opened against Ubuntu[0].  We performed a kernel
> > bisect, and found that reverting the following commit resolved this bug:
> >
> >
> > commit 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Nov 12 15:06:45 2013 -0800
> >
> >     kthread: make kthread_create() killable
> >
> > The regression was introduced as of v3.13-rc1.
> >
> > The bug indicates an issue with the SAS controller during
> > initialization, which prevents the system from booting.  Additional
> > details are available in the bug report or on request.
> >
> > I was hoping to get your feedback, since you are the patch author.  Do
> > you think gathering any additional data will help diagnose this issue,
> > or would it be best to submit a revert request?
> >
> > [0] http://pad.lv/1276705
>
> What process is running here?  Presumably modprobe.
>
> A possible explanation is that modprobe has genuinely received a
> SIGKILL.  Can you identify anything in this setup which might send a
> SIGKILL to the modprobe process?

See also other discussion in this thread, I thinks the code in drivers/
is buggy anyway.

> kthread_create_on_node() thinks that SIGKILL came from the oom-killer
> and it cheerfully returns -ENOMEM, which is incorrect if that signal
> came from userspace.

Yes, I think it should return -EINTR.

> And I don't _think_ we prevent
> userspace-originated signals from unblocking
> wait_for_completion_killable()?

And we should not.

> Root cause time: it's wrong for the oom-killer to use SIGKILL.

Not sure... what else?

> In fact
> it's basically always wrong to send signals from in-kernel.

Well, SIGSEGV, SIGIO...

> Signals
> are a userspace IPC mechanism and using them in-kernel a) makes it hard
> (or impossible) to distinguish them from userspace-originated signals
> and b) permits userspace to produce surprising results in the kernel,
> which I suspect is what we're seeing here.

Well, I think in this case it doesn't matter who/why sends a signal.
The task is killed, it should react and exit asap. And kthread_create()
can fail in any case, at least the kernel should not crash.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-17 20:19   ` Oleg Nesterov
@ 2014-03-17 20:39     ` Andrew Morton
  2014-03-18 17:45       ` Oleg Nesterov
  2014-06-03 13:03     ` [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL Tetsuo Handa
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2014-03-17 20:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Joseph Salisbury, penguin-kernel, rientjes, Linus Torvalds, tj,
	Thomas Gleixner, LKML, Kernel Team

On Mon, 17 Mar 2014 21:19:19 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> > kthread_create_on_node() thinks that SIGKILL came from the oom-killer
> > and it cheerfully returns -ENOMEM, which is incorrect if that signal
> > came from userspace.
> 
> Yes, I think it should return -EINTR.
> 
> > And I don't _think_ we prevent
> > userspace-originated signals from unblocking
> > wait_for_completion_killable()?
> 
> And we should not.
> 
> > Root cause time: it's wrong for the oom-killer to use SIGKILL.
> 
> Not sure... what else?

If we really really have to use userspace IPC in the kernel then I
suppose we could add a new signal type (SIGKERNEL?) which userspace can
neither receive nor send.

> > In fact
> > it's basically always wrong to send signals from in-kernel.
> 
> Well, SIGSEGV, SIGIO...

Those are kernel->userspace, not kernel->kernel.

We have all sorts of cross-task mechanisms available in kernel (set_bit
and wake_up would suffice here) so we don't need to use signals with
all the baggage they bring along.

We used to do a lot more kernel signalling, but quite a lot got weeded
out years ago.  The favorite failure mode was for a task to use
TASK_INTERRUPTIBLE so that a kernel-sent signal could wake it up. 
Whoops, drivers would outright fail if the calling task happened to
have SIGINT pending!

> > Signals
> > are a userspace IPC mechanism and using them in-kernel a) makes it hard
> > (or impossible) to distinguish them from userspace-originated signals
> > and b) permits userspace to produce surprising results in the kernel,
> > which I suspect is what we're seeing here.
> 
> Well, I think in this case it doesn't matter who/why sends a signal.
> The task is killed, it should react and exit asap. And kthread_create()
> can fail in any case, at least the kernel should not crash.

Well yes.  In this "bug", userspace tried to kill udev and surprise
surprise, that's what happened.  The only real kernel bug here is the
incorrect errno.  But in the real world, the kernel changed and systems
broke - we should try to unbreak them.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create()killable
  2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
  2014-03-17 20:19   ` Oleg Nesterov
@ 2014-03-17 21:32   ` Tetsuo Handa
  2014-03-17 23:18   ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable One Thousand Gnomes
  2 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-17 21:32 UTC (permalink / raw)
  To: akpm, joseph.salisbury
  Cc: oleg, rientjes, torvalds, tj, tglx, linux-kernel, kernel-team

Andrew Morton wrote:
> What process is running here?  Presumably modprobe.

Yes. It is a worker systemd-udevd process who is acting like modprobe .

> A possible explanation is that modprobe has genuinely received a
> SIGKILL.  Can you identify anything in this setup which might send a
> SIGKILL to the modprobe process?

It is the systemd-udevd process who is sending SIGKILL to worker
systemd-udevd processes. It uses hard coded 30 seconds timeout.

> kthread_create_on_node() thinks that SIGKILL came from the oom-killer
> and it cheerfully returns -ENOMEM, which is incorrect if that signal
> came from userspace.  And I don't _think_ we prevent
> userspace-originated signals from unblocking
> wait_for_completion_killable()?

I prefer processes being killed upon SIGKILL as soon as possible.
I expect any unkillable operations should be replaced with killable
operations, or the OOM killer may fail to solve no memory state by
choosing a process in unkillable sleep.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
  2014-03-17 20:19   ` Oleg Nesterov
  2014-03-17 21:32   ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
@ 2014-03-17 23:18   ` One Thousand Gnomes
  2014-03-18 17:50     ` Oleg Nesterov
  2 siblings, 1 reply; 41+ messages in thread
From: One Thousand Gnomes @ 2014-03-17 23:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joseph Salisbury, penguin-kernel, Oleg Nesterov, rientjes,
	Linus Torvalds, tj, Thomas Gleixner, LKML, Kernel Team

> Root cause time: it's wrong for the oom-killer to use SIGKILL.  In fact

It has to use SIGKILL anything else might be caught and grow the user
stack a page..

> it's basically always wrong to send signals from in-kernel.  Signals
> are a userspace IPC mechanism and using them in-kernel a) makes it hard
> (or impossible) to distinguish them from userspace-originated signals

Actually signals are a kernel messaging system someone repurposed for IPC.

> and b) permits userspace to produce surprising results in the kernel,
> which I suspect is what we're seeing here.

There is enough information for kernel side code to decide whether a
signal came from kernel or userspace. Then again - it's not clear that it
should ever have to.

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: makekthread_create()killable
  2014-03-17 14:22         ` Oleg Nesterov
@ 2014-03-18 12:03           ` Tetsuo Handa
  2014-03-18 17:16             ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-18 12:03 UTC (permalink / raw)
  To: oleg, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy
  Cc: rientjes, akpm, joseph.salisbury, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi

Oleg Nesterov wrote:
> If we need the urgent hack to fix the regression, then I suggest to change
> scsi_host_alloc() temporary until mptsas (or whatever) is fixed.

Device initialization taking longer than 30 seconds is possible and is not a
hang up. It is systemd which needs to be fixed.

> --- x/drivers/scsi/hosts.c
> +++ x/drivers/scsi/hosts.c
> @@ -447,8 +447,18 @@ struct Scsi_Host *scsi_host_alloc(struct
>  	dev_set_name(&shost->shost_dev, "host%d", shost->host_no);
>  	shost->shost_dev.groups = scsi_sysfs_shost_attr_groups;
>  
> -	shost->ehandler = kthread_run(scsi_error_handler, shost,
> -			"scsi_eh_%d", shost->host_no);
> +	/*
> +	 * HUGE COMMENT. and kthread_create() needs s/ENOMEM/EINTR/.
> +	 */
> +	for (;;) {
> +		shost->ehandler = kthread_run(scsi_error_handler, shost,
> +						"scsi_eh_%d", shost->host_no);
> +		if (!IS_ERR(shost->ehandler) || PTR_ERR(shost->ehandler) != -EINTR)
> +			break;
> +		clear_thread_flag(TIF_SIGPENDING);
> +	}
> +	recalc_sigpending();
> +
>  	if (IS_ERR(shost->ehandler)) {
>  		printk(KERN_WARNING "scsi%d: error handler thread failed to spawn, error = %ld\n",
>  			shost->host_no, PTR_ERR(shost->ehandler));
> 
> 

I think we need a bit different version, in order to take TIF_MEMDIE flag into
account at the caller of kthread_create(), for the purpose of commit 786235ee
is "try to die as soon as possible if chosen by the OOM killer".

	for (;;) {
		shost->ehandler = kthread_run(scsi_error_handler, shost,
					      "scsi_eh_%d", shost->host_no);
		if (PTR_ERR(shost->ehandler) != -EINTR ||
		    test_thread_flag(TIF_MEMDIE))
			break;
		clear_thread_flag(TIF_SIGPENDING);
	}
	recalc_sigpending();

But I have two worrying points.

  (1) Changing return code from -ENOMEM to -EINTR may not be sufficient.

      If kmalloc(GFP_KERNEL) in kthread_create_on_node() does something that
      calls recalc_sigpending(), TIF_SIGPENDING will be set on the second call
      to kthread_run(). This will make wait_for_completion_killable() return
      -EINTR immediately because the second call to kthread_run() happens only
      when current thread already received SIGKILL (by other than the OOM
      killer). This may form an infinite busy loop.

      As I think it is difficult to prove that kmalloc(GFP_KERNEL) never sets
      TIF_SIGPENDING flag, we would need to call
      clear_thread_flag(TIF_SIGPENDING) immediately before
      wait_for_completion_killable() and call recalc_sigpending() immediately
      after wait_for_completion_killable(). Is this better than taking care of
      SIGKILL (by other than the OOM killer) on the first call to
      kthread_run() ?

  (2) I don't like scattering around test_thread_flag(TIF_MEMDIE), for there
      might be other drivers who receive SIGKILL by systemd's 30 seconds
      timeout.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: makekthread_create()killable
  2014-03-18 12:03           ` [v3.13][v3.14][Regression] kthread: makekthread_create()killable Tetsuo Handa
@ 2014-03-18 17:16             ` Oleg Nesterov
  2014-03-19 11:49               ` [v3.13][v3.14][Regression] kthread:makekthread_create()killable Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-18 17:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes,
	akpm, joseph.salisbury, torvalds, tj, tglx, linux-kernel,
	kernel-team, linux-scsi

On 03/18, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > If we need the urgent hack to fix the regression, then I suggest to change
> > scsi_host_alloc() temporary until mptsas (or whatever) is fixed.
>
> Device initialization taking longer than 30 seconds is possible and is not a
> hang up. It is systemd which needs to be fixed.

Perhaps systemd needs the fix too, I do not know. But this is irrelevant,
I think. Or at least this should be discussed separately.

kthread_run() can fail anyway, mptsas_probe() should not crash the kernel.

And btw, it is not clear to me if in this case device initialization really
needs more than 30 seconds... My understanding is probably wrong, so please
correct me. But it seems that before your "make kthread_create() killable"

	- probe hangs

	- SIGKILL wakes it up

	- so I assume that the probe was interrupted and didn't finish
	  correctly ???

	- initialization continues, does scsi_host_alloc(), etc, and
	  everything works fine even if probe was interrupted?

So perhaps that probe should not hang and this should be fixed too ?
Do you know where exactly it hangs? And where it is woken up by SIGKILL ?
Or I totally misunderstood ?

> > --- x/drivers/scsi/hosts.c
> > +++ x/drivers/scsi/hosts.c
> > @@ -447,8 +447,18 @@ struct Scsi_Host *scsi_host_alloc(struct
> >  	dev_set_name(&shost->shost_dev, "host%d", shost->host_no);
> >  	shost->shost_dev.groups = scsi_sysfs_shost_attr_groups;
> >
> > -	shost->ehandler = kthread_run(scsi_error_handler, shost,
> > -			"scsi_eh_%d", shost->host_no);
> > +	/*
> > +	 * HUGE COMMENT. and kthread_create() needs s/ENOMEM/EINTR/.
> > +	 */
> > +	for (;;) {
> > +		shost->ehandler = kthread_run(scsi_error_handler, shost,
> > +						"scsi_eh_%d", shost->host_no);
> > +		if (!IS_ERR(shost->ehandler) || PTR_ERR(shost->ehandler) != -EINTR)
> > +			break;
> > +		clear_thread_flag(TIF_SIGPENDING);
> > +	}
> > +	recalc_sigpending();
> > +
> >  	if (IS_ERR(shost->ehandler)) {
> >  		printk(KERN_WARNING "scsi%d: error handler thread failed to spawn, error = %ld\n",
> >  			shost->host_no, PTR_ERR(shost->ehandler));
> >
> >
>
> I think we need a bit different version, in order to take TIF_MEMDIE flag into
> account at the caller of kthread_create(), for the purpose of commit 786235ee
> is "try to die as soon as possible if chosen by the OOM killer".
>
> 	for (;;) {
> 		shost->ehandler = kthread_run(scsi_error_handler, shost,
> 					      "scsi_eh_%d", shost->host_no);
> 		if (PTR_ERR(shost->ehandler) != -EINTR ||
> 		    test_thread_flag(TIF_MEMDIE))

Well, personally I do not care about TIF_MEMDIE/oom at all. We need the
temporary hack (unless we have the "right" fix right now) which should be
reverted later.

>   (1) Changing return code from -ENOMEM to -EINTR may not be sufficient.

You mean, in kthread_create() ? And, sufficient for what?

>       If kmalloc(GFP_KERNEL) in kthread_create_on_node() does something that
>       calls recalc_sigpending(),

firstly, it should not. And almost every caller of recalc_sigpending() outside
of core signal code is wrong.

> TIF_SIGPENDING will be set on the second call
>       to kthread_run(). This will make wait_for_completion_killable() return
>       -EINTR immediately because the second call to kthread_run() happens only
>       when current thread already received SIGKILL (by other than the OOM
>       killer). This may form an infinite busy loop.

Not sure I understand... Yes, wait_for_completion_killable() can return
immediately if TIF_SIGPENDING will be set again for any reason. Say, another
signal. But the next iteration will clear TIF_SIGPENDING ?

>      As I think it is difficult to prove that kmalloc(GFP_KERNEL) never sets
>      TIF_SIGPENDING flag

Ah, I see, you mean that kmalloc() can do this every time. No, this should
not happen or we have another problem.

In any case. My only point is that the regression should be fixed here, imho.
And just in case, let me repeat that the code above is the dirty temporary hack.
I won't insist on the ugly TIF_SIGPENDING games, say, we can use workqueue.

>   (2) I don't like scattering around test_thread_flag(TIF_MEMDIE),
>       might be other drivers who receive SIGKILL by systemd's 30 seconds
>       timeout.

Can't understand this part too... and it was you who suggested to check
TIF_MEMDIE.

Anyway. I agree with any hack in scsi_host_alloc/etc, this is up to
maintainers. I still think that your change uncovered the problems in
drivers/message/fusion/, these problems should be fixed somehow.

Dear maintainers, we need your help.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-17 20:39     ` Andrew Morton
@ 2014-03-18 17:45       ` Oleg Nesterov
  0 siblings, 0 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-18 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joseph Salisbury, penguin-kernel, rientjes, Linus Torvalds, tj,
	Thomas Gleixner, LKML, Kernel Team

On 03/17, Andrew Morton wrote:
>
> On Mon, 17 Mar 2014 21:19:19 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > Root cause time: it's wrong for the oom-killer to use SIGKILL.
> >
> > Not sure... what else?
>
> If we really really have to use userspace IPC in the kernel then I
> suppose we could add a new signal type (SIGKERNEL?) which userspace can
> neither receive nor send.

But for what?

> > > In fact
> > > it's basically always wrong to send signals from in-kernel.
> >
> > Well, SIGSEGV, SIGIO...
>
> Those are kernel->userspace, not kernel->kernel.

OK, this is probably true.

> We have all sorts of cross-task mechanisms available in kernel (set_bit
> and wake_up would suffice here) so we don't need to use signals with
> all the baggage they bring along.

but why do we need another mechanism if we want terminate the task
for any reason?

OK. Perhaps the killed task wants to know if it was killed by kernel
or user-space... Not sure we really need this, but OK. We can add the
SIGNAL_GROUP_KILLED_BY_KERNEL flag. But it is not clear if we want
TASK_WAKEKILL_BY_KERNEL or not... And probably I missed your point.

> > Well, I think in this case it doesn't matter who/why sends a signal.
> > The task is killed, it should react and exit asap. And kthread_create()
> > can fail in any case, at least the kernel should not crash.
>
> Well yes.  In this "bug", userspace tried to kill udev and surprise
> surprise, that's what happened.  The only real kernel bug here is the
> incorrect errno.

and the kernel crash in mptsas_probe's error paths. This should be fixed.
And it seems that this driver has more problems, but I can be wrong.

> But in the real world, the kernel changed and systems
> broke - we should try to unbreak them.

Yes, yes, sure. But so far I think we should uglify the driver or scsi
(unless we have the right fix), not the poor kthread_create().

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread: make kthread_create() killable
  2014-03-17 23:18   ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable One Thousand Gnomes
@ 2014-03-18 17:50     ` Oleg Nesterov
  0 siblings, 0 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-18 17:50 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Andrew Morton, Joseph Salisbury, penguin-kernel, rientjes,
	Linus Torvalds, tj, Thomas Gleixner, LKML, Kernel Team

On 03/17, One Thousand Gnomes wrote:
>
> > and b) permits userspace to produce surprising results in the kernel,
> > which I suspect is what we're seeing here.
>
> There is enough information for kernel side code to decide whether a
> signal came from kernel or userspace.

Not really. SIGKILL can come without siginfo. Other signals too, but
SIGKILL especially.

> Then again - it's not clear that it
> should ever have to.

Yes, agreed.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread:makekthread_create()killable
  2014-03-18 17:16             ` Oleg Nesterov
@ 2014-03-19 11:49               ` Tetsuo Handa
  2014-03-19 16:13                 ` Joseph Salisbury
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-19 11:49 UTC (permalink / raw)
  To: oleg, joseph.salisbury
  Cc: JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes,
	akpm, torvalds, tj, tglx, linux-kernel, kernel-team, linux-scsi

Oleg Nesterov wrote:
> > > If we need the urgent hack to fix the regression, then I suggest to change
> > > scsi_host_alloc() temporary until mptsas (or whatever) is fixed.
> >
> > Device initialization taking longer than 30 seconds is possible and is not a
> > hang up. It is systemd which needs to be fixed.
> 
> Perhaps systemd needs the fix too, I do not know. But this is irrelevant,
> I think. Or at least this should be discussed separately.

I confirmed that this problem goes away if systemd-udevd supports longer
timeout.

> 
> kthread_run() can fail anyway, mptsas_probe() should not crash the kernel.

Right. But mptsas_probe() triggering an OOPS is irrelevant to kthread_run()
( comment #27 ).

> 
> And btw, it is not clear to me if in this case device initialization really
> needs more than 30 seconds... My understanding is probably wrong, so please
> correct me. But it seems that before your "make kthread_create() killable"
> 
> 	- probe hangs
> 
> 	- SIGKILL wakes it up
> 
> 	- so I assume that the probe was interrupted and didn't finish
> 	  correctly ???
> 
> 	- initialization continues, does scsi_host_alloc(), etc, and
> 	  everything works fine even if probe was interrupted?
> 

I confirmed that device initialization really took more than 30 seconds
( comments #51 and #52 ).

> So perhaps that probe should not hang and this should be fixed too ?
> Do you know where exactly it hangs? And where it is woken up by SIGKILL ?
> Or I totally misunderstood ?

The probe did not hang. SIGKILL affected only wait_for_completion_killable()
in kthread_create_on_node() called by mptsas_probe() via scsi_host_alloc().
Thus, the probe was interrupted because kthread_run() returned an error.

> > I think we need a bit different version, in order to take TIF_MEMDIE flag into
> > account at the caller of kthread_create(), for the purpose of commit 786235ee
> > is "try to die as soon as possible if chosen by the OOM killer".
> >
> > 	for (;;) {
> > 		shost->ehandler = kthread_run(scsi_error_handler, shost,
> > 					      "scsi_eh_%d", shost->host_no);
> > 		if (PTR_ERR(shost->ehandler) != -EINTR ||
> > 		    test_thread_flag(TIF_MEMDIE))
> 
> Well, personally I do not care about TIF_MEMDIE/oom at all. We need the
> temporary hack (unless we have the "right" fix right now) which should be
> reverted later.

I do seriously care about TIF_MEMDIE/oom. Last week I respond to a trouble
which hit "kernel: request_module() OOM local DoS" (RHBZ #853474) without
any malice.

> Not sure I understand... Yes, wait_for_completion_killable() can return
> immediately if TIF_SIGPENDING will be set again for any reason. Say, another
> signal. But the next iteration will clear TIF_SIGPENDING ?
> 
> >      As I think it is difficult to prove that kmalloc(GFP_KERNEL) never sets
> >      TIF_SIGPENDING flag
> 
> Ah, I see, you mean that kmalloc() can do this every time. No, this should
> not happen or we have another problem.

Then, what happens if somebody does

  while (1)
      kill(pid, SIGKILL);

where pid is the process calling kthread_run() from the "for (;;)" loop in
scsi_host_alloc()? Theoretically, it will form an infinite retry loop.
Clearing TIF_SIGPENDING does not guarantee that next
wait_for_completion_killable() does not return immediately.
Doing retry decision at scsi_host_alloc() will make things worse than
doing it at kthread_create_on_node().

> Anyway. I agree with any hack in scsi_host_alloc/etc, this is up to
> maintainers. I still think that your change uncovered the problems in
> drivers/message/fusion/, these problems should be fixed somehow.
> 
> Dear maintainers, we need your help.
> 

Right. We found that we can fix this problem by updating systemd-udevd to
support longer timeout ( comment #53 ). Joseph, would you consult systemd
maintainers?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread:makekthread_create()killable
  2014-03-19 11:49               ` [v3.13][v3.14][Regression] kthread:makekthread_create()killable Tetsuo Handa
@ 2014-03-19 16:13                 ` Joseph Salisbury
  2014-03-19 17:52                   ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Salisbury @ 2014-03-19 16:13 UTC (permalink / raw)
  To: Tetsuo Handa, oleg
  Cc: JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes,
	akpm, torvalds, tj, tglx, linux-kernel, kernel-team, linux-scsi

On 03/19/2014 07:49 AM, Tetsuo Handa wrote:
> Oleg Nesterov wrote:
>>>> If we need the urgent hack to fix the regression, then I suggest to change
>>>> scsi_host_alloc() temporary until mptsas (or whatever) is fixed.
>>> Device initialization taking longer than 30 seconds is possible and is not a
>>> hang up. It is systemd which needs to be fixed.
>> Perhaps systemd needs the fix too, I do not know. But this is irrelevant,
>> I think. Or at least this should be discussed separately.
> I confirmed that this problem goes away if systemd-udevd supports longer
> timeout.
>
>> kthread_run() can fail anyway, mptsas_probe() should not crash the kernel.
> Right. But mptsas_probe() triggering an OOPS is irrelevant to kthread_run()
> ( comment #27 ).
>
>> And btw, it is not clear to me if in this case device initialization really
>> needs more than 30 seconds... My understanding is probably wrong, so please
>> correct me. But it seems that before your "make kthread_create() killable"
>>
>> 	- probe hangs
>>
>> 	- SIGKILL wakes it up
>>
>> 	- so I assume that the probe was interrupted and didn't finish
>> 	  correctly ???
>>
>> 	- initialization continues, does scsi_host_alloc(), etc, and
>> 	  everything works fine even if probe was interrupted?
>>
> I confirmed that device initialization really took more than 30 seconds
> ( comments #51 and #52 ).
>
>> So perhaps that probe should not hang and this should be fixed too ?
>> Do you know where exactly it hangs? And where it is woken up by SIGKILL ?
>> Or I totally misunderstood ?
> The probe did not hang. SIGKILL affected only wait_for_completion_killable()
> in kthread_create_on_node() called by mptsas_probe() via scsi_host_alloc().
> Thus, the probe was interrupted because kthread_run() returned an error.
>
>>> I think we need a bit different version, in order to take TIF_MEMDIE flag into
>>> account at the caller of kthread_create(), for the purpose of commit 786235ee
>>> is "try to die as soon as possible if chosen by the OOM killer".
>>>
>>> 	for (;;) {
>>> 		shost->ehandler = kthread_run(scsi_error_handler, shost,
>>> 					      "scsi_eh_%d", shost->host_no);
>>> 		if (PTR_ERR(shost->ehandler) != -EINTR ||
>>> 		    test_thread_flag(TIF_MEMDIE))
>> Well, personally I do not care about TIF_MEMDIE/oom at all. We need the
>> temporary hack (unless we have the "right" fix right now) which should be
>> reverted later.
> I do seriously care about TIF_MEMDIE/oom. Last week I respond to a trouble
> which hit "kernel: request_module() OOM local DoS" (RHBZ #853474) without
> any malice.
>
>> Not sure I understand... Yes, wait_for_completion_killable() can return
>> immediately if TIF_SIGPENDING will be set again for any reason. Say, another
>> signal. But the next iteration will clear TIF_SIGPENDING ?
>>
>>>      As I think it is difficult to prove that kmalloc(GFP_KERNEL) never sets
>>>      TIF_SIGPENDING flag
>> Ah, I see, you mean that kmalloc() can do this every time. No, this should
>> not happen or we have another problem.
> Then, what happens if somebody does
>
>   while (1)
>       kill(pid, SIGKILL);
>
> where pid is the process calling kthread_run() from the "for (;;)" loop in
> scsi_host_alloc()? Theoretically, it will form an infinite retry loop.
> Clearing TIF_SIGPENDING does not guarantee that next
> wait_for_completion_killable() does not return immediately.
> Doing retry decision at scsi_host_alloc() will make things worse than
> doing it at kthread_create_on_node().
>
>> Anyway. I agree with any hack in scsi_host_alloc/etc, this is up to
>> maintainers. I still think that your change uncovered the problems in
>> drivers/message/fusion/, these problems should be fixed somehow.
>>
>> Dear maintainers, we need your help.
>>
> Right. We found that we can fix this problem by updating systemd-udevd to
> support longer timeout ( comment #53 ). Joseph, would you consult systemd
> maintainers?

Thanks everyone for reviewing this bug.  Message sent to systemd mailing
list:
http://lists.freedesktop.org/archives/systemd-devel/2014-March/018006.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [v3.13][v3.14][Regression] kthread:makekthread_create()killable
  2014-03-19 16:13                 ` Joseph Salisbury
@ 2014-03-19 17:52                   ` Oleg Nesterov
  2014-03-19 18:29                     ` please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable) Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-19 17:52 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/19, Joseph Salisbury wrote:
>
> On 03/19/2014 07:49 AM, Tetsuo Handa wrote:

Hmm. Apparently I missed this email from Tetsuo. I'll reply here.

> > Oleg Nesterov wrote:
> >
> >> And btw, it is not clear to me if in this case device initialization really
> >> needs more than 30 seconds... My understanding is probably wrong, so please
> >> correct me. But it seems that before your "make kthread_create() killable"
> >>
> >> 	- probe hangs
> >>
> >> 	- SIGKILL wakes it up
> >>
> >> 	- so I assume that the probe was interrupted and didn't finish
> >> 	  correctly ???
> >>
> >> 	- initialization continues, does scsi_host_alloc(), etc, and
> >> 	  everything works fine even if probe was interrupted?
> >>
> > I confirmed that device initialization really took more than 30 seconds
> > ( comments #51 and #52 ).

Thanks. However I still think this needs more investigation. May be I'll
write another email, but given that maintainers do not care...

> >> So perhaps that probe should not hang and this should be fixed too ?
> >> Do you know where exactly it hangs? And where it is woken up by SIGKILL ?
> >> Or I totally misunderstood ?
> > The probe did not hang.

It doesn't hang forever. Otherwise see above.

> > SIGKILL affected only wait_for_completion_killable()
> > in kthread_create_on_node() called by mptsas_probe() via scsi_host_alloc().

This wad already clear,

> > Thus, the probe was interrupted because kthread_run() returned an error.

No, #51 / #52 can't prove this. I think that kthread_run() or even
scsi_host_alloc() was called with fatal_signal_pending(). What did the probe
task do before? This is not clear. But again, see above.

> >> Ah, I see, you mean that kmalloc() can do this every time. No, this should
> >> not happen or we have another problem.
> > Then, what happens if somebody does
> >
> >   while (1)
> >       kill(pid, SIGKILL);
> >
> > where pid is the process calling kthread_run() from the "for (;;)" loop in
> > scsi_host_alloc()?

Nothing good. So what?

Tetsuo, how many time I should repeat that I only tried to suggest the
temporary dirty hack to close the regression ? ;)

And once again, I agree with any change in scsi_host_alloc/etc, I suggested
this (pseudo) code for example.

> >> Dear maintainers, we need your help.
> >>
> > Right. We found that we can fix this problem by updating systemd-udevd to
> > support longer timeout ( comment #53 ). Joseph, would you consult systemd
> > maintainers?
>
> Thanks everyone for reviewing this bug.  Message sent to systemd mailing
> list:
> http://lists.freedesktop.org/archives/systemd-devel/2014-March/018006.html

OK, good, thanks.

But please do not forget that the kernel crashes. Whatever else we do, this
should be fixed anyway. And this should be fixed in driver.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-19 17:52                   ` Oleg Nesterov
@ 2014-03-19 18:29                     ` Oleg Nesterov
  2014-03-19 19:42                       ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-19 18:29 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/19, Oleg Nesterov wrote:
>
> But please do not forget that the kernel crashes. Whatever else we do, this
> should be fixed anyway. And this should be fixed in driver.

drivers/message/fusion/ is obviously buggy.

mptsas_probe() does

		sh = scsi_host_alloc(...);
		if (!sh) {
			...
			goto out_mptsas_probe;
		}
		...
	out_mptsas_probe:
		mptscsih_remove(pdev);

and mptscsih_remove() blindly calls scsi_remove_host(ioc->sh) but ->sh
was not initialized, probably it is NULL.

and scsi_remove_host(host) obviously assumes that this pointer is valid.

I think we should wait for maintainers.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-19 18:29                     ` please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable) Oleg Nesterov
@ 2014-03-19 19:42                       ` Oleg Nesterov
  2014-03-19 21:04                         ` Joseph Salisbury
  2014-03-20 16:46                         ` Joseph Salisbury
  0 siblings, 2 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-19 19:42 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/19, Oleg Nesterov wrote:
>
> On 03/19, Oleg Nesterov wrote:
> >
> > But please do not forget that the kernel crashes. Whatever else we do, this
> > should be fixed anyway. And this should be fixed in driver.
>
> drivers/message/fusion/ is obviously buggy.

Perhaps this is the only problem and Tetsuo is right, this driver
really needs more than 30 secs to probe...

But if you have a bit of free time, perhaps you can try the stupid
debugging patch below ;) Not sure it will help, but who knows.

Oleg.

diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 00d339c..5ecc27e 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -5400,12 +5400,16 @@ mptsas_init(void)
 {
 	int error;
 
+	printk(KERN_CRIT "mptsas_init start\n");
+	current->flags |= 0x1;
 	show_mptmod_ver(my_NAME, my_VERSION);
 
 	mptsas_transport_template =
 	    sas_attach_transport(&mptsas_transport_functions);
-	if (!mptsas_transport_template)
-		return -ENODEV;
+	if (!mptsas_transport_template) {
+		error = -ENODEV;
+		goto out;
+	}
 	mptsas_transport_template->eh_timed_out = mptsas_eh_timed_out;
 
 	mptsasDoneCtx = mpt_register(mptscsih_io_done, MPTSAS_DRIVER,
@@ -5428,6 +5432,9 @@ mptsas_init(void)
 	if (error)
 		sas_release_transport(mptsas_transport_template);
 
+out:
+	current->flags &= ~0x1;
+	printk(KERN_CRIT "mptsas_init end\n");
 	return error;
 }
 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index b5ae3ee..78e643d 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -291,6 +291,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 	 * the OOM killer while kthreadd is trying to allocate memory for
 	 * new kernel thread.
 	 */
+
+	if (current->flags & 1) {
+		pr_crit("mptsas no killable wait: %d %d\n",
+			signal_pending(current), __fatal_signal_pending(current));
+		goto wait;
+	}
+
 	if (unlikely(wait_for_completion_killable(&done))) {
 		/*
 		 * If I was SIGKILLed before kthreadd (or new kernel thread)
@@ -303,6 +310,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 		 * kthreadd (or new kernel thread) will call complete()
 		 * shortly.
 		 */
+wait:
 		wait_for_completion(&done);
 	}
 	task = create->result;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..2b202bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2655,6 +2655,14 @@ static void __sched __schedule(void)
 	unsigned long *switch_count;
 	struct rq *rq;
 	int cpu;
+	bool trace;
+
+	trace = (current->flags & 1) && current->state && !(preempt_count() & PREEMPT_ACTIVE);
+	if (trace) {
+		pr_crit("mptsas sched: %lx %d %d\n", current->state,
+			signal_pending(current), __fatal_signal_pending(current));
+		show_stack(NULL, NULL);
+	}
 
 need_resched:
 	preempt_disable();
@@ -2733,6 +2741,11 @@ need_resched:
 	sched_preempt_enable_no_resched();
 	if (need_resched())
 		goto need_resched;
+
+	if (trace) {
+		pr_crit("mptsas wake: %d %d\n",
+			signal_pending(current), __fatal_signal_pending(current));
+	}
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..d121944 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1152,6 +1152,11 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 {
 	int from_ancestor_ns = 0;
 
+	if (t->flags & 1) {
+		pr_crit("mptsas killed %d\n", sig);
+		sched_show_task(t);
+	}
+
 #ifdef CONFIG_PID_NS
 	from_ancestor_ns = si_fromuser(info) &&
 			   !task_pid_nr_ns(current, task_active_pid_ns(t));


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-19 19:42                       ` Oleg Nesterov
@ 2014-03-19 21:04                         ` Joseph Salisbury
  2014-03-20 16:46                         ` Joseph Salisbury
  1 sibling, 0 replies; 41+ messages in thread
From: Joseph Salisbury @ 2014-03-19 21:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/19/2014 03:42 PM, Oleg Nesterov wrote:
> On 03/19, Oleg Nesterov wrote:
>> On 03/19, Oleg Nesterov wrote:
>>> But please do not forget that the kernel crashes. Whatever else we do, this
>>> should be fixed anyway. And this should be fixed in driver.
>> drivers/message/fusion/ is obviously buggy.
> Perhaps this is the only problem and Tetsuo is right, this driver
> really needs more than 30 secs to probe...
>
> But if you have a bit of free time, perhaps you can try the stupid
> debugging patch below ;) Not sure it will help, but who knows.
>
> Oleg.
>
> diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
> index 00d339c..5ecc27e 100644
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -5400,12 +5400,16 @@ mptsas_init(void)
>  {
>  	int error;
>  
> +	printk(KERN_CRIT "mptsas_init start\n");
> +	current->flags |= 0x1;
>  	show_mptmod_ver(my_NAME, my_VERSION);
>  
>  	mptsas_transport_template =
>  	    sas_attach_transport(&mptsas_transport_functions);
> -	if (!mptsas_transport_template)
> -		return -ENODEV;
> +	if (!mptsas_transport_template) {
> +		error = -ENODEV;
> +		goto out;
> +	}
>  	mptsas_transport_template->eh_timed_out = mptsas_eh_timed_out;
>  
>  	mptsasDoneCtx = mpt_register(mptscsih_io_done, MPTSAS_DRIVER,
> @@ -5428,6 +5432,9 @@ mptsas_init(void)
>  	if (error)
>  		sas_release_transport(mptsas_transport_template);
>  
> +out:
> +	current->flags &= ~0x1;
> +	printk(KERN_CRIT "mptsas_init end\n");
>  	return error;
>  }
>  
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index b5ae3ee..78e643d 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -291,6 +291,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  	 * the OOM killer while kthreadd is trying to allocate memory for
>  	 * new kernel thread.
>  	 */
> +
> +	if (current->flags & 1) {
> +		pr_crit("mptsas no killable wait: %d %d\n",
> +			signal_pending(current), __fatal_signal_pending(current));
> +		goto wait;
> +	}
> +
>  	if (unlikely(wait_for_completion_killable(&done))) {
>  		/*
>  		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> @@ -303,6 +310,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  		 * kthreadd (or new kernel thread) will call complete()
>  		 * shortly.
>  		 */
> +wait:
>  		wait_for_completion(&done);
>  	}
>  	task = create->result;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b46131e..2b202bd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2655,6 +2655,14 @@ static void __sched __schedule(void)
>  	unsigned long *switch_count;
>  	struct rq *rq;
>  	int cpu;
> +	bool trace;
> +
> +	trace = (current->flags & 1) && current->state && !(preempt_count() & PREEMPT_ACTIVE);
> +	if (trace) {
> +		pr_crit("mptsas sched: %lx %d %d\n", current->state,
> +			signal_pending(current), __fatal_signal_pending(current));
> +		show_stack(NULL, NULL);
> +	}
>  
>  need_resched:
>  	preempt_disable();
> @@ -2733,6 +2741,11 @@ need_resched:
>  	sched_preempt_enable_no_resched();
>  	if (need_resched())
>  		goto need_resched;
> +
> +	if (trace) {
> +		pr_crit("mptsas wake: %d %d\n",
> +			signal_pending(current), __fatal_signal_pending(current));
> +	}
>  }
>  
>  static inline void sched_submit_work(struct task_struct *tsk)
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 52f881d..d121944 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1152,6 +1152,11 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  {
>  	int from_ancestor_ns = 0;
>  
> +	if (t->flags & 1) {
> +		pr_crit("mptsas killed %d\n", sig);
> +		sched_show_task(t);
> +	}
> +
>  #ifdef CONFIG_PID_NS
>  	from_ancestor_ns = si_fromuser(info) &&
>  			   !task_pid_nr_ns(current, task_active_pid_ns(t));
>
Thanks for the patch, Oleg.  I built a test kernel and asked the bug
reporter to test it [0].

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/comments/56

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-19 19:42                       ` Oleg Nesterov
  2014-03-19 21:04                         ` Joseph Salisbury
@ 2014-03-20 16:46                         ` Joseph Salisbury
  2014-03-20 19:23                           ` Oleg Nesterov
  1 sibling, 1 reply; 41+ messages in thread
From: Joseph Salisbury @ 2014-03-20 16:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/19/2014 03:42 PM, Oleg Nesterov wrote:
> On 03/19, Oleg Nesterov wrote:
>> On 03/19, Oleg Nesterov wrote:
>>> But please do not forget that the kernel crashes. Whatever else we do, this
>>> should be fixed anyway. And this should be fixed in driver.
>> drivers/message/fusion/ is obviously buggy.
> Perhaps this is the only problem and Tetsuo is right, this driver
> really needs more than 30 secs to probe...
>
> But if you have a bit of free time, perhaps you can try the stupid
> debugging patch below ;) Not sure it will help, but who knows.
>
> Oleg.

There was some testing done with your test kernel.  The data collected
while running your kernel is available in the bug report:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/comments/58

Thanks again for the assistance!

>
> diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
> index 00d339c..5ecc27e 100644
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -5400,12 +5400,16 @@ mptsas_init(void)
>  {
>  	int error;
>  
> +	printk(KERN_CRIT "mptsas_init start\n");
> +	current->flags |= 0x1;
>  	show_mptmod_ver(my_NAME, my_VERSION);
>  
>  	mptsas_transport_template =
>  	    sas_attach_transport(&mptsas_transport_functions);
> -	if (!mptsas_transport_template)
> -		return -ENODEV;
> +	if (!mptsas_transport_template) {
> +		error = -ENODEV;
> +		goto out;
> +	}
>  	mptsas_transport_template->eh_timed_out = mptsas_eh_timed_out;
>  
>  	mptsasDoneCtx = mpt_register(mptscsih_io_done, MPTSAS_DRIVER,
> @@ -5428,6 +5432,9 @@ mptsas_init(void)
>  	if (error)
>  		sas_release_transport(mptsas_transport_template);
>  
> +out:
> +	current->flags &= ~0x1;
> +	printk(KERN_CRIT "mptsas_init end\n");
>  	return error;
>  }
>  
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index b5ae3ee..78e643d 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -291,6 +291,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  	 * the OOM killer while kthreadd is trying to allocate memory for
>  	 * new kernel thread.
>  	 */
> +
> +	if (current->flags & 1) {
> +		pr_crit("mptsas no killable wait: %d %d\n",
> +			signal_pending(current), __fatal_signal_pending(current));
> +		goto wait;
> +	}
> +
>  	if (unlikely(wait_for_completion_killable(&done))) {
>  		/*
>  		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> @@ -303,6 +310,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  		 * kthreadd (or new kernel thread) will call complete()
>  		 * shortly.
>  		 */
> +wait:
>  		wait_for_completion(&done);
>  	}
>  	task = create->result;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b46131e..2b202bd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2655,6 +2655,14 @@ static void __sched __schedule(void)
>  	unsigned long *switch_count;
>  	struct rq *rq;
>  	int cpu;
> +	bool trace;
> +
> +	trace = (current->flags & 1) && current->state && !(preempt_count() & PREEMPT_ACTIVE);
> +	if (trace) {
> +		pr_crit("mptsas sched: %lx %d %d\n", current->state,
> +			signal_pending(current), __fatal_signal_pending(current));
> +		show_stack(NULL, NULL);
> +	}
>  
>  need_resched:
>  	preempt_disable();
> @@ -2733,6 +2741,11 @@ need_resched:
>  	sched_preempt_enable_no_resched();
>  	if (need_resched())
>  		goto need_resched;
> +
> +	if (trace) {
> +		pr_crit("mptsas wake: %d %d\n",
> +			signal_pending(current), __fatal_signal_pending(current));
> +	}
>  }
>  
>  static inline void sched_submit_work(struct task_struct *tsk)
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 52f881d..d121944 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1152,6 +1152,11 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  {
>  	int from_ancestor_ns = 0;
>  
> +	if (t->flags & 1) {
> +		pr_crit("mptsas killed %d\n", sig);
> +		sched_show_task(t);
> +	}
> +
>  #ifdef CONFIG_PID_NS
>  	from_ancestor_ns = si_fromuser(info) &&
>  			   !task_pid_nr_ns(current, task_active_pid_ns(t));
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-20 16:46                         ` Joseph Salisbury
@ 2014-03-20 19:23                           ` Oleg Nesterov
  2014-03-21 18:34                             ` Oleg Nesterov
  0 siblings, 1 reply; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-20 19:23 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi

On 03/20, Joseph Salisbury wrote:
>
> On 03/19/2014 03:42 PM, Oleg Nesterov wrote:
> > On 03/19, Oleg Nesterov wrote:
> >> On 03/19, Oleg Nesterov wrote:
> >>> But please do not forget that the kernel crashes. Whatever else we do, this
> >>> should be fixed anyway. And this should be fixed in driver.
> >> drivers/message/fusion/ is obviously buggy.
> > Perhaps this is the only problem and Tetsuo is right, this driver
> > really needs more than 30 secs to probe...
> >
> > But if you have a bit of free time, perhaps you can try the stupid
> > debugging patch below ;) Not sure it will help, but who knows.
> >
> > Oleg.
>
> There was some testing done with your test kernel.  The data collected
> while running your kernel is available in the bug report:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/comments/58

Joseph, thanks a lot.

I'll try to read the logs tomorrow, but at first glance Tetsuo was right,
I do not see a "long" sleep in that log. And it shows that the hack around
kthread_run() in scsi_host_alloc() won't help. I am wondering why I didn't
realize this before ;)

Hmm. Perhaps we should simply change mptsas_init() to block all signals
until we have the right fix.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-20 19:23                           ` Oleg Nesterov
@ 2014-03-21 18:34                             ` Oleg Nesterov
  2014-03-21 19:32                               ` Linus Torvalds
  2014-03-22  6:25                               ` please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
  0 siblings, 2 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-21 18:34 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Tetsuo Handa, JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy,
	rientjes, akpm, torvalds, tj, tglx, linux-kernel, kernel-team,
	linux-scsi, Tomas Henzl

On 03/20, Oleg Nesterov wrote:
>
> On 03/20, Joseph Salisbury wrote:
> >
> > There was some testing done with your test kernel.  The data collected
> > while running your kernel is available in the bug report:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/comments/58
>
> Joseph, thanks a lot.
>
> I'll try to read the logs tomorrow, but at first glance Tetsuo was right,
> I do not see a "long" sleep in that log.

Yes, it seems that it actually needs > 30 secs. It spends most of the time
(30.13286 seconds) in

	msleep+0x20/0x30
	WaitForDoorbellInt+0x103/0x130 [mptbase]
	WaitForDoorbellReply+0x42/0x220 [mptbase]
	mpt_handshake_req_reply_wait+0x1dc/0x2c0 [mptbase]
	SendPortEnable.constprop.23+0x94/0xc0 [mptbase]

WaitForDoorbellInt() does msleep(1) in a loop. This trace starts at the line
6001, and it is repeated 3792 times, till the line 176686 which apparently
shows the trace of the 2nd WaitForDoorbellInt() in WaitForDoorbellReply().

SendPortEnable:

	if (ioc->ir_firmware || ioc->bus_type == SAS) {
		rc = mpt_handshake_req_reply_wait(ioc, req_sz,
		(u32*)&port_enable, reply_sz, (u16*)&reply_buf,
		300 /*seconds*/, sleepFlag);
	} else {
		rc = mpt_handshake_req_reply_wait(ioc, req_sz,
		(u32*)&port_enable, reply_sz, (u16*)&reply_buf,
		30 /*seconds*/, sleepFlag);
	}

I am wondering which branch calls mpt_handshake_req_reply_wait(), the
else's timeout=30 (passed to the 1st WaitForDoorbellInt) suspiciously
matches the time WaitForDoorbellInt() actually runs... But everything
works fine and at first glance the potential timeout error should be
propogated correctly. So "timeout" is probably 300. And probably this
all is fine.

All I can suggest is the dirty hack for now. User-space should be
changed too, I think, but this is another story.

Tetsuo, what do you think?

Oleg.
---


--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -5395,6 +5395,8 @@ static struct pci_driver mptsas_driver = {
 #endif
 };
 
+#include <linux/signal.h>
+
 static int __init
 mptsas_init(void)
 {
@@ -5424,7 +5426,31 @@ mptsas_init(void)
 	mpt_event_register(mptsasDoneCtx, mptsas_event_process);
 	mpt_reset_register(mptsasDoneCtx, mptsas_ioc_reset);
 
-	error = pci_register_driver(&mptsas_driver);
+	{
+		sigset_t full, save;
+		/*
+		 * KILL ME. THIS IS THE DIRTY AND WRONG HACK WAITING FOR THE
+		 * FIX FROM MAINTAINERS.
+		 *
+		 * - This driver needs a lot of time to complete SendPortEnable()
+		 *   but systemd-udevd sends SIGKILL after 30 seconds, see
+		 *   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
+		 *
+		 *   Probably user-space should be changed, but:
+		 *
+		 * - Since commit 786235eeba0e "kthread: make kthread_create()
+		 *   killable" scsi_host_alloc() becomes killable and this SIGKILL
+		 *   crashes the kernel.
+		 *
+		 *   If scsi_host_alloc() fails mptsas_probe() blindly calls
+		 *   mptscsih_remove() and scsi_remove_host() hits host == NULL.
+		 */
+		sigfillset(&full);
+		sigprocmask(SIG_SETMASK, &full, &save);
+		error = pci_register_driver(&mptsas_driver);
+		sigprocmask(SIG_SETMASK, &save, NULL);
+	}
+
 	if (error)
 		sas_release_transport(mptsas_transport_template);
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-21 18:34                             ` Oleg Nesterov
@ 2014-03-21 19:32                               ` Linus Torvalds
  2014-03-21 20:31                                 ` Oleg Nesterov
  2014-03-21 22:56                                 ` James Bottomley
  2014-03-22  6:25                               ` please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2014-03-21 19:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Joseph Salisbury, Tetsuo Handa, James Bottomley,
	Nagalakshmi.Nandigama, Sreekanth.Reddy, David Rientjes,
	Andrew Morton, Tejun Heo, Thomas Gleixner,
	Linux Kernel Mailing List, Ubuntu Kernel Team, Linux SCSI List,
	Tomas Henzl

On Fri, Mar 21, 2014 at 11:34 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Yes, it seems that it actually needs > 30 secs. It spends most of the time
> (30.13286 seconds) in [..]

So how about taking a completely different approach:

 - just say that waiting for devices in the module init sequence for
over 30 seconds is really really wrong.

 - make the damn mptsas driver just register the controller from the
init sequence, and then do device discovery asynchronously.

The ATA layer does this correctly: it synchronously finds each host,
but then it does

        /* perform each probe asynchronously */
        for (i = 0; i < host->n_ports; i++) {
                struct ata_port *ap = host->ports[i];
                async_schedule(async_port_probe, ap);
        }

and I really think SCSI drivers should do the same if they have this
kind of "ports can take forever to probe" behavior.

What would be the equivalent magic to do this for SCSI? Could we just
make something like scsi_probe_and_add_lun() just always do this, the
same way ata_host_register() does it?

                   Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-21 19:32                               ` Linus Torvalds
@ 2014-03-21 20:31                                 ` Oleg Nesterov
  2014-03-21 22:56                                 ` James Bottomley
  1 sibling, 0 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-21 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joseph Salisbury, Tetsuo Handa, James Bottomley,
	Nagalakshmi.Nandigama, Sreekanth.Reddy, David Rientjes,
	Andrew Morton, Tejun Heo, Thomas Gleixner,
	Linux Kernel Mailing List, Ubuntu Kernel Team, Linux SCSI List,
	Tomas Henzl

On 03/21, Linus Torvalds wrote:
>
> On Fri, Mar 21, 2014 at 11:34 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, it seems that it actually needs > 30 secs. It spends most of the time
> > (30.13286 seconds) in [..]
>
> So how about taking a completely different approach:

Due to the lack of knowledge I can not comment (or even actually
understand) your suggestion.

But it probably falls into the "right fix" category, iow "THE FIX
FROM MAINTAINERS" the temporary sigprocmask() hack should wait for.

I can't implement this by the same reason. And of course I won't
insist that we need some really stupid (and wrong) solution until
someone who understand makes the initialization asynchronous.


>  - just say that waiting for devices in the module init sequence for
> over 30 seconds is really really wrong.
>
>  - make the damn mptsas driver just register the controller from the
> init sequence, and then do device discovery asynchronously.
>
> The ATA layer does this correctly: it synchronously finds each host,
> but then it does
>
>         /* perform each probe asynchronously */
>         for (i = 0; i < host->n_ports; i++) {
>                 struct ata_port *ap = host->ports[i];
>                 async_schedule(async_port_probe, ap);
>         }
>
> and I really think SCSI drivers should do the same if they have this
> kind of "ports can take forever to probe" behavior.
>
> What would be the equivalent magic to do this for SCSI? Could we just
> make something like scsi_probe_and_add_lun() just always do this, the
> same way ata_host_register() does it?
>
>                    Linus


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable)
  2014-03-21 19:32                               ` Linus Torvalds
  2014-03-21 20:31                                 ` Oleg Nesterov
@ 2014-03-21 22:56                                 ` James Bottomley
  1 sibling, 0 replies; 41+ messages in thread
From: James Bottomley @ 2014-03-21 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Joseph Salisbury, Tetsuo Handa,
	Nagalakshmi.Nandigama, Sreekanth.Reddy, David Rientjes,
	Andrew Morton, Tejun Heo, Thomas Gleixner,
	Linux Kernel Mailing List, Ubuntu Kernel Team, Linux SCSI List,
	Tomas Henzl

On Fri, 2014-03-21 at 12:32 -0700, Linus Torvalds wrote:
> On Fri, Mar 21, 2014 at 11:34 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, it seems that it actually needs > 30 secs. It spends most of the time
> > (30.13286 seconds) in [..]
> 
> So how about taking a completely different approach:
> 
>  - just say that waiting for devices in the module init sequence for
> over 30 seconds is really really wrong.
> 
>  - make the damn mptsas driver just register the controller from the
> init sequence, and then do device discovery asynchronously.
> 
> The ATA layer does this correctly: it synchronously finds each host,
> but then it does
> 
>         /* perform each probe asynchronously */
>         for (i = 0; i < host->n_ports; i++) {
>                 struct ata_port *ap = host->ports[i];
>                 async_schedule(async_port_probe, ap);
>         }
> 
> and I really think SCSI drivers should do the same if they have this
> kind of "ports can take forever to probe" behavior.
> 
> What would be the equivalent magic to do this for SCSI? Could we just
> make something like scsi_probe_and_add_lun() just always do this, the
> same way ata_host_register() does it?

Well, we do do this asynchronously.  The idea is that the add host only
initialises the actual hardware.  The port probing is supposed to be
done asynchronously (provided the async probe option is enabled in SCSI,
of course).  The way this is supposed to happen is the driver
initialises the hardware and then calls scsi_scan_host().  If the
platform is set up for async scanning, that kicks off all the async
workqueues and returns (or does it all synchronously if async scanning
isn't enabled).

It is possible fusion gets this wrong because the sas driver doesn't
really couple to SCSI's libsas, which is where it would pick up most of
the generic infrastructure for this.  Plus it depends where all the time
is being wasted.  The fusion was the last sas chipset I got the specs
for (under NDA).  It's actually table driven, so if the problem is the
controller taking ages to fill in the tables it might necessitate a
fusion specific fix.  I can see from the driver that it seems to do all
the probing itself instead of relying on probe callbacks from
scsi_scan_host(), so I know what needs to be fixed ... it's less clear
how easy this would be given how monolithic the routine looks.

James



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-21 18:34                             ` Oleg Nesterov
  2014-03-21 19:32                               ` Linus Torvalds
@ 2014-03-22  6:25                               ` Tetsuo Handa
  2014-03-22 19:25                                 ` Oleg Nesterov
  1 sibling, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-22  6:25 UTC (permalink / raw)
  To: oleg, joseph.salisbury
  Cc: JBottomley, Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes,
	akpm, torvalds, tj, tglx, linux-kernel, kernel-team, linux-scsi,
	thenzl

Oleg Nesterov wrote:
> Tetsuo, what do you think?

I don't like blocking SIGKILL while doing operations that depend on memory
allocation by other processes. If the OOM killer is triggered and it chose
the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
it generates the OOM killer deadlock.

My preference is to fix kthread_create() to handle SIGKILL gracefully.
kthread_create() did not return upon SIGKILL before commit 786235ee.
Since commit 786235ee, there is imbalance that "kmalloc(GFP_KERNEL) in
kthread_create_on_node() ignores SIGKILL unless TIF_MEMDIE is set" but
"wait_for_completion_killable() in kthread_create_on_node() does not ignore
SIGKILL even if TIF_MEMDIE is not set".

Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
commit 786235ee sounds like a kernel API breakage.
----------
>From 731f1f6dec7efaa132f751c432079b9b1fdb78d2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 22 Mar 2014 15:16:50 +0900
Subject: [PATCH] kthread: Handle SIGKILL gracefully in kthread_create().

Commit 786235ee "kthread: make kthread_create() killable" changed to
leave kthread_create() as soon as receiving SIGKILL. But this change
caused boot failures because systemd-udevd receives SIGKILL due to timeout
while loading SCSI controller drivers using finit_module() [1].

Therefore, this patch changes kthread_create() from "wait forever in
killable state" to "wait for 10 seconds in unkillable state, check for
the OOM killer every second".

This patch also changes the return value of timeout case from -ENOMEM
to -EINTR because -ENOMEM could make sense for only TIF_MEMDIE case.

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/kthread.c |   37 +++++++++++++++++++++----------------
 1 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b5ae3ee..6a25a9f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -269,6 +269,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 					   const char namefmt[],
 					   ...)
 {
+	int i = 0;
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct task_struct *task;
 	struct kthread_create_info *create = kmalloc(sizeof(*create),
@@ -287,24 +288,28 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 
 	wake_up_process(kthreadd_task);
 	/*
-	 * Wait for completion in killable state, for I might be chosen by
-	 * the OOM killer while kthreadd is trying to allocate memory for
-	 * new kernel thread.
+	 * Wait for completion with 10 seconds timeout. Unless the kthreadd is
+	 * blocked for direct memory reclaim path, the kthreadd will be able to
+	 * complete the request within 10 seconds. Also, check every second if
+	 * I was chosen by the OOM killer in order to avoid the OOM killer
+	 * deadlock.
 	 */
-	if (unlikely(wait_for_completion_killable(&done))) {
-		/*
-		 * If I was SIGKILLed before kthreadd (or new kernel thread)
-		 * calls complete(), leave the cleanup of this structure to
-		 * that thread.
-		 */
-		if (xchg(&create->done, NULL))
-			return ERR_PTR(-ENOMEM);
-		/*
-		 * kthreadd (or new kernel thread) will call complete()
-		 * shortly.
-		 */
-		wait_for_completion(&done);
+	do {
+		if (likely(wait_for_completion_timeout(&done, HZ)))
+			goto ready;
+	} while (i++ < 10 && !test_thread_flag(TIF_MEMDIE));
+	/*
+	 * The kthreadd was unable to complete the request within 10 seconds
+	 * (or I was chosen by the OOM killer). Thus, give up and leave the
+	 * cleanup of this structure to the kthreadd (or new kernel thread).
+	 */
+	if (xchg(&create->done, NULL)) {
+		WARN(1, "Gave up waiting for kthreadd.\n");
+		return ERR_PTR(-EINTR);
 	}
+	/* kthreadd (or new kernel thread) will call complete() shortly. */
+	wait_for_completion(&done);
+ready:
 	task = create->result;
 	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22  6:25                               ` please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
@ 2014-03-22 19:25                                 ` Oleg Nesterov
  2014-03-22 20:48                                   ` James Bottomley
                                                     ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-22 19:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: joseph.salisbury, JBottomley, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi, thenzl

On 03/22, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > Tetsuo, what do you think?
>
> I don't like blocking SIGKILL while doing operations that depend on memory
> allocation by other processes. If the OOM killer is triggered and it chose
> the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> it generates the OOM killer deadlock.

It seems that we do not understand each other.

I do not like this hack too. And it is even wrong, you can't really block
SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
module_init() reacts to SIGKILL and aborts, just the fact it crashes the
kernel in the error paths is not fine.

The driver should be fixed anyway. As for timeout, either userspace/systemd
should be changed to not send SIGKILL after 30 secs, or (better) the driver
should be changed to not waste 30 secs.

The hack I sent can only serve as a short term solution, it should be
reverted once we have something better. And, otoh, this hack only affects
the problematic driver which should be fixed in any case.

Do you see my point? I can be wrong, but I think that you constantly
misunderstand the intent.

> My preference is to fix kthread_create() to handle SIGKILL gracefully.

And now I do not understand you too. I do not understand why should we
"fix" kthread_create().

> Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> commit 786235ee sounds like a kernel API breakage.

Personally I do not really think so, but OK. In this case lets revert
786235ee.

> Commit 786235ee "kthread: make kthread_create() killable" changed to
> leave kthread_create() as soon as receiving SIGKILL. But this change
> caused boot failures because systemd-udevd receives SIGKILL due to timeout
> while loading SCSI controller drivers using finit_module() [1].

And I still think that 786235ee simply uncovered the problems in this
driver. Perhaps we should change kthread_create() for some reason, but
(imho) not because we need to help the buggy code.

> Therefore, this patch changes kthread_create() from "wait forever in
> killable state" to "wait for 10 seconds in unkillable state, check for
> the OOM killer every second".

Personally I dislike this change. In fact I think it is ugly. But this
is only my opinion.

If you convince someone to take this patch I won't argue.


> This patch also changes the return value of timeout case from -ENOMEM
> to -EINTR because -ENOMEM could make sense for only TIF_MEMDIE case.
>
>   [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  kernel/kthread.c |   37 +++++++++++++++++++++----------------
>  1 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index b5ae3ee..6a25a9f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -269,6 +269,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  					   const char namefmt[],
>  					   ...)
>  {
> +	int i = 0;
>  	DECLARE_COMPLETION_ONSTACK(done);
>  	struct task_struct *task;
>  	struct kthread_create_info *create = kmalloc(sizeof(*create),
> @@ -287,24 +288,28 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>
>  	wake_up_process(kthreadd_task);
>  	/*
> -	 * Wait for completion in killable state, for I might be chosen by
> -	 * the OOM killer while kthreadd is trying to allocate memory for
> -	 * new kernel thread.
> +	 * Wait for completion with 10 seconds timeout. Unless the kthreadd is
> +	 * blocked for direct memory reclaim path, the kthreadd will be able to
> +	 * complete the request within 10 seconds. Also, check every second if
> +	 * I was chosen by the OOM killer in order to avoid the OOM killer
> +	 * deadlock.
>  	 */
> -	if (unlikely(wait_for_completion_killable(&done))) {
> -		/*
> -		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> -		 * calls complete(), leave the cleanup of this structure to
> -		 * that thread.
> -		 */
> -		if (xchg(&create->done, NULL))
> -			return ERR_PTR(-ENOMEM);
> -		/*
> -		 * kthreadd (or new kernel thread) will call complete()
> -		 * shortly.
> -		 */
> -		wait_for_completion(&done);
> +	do {
> +		if (likely(wait_for_completion_timeout(&done, HZ)))
> +			goto ready;
> +	} while (i++ < 10 && !test_thread_flag(TIF_MEMDIE));
> +	/*
> +	 * The kthreadd was unable to complete the request within 10 seconds
> +	 * (or I was chosen by the OOM killer). Thus, give up and leave the
> +	 * cleanup of this structure to the kthreadd (or new kernel thread).
> +	 */
> +	if (xchg(&create->done, NULL)) {
> +		WARN(1, "Gave up waiting for kthreadd.\n");
> +		return ERR_PTR(-EINTR);
>  	}
> +	/* kthreadd (or new kernel thread) will call complete() shortly. */
> +	wait_for_completion(&done);
> +ready:
>  	task = create->result;
>  	if (!IS_ERR(task)) {
>  		static const struct sched_param param = { .sched_priority = 0 };
> --
> 1.7.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 19:25                                 ` Oleg Nesterov
@ 2014-03-22 20:48                                   ` James Bottomley
  2014-03-24 17:01                                     ` Oleg Nesterov
  2014-03-22 21:25                                   ` Thomas Gleixner
  2014-03-22 23:50                                   ` Tetsuo Handa
  2 siblings, 1 reply; 41+ messages in thread
From: James Bottomley @ 2014-03-22 20:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, joseph.salisbury, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi, thenzl

On Sat, 2014-03-22 at 20:25 +0100, Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Tetsuo, what do you think?
> >
> > I don't like blocking SIGKILL while doing operations that depend on memory
> > allocation by other processes. If the OOM killer is triggered and it chose
> > the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> > it generates the OOM killer deadlock.
> 
> It seems that we do not understand each other.
> 
> I do not like this hack too. And it is even wrong, you can't really block
> SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
> module_init() reacts to SIGKILL and aborts, just the fact it crashes the
> kernel in the error paths is not fine.
> 
> The driver should be fixed anyway. As for timeout, either userspace/systemd
> should be changed to not send SIGKILL after 30 secs, or (better) the driver
> should be changed to not waste 30 secs.
> 
> The hack I sent can only serve as a short term solution, it should be
> reverted once we have something better. And, otoh, this hack only affects
> the problematic driver which should be fixed in any case.
> 
> Do you see my point? I can be wrong, but I think that you constantly
> misunderstand the intent.
> 
> > My preference is to fix kthread_create() to handle SIGKILL gracefully.
> 
> And now I do not understand you too. I do not understand why should we
> "fix" kthread_create().
> 
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
> 
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.
> 
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
> 
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.
> 
> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
> 
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.
> 
> If you convince someone to take this patch I won't argue.

OK, the fix from the SCSI point of view is to make the mpt teardown path
actually work.  The whole thing looks to be a complete mess because
there's another place where it will do the wrong thing in
mptscsih_remove().  You always have to call mpt_detach() otherwise the
device doesn't get removed from the lists.  In theory this patch fixes
both bugs in the driver.

James

---

diff --git a/drivers/message/fusion/mptscsih.c b/drivers/message/fusion/mptscsih.c
index 727819c..282d39a 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1176,10 +1176,14 @@ mptscsih_remove(struct pci_dev *pdev)
 	MPT_SCSI_HOST		*hd;
 	int sz1;
 
+	if (!host)
+		/* not brought up far enough to do scsi_host_attach() */
+		goto out;
+
 	scsi_remove_host(host);
 
 	if((hd = shost_priv(host)) == NULL)
-		return;
+		goto out;
 
 	mptscsih_shutdown(pdev);
 
@@ -1203,6 +1207,7 @@ mptscsih_remove(struct pci_dev *pdev)
 
 	scsi_host_put(host);
 
+ out:
 	mpt_detach(pdev);
 
 }



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 19:25                                 ` Oleg Nesterov
  2014-03-22 20:48                                   ` James Bottomley
@ 2014-03-22 21:25                                   ` Thomas Gleixner
  2014-03-22 22:01                                     ` Thomas Gleixner
  2014-03-22 23:50                                   ` Tetsuo Handa
  2 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2014-03-22 21:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, joseph.salisbury, JBottomley,
	Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes, akpm, torvalds,
	tj, linux-kernel, kernel-team, linux-scsi, thenzl

On Sat, 22 Mar 2014, Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
> 
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.

We have explicitely:

    mutex_lock, mutex_lock_killable and mutex_lock_interruptible.

Ditto for wait_event and wait_for_completion.

So the existance of the uninterruptible versions does not make an
argument for the kthread_create() case.
 
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
> 
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.

Right.
 
> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
> 
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.

It's not only ugly, it's activly wrong. It's as wrong as 786235ee
itself. And 786235ee needs to be reverted and the revert must go into
3.13.stable as well. I'll send a revert request in separate mail.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 21:25                                   ` Thomas Gleixner
@ 2014-03-22 22:01                                     ` Thomas Gleixner
  2014-03-22 23:57                                       ` please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2014-03-22 22:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tetsuo Handa, joseph.salisbury, JBottomley,
	Nagalakshmi.Nandigama, Sreekanth.Reddy, rientjes, akpm, torvalds,
	tj, linux-kernel, kernel-team, linux-scsi, thenzl

On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> On Sat, 22 Mar 2014, Oleg Nesterov wrote:
> > Personally I dislike this change. In fact I think it is ugly. But this
> > is only my opinion.
> 
> It's not only ugly, it's activly wrong. It's as wrong as 786235ee
> itself. And 786235ee needs to be reverted and the revert must go into
> 3.13.stable as well. I'll send a revert request in separate mail.

Sorry. I misread the combo of both patches. 786235ee is correct. We
definitely don't want to revert it. 

But I still think, that changing this to butt ugly heuristics with an
arbitrary chosen timeout is not the proper solution to "fix" a driver
problem and to work around a systemd policy which mandates that
modprobe must return in 30 seconds.

But then systemd/udev mutters:

   "You migh be able to work around the timeout with udev rules and
    OPTIONS+="event_timeout=120", but that code was maybe never used
    or tested, so it might not work correctly." [1]

AFAICT from the ubuntu bug system [2] nobody bothered even to try that.

And if the udev/systemd event_timeout option is broken it's way better
to fix that one instead of hacking random heuristics into the kernel.

Thanks,

	tglx

[1] http://lists.freedesktop.org/archives/systemd-devel/2014-March/018007.html
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 19:25                                 ` Oleg Nesterov
  2014-03-22 20:48                                   ` James Bottomley
  2014-03-22 21:25                                   ` Thomas Gleixner
@ 2014-03-22 23:50                                   ` Tetsuo Handa
  2 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-22 23:50 UTC (permalink / raw)
  To: oleg
  Cc: joseph.salisbury, JBottomley, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi, thenzl

Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Tetsuo, what do you think?
> >
> > I don't like blocking SIGKILL while doing operations that depend on memory
> > allocation by other processes. If the OOM killer is triggered and it chose
> > the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> > it generates the OOM killer deadlock.
> 
> It seems that we do not understand each other.
> 

I'm agreeing with you regarding long term solution. I think that I and you
do not understand each other regarding which approach should be used for short
term solution.

> I do not like this hack too. And it is even wrong, you can't really block
> SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
> module_init() reacts to SIGKILL and aborts, just the fact it crashes the
> kernel in the error paths is not fine.

I expect that kernel code reacts to SIGKILL and aborts, but current code
(not only fusion but any code) is not ready for reacting to SIGKILL due to
use of uninterruptible versions of lock/wait etc. operators.

> The driver should be fixed anyway. As for timeout, either userspace/systemd
> should be changed to not send SIGKILL after 30 secs, or (better) the driver
> should be changed to not waste 30 secs.

I'm not asserting that we should not fix the driver and the userspace.
I agree that we should fix the driver to respond SIGKILL properly and
fix the userspace not to send SIGKILL on hard coded timeout.

> 
> The hack I sent can only serve as a short term solution, it should be
> reverted once we have something better. And, otoh, this hack only affects
> the problematic driver which should be fixed in any case.
> 
> Do you see my point? I can be wrong, but I think that you constantly
> misunderstand the intent.
> 
> > My preference is to fix kthread_create() to handle SIGKILL gracefully.
> 
> And now I do not understand you too. I do not understand why should we
> "fix" kthread_create().

I can see your point. But as for kernel for Ubuntu 14.04 LTS (which the date
of kernel freeze comes shortly), fixing kthread_create() is the safer choice,
for we haven't proved that the fusion is the only kernel code which is
disturbed by commit 786235ee.

After we confirmed that there is no more kernel code which is disturbed by
commit 786235ee, we can revert the "kthread: Handle SIGKILL gracefully in
kthread_create()." patch.

> 
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
> 
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.
> 
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
> 
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.
> 

I don't mean to help the buggy code. The "kthread: Handle SIGKILL gracefully
in kthread_create()." patch (or revert commit 786235ee) is short term solution
(especially for distributions which the date of kernel freeze is approaching).

> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
> 
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.
> 
> If you convince someone to take this patch I won't argue.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 22:01                                     ` Thomas Gleixner
@ 2014-03-22 23:57                                       ` Tetsuo Handa
  2014-03-23  8:04                                         ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-03-22 23:57 UTC (permalink / raw)
  To: tglx, oleg
  Cc: joseph.salisbury, JBottomley, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, linux-kernel,
	kernel-team, linux-scsi, thenzl

Thomas Gleixner wrote:
> But then systemd/udev mutters:
> 
>    "You migh be able to work around the timeout with udev rules and
>     OPTIONS+="event_timeout=120", but that code was maybe never used
>     or tested, so it might not work correctly." [1]
> 
> AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> 
> And if the udev/systemd event_timeout option is broken it's way better
> to fix that one instead of hacking random heuristics into the kernel.

I haven't tried the event_timeout= option but I think it will not work.
The timeout is hard coded as shown below and there will be no chance for taking
the event_timeout= option into account.

---------- systemd-204/src/udev/udevd.c start ----------
(...snipped...)
                        /* check for hanging events */
                        udev_list_node_foreach(loop, &worker_list) {
                                struct worker *worker = node_to_worker(loop);

                                if (worker->state != WORKER_RUNNING)
                                        continue;

                                if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
                                        log_error("worker [%u] %s timeout; kill it\n", worker->pid,
                                            worker->event ? worker->event->devpath : "<idle>");
                                        kill(worker->pid, SIGKILL);
                                        worker->state = WORKER_KILLED;
                                        /* drop reference taken for state 'running' */
                                        worker_unref(worker);
                                        if (worker->event) {
                                                log_error("seq %llu '%s' killed\n",
                                                          udev_device_get_seqnum(worker->event->dev), worker->event->devpath);
                                                worker->event->exitcode = -64;
                                                event_queue_delete(worker->event, true);
                                                worker->event = NULL;
                                        }
                                }
                        }
(...snipped...)
---------- systemd-204/src/udev/udevd.c end ----------

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 23:57                                       ` please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
@ 2014-03-23  8:04                                         ` Thomas Gleixner
  2014-03-23 14:19                                           ` James Bottomley
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2014-03-23  8:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: oleg, joseph.salisbury, JBottomley, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, linux-kernel,
	kernel-team, linux-scsi, thenzl

On Sun, 23 Mar 2014, Tetsuo Handa wrote:

> Thomas Gleixner wrote:
> > But then systemd/udev mutters:
> > 
> >    "You migh be able to work around the timeout with udev rules and
> >     OPTIONS+="event_timeout=120", but that code was maybe never used
> >     or tested, so it might not work correctly." [1]
> > 
> > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > 
> > And if the udev/systemd event_timeout option is broken it's way better
> > to fix that one instead of hacking random heuristics into the kernel.
> 
> I haven't tried the event_timeout= option but I think it will not work.
> The timeout is hard coded as shown below and there will be no chance for taking
> the event_timeout= option into account.
> 
> ---------- systemd-204/src/udev/udevd.c start ----------
> (...snipped...)
>                         /* check for hanging events */
>                         udev_list_node_foreach(loop, &worker_list) {
>                                 struct worker *worker = node_to_worker(loop);
> 
>                                 if (worker->state != WORKER_RUNNING)
>                                         continue;
> 
>                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {

And because systemd has an immutable hardcoded random timeout we add
another hardcoded random timeout into kthread_create() to work around
that.

How broken is that?

And it seems other people have solved it:

    http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-23  8:04                                         ` Thomas Gleixner
@ 2014-03-23 14:19                                           ` James Bottomley
  2014-03-23 14:28                                             ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: James Bottomley @ 2014-03-23 14:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tetsuo Handa, oleg, joseph.salisbury, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, linux-kernel,
	kernel-team, linux-scsi, thenzl

On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> 
> > Thomas Gleixner wrote:
> > > But then systemd/udev mutters:
> > > 
> > >    "You migh be able to work around the timeout with udev rules and
> > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > >     or tested, so it might not work correctly." [1]
> > > 
> > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > 
> > > And if the udev/systemd event_timeout option is broken it's way better
> > > to fix that one instead of hacking random heuristics into the kernel.
> > 
> > I haven't tried the event_timeout= option but I think it will not work.
> > The timeout is hard coded as shown below and there will be no chance for taking
> > the event_timeout= option into account.
> > 
> > ---------- systemd-204/src/udev/udevd.c start ----------
> > (...snipped...)
> >                         /* check for hanging events */
> >                         udev_list_node_foreach(loop, &worker_list) {
> >                                 struct worker *worker = node_to_worker(loop);
> > 
> >                                 if (worker->state != WORKER_RUNNING)
> >                                         continue;
> > 
> >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> 
> And because systemd has an immutable hardcoded random timeout we add
> another hardcoded random timeout into kthread_create() to work around
> that.
> 
> How broken is that?
> 
> And it seems other people have solved it:
> 
>     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html

I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
could I get confirmation, while you can use this bug to do it, that the
patch back in this thread actually fixes the crash when scsi_alloc_host
fails, that's the serious SCSI bug, in my view?

Thanks,

James




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-23 14:19                                           ` James Bottomley
@ 2014-03-23 14:28                                             ` Thomas Gleixner
  2014-03-23 14:29                                               ` James Bottomley
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2014-03-23 14:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tetsuo Handa, oleg, joseph.salisbury, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, linux-kernel,
	kernel-team, linux-scsi, thenzl

On Sun, 23 Mar 2014, James Bottomley wrote:
> On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> > On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> > 
> > > Thomas Gleixner wrote:
> > > > But then systemd/udev mutters:
> > > > 
> > > >    "You migh be able to work around the timeout with udev rules and
> > > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > > >     or tested, so it might not work correctly." [1]
> > > > 
> > > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > > 
> > > > And if the udev/systemd event_timeout option is broken it's way better
> > > > to fix that one instead of hacking random heuristics into the kernel.
> > > 
> > > I haven't tried the event_timeout= option but I think it will not work.
> > > The timeout is hard coded as shown below and there will be no chance for taking
> > > the event_timeout= option into account.
> > > 
> > > ---------- systemd-204/src/udev/udevd.c start ----------
> > > (...snipped...)
> > >                         /* check for hanging events */
> > >                         udev_list_node_foreach(loop, &worker_list) {
> > >                                 struct worker *worker = node_to_worker(loop);
> > > 
> > >                                 if (worker->state != WORKER_RUNNING)
> > >                                         continue;
> > > 
> > >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> > 
> > And because systemd has an immutable hardcoded random timeout we add
> > another hardcoded random timeout into kthread_create() to work around
> > that.
> > 
> > How broken is that?
> > 
> > And it seems other people have solved it:
> > 
> >     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html
> 
> I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
> could I get confirmation, while you can use this bug to do it, that the
> patch back in this thread actually fixes the crash when scsi_alloc_host
> fails, that's the serious SCSI bug, in my view?

Which patch, the one to kthread_create() or the one to SCSI?

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-23 14:28                                             ` Thomas Gleixner
@ 2014-03-23 14:29                                               ` James Bottomley
  0 siblings, 0 replies; 41+ messages in thread
From: James Bottomley @ 2014-03-23 14:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tetsuo Handa, oleg, joseph.salisbury, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, linux-kernel,
	kernel-team, linux-scsi, thenzl

On Sun, 2014-03-23 at 15:28 +0100, Thomas Gleixner wrote:
> On Sun, 23 Mar 2014, James Bottomley wrote:
> > On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> > > On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> > > 
> > > > Thomas Gleixner wrote:
> > > > > But then systemd/udev mutters:
> > > > > 
> > > > >    "You migh be able to work around the timeout with udev rules and
> > > > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > > > >     or tested, so it might not work correctly." [1]
> > > > > 
> > > > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > > > 
> > > > > And if the udev/systemd event_timeout option is broken it's way better
> > > > > to fix that one instead of hacking random heuristics into the kernel.
> > > > 
> > > > I haven't tried the event_timeout= option but I think it will not work.
> > > > The timeout is hard coded as shown below and there will be no chance for taking
> > > > the event_timeout= option into account.
> > > > 
> > > > ---------- systemd-204/src/udev/udevd.c start ----------
> > > > (...snipped...)
> > > >                         /* check for hanging events */
> > > >                         udev_list_node_foreach(loop, &worker_list) {
> > > >                                 struct worker *worker = node_to_worker(loop);
> > > > 
> > > >                                 if (worker->state != WORKER_RUNNING)
> > > >                                         continue;
> > > > 
> > > >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> > > 
> > > And because systemd has an immutable hardcoded random timeout we add
> > > another hardcoded random timeout into kthread_create() to work around
> > > that.
> > > 
> > > How broken is that?
> > > 
> > > And it seems other people have solved it:
> > > 
> > >     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html
> > 
> > I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
> > could I get confirmation, while you can use this bug to do it, that the
> > patch back in this thread actually fixes the crash when scsi_alloc_host
> > fails, that's the serious SCSI bug, in my view?
> 
> Which patch, the one to kthread_create() or the one to SCSI?

The one to SCSI ... I'm only really interested in the oops when
scsi_host_alloc fails.

James




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)
  2014-03-22 20:48                                   ` James Bottomley
@ 2014-03-24 17:01                                     ` Oleg Nesterov
  0 siblings, 0 replies; 41+ messages in thread
From: Oleg Nesterov @ 2014-03-24 17:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tetsuo Handa, joseph.salisbury, Nagalakshmi.Nandigama,
	Sreekanth.Reddy, rientjes, akpm, torvalds, tj, tglx,
	linux-kernel, kernel-team, linux-scsi, thenzl

On 03/22, James Bottomley wrote:
>
> OK, the fix from the SCSI point of view is to make the mpt teardown path
> actually work.  The whole thing looks to be a complete mess because
> there's another place where it will do the wrong thing in
> mptscsih_remove().  You always have to call mpt_detach() otherwise the
> device doesn't get removed from the lists.  In theory this patch fixes
> both bugs in the driver.

Yes, I obviously thought about the same change ;)

But note that mpt_detach() doesn't look correct too. This is almost
off-topic, but still...

At least it needs s/cancel_delayed_work/cancel_delayed_work_sync/ to
sync with mpt_fault_reset_work().

And it is not clear if it is safe to simply destroy ->fw_event_q,
perhaps it can't have the pending works if the caller is teardown
path, and otherwise we rely on mptsas_cleanup_fw_event_q(), I do
not know... but mptsas_cleanup_fw_event_q() needs _sync too I guess,
and obviously mptsas_free_fw_event() should be called unconditionally.

OTOH, mpt_attach() doesn't verify that ->fw_event_q was created
succesfully. And, if mpt_do_ioc_recovery() fails it destroys
->reset_work_q but not ->fw_event_q. Looks like this driver needs
some cleanups.

> --- a/drivers/message/fusion/mptscsih.c
> +++ b/drivers/message/fusion/mptscsih.c
> @@ -1176,10 +1176,14 @@ mptscsih_remove(struct pci_dev *pdev)
>  	MPT_SCSI_HOST		*hd;
>  	int sz1;
>
> +	if (!host)
> +		/* not brought up far enough to do scsi_host_attach() */
                                                   ^^^^^^^^^^^^^^^^
scsi_host_alloc ;)

> +		goto out;
> +

Yes, I think this is correct. mpt_attach() does kzalloc(MPT_ADAPTER) so
we can probably rely on ->sh == NULL if _alloc failed.

Oleg.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL.
  2014-03-17 20:19   ` Oleg Nesterov
  2014-03-17 20:39     ` Andrew Morton
@ 2014-06-03 13:03     ` Tetsuo Handa
  2014-06-03 21:35       ` David Rientjes
  1 sibling, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2014-06-03 13:03 UTC (permalink / raw)
  To: oleg, akpm
  Cc: joseph.salisbury, rientjes, torvalds, tj, tglx, linux-kernel,
	kernel-team

>From c41fa0b9294900a86167d6c8db392d99dde09774 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 3 Jun 2014 21:46:24 +0900
Subject: [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL.

Commit 786235eeb "kthread: make kthread_create() killable" meant for
allowing kthread_create() to abort as soon as killed by the OOM-killer.
But returning -ENOMEM is wrong if killed by SIGKILL from userspace.
Change kthread_create() to return -EINTR upon SIGKILL.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: stable <stable@kernel.org> [3.13+]
---
 kernel/kthread.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 9a130ec..c2390f4 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -262,7 +262,7 @@ static void create_kthread(struct kthread_create_info *create)
  * kthread_stop() has been called).  The return value should be zero
  * or a negative error number; it will be passed to kthread_stop().
  *
- * Returns a task_struct or ERR_PTR(-ENOMEM).
+ * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
  */
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 					   void *data, int node,
@@ -298,7 +298,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 		 * that thread.
 		 */
 		if (xchg(&create->done, NULL))
-			return ERR_PTR(-ENOMEM);
+			return ERR_PTR(-EINTR);
 		/*
 		 * kthreadd (or new kernel thread) will call complete()
 		 * shortly.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL.
  2014-06-03 13:03     ` [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL Tetsuo Handa
@ 2014-06-03 21:35       ` David Rientjes
  0 siblings, 0 replies; 41+ messages in thread
From: David Rientjes @ 2014-06-03 21:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Oleg Nesterov, Andrew Morton, joseph.salisbury, Linus Torvalds,
	Tejun Heo, tglx, linux-kernel, kernel-team

On Tue, 3 Jun 2014, Tetsuo Handa wrote:

> >From c41fa0b9294900a86167d6c8db392d99dde09774 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 3 Jun 2014 21:46:24 +0900
> Subject: [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL.
> 
> Commit 786235eeb "kthread: make kthread_create() killable" meant for
> allowing kthread_create() to abort as soon as killed by the OOM-killer.
> But returning -ENOMEM is wrong if killed by SIGKILL from userspace.
> Change kthread_create() to return -EINTR upon SIGKILL.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: stable <stable@kernel.org> [3.13+]

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-06-03 21:35 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-14 20:46 [v3.13][v3.14][Regression] kthread: make kthread_create() killable Joseph Salisbury
2014-03-15  0:43 ` Tetsuo Handa
2014-03-16 15:13   ` Tetsuo Handa
2014-03-16 16:25     ` Oleg Nesterov
2014-03-17 12:38       ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
2014-03-17 14:22         ` Oleg Nesterov
2014-03-18 12:03           ` [v3.13][v3.14][Regression] kthread: makekthread_create()killable Tetsuo Handa
2014-03-18 17:16             ` Oleg Nesterov
2014-03-19 11:49               ` [v3.13][v3.14][Regression] kthread:makekthread_create()killable Tetsuo Handa
2014-03-19 16:13                 ` Joseph Salisbury
2014-03-19 17:52                   ` Oleg Nesterov
2014-03-19 18:29                     ` please fix FUSION (Was: [v3.13][v3.14][Regression] kthread:makekthread_create()killable) Oleg Nesterov
2014-03-19 19:42                       ` Oleg Nesterov
2014-03-19 21:04                         ` Joseph Salisbury
2014-03-20 16:46                         ` Joseph Salisbury
2014-03-20 19:23                           ` Oleg Nesterov
2014-03-21 18:34                             ` Oleg Nesterov
2014-03-21 19:32                               ` Linus Torvalds
2014-03-21 20:31                                 ` Oleg Nesterov
2014-03-21 22:56                                 ` James Bottomley
2014-03-22  6:25                               ` please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
2014-03-22 19:25                                 ` Oleg Nesterov
2014-03-22 20:48                                   ` James Bottomley
2014-03-24 17:01                                     ` Oleg Nesterov
2014-03-22 21:25                                   ` Thomas Gleixner
2014-03-22 22:01                                     ` Thomas Gleixner
2014-03-22 23:57                                       ` please fix FUSION (Was:[v3.13][v3.14][Regression]kthread:makekthread_create()killable) Tetsuo Handa
2014-03-23  8:04                                         ` Thomas Gleixner
2014-03-23 14:19                                           ` James Bottomley
2014-03-23 14:28                                             ` Thomas Gleixner
2014-03-23 14:29                                               ` James Bottomley
2014-03-22 23:50                                   ` Tetsuo Handa
2014-03-17 20:02 ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable Andrew Morton
2014-03-17 20:19   ` Oleg Nesterov
2014-03-17 20:39     ` Andrew Morton
2014-03-18 17:45       ` Oleg Nesterov
2014-06-03 13:03     ` [PATCH] kthread: Fix return value of kthread_create() upon SIGKILL Tetsuo Handa
2014-06-03 21:35       ` David Rientjes
2014-03-17 21:32   ` [v3.13][v3.14][Regression] kthread: make kthread_create()killable Tetsuo Handa
2014-03-17 23:18   ` [v3.13][v3.14][Regression] kthread: make kthread_create() killable One Thousand Gnomes
2014-03-18 17:50     ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.