LKML Archive on lore.kernel.org
 help / Atom feed
* RFC: SysRq nice-all-RT-tasks is broken
@ 2017-03-08 15:23 Laurent Dufour
  2017-03-08 16:51 ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Laurent Dufour @ 2017-03-08 15:23 UTC (permalink / raw)
  To: Andrew Morton, Steven Rostedt
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar

Hi,

It appears that triggering the SysRq nice-all-RT-tasks from the console
while a real task is active is leading to panic the system like this :

 sysrq: SysRq : Nice All RT Tasks
 ------------[ cut here ]------------
 kernel BUG at /build/linux-twbIHf/linux-4.10.0/kernel/sched/core.c:4089!
 Oops: Exception in kernel mode, sig: 5 [#1]
 SMP NR_CPUS=2048
 NUMA
 pSeries
 Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
fscache pseries_rng vmx_crypto binfmt_misc dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua sunrpc ip_tables x_tables autofs4 btrfs xor
raid6_pq ses enclosure scsi_transport_sas crc32c_vpmsum mlx5_core be2net
ipr devlink
 CPU: 100 PID: 0 Comm: swapper/100 Not tainted 4.10.0-8-generic #10-Ubuntu
 task: c0000003b6179c00 task.stack: c0000003b620c000
 NIP: c0000000001044f0 LR: c00000000010e524 CTR: c0000000006d0ec0
 REGS: c00000077fc978d0 TRAP: 0700   Not tainted  (4.10.0-8-generic)
 MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
   CR: 28002228  XER: 20000001
 CFAR: c00000000010e520 SOFTE: 0
 GPR00: c00000000010e524 c00000077fc97b50 c00000000143c900 c00000038785fa00
 GPR04: c00000077fc97c00 0000000000000000 0000000000000000 ffffffffffffffff
 GPR08: 0000000000000000 0000000000010000 0000000000000000 0000000000000406
 GPR12: 0000000028002228 c000000007b58400 c0000003b620ff90 000000001ede2d80
 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 GPR20: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
 GPR24: 0000000000000000 c00000077a0f8b50 0000000000000063 0000000000000000
 GPR28: c00000038785fa00 c00000077fc97c00 0000000000000000 c00000038785fa00
 NIP [c0000000001044f0] __sched_setscheduler+0x90/0xd30
 LR [c00000000010e524] normalize_rt_tasks+0x184/0x1c0
 Call Trace:
 [c00000077fc97be0] [c00000000010e524] normalize_rt_tasks+0x184/0x1c0
 [c00000077fc97c60] [c0000000006d0ee0] sysrq_handle_unrt+0x20/0x40
 [c00000077fc97c80] [c0000000006d1c98] __handle_sysrq+0xe8/0x280
 [c00000077fc97d20] [c0000000006eab38] hvc_poll+0x1c8/0x360
 [c00000077fc97dc0] [c0000000006ebd74] hvc_handle_interrupt+0x24/0x50
 [c00000077fc97de0] [c0000000001453d0] __handle_irq_event_percpu+0x90/0x2c0
 [c00000077fc97ea0] [c000000000145638] handle_irq_event_percpu+0x38/0x90
 [c00000077fc97ee0] [c0000000001456f4] handle_irq_event+0x64/0xb0
 [c00000077fc97f10] [c00000000014add8] handle_fasteoi_irq+0xe8/0x290
 [c00000077fc97f40] [c000000000143ebc] generic_handle_irq+0x4c/0x80
 [c00000077fc97f60] [c0000000000167b0] __do_irq+0x80/0x1d0
 [c00000077fc97f90] [c000000000029414] call_do_irq+0x14/0x24
 [c0000003b620fa40] [c000000000016994] do_IRQ+0x94/0x140
 [c0000003b620fa80] [c000000000008be4] hardware_interrupt_common+0x114/0x120
 --- interrupt: 501 at plpar_hcall_norets+0x1c/0x28
     LR = check_and_cede_processor+0x34/0x50
 [c0000003b620fd70] [c0000003b620fe10] 0xc0000003b620fe10 (unreliable)
 [c0000003b620fdd0] [c00000000095db50] shared_cede_loop+0x50/0x150
 [c0000003b620fe00] [c00000000095accc] cpuidle_enter_state+0x16c/0x430
 [c0000003b620fe60] [c00000000012c8bc] call_cpuidle+0x4c/0x90
 [c0000003b620fe80] [c00000000012cce0] do_idle+0x2c0/0x330
 [c0000003b620ff00] [c00000000012cfa8] cpu_startup_entry+0x38/0x50
 [c0000003b620ff30] [c000000000044180] start_secondary+0x330/0x370
 [c0000003b620ff90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
 Instruction dump:
 7c7f1b78 7cb62b78 7cdb3378 2b8a0006 7d5507b4 419e0010 83440014 235a0063
 7f5a07b4 78290464 8129000c 552902ee <0b090000> 2f950000 419c04b8 2f950005
 ---[ end trace 891618fbca78ebe7 ]---

I got it on Power and on X86_64, but I guess it should happen in all
architectures.
Here are the steps to recreate it :
1. Create a RT task : sudo chrt -f 50 /bin/sleep 999999
2. On the console trigger the 'nice-all-RT-tasks' SYS-RQ.

The panic is triggered by the BUG_ON(in_interrupt()) introduced by this
commit:

66e5393a78b3 ("[PATCH] BUG() if setscheduler is called from interrupt
context")

Since SysRq is run from the interrupt context, the panic is expected.

Looking at the code, I'm wondering if the BUG_ON() is still required in
__sched_setscheduler(). But I'm not so confident, so requesting your
advise here.

Thanks,
Laurent.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 15:23 RFC: SysRq nice-all-RT-tasks is broken Laurent Dufour
@ 2017-03-08 16:51 ` Steven Rostedt
  2017-03-08 16:57   ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2017-03-08 16:51 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar

On Wed, 8 Mar 2017 16:23:35 +0100
Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:

> I got it on Power and on X86_64, but I guess it should happen in all
> architectures.
> Here are the steps to recreate it :
> 1. Create a RT task : sudo chrt -f 50 /bin/sleep 999999
> 2. On the console trigger the 'nice-all-RT-tasks' SYS-RQ.
> 
> The panic is triggered by the BUG_ON(in_interrupt()) introduced by this
> commit:
> 
> 66e5393a78b3 ("[PATCH] BUG() if setscheduler is called from interrupt
> context")
> 
> Since SysRq is run from the interrupt context, the panic is expected.
> 
> Looking at the code, I'm wondering if the BUG_ON() is still required in
> __sched_setscheduler(). But I'm not so confident, so requesting your
> advise here.
> 

Hmm, that commit was added in 2.6.18, and you're right, a lot has
changed since then. Have you tried removing it and running it under
lockdep, and see if it triggers any warnings?

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 16:51 ` Steven Rostedt
@ 2017-03-08 16:57   ` Steven Rostedt
  2017-03-08 17:03     ` Laurent Dufour
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2017-03-08 16:57 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar

On Wed, 8 Mar 2017 11:51:14 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:


> Hmm, that commit was added in 2.6.18, and you're right, a lot has
> changed since then. Have you tried removing it and running it under
> lockdep, and see if it triggers any warnings?

I did a little digging, and it appears that its the rt mutex wait lock
that the comment was referring to. Today that spin lock is irq safe. I
believe its safe to remove the BUG_ON(). Want me to send a patch?

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 16:57   ` Steven Rostedt
@ 2017-03-08 17:03     ` Laurent Dufour
  2017-03-08 17:40       ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Laurent Dufour @ 2017-03-08 17:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar

On 08/03/2017 17:57, Steven Rostedt wrote:
> On Wed, 8 Mar 2017 11:51:14 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
>> Hmm, that commit was added in 2.6.18, and you're right, a lot has
>> changed since then. Have you tried removing it and running it under
>> lockdep, and see if it triggers any warnings?
> 
> I did a little digging, and it appears that its the rt mutex wait lock
> that the comment was referring to. Today that spin lock is irq safe. I
> believe its safe to remove the BUG_ON(). Want me to send a patch?

Sure, go ahead ;)

Thanks,
Laurent.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 17:03     ` Laurent Dufour
@ 2017-03-08 17:40       ` Steven Rostedt
  2017-03-08 17:46         ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2017-03-08 17:40 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra


[
  Added Peter

   Update: Laurent noticed that sysrq 'n' (nice-all-RT-tasks) calls
   __sched_setscheduler() form interrupt context. At the start of that
   function, there's a BUG_ON(in_interrupt()). The reason for that was
   due to the rt mutex pi code calling wait_lock. Which was not irq
   safe. Now it is, but that's not good enough.
]

On Wed, 8 Mar 2017 18:03:55 +0100
Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:

> On 08/03/2017 17:57, Steven Rostedt wrote:
> > On Wed, 8 Mar 2017 11:51:14 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> >   
> >> Hmm, that commit was added in 2.6.18, and you're right, a lot has
> >> changed since then. Have you tried removing it and running it under
> >> lockdep, and see if it triggers any warnings?  
> > 
> > I did a little digging, and it appears that its the rt mutex wait lock
> > that the comment was referring to. Today that spin lock is irq safe. I
> > believe its safe to remove the BUG_ON(). Want me to send a patch?  
> 
> Sure, go ahead ;)
>

Actually, it's still not safe :-/

I just noticed this in the call path:

	raw_spin_unlock_irq(&task->pi_lock);

As well as other raw_spin_unlock_irq()s.

Which would enable interrupts regardless of the previous state.

One solution is to change all those to irqsave() but that seems to be a
big step for something that is rarely done (how many years has it been
since 2.6.18?).

I wonder if we should just have a special flag sent by that sysrq
trigger. Since it is causing all tasks to go "nice" there's no need to
do the pi chain walk in __sched_setscheduler().

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 17:40       ` Steven Rostedt
@ 2017-03-08 17:46         ` Steven Rostedt
  2017-03-09  9:02           ` Laurent Dufour
  2017-05-23  8:46           ` [tip:sched/core] sched/core: Allow __sched_setscheduler() in interrupts when PI is not used tip-bot for Steven Rostedt (VMware)
  0 siblings, 2 replies; 8+ messages in thread
From: Steven Rostedt @ 2017-03-08 17:46 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra

On Wed, 8 Mar 2017 12:40:12 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I wonder if we should just have a special flag sent by that sysrq
> trigger. Since it is causing all tasks to go "nice" there's no need to
> do the pi chain walk in __sched_setscheduler().

Hah, there already is a flag!

Laurent, can you test this patch:

-- Steve

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc0..7292fa9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4129,8 +4129,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
 	struct rq *rq;
 
-	/* May grab non-irq protected spin_locks: */
-	BUG_ON(in_interrupt());
+	/* The pi code expects interrupts enabled */
+	BUG_ON(pi && in_interrupt());
 recheck:
 	/* Double check policy once rq lock held: */
 	if (policy < 0) {

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: SysRq nice-all-RT-tasks is broken
  2017-03-08 17:46         ` Steven Rostedt
@ 2017-03-09  9:02           ` Laurent Dufour
  2017-05-23  8:46           ` [tip:sched/core] sched/core: Allow __sched_setscheduler() in interrupts when PI is not used tip-bot for Steven Rostedt (VMware)
  1 sibling, 0 replies; 8+ messages in thread
From: Laurent Dufour @ 2017-03-09  9:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra

On 08/03/2017 18:46, Steven Rostedt wrote:
> On Wed, 8 Mar 2017 12:40:12 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> I wonder if we should just have a special flag sent by that sysrq
>> trigger. Since it is causing all tasks to go "nice" there's no need to
>> do the pi chain walk in __sched_setscheduler().
> 
> Hah, there already is a flag!
> 
> Laurent, can you test this patch:

Tested on ppc64, no more panic \o/

FWIW,
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Thanks,
Laurent.

> 
> -- Steve
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3b31fc0..7292fa9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4129,8 +4129,8 @@ static int __sched_setscheduler(struct task_struct *p,
>  	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
>  	struct rq *rq;
> 
> -	/* May grab non-irq protected spin_locks: */
> -	BUG_ON(in_interrupt());
> +	/* The pi code expects interrupts enabled */
> +	BUG_ON(pi && in_interrupt());
>  recheck:
>  	/* Double check policy once rq lock held: */
>  	if (policy < 0) {
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:sched/core] sched/core: Allow __sched_setscheduler() in interrupts when PI is not used
  2017-03-08 17:46         ` Steven Rostedt
  2017-03-09  9:02           ` Laurent Dufour
@ 2017-05-23  8:46           ` tip-bot for Steven Rostedt (VMware)
  1 sibling, 0 replies; 8+ messages in thread
From: tip-bot for Steven Rostedt (VMware) @ 2017-05-23  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, ldufour, hpa, peterz, torvalds, mingo, tglx, linux-kernel, akpm

Commit-ID:  896bbb2522587e3b8eb2a0d204d43ccc1042a00d
Gitweb:     http://git.kernel.org/tip/896bbb2522587e3b8eb2a0d204d43ccc1042a00d
Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
AuthorDate: Thu, 9 Mar 2017 10:18:42 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 23 May 2017 10:01:34 +0200

sched/core: Allow __sched_setscheduler() in interrupts when PI is not used

When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it
added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI
is not a common occurrence, lockdep will likely never trigger if
sched_setscheduler was called from interrupt context. A BUG_ON() was added
to trigger if __sched_setscheduler() was ever called from interrupt context
because there was a possibility to take the wait_lock.

Today the wait_lock is irq safe, but the path to taking it in
sched_setscheduler() is the same as the path to taking it from normal
context. The wait_lock is taken with raw_spin_lock_irq() and released with
raw_spin_unlock_irq() which will indiscriminately enable interrupts,
which would be bad in interrupt context.

The problem is that normalize_rt_tasks, which is called by triggering the
sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this
is done from interrupt context!

Now __sched_setscheduler() takes a "pi" parameter that is used to know if
the priority inheritance should be called or not. As the BUG_ON() only cares
about calling the PI code, it should only bug if called from interrupt
context with the "pi" parameter set to true.

Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: dbc7f069b93a ("sched: Use replace normalize_task() with __sched_setscheduler()")
Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4a31239..877241e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4188,8 +4188,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	struct rq *rq;
 
-	/* May grab non-irq protected spin_locks: */
-	BUG_ON(in_interrupt());
+	/* The pi code expects interrupts enabled */
+	BUG_ON(pi && in_interrupt());
 recheck:
 	/* Double check policy once rq lock held: */
 	if (policy < 0) {

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08 15:23 RFC: SysRq nice-all-RT-tasks is broken Laurent Dufour
2017-03-08 16:51 ` Steven Rostedt
2017-03-08 16:57   ` Steven Rostedt
2017-03-08 17:03     ` Laurent Dufour
2017-03-08 17:40       ` Steven Rostedt
2017-03-08 17:46         ` Steven Rostedt
2017-03-09  9:02           ` Laurent Dufour
2017-05-23  8:46           ` [tip:sched/core] sched/core: Allow __sched_setscheduler() in interrupts when PI is not used tip-bot for Steven Rostedt (VMware)

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox