linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* INFO: rcu detected stall in do_idle
@ 2018-10-13  7:31 syzbot
  2018-10-16 13:24 ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: syzbot @ 2018-10-13  7:31 UTC (permalink / raw)
  To: bp, hpa, linux-kernel, mingo, nstange, syzkaller-bugs, tglx, x86

Hello,

syzbot found the following crash on:

HEAD commit:    6b3944e42e2e afs: Fix cell proc list
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1545a479400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
dashboard link: https://syzkaller.appspot.com/bug?extid=385468161961cee80c31
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14825d5e400000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=122cc531400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+385468161961cee80c31@syzkaller.appspotmail.com

sshd (5193) used greatest stack depth: 15496 bytes left
sched: DL replenish lagged too much
hrtimer: interrupt took 417330 ns
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 	(detected by 0, t=10520 jiffies, g=-1179, q=0)
rcu: All QSes seen, last rcu_sched kthread activity 10521  
(4295005425-4294994904), jiffies_till_next_fqs=1, root ->qsmask 0x0
swapper/0       R  running task    21816     0      0 0x80000000
Call Trace:
  <IRQ>
  sched_show_task.cold.83+0x2b6/0x30a kernel/sched/core.c:5296
  print_other_cpu_stall.cold.79+0xa83/0xba5 kernel/rcu/tree.c:1430
  check_cpu_stall kernel/rcu/tree.c:1557 [inline]
  __rcu_pending kernel/rcu/tree.c:3276 [inline]
  rcu_pending kernel/rcu/tree.c:3319 [inline]
  rcu_check_callbacks+0xafc/0x1990 kernel/rcu/tree.c:2665
  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
  __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
  __hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1460
  hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1029 [inline]
  smp_apic_timer_interrupt+0x1a1/0x760 arch/x86/kernel/apic/apic.c:1054
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:798 [inline]
RIP: 0010:__do_softirq+0x2c3/0xad8 kernel/softirq.c:276
Code: c7 c0 98 f2 31 89 48 c1 e8 03 42 80 3c 30 00 0f 85 17 07 00 00 48 83  
3d e2 ef 51 01 00 0f 84 6e 06 00 00 fb 66 0f 1f 44 00 00 <48> c7 85 68 fe  
ff ff 00 91 20 89 b8 ff ff ff ff 0f bc 85 74 fe ff
RSP: 0018:ffff8801dae07ae8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff1263e53 RBX: ffffffff89276e40 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff892776bc
RBP: ffff8801dae07cc8 R08: ffffffff89276e40 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1003b5c0fa1
R13: ffffffff89276e40 R14: dffffc0000000000 R15: 0000000000000000
  invoke_softirq kernel/softirq.c:372 [inline]
  irq_exit+0x17f/0x1c0 kernel/softirq.c:412
  scheduler_ipi+0x55a/0xad0 kernel/sched/core.c:1780
  smp_reschedule_interrupt+0x109/0x650 arch/x86/kernel/smp.c:278
  reschedule_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:888
  </IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:57
Code: e9 2c ff ff ff 48 89 c7 48 89 45 d8 e8 e3 e0 11 fa 48 8b 45 d8 e9 ca  
fe ff ff 48 89 df e8 d2 e0 11 fa eb 82 55 48 89 e5 fb f4 <5d> c3 0f 1f 84  
00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 90
RSP: 0018:ffffffff89207bb8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff02
RAX: dffffc0000000000 RBX: 1ffffffff1240f7b RCX: 0000000000000000
RDX: 1ffffffff1263e54 RSI: 0000000000000001 RDI: ffffffff8931f2a0
RBP: ffffffff89207bb8 R08: ffffffff89276e40 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff89207c78
R13: ffffffff89f3be20 R14: 0000000000000000 R15: 0000000000000000
  arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
  default_idle+0xbf/0x490 arch/x86/kernel/process.c:498
  arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:489
  default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
  cpuidle_idle_call kernel/sched/idle.c:153 [inline]
  do_idle+0x3db/0x5b0 kernel/sched/idle.c:262
  cpu_startup_entry+0x10c/0x120 kernel/sched/idle.c:368
  rest_init+0xe2/0xe5 init/main.c:442
  start_kernel+0x8f4/0x92f init/main.c:739
  x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:470
  x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:451
  secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
rcu: rcu_sched kthread starved for 10601 jiffies! g-1179 f0x2  
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu: RCU grace-period kthread stack dump:
rcu_sched       R  running task    25144    11      2 0x80000000
Call Trace:
  context_switch kernel/sched/core.c:2825 [inline]
  __schedule+0x86c/0x1ed0 kernel/sched/core.c:3473
  schedule+0xfe/0x460 kernel/sched/core.c:3517
  schedule_timeout+0x140/0x260 kernel/time/timer.c:1804
  rcu_gp_kthread+0x9d9/0x2310 kernel/rcu/tree.c:2194
  kthread+0x35a/0x420 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-13  7:31 INFO: rcu detected stall in do_idle syzbot
@ 2018-10-16 13:24 ` Thomas Gleixner
  2018-10-16 14:03   ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2018-10-16 13:24 UTC (permalink / raw)
  To: syzbot
  Cc: Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, x86, Peter Zijlstra

On Sat, 13 Oct 2018, syzbot wrote:
> syzbot found the following crash on:
> 
> HEAD commit:    6b3944e42e2e afs: Fix cell proc list
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1545a479400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
> dashboard link: https://syzkaller.appspot.com/bug?extid=385468161961cee80c31
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14825d5e400000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=122cc531400000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+385468161961cee80c31@syzkaller.appspotmail.com
> 
> sshd (5193) used greatest stack depth: 15496 bytes left
> sched: DL replenish lagged too much
> hrtimer: interrupt took 417330 ns
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:

It does reproduce here but with a kworker stall. Looking at the reproducer:

  *(uint32_t*)0x20000000 = 0;
  *(uint32_t*)0x20000004 = 6;
  *(uint64_t*)0x20000008 = 0;
  *(uint32_t*)0x20000010 = 0;
  *(uint32_t*)0x20000014 = 0;
  *(uint64_t*)0x20000018 = 0x9917;
  *(uint64_t*)0x20000020 = 0xffff;
  *(uint64_t*)0x20000028 = 0;
  syscall(__NR_sched_setattr, 0, 0x20000000, 0);

which means:

  struct sched_attr {
  	 .size		= 0,
	 .policy	= 6,
	 .flags		= 0,
	 .nice		= 0,
	 .priority	= 0,
	 .deadline	= 0x9917,
	 .runtime	= 0xffff,
	 .period	= 0,
  }

policy 6 is SCHED_DEADLINE

That makes the thread hog the CPU and prevents all kind of stuff to run.

Peter, is that expected behaviour?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-16 13:24 ` Thomas Gleixner
@ 2018-10-16 14:03   ` Peter Zijlstra
  2018-10-16 14:41     ` Juri Lelli
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-16 14:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: syzbot, Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, Juri Lelli

On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> It does reproduce here but with a kworker stall. Looking at the reproducer:
> 
>   *(uint32_t*)0x20000000 = 0;
>   *(uint32_t*)0x20000004 = 6;
>   *(uint64_t*)0x20000008 = 0;
>   *(uint32_t*)0x20000010 = 0;
>   *(uint32_t*)0x20000014 = 0;
>   *(uint64_t*)0x20000018 = 0x9917;
>   *(uint64_t*)0x20000020 = 0xffff;
>   *(uint64_t*)0x20000028 = 0;
>   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
> 
> which means:
> 
>   struct sched_attr {
>   	 .size		= 0,
> 	 .policy	= 6,
> 	 .flags		= 0,
> 	 .nice		= 0,
> 	 .priority	= 0,
> 	 .deadline	= 0x9917,
> 	 .runtime	= 0xffff,
> 	 .period	= 0,
>   }
> 
> policy 6 is SCHED_DEADLINE
> 
> That makes the thread hog the CPU and prevents all kind of stuff to run.
> 
> Peter, is that expected behaviour?

Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
above configuration, because of the rule:

  runtime <= deadline <= period

Juri, where were we supposed to check that?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-16 14:03   ` Peter Zijlstra
@ 2018-10-16 14:41     ` Juri Lelli
  2018-10-16 14:45       ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-16 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, syzbot, Borislav Petkov, H. Peter Anvin, LKML,
	mingo, nstange, syzkaller-bugs, juri.lelli

On 16/10/18 16:03, Peter Zijlstra wrote:
> On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> > It does reproduce here but with a kworker stall. Looking at the reproducer:
> > 
> >   *(uint32_t*)0x20000000 = 0;
> >   *(uint32_t*)0x20000004 = 6;
> >   *(uint64_t*)0x20000008 = 0;
> >   *(uint32_t*)0x20000010 = 0;
> >   *(uint32_t*)0x20000014 = 0;
> >   *(uint64_t*)0x20000018 = 0x9917;
> >   *(uint64_t*)0x20000020 = 0xffff;
> >   *(uint64_t*)0x20000028 = 0;
> >   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
> > 
> > which means:
> > 
> >   struct sched_attr {
> >   	 .size		= 0,
> > 	 .policy	= 6,
> > 	 .flags		= 0,
> > 	 .nice		= 0,
> > 	 .priority	= 0,
> > 	 .deadline	= 0x9917,
> > 	 .runtime	= 0xffff,
> > 	 .period	= 0,
> >   }
> > 
> > policy 6 is SCHED_DEADLINE
> > 
> > That makes the thread hog the CPU and prevents all kind of stuff to run.
> > 
> > Peter, is that expected behaviour?
> 
> Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
> above configuration, because of the rule:
> 
>   runtime <= deadline <= period
> 
> Juri, where were we supposed to check that?

Not if period == 0.

https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2632
https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2515

Now, maybe we should be checking also against the default 95% cap?

Best,

- Juri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-16 14:41     ` Juri Lelli
@ 2018-10-16 14:45       ` Thomas Gleixner
  2018-10-16 15:36         ` Juri Lelli
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2018-10-16 14:45 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML,
	mingo, nstange, syzkaller-bugs, juri.lelli

On Tue, 16 Oct 2018, Juri Lelli wrote:
> On 16/10/18 16:03, Peter Zijlstra wrote:
> > On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> > > It does reproduce here but with a kworker stall. Looking at the reproducer:
> > > 
> > >   *(uint32_t*)0x20000000 = 0;
> > >   *(uint32_t*)0x20000004 = 6;
> > >   *(uint64_t*)0x20000008 = 0;
> > >   *(uint32_t*)0x20000010 = 0;
> > >   *(uint32_t*)0x20000014 = 0;
> > >   *(uint64_t*)0x20000018 = 0x9917;
> > >   *(uint64_t*)0x20000020 = 0xffff;
> > >   *(uint64_t*)0x20000028 = 0;
> > >   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
> > > 
> > > which means:
> > > 
> > >   struct sched_attr {
> > >   	 .size		= 0,
> > > 	 .policy	= 6,
> > > 	 .flags		= 0,
> > > 	 .nice		= 0,
> > > 	 .priority	= 0,
> > > 	 .deadline	= 0x9917,
> > > 	 .runtime	= 0xffff,
> > > 	 .period	= 0,
> > >   }
> > > 
> > > policy 6 is SCHED_DEADLINE
> > > 
> > > That makes the thread hog the CPU and prevents all kind of stuff to run.
> > > 
> > > Peter, is that expected behaviour?
> > 
> > Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
> > above configuration, because of the rule:
> > 
> >   runtime <= deadline <= period
> > 
> > Juri, where were we supposed to check that?
> 
> Not if period == 0.
> 
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2632
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2515
> 
> Now, maybe we should be checking also against the default 95% cap?

If the cap is active, then yes. But you want to use the actual
configuration not the default.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-16 14:45       ` Thomas Gleixner
@ 2018-10-16 15:36         ` Juri Lelli
  2018-10-18  8:28           ` Juri Lelli
  0 siblings, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-16 15:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs

On 16/10/18 16:45, Thomas Gleixner wrote:
> On Tue, 16 Oct 2018, Juri Lelli wrote:
> > On 16/10/18 16:03, Peter Zijlstra wrote:
> > > On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> > > > It does reproduce here but with a kworker stall. Looking at the reproducer:
> > > > 
> > > >   *(uint32_t*)0x20000000 = 0;
> > > >   *(uint32_t*)0x20000004 = 6;
> > > >   *(uint64_t*)0x20000008 = 0;
> > > >   *(uint32_t*)0x20000010 = 0;
> > > >   *(uint32_t*)0x20000014 = 0;
> > > >   *(uint64_t*)0x20000018 = 0x9917;
> > > >   *(uint64_t*)0x20000020 = 0xffff;
> > > >   *(uint64_t*)0x20000028 = 0;
> > > >   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
> > > > 
> > > > which means:
> > > > 
> > > >   struct sched_attr {
> > > >   	 .size		= 0,
> > > > 	 .policy	= 6,
> > > > 	 .flags		= 0,
> > > > 	 .nice		= 0,
> > > > 	 .priority	= 0,
> > > > 	 .deadline	= 0x9917,
> > > > 	 .runtime	= 0xffff,
> > > > 	 .period	= 0,
> > > >   }
> > > > 
> > > > policy 6 is SCHED_DEADLINE
> > > > 
> > > > That makes the thread hog the CPU and prevents all kind of stuff to run.
> > > > 
> > > > Peter, is that expected behaviour?
> > > 
> > > Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
> > > above configuration, because of the rule:
> > > 
> > >   runtime <= deadline <= period
> > > 
> > > Juri, where were we supposed to check that?
> > 
> > Not if period == 0.
> > 
> > https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2632
> > https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2515
> > 
> > Now, maybe we should be checking also against the default 95% cap?
> 
> If the cap is active, then yes. But you want to use the actual
> configuration not the default.

Sure.

Although DEADLINE bandwidth is "replicated" across the CPUs of a domain,
so we can still admit a while(1) on multi-CPUs domains. Mmm, guess we
should be able to fix this however if we limit also the per-task maximum
bandwidth considering rt_runtime/rt_period.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-16 15:36         ` Juri Lelli
@ 2018-10-18  8:28           ` Juri Lelli
  2018-10-18  9:48             ` Peter Zijlstra
  2018-10-18 10:23             ` luca abeni
  0 siblings, 2 replies; 36+ messages in thread
From: Juri Lelli @ 2018-10-18  8:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, Luca Abeni,
	henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On 16/10/18 16:03, Peter Zijlstra wrote:
> On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> > It does reproduce here but with a kworker stall. Looking at the reproducer:
> > 
> >   *(uint32_t*)0x20000000 = 0;
> >   *(uint32_t*)0x20000004 = 6;
> >   *(uint64_t*)0x20000008 = 0;
> >   *(uint32_t*)0x20000010 = 0;
> >   *(uint32_t*)0x20000014 = 0;
> >   *(uint64_t*)0x20000018 = 0x9917;
> >   *(uint64_t*)0x20000020 = 0xffff;
> >   *(uint64_t*)0x20000028 = 0;
> >   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
> > 
> > which means:
> > 
> >   struct sched_attr {
> >   	 .size		= 0,
> > 	 .policy	= 6,
> > 	 .flags		= 0,
> > 	 .nice		= 0,
> > 	 .priority	= 0,
> > 	 .deadline	= 0x9917,
> > 	 .runtime	= 0xffff,
> > 	 .period	= 0,
> >   }
> > 
> > policy 6 is SCHED_DEADLINE
> > 
> > That makes the thread hog the CPU and prevents all kind of stuff to run.
> > 
> > Peter, is that expected behaviour?
> 
> Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
> above configuration, because of the rule:
> 
>   runtime <= deadline <= period
> 
> Juri, where were we supposed to check that?

OK, looks like the "which means" part above had me fooled, as
we actually have ([1], where the comment is wrong)

 struct sched_attr {
    .size	= 0,
    .policy	= 6,
    .flags	= 0,
    .nice	= 0,
    .priority	= 0,
    .runtime	= 0x9917,
    .deadline	= 0xffff,
    .period	= 0,
 }

So, we seem to be correctly (in theory, see below) accepting the task.

What seems to generate the problem here is that CONFIG_HZ=100 and
reproducer task has "tiny" runtime (~40us) and deadline (~66us)
parameters, combination that "bypasses" the enforcing mechanism
(performed at each tick).

Another side problem seems also to be that with such tiny parameters we
spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
entity() (actually uselessly, as deadline is most probably going to
still be in the past when eventually runtime becomes positive again), as
delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
increments of dl_runtime. I guess we could ameliorate things here by
limiting the number of time we execute the loop before bailing out.

Enabling HRTICK makes a difference [2]. I played a bit with several
combinations and could verify that parameters in the ~50us range seem
usable. However, still to mention that when runtime gets close to
deadline (very high bandwidth) enforcing could be tricked again, as
hrtick overheads might make the task effectively executing for more than
the runtime, over passing the replenish instant (old deadline), so
replenish timer is not set, and letting the task continuing executing
after a replenishment.

This is all however very much platform and config dependent, of course.

So, I tend to think that we might want to play safe and put some higher
minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
Guess the problem is to pick a reasonable value, though. Maybe link it
someway to HZ? Then we might add a sysctl (or similar) thing with which
knowledgeable users can do whatever they think their platform/config can
support?

Thoughts?

I'm adding more people on Cc as I'm not sure they are following this.
Thread starts here [3].

1 - https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/sched/types.h#L70
2 - noticed that we don't actually start hrtick on setup_new_dl_entity()
    and think we should
3 - https://lore.kernel.org/lkml/000000000000a4ee200578172fde@google.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18  8:28           ` Juri Lelli
@ 2018-10-18  9:48             ` Peter Zijlstra
  2018-10-18 10:10               ` Juri Lelli
  2018-10-18 10:33               ` luca abeni
  2018-10-18 10:23             ` luca abeni
  1 sibling, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-18  9:48 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, Luca Abeni,
	henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 10:28:38AM +0200, Juri Lelli wrote:

> Another side problem seems also to be that with such tiny parameters we
> spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
> entity() (actually uselessly, as deadline is most probably going to
> still be in the past when eventually runtime becomes positive again), as
> delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
> increments of dl_runtime. I guess we could ameliorate things here by
> limiting the number of time we execute the loop before bailing out.

That's the "DL replenish lagged too much" case, right? Yeah, there is
only so much we can recover from.

Funny that GCC actually emits that loop; sometimes we've had to fight
GCC not to turn that into a division.

But yes, I suppose we can put a limit on how many periods we can lag
before just giving up.

> So, I tend to think that we might want to play safe and put some higher
> minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
> Guess the problem is to pick a reasonable value, though. Maybe link it
> someway to HZ? Then we might add a sysctl (or similar) thing with which
> knowledgeable users can do whatever they think their platform/config can
> support?

Yes, a HZ related limit sounds like something we'd want. But if we're
going to do a minimum sysctl, we should also consider adding a maximum,
if you set a massive period/deadline, you can, even with a relatively
low u, incur significant delays.

And do we want to put the limit on runtime or on period ?

That is, something like:

  TICK_NSEC/2 < period < 10*TICK_NSEC

and/or

  TICK_NSEC/2 < runtime < 10*TICK_NSEC

Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far too
low, 24Hz needs ~41ms. We can of course also limit the runtime by
capping u for users (as we should anyway).




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18  9:48             ` Peter Zijlstra
@ 2018-10-18 10:10               ` Juri Lelli
  2018-10-18 10:38                 ` luca abeni
  2018-10-18 10:33               ` luca abeni
  1 sibling, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-18 10:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, Luca Abeni,
	henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On 18/10/18 11:48, Peter Zijlstra wrote:
> On Thu, Oct 18, 2018 at 10:28:38AM +0200, Juri Lelli wrote:
> 
> > Another side problem seems also to be that with such tiny parameters we
> > spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
> > entity() (actually uselessly, as deadline is most probably going to
> > still be in the past when eventually runtime becomes positive again), as
> > delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
> > increments of dl_runtime. I guess we could ameliorate things here by
> > limiting the number of time we execute the loop before bailing out.
> 
> That's the "DL replenish lagged too much" case, right? Yeah, there is
> only so much we can recover from.

Right.

> Funny that GCC actually emits that loop; sometimes we've had to fight
> GCC not to turn that into a division.
> 
> But yes, I suppose we can put a limit on how many periods we can lag
> before just giving up.

OK.

> > So, I tend to think that we might want to play safe and put some higher
> > minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
> > Guess the problem is to pick a reasonable value, though. Maybe link it
> > someway to HZ? Then we might add a sysctl (or similar) thing with which
> > knowledgeable users can do whatever they think their platform/config can
> > support?
> 
> Yes, a HZ related limit sounds like something we'd want. But if we're
> going to do a minimum sysctl, we should also consider adding a maximum,
> if you set a massive period/deadline, you can, even with a relatively
> low u, incur significant delays.
> 
> And do we want to put the limit on runtime or on period ?
> 
> That is, something like:
> 
>   TICK_NSEC/2 < period < 10*TICK_NSEC
> 
> and/or
> 
>   TICK_NSEC/2 < runtime < 10*TICK_NSEC
> 
> Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far too
> low, 24Hz needs ~41ms. We can of course also limit the runtime by
> capping u for users (as we should anyway).

I also thought of TICK_NSEC/2 as a reasonably safe lower limit, that
will implicitly limit period as well since

   runtime <= deadline <= period

Not sure about the upper limit, though. Lower limit is something related
to the inherent granularity of the platform/config, upper limit is more
to do with highest prio stuff with huge period delaying everything else;
doesn't seem to be related to HZ?

Maybe we could just pick something that seems reasonably big to handle
SCHED_DEADLINE users needs and not too big to jeopardize everyone else,
say 0.5s?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18  8:28           ` Juri Lelli
  2018-10-18  9:48             ` Peter Zijlstra
@ 2018-10-18 10:23             ` luca abeni
  2018-10-18 10:47               ` Juri Lelli
  1 sibling, 1 reply; 36+ messages in thread
From: luca abeni @ 2018-10-18 10:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 10:28:38 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
>  struct sched_attr {
>     .size	= 0,
>     .policy	= 6,
>     .flags	= 0,
>     .nice	= 0,
>     .priority	= 0,
>     .runtime	= 0x9917,
>     .deadline	= 0xffff,
>     .period	= 0,
>  }
> 
> So, we seem to be correctly (in theory, see below) accepting the task.
> 
> What seems to generate the problem here is that CONFIG_HZ=100 and
> reproducer task has "tiny" runtime (~40us) and deadline (~66us)
> parameters, combination that "bypasses" the enforcing mechanism
> (performed at each tick).

Ok, so the task can execute for at most 1 tick before being throttled...
Which does not look too bad.

I missed the original emails, but maybe the issue is that the task
blocks before the tick, and when it wakes up again something goes wrong
with the deadline and runtime assignment? (maybe because the deadline
is in the past?)


> Another side problem seems also to be that with such tiny parameters
> we spend lot of time in the while (dl_se->runtime <= 0) loop of
> replenish_dl_ entity() (actually uselessly, as deadline is most
> probably going to still be in the past when eventually runtime
> becomes positive again), as delta_exec is huge w.r.t. runtime and
> runtime has to keep up with tiny increments of dl_runtime. I guess we
> could ameliorate things here by limiting the number of time we
> execute the loop before bailing out.

Actually, I think the loop will iterate at most 10ms / 39us times, which
is about 256 times, right? If this is too much (I do not know how much
time it is spent executing the loop), then the solution is (as you
suggest) to increase the minimum allowed runtime.

[...]
> So, I tend to think that we might want to play safe and put some
> higher minimum value for dl_runtime (it's currently at 1ULL <<
> DL_SCALE). Guess the problem is to pick a reasonable value, though.
> Maybe link it someway to HZ?

Yes, a value dependent on HZ looks like a good idea. I would propose
HZ / N, where N is the maximum number of times you want the loop above
to be executed.


> Then we might add a sysctl (or similar)
> thing with which knowledgeable users can do whatever they think their
> platform/config can support?

I guess this can be related to the utilization limits we were
discussing some time ago... I would propose a cgroup-based interface to
set all of these limits.



				Luca

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18  9:48             ` Peter Zijlstra
  2018-10-18 10:10               ` Juri Lelli
@ 2018-10-18 10:33               ` luca abeni
  2018-10-19 13:14                 ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: luca abeni @ 2018-10-18 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Peter,

On Thu, 18 Oct 2018 11:48:50 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > So, I tend to think that we might want to play safe and put some
> > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > Maybe link it someway to HZ? Then we might add a sysctl (or
> > similar) thing with which knowledgeable users can do whatever they
> > think their platform/config can support?  
> 
> Yes, a HZ related limit sounds like something we'd want. But if we're
> going to do a minimum sysctl, we should also consider adding a
> maximum, if you set a massive period/deadline, you can, even with a
> relatively low u, incur significant delays.

I agree with this.


> And do we want to put the limit on runtime or on period ?

I think we should have a minimum allowed runtime, a maximum allowed
runtime, a minimum allowed period and a (per-user? per-control
group?) maximum allowed utilization.

I suspect having a maximum period is useless, if we already enforce a
maximum runtime.


> That is, something like:
> 
>   TICK_NSEC/2 < period < 10*TICK_NSEC

As written above I would not enforce a maximum period.


> 
> and/or
> 
>   TICK_NSEC/2 < runtime < 10*TICK_NSEC

I think (but I might be wrong) that "TICK_NSEC/2" is too large... I
would divide the tick for a larger number (how many time do we want to
allow the loop to run?)

And I think the maximum runtime should not be TICK-dependent... It is
the maximum amount of time for which we allow the dealdine task to
starve non-deadline tasks, so it should be an absolute time, not
something HZ-dependent... No?



> Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> too low, 24Hz needs ~41ms. We can of course also limit the runtime by
> capping u for users (as we should anyway).

Regarding capping u for users: some time ago, with Juri we discussed
the idea of having per-cgroup limits on the deadline utilization... I
think this is a good idea (and if the userspace creates a cgroup per
user, this results in per-user capping - but it is more flexible in
general)



				Luca

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 10:10               ` Juri Lelli
@ 2018-10-18 10:38                 ` luca abeni
  0 siblings, 0 replies; 36+ messages in thread
From: luca abeni @ 2018-10-18 10:38 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 12:10:08 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > Yes, a HZ related limit sounds like something we'd want. But if
> > we're going to do a minimum sysctl, we should also consider adding
> > a maximum, if you set a massive period/deadline, you can, even with
> > a relatively low u, incur significant delays.
> > 
> > And do we want to put the limit on runtime or on period ?
> > 
> > That is, something like:
> > 
> >   TICK_NSEC/2 < period < 10*TICK_NSEC
> > 
> > and/or
> > 
> >   TICK_NSEC/2 < runtime < 10*TICK_NSEC
> > 
> > Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> > too low, 24Hz needs ~41ms. We can of course also limit the runtime
> > by capping u for users (as we should anyway).  
> 
> I also thought of TICK_NSEC/2 as a reasonably safe lower limit

I tend to think that something larger than "2" should be used (maybe
10? I mean: even if HZ = 100, it might make sense to allow a runtime
equal to 1ms...)


> that will implicitly limit period as well since
> 
>    runtime <= deadline <= period

I agree that if we end up with TICK_NSEC/2 for the runtime limit then
explicitly enforcing a minimum period is not needed.



> Not sure about the upper limit, though. Lower limit is something
> related to the inherent granularity of the platform/config, upper
> limit is more to do with highest prio stuff with huge period delaying
> everything else; doesn't seem to be related to HZ?

I agree


			Luca

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 10:23             ` luca abeni
@ 2018-10-18 10:47               ` Juri Lelli
  2018-10-18 11:08                 ` luca abeni
  0 siblings, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-18 10:47 UTC (permalink / raw)
  To: luca abeni
  Cc: Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

Hi,

On 18/10/18 12:23, luca abeni wrote:
> Hi Juri,
> 
> On Thu, 18 Oct 2018 10:28:38 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> >  struct sched_attr {
> >     .size	= 0,
> >     .policy	= 6,
> >     .flags	= 0,
> >     .nice	= 0,
> >     .priority	= 0,
> >     .runtime	= 0x9917,
> >     .deadline	= 0xffff,
> >     .period	= 0,
> >  }
> > 
> > So, we seem to be correctly (in theory, see below) accepting the task.
> > 
> > What seems to generate the problem here is that CONFIG_HZ=100 and
> > reproducer task has "tiny" runtime (~40us) and deadline (~66us)
> > parameters, combination that "bypasses" the enforcing mechanism
> > (performed at each tick).
> 
> Ok, so the task can execute for at most 1 tick before being throttled...
> Which does not look too bad.
> 
> I missed the original emails, but maybe the issue is that the task
> blocks before the tick, and when it wakes up again something goes wrong
> with the deadline and runtime assignment? (maybe because the deadline
> is in the past?)

No, the problem is that the task won't be throttled at all, because its
replenishing instant is always way in the past when tick occurs. :-/

> > Another side problem seems also to be that with such tiny parameters
> > we spend lot of time in the while (dl_se->runtime <= 0) loop of
> > replenish_dl_ entity() (actually uselessly, as deadline is most
> > probably going to still be in the past when eventually runtime
> > becomes positive again), as delta_exec is huge w.r.t. runtime and
> > runtime has to keep up with tiny increments of dl_runtime. I guess we
> > could ameliorate things here by limiting the number of time we
> > execute the loop before bailing out.
> 
> Actually, I think the loop will iterate at most 10ms / 39us times, which
> is about 256 times, right? If this is too much (I do not know how much
> time it is spent executing the loop), then the solution is (as you
> suggest) to increase the minimum allowed runtime.

Yeah, it's maybe not a big issue (and fixing it won't change anything
regarding the real problem at hand). Just thought I'd mention what I was
seeing; and having the loop limit won't harm anyway I guess.

> [...]
> > So, I tend to think that we might want to play safe and put some
> > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > Maybe link it someway to HZ?
> 
> Yes, a value dependent on HZ looks like a good idea. I would propose
> HZ / N, where N is the maximum number of times you want the loop above
> to be executed.

Mmm, it's not really about the loop, but about the granularity at which
we do enforcement.

> > Then we might add a sysctl (or similar)
> > thing with which knowledgeable users can do whatever they think their
> > platform/config can support?
> 
> I guess this can be related to the utilization limits we were
> discussing some time ago... I would propose a cgroup-based interface to
> set all of these limits.

Guess we can go that path as well. But I'd leave it for a later stage.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 10:47               ` Juri Lelli
@ 2018-10-18 11:08                 ` luca abeni
  2018-10-18 12:21                   ` Juri Lelli
  2018-10-19 11:39                   ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: luca abeni @ 2018-10-18 11:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On Thu, 18 Oct 2018 12:47:13 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Hi,
> 
> On 18/10/18 12:23, luca abeni wrote:
> > Hi Juri,
> > 
> > On Thu, 18 Oct 2018 10:28:38 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> > [...]  
> > >  struct sched_attr {
> > >     .size	= 0,
> > >     .policy	= 6,
> > >     .flags	= 0,
> > >     .nice	= 0,
> > >     .priority	= 0,
> > >     .runtime	= 0x9917,
> > >     .deadline	= 0xffff,
> > >     .period	= 0,
> > >  }
> > > 
> > > So, we seem to be correctly (in theory, see below) accepting the
> > > task.
> > > 
> > > What seems to generate the problem here is that CONFIG_HZ=100 and
> > > reproducer task has "tiny" runtime (~40us) and deadline (~66us)
> > > parameters, combination that "bypasses" the enforcing mechanism
> > > (performed at each tick).  
> > 
> > Ok, so the task can execute for at most 1 tick before being
> > throttled... Which does not look too bad.
> > 
> > I missed the original emails, but maybe the issue is that the task
> > blocks before the tick, and when it wakes up again something goes
> > wrong with the deadline and runtime assignment? (maybe because the
> > deadline is in the past?)  
> 
> No, the problem is that the task won't be throttled at all, because
> its replenishing instant is always way in the past when tick
> occurs. :-/

Ok, I see the issue now: the problem is that the "while (dl_se->runtime
<= 0)" loop is executed at replenishment time, but the deadline should
be postponed at enforcement time.

I mean: in update_curr_dl() we do:
	dl_se->runtime -= scaled_delta_exec;
	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
		...
		enqueue replenishment timer at dl_next_period(dl_se)
But dl_next_period() is based on a "wrong" deadline!


I think that inserting a
        while (dl_se->runtime <= -pi_se->dl_runtime) {
                dl_se->deadline += pi_se->dl_period;
                dl_se->runtime += pi_se->dl_runtime;
        }
immediately after "dl_se->runtime -= scaled_delta_exec;" would fix the
problem, no?
If we go this way, then we can remove the while loop from
replenish_dl_entity(), and change it in
	WARN_ON(dl_se->runtime <= -pi_se->dl_runtime);
	WARN_ON(dl_se->runtime > 0);
	dl_se->deadline += pi_se->dl_period;
        dl_se->runtime += pi_se->dl_runtime;
or something similar.


			Luca

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 11:08                 ` luca abeni
@ 2018-10-18 12:21                   ` Juri Lelli
  2018-10-18 12:36                     ` luca abeni
  2018-10-19 11:39                   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-18 12:21 UTC (permalink / raw)
  To: luca abeni
  Cc: Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On 18/10/18 13:08, luca abeni wrote:
> On Thu, 18 Oct 2018 12:47:13 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > Hi,
> > 
> > On 18/10/18 12:23, luca abeni wrote:
> > > Hi Juri,
> > > 
> > > On Thu, 18 Oct 2018 10:28:38 +0200
> > > Juri Lelli <juri.lelli@redhat.com> wrote:
> > > [...]  
> > > >  struct sched_attr {
> > > >     .size	= 0,
> > > >     .policy	= 6,
> > > >     .flags	= 0,
> > > >     .nice	= 0,
> > > >     .priority	= 0,
> > > >     .runtime	= 0x9917,
> > > >     .deadline	= 0xffff,
> > > >     .period	= 0,
> > > >  }
> > > > 
> > > > So, we seem to be correctly (in theory, see below) accepting the
> > > > task.
> > > > 
> > > > What seems to generate the problem here is that CONFIG_HZ=100 and
> > > > reproducer task has "tiny" runtime (~40us) and deadline (~66us)
> > > > parameters, combination that "bypasses" the enforcing mechanism
> > > > (performed at each tick).  
> > > 
> > > Ok, so the task can execute for at most 1 tick before being
> > > throttled... Which does not look too bad.
> > > 
> > > I missed the original emails, but maybe the issue is that the task
> > > blocks before the tick, and when it wakes up again something goes
> > > wrong with the deadline and runtime assignment? (maybe because the
> > > deadline is in the past?)  
> > 
> > No, the problem is that the task won't be throttled at all, because
> > its replenishing instant is always way in the past when tick
> > occurs. :-/
> 
> Ok, I see the issue now: the problem is that the "while (dl_se->runtime
> <= 0)" loop is executed at replenishment time, but the deadline should
> be postponed at enforcement time.
> 
> I mean: in update_curr_dl() we do:
> 	dl_se->runtime -= scaled_delta_exec;
> 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> 		...
> 		enqueue replenishment timer at dl_next_period(dl_se)
> But dl_next_period() is based on a "wrong" deadline!
> 
> 
> I think that inserting a
>         while (dl_se->runtime <= -pi_se->dl_runtime) {
>                 dl_se->deadline += pi_se->dl_period;
>                 dl_se->runtime += pi_se->dl_runtime;
>         }
> immediately after "dl_se->runtime -= scaled_delta_exec;" would fix the
> problem, no?

Mmm, I also thought of letting the task "pay back" its overrunning. But,
doesn't this get us quite far from what one would expect. I mean,
enforcement granularity will be way different from task period, no?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 12:21                   ` Juri Lelli
@ 2018-10-18 12:36                     ` luca abeni
  0 siblings, 0 replies; 36+ messages in thread
From: luca abeni @ 2018-10-18 12:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 14:21:42 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > > > I missed the original emails, but maybe the issue is that the
> > > > task blocks before the tick, and when it wakes up again
> > > > something goes wrong with the deadline and runtime assignment?
> > > > (maybe because the deadline is in the past?)    
> > > 
> > > No, the problem is that the task won't be throttled at all,
> > > because its replenishing instant is always way in the past when
> > > tick occurs. :-/  
> > 
> > Ok, I see the issue now: the problem is that the "while
> > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > the deadline should be postponed at enforcement time.
> > 
> > I mean: in update_curr_dl() we do:
> > 	dl_se->runtime -= scaled_delta_exec;
> > 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > 		...
> > 		enqueue replenishment timer at dl_next_period(dl_se)
> > But dl_next_period() is based on a "wrong" deadline!
> > 
> > 
> > I think that inserting a
> >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> >                 dl_se->deadline += pi_se->dl_period;
> >                 dl_se->runtime += pi_se->dl_runtime;
> >         }
> > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > the problem, no?  
> 
> Mmm, I also thought of letting the task "pay back" its overrunning.
> But, doesn't this get us quite far from what one would expect. I mean,
> enforcement granularity will be way different from task period, no?

Yes, the granularity will be what the kernel can provide (due to the HZ
value and to the hrtick on/off state). But at least the task will not
starve non-deadline tasks (which is bug that originated this
discussion, I think).

If I understand well, there are two different (and orthogonal) issues
here:
1) Due to a bug in the accounting / enforcement mechanisms
   (the wrong placement of the while() loop), the tasks consumes 100% of
   the CPU time, starving non-deadline tasks
2) Due to the large HZ value, the small runtime (and period) and the
   fact that hrtick is disabled, the kernel cannot provide the
   requested scheduling granularity

The second issue can be fixed by imposing limits on minimum and maximum
runtime and the first issue can be fixed by changing the code as I
suggested in my previous email.

I would suggest to address both the two issues, with separate changes
(the current replenishment code looks strange anyway).



				Luca

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 11:08                 ` luca abeni
  2018-10-18 12:21                   ` Juri Lelli
@ 2018-10-19 11:39                   ` Peter Zijlstra
  2018-10-19 20:50                     ` luca abeni
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-19 11:39 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> Ok, I see the issue now: the problem is that the "while (dl_se->runtime
> <= 0)" loop is executed at replenishment time, but the deadline should
> be postponed at enforcement time.
> 
> I mean: in update_curr_dl() we do:
> 	dl_se->runtime -= scaled_delta_exec;
> 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> 		...
> 		enqueue replenishment timer at dl_next_period(dl_se)
> But dl_next_period() is based on a "wrong" deadline!
> 
> 
> I think that inserting a
>         while (dl_se->runtime <= -pi_se->dl_runtime) {
>                 dl_se->deadline += pi_se->dl_period;
>                 dl_se->runtime += pi_se->dl_runtime;
>         }
> immediately after "dl_se->runtime -= scaled_delta_exec;" would fix the
> problem, no?

That certainly makes sense to me. The only remaining issue would then be
placing a limit on the amount of times we can take that loop; which, as
you propose in a later email; can be done separately as a limit on
runtime.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-18 10:33               ` luca abeni
@ 2018-10-19 13:14                 ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-19 13:14 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 12:33:32PM +0200, luca abeni wrote:
> Hi Peter,
> 
> On Thu, 18 Oct 2018 11:48:50 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > > So, I tend to think that we might want to play safe and put some
> > > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > > Maybe link it someway to HZ? Then we might add a sysctl (or
> > > similar) thing with which knowledgeable users can do whatever they
> > > think their platform/config can support?  
> > 
> > Yes, a HZ related limit sounds like something we'd want. But if we're
> > going to do a minimum sysctl, we should also consider adding a
> > maximum, if you set a massive period/deadline, you can, even with a
> > relatively low u, incur significant delays.
> 
> I agree with this.
> 
> 
> > And do we want to put the limit on runtime or on period ?
> 
> I think we should have a minimum allowed runtime, a maximum allowed
> runtime, a minimum allowed period and a (per-user? per-control
> group?) maximum allowed utilization.

I was talking about a global !root max-u, but yes the cgroup max-u makes
definite sense as well.

> I suspect having a maximum period is useless, if we already enforce a
> maximum runtime.

Probably; yes. The asymmetry is unfortunate of course.

> > That is, something like:
> > 
> >   TICK_NSEC/2 < period < 10*TICK_NSEC
> 
> As written above I would not enforce a maximum period.

I'm confused: 'period < 10*TICK_NSEC' reads like a max to me.

(irrespective of the argument on wether the max should be HZ related;
and I think you and Juri made good argument for it not to be)

> > and/or
> > 
> >   TICK_NSEC/2 < runtime < 10*TICK_NSEC
> 
> I think (but I might be wrong) that "TICK_NSEC/2" is too large... I
> would divide the tick for a larger number (how many time do we want to
> allow the loop to run?)

It depends on how strict we want to enforce the no-interference rule.
The smaller we make this, the less accurate we enforce, the worse the
interference between tasks.

Note that we're only talking about a default; and HZ=100 is daft in any
case.

> And I think the maximum runtime should not be TICK-dependent... It is
> the maximum amount of time for which we allow the dealdine task to
> starve non-deadline tasks, so it should be an absolute time, not
> something HZ-dependent... No?

Agreed.

> > Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> > too low, 24Hz needs ~41ms. We can of course also limit the runtime by
> > capping u for users (as we should anyway).
> 
> Regarding capping u for users: some time ago, with Juri we discussed
> the idea of having per-cgroup limits on the deadline utilization... I
> think this is a good idea (and if the userspace creates a cgroup per
> user, this results in per-user capping - but it is more flexible in
> general)

Agreed, that makes sense.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-19 11:39                   ` Peter Zijlstra
@ 2018-10-19 20:50                     ` luca abeni
  2018-10-24 12:03                       ` Juri Lelli
  0 siblings, 1 reply; 36+ messages in thread
From: luca abeni @ 2018-10-19 20:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Fri, 19 Oct 2018 13:39:42 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > Ok, I see the issue now: the problem is that the "while
> > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > the deadline should be postponed at enforcement time.
> > 
> > I mean: in update_curr_dl() we do:
> > 	dl_se->runtime -= scaled_delta_exec;
> > 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > 		...
> > 		enqueue replenishment timer at dl_next_period(dl_se)
> > But dl_next_period() is based on a "wrong" deadline!
> > 
> > 
> > I think that inserting a
> >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> >                 dl_se->deadline += pi_se->dl_period;
> >                 dl_se->runtime += pi_se->dl_runtime;
> >         }
> > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > the problem, no?  
> 
> That certainly makes sense to me.

Good; I'll try to work on this idea in the weekend.


			Thanks,
				Luca

> The only remaining issue would then
> be placing a limit on the amount of times we can take that loop;
> which, as you propose in a later email; can be done separately as a
> limit on runtime.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-19 20:50                     ` luca abeni
@ 2018-10-24 12:03                       ` Juri Lelli
  2018-10-27 11:16                         ` Dmitry Vyukov
  2018-10-30 10:45                         ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Juri Lelli @ 2018-10-24 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 19/10/18 22:50, luca abeni wrote:
> On Fri, 19 Oct 2018 13:39:42 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > Ok, I see the issue now: the problem is that the "while
> > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > the deadline should be postponed at enforcement time.
> > > 
> > > I mean: in update_curr_dl() we do:
> > > 	dl_se->runtime -= scaled_delta_exec;
> > > 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > 		...
> > > 		enqueue replenishment timer at dl_next_period(dl_se)
> > > But dl_next_period() is based on a "wrong" deadline!
> > > 
> > > 
> > > I think that inserting a
> > >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> > >                 dl_se->deadline += pi_se->dl_period;
> > >                 dl_se->runtime += pi_se->dl_runtime;
> > >         }
> > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > the problem, no?  
> > 
> > That certainly makes sense to me.
> 
> Good; I'll try to work on this idea in the weekend.

So, we (me and Luca) managed to spend some more time on this and found a
few more things worth sharing. I'll try to summarize what we have got so
far (including what already discussed) because impression is that each
point might deserve a fix or at least consideration (just amazing how a
simple random fuzzer thing can highlight all that :). Apologies for the
long email.

Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel
and does something like this (only the bits that seems to matter here)

int main(void)
{
  [...]
  [setup stuff at 0x2001d000]
  syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0);
  *(uint32_t*)0x20000000 = 0;
  *(uint32_t*)0x20000004 = 6;
  *(uint64_t*)0x20000008 = 0;
  *(uint32_t*)0x20000010 = 0;
  *(uint32_t*)0x20000014 = 0;
  *(uint64_t*)0x20000018 = 0x9917; <-- ~40us
  *(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth)
  *(uint64_t*)0x20000028 = 0;
  syscall(__NR_sched_setattr, 0, 0x20000000, 0);
  [busy loop]
  return 0;
}

And this causes problems because the task is actually never throttled.

Pain points:

 1. Granularity of enforcement (at each tick) is huge compared with
    the task runtime. This makes starting the replenishment timer,
    when runtime is depleted, always to fail (because old deadline
    is way in the past). So, the task is fully replenished and put
    back to run.

    - Luca's proposal should help here, since the deadline is postponed
      at throttling time, and replenishment timer set to that (and it
      should be in the future)

 1.1 Even if we fix 1. in a configuration like this, the task would
     still be able to run for ~10ms (worst case) and potentially starve
     other tasks. It doesn't seem a too big interval maybe, but there
     might be other very short activities that might miss an occasion
     to run "quickly".

     - Might be fixed by imposing (via sysctl) reasonable defaults for
       minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
       (as also a very small bandwidth task can have a big runtime if
       period is big as well)

 (1.2) When runtime becomes very negative (because delta_exec was big)
       we seem to spend lot of time inside the replenishment loop.

       - Not sure it's such a big problem, might need more profiling.
         Feeling is that once the other points will be addressed this
	 won't matter anymore

 2. This is related to perf_event_open syscall reproducer does before
    becoming DEADLINE and entering the busy loop. Enabling of perf
    swevents generates lot of hrtimers load that happens in the
    reproducer task context. Now, DEADLINE uses rq_clock() for setting
    deadlines, but rq_clock_task() for doing runtime enforcement.
    In a situation like this it seems that the amount of irq pressure
    becomes pretty big (I'm seeing this on kvm, real hw should maybe do
    better, pain point remains I guess), so rq_clock() and
    rq_clock_task() might become more a more skewed w.r.t. each other.
    Since rq_clock() is only used when setting absolute deadlines for
    the first time (or when resetting them in certain cases), after a
    bit the replenishment code will start to see postponed deadlines
    always in the past w.r.t. rq_clock(). And this brings us back to the
    fact that the task is never stopped, since it can't keep up with
    rq_clock().

    - Not sure yet how we want to address this [1]. We could use
      rq_clock() everywhere, but tasks might be penalized by irq
      pressure (theoretically this would mandate that irqs are
      explicitly accounted for I guess). I tried to use the skew between
      the two clocks to "fix" deadlines, but that puts us at risks of
      de-synchronizing userspace and kernel views of deadlines.

 3. HRTICK is not started for new entities.
 
    - Already got a patch for it.

This should be it, I hope. Luca (thanks a lot for your help) and please
add or correct me if I was wrong.

Thoughts?

Best,

- Juri

1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-24 12:03                       ` Juri Lelli
@ 2018-10-27 11:16                         ` Dmitry Vyukov
  2018-10-28  8:33                           ` Juri Lelli
  2018-10-30 10:45                         ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Dmitry Vyukov @ 2018-10-27 11:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, Ingo Molnar, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On Wed, Oct 24, 2018 at 1:03 PM, Juri Lelli <juri.lelli@redhat.com> wrote:
>
> On 19/10/18 22:50, luca abeni wrote:
> > On Fri, 19 Oct 2018 13:39:42 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > > Ok, I see the issue now: the problem is that the "while
> > > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > > the deadline should be postponed at enforcement time.
> > > >
> > > > I mean: in update_curr_dl() we do:
> > > >   dl_se->runtime -= scaled_delta_exec;
> > > >   if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > >           ...
> > > >           enqueue replenishment timer at dl_next_period(dl_se)
> > > > But dl_next_period() is based on a "wrong" deadline!
> > > >
> > > >
> > > > I think that inserting a
> > > >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> > > >                 dl_se->deadline += pi_se->dl_period;
> > > >                 dl_se->runtime += pi_se->dl_runtime;
> > > >         }
> > > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > > the problem, no?
> > >
> > > That certainly makes sense to me.
> >
> > Good; I'll try to work on this idea in the weekend.
>
> So, we (me and Luca) managed to spend some more time on this and found a
> few more things worth sharing. I'll try to summarize what we have got so
> far (including what already discussed) because impression is that each
> point might deserve a fix or at least consideration (just amazing how a
> simple random fuzzer thing can highlight all that :).

1. Fuzzing finds bugs in any code. Always.
If a code wasn't fuzzed, there are bugs.

2. This fuzzer is not so simple ;)


> Apologies for the
> long email.
>
> Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel
> and does something like this (only the bits that seems to matter here)
>
> int main(void)
> {
>   [...]
>   [setup stuff at 0x2001d000]
>   syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0);
>   *(uint32_t*)0x20000000 = 0;
>   *(uint32_t*)0x20000004 = 6;
>   *(uint64_t*)0x20000008 = 0;
>   *(uint32_t*)0x20000010 = 0;
>   *(uint32_t*)0x20000014 = 0;
>   *(uint64_t*)0x20000018 = 0x9917; <-- ~40us
>   *(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth)
>   *(uint64_t*)0x20000028 = 0;
>   syscall(__NR_sched_setattr, 0, 0x20000000, 0);
>   [busy loop]
>   return 0;
> }
>
> And this causes problems because the task is actually never throttled.
>
> Pain points:
>
>  1. Granularity of enforcement (at each tick) is huge compared with
>     the task runtime. This makes starting the replenishment timer,
>     when runtime is depleted, always to fail (because old deadline
>     is way in the past). So, the task is fully replenished and put
>     back to run.
>
>     - Luca's proposal should help here, since the deadline is postponed
>       at throttling time, and replenishment timer set to that (and it
>       should be in the future)
>
>  1.1 Even if we fix 1. in a configuration like this, the task would
>      still be able to run for ~10ms (worst case) and potentially starve
>      other tasks. It doesn't seem a too big interval maybe, but there
>      might be other very short activities that might miss an occasion
>      to run "quickly".
>
>      - Might be fixed by imposing (via sysctl) reasonable defaults for
>        minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
>        (as also a very small bandwidth task can have a big runtime if
>        period is big as well)
>
>  (1.2) When runtime becomes very negative (because delta_exec was big)
>        we seem to spend lot of time inside the replenishment loop.
>
>        - Not sure it's such a big problem, might need more profiling.
>          Feeling is that once the other points will be addressed this
>          won't matter anymore
>
>  2. This is related to perf_event_open syscall reproducer does before
>     becoming DEADLINE and entering the busy loop. Enabling of perf
>     swevents generates lot of hrtimers load that happens in the
>     reproducer task context. Now, DEADLINE uses rq_clock() for setting
>     deadlines, but rq_clock_task() for doing runtime enforcement.
>     In a situation like this it seems that the amount of irq pressure
>     becomes pretty big (I'm seeing this on kvm, real hw should maybe do
>     better, pain point remains I guess), so rq_clock() and
>     rq_clock_task() might become more a more skewed w.r.t. each other.
>     Since rq_clock() is only used when setting absolute deadlines for
>     the first time (or when resetting them in certain cases), after a
>     bit the replenishment code will start to see postponed deadlines
>     always in the past w.r.t. rq_clock(). And this brings us back to the
>     fact that the task is never stopped, since it can't keep up with
>     rq_clock().
>
>     - Not sure yet how we want to address this [1]. We could use
>       rq_clock() everywhere, but tasks might be penalized by irq
>       pressure (theoretically this would mandate that irqs are
>       explicitly accounted for I guess). I tried to use the skew between
>       the two clocks to "fix" deadlines, but that puts us at risks of
>       de-synchronizing userspace and kernel views of deadlines.
>
>  3. HRTICK is not started for new entities.
>
>     - Already got a patch for it.
>
> This should be it, I hope. Luca (thanks a lot for your help) and please
> add or correct me if I was wrong.
>
> Thoughts?
>
> Best,
>
> - Juri
>
> 1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20181024120335.GE29272%40localhost.localdomain.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-27 11:16                         ` Dmitry Vyukov
@ 2018-10-28  8:33                           ` Juri Lelli
  0 siblings, 0 replies; 36+ messages in thread
From: Juri Lelli @ 2018-10-28  8:33 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, Ingo Molnar, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino,
	Daniel Bristot de Oliveira

On 27/10/18 12:16, Dmitry Vyukov wrote:
> On Wed, Oct 24, 2018 at 1:03 PM, Juri Lelli <juri.lelli@redhat.com> wrote:
> >
> > On 19/10/18 22:50, luca abeni wrote:
> > > On Fri, 19 Oct 2018 13:39:42 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > > > Ok, I see the issue now: the problem is that the "while
> > > > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > > > the deadline should be postponed at enforcement time.
> > > > >
> > > > > I mean: in update_curr_dl() we do:
> > > > >   dl_se->runtime -= scaled_delta_exec;
> > > > >   if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > > >           ...
> > > > >           enqueue replenishment timer at dl_next_period(dl_se)
> > > > > But dl_next_period() is based on a "wrong" deadline!
> > > > >
> > > > >
> > > > > I think that inserting a
> > > > >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> > > > >                 dl_se->deadline += pi_se->dl_period;
> > > > >                 dl_se->runtime += pi_se->dl_runtime;
> > > > >         }
> > > > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > > > the problem, no?
> > > >
> > > > That certainly makes sense to me.
> > >
> > > Good; I'll try to work on this idea in the weekend.
> >
> > So, we (me and Luca) managed to spend some more time on this and found a
> > few more things worth sharing. I'll try to summarize what we have got so
> > far (including what already discussed) because impression is that each
> > point might deserve a fix or at least consideration (just amazing how a
> > simple random fuzzer thing can highlight all that :).
> 
> 1. Fuzzing finds bugs in any code. Always.
> If a code wasn't fuzzed, there are bugs.
> 
> 2. This fuzzer is not so simple ;)

Indeed! I meant that it's amazing how the fuzzer was able to forge a
relatively simple reproducer that highlighted the problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-24 12:03                       ` Juri Lelli
  2018-10-27 11:16                         ` Dmitry Vyukov
@ 2018-10-30 10:45                         ` Peter Zijlstra
  2018-10-30 11:08                           ` luca abeni
  2018-10-30 11:12                           ` Juri Lelli
  1 sibling, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-30 10:45 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Wed, Oct 24, 2018 at 02:03:35PM +0200, Juri Lelli wrote:
> Pain points:
> 
>  1. Granularity of enforcement (at each tick) is huge compared with
>     the task runtime. This makes starting the replenishment timer,
>     when runtime is depleted, always to fail (because old deadline
>     is way in the past). So, the task is fully replenished and put
>     back to run.
> 
>     - Luca's proposal should help here, since the deadline is postponed
>       at throttling time, and replenishment timer set to that (and it
>       should be in the future)

ACK

>  1.1 Even if we fix 1. in a configuration like this, the task would
>      still be able to run for ~10ms (worst case) and potentially starve
>      other tasks. It doesn't seem a too big interval maybe, but there
>      might be other very short activities that might miss an occasion
>      to run "quickly".
> 
>      - Might be fixed by imposing (via sysctl) reasonable defaults for
>        minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
>        (as also a very small bandwidth task can have a big runtime if
>        period is big as well)

ACK

>  (1.2) When runtime becomes very negative (because delta_exec was big)
>        we seem to spend lot of time inside the replenishment loop.
> 
>        - Not sure it's such a big problem, might need more profiling.
>          Feeling is that once the other points will be addressed this
> 	 won't matter anymore

Right, once the sysctl limits are in place, we should not have such
excessive cases anymore.

>  2. This is related to perf_event_open syscall reproducer does before
>     becoming DEADLINE and entering the busy loop. Enabling of perf
>     swevents generates lot of hrtimers load that happens in the
>     reproducer task context. Now, DEADLINE uses rq_clock() for setting
>     deadlines, but rq_clock_task() for doing runtime enforcement.
>     In a situation like this it seems that the amount of irq pressure
>     becomes pretty big (I'm seeing this on kvm, real hw should maybe do
>     better, pain point remains I guess), so rq_clock() and
>     rq_clock_task() might become more a more skewed w.r.t. each other.
>     Since rq_clock() is only used when setting absolute deadlines for
>     the first time (or when resetting them in certain cases), after a
>     bit the replenishment code will start to see postponed deadlines
>     always in the past w.r.t. rq_clock(). And this brings us back to the
>     fact that the task is never stopped, since it can't keep up with
>     rq_clock().
> 
>     - Not sure yet how we want to address this [1]. We could use
>       rq_clock() everywhere, but tasks might be penalized by irq
>       pressure (theoretically this would mandate that irqs are
>       explicitly accounted for I guess). I tried to use the skew between
>       the two clocks to "fix" deadlines, but that puts us at risks of
>       de-synchronizing userspace and kernel views of deadlines.

Hurm.. right. We knew of this issue back when we did it.
I suppose now it hurts and we need to figure something out.

By virtue of being a real-time class, we do indeed need to have deadline
on the wall-clock. But if we then don't account runtime on that same
clock, but on a potentially slower clock, we get the problem that we can
run longer than our period/deadline, which is what we're running into
here I suppose.

And yes, at some point RT workloads need to be aware of the jitter
injected by things like IRQs and such. But I believe the rationale was
that for soft real-time workloads this current semantic was 'easier'
because we get to ignore IRQ overhead for workload estimation etc.

What we could maybe do is track runtime in both rq_clock_task() and
rq_clock() and detect where the rq_clock based one exceeds the period
and then push out the deadline (and add runtime).

Maybe something along such lines; does that make sense?

---
 include/linux/sched.h   |  3 +++
 kernel/sched/deadline.c | 53 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f8a5418b627..6aec81cb3d2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -522,6 +522,9 @@ struct sched_dl_entity {
 	u64				deadline;	/* Absolute deadline for this instance	*/
 	unsigned int			flags;		/* Specifying the scheduler behaviour	*/
 
+	u64				wallstamp;
+	s64				walltime;
+
 	/*
 	 * Some bool flags:
 	 *
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..633c8f36c700 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -683,16 +683,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
 	if (dl_se->dl_yielded && dl_se->runtime > 0)
 		dl_se->runtime = 0;
 
-	/*
-	 * We keep moving the deadline away until we get some
-	 * available runtime for the entity. This ensures correct
-	 * handling of situations where the runtime overrun is
-	 * arbitrary large.
-	 */
-	while (dl_se->runtime <= 0) {
-		dl_se->deadline += pi_se->dl_period;
-		dl_se->runtime += pi_se->dl_runtime;
-	}
+	/* XXX what do we do with pi_se */
 
 	/*
 	 * At this point, the deadline really should be "in
@@ -1148,9 +1139,9 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	u64 delta_exec, scaled_delta_exec, delta_wall;
 	int cpu = cpu_of(rq);
-	u64 now;
+	u64 now, wall;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
 		return;
 	}
 
+	wall = rq_clock();
+	delta_wall = wall - dl_se->wallstamp;
+	if (delta_wall > 0) {
+		dl_se->walltime += delta_wall;
+		dl_se->wallstamp = wall;
+	}
+
+	/* check if rq_clock_task() has been too slow */
+	if (unlikely(dl_se->walltime > dl_se->period))
+		goto throttle;
+
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
@@ -1204,14 +1206,27 @@ static void update_curr_dl(struct rq *rq)
 
 	dl_se->runtime -= scaled_delta_exec;
 
-throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+throttle:
 		dl_se->dl_throttled = 1;
 
-		/* If requested, inform the user about runtime overruns. */
-		if (dl_runtime_exceeded(dl_se) &&
-		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
-			dl_se->dl_overrun = 1;
+		if (dl_runtime_exceeded(dl_se)) {
+			/* If requested, inform the user about runtime overruns. */
+			if (dl_se->flags & SCHED_FLAG_DL_OVERRUN)
+				dl_se->dl_overrun = 1;
+
+		}
+
+		/*
+		 * We keep moving the deadline away until we get some available
+		 * runtime for the entity. This ensures correct handling of
+		 * situations where the runtime overrun is arbitrary large.
+		 */
+		while (dl_se->runtime <= 0 || dl_se->walltime > dl_se->period) {
+			dl_se->deadline += dl_se->dl_period;
+			dl_se->runtime  += dl_se->dl_runtime;
+			dl_se->walltime -= dl_se->dl_period;
+		}
 
 		__dequeue_task_dl(rq, curr, 0);
 		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
@@ -1751,9 +1766,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq_clock_task(rq);
+	dl_se->wallstamp = rq_clock(rq);
 
 	/* Running task will never be pushed. */
-       dequeue_pushable_dl_task(rq, p);
+	dequeue_pushable_dl_task(rq, p);
 
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
@@ -1811,6 +1827,7 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq_clock_task(rq);
+	p->dl_se.wallstamp = rq_clock(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-30 10:45                         ` Peter Zijlstra
@ 2018-10-30 11:08                           ` luca abeni
  2018-10-31 16:18                             ` Daniel Bristot de Oliveira
  2018-10-30 11:12                           ` Juri Lelli
  1 sibling, 1 reply; 36+ messages in thread
From: luca abeni @ 2018-10-30 11:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Peter,

On Tue, 30 Oct 2018 11:45:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> >  2. This is related to perf_event_open syscall reproducer does
> > before becoming DEADLINE and entering the busy loop. Enabling of
> > perf swevents generates lot of hrtimers load that happens in the
> >     reproducer task context. Now, DEADLINE uses rq_clock() for
> > setting deadlines, but rq_clock_task() for doing runtime
> > enforcement. In a situation like this it seems that the amount of
> > irq pressure becomes pretty big (I'm seeing this on kvm, real hw
> > should maybe do better, pain point remains I guess), so rq_clock()
> > and rq_clock_task() might become more a more skewed w.r.t. each
> > other. Since rq_clock() is only used when setting absolute
> > deadlines for the first time (or when resetting them in certain
> > cases), after a bit the replenishment code will start to see
> > postponed deadlines always in the past w.r.t. rq_clock(). And this
> > brings us back to the fact that the task is never stopped, since it
> > can't keep up with rq_clock().
> > 
> >     - Not sure yet how we want to address this [1]. We could use
> >       rq_clock() everywhere, but tasks might be penalized by irq
> >       pressure (theoretically this would mandate that irqs are
> >       explicitly accounted for I guess). I tried to use the skew
> > between the two clocks to "fix" deadlines, but that puts us at
> > risks of de-synchronizing userspace and kernel views of deadlines.  
> 
> Hurm.. right. We knew of this issue back when we did it.
> I suppose now it hurts and we need to figure something out.
> 
> By virtue of being a real-time class, we do indeed need to have
> deadline on the wall-clock. But if we then don't account runtime on
> that same clock, but on a potentially slower clock, we get the
> problem that we can run longer than our period/deadline, which is
> what we're running into here I suppose.

I might be hugely misunderstanding something here, but in my impression
the issue is just that if the IRQ time is not accounted to the
-deadline task, then the non-deadline tasks might be starved.

I do not see this as a skew between two clocks, but as an accounting
thing:
- if we decide that the IRQ time is accounted to the -deadline
  task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
  then the non-deadline tasks are not starved (but of course the
  -deadline tasks executes for less than its reserved time in the
  period); 
- if we decide that the IRQ time is not accounted to the -deadline task
  (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
  the -deadline task executes for the expected amount of time (about
  60% of the CPU time), but an IRQ load of 40% will starve non-deadline
  tasks (this is what happens in the bug that triggered this discussion)

I think this might be seen as an adimission control issue: when
CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
in the admission control (because it ends up in the task's runtime),
but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
accounted for in the admission test (the IRQ handler becomes some sort
of entity with a higher priority than -deadline tasks, on which no
accounting or enforcement is performed).



> And yes, at some point RT workloads need to be aware of the jitter
> injected by things like IRQs and such. But I believe the rationale was
> that for soft real-time workloads this current semantic was 'easier'
> because we get to ignore IRQ overhead for workload estimation etc.
> 
> What we could maybe do is track runtime in both rq_clock_task() and
> rq_clock() and detect where the rq_clock based one exceeds the period
> and then push out the deadline (and add runtime).
> 
> Maybe something along such lines; does that make sense?

Uhm... I have to study and test your patch... I'll comment on this
later.



			Thanks,
				Luca


> 
> ---
>  include/linux/sched.h   |  3 +++
>  kernel/sched/deadline.c | 53
> ++++++++++++++++++++++++++++++++----------------- 2 files changed, 38
> insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f8a5418b627..6aec81cb3d2e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -522,6 +522,9 @@ struct sched_dl_entity {
>  	u64				deadline;	/*
> Absolute deadline for this instance	*/ unsigned
> int			flags;		/* Specifying the
> scheduler behaviour	*/ 
> +	u64				wallstamp;
> +	s64				walltime;
> +
>  	/*
>  	 * Some bool flags:
>  	 *
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91e4202b0634..633c8f36c700 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -683,16 +683,7 @@ static void replenish_dl_entity(struct
> sched_dl_entity *dl_se, if (dl_se->dl_yielded && dl_se->runtime > 0)
>  		dl_se->runtime = 0;
>  
> -	/*
> -	 * We keep moving the deadline away until we get some
> -	 * available runtime for the entity. This ensures correct
> -	 * handling of situations where the runtime overrun is
> -	 * arbitrary large.
> -	 */
> -	while (dl_se->runtime <= 0) {
> -		dl_se->deadline += pi_se->dl_period;
> -		dl_se->runtime += pi_se->dl_runtime;
> -	}
> +	/* XXX what do we do with pi_se */
>  
>  	/*
>  	 * At this point, the deadline really should be "in
> @@ -1148,9 +1139,9 @@ static void update_curr_dl(struct rq *rq)
>  {
>  	struct task_struct *curr = rq->curr;
>  	struct sched_dl_entity *dl_se = &curr->dl;
> -	u64 delta_exec, scaled_delta_exec;
> +	u64 delta_exec, scaled_delta_exec, delta_wall;
>  	int cpu = cpu_of(rq);
> -	u64 now;
> +	u64 now, wall;
>  
>  	if (!dl_task(curr) || !on_dl_rq(dl_se))
>  		return;
> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
>  		return;
>  	}
>  
> +	wall = rq_clock();
> +	delta_wall = wall - dl_se->wallstamp;
> +	if (delta_wall > 0) {
> +		dl_se->walltime += delta_wall;
> +		dl_se->wallstamp = wall;
> +	}
> +
> +	/* check if rq_clock_task() has been too slow */
> +	if (unlikely(dl_se->walltime > dl_se->period))
> +		goto throttle;
> +
>  	schedstat_set(curr->se.statistics.exec_max,
>  		      max(curr->se.statistics.exec_max, delta_exec));
>  
> @@ -1204,14 +1206,27 @@ static void update_curr_dl(struct rq *rq)
>  
>  	dl_se->runtime -= scaled_delta_exec;
>  
> -throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> +throttle:
>  		dl_se->dl_throttled = 1;
>  
> -		/* If requested, inform the user about runtime
> overruns. */
> -		if (dl_runtime_exceeded(dl_se) &&
> -		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
> -			dl_se->dl_overrun = 1;
> +		if (dl_runtime_exceeded(dl_se)) {
> +			/* If requested, inform the user about
> runtime overruns. */
> +			if (dl_se->flags & SCHED_FLAG_DL_OVERRUN)
> +				dl_se->dl_overrun = 1;
> +
> +		}
> +
> +		/*
> +		 * We keep moving the deadline away until we get
> some available
> +		 * runtime for the entity. This ensures correct
> handling of
> +		 * situations where the runtime overrun is arbitrary
> large.
> +		 */
> +		while (dl_se->runtime <= 0 || dl_se->walltime >
> dl_se->period) {
> +			dl_se->deadline += dl_se->dl_period;
> +			dl_se->runtime  += dl_se->dl_runtime;
> +			dl_se->walltime -= dl_se->dl_period;
> +		}
>  
>  		__dequeue_task_dl(rq, curr, 0);
>  		if (unlikely(dl_se->dl_boosted
> || !start_dl_timer(curr))) @@ -1751,9 +1766,10 @@
> pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct
> rq_flags *rf) p = dl_task_of(dl_se);
>  	p->se.exec_start = rq_clock_task(rq);
> +	dl_se->wallstamp = rq_clock(rq);
>  
>  	/* Running task will never be pushed. */
> -       dequeue_pushable_dl_task(rq, p);
> +	dequeue_pushable_dl_task(rq, p);
>  
>  	if (hrtick_enabled(rq))
>  		start_hrtick_dl(rq, p);
> @@ -1811,6 +1827,7 @@ static void set_curr_task_dl(struct rq *rq)
>  	struct task_struct *p = rq->curr;
>  
>  	p->se.exec_start = rq_clock_task(rq);
> +	p->dl_se.wallstamp = rq_clock(rq);
>  
>  	/* You can't push away the running task */
>  	dequeue_pushable_dl_task(rq, p);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-30 10:45                         ` Peter Zijlstra
  2018-10-30 11:08                           ` luca abeni
@ 2018-10-30 11:12                           ` Juri Lelli
  2018-11-06 11:44                             ` Juri Lelli
  1 sibling, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-10-30 11:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 30/10/18 11:45, Peter Zijlstra wrote:

[...]

> Hurm.. right. We knew of this issue back when we did it.
> I suppose now it hurts and we need to figure something out.
> 
> By virtue of being a real-time class, we do indeed need to have deadline
> on the wall-clock. But if we then don't account runtime on that same
> clock, but on a potentially slower clock, we get the problem that we can
> run longer than our period/deadline, which is what we're running into
> here I suppose.
> 
> And yes, at some point RT workloads need to be aware of the jitter
> injected by things like IRQs and such. But I believe the rationale was
> that for soft real-time workloads this current semantic was 'easier'
> because we get to ignore IRQ overhead for workload estimation etc.

Right. In this case the task is self injecting IRQ load, but it maybe
doesn't make a big difference on how we need to treat it (supposing we
can actually distinguish).

> What we could maybe do is track runtime in both rq_clock_task() and
> rq_clock() and detect where the rq_clock based one exceeds the period
> and then push out the deadline (and add runtime).
> 
> Maybe something along such lines; does that make sense?

Yeah, I think I've got the gist of the idea. I'll play with it.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-30 11:08                           ` luca abeni
@ 2018-10-31 16:18                             ` Daniel Bristot de Oliveira
  2018-10-31 16:40                               ` Juri Lelli
  2018-10-31 17:38                               ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-10-31 16:18 UTC (permalink / raw)
  To: luca abeni, Peter Zijlstra
  Cc: Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino

On 10/30/18 12:08 PM, luca abeni wrote:
> Hi Peter,
> 
> On Tue, 30 Oct 2018 11:45:54 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
>>>  2. This is related to perf_event_open syscall reproducer does
>>> before becoming DEADLINE and entering the busy loop. Enabling of
>>> perf swevents generates lot of hrtimers load that happens in the
>>>     reproducer task context. Now, DEADLINE uses rq_clock() for
>>> setting deadlines, but rq_clock_task() for doing runtime
>>> enforcement. In a situation like this it seems that the amount of
>>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw
>>> should maybe do better, pain point remains I guess), so rq_clock()
>>> and rq_clock_task() might become more a more skewed w.r.t. each
>>> other. Since rq_clock() is only used when setting absolute
>>> deadlines for the first time (or when resetting them in certain
>>> cases), after a bit the replenishment code will start to see
>>> postponed deadlines always in the past w.r.t. rq_clock(). And this
>>> brings us back to the fact that the task is never stopped, since it
>>> can't keep up with rq_clock().
>>>
>>>     - Not sure yet how we want to address this [1]. We could use
>>>       rq_clock() everywhere, but tasks might be penalized by irq
>>>       pressure (theoretically this would mandate that irqs are
>>>       explicitly accounted for I guess). I tried to use the skew
>>> between the two clocks to "fix" deadlines, but that puts us at
>>> risks of de-synchronizing userspace and kernel views of deadlines.  
>>
>> Hurm.. right. We knew of this issue back when we did it.
>> I suppose now it hurts and we need to figure something out.
>>
>> By virtue of being a real-time class, we do indeed need to have
>> deadline on the wall-clock. But if we then don't account runtime on
>> that same clock, but on a potentially slower clock, we get the
>> problem that we can run longer than our period/deadline, which is
>> what we're running into here I suppose.
> 
> I might be hugely misunderstanding something here, but in my impression
> the issue is just that if the IRQ time is not accounted to the
> -deadline task, then the non-deadline tasks might be starved.
> 
> I do not see this as a skew between two clocks, but as an accounting
> thing:
> - if we decide that the IRQ time is accounted to the -deadline
>   task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
>   then the non-deadline tasks are not starved (but of course the
>   -deadline tasks executes for less than its reserved time in the
>   period); 
> - if we decide that the IRQ time is not accounted to the -deadline task
>   (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
>   the -deadline task executes for the expected amount of time (about
>   60% of the CPU time), but an IRQ load of 40% will starve non-deadline
>   tasks (this is what happens in the bug that triggered this discussion)
> 
> I think this might be seen as an adimission control issue: when
> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
> in the admission control (because it ends up in the task's runtime),
> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
> accounted for in the admission test (the IRQ handler becomes some sort
> of entity with a higher priority than -deadline tasks, on which no
> accounting or enforcement is performed).
> 

I am sorry for taking to long to join in the discussion.

I agree with Luca. I've seem this behavior two time before. Firstly when we were
trying to make the rt throttling to have a very short runtime for non-rt
threads, and then in the proof of concept of the semi-partitioned scheduler.

Firstly, I started thinking on this as a skew between both clocks and disabled
IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
IRQ runtime will be accounted as the thread's runtime. In other words, we are
just sweeping the trash under the rug, where the rug is the worst case execution
time estimation/definition (which is an even more complex problem). In the
Brazilian part of the Ph.D we are dealing with probabilistic worst case
execution time, and to be able to use probabilistic methods, we need to remove
the noise of the IRQs in the execution time [1]. So, IMHO, using
CONFIG_IRQ_TIME_ACCOUNTING is a good thing.

The fact that we have barely no control of the execution of IRQs, at first
glance, let us think that the idea of considering an IRQ as a task seems to be
absurd. But, it is not. The IRQs run a piece of code that is, in the vast
majority of the case, not related to the current thread, so it runs another
"task". In the occurrence of more than one IRQ concurrently, the processor
serves the IRQ in a predictable order [2], so the processor schedules the IRQs
as a "task". Finally, there are precedence constraints among threads and IRQs.
For instance, the latency can be seen as the response time of the timer IRQ
handler, plus the delta of the return of the handler and the starting of the
execution of cyclictest [3]. In the theory, the idea of precedence constraints
is also about "task".

So IMHO, IRQs can be considered as a task (I am considering in my model), and
the place to account this would be in the admission test.

The problem is that, for the best of my knowledge, there is no admissions test
for such task model/system:

Two level of schedulers. A high priority scheduler that schedules a non
preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
and on intel it is a fixed priority). A lower priority task set (threads)
scheduled by the OS.

But assuming that our current admission control is more about a safe guard than
an exact admission control - that is, for multiprocessor it is necessary, but
not sufficient. (Theoretically, it works for uniprocessor, but... there is a
paper of Rob Davis somewhere that shows that if we have "context switch" (and so
scheduler for our case)) with different costs, the many things does not hold
true, for instance, Deadline Monotonic is not optimal... but I will have to read
more to enter in this point, anyway, multiprocessor is only necessary).

With this in mind: we do *not* use/have an exact admission test for all cases.
By not having an exact admission test, we assume the user knows what he/she is
doing. In this case, if they have a high load of IRQs... they need to know that:

1) Their periods should be consistent with the "interference" they might receive.
2) Their tasks can miss the deadline because of IRQs (and there is no way to
avoid this without "throttling" IRQs...)

So, is it worth to put a duct tape for this case?

My fear is that, by putting a duct tape here, we would turn things prone to more
complex errors/undeterminism... so...

I think we have another point to add to the discussion at plumbers, Juri.

[1] http://bristot.me/wp-content/uploads/2018/09/conference_071817.pdf
[2] Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3,
    section: 6.9 PRIORITY AMONG SIMULTANEOUS EXCEPTIONS AND INTERRUPTS.
[3] I will detail this a little bit more in the plumbers presentation.

> 
>> And yes, at some point RT workloads need to be aware of the jitter
>> injected by things like IRQs and such. But I believe the rationale was
>> that for soft real-time workloads this current semantic was 'easier'
>> because we get to ignore IRQ overhead for workload estimation etc.
>>
>> What we could maybe do is track runtime in both rq_clock_task() and
>> rq_clock() and detect where the rq_clock based one exceeds the period
>> and then push out the deadline (and add runtime).
>>
>> Maybe something along such lines; does that make sense?
> 
> Uhm... I have to study and test your patch... I'll comment on this
> later.
> 
> 
> 
> 			Thanks,
> 				Luca
> 
> 
>>
>> ---
>>  include/linux/sched.h   |  3 +++
>>  kernel/sched/deadline.c | 53
>> ++++++++++++++++++++++++++++++++----------------- 2 files changed, 38
>> insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 8f8a5418b627..6aec81cb3d2e 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -522,6 +522,9 @@ struct sched_dl_entity {
>>  	u64				deadline;	/*
>> Absolute deadline for this instance	*/ unsigned
>> int			flags;		/* Specifying the
>> scheduler behaviour	*/ 
>> +	u64				wallstamp;
>> +	s64				walltime;
>> +
>>  	/*
>>  	 * Some bool flags:
>>  	 *
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 91e4202b0634..633c8f36c700 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -683,16 +683,7 @@ static void replenish_dl_entity(struct
>> sched_dl_entity *dl_se, if (dl_se->dl_yielded && dl_se->runtime > 0)
>>  		dl_se->runtime = 0;
>>  
>> -	/*
>> -	 * We keep moving the deadline away until we get some
>> -	 * available runtime for the entity. This ensures correct
>> -	 * handling of situations where the runtime overrun is
>> -	 * arbitrary large.
>> -	 */
>> -	while (dl_se->runtime <= 0) {
>> -		dl_se->deadline += pi_se->dl_period;
>> -		dl_se->runtime += pi_se->dl_runtime;
>> -	}
>> +	/* XXX what do we do with pi_se */
>>  
>>  	/*
>>  	 * At this point, the deadline really should be "in
>> @@ -1148,9 +1139,9 @@ static void update_curr_dl(struct rq *rq)
>>  {
>>  	struct task_struct *curr = rq->curr;
>>  	struct sched_dl_entity *dl_se = &curr->dl;
>> -	u64 delta_exec, scaled_delta_exec;
>> +	u64 delta_exec, scaled_delta_exec, delta_wall;
>>  	int cpu = cpu_of(rq);
>> -	u64 now;
>> +	u64 now, wall;
>>  
>>  	if (!dl_task(curr) || !on_dl_rq(dl_se))
>>  		return;
>> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
>>  		return;
>>  	}
>>  
>> +	wall = rq_clock();
>> +	delta_wall = wall - dl_se->wallstamp;
>> +	if (delta_wall > 0) {
>> +		dl_se->walltime += delta_wall;
>> +		dl_se->wallstamp = wall;
>> +	}
>> +
>> +	/* check if rq_clock_task() has been too slow */
>> +	if (unlikely(dl_se->walltime > dl_se->period))
>> +		goto throttle;
>> +

If I got it correctly, it can be the case that we would throttle a thread that,
because of IRQs, received less CPU time than expected, right?

Thanks!
-- Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-31 16:18                             ` Daniel Bristot de Oliveira
@ 2018-10-31 16:40                               ` Juri Lelli
  2018-10-31 17:39                                 ` Peter Zijlstra
  2018-10-31 17:58                                 ` Daniel Bristot de Oliveira
  2018-10-31 17:38                               ` Peter Zijlstra
  1 sibling, 2 replies; 36+ messages in thread
From: Juri Lelli @ 2018-10-31 16:40 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 31/10/18 17:18, Daniel Bristot de Oliveira wrote:
> On 10/30/18 12:08 PM, luca abeni wrote:
> > Hi Peter,
> > 
> > On Tue, 30 Oct 2018 11:45:54 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > [...]
> >>>  2. This is related to perf_event_open syscall reproducer does
> >>> before becoming DEADLINE and entering the busy loop. Enabling of
> >>> perf swevents generates lot of hrtimers load that happens in the
> >>>     reproducer task context. Now, DEADLINE uses rq_clock() for
> >>> setting deadlines, but rq_clock_task() for doing runtime
> >>> enforcement. In a situation like this it seems that the amount of
> >>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw
> >>> should maybe do better, pain point remains I guess), so rq_clock()
> >>> and rq_clock_task() might become more a more skewed w.r.t. each
> >>> other. Since rq_clock() is only used when setting absolute
> >>> deadlines for the first time (or when resetting them in certain
> >>> cases), after a bit the replenishment code will start to see
> >>> postponed deadlines always in the past w.r.t. rq_clock(). And this
> >>> brings us back to the fact that the task is never stopped, since it
> >>> can't keep up with rq_clock().
> >>>
> >>>     - Not sure yet how we want to address this [1]. We could use
> >>>       rq_clock() everywhere, but tasks might be penalized by irq
> >>>       pressure (theoretically this would mandate that irqs are
> >>>       explicitly accounted for I guess). I tried to use the skew
> >>> between the two clocks to "fix" deadlines, but that puts us at
> >>> risks of de-synchronizing userspace and kernel views of deadlines.  
> >>
> >> Hurm.. right. We knew of this issue back when we did it.
> >> I suppose now it hurts and we need to figure something out.
> >>
> >> By virtue of being a real-time class, we do indeed need to have
> >> deadline on the wall-clock. But if we then don't account runtime on
> >> that same clock, but on a potentially slower clock, we get the
> >> problem that we can run longer than our period/deadline, which is
> >> what we're running into here I suppose.
> > 
> > I might be hugely misunderstanding something here, but in my impression
> > the issue is just that if the IRQ time is not accounted to the
> > -deadline task, then the non-deadline tasks might be starved.
> > 
> > I do not see this as a skew between two clocks, but as an accounting
> > thing:
> > - if we decide that the IRQ time is accounted to the -deadline
> >   task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
> >   then the non-deadline tasks are not starved (but of course the
> >   -deadline tasks executes for less than its reserved time in the
> >   period); 
> > - if we decide that the IRQ time is not accounted to the -deadline task
> >   (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
> >   the -deadline task executes for the expected amount of time (about
> >   60% of the CPU time), but an IRQ load of 40% will starve non-deadline
> >   tasks (this is what happens in the bug that triggered this discussion)
> > 
> > I think this might be seen as an adimission control issue: when
> > CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
> > in the admission control (because it ends up in the task's runtime),
> > but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
> > accounted for in the admission test (the IRQ handler becomes some sort
> > of entity with a higher priority than -deadline tasks, on which no
> > accounting or enforcement is performed).
> > 
> 
> I am sorry for taking to long to join in the discussion.
> 
> I agree with Luca. I've seem this behavior two time before. Firstly when we were
> trying to make the rt throttling to have a very short runtime for non-rt
> threads, and then in the proof of concept of the semi-partitioned scheduler.
> 
> Firstly, I started thinking on this as a skew between both clocks and disabled
> IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
> IRQ runtime will be accounted as the thread's runtime. In other words, we are
> just sweeping the trash under the rug, where the rug is the worst case execution
> time estimation/definition (which is an even more complex problem). In the
> Brazilian part of the Ph.D we are dealing with probabilistic worst case
> execution time, and to be able to use probabilistic methods, we need to remove
> the noise of the IRQs in the execution time [1]. So, IMHO, using
> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.
> 
> The fact that we have barely no control of the execution of IRQs, at first
> glance, let us think that the idea of considering an IRQ as a task seems to be
> absurd. But, it is not. The IRQs run a piece of code that is, in the vast
> majority of the case, not related to the current thread, so it runs another
> "task". In the occurrence of more than one IRQ concurrently, the processor
> serves the IRQ in a predictable order [2], so the processor schedules the IRQs
> as a "task". Finally, there are precedence constraints among threads and IRQs.
> For instance, the latency can be seen as the response time of the timer IRQ
> handler, plus the delta of the return of the handler and the starting of the
> execution of cyclictest [3]. In the theory, the idea of precedence constraints
> is also about "task".
> 
> So IMHO, IRQs can be considered as a task (I am considering in my model), and
> the place to account this would be in the admission test.
> 
> The problem is that, for the best of my knowledge, there is no admissions test
> for such task model/system:
> 
> Two level of schedulers. A high priority scheduler that schedules a non
> preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
> and on intel it is a fixed priority). A lower priority task set (threads)
> scheduled by the OS.
> 
> But assuming that our current admission control is more about a safe guard than
> an exact admission control - that is, for multiprocessor it is necessary, but
> not sufficient. (Theoretically, it works for uniprocessor, but... there is a
> paper of Rob Davis somewhere that shows that if we have "context switch" (and so
> scheduler for our case)) with different costs, the many things does not hold
> true, for instance, Deadline Monotonic is not optimal... but I will have to read
> more to enter in this point, anyway, multiprocessor is only necessary).
> 
> With this in mind: we do *not* use/have an exact admission test for all cases.
> By not having an exact admission test, we assume the user knows what he/she is
> doing. In this case, if they have a high load of IRQs... they need to know that:
> 
> 1) Their periods should be consistent with the "interference" they might receive.
> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
> avoid this without "throttling" IRQs...)
> 
> So, is it worth to put a duct tape for this case?
> 
> My fear is that, by putting a duct tape here, we would turn things prone to more
> complex errors/undeterminism... so...
> 
> I think we have another point to add to the discussion at plumbers, Juri.

Yeah, sure. My fear in a case like this though is that the task that
ends up starving other is "creating" IRQ overhead on itself. Kind of
DoS, no?

I'm seeing something along the lines of what Peter suggested as a last
resort measure we probably still need to put in place.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-31 16:18                             ` Daniel Bristot de Oliveira
  2018-10-31 16:40                               ` Juri Lelli
@ 2018-10-31 17:38                               ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-31 17:38 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: luca abeni, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On Wed, Oct 31, 2018 at 05:18:00PM +0100, Daniel Bristot de Oliveira wrote:
> Brazilian part of the Ph.D we are dealing with probabilistic worst case
> execution time, and to be able to use probabilistic methods, we need to remove
> the noise of the IRQs in the execution time [1]. So, IMHO, using
> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.

> With this in mind: we do *not* use/have an exact admission test for all cases.
> By not having an exact admission test, we assume the user knows what he/she is
> doing. In this case, if they have a high load of IRQs... they need to know that:

So I mostly agree with the things you said; IRQs are a different
'context' or 'task' from the normal scheduled task. For AC we can
consider an average IRQ load etc..

But even if we get AC sufficient with an average IRQ load, there are
still the cases where the IRQs cluster. So, esp. for very short
runtimes, you can get this scenario.

> 1) Their periods should be consistent with the "interference" they might receive.
> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
> avoid this without "throttling" IRQs...)

True.

> So, is it worth to put a duct tape for this case?

My point was mostly to to not misbehave. Maybe I got it wrong, that
happens ;-)

> >> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
> >>  		return;
> >>  	}
> >>  
> >> +	wall = rq_clock();
> >> +	delta_wall = wall - dl_se->wallstamp;
> >> +	if (delta_wall > 0) {
> >> +		dl_se->walltime += delta_wall;
> >> +		dl_se->wallstamp = wall;
> >> +	}
> >> +
> >> +	/* check if rq_clock_task() has been too slow */
> >> +	if (unlikely(dl_se->walltime > dl_se->period))
> >> +		goto throttle;
> >> +
> 
> If I got it correctly, it can be the case that we would throttle a thread that,
> because of IRQs, received less CPU time than expected, right?

So the thinking was that when the actual walltime runtime (which
includes IRQs) exceeds the period, our whole CBS model breaks and we
need to replenish and push forward the deadline.

Maybe it should do that and not throttle.

Also, the walltime -= period thing should be fixed to never go negative
I think.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-31 16:40                               ` Juri Lelli
@ 2018-10-31 17:39                                 ` Peter Zijlstra
  2018-10-31 17:58                                 ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2018-10-31 17:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Daniel Bristot de Oliveira, luca abeni, Thomas Gleixner,
	Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mingo,
	nstange, syzkaller-bugs, henrik, Tommaso Cucinotta,
	Claudio Scordino

On Wed, Oct 31, 2018 at 05:40:10PM +0100, Juri Lelli wrote:
> I'm seeing something along the lines of what Peter suggested as a last
> resort measure we probably still need to put in place.

Yes, you certainly hope to never hit that, and should not in a proper
setup. But we need something to keep things 'working'.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-31 16:40                               ` Juri Lelli
  2018-10-31 17:39                                 ` Peter Zijlstra
@ 2018-10-31 17:58                                 ` Daniel Bristot de Oliveira
  2018-11-01  5:55                                   ` Juri Lelli
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-10-31 17:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 10/31/18 5:40 PM, Juri Lelli wrote:
> On 31/10/18 17:18, Daniel Bristot de Oliveira wrote:
>> On 10/30/18 12:08 PM, luca abeni wrote:
>>> Hi Peter,
>>>
>>> On Tue, 30 Oct 2018 11:45:54 +0100
>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>> [...]
>>>>>  2. This is related to perf_event_open syscall reproducer does
>>>>> before becoming DEADLINE and entering the busy loop. Enabling of
>>>>> perf swevents generates lot of hrtimers load that happens in the
>>>>>     reproducer task context. Now, DEADLINE uses rq_clock() for
>>>>> setting deadlines, but rq_clock_task() for doing runtime
>>>>> enforcement. In a situation like this it seems that the amount of
>>>>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw
>>>>> should maybe do better, pain point remains I guess), so rq_clock()
>>>>> and rq_clock_task() might become more a more skewed w.r.t. each
>>>>> other. Since rq_clock() is only used when setting absolute
>>>>> deadlines for the first time (or when resetting them in certain
>>>>> cases), after a bit the replenishment code will start to see
>>>>> postponed deadlines always in the past w.r.t. rq_clock(). And this
>>>>> brings us back to the fact that the task is never stopped, since it
>>>>> can't keep up with rq_clock().
>>>>>
>>>>>     - Not sure yet how we want to address this [1]. We could use
>>>>>       rq_clock() everywhere, but tasks might be penalized by irq
>>>>>       pressure (theoretically this would mandate that irqs are
>>>>>       explicitly accounted for I guess). I tried to use the skew
>>>>> between the two clocks to "fix" deadlines, but that puts us at
>>>>> risks of de-synchronizing userspace and kernel views of deadlines.  
>>>>
>>>> Hurm.. right. We knew of this issue back when we did it.
>>>> I suppose now it hurts and we need to figure something out.
>>>>
>>>> By virtue of being a real-time class, we do indeed need to have
>>>> deadline on the wall-clock. But if we then don't account runtime on
>>>> that same clock, but on a potentially slower clock, we get the
>>>> problem that we can run longer than our period/deadline, which is
>>>> what we're running into here I suppose.
>>>
>>> I might be hugely misunderstanding something here, but in my impression
>>> the issue is just that if the IRQ time is not accounted to the
>>> -deadline task, then the non-deadline tasks might be starved.
>>>
>>> I do not see this as a skew between two clocks, but as an accounting
>>> thing:
>>> - if we decide that the IRQ time is accounted to the -deadline
>>>   task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
>>>   then the non-deadline tasks are not starved (but of course the
>>>   -deadline tasks executes for less than its reserved time in the
>>>   period); 
>>> - if we decide that the IRQ time is not accounted to the -deadline task
>>>   (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
>>>   the -deadline task executes for the expected amount of time (about
>>>   60% of the CPU time), but an IRQ load of 40% will starve non-deadline
>>>   tasks (this is what happens in the bug that triggered this discussion)
>>>
>>> I think this might be seen as an adimission control issue: when
>>> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
>>> in the admission control (because it ends up in the task's runtime),
>>> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
>>> accounted for in the admission test (the IRQ handler becomes some sort
>>> of entity with a higher priority than -deadline tasks, on which no
>>> accounting or enforcement is performed).
>>>
>>
>> I am sorry for taking to long to join in the discussion.
>>
>> I agree with Luca. I've seem this behavior two time before. Firstly when we were
>> trying to make the rt throttling to have a very short runtime for non-rt
>> threads, and then in the proof of concept of the semi-partitioned scheduler.
>>
>> Firstly, I started thinking on this as a skew between both clocks and disabled
>> IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
>> IRQ runtime will be accounted as the thread's runtime. In other words, we are
>> just sweeping the trash under the rug, where the rug is the worst case execution
>> time estimation/definition (which is an even more complex problem). In the
>> Brazilian part of the Ph.D we are dealing with probabilistic worst case
>> execution time, and to be able to use probabilistic methods, we need to remove
>> the noise of the IRQs in the execution time [1]. So, IMHO, using
>> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.
>>
>> The fact that we have barely no control of the execution of IRQs, at first
>> glance, let us think that the idea of considering an IRQ as a task seems to be
>> absurd. But, it is not. The IRQs run a piece of code that is, in the vast
>> majority of the case, not related to the current thread, so it runs another
>> "task". In the occurrence of more than one IRQ concurrently, the processor
>> serves the IRQ in a predictable order [2], so the processor schedules the IRQs
>> as a "task". Finally, there are precedence constraints among threads and IRQs.
>> For instance, the latency can be seen as the response time of the timer IRQ
>> handler, plus the delta of the return of the handler and the starting of the
>> execution of cyclictest [3]. In the theory, the idea of precedence constraints
>> is also about "task".
>>
>> So IMHO, IRQs can be considered as a task (I am considering in my model), and
>> the place to account this would be in the admission test.
>>
>> The problem is that, for the best of my knowledge, there is no admissions test
>> for such task model/system:
>>
>> Two level of schedulers. A high priority scheduler that schedules a non
>> preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
>> and on intel it is a fixed priority). A lower priority task set (threads)
>> scheduled by the OS.
>>
>> But assuming that our current admission control is more about a safe guard than
>> an exact admission control - that is, for multiprocessor it is necessary, but
>> not sufficient. (Theoretically, it works for uniprocessor, but... there is a
>> paper of Rob Davis somewhere that shows that if we have "context switch" (and so
>> scheduler for our case)) with different costs, the many things does not hold
>> true, for instance, Deadline Monotonic is not optimal... but I will have to read
>> more to enter in this point, anyway, multiprocessor is only necessary).
>>
>> With this in mind: we do *not* use/have an exact admission test for all cases.
>> By not having an exact admission test, we assume the user knows what he/she is
>> doing. In this case, if they have a high load of IRQs... they need to know that:
>>
>> 1) Their periods should be consistent with the "interference" they might receive.
>> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
>> avoid this without "throttling" IRQs...)
>>
>> So, is it worth to put a duct tape for this case?
>>
>> My fear is that, by putting a duct tape here, we would turn things prone to more
>> complex errors/undeterminism... so...
>>
>> I think we have another point to add to the discussion at plumbers, Juri.
> 
> Yeah, sure. My fear in a case like this though is that the task that
> ends up starving other is "creating" IRQ overhead on itself. Kind of
> DoS, no?

I see your point.

But how about a non-rt thread that creates a lot of timers, then a DL task
arrives and preempts it, receiving the interfere from interrupts that were
caused by the previous thread?

Actually, enabling/disabling sched stats in a loop generates a IPI storm on all
(other) CPUs because of updates in jump labels (we will reduce/bound that with
the batch of jump label update, but still, the problem will exist). But not
only, iirc we can cause this with a madvise to cause the flush of pages.

> I'm seeing something along the lines of what Peter suggested as a last
> resort measure we probably still need to put in place.

I meant, I am not against the/a fix, i just think that... it is more complicated
that it seems.

For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
IPIs because of static key update, and a good dl thread B in the CPU 1.

In this case, the thread B could run less than what was reserved for it, but it
was not causing the interrupts. It is not fair to put a penalty in the thread B.

The same is valid for a dl thread running in the same CPU that is receiving a
lot of network packets to another application, and other legit cases.

In the end, if we want to avoid non-rt threads starving, we need to prioritize
them some time, but in this case, we return to the DL server for non-rt threads.

Thoughts?

Thanks,
-- Daniel


> Thanks,
> 
> - Juri
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-31 17:58                                 ` Daniel Bristot de Oliveira
@ 2018-11-01  5:55                                   ` Juri Lelli
  2018-11-02 10:00                                     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-11-01  5:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 31/10/18 18:58, Daniel Bristot de Oliveira wrote:
> On 10/31/18 5:40 PM, Juri Lelli wrote:
> > On 31/10/18 17:18, Daniel Bristot de Oliveira wrote:
> >> On 10/30/18 12:08 PM, luca abeni wrote:
> >>> Hi Peter,
> >>>
> >>> On Tue, 30 Oct 2018 11:45:54 +0100
> >>> Peter Zijlstra <peterz@infradead.org> wrote:
> >>> [...]
> >>>>>  2. This is related to perf_event_open syscall reproducer does
> >>>>> before becoming DEADLINE and entering the busy loop. Enabling of
> >>>>> perf swevents generates lot of hrtimers load that happens in the
> >>>>>     reproducer task context. Now, DEADLINE uses rq_clock() for
> >>>>> setting deadlines, but rq_clock_task() for doing runtime
> >>>>> enforcement. In a situation like this it seems that the amount of
> >>>>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw
> >>>>> should maybe do better, pain point remains I guess), so rq_clock()
> >>>>> and rq_clock_task() might become more a more skewed w.r.t. each
> >>>>> other. Since rq_clock() is only used when setting absolute
> >>>>> deadlines for the first time (or when resetting them in certain
> >>>>> cases), after a bit the replenishment code will start to see
> >>>>> postponed deadlines always in the past w.r.t. rq_clock(). And this
> >>>>> brings us back to the fact that the task is never stopped, since it
> >>>>> can't keep up with rq_clock().
> >>>>>
> >>>>>     - Not sure yet how we want to address this [1]. We could use
> >>>>>       rq_clock() everywhere, but tasks might be penalized by irq
> >>>>>       pressure (theoretically this would mandate that irqs are
> >>>>>       explicitly accounted for I guess). I tried to use the skew
> >>>>> between the two clocks to "fix" deadlines, but that puts us at
> >>>>> risks of de-synchronizing userspace and kernel views of deadlines.  
> >>>>
> >>>> Hurm.. right. We knew of this issue back when we did it.
> >>>> I suppose now it hurts and we need to figure something out.
> >>>>
> >>>> By virtue of being a real-time class, we do indeed need to have
> >>>> deadline on the wall-clock. But if we then don't account runtime on
> >>>> that same clock, but on a potentially slower clock, we get the
> >>>> problem that we can run longer than our period/deadline, which is
> >>>> what we're running into here I suppose.
> >>>
> >>> I might be hugely misunderstanding something here, but in my impression
> >>> the issue is just that if the IRQ time is not accounted to the
> >>> -deadline task, then the non-deadline tasks might be starved.
> >>>
> >>> I do not see this as a skew between two clocks, but as an accounting
> >>> thing:
> >>> - if we decide that the IRQ time is accounted to the -deadline
> >>>   task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
> >>>   then the non-deadline tasks are not starved (but of course the
> >>>   -deadline tasks executes for less than its reserved time in the
> >>>   period); 
> >>> - if we decide that the IRQ time is not accounted to the -deadline task
> >>>   (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
> >>>   the -deadline task executes for the expected amount of time (about
> >>>   60% of the CPU time), but an IRQ load of 40% will starve non-deadline
> >>>   tasks (this is what happens in the bug that triggered this discussion)
> >>>
> >>> I think this might be seen as an adimission control issue: when
> >>> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
> >>> in the admission control (because it ends up in the task's runtime),
> >>> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
> >>> accounted for in the admission test (the IRQ handler becomes some sort
> >>> of entity with a higher priority than -deadline tasks, on which no
> >>> accounting or enforcement is performed).
> >>>
> >>
> >> I am sorry for taking to long to join in the discussion.
> >>
> >> I agree with Luca. I've seem this behavior two time before. Firstly when we were
> >> trying to make the rt throttling to have a very short runtime for non-rt
> >> threads, and then in the proof of concept of the semi-partitioned scheduler.
> >>
> >> Firstly, I started thinking on this as a skew between both clocks and disabled
> >> IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
> >> IRQ runtime will be accounted as the thread's runtime. In other words, we are
> >> just sweeping the trash under the rug, where the rug is the worst case execution
> >> time estimation/definition (which is an even more complex problem). In the
> >> Brazilian part of the Ph.D we are dealing with probabilistic worst case
> >> execution time, and to be able to use probabilistic methods, we need to remove
> >> the noise of the IRQs in the execution time [1]. So, IMHO, using
> >> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.
> >>
> >> The fact that we have barely no control of the execution of IRQs, at first
> >> glance, let us think that the idea of considering an IRQ as a task seems to be
> >> absurd. But, it is not. The IRQs run a piece of code that is, in the vast
> >> majority of the case, not related to the current thread, so it runs another
> >> "task". In the occurrence of more than one IRQ concurrently, the processor
> >> serves the IRQ in a predictable order [2], so the processor schedules the IRQs
> >> as a "task". Finally, there are precedence constraints among threads and IRQs.
> >> For instance, the latency can be seen as the response time of the timer IRQ
> >> handler, plus the delta of the return of the handler and the starting of the
> >> execution of cyclictest [3]. In the theory, the idea of precedence constraints
> >> is also about "task".
> >>
> >> So IMHO, IRQs can be considered as a task (I am considering in my model), and
> >> the place to account this would be in the admission test.
> >>
> >> The problem is that, for the best of my knowledge, there is no admissions test
> >> for such task model/system:
> >>
> >> Two level of schedulers. A high priority scheduler that schedules a non
> >> preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
> >> and on intel it is a fixed priority). A lower priority task set (threads)
> >> scheduled by the OS.
> >>
> >> But assuming that our current admission control is more about a safe guard than
> >> an exact admission control - that is, for multiprocessor it is necessary, but
> >> not sufficient. (Theoretically, it works for uniprocessor, but... there is a
> >> paper of Rob Davis somewhere that shows that if we have "context switch" (and so
> >> scheduler for our case)) with different costs, the many things does not hold
> >> true, for instance, Deadline Monotonic is not optimal... but I will have to read
> >> more to enter in this point, anyway, multiprocessor is only necessary).
> >>
> >> With this in mind: we do *not* use/have an exact admission test for all cases.
> >> By not having an exact admission test, we assume the user knows what he/she is
> >> doing. In this case, if they have a high load of IRQs... they need to know that:
> >>
> >> 1) Their periods should be consistent with the "interference" they might receive.
> >> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
> >> avoid this without "throttling" IRQs...)
> >>
> >> So, is it worth to put a duct tape for this case?
> >>
> >> My fear is that, by putting a duct tape here, we would turn things prone to more
> >> complex errors/undeterminism... so...
> >>
> >> I think we have another point to add to the discussion at plumbers, Juri.
> > 
> > Yeah, sure. My fear in a case like this though is that the task that
> > ends up starving other is "creating" IRQ overhead on itself. Kind of
> > DoS, no?
> 
> I see your point.
> 
> But how about a non-rt thread that creates a lot of timers, then a DL task
> arrives and preempts it, receiving the interfere from interrupts that were
> caused by the previous thread?
> 
> Actually, enabling/disabling sched stats in a loop generates a IPI storm on all
> (other) CPUs because of updates in jump labels (we will reduce/bound that with
> the batch of jump label update, but still, the problem will exist). But not
> only, iirc we can cause this with a madvise to cause the flush of pages.
> 
> > I'm seeing something along the lines of what Peter suggested as a last
> > resort measure we probably still need to put in place.
> 
> I meant, I am not against the/a fix, i just think that... it is more complicated
> that it seems.
> 
> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
> IPIs because of static key update, and a good dl thread B in the CPU 1.
> 
> In this case, the thread B could run less than what was reserved for it, but it
> was not causing the interrupts. It is not fair to put a penalty in the thread B.
> 
> The same is valid for a dl thread running in the same CPU that is receiving a
> lot of network packets to another application, and other legit cases.
> 
> In the end, if we want to avoid non-rt threads starving, we need to prioritize
> them some time, but in this case, we return to the DL server for non-rt threads.
> 
> Thoughts?

And I see your point. :-)

I'd also add (maybe you mentioned this as well) that it seems the same
could happen with RT throttling safety measure, as we are using
clock_task there as well to account runtime and throttle stuff.

OTOH, when something like you describe happens, guarantees are probably
already out of the window and we should just do our best to at least
keep the system "working"? (maybe only to warn the user that something
bad has happened)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-11-01  5:55                                   ` Juri Lelli
@ 2018-11-02 10:00                                     ` Daniel Bristot de Oliveira
  2018-11-05 10:55                                       ` Juri Lelli
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-11-02 10:00 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 11/1/18 6:55 AM, Juri Lelli wrote:
>> I meant, I am not against the/a fix, i just think that... it is more complicated
>> that it seems.
>>
>> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
>> IPIs because of static key update, and a good dl thread B in the CPU 1.
>>
>> In this case, the thread B could run less than what was reserved for it, but it
>> was not causing the interrupts. It is not fair to put a penalty in the thread B.
>>
>> The same is valid for a dl thread running in the same CPU that is receiving a
>> lot of network packets to another application, and other legit cases.
>>
>> In the end, if we want to avoid non-rt threads starving, we need to prioritize
>> them some time, but in this case, we return to the DL server for non-rt threads.
>>
>> Thoughts?
> And I see your point. :-)
> 
> I'd also add (maybe you mentioned this as well) that it seems the same
> could happen with RT throttling safety measure, as we are using
> clock_task there as well to account runtime and throttle stuff.

Yes! The same problem can happen with rt scheduler as well! I saw this problem
first with the rt throttling mechanism when I was trying to make it work in the
microseconds granularity (it is only enforced in the schedule tick, so it is in
an ms granularity in practice). After using hr timers to do the enforcement in
the microseconds granularity, I was trying to let just fewer us for the non-rt.
But as the IRQ runtime was higher than these fewer us, the rt_rq was never
throttled. It is the same/similar behavior we see here.

As we think in the rt throttling as "avoiding rt workload to consume more than
rt_runtime/rt_period", and considering that IRQs are a level of task with a
fixed priority higher than all the real-time related schedulers, i.e., deadline
and rt, we can safely argue that we can consider the IRQ time into the pool of
rt workload and account it in the rt_runtime. The easiest way to do it is to use
the rq_clock() in the measurement. I agree.

The point is that the CBS has a dual goal: it avoids a task running for more
than its runtime (a throttling behavior), but it also is used as a guarantee of
runtime for the case in which the task behaves, and the system is not
overloaded. Considering we can have more load than we can schedule in a
multiprocessor - but that is another story.

The the obvious reasoning here is: Ok boy, but the system IS overloaded in this
case, we have a RCU stall! And that is true if you look at the processor
starving RCU. But if the system has mode than one CPU, it could have CPU time
available in another CPU. So, we could just move the dl task from one CPU to
another.

Btw, that is another point. We have the AC with the sum of the utilization of
all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
thread with runtime=deadline=period  (in a system with more than one CPU), and
run in a busy-loop, we will eventually have an RCU stall as well (I just did on
my box, I got a soft lockup). I know this is a different problem. But, maybe,
there is a general solution for both issues:

For instance, if the sum of the execution time of all "task" with priority
higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
CPU is higher than rt_runtime in the rt_period, we need to avoid what is
"avoidable" by trying to move rt and dl threads away from that CPU. Another
possibility is to bump the priority of the OTHER class (and we are back to the
DL server).

- Dude, would not be easy just changing the CBS?

Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
that rely on CBS... like GRUB, user-space/kernel-space synchronization...

> OTOH, when something like you describe happens, guarantees are probably
> already out of the window and we should just do our best to at least
> keep the system "working"? (maybe only to warn the user that something
> bad has happened)

Btw, don't get me wrong, I am not against changing CBS: I am just trying to
raise other viewpoints to avoid touching in the base of the DL scheduler, and
avoid punishing a thread that behaves well.

Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
good thing to do as well. We will be notified anyway, either by RCU or
softlockup... but they are side effects warning. By notifying that we have an
overload of rt or higher workload we will be pointing to the cause.

Thoughts?

-- Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-11-02 10:00                                     ` Daniel Bristot de Oliveira
@ 2018-11-05 10:55                                       ` Juri Lelli
  2018-11-07 10:12                                         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2018-11-05 10:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 02/11/18 11:00, Daniel Bristot de Oliveira wrote:
> On 11/1/18 6:55 AM, Juri Lelli wrote:
> >> I meant, I am not against the/a fix, i just think that... it is more complicated
> >> that it seems.
> >>
> >> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
> >> IPIs because of static key update, and a good dl thread B in the CPU 1.
> >>
> >> In this case, the thread B could run less than what was reserved for it, but it
> >> was not causing the interrupts. It is not fair to put a penalty in the thread B.
> >>
> >> The same is valid for a dl thread running in the same CPU that is receiving a
> >> lot of network packets to another application, and other legit cases.
> >>
> >> In the end, if we want to avoid non-rt threads starving, we need to prioritize
> >> them some time, but in this case, we return to the DL server for non-rt threads.
> >>
> >> Thoughts?
> > And I see your point. :-)
> > 
> > I'd also add (maybe you mentioned this as well) that it seems the same
> > could happen with RT throttling safety measure, as we are using
> > clock_task there as well to account runtime and throttle stuff.
> 
> Yes! The same problem can happen with rt scheduler as well! I saw this problem
> first with the rt throttling mechanism when I was trying to make it work in the
> microseconds granularity (it is only enforced in the schedule tick, so it is in
> an ms granularity in practice). After using hr timers to do the enforcement in
> the microseconds granularity, I was trying to let just fewer us for the non-rt.
> But as the IRQ runtime was higher than these fewer us, the rt_rq was never
> throttled. It is the same/similar behavior we see here.
> 
> As we think in the rt throttling as "avoiding rt workload to consume more than
> rt_runtime/rt_period", and considering that IRQs are a level of task with a
> fixed priority higher than all the real-time related schedulers, i.e., deadline
> and rt, we can safely argue that we can consider the IRQ time into the pool of
> rt workload and account it in the rt_runtime. The easiest way to do it is to use
> the rq_clock() in the measurement. I agree.
> 
> The point is that the CBS has a dual goal: it avoids a task running for more
> than its runtime (a throttling behavior), but it also is used as a guarantee of
> runtime for the case in which the task behaves, and the system is not
> overloaded. Considering we can have more load than we can schedule in a
> multiprocessor - but that is another story.
> 
> The the obvious reasoning here is: Ok boy, but the system IS overloaded in this
> case, we have a RCU stall! And that is true if you look at the processor
> starving RCU. But if the system has mode than one CPU, it could have CPU time
> available in another CPU. So, we could just move the dl task from one CPU to
> another.

Mmm, only that in this particular case I believe IRQ load will move
together with the migrating task and problem won't really be solved. :-/

> Btw, that is another point. We have the AC with the sum of the utilization of
> all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
> thread with runtime=deadline=period  (in a system with more than one CPU), and
> run in a busy-loop, we will eventually have an RCU stall as well (I just did on
> my box, I got a soft lockup). I know this is a different problem. But, maybe,
> there is a general solution for both issues:

This is true. However, the single 100% bandwidth task problem can be
solved by limiting the maximum bandwidth a single entity can ask for. Of
course we can get again to a similar sort of problem if multiple
entities are then co-scheduled on the same CPU, for which we would need
(residual) capacity awareness. This should happen less likely though, as
there is a general tendency to spread tasks.

> For instance, if the sum of the execution time of all "task" with priority
> higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
> CPU is higher than rt_runtime in the rt_period, we need to avoid what is
> "avoidable" by trying to move rt and dl threads away from that CPU. Another
> possibility is to bump the priority of the OTHER class (and we are back to the
> DL server).

Kind of weird though having to migrate RT (everything higher than OTHER)
only to make some room for non-RT stuff. Also because one can introduce
unwanted side effects on high prio workloads (cache related overheads,
etc.). OTHER has also already have some knowledge about higher prio
activities (rt,dl,irq PELT). So this seems to really leave us with
affined tasks, of all priorities and kinds (real vs. irq).

> 
> - Dude, would not be easy just changing the CBS?
> 
> Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
> that rely on CBS... like GRUB, user-space/kernel-space synchronization...
> 
> > OTOH, when something like you describe happens, guarantees are probably
> > already out of the window and we should just do our best to at least
> > keep the system "working"? (maybe only to warn the user that something
> > bad has happened)
> 
> Btw, don't get me wrong, I am not against changing CBS: I am just trying to
> raise other viewpoints to avoid touching in the base of the DL scheduler, and
> avoid punishing a thread that behaves well.
> 
> Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
> good thing to do as well. We will be notified anyway, either by RCU or
> softlockup... but they are side effects warning. By notifying that we have an
> overload of rt or higher workload we will be pointing to the cause.

Right. It doesn't solve the problem, but I guess it could help debugging.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-10-30 11:12                           ` Juri Lelli
@ 2018-11-06 11:44                             ` Juri Lelli
  0 siblings, 0 replies; 36+ messages in thread
From: Juri Lelli @ 2018-11-06 11:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov,
	H. Peter Anvin, LKML, mingo, nstange, syzkaller-bugs, henrik,
	Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 30/10/18 12:12, Juri Lelli wrote:
> On 30/10/18 11:45, Peter Zijlstra wrote:
> 
> [...]
> 
> > Hurm.. right. We knew of this issue back when we did it.
> > I suppose now it hurts and we need to figure something out.
> > 
> > By virtue of being a real-time class, we do indeed need to have deadline
> > on the wall-clock. But if we then don't account runtime on that same
> > clock, but on a potentially slower clock, we get the problem that we can
> > run longer than our period/deadline, which is what we're running into
> > here I suppose.
> > 
> > And yes, at some point RT workloads need to be aware of the jitter
> > injected by things like IRQs and such. But I believe the rationale was
> > that for soft real-time workloads this current semantic was 'easier'
> > because we get to ignore IRQ overhead for workload estimation etc.
> 
> Right. In this case the task is self injecting IRQ load, but it maybe
> doesn't make a big difference on how we need to treat it (supposing we
> can actually distinguish).
> 
> > What we could maybe do is track runtime in both rq_clock_task() and
> > rq_clock() and detect where the rq_clock based one exceeds the period
> > and then push out the deadline (and add runtime).
> > 
> > Maybe something along such lines; does that make sense?
> 
> Yeah, I think I've got the gist of the idea. I'll play with it.

So, even though I consider Luca and Daniel's points/concerns more then
valid, I spent I bit of time playing with this idea and ended up with
what follows.

I'm adding deadline adjustment when we realize that there is a
difference between delta_exec and delta_wall (IRQ load) and walltime is
bigger then configured dl_period. The adjustment to rq_clock (even
though it of course de-synchs user/kernel deadlines) seems needed, as w/o
this kind of thing task doesn't get throttled for too long, a dl_period
at max I guess, after having been running for quite a while.

Another change is that I believe we want to keep runtime always below
dl_runtime while pushing deadline away.

This doesn't seem to modify behavior for non misbehaving and non
affected by irq load tasks and should fix stalls and starvation of lower
prio activities problem.

Does it make any sense? :-)

--->8---
 include/linux/sched.h   |  3 ++
 kernel/sched/deadline.c | 78 ++++++++++++++++++++++++++++++-----------
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..e5aaf6deab7e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,6 +521,9 @@ struct sched_dl_entity {
 	u64				deadline;	/* Absolute deadline for this instance	*/
 	unsigned int			flags;		/* Specifying the scheduler behaviour	*/
 
+	u64				wallstamp;
+	s64				walltime;
+
 	/*
 	 * Some bool flags:
 	 *
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 93da990ba458..061562589282 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -696,16 +696,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
 	if (dl_se->dl_yielded && dl_se->runtime > 0)
 		dl_se->runtime = 0;
 
-	/*
-	 * We keep moving the deadline away until we get some
-	 * available runtime for the entity. This ensures correct
-	 * handling of situations where the runtime overrun is
-	 * arbitrary large.
-	 */
-	while (dl_se->runtime <= 0) {
-		dl_se->deadline += pi_se->dl_period;
-		dl_se->runtime += pi_se->dl_runtime;
-	}
+	/* XXX have to deal with PI */
+	dl_se->walltime = 0;
 
 
 	/*
@@ -917,7 +909,7 @@ static int start_dl_timer(struct task_struct *p)
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
 	 */
-	act = ns_to_ktime(dl_next_period(dl_se));
+	act = ns_to_ktime(dl_se->deadline - dl_se->dl_period);
 	now = hrtimer_cb_get_time(timer);
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	trace_printk("%s: cpu:%d pid:%d dl_runtime:%llu dl_deadline:%llu dl_period:%llu runtime:%lld deadline:%llu rq_clock:%llu rq_clock_task:%llu act:%lld now:%lld delta:%lld",
@@ -1174,9 +1166,10 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	u64 delta_exec, scaled_delta_exec, delta_wall;
 	int cpu = cpu_of(rq);
-	u64 now;
+	u64 now, wall;
+	bool adjusted = false;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1191,12 +1184,28 @@ static void update_curr_dl(struct rq *rq)
 	 */
 	now = rq_clock_task(rq);
 	delta_exec = now - curr->se.exec_start;
+
+	wall = rq_clock(rq);
+	delta_wall = wall - dl_se->wallstamp;
+
 	if (unlikely((s64)delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
+	if (delta_wall > 0) {
+		dl_se->walltime += delta_wall;
+		dl_se->wallstamp = wall;
+	}
+
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
@@ -1230,22 +1239,47 @@ static void update_curr_dl(struct rq *rq)
 
 	dl_se->runtime -= scaled_delta_exec;
 
+	if (dl_runtime_exceeded(dl_se) ||
+	    dl_se->dl_yielded ||
+	    unlikely(dl_se->walltime > dl_se->dl_period)) {
 throttle:
-	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
 		dl_se->dl_throttled = 1;
 
 		/* If requested, inform the user about runtime overruns. */
 		if (dl_runtime_exceeded(dl_se) &&
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
-
 		__dequeue_task_dl(rq, curr, 0);
 
+		/*
+		 * We keep moving the deadline away until we get some available
+		 * runtime for the entity. This ensures correct handling of
+		 * situations where the runtime overrun is arbitrary large.
+		 */
+		while (dl_se->runtime <= 0 || dl_se->walltime > (s64)dl_se->dl_period) {
+
+			if (delta_exec != delta_wall &&
+			    dl_se->walltime > (s64)dl_se->dl_period && !adjusted) {
+				dl_se->deadline = wall;
+				adjusted = true;
+			}
+
+			dl_se->deadline += dl_se->dl_period;
+
+			if (dl_se->runtime <= 0)
+				dl_se->runtime  += dl_se->dl_runtime;
+
+			dl_se->walltime -= dl_se->dl_period;
+		}
+
+		WARN_ON_ONCE(dl_se->runtime > dl_se->dl_runtime);
+
 		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
 
@@ -1783,9 +1817,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq_clock_task(rq);
+	dl_se->wallstamp = rq_clock(rq);
 
 	/* Running task will never be pushed. */
-       dequeue_pushable_dl_task(rq, p);
+	dequeue_pushable_dl_task(rq, p);
 
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
@@ -1843,6 +1878,7 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq_clock_task(rq);
+	p->dl.wallstamp = rq_clock(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: INFO: rcu detected stall in do_idle
  2018-11-05 10:55                                       ` Juri Lelli
@ 2018-11-07 10:12                                         ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-11-07 10:12 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot,
	Borislav Petkov, H. Peter Anvin, LKML, mingo, nstange,
	syzkaller-bugs, henrik, Tommaso Cucinotta, Claudio Scordino

On 11/5/18 11:55 AM, Juri Lelli wrote:
> On 02/11/18 11:00, Daniel Bristot de Oliveira wrote:
>> On 11/1/18 6:55 AM, Juri Lelli wrote:
>>>> I meant, I am not against the/a fix, i just think that... it is more complicated
>>>> that it seems.
>>>>
>>>> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
>>>> IPIs because of static key update, and a good dl thread B in the CPU 1.
>>>>
>>>> In this case, the thread B could run less than what was reserved for it, but it
>>>> was not causing the interrupts. It is not fair to put a penalty in the thread B.
>>>>
>>>> The same is valid for a dl thread running in the same CPU that is receiving a
>>>> lot of network packets to another application, and other legit cases.
>>>>
>>>> In the end, if we want to avoid non-rt threads starving, we need to prioritize
>>>> them some time, but in this case, we return to the DL server for non-rt threads.
>>>>
>>>> Thoughts?
>>> And I see your point. :-)
>>>
>>> I'd also add (maybe you mentioned this as well) that it seems the same
>>> could happen with RT throttling safety measure, as we are using
>>> clock_task there as well to account runtime and throttle stuff.
>>
>> Yes! The same problem can happen with rt scheduler as well! I saw this problem
>> first with the rt throttling mechanism when I was trying to make it work in the
>> microseconds granularity (it is only enforced in the schedule tick, so it is in
>> an ms granularity in practice). After using hr timers to do the enforcement in
>> the microseconds granularity, I was trying to let just fewer us for the non-rt.
>> But as the IRQ runtime was higher than these fewer us, the rt_rq was never
>> throttled. It is the same/similar behavior we see here.
>>
>> As we think in the rt throttling as "avoiding rt workload to consume more than
>> rt_runtime/rt_period", and considering that IRQs are a level of task with a
>> fixed priority higher than all the real-time related schedulers, i.e., deadline
>> and rt, we can safely argue that we can consider the IRQ time into the pool of
>> rt workload and account it in the rt_runtime. The easiest way to do it is to use
>> the rq_clock() in the measurement. I agree.
>>
>> The point is that the CBS has a dual goal: it avoids a task running for more
>> than its runtime (a throttling behavior), but it also is used as a guarantee of
>> runtime for the case in which the task behaves, and the system is not
>> overloaded. Considering we can have more load than we can schedule in a
>> multiprocessor - but that is another story.
>>
>> The the obvious reasoning here is: Ok boy, but the system IS overloaded in this
>> case, we have a RCU stall! And that is true if you look at the processor
>> starving RCU. But if the system has mode than one CPU, it could have CPU time
>> available in another CPU. So, we could just move the dl task from one CPU to
>> another.
> 
> Mmm, only that in this particular case I believe IRQ load will move
> together with the migrating task and problem won't really be solved. :-/

The thread would move to another CPU. Allowing the (pinned) non-rt tasks to have
time to run. Later, the bad dl task would be able to return, to avoid the
problem of the deadline task doing the wrong thing in the other CPU. In this
way, non-rt threads would be able to run, avoiding RCU stall/softlockup.

That is the idea of the rt throttling.

>> Btw, that is another point. We have the AC with the sum of the utilization of
>> all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
>> thread with runtime=deadline=period  (in a system with more than one CPU), and
>> run in a busy-loop, we will eventually have an RCU stall as well (I just did on
>> my box, I got a soft lockup). I know this is a different problem. But, maybe,
>> there is a general solution for both issues:
> 
> This is true. However, the single 100% bandwidth task problem can be
> solved by limiting the maximum bandwidth a single entity can ask for. Of
> course we can get again to a similar sort of problem if multiple
> entities are then co-scheduled on the same CPU, for which we would need
> (residual) capacity awareness. This should happen less likely though, as
> there is a general tendency to spread tasks.

Limiting the U of a task does not solve the problem. Moreover, a U = 1 task is
not exactly a problem, if the proper way to avoid the starvation of non-rt
thread exists. A U = 1 task can exist without causing damage by moving the
thread between two CPUs, for instance. I know this is very controversial, but
there are many use cases for it. For instance, NFV polling to NIC,
high-frequency trading -rt-users use polling mechanism as well (the discussion
of whether it is right or wrong is another chapter). In practice, these cases
are a significant part of -rt-users.

Still, the problem can happen even if you limit the U per task. You just need
two U = 0.5 tasks to fulfill the CPU. The global scheduler tends to spread the
load (because it migrates the threads very often), I agree. But the problem can
happen, and it will, sooner or later it always happens.

>> For instance, if the sum of the execution time of all "task" with priority
>> higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
>> CPU is higher than rt_runtime in the rt_period, we need to avoid what is
>> "avoidable" by trying to move rt and dl threads away from that CPU. Another
>> possibility is to bump the priority of the OTHER class (and we are back to the
>> DL server).
> 
> Kind of weird though having to migrate RT (everything higher than OTHER)
> only to make some room for non-RT stuff.

It is not. That is the idea of the RT throttling. The rq is throttle, to avoid
starving (per-cpu) non-rt threads that need to run. One can prevent migrating rt
threads, but this is not correct for a global scheduler, as it would break the
working conserving properties.

> Also because one can introduce
> unwanted side effects on high prio workloads (cache related overheads,
> etc.).

Considering one thread will have to migrate once per rt_period (1ms by default),
only if an rq becomes overloaded, we can say that this is barely insignificant,
given the amount of migrations we have in the global scheduler. Well, global has
a tendency to spread tasks by migrating them anyway.

> OTHER has also already have some knowledge about higher prio
> activities (rt,dl,irq PELT). So this seems to really leave us with
> affined tasks, of all priorities and kinds (real vs. irq).

I am not an expert in PELT, how does PELT deal with RT throttling?

>>
>> - Dude, would not be easy just changing the CBS?
>>
>> Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
>> that rely on CBS... like GRUB, user-space/kernel-space synchronization...
>>
>>> OTOH, when something like you describe happens, guarantees are probably
>>> already out of the window and we should just do our best to at least
>>> keep the system "working"? (maybe only to warn the user that something
>>> bad has happened)
>>
>> Btw, don't get me wrong, I am not against changing CBS: I am just trying to
>> raise other viewpoints to avoid touching in the base of the DL scheduler, and
>> avoid punishing a thread that behaves well.
>>
>> Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
>> good thing to do as well. We will be notified anyway, either by RCU or
>> softlockup... but they are side effects warning. By notifying that we have an
>> overload of rt or higher workload we will be pointing to the cause.
> 
> Right. It doesn't solve the problem, but I guess it could help debugging.
I did not say it was a solution.

I was having a look at the reproducer, and... well, a good part of the problem
can be bounded in the other part of the equation. The reproducer enables perf
sampling, and it is known that perf sampling can cause problems, and that is why
we have limits for it.

The limits point to a 25 percent of CPU time for perf sampling... considering
throttling imprecision because of HZ... we can clearly see that the system is
with > 100% of CPU usage for dl + IRQ.

Again: don't get me wrong, I am aware and agree that there is another problem,
about the "readjustment of the period/runtime considering the drift in the
execution of the task caused by IRQs." What I am pointing here is that there are
more general problems w.r.t. the possibility of causing starvation of per-cpu
housekeeping threads needed by the system (for instance, RCU).

There are many open issues w.r.t the throttling mechanism, for instance:

1) We need to take the imprecision in the account of runtime in the AC.
2) The throttling needs to be designed in such a way that we try not to starve
non-rt threads in a CPU/rq - rather than the system (accounting per CPU).
3) We need to consider the IRQ workload as well to avoid RT+DL+IRQ to use all
the CPU time.

... among other things.

-- Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* INFO: rcu detected stall in do_idle
@ 2021-10-27  0:59 Hao Sun
  0 siblings, 0 replies; 36+ messages in thread
From: Hao Sun @ 2021-10-27  0:59 UTC (permalink / raw)
  To: tglx, maarten.lankhorst, mripard, tzimmermann, airlied, Daniel Vetter
  Cc: Linux Kernel Mailing List

Hello,

When using Healer to fuzz the latest Linux kernel, the following crash
was triggered.

HEAD commit: 519d81956ee2 Linux 5.15-rc6
git tree: upstream
console output:
https://drive.google.com/file/d/1_-g_-SXK01zc-1mhvVgiHr7ONSzO5WL8/view?usp=sharing
kernel config: https://drive.google.com/file/d/12PUnxIM1EPBgW4ZJmI7WJBRaY1lA83an/view?usp=sharing

Sorry, I don't have a reproducer for this crash, hope the symbolized
report can help.
If you fix this issue, please add the following tag to the commit:
Reported-by: Hao Sun <sunhao.th@gmail.com>

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 0-...0: (1 GPs behind) idle=c05/0/0x1 softirq=126007/126009 fqs=5063
rcu: 3-...0: (6 ticks this GP) idle=2a3/0/0x3 softirq=116486/116488 fqs=5063
(detected by 1, t=10502 jiffies, g=215657, q=298)
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:mark_lock.part.0+0x136/0x27c0 kernel/locking/lockdep.c:4571
Code: 00 00 00 48 8d 04 52 48 c1 e0 06 48 05 40 ad cf 8f 48 8d 78 50
48 ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 11 00 <0f> 85
aa 1d 00 00 4c 85 68 50 0f 84 b8 06 00 00 41 ba 01 00 00 00
RSP: 0018:ffffc900000077b0 EFLAGS: 00000046
RAX: ffffffff8fcfee80 RBX: 1ffff92000000efe RCX: 1ffffffff1f9fdda
RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffffff8fcfeed0
RBP: ffffc90000007918 R08: 0000000000000001 R09: fffffbfff1f9f526
R10: ffffffff8fcfa92f R11: fffffbfff1f9f525 R12: ffffffff8b6bd0b0
R13: 0000000000000100 R14: 0000000000000008 R15: ffffffff8b6bd0d0
FS:  0000000000000000(0000) GS:ffff888063e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555557bec48 CR3: 000000000b68e000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
 mark_lock kernel/locking/lockdep.c:4557 [inline]
 mark_usage kernel/locking/lockdep.c:4514 [inline]
 __lock_acquire+0x8e1/0x57e0 kernel/locking/lockdep.c:4969
 lock_acquire kernel/locking/lockdep.c:5625 [inline]
 lock_acquire+0x1ab/0x520 kernel/locking/lockdep.c:5590
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 lock_hrtimer_base+0x76/0x140 kernel/time/hrtimer.c:173
 hrtimer_try_to_cancel kernel/time/hrtimer.c:1331 [inline]
 hrtimer_try_to_cancel+0xd6/0x570 kernel/time/hrtimer.c:1316
 hrtimer_cancel+0x13/0x40 kernel/time/hrtimer.c:1443
 __disable_vblank drivers/gpu/drm/drm_vblank.c:434 [inline]
 drm_vblank_disable_and_save+0x286/0x3e0 drivers/gpu/drm/drm_vblank.c:478
 vblank_disable_fn+0x161/0x1b0 drivers/gpu/drm/drm_vblank.c:495
 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
 expire_timers kernel/time/timer.c:1466 [inline]
 __run_timers.part.0+0x6b0/0xa90 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb6/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x1d7/0x93b kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:default_idle+0xb/0x10 arch/x86/kernel/process.c:717
Code: cb 41 88 f8 e9 6f fe ff ff e8 c1 41 88 f8 e9 3d fe ff ff e8 17
31 fd ff cc cc cc cc cc cc cc eb 07 0f 00 2d 87 88 50 00 fb f4 <c3> 0f
1f 40 00 41 54 be 08 00 00 00 53 65 48 8b 1c 25 40 f0 01 00
RSP: 0018:ffffffff8b607e28 EFLAGS: 00000202
RAX: 0000000000521dc7 RBX: 0000000000000000 RCX: ffffffff8932e632
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed100c7c6542
R10: ffff888063e32a0b R11: ffffed100c7c6541 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff8d6d8dd0 R15: 0000000000000000
 default_idle_call+0xc4/0x420 kernel/sched/idle.c:112
 cpuidle_idle_call kernel/sched/idle.c:194 [inline]
 do_idle+0x3f9/0x570 kernel/sched/idle.c:306
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:403
 start_kernel+0x47a/0x49b init/main.c:1142
 secondary_startup_64_no_verify+0xb0/0xbb
Sending NMI from CPU 1 to CPUs 3:
NMI backtrace for cpu 3
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.15.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kvm_wait+0xb3/0x110 arch/x86/kernel/kvm.c:893
Code: 40 38 c6 74 1b 48 83 c4 10 c3 c3 e8 37 70 49 00 eb 07 0f 00 2d
de 1c 56 08 fb f4 48 83 c4 10 c3 eb 07 0f 00 2d ce 1c 56 08 f4 <48> 83
c4 10 c3 89 74 24 0c 48 89 3c 24 e8 5b 6e 49 00 8b 74 24 0c
RSP: 0018:ffffc900008a08a0 EFLAGS: 00000046
RAX: 0000000000000003 RBX: 0000000000000000 RCX: ffffffff815c2189
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8881050f0338
RBP: ffff8881050f0338 R08: 0000000000000001 R09: ffffed1020a1e068
R10: ffff8881050f0338 R11: ffffed1020a1e067 R12: 0000000000000000
R13: ffffed1020a1e067 R14: 0000000000000001 R15: ffff888135d32840
FS:  0000000000000000(0000) GS:ffff888135d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0901c7cdb8 CR3: 0000000101dc2000 CR4: 0000000000350ee0
Call Trace:
 <IRQ>
 pv_wait arch/x86/include/asm/paravirt.h:597 [inline]
 pv_wait_head_or_lock kernel/locking/qspinlock_paravirt.h:470 [inline]
 __pv_queued_spin_lock_slowpath+0x923/0xb80 kernel/locking/qspinlock.c:508
 pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:585 [inline]
 queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:51 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:85 [inline]
 do_raw_spin_lock+0x204/0x2d0 kernel/locking/spinlock_debug.c:115
 spin_lock include/linux/spinlock.h:363 [inline]
 drm_handle_vblank+0x126/0xc70 drivers/gpu/drm/drm_vblank.c:1951
 vkms_vblank_simulate+0xd0/0x3c0 drivers/gpu/drm/vkms/vkms_crtc.c:29
 __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
 __hrtimer_run_queues+0x1b8/0xdf0 kernel/time/hrtimer.c:1749
 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1103
 sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1097
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:202
Code: 0f 1f 44 00 00 55 48 8b 74 24 08 48 89 fd 48 83 c7 18 e8 1e dc
25 f8 48 89 ef e8 a6 6b 26 f8 e8 d1 95 46 f8 fb bf 01 00 00 00 <e8> 36
d2 19 f8 65 8b 05 6f 2a cc 76 85 c0 74 02 5d c3 e8 bb 7f ca
RSP: 0018:ffffc900008a0e00 EFLAGS: 00000202
RAX: 0000000000627162 RBX: ffffffff866c4e90 RCX: 1ffffffff1adac49
RDX: 0000000000000000 RSI: 0000000000000102 RDI: 0000000000000001
RBP: ffff888135d20240 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000005 R11: ffffed1026ba4048 R12: ffff888135d20240
R13: ffffc900008a0e90 R14: 0000000000000000 R15: dffffc0000000000
 expire_timers kernel/time/timer.c:1465 [inline]
 __run_timers.part.0+0x6a0/0xa90 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb6/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x1d7/0x93b kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:default_idle+0xb/0x10 arch/x86/kernel/process.c:717
Code: cb 41 88 f8 e9 6f fe ff ff e8 c1 41 88 f8 e9 3d fe ff ff e8 17
31 fd ff cc cc cc cc cc cc cc eb 07 0f 00 2d 87 88 50 00 fb f4 <c3> 0f
1f 40 00 41 54 be 08 00 00 00 53 65 48 8b 1c 25 40 f0 01 00
RSP: 0018:ffffc90000777de0 EFLAGS: 00000202
RAX: 000000000062715d RBX: 0000000000000003 RCX: ffffffff8932e632
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed1026ba6542
R10: ffff888135d32a0b R11: ffffed1026ba6541 R12: 0000000000000003
R13: 0000000000000003 R14: ffffffff8d6d8dd0 R15: 0000000000000000
 default_idle_call+0xc4/0x420 kernel/sched/idle.c:112
 cpuidle_idle_call kernel/sched/idle.c:194 [inline]
 do_idle+0x3f9/0x570 kernel/sched/idle.c:306
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:403
 start_secondary+0x227/0x2f0 arch/x86/kernel/smpboot.c:270
 secondary_startup_64_no_verify+0xb0/0xbb
watchdog: BUG: soft lockup - CPU#1 stuck for 223s! [syz-executor:10710]
Modules linked in:
irq event stamp: 52796
hardirqs last  enabled at (52795): [<ffffffff89400c02>]
asm_sysvec_apic_timer_interrupt+0x12/0x20
arch/x86/include/asm/idtentry.h:638
hardirqs last disabled at (52796): [<ffffffff8932d77b>]
sysvec_apic_timer_interrupt+0xb/0xc0 arch/x86/kernel/apic/apic.c:1097
softirqs last  enabled at (5040): [<ffffffff896005ab>]
softirq_handle_end kernel/softirq.c:401 [inline]
softirqs last  enabled at (5040): [<ffffffff896005ab>]
__do_softirq+0x5ab/0x93b kernel/softirq.c:587
softirqs last disabled at (5019): [<ffffffff81464c62>] invoke_softirq
kernel/softirq.c:432 [inline]
softirqs last disabled at (5019): [<ffffffff81464c62>] __irq_exit_rcu
kernel/softirq.c:636 [inline]
softirqs last disabled at (5019): [<ffffffff81464c62>]
irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
CPU: 1 PID: 10710 Comm: syz-executor Not tainted 5.15.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:csd_lock_wait kernel/smp.c:440 [inline]
RIP: 0010:smp_call_function_many_cond+0x3db/0xc20 kernel/smp.c:969
Code: 0b 00 85 ed 74 4d 48 b8 00 00 00 00 00 fc ff df 4d 89 f4 4c 89
f5 49 c1 ec 03 83 e5 07 49 01 c4 83 c5 03 e8 87 46 0b 00 f3 90 <41> 0f
b6 04 24 40 38 c5 7c 08 84 c0 0f 85 a2 06 00 00 8b 43 08 31
RSP: 0018:ffffc9000813f650 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff888063e39ce0 RCX: ffff888027a81c80
RDX: 0000000000000000 RSI: ffff888027a81c80 RDI: 0000000000000002
RBP: 0000000000000003 R08: ffffffff816b1839 R09: 0000000000000000
R10: 0000000000000005 R11: fffffbfff1f9f524 R12: ffffed100c7c739d
R13: 0000000000000000 R14: ffff888063e39ce8 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37fe3ec000 CR3: 000000000b68e000 CR4: 0000000000350ee0
Call Trace:
 on_each_cpu_cond_mask+0x56/0xa0 kernel/smp.c:1135
 on_each_cpu include/linux/smp.h:71 [inline]
 text_poke_sync arch/x86/kernel/alternative.c:929 [inline]
 text_poke_bp_batch+0x1b3/0x560 arch/x86/kernel/alternative.c:1114
 text_poke_flush arch/x86/kernel/alternative.c:1268 [inline]
 text_poke_finish+0x16/0x30 arch/x86/kernel/alternative.c:1275
 arch_jump_label_transform_apply+0x13/0x20 arch/x86/kernel/jump_label.c:146
 jump_label_update+0x32d/0x440 kernel/jump_label.c:830
 static_key_disable_cpuslocked+0x156/0x1c0 kernel/jump_label.c:207
 static_key_disable+0x16/0x20 kernel/jump_label.c:215
 tracepoint_remove_func kernel/tracepoint.c:417 [inline]
 tracepoint_probe_unregister+0x608/0x9a0 kernel/tracepoint.c:551
 trace_event_reg+0x181/0x350 kernel/trace/trace_events.c:517
 perf_trace_event_unreg.isra.0+0xac/0x2b0 kernel/trace/trace_event_perf.c:162
 perf_trace_destroy+0xb5/0xf0 kernel/trace/trace_event_perf.c:243
 _free_event+0x213/0x1270 kernel/events/core.c:4982
 put_event+0x40/0x50 kernel/events/core.c:5076
 perf_event_release_kernel+0x6dd/0x8e0 kernel/events/core.c:5191
 perf_release+0x33/0x40 kernel/events/core.c:5201
 __fput+0x288/0x9f0 fs/file_table.c:280
 task_work_run+0xe0/0x1a0 kernel/task_work.c:164
 exit_task_work include/linux/task_work.h:32 [inline]
 do_exit+0xbbe/0x2dd0 kernel/exit.c:825
 do_group_exit+0x125/0x340 kernel/exit.c:922
 get_signal+0x4d5/0x25a0 kernel/signal.c:2868
 arch_do_signal_or_restart+0x2ed/0x1c40 arch/x86/kernel/signal.c:865
 handle_signal_work kernel/entry/common.c:148 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
 exit_to_user_mode_prepare+0x192/0x2a0 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x2000018a
Code: Unable to access opcode bytes at RIP 0x20000160.
RSP: 002b:00007f0901c5bbb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000022
RAX: fffffffffffffdfe RBX: 0000000000000009 RCX: 000000002000018a
RDX: 0000000000000000 RSI: 000000008000005d RDI: 0000000000000003
RBP: 0000000000000052 R08: 0000000000000005 R09: 0000000000000006
R10: 0000000000000007 R11: 0000000000000202 R12: 000000000000000b
R13: 000000000000000c R14: 000000000000000d R15: 00007f0901c5bdc0
Sending NMI from CPU 1 to CPUs 0,2-3:
NMI backtrace for cpu 0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:asm_exc_int3+0x0/0x40 arch/x86/include/asm/idtentry.h:566
Code: 0f 1f 84 00 00 00 00 00 66 90 0f 01 ca 6a ff e8 56 05 00 00 48
89 e7 e8 1e a7 f2 ff e9 09 06 00 00 66 0f 1f 84 00 00 00 00 00 <0f> 01
ca 6a ff f6 44 24 10 03 75 18 ff 74 24 28 ff 74 24 28 ff 74
RSP: 0018:ffffc90000007a68 EFLAGS: 00000046
RAX: 0000000000000000 RBX: 1ffff92000000f55 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888135d223d8
RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1026ba4478 R12: 0000000000000000
R13: ffff888135d223d8 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff888063e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555557bec48 CR3: 000000000b68e000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
RIP: 0010:arch_static_branch arch/x86/include/asm/jump_label.h:27 [inline]
RIP: 0010:static_key_false include/linux/jump_label.h:212 [inline]
RIP: 0010:trace_lock_acquire include/trace/events/lock.h:13 [inline]
RIP: 0010:lock_acquire+0x84/0x520 kernel/locking/lockdep.c:5596
Code: f1 c7 40 04 f1 f1 00 00 c7 40 08 00 00 00 f3 c7 40 0c f3 f3 f3
f3 65 48 8b 04 25 28 00 00 00 48 89 84 24 a8 00 00 00 31 c0 cc <a2> 01
00 00 65 8b 15 d9 ec a5 7e 83 fa 07 0f 87 f3 03 00 00 89 d2
RSP: 0018:ffffc90000007a98 EFLAGS: 00000046
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 lock_hrtimer_base+0x76/0x140 kernel/time/hrtimer.c:173
 hrtimer_try_to_cancel kernel/time/hrtimer.c:1331 [inline]
 hrtimer_try_to_cancel+0xd6/0x570 kernel/time/hrtimer.c:1316
 hrtimer_cancel+0x13/0x40 kernel/time/hrtimer.c:1443
 __disable_vblank drivers/gpu/drm/drm_vblank.c:434 [inline]
 drm_vblank_disable_and_save+0x286/0x3e0 drivers/gpu/drm/drm_vblank.c:478
 vblank_disable_fn+0x161/0x1b0 drivers/gpu/drm/drm_vblank.c:495
 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
 expire_timers kernel/time/timer.c:1466 [inline]
 __run_timers.part.0+0x6b0/0xa90 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb6/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x1d7/0x93b kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:default_idle+0xb/0x10 arch/x86/kernel/process.c:717
Code: cb 41 88 f8 e9 6f fe ff ff e8 c1 41 88 f8 e9 3d fe ff ff e8 17
31 fd ff cc cc cc cc cc cc cc eb 07 0f 00 2d 87 88 50 00 fb f4 <c3> 0f
1f 40 00 41 54 be 08 00 00 00 53 65 48 8b 1c 25 40 f0 01 00
RSP: 0018:ffffffff8b607e28 EFLAGS: 00000202
RAX: 0000000000521dc7 RBX: 0000000000000000 RCX: ffffffff8932e632
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed100c7c6542
R10: ffff888063e32a0b R11: ffffed100c7c6541 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff8d6d8dd0 R15: 0000000000000000
 default_idle_call+0xc4/0x420 kernel/sched/idle.c:112
 cpuidle_idle_call kernel/sched/idle.c:194 [inline]
 do_idle+0x3f9/0x570 kernel/sched/idle.c:306
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:403
 start_kernel+0x47a/0x49b init/main.c:1142
 secondary_startup_64_no_verify+0xb0/0xbb
NMI backtrace for cpu 3
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.15.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kvm_wait+0xb3/0x110 arch/x86/kernel/kvm.c:893
Code: 40 38 c6 74 1b 48 83 c4 10 c3 c3 e8 37 70 49 00 eb 07 0f 00 2d
de 1c 56 08 fb f4 48 83 c4 10 c3 eb 07 0f 00 2d ce 1c 56 08 f4 <48> 83
c4 10 c3 89 74 24 0c 48 89 3c 24 e8 5b 6e 49 00 8b 74 24 0c
RSP: 0018:ffffc900008a08a0 EFLAGS: 00000046
RAX: 0000000000000003 RBX: 0000000000000000 RCX: 1ffff11020a1e067
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8881050f0338
RBP: ffff8881050f0338 R08: 0000000000000001 R09: ffffed1020a1e068
R10: ffff8881050f0338 R11: ffffed1020a1e067 R12: 0000000000000000
R13: ffffed1020a1e067 R14: 0000000000000001 R15: ffff888135d32840
FS:  0000000000000000(0000) GS:ffff888135d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0901c7cdb8 CR3: 0000000101dc2000 CR4: 0000000000350ee0
Call Trace:
 <IRQ>
 pv_wait arch/x86/include/asm/paravirt.h:597 [inline]
 pv_wait_head_or_lock kernel/locking/qspinlock_paravirt.h:470 [inline]
 __pv_queued_spin_lock_slowpath+0x923/0xb80 kernel/locking/qspinlock.c:508
 pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:585 [inline]
 queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:51 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:85 [inline]
 do_raw_spin_lock+0x204/0x2d0 kernel/locking/spinlock_debug.c:115
 spin_lock include/linux/spinlock.h:363 [inline]
 drm_handle_vblank+0x126/0xc70 drivers/gpu/drm/drm_vblank.c:1951
 vkms_vblank_simulate+0xd0/0x3c0 drivers/gpu/drm/vkms/vkms_crtc.c:29
 __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
 __hrtimer_run_queues+0x1b8/0xdf0 kernel/time/hrtimer.c:1749
 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1103
 sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1097
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:202
Code: 0f 1f 44 00 00 55 48 8b 74 24 08 48 89 fd 48 83 c7 18 e8 1e dc
25 f8 48 89 ef e8 a6 6b 26 f8 e8 d1 95 46 f8 fb bf 01 00 00 00 <e8> 36
d2 19 f8 65 8b 05 6f 2a cc 76 85 c0 74 02 5d c3 e8 bb 7f ca
RSP: 0018:ffffc900008a0e00 EFLAGS: 00000202
RAX: 0000000000627162 RBX: ffffffff866c4e90 RCX: 1ffffffff1adac49
RDX: 0000000000000000 RSI: 0000000000000102 RDI: 0000000000000001
RBP: ffff888135d20240 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000005 R11: ffffed1026ba4048 R12: ffff888135d20240
R13: ffffc900008a0e90 R14: 0000000000000000 R15: dffffc0000000000
 expire_timers kernel/time/timer.c:1465 [inline]
 __run_timers.part.0+0x6a0/0xa90 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb6/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x1d7/0x93b kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu kernel/softirq.c:636 [inline]
 irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:default_idle+0xb/0x10 arch/x86/kernel/process.c:717
Code: cb 41 88 f8 e9 6f fe ff ff e8 c1 41 88 f8 e9 3d fe ff ff e8 17
31 fd ff cc cc cc cc cc cc cc eb 07 0f 00 2d 87 88 50 00 fb f4 <c3> 0f
1f 40 00 41 54 be 08 00 00 00 53 65 48 8b 1c 25 40 f0 01 00
RSP: 0018:ffffc90000777de0 EFLAGS: 00000202
RAX: 000000000062715d RBX: 0000000000000003 RCX: ffffffff8932e632
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed1026ba6542
R10: ffff888135d32a0b R11: ffffed1026ba6541 R12: 0000000000000003
R13: 0000000000000003 R14: ffffffff8d6d8dd0 R15: 0000000000000000
 default_idle_call+0xc4/0x420 kernel/sched/idle.c:112
 cpuidle_idle_call kernel/sched/idle.c:194 [inline]
 do_idle+0x3f9/0x570 kernel/sched/idle.c:306
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:403
 start_secondary+0x227/0x2f0 arch/x86/kernel/smpboot.c:270
 secondary_startup_64_no_verify+0xb0/0xbb
NMI backtrace for cpu 2 skipped: idling at native_safe_halt
arch/x86/include/asm/irqflags.h:51 [inline]
NMI backtrace for cpu 2 skipped: idling at arch_safe_halt
arch/x86/include/asm/irqflags.h:89 [inline]
NMI backtrace for cpu 2 skipped: idling at default_idle+0xb/0x10
arch/x86/kernel/process.c:716
----------------
Code disassembly (best guess):
   0: 00 00                add    %al,(%rax)
   2: 00 48 8d              add    %cl,-0x73(%rax)
   5: 04 52                add    $0x52,%al
   7: 48 c1 e0 06          shl    $0x6,%rax
   b: 48 05 40 ad cf 8f    add    $0xffffffff8fcfad40,%rax
  11: 48 8d 78 50          lea    0x50(%rax),%rdi
  15: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
  1c: fc ff df
  1f: 48 89 f9              mov    %rdi,%rcx
  22: 48 c1 e9 03          shr    $0x3,%rcx
  26: 80 3c 11 00          cmpb   $0x0,(%rcx,%rdx,1)
* 2a: 0f 85 aa 1d 00 00    jne    0x1dda <-- trapping instruction
  30: 4c 85 68 50          test   %r13,0x50(%rax)
  34: 0f 84 b8 06 00 00    je     0x6f2
  3a: 41 ba 01 00 00 00    mov    $0x1,%r10d

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-10-27  0:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13  7:31 INFO: rcu detected stall in do_idle syzbot
2018-10-16 13:24 ` Thomas Gleixner
2018-10-16 14:03   ` Peter Zijlstra
2018-10-16 14:41     ` Juri Lelli
2018-10-16 14:45       ` Thomas Gleixner
2018-10-16 15:36         ` Juri Lelli
2018-10-18  8:28           ` Juri Lelli
2018-10-18  9:48             ` Peter Zijlstra
2018-10-18 10:10               ` Juri Lelli
2018-10-18 10:38                 ` luca abeni
2018-10-18 10:33               ` luca abeni
2018-10-19 13:14                 ` Peter Zijlstra
2018-10-18 10:23             ` luca abeni
2018-10-18 10:47               ` Juri Lelli
2018-10-18 11:08                 ` luca abeni
2018-10-18 12:21                   ` Juri Lelli
2018-10-18 12:36                     ` luca abeni
2018-10-19 11:39                   ` Peter Zijlstra
2018-10-19 20:50                     ` luca abeni
2018-10-24 12:03                       ` Juri Lelli
2018-10-27 11:16                         ` Dmitry Vyukov
2018-10-28  8:33                           ` Juri Lelli
2018-10-30 10:45                         ` Peter Zijlstra
2018-10-30 11:08                           ` luca abeni
2018-10-31 16:18                             ` Daniel Bristot de Oliveira
2018-10-31 16:40                               ` Juri Lelli
2018-10-31 17:39                                 ` Peter Zijlstra
2018-10-31 17:58                                 ` Daniel Bristot de Oliveira
2018-11-01  5:55                                   ` Juri Lelli
2018-11-02 10:00                                     ` Daniel Bristot de Oliveira
2018-11-05 10:55                                       ` Juri Lelli
2018-11-07 10:12                                         ` Daniel Bristot de Oliveira
2018-10-31 17:38                               ` Peter Zijlstra
2018-10-30 11:12                           ` Juri Lelli
2018-11-06 11:44                             ` Juri Lelli
2021-10-27  0:59 Hao Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).