linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pick_next_task() picking the wrong task [v4.9.163]
@ 2019-03-22 21:57 Radu Rendec
  2019-03-23 10:15 ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Radu Rendec @ 2019-03-22 21:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Peter Zijlstra

Hi Everyone,

I believe I'm seeing a weird behavior of pick_next_task() where it
chooses a lower priority task over a higher priority one. The scheduling
class of the two tasks is also different ('fair' vs. 'rt'). The culprit
seems to be the optimization at the beginning of the function, where
fair_sched_class.pick_next_task() is called directly.  I'm running
v4.9.163, but that piece of code is very similar in recent kernels.

My use case is quite simple: I have a real-time thread that is woken up
by a GPIO hardware interrupt. The thread sleeps most of the time in
poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
interrupt and the thread being woken up/scheduled is very important for
the application. Note that I backported my own commit 03c0a9208bb1, so
the thread is always woken up synchronously from HW interrupt context.

Most of the time things work as expected, but sometimes the scheduler
picks kworker and even the idle task before my real-time thread. I used
the trace infrastructure to figure out what happens and I'm including a
snippet below (I apologize for the wide lines).

     <idle>-0     [000] d.h2   161.202970: gpio_sysfs_irq
<-__handle_irq_event_percpu
     <idle>-0     [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
     <idle>-0     [000] d.h4   161.202998: sched_waking:
comm=irqWorker pid=1141 prio=9 target_cpu=000
     <idle>-0     [000] d.h5   161.203025: sched_wakeup:
comm=irqWorker pid=1141 prio=9 target_cpu=000
     <idle>-0     [000] d.h3   161.203047: workqueue_queue_work: work
struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60
req_cpu=1 cpu=0
     <idle>-0     [000] d.h3   161.203049: workqueue_activate_work:
work struct 806506b8
     <idle>-0     [000] d.h4   161.203061: sched_waking:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
     <idle>-0     [000] d.h5   161.203083: sched_wakeup:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
     <idle>-0     [000] d..2   161.203201: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=kworker/0:1 next_pid=134 next_prio=120
kworker/0:1-134   [000] ....   161.203222: workqueue_execute_start:
work struct 806506b8: function kernfs_notify_workfn
kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
kworker/0:1-134   [000] d..2   161.203329: sched_switch:
prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==>
next_comm=swapper next_pid=0 next_prio=120
     <idle>-0     [000] .n.1   161.230287: schedule <-schedule_preempt_disabled
     <idle>-0     [000] d..2   161.230310: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=irqWorker next_pid=1141 next_prio=9
  irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule

The system is Freescale MPC8378 (PowerPC, single processor).

I instrumented pick_next_task() with trace_printk() and I am sure that
every time the wrong task is picked, flow goes through the optimization
path and idle_sched_class.pick_next_task() is called directly. When the
right task is eventually picked, flow goes through the bottom block that
iterates over all scheduling classes. This probably makes sense: when
the scheduler runs in the context of the idle task, prev->sched_class is
no longer fair_sched_class, so the bottom block with the full iteration
is used. Note that in v4.9.163 the optimization path is taken only when
prev->sched_class is fair_sched_class, whereas in recent kernels it is
taken for both fair_sched_class and idle_sched_class.

Any help or feedback would be much appreciated. In the meantime, I will
experiment with commenting out the optimization (at the expense of a
slower scheduler, of course).

Best regards,
Radu Rendec

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: pick_next_task() picking the wrong task [v4.9.163]
  2019-03-22 21:57 pick_next_task() picking the wrong task [v4.9.163] Radu Rendec
@ 2019-03-23 10:15 ` Peter Zijlstra
  2019-03-25  2:17   ` Radu Rendec
  2019-03-28  0:56   ` Radu Rendec
  0 siblings, 2 replies; 4+ messages in thread
From: Peter Zijlstra @ 2019-03-23 10:15 UTC (permalink / raw)
  To: Radu Rendec; +Cc: linux-kernel, Ingo Molnar

On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> Hi Everyone,
> 
> I believe I'm seeing a weird behavior of pick_next_task() where it
> chooses a lower priority task over a higher priority one. The scheduling
> class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> seems to be the optimization at the beginning of the function, where
> fair_sched_class.pick_next_task() is called directly.  I'm running
> v4.9.163, but that piece of code is very similar in recent kernels.
> 
> My use case is quite simple: I have a real-time thread that is woken up
> by a GPIO hardware interrupt. The thread sleeps most of the time in
> poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
> interrupt and the thread being woken up/scheduled is very important for
> the application. Note that I backported my own commit 03c0a9208bb1, so
> the thread is always woken up synchronously from HW interrupt context.
> 
> Most of the time things work as expected, but sometimes the scheduler
> picks kworker and even the idle task before my real-time thread. I used
> the trace infrastructure to figure out what happens and I'm including a
> snippet below (I apologize for the wide lines).

If only they were wide :/ I had to unwrap them myself..

>      <idle>-0     [000] d.h2   161.202970: gpio_sysfs_irq  <-__handle_irq_event_percpu
>      <idle>-0     [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
>      <idle>-0     [000] d.h4   161.202998: sched_waking: comm=irqWorker pid=1141 prio=9 target_cpu=000
>      <idle>-0     [000] d.h5   161.203025: sched_wakeup: comm=irqWorker pid=1141 prio=9 target_cpu=000

weird how the next line doesn't have 'n/N' set:

>      <idle>-0     [000] d.h3   161.203047: workqueue_queue_work: work struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 cpu=0
>      <idle>-0     [000] d.h3   161.203049: workqueue_activate_work: work struct 806506b8
>      <idle>-0     [000] d.h4   161.203061: sched_waking: comm=kworker/0:1 pid=134 prio=120 target_cpu=000
>      <idle>-0     [000] d.h5   161.203083: sched_wakeup: comm=kworker/0:1 pid=134 prio=120 target_cpu=000

There's that kworker wakeup.

>      <idle>-0     [000] d..2   161.203201: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=134 next_prio=120

And I agree that that is weird.

> kworker/0:1-134   [000] ....   161.203222: workqueue_execute_start: work struct 806506b8: function kernfs_notify_workfn
> kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
> kworker/0:1-134   [000] d..2   161.203329: sched_switch: prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
>      <idle>-0     [000] .n.1   161.230287: schedule <-schedule_preempt_disabled

Only here do I see 'n'.

>      <idle>-0     [000] d..2   161.230310: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker next_pid=1141 next_prio=9
>   irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule
> 
> The system is Freescale MPC8378 (PowerPC, single processor).
> 
> I instrumented pick_next_task() with trace_printk() and I am sure that
> every time the wrong task is picked, flow goes through the optimization

That's weird, because when you wake a RT task, the:

  rq->nr_running == rq->cfs.h_nr_running

condition should not be true. Maybe try adding trace_printk() to all
rq->nr_running manipulation to see what goes wobbly?

> path and idle_sched_class.pick_next_task() is called directly. When the
> right task is eventually picked, flow goes through the bottom block that
> iterates over all scheduling classes. This probably makes sense: when
> the scheduler runs in the context of the idle task, prev->sched_class is
> no longer fair_sched_class, so the bottom block with the full iteration
> is used. Note that in v4.9.163 the optimization path is taken only when
> prev->sched_class is fair_sched_class, whereas in recent kernels it is
> taken for both fair_sched_class and idle_sched_class.
> 
> Any help or feedback would be much appreciated. In the meantime, I will
> experiment with commenting out the optimization (at the expense of a
> slower scheduler, of course).

It would be very good if you could confirm on the very latest kernel,
instead of on 4.9.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: pick_next_task() picking the wrong task [v4.9.163]
  2019-03-23 10:15 ` Peter Zijlstra
@ 2019-03-25  2:17   ` Radu Rendec
  2019-03-28  0:56   ` Radu Rendec
  1 sibling, 0 replies; 4+ messages in thread
From: Radu Rendec @ 2019-03-25  2:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar

On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> > Hi Everyone,
> > 
> > I believe I'm seeing a weird behavior of pick_next_task() where it
> > chooses a lower priority task over a higher priority one. The scheduling
> > class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> > seems to be the optimization at the beginning of the function, where
> > fair_sched_class.pick_next_task() is called directly.  I'm running
> > v4.9.163, but that piece of code is very similar in recent kernels.
> > 
> > My use case is quite simple: I have a real-time thread that is woken up
> > by a GPIO hardware interrupt. The thread sleeps most of the time in
> > poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
> > interrupt and the thread being woken up/scheduled is very important for
> > the application. Note that I backported my own commit 03c0a9208bb1, so
> > the thread is always woken up synchronously from HW interrupt context.
> > 
> > Most of the time things work as expected, but sometimes the scheduler
> > picks kworker and even the idle task before my real-time thread. I used
> > the trace infrastructure to figure out what happens and I'm including a
> > snippet below (I apologize for the wide lines).
> 
> If only they were wide :/ I had to unwrap them myself..

Sorry about that! Wonders of using the gmail web interface. I'll pay
more attention in the future.

> >      <idle>-0     [000] d.h2   161.202970: gpio_sysfs_irq  <-__handle_irq_event_percpu
> >      <idle>-0     [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
> >      <idle>-0     [000] d.h4   161.202998: sched_waking: comm=irqWorker pid=1141 prio=9 target_cpu=000
> >      <idle>-0     [000] d.h5   161.203025: sched_wakeup: comm=irqWorker pid=1141 prio=9 target_cpu=000
> 
> weird how the next line doesn't have 'n/N' set:
> 
> >      <idle>-0     [000] d.h3   161.203047: workqueue_queue_work: work struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 cpu=0
> >      <idle>-0     [000] d.h3   161.203049: workqueue_activate_work: work struct 806506b8
> >      <idle>-0     [000] d.h4   161.203061: sched_waking: comm=kworker/0:1 pid=134 prio=120 target_cpu=000
> >      <idle>-0     [000] d.h5   161.203083: sched_wakeup: comm=kworker/0:1 pid=134 prio=120 target_cpu=000
> 
> There's that kworker wakeup.
> 
> >      <idle>-0     [000] d..2   161.203201: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=134 next_prio=120
> 
> And I agree that that is weird.
> 
> > kworker/0:1-134   [000] ....   161.203222: workqueue_execute_start: work struct 806506b8: function kernfs_notify_workfn
> > kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
> > kworker/0:1-134   [000] d..2   161.203329: sched_switch: prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
> >      <idle>-0     [000] .n.1   161.230287: schedule <-schedule_preempt_disabled
> 
> Only here do I see 'n'.

Looking at other captures I can see 'n' starting at sched_wakeup for
irqWorker. Perhaps there was something wrong with this one or I copied
the wrong line.

> >      <idle>-0     [000] d..2   161.230310: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker next_pid=1141 next_prio=9
> >   irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule
> > 
> > The system is Freescale MPC8378 (PowerPC, single processor).
> > 
> > I instrumented pick_next_task() with trace_printk() and I am sure that
> > every time the wrong task is picked, flow goes through the optimization
> 
> That's weird, because when you wake a RT task, the:
> 
>   rq->nr_running == rq->cfs.h_nr_running
> 
> condition should not be true. Maybe try adding trace_printk() to all
> rq->nr_running manipulation to see what goes wobbly?

Sure, I will try that and come back with the results.

FWIW, I tried to comment out the entire optimization and force the
"slow" path that goes through every scheduling class. Surprisingly,
rt_sched_class.pick_next_task() returns NULL.

> > path and idle_sched_class.pick_next_task() is called directly. When the
> > right task is eventually picked, flow goes through the bottom block that
> > iterates over all scheduling classes. This probably makes sense: when
> > the scheduler runs in the context of the idle task, prev->sched_class is
> > no longer fair_sched_class, so the bottom block with the full iteration
> > is used. Note that in v4.9.163 the optimization path is taken only when
> > prev->sched_class is fair_sched_class, whereas in recent kernels it is
> > taken for both fair_sched_class and idle_sched_class.
> > 
> > Any help or feedback would be much appreciated. In the meantime, I will
> > experiment with commenting out the optimization (at the expense of a
> > slower scheduler, of course).
> 
> It would be very good if you could confirm on the very latest kernel,
> instead of on 4.9.

Of course. I spent quite some time on this and fortunately I've got
something. I can now reproduce the problem on 5.0.3 with a 250 lines
test program on a Raspberry Pi 1B. IMO this is far better than what I
had on Friday (4.9.x with a huge proprietary application, running on
custom hardware). Not quite "the very latest kernel", but pretty close.
That's the latest thing that I could run on the Pi, as it requires a
whole bunch of custom patches that they haven't upstreamed yet (ugh!)

I can upload the full trace log and the test program and post the links
if you find any value in that. In the meantime, I'm including the log
around the HW IRQ below (this time with unfolded lines, I promise).
Thread names are different in the test program. The "interesting" thread
is now called "gpio-irq-lat".

Thank you for looking into this. I really appreciate it!

Best regards,
Radu Rendec


<idle>-0     [000] d.h1   290.675001: gpio_sysfs_irq <-__handle_irq_event_percpu
<idle>-0     [000] d.h1   290.675003: preempt_count_add <-gpio_sysfs_irq
<idle>-0     [000] d.h2   290.675005: ktime_get <-gpio_sysfs_irq
<idle>-0     [000] d.h2   290.675007: clocksource_mmio_readl_up <-ktime_get
<idle>-0     [000] d.h2   290.675022: gpio_sysfs_irq: new 290569083832 0
<idle>-0     [000] d.h2   290.675028: preempt_count_sub <-gpio_sysfs_irq
<idle>-0     [000] d.h1   290.675031: kernfs_notify <-gpio_sysfs_irq
<idle>-0     [000] d.h1   290.675033: preempt_count_add <-kernfs_notify
<idle>-0     [000] d.h2   290.675036: __wake_up <-kernfs_notify
<idle>-0     [000] d.h2   290.675038: __wake_up_common_lock <-__wake_up
<idle>-0     [000] d.h2   290.675040: preempt_count_add <-__wake_up_common_lock
<idle>-0     [000] d.h3   290.675042: __wake_up_common <-__wake_up_common_lock
<idle>-0     [000] d.h3   290.675045: pollwake <-__wake_up_common
<idle>-0     [000] d.h3   290.675048: default_wake_function <-pollwake
<idle>-0     [000] d.h3   290.675050: try_to_wake_up <-default_wake_function
<idle>-0     [000] d.h3   290.675052: preempt_count_add <-try_to_wake_up
<idle>-0     [000] d.h4   290.675056: sched_waking: comm=gpio-irq-lat pid=1164 prio=9 target_cpu=000
<idle>-0     [000] d.h4   290.675060: preempt_count_add <-try_to_wake_up
<idle>-0     [000] d.h5   290.675063: update_rq_clock <-try_to_wake_up
<idle>-0     [000] d.h5   290.675066: activate_task <-try_to_wake_up
<idle>-0     [000] d.h5   290.675069: enqueue_task_rt <-activate_task
<idle>-0     [000] d.h5   290.675072: dequeue_rt_stack <-enqueue_task_rt
<idle>-0     [000] d.h5   290.675074: dequeue_top_rt_rq.constprop.10 <-dequeue_rt_stack
<idle>-0     [000] d.h5   290.675079: preempt_count_add <-enqueue_task_rt
<idle>-0     [000] d.h6   290.675082: preempt_count_sub <-enqueue_task_rt
<idle>-0     [000] d.h5   290.675084: enqueue_top_rt_rq.constprop.12 <-enqueue_task_rt
<idle>-0     [000] d.h5   290.675087: ttwu_do_wakeup.constprop.30 <-try_to_wake_up
<idle>-0     [000] d.h5   290.675089: check_preempt_curr <-ttwu_do_wakeup.constprop.30
<idle>-0     [000] d.h5   290.675092: resched_curr <-check_preempt_curr
<idle>-0     [000] dnh5   290.675099: sched_wakeup: comm=gpio-irq-lat pid=1164 prio=9 target_cpu=000
<idle>-0     [000] dnh5   290.675101: preempt_count_sub <-try_to_wake_up
<idle>-0     [000] dnh4   290.675104: ttwu_stat.constprop.32 <-try_to_wake_up
<idle>-0     [000] dnh4   290.675106: preempt_count_sub <-try_to_wake_up
<idle>-0     [000] dnh3   290.675110: preempt_count_sub <-__wake_up_common_lock
<idle>-0     [000] dnh2   290.675114: preempt_count_sub <-kernfs_notify
<idle>-0     [000] dnh1   290.675117: preempt_count_add <-kernfs_notify
<idle>-0     [000] dnh2   290.675119: kernfs_get <-kernfs_notify
<idle>-0     [000] dnh2   290.675122: queue_work_on <-kernfs_notify
<idle>-0     [000] dnh2   290.675125: __queue_work <-queue_work_on
<idle>-0     [000] dnh2   290.675128: get_work_pool <-__queue_work
<idle>-0     [000] dnh2   290.675132: preempt_count_add <-__queue_work
<idle>-0     [000] dnh3   290.675136: workqueue_queue_work: work struct=4b4b2bdf function=kernfs_notify_workfn workqueue=8231353a req_cpu=1 cpu=0
<idle>-0     [000] dnh3   290.675141: workqueue_activate_work: work struct 4b4b2bdf
<idle>-0     [000] dnh3   290.675146: insert_work <-__queue_work
<idle>-0     [000] dnh3   290.675148: get_pwq <-insert_work
<idle>-0     [000] dnh3   290.675151: wake_up_process <-insert_work
<idle>-0     [000] dnh3   290.675153: try_to_wake_up <-wake_up_process
<idle>-0     [000] dnh3   290.675155: preempt_count_add <-try_to_wake_up
<idle>-0     [000] dnh4   290.675157: sched_waking: comm=kworker/0:2 pid=57 prio=120 target_cpu=000
<idle>-0     [000] dnh4   290.675160: preempt_count_add <-try_to_wake_up
<idle>-0     [000] dnh5   290.675162: update_rq_clock <-try_to_wake_up
<idle>-0     [000] dnh5   290.675164: activate_task <-try_to_wake_up
<idle>-0     [000] dnh5   290.675167: enqueue_task_fair <-activate_task
<idle>-0     [000] dnh5   290.675170: update_curr <-enqueue_task_fair
<idle>-0     [000] dnh5   290.675174: update_cfs_group <-enqueue_task_fair
<idle>-0     [000] dnh5   290.675178: check_spread <-enqueue_task_fair
<idle>-0     [000] dnh5   290.675180: __enqueue_entity <-enqueue_task_fair
<idle>-0     [000] dnh5   290.675184: hrtick_update <-enqueue_task_fair
<idle>-0     [000] dnh5   290.675186: wq_worker_waking_up <-try_to_wake_up
<idle>-0     [000] dnh5   290.675188: kthread_data <-wq_worker_waking_up
<idle>-0     [000] dnh5   290.675191: ttwu_do_wakeup.constprop.30 <-try_to_wake_up
<idle>-0     [000] dnh5   290.675193: check_preempt_curr <-ttwu_do_wakeup.constprop.30
<idle>-0     [000] dnh5   290.675195: resched_curr <-check_preempt_curr
<idle>-0     [000] dnh5   290.675198: sched_wakeup: comm=kworker/0:2 pid=57 prio=120 target_cpu=000
<idle>-0     [000] dnh5   290.675200: preempt_count_sub <-try_to_wake_up
<idle>-0     [000] dnh4   290.675202: ttwu_stat.constprop.32 <-try_to_wake_up
<idle>-0     [000] dnh4   290.675204: preempt_count_sub <-try_to_wake_up
<idle>-0     [000] dnh3   290.675206: preempt_count_sub <-__queue_work
<idle>-0     [000] dnh2   290.675209: preempt_count_sub <-kernfs_notify
<idle>-0     [000] dnh1   290.675212: irq_handler_exit: irq=160 ret=handled
<idle>-0     [000] dnh1   290.675215: add_interrupt_randomness <-handle_irq_event_percpu
<idle>-0     [000] dnh1   290.675218: read_current_timer <-add_interrupt_randomness
<idle>-0     [000] dnh1   290.675221: note_interrupt <-handle_irq_event_percpu
<idle>-0     [000] dnh1   290.675223: preempt_count_add <-handle_irq_event
<idle>-0     [000] dnh2   290.675226: preempt_count_sub <-handle_edge_irq
<idle>-0     [000] dnh1   290.675229: armctrl_unmask_irq <-bcm2835_gpio_irq_handler
<idle>-0     [000] dnh1   290.675231: irq_exit <-__handle_domain_irq
<idle>-0     [000] dnh1   290.675233: preempt_count_sub <-irq_exit
<idle>-0     [000] dn.1   290.675237: idle_cpu <-irq_exit
<idle>-0     [000] dn.1   290.675240: rcu_irq_exit <-irq_exit
<idle>-0     [000] dn.1   290.675242: get_next_armctrl_hwirq <-bcm2835_handle_irq
<idle>-0     [000] dn.2   290.675246: __srcu_read_lock <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675248: rcu_irq_enter_irqson <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675250: rcu_irq_enter <-rcu_irq_enter_irqson
<idle>-0     [000] dn.1   290.675253: irq_enable: caller=__irq_svc+0x78/0x94 parent=do_idle+0xc4/0x138
<idle>-0     [000] dn.2   290.675255: rcu_irq_exit_irqson <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675257: rcu_irq_exit <-rcu_irq_exit_irqson
<idle>-0     [000] dn.2   290.675259: __srcu_read_unlock <-trace_hardirqs_on
<idle>-0     [000] .n.1   290.675262: arch_cpu_idle_exit <-do_idle
<idle>-0     [000] .n.1   290.675264: ledtrig_cpu <-arch_cpu_idle_exit
<idle>-0     [000] .n.1   290.675266: led_trigger_event <-ledtrig_cpu
<idle>-0     [000] .n.1   290.675268: preempt_count_add <-led_trigger_event
<idle>-0     [000] .n.2   290.675270: preempt_count_sub <-led_trigger_event
<idle>-0     [000] .n.1   290.675273: led_trigger_event <-ledtrig_cpu
<idle>-0     [000] .n.1   290.675275: preempt_count_add <-led_trigger_event
<idle>-0     [000] .n.2   290.675277: preempt_count_sub <-led_trigger_event
<idle>-0     [000] .n.1   290.675279: tick_nohz_idle_exit <-do_idle
<idle>-0     [000] dn.2   290.675282: __srcu_read_lock <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675284: rcu_irq_enter_irqson <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675286: rcu_irq_enter <-rcu_irq_enter_irqson
<idle>-0     [000] dn.1   290.675288: irq_disable: caller=tick_nohz_idle_exit+0x24/0x130 parent=do_idle+0xcc/0x138
<idle>-0     [000] dn.2   290.675290: rcu_irq_exit_irqson <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675292: rcu_irq_exit <-rcu_irq_exit_irqson
<idle>-0     [000] dn.2   290.675294: __srcu_read_unlock <-trace_hardirqs_off
<idle>-0     [000] dn.1   290.675296: ktime_get <-tick_nohz_idle_exit
<idle>-0     [000] dn.1   290.675298: clocksource_mmio_readl_up <-ktime_get
<idle>-0     [000] dn.1   290.675301: __tick_nohz_idle_restart_tick.constprop.4 <-tick_nohz_idle_exit
<idle>-0     [000] dn.1   290.675304: timer_clear_idle <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675306: calc_load_nohz_stop <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675309: hrtimer_cancel <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675311: hrtimer_try_to_cancel <-hrtimer_cancel
<idle>-0     [000] dn.1   290.675313: preempt_count_add <-hrtimer_try_to_cancel
<idle>-0     [000] dn.2   290.675317: hrtimer_cancel: hrtimer=f8522568
<idle>-0     [000] dn.2   290.675319: __remove_hrtimer <-hrtimer_try_to_cancel
<idle>-0     [000] dn.2   290.675324: preempt_count_sub <-hrtimer_try_to_cancel
<idle>-0     [000] dn.1   290.675327: hrtimer_forward <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675331: hrtimer_start_range_ns <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675333: preempt_count_add <-hrtimer_start_range_ns
<idle>-0     [000] dn.2   290.675337: enqueue_hrtimer <-hrtimer_start_range_ns
<idle>-0     [000] dn.2   290.675341: hrtimer_start: hrtimer=f8522568 function=tick_sched_timer expires=290570000000 softexpires=290570000000 mode=ABS|PINNED
<idle>-0     [000] dn.2   290.675346: hrtimer_reprogram.constprop.7 <-hrtimer_start_range_ns
<idle>-0     [000] dn.2   290.675352: tick_program_event <-hrtimer_reprogram.constprop.7
<idle>-0     [000] dn.2   290.675356: clockevents_program_event <-tick_program_event
<idle>-0     [000] dn.2   290.675358: ktime_get <-clockevents_program_event
<idle>-0     [000] dn.2   290.675360: clocksource_mmio_readl_up <-ktime_get
<idle>-0     [000] dn.2   290.675365: bcm2835_time_set_next_event <-clockevents_program_event
<idle>-0     [000] dn.2   290.675368: preempt_count_sub <-hrtimer_start_range_ns
<idle>-0     [000] dn.1   290.675370: account_idle_ticks <-__tick_nohz_idle_restart_tick.constprop.4
<idle>-0     [000] dn.1   290.675373: account_idle_time <-account_idle_ticks
<idle>-0     [000] dn.2   290.675376: __srcu_read_lock <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675378: rcu_irq_enter_irqson <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675380: rcu_irq_enter <-rcu_irq_enter_irqson
<idle>-0     [000] dn.1   290.675382: irq_enable: caller=tick_nohz_idle_exit+0xc0/0x130 parent=do_idle+0xcc/0x138
<idle>-0     [000] dn.2   290.675385: rcu_irq_exit_irqson <-trace_hardirqs_on
<idle>-0     [000] dn.2   290.675387: rcu_irq_exit <-rcu_irq_exit_irqson
<idle>-0     [000] dn.2   290.675389: __srcu_read_unlock <-trace_hardirqs_on
<idle>-0     [000] .n.1   290.675391: schedule_idle <-do_idle
<idle>-0     [000] dn.2   290.675395: __srcu_read_lock <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675398: rcu_irq_enter_irqson <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675400: rcu_irq_enter <-rcu_irq_enter_irqson
<idle>-0     [000] dn.1   290.675402: irq_disable: caller=__schedule+0x68/0x8f0 parent=schedule_idle+0x44/0x88
<idle>-0     [000] dn.2   290.675404: rcu_irq_exit_irqson <-trace_hardirqs_off
<idle>-0     [000] dn.2   290.675406: rcu_irq_exit <-rcu_irq_exit_irqson
<idle>-0     [000] dn.2   290.675408: __srcu_read_unlock <-trace_hardirqs_off
<idle>-0     [000] dn.1   290.675410: rcu_note_context_switch <-__schedule
<idle>-0     [000] dn.1   290.675414: rcu_utilization: Start context switch
<idle>-0     [000] dn.1   290.675417: rcu_preempt_deferred_qs <-rcu_note_context_switch
<idle>-0     [000] dn.1   290.675419: rcu_qs <-rcu_note_context_switch
<idle>-0     [000] dn.1   290.675422: rcu_utilization: End context switch
<idle>-0     [000] dn.1   290.675424: preempt_count_add <-__schedule
<idle>-0     [000] dn.2   290.675427: update_rq_clock <-__schedule
<idle>-0     [000] dn.2   290.675430: pick_next_task_fair <-__schedule
<idle>-0     [000] dn.2   290.675432: put_prev_task_idle <-pick_next_task_fair
<idle>-0     [000] dn.2   290.675434: pick_next_entity <-pick_next_task_fair
<idle>-0     [000] dn.2   290.675437: clear_buddies <-pick_next_entity
<idle>-0     [000] dn.2   290.675440: set_next_entity <-pick_next_task_fair
<idle>-0     [000] d..2   290.675450: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:2 next_pid=57 next_prio=120


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: pick_next_task() picking the wrong task [v4.9.163]
  2019-03-23 10:15 ` Peter Zijlstra
  2019-03-25  2:17   ` Radu Rendec
@ 2019-03-28  0:56   ` Radu Rendec
  1 sibling, 0 replies; 4+ messages in thread
From: Radu Rendec @ 2019-03-28  0:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar

On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> > Hi Everyone,
> > 
> > I believe I'm seeing a weird behavior of pick_next_task() where it
> > chooses a lower priority task over a higher priority one. The scheduling
> > class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> > seems to be the optimization at the beginning of the function, where
> > fair_sched_class.pick_next_task() is called directly.  I'm running
> > v4.9.163, but that piece of code is very similar in recent kernels.
> > 
> > [...]
> > 
> > I instrumented pick_next_task() with trace_printk() and I am sure that
> > every time the wrong task is picked, flow goes through the optimization
> 
> That's weird, because when you wake a RT task, the:
> 
>   rq->nr_running == rq->cfs.h_nr_running
> 
> condition should not be true. Maybe try adding trace_printk() to all
> rq->nr_running manipulation to see what goes wobbly?

The answer is in enqueue_top_rt_rq(): it returns before touching the run
queue counters because rt_rq_throttled(rt_rq) is true. So basically this
is RT throttling kicking in. I confirmed by disabling RT throttling and
testing again.

So there's nothing wrong with the scheduler. The "sched_wakeup: comm=.."
trace was a bit misleading. What happens when RT throttling kicks in is
that the task is woken (and probably changes state to TASK_RUNNING) but
not actually added to the run queue.

Thanks again for looking into this and sorry about the noise!

Best regards,
Radu Rendec



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-03-28  0:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22 21:57 pick_next_task() picking the wrong task [v4.9.163] Radu Rendec
2019-03-23 10:15 ` Peter Zijlstra
2019-03-25  2:17   ` Radu Rendec
2019-03-28  0:56   ` Radu Rendec

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).