All of lore.kernel.org
 help / color / mirror / Atom feed
* preempt_disable causes big latency
@ 2021-01-28  9:04 chensong_2000
  2021-01-28 11:30 ` Daniel Vacek
  0 siblings, 1 reply; 3+ messages in thread
From: chensong_2000 @ 2021-01-28  9:04 UTC (permalink / raw)
  To: linux-rt-users


Dear experts,

When i was running cyclictest with ltp(kernel 4.19.90 rt, arch64), 
cyclictest got a big latecy(more than 10ms), the ftrace log shows that 
modprobe called preempt_disable for a long time, which stops cyclictest 
being switched.

here is a piece of log:

modprobe-4551    0...10   712.754529: preempt_disable: 
caller=0xffff0001019900c8 parent=0xffff0001019900c8
...
modprobe-4551    0d..10   712.754589: irq_disable: caller=el1_irq+0x7c 
parent=0xffff0001019900dc
modprobe-4551    0d.h10   712.754590: irq_handler_entry:    irq=2 
name=arch_timer
modprobe-4551    0d.h20   712.754591: hrtimer_cancel: 
hrtimer=0xffff8020f4983d98
modprobe-4551    0d.h10   712.754591: hrtimer_expire_entry: 
hrtimer=0xffff8020f4983d98 now=712720218245 function=hrtimer_wakeup/0x0
modprobe-4551    0d.h20   712.754592: sched_waking: comm=cyclictest 
pid=2212 prio=9 target_cpu=000
modprobe-4551    0dNh30   712.754593: sched_wakeup: cyclictest:2212 [9] 
success=1 CPU:000
modprobe-4551    0dNh10   712.754594: hrtimer_expire_exit: 
hrtimer=0xffff8020f4983d98
modprobe-4551    0dNh10   712.754594: irq_handler_exit:     irq=2 
ret=handled

Further i found 
preemptirq_delay_test(kernel/trace/preemptirq_delay_test.c) can 
reproduce it easily and got the similar ftrace log.

The root cause i found is in el1_irq, it checks preempt count and 
TIF_NEED_RESCHED before it goes through the path el1_preempt, 
preempt_disable just stops it happening.

Then i came up an idea and did an experiment,  call 
preempt_count_set(0); and set_tsk_need_resched(task); in hrtimer_wakeup 
to meet the expectation in order to reschdule.

static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
     struct hrtimer_sleeper *t =
         container_of(timer, struct hrtimer_sleeper, timer);
     struct task_struct *task = t->task;

     t->task = NULL;
     if (task)
         wake_up_process(task);

     if(preempt_count())
         preempt_count_set(0);
     set_tsk_need_resched(task);

     return HRTIMER_NORESTART;
}

but, failed, system froze, no panic information.

Is there anyone having the same problem? i would appreciate it if you 
could share me some information in this case, many thanks.

BR

Song Chen




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: preempt_disable causes big latency
  2021-01-28  9:04 preempt_disable causes big latency chensong_2000
@ 2021-01-28 11:30 ` Daniel Vacek
  2021-01-29  3:52   ` chensong_2000
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Vacek @ 2021-01-28 11:30 UTC (permalink / raw)
  To: chensong_2000; +Cc: linux-rt-users

Hi Song,

On Thu, Jan 28, 2021 at 10:12 AM chensong_2000@189.cn
<chensong_2000@189.cn> wrote:
> Dear experts,
>
> When i was running cyclictest with ltp(kernel 4.19.90 rt, arch64),
> cyclictest got a big latecy(more than 10ms), the ftrace log shows that
> modprobe called preempt_disable for a long time, which stops cyclictest
> being switched.

This works as designed. Running a latency sensitive application has
some inherent rules. You need to stick to that rules, otherwise
excessive latencies are to be expected. The best documentation (I
believe) you can find on
https://wiki.linuxfoundation.org/realtime/start (ex.
https://rt.wiki.kernel.org/).

In short, running modprobe (loading kernel modules in general) is not
allowed. For a production system, you are supposed to set up the
machine first (loading the modules during boot or before starting the
application). The application itself is bound to allowed operations
only, otherwise it has to expect blocking and unbounded latencies.

> here is a piece of log:
>
> modprobe-4551    0...10   712.754529: preempt_disable:
> caller=0xffff0001019900c8 parent=0xffff0001019900c8
> ...
> modprobe-4551    0d..10   712.754589: irq_disable: caller=el1_irq+0x7c
> parent=0xffff0001019900dc
> modprobe-4551    0d.h10   712.754590: irq_handler_entry:    irq=2
> name=arch_timer
> modprobe-4551    0d.h20   712.754591: hrtimer_cancel:
> hrtimer=0xffff8020f4983d98
> modprobe-4551    0d.h10   712.754591: hrtimer_expire_entry:
> hrtimer=0xffff8020f4983d98 now=712720218245 function=hrtimer_wakeup/0x0
> modprobe-4551    0d.h20   712.754592: sched_waking: comm=cyclictest
> pid=2212 prio=9 target_cpu=000
> modprobe-4551    0dNh30   712.754593: sched_wakeup: cyclictest:2212 [9]
> success=1 CPU:000
> modprobe-4551    0dNh10   712.754594: hrtimer_expire_exit:
> hrtimer=0xffff8020f4983d98
> modprobe-4551    0dNh10   712.754594: irq_handler_exit:     irq=2
> ret=handled

Again, this is perfectly valid.

> Further i found
> preemptirq_delay_test(kernel/trace/preemptirq_delay_test.c) can
> reproduce it easily and got the similar ftrace log.

You can find endless ways how to introduce latency to the system.
That's why some operations are not allowed. For more details see the
mentioned documentation. ^^

> The root cause i found is in el1_irq, it checks preempt count and
> TIF_NEED_RESCHED before it goes through the path el1_preempt,
> preempt_disable just stops it happening.

As it should.

> Then i came up an idea and did an experiment,  call
> preempt_count_set(0); and set_tsk_need_resched(task); in hrtimer_wakeup
> to meet the expectation in order to reschdule.

Sounds like demolishing the pillars of the bridge, yet still expecting
it not to fall. This will break your system for sure.

> static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
> {
>      struct hrtimer_sleeper *t =
>          container_of(timer, struct hrtimer_sleeper, timer);
>      struct task_struct *task = t->task;
>
>      t->task = NULL;
>      if (task)
>          wake_up_process(task);
>
>      if(preempt_count())
>          preempt_count_set(0);
>      set_tsk_need_resched(task);
>
>      return HRTIMER_NORESTART;
> }
>
> but, failed, system froze, no panic information.

Not surprised.

> Is there anyone having the same problem? i would appreciate it if you
> could share me some information in this case, many thanks.

Simply, no modprobe while running the cyclictest. The cyclictest is
supposed to be running on a machine as if the machine was in
production and definitely not tortured by arbitrary synthetic tests
(unless the test simulates a real world scenario which actually can
happen in production).
Again, all the modules should be preloaded before running cyclictest
or the production application.

Realtime system is a fragile beast and you have to handle it with special care.

Have a nice day,
Daniel

> BR
>
> Song Chen

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: preempt_disable causes big latency
  2021-01-28 11:30 ` Daniel Vacek
@ 2021-01-29  3:52   ` chensong_2000
  0 siblings, 0 replies; 3+ messages in thread
From: chensong_2000 @ 2021-01-29  3:52 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-rt-users

Hi Daniel,

Thanks a lot for your kind explanation and vivid metaphors, it helps me 
have better understanding about preempt rt with the docs linked below.

Like you said, Realtime system is a fragile beast, we have a lot of work 
ahead to make it more robust and have better fault-tolerant.

Thanks again and best regards,

Song

在 2021/1/28 下午7:30, Daniel Vacek 写道:
> Hi Song,
> 
> On Thu, Jan 28, 2021 at 10:12 AM chensong_2000@189.cn
> <chensong_2000@189.cn> wrote:
>> Dear experts,
>>
>> When i was running cyclictest with ltp(kernel 4.19.90 rt, arch64),
>> cyclictest got a big latecy(more than 10ms), the ftrace log shows that
>> modprobe called preempt_disable for a long time, which stops cyclictest
>> being switched.
> 
> This works as designed. Running a latency sensitive application has
> some inherent rules. You need to stick to that rules, otherwise
> excessive latencies are to be expected. The best documentation (I
> believe) you can find on
> https://wiki.linuxfoundation.org/realtime/start (ex.
> https://rt.wiki.kernel.org/).
> 
> In short, running modprobe (loading kernel modules in general) is not
> allowed. For a production system, you are supposed to set up the
> machine first (loading the modules during boot or before starting the
> application). The application itself is bound to allowed operations
> only, otherwise it has to expect blocking and unbounded latencies.
> 
>> here is a piece of log:
>>
>> modprobe-4551    0...10   712.754529: preempt_disable:
>> caller=0xffff0001019900c8 parent=0xffff0001019900c8
>> ...
>> modprobe-4551    0d..10   712.754589: irq_disable: caller=el1_irq+0x7c
>> parent=0xffff0001019900dc
>> modprobe-4551    0d.h10   712.754590: irq_handler_entry:    irq=2
>> name=arch_timer
>> modprobe-4551    0d.h20   712.754591: hrtimer_cancel:
>> hrtimer=0xffff8020f4983d98
>> modprobe-4551    0d.h10   712.754591: hrtimer_expire_entry:
>> hrtimer=0xffff8020f4983d98 now=712720218245 function=hrtimer_wakeup/0x0
>> modprobe-4551    0d.h20   712.754592: sched_waking: comm=cyclictest
>> pid=2212 prio=9 target_cpu=000
>> modprobe-4551    0dNh30   712.754593: sched_wakeup: cyclictest:2212 [9]
>> success=1 CPU:000
>> modprobe-4551    0dNh10   712.754594: hrtimer_expire_exit:
>> hrtimer=0xffff8020f4983d98
>> modprobe-4551    0dNh10   712.754594: irq_handler_exit:     irq=2
>> ret=handled
> 
> Again, this is perfectly valid.
> 
>> Further i found
>> preemptirq_delay_test(kernel/trace/preemptirq_delay_test.c) can
>> reproduce it easily and got the similar ftrace log.
> 
> You can find endless ways how to introduce latency to the system.
> That's why some operations are not allowed. For more details see the
> mentioned documentation. ^^
> 
>> The root cause i found is in el1_irq, it checks preempt count and
>> TIF_NEED_RESCHED before it goes through the path el1_preempt,
>> preempt_disable just stops it happening.
> 
> As it should.
> 
>> Then i came up an idea and did an experiment,  call
>> preempt_count_set(0); and set_tsk_need_resched(task); in hrtimer_wakeup
>> to meet the expectation in order to reschdule.
> 
> Sounds like demolishing the pillars of the bridge, yet still expecting
> it not to fall. This will break your system for sure.
> 
>> static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
>> {
>>       struct hrtimer_sleeper *t =
>>           container_of(timer, struct hrtimer_sleeper, timer);
>>       struct task_struct *task = t->task;
>>
>>       t->task = NULL;
>>       if (task)
>>           wake_up_process(task);
>>
>>       if(preempt_count())
>>           preempt_count_set(0);
>>       set_tsk_need_resched(task);
>>
>>       return HRTIMER_NORESTART;
>> }
>>
>> but, failed, system froze, no panic information.
> 
> Not surprised.
> 
>> Is there anyone having the same problem? i would appreciate it if you
>> could share me some information in this case, many thanks.
> 
> Simply, no modprobe while running the cyclictest. The cyclictest is
> supposed to be running on a machine as if the machine was in
> production and definitely not tortured by arbitrary synthetic tests
> (unless the test simulates a real world scenario which actually can
> happen in production).
> Again, all the modules should be preloaded before running cyclictest
> or the production application.
> 
> Realtime system is a fragile beast and you have to handle it with special care.
> 
> Have a nice day,
> Daniel
> 
>> BR
>>
>> Song Chen
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-29  3:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28  9:04 preempt_disable causes big latency chensong_2000
2021-01-28 11:30 ` Daniel Vacek
2021-01-29  3:52   ` chensong_2000

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.