linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 17:49 Mark_H_Johnson
  2004-12-10 21:09 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

>The -32-15 kernel can be downloaded from the
>usual place:
>
> http://redhat.com/~mingo/realtime-preempt/

By the time I started today, the directory had -18 so that is what I built
with. Here are some initial results from running cpu_delay (the simple
RT test / user tracing) on a -18 PREEMPT_RT kernel. Have not started
any of the stress tests yet.

To recap, all IRQ # tasks, ksoftirqd/# and events/# tasks are RT FIFO
priority 99. The test program runs at RT FIFO 30 and should use about
70% of the CPU time on the two CPU SMP system under test.

[1] I still do not get traces where cpu_delay switches CPU's. I only
get trace output if it starts and ends on a single CPU. I also had
several cases where I "triggered" a trace but no output - I assume
those are related symptoms. For example:

# ./cpu_delay 0.000100
Delay limit set to 0.00010000 seconds
calibrating loop ....
time diff= 0.504598 or 396354830 loops/sec.
Trace activated with 0.000100 second delay.
Trace triggered with 0.000102 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000164 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000132 second delay. [00]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000128 second delay. [01]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000144 second delay. [not recorded]
Trace triggered with 0.000355 second delay. [02]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000101 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000126 second delay. [not recorded]
Trace triggered with 0.000205 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000147 second delay. [03]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000135 second delay. [04]
Trace triggered with 0.000110 second delay. [not recorded]
Trace triggered with 0.000247 second delay. [05]
Trace triggered with 0.000120 second delay. [06]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000107 second delay. [07]
Trace triggered with 0.000104 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000100 second delay. [08]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000201 second delay. [09]

# chrt -f 1 ./get_ltrace.sh 50
Current Maximum is 4965280, limit will be 50.
Resetting max latency from 4965280 to 50.
No new latency samples at Fri Dec 10 11:01:22 CST 2004.
No new latency samples at Fri Dec 10 11:01:32 CST 2004.
No new latency samples at Fri Dec 10 11:01:42 CST 2004.
No new latency samples at Fri Dec 10 11:01:53 CST 2004.
No new latency samples at Fri Dec 10 11:02:03 CST 2004.
No new latency samples at Fri Dec 10 11:02:13 CST 2004.
New trace 0 w/ 117 usec latency.
Resetting max latency from 117 to 50.
No new latency samples at Fri Dec 10 11:02:35 CST 2004.
No new latency samples at Fri Dec 10 11:02:45 CST 2004.
New trace 1 w/ 99 usec latency.
Resetting max latency from 99 to 50.
New trace 2 w/ 248 usec latency.
Resetting max latency from 248 to 50.
New trace 3 w/ 146 usec latency.
Resetting max latency from 146 to 50.
New trace 4 w/ 134 usec latency.
Resetting max latency from 134 to 50.
New trace 5 w/ 250 usec latency.
Resetting max latency from 250 to 50.
New trace 6 w/ 120 usec latency.
Resetting max latency from 120 to 50.
New trace 7 w/ 105 usec latency.
Resetting max latency from 105 to 50.
New trace 8 w/ 91 usec latency.
Resetting max latency from 91 to 50.
New trace 9 w/ 200 usec latency.

For the most part, the elapsed times in the traces match closely to what
was measured by the application.

[2] The all CPU traces appear to show some cases where both ksoftirqd
tasks are active during a preemption of the RT task. That is expected
with my priority settings. [though the first trace shows BOTH getting
activated within 2 usec!]

[3] Some traces show information on both CPU's and then a long period
with no traces from the other. Here is an example...

preemption latency trace v1.1.4 on 2.6.10-rc2-mm3-V0.7.32-18RT
--------------------------------------------------------------------
 latency: 99 us, #275/275, CPU#0 | (M:rt VP:0, KP:1, SP:1 HP:1 #P:2)
    -----------------
    | task: cpu_delay-3556 (uid:0 nice:0 policy:1 rt_prio:30)
    -----------------
 => started at: <00000000>
 => ended at:   <00000000>

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
<unknown-2847  1d...    0µs : smp_apic_timer_interrupt (86a89bf 0 0)
cpu_dela-3556  0d.h1    0µs : update_process_times
(smp_apic_timer_interrupt)
cpu_dela-3556  0d.h1    0µs : account_system_time (update_process_times)
cpu_dela-3556  0d.h1    0µs : check_rlimit (account_system_time)
<unknown-2847  1d.h.    0µs : update_process_times
(smp_apic_timer_interrupt)
cpu_dela-3556  0d.h1    0µs : account_it_prof (account_system_time)
...
<unknown-2847  1d.h1    4µs : _raw_spin_unlock (scheduler_tick)
cpu_dela-3556  0d.h1    4µs : irq_exit (apic_timer_interrupt)
<unknown-2847  1d.h.    4µs : rebalance_tick (scheduler_tick)
cpu_dela-3556  0d..2    4µs : do_softirq (irq_exit)
cpu_dela-3556  0d..2    4µs : __do_softirq (do_softirq)
<unknown-2847  1d.h.    5µs : irq_exit (apic_timer_interrupt)
cpu_dela-3556  0d..2    5µs : wake_up_process (do_softirq)
cpu_dela-3556  0d..2    5µs : try_to_wake_up (wake_up_process)
cpu_dela-3556  0d..2    5µs : task_rq_lock (try_to_wake_up)
<unknown-2847  1d...    5µs < (0)
cpu_dela-3556  0d..2    5µs : _raw_spin_lock (task_rq_lock)
cpu_dela-3556  0d..3    6µs : activate_task (try_to_wake_up)
cpu_dela-3556  0d..3    6µs : sched_clock (activate_task)
cpu_dela-3556  0d..3    7µs : recalc_task_prio (activate_task)
cpu_dela-3556  0d..3    7µs : effective_prio (recalc_task_prio)
cpu_dela-3556  0d..3    7µs : activate_task <ksoftirq-4> (0 1):
cpu_dela-3556  0d..3    7µs : enqueue_task (activate_task)
cpu_dela-3556  0d..3    8µs : resched_task (try_to_wake_up)
cpu_dela-3556  0dn.3    8µs : __trace_start_sched_wakeup (try_to_wake_up)
cpu_dela-3556  0dn.3    8µs : try_to_wake_up <ksoftirq-4> (0 45):
cpu_dela-3556  0dn.3    9µs : _raw_spin_unlock_irqrestore (try_to_wake_up)
cpu_dela-3556  0dn.2    9µs : preempt_schedule (try_to_wake_up)
cpu_dela-3556  0dn.2    9µs : wake_up_process (do_softirq)
cpu_dela-3556  0dn.1   10µs < (0)
cpu_dela-3556  0.n.1   10µs : preempt_schedule (up)
cpu_dela-3556  0.n..   10µs : preempt_schedule (user_trace_start)
cpu_dela-3556  0dn..   11µs : __sched_text_start (preempt_schedule)
cpu_dela-3556  0dn.1   11µs : sched_clock (__sched_text_start)
cpu_dela-3556  0dn.1   11µs : _raw_spin_lock_irq (__sched_text_start)
cpu_dela-3556  0dn.1   12µs : _raw_spin_lock_irqsave (__sched_text_start)
cpu_dela-3556  0dn.2   12µs : pull_rt_tasks (__sched_text_start)
cpu_dela-3556  0dn.2   12µs : find_next_bit (pull_rt_tasks)
cpu_dela-3556  0dn.2   13µs : find_next_bit (pull_rt_tasks)
cpu_dela-3556  0d..2   13µs : trace_array (__sched_text_start)
cpu_dela-3556  0d..2   13µs : trace_array <ksoftirq-4> (0 1):
cpu_dela-3556  0d..2   15µs : trace_array <cpu_dela-3556> (45 46):
cpu_dela-3556  0d..2   16µs+: trace_array (__sched_text_start)
ksoftirq-4     0d..2   19µs : __switch_to (__sched_text_start)
ksoftirq-4     0d..2   20µs : __sched_text_start <cpu_dela-3556> (45 0):
ksoftirq-4     0d..2   20µs : finish_task_switch (__sched_text_start)
ksoftirq-4     0d..2   20µs : smp_send_reschedule_allbutself
(finish_task_switch)
ksoftirq-4     0d..2   20µs : send_IPI_allbutself
(smp_send_reschedule_allbutself)
ksoftirq-4     0d..2   21µs : __bitmap_weight (send_IPI_allbutself)
ksoftirq-4     0d..2   21µs : __send_IPI_shortcut (send_IPI_allbutself)
ksoftirq-4     0d..2   21µs : _raw_spin_unlock (finish_task_switch)
ksoftirq-4     0d..1   22µs : trace_stop_sched_switched
(finish_task_switch)
ksoftirq-4     0....   23µs : _do_softirq (ksoftirqd)
ksoftirq-4     0d...   23µs : ___do_softirq (_do_softirq)
ksoftirq-4     0....   23µs : run_timer_softirq (___do_softirq)
ksoftirq-4     0....   24µs : _spin_lock (run_timer_softirq)
ksoftirq-4     0....   24µs : __spin_lock (_spin_lock)
ksoftirq-4     0....   24µs : __might_sleep (__spin_lock)
<unknown-2847  1d...   24µs : smp_reschedule_interrupt (86a8bd8 0 0)
ksoftirq-4     0....   24µs : _down_mutex (__spin_lock)
<unknown-2847  1d...   25µs < (0)
ksoftirq-4     0....   25µs : __down_mutex (__spin_lock)
ksoftirq-4     0....   25µs : __might_sleep (__down_mutex)
ksoftirq-4     0d...   25µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..1   25µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..2   26µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..3   26µs : set_new_owner (__down_mutex)
ksoftirq-4     0d..3   26µs : set_new_owner <ksoftirq-4> (0 0):
ksoftirq-4     0d..3   27µs : _raw_spin_unlock (__down_mutex)
ksoftirq-4     0d..2   27µs : _raw_spin_unlock (__down_mutex)
ksoftirq-4     0d..1   27µs : _raw_spin_unlock (__down_mutex)
... no more traces from CPU 1 ...
ksoftirq-4     0....   77µs : rcu_check_quiescent_state
(__rcu_process_callbacks)
ksoftirq-4     0....   77µs : cond_resched_all (___do_softirq)
ksoftirq-4     0....   77µs : cond_resched (___do_softirq)
ksoftirq-4     0....   78µs : cond_resched (ksoftirqd)
ksoftirq-4     0....   78µs : kthread_should_stop (ksoftirqd)
ksoftirq-4     0....   78µs : schedule (ksoftirqd)
ksoftirq-4     0....   78µs : __sched_text_start (schedule)
ksoftirq-4     0...1   79µs : sched_clock (__sched_text_start)
ksoftirq-4     0...1   79µs : _raw_spin_lock_irq (__sched_text_start)
ksoftirq-4     0...1   79µs : _raw_spin_lock_irqsave (__sched_text_start)
ksoftirq-4     0d..2   80µs : deactivate_task (__sched_text_start)
ksoftirq-4     0d..2   80µs : dequeue_task (deactivate_task)
ksoftirq-4     0d..2   81µs : trace_array (__sched_text_start)
ksoftirq-4     0d..2   82µs : trace_array <cpu_dela-3556> (45 46):
ksoftirq-4     0d..2   84µs+: trace_array (__sched_text_start)
cpu_dela-3556  0d..2   86µs : __switch_to (__sched_text_start)
cpu_dela-3556  0d..2   87µs : __sched_text_start <ksoftirq-4> (0 45):
cpu_dela-3556  0d..2   87µs : finish_task_switch (__sched_text_start)
cpu_dela-3556  0d..2   87µs : _raw_spin_unlock (finish_task_switch)
cpu_dela-3556  0d..1   88µs : trace_stop_sched_switched
(finish_task_switch)
cpu_dela-3556  0d...   89µs+< (0)
cpu_dela-3556  0d...   92µs : math_state_restore (device_not_available)
cpu_dela-3556  0d...   92µs : restore_fpu (math_state_restore)
cpu_dela-3556  0d...   93µs < (0)
cpu_dela-3556  0....   93µs > sys_gettimeofday (00000000 00000000 0000007b)
cpu_dela-3556  0....   93µs : sys_gettimeofday (sysenter_past_esp)
cpu_dela-3556  0....   94µs : user_trace_stop (sys_gettimeofday)
cpu_dela-3556  0...1   94µs : user_trace_stop (sys_gettimeofday)
cpu_dela-3556  0...1   95µs : _raw_spin_lock_irqsave (user_trace_stop)
cpu_dela-3556  0d..2   95µs : _raw_spin_unlock_irqrestore (user_trace_stop)

If I read this right, we tried to reschedule cpu_delay on CPU #1 (at
24 usec) but it never happened and cpu_delay was still "ready to run"
on CPU #0 70 usec later.

[4] I have a trace where cpu_delay was bumped off of CPU #1 at 20 usec
while the X server (not RT) was the active process on CPU #0 for another
130 usec (several traces with preempt-depth ==0) when it finally gets
bumped by IRQ 0.

[5] More of a cosmetic problem, several traces still show the
application name as "unknown" - even for long lived processes like
ksoftirqd/0 and X.

Due to the file size, I will send the traces separately.
  --Mark


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-13 23:14 Mark_H_Johnson
  0 siblings, 0 replies; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-13 23:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

> * Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
> > The maximum duration of the CPU loop (as measured by the application)
> > is in the range of 1.42 msec to 2.57 compared to the nominal 1.16 msec
> > duration for -20RT.  The equivalent numbers for -20PK are 1.28 to 1.93
> > msec. [...]
>
> so -20RT has resolved all the CPU-loop-max-delay issues of the -RT
> kernel regarding the RT-priority CPU loop and in essence adds only a
> small amount of delay (100 usecs?) to the nominal (==minimum possible)
> delay?
I believe so, or in other words a minor penalty to average latency impact
to get a reduction in maximum latency [though I am not sure I actually
measured such a reduction].

> i suspect the 100 usecs comparison is an effect of the cutoff value
> being a single value. Also, 100 usecs is so close to the DMA related
> delay which makes it hard to compare it - other than stating that -RT
> has higher CPU overhead.
I think we agree on this assessment. I don't think the threading overhead
is 100 usec but more likely 50 usec [at least on my hardware - see below].
That the extra 50 usec puts us over the 100 usec measure is unfortunate.

> are the ping times still considered anomalous? Could be a side-effect of
> the different flow of control between hardirq/softirq contexts. (There
> have been a (low but nonzero) number of assumptions about the flow in
> pieces of softirq code, and there could be more.)
Please note that to get the ping times I stated, I set the priority of
ksoftirqd/0 and /1 to 99 (along with the IRQ threads). Others won't be
so generous and get different results.

I believe the longer duration with -20RT is due to the threading overhead.
Look at the numbers for -20PK and -20RT, the minimums are
  -20PK 0.089 msec (or 089 usec)
  -20RT 0.134 msec (or 134 usec)
or a difference of about 50 usec. If the sequence of steps is something
like
  Network Interrupt -T-> Network IRQ -T-> ksoftirqd/0 -> ping reply
on -RT and
  Network Interrupt -> Network IRQ -> soft IRQ -> ping reply
on -PK, the difference measured is likely reasonable.

[where -T-> represents a new thread activation, and -> represents something
like a function call]

Again, the benefit of PREEMPT_RT (over PREEMPT_DESKTOP) should be reduced
maximum latencies with a modest (or very little) impact to overall
performance / throughput. We are certainly getting a lot closer on
reaching that goal.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-13 20:02 Mark_H_Johnson
  2004-12-13 22:39 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-13 20:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

A comparison of PREEMPT_RT (no tracing) to PREEMPT_DESKTOP
(no tracing) to help answer previous requests.

Comparison of .32-20RT and .32-20PK results
20RT has PREEMPT_RT (all tracing disabled)
20PK has PREEMPT_DESKTOP and no threaded IRQ's (all tracing disabled)
2.4 has lowlat + preempt patches applied

      within 100 usec
       CPU loop (%)   Elapsed Time (sec)    2.4
Test   RT     PK        RT      PK   |   CPU  Elapsed
X     99.87  99.75      65 *    59 * |  97.20   70
top   99.35  99.97      31 *    30 * |  97.48   29
neto  96.94  98.65     113 *   135 * |  96.23   36
neti  97.05  98.59     119 *   140 * |  95.86   41
diskw 94.36  91.69      30 *    70 * |  77.64   29
diskc 93.85  98.88      98 *   151 * |  84.12   77
diskr 99.39  99.92     133 *   210 * |  90.66   86
total                  589     795   |         368
 [higher is better]  [lower is better]
* wide variation in audio duration
+ long stretch of audio duration "too fast"

With the two versions of -20, they are quite similar in the
percentage of CPU loop duration within 100 usec of nominal.

Looking at ping response time:
  RT 0.134 / 0.208 / 1.502 / 0.075 msec
  PK 0.089 / 0.159 / 0.467 / 0.047 msec
for min / average / max / mdev values. You can see that
-20PK does much better than -20RT in this measure.

The maximum duration of the CPU loop (as measured by the
application) is in the range of 1.42 msec to 2.57 compared
to the nominal 1.16 msec duration for -20RT. The equivalent
numbers for -20PK are 1.28 to 1.93 msec. Its a little odd
that the big outlier for -20PK was during the X 11 stress
test. Its chart was one of the smoothest (less variation)
than the others.

I repeated the test without cpu_burn (non RT, nice) for the
20PK kernel as well.  For -20PK, all the elapsed times were
reduced [as I expected for both RT and PK] and with exception
of the network tests, were roughly the same as the 2.4 results.

-20RT
  with cpu_burn      65  31 113 119  30  98 133
  without cpu_burn   63  30 121 150  32  87  97
-20PK
  with cpu_burn      59  30 135 140  70 151 210
  without cpu_burn   62  30  94  60  27  89  93

I reran the 2.4 tests and the network tests ran in 39 and 37
seconds respectively. I guess this shows we have something odd
going on in the network stack under 2.6.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-13 17:05 Mark_H_Johnson
  2004-12-13 22:33 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-13 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

A comparison of PREEMPT_RT (no tracing) to PREEMPT_DESKTOP
(with tracing) to help answer previous requests.

Comparison of .32-20RT and .32-18PK results
20RT has PREEMPT_RT (all tracing disabled)
18PK has PREEMPT_DESKTOP and no threaded IRQ's (tracing enabled)
2.4 has lowlat + preempt patches applied

      within 100 usec
       CPU loop (%)   Elapsed Time (sec)    2.4
Test   RT     PK        RT      PK   |   CPU  Elapsed
X     99.87 100.00&     65 *    64+  |  97.20   70
top   99.35 100.00&     31 *    31+  |  97.48   29
neto  96.94 100.00&    113 *   184+  |  96.23   36
neti  97.05 100.00&    119 *   170+  |  95.86   41
diskw 94.36  99.99      30 *    61+  |  77.64   29
diskc 93.85  99.34      98 *   310+  |  84.12   77
diskr 99.39  99.96     133 *   310+  |  90.66   86
total                  589    1130   |         368
 [higher is better]  [lower is better]
* wide variation in audio duration
+ long stretch of audio duration "too fast"
& 100% to digits shown, had a FEW samples > 100 usec.

The results for -20RT (without tracing) are much better than
-18RT (with tracing) but still not quite as good as -18PK
with respect to the 100 usec delays.

The overhead of threading the IRQ's (and rescheduling the
RT application making the measurements) is likely cause of
the difference in 100 usec latency measurements.

I believe the threading overhead is confirmed by the slightly
slower ping response measurements:
  RT 0.134 / 0.208 / 1.502 / 0.075 msec
  PK 0.102 / 0.164 / 0.708 / 0.044 msec
for min / average / max / mdev values. These are much
better than with -18RT (with tracing) as well. Apparently the
tracing overhead is pretty high in the task switching area.

The maximum duration of the CPU loop (as measured by the
application) is in the range of 1.42 msec to 2.57 compared
to the nominal 1.16 msec duration for -20RT.

The non RT starvation problem appears to be mitigated quite
a bit by removing the tracing. I did a follow up test without
cpu_burn (non RT, nice) and the elapsed times varied in an
odd pattern:

  with cpu_burn      65  31 113 119  30  98 133
  without cpu_burn   63  30 121 150  32  87  97

The disk numbers are getting much closer to the 2.4 values
so removal of cpu_burn definitely helps prevent starvation.
The larger network numbers are something I should probably
confirm on my 2.4 test system to see if something changed
on the network to skew the results.

I will be building a -20PK without tracing for a more complete
comparison.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 22:06 Mark_H_Johnson
  2004-12-10 22:24 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 22:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

Here is my get_ltrace.sh script (at the end).

So I read the preempt_max_latency (to see if its changed) before
I copy the latency_trace output. I am not so sure that cat is
really doing an "atomic" read when some of the latency traces
are over 300 Kbytes in length.

Also note that some of the files were empty :-(. I don't think
I've seen that symptom before.

Note that the preempt_max_latency value DID match the last line of
the trace output in the example I described. It is just the header
that had some stale data in it.

  --Mark

--- get_ltrace.sh ---

#!/bin/sh

let MAX=`cat /proc/sys/kernel/preempt_max_latency`
let I=0 J=1
let MP=${1:-1000}
echo "Current Maximum is $MAX, limit will be $MP."
while (( I < 100 )) ; do
    sleep 1s
    let NOW=`cat /proc/sys/kernel/preempt_max_latency`
    if (( MAX != NOW )) ; then
        echo "New trace $I w/ $NOW usec latency."
        cat /proc/latency_trace > lt.`printf "%02d" $I`
#       sync ; sync
        let I++
        let MAX=NOW
    elif (( J++ >= 10 )) ; then
        if (( MAX != MP )) ; then
            echo "Resetting max latency from $MAX to $MP."
            echo $MP > /proc/sys/kernel/preempt_max_latency
            let MAX=$MP
        else
            echo "No new latency samples at `date`."
        fi
        let J=1
# else do nothing...
    fi
done


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 21:58 Mark_H_Johnson
  0 siblings, 0 replies; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 21:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>the only recent thing added was the global RT balancer, which is not
>active under PREEMPT_DESKTOP. Maybe this code somehow interferes with
>your workload? Could you try to disable it, ...

Real life will intervene before I can get the kernel built & tested
[have to pick up the kids...]. Will try this (w/o any tracing as well)
on Monday (and see if I can put RT in a good light...).

  --Mark


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 21:54 Mark_H_Johnson
  2004-12-10 22:14 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 21:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

The code does not quite match either pattern but is perhaps
more like your second example.

For reference, the cpu_delay loop looks like this...

  t1 = mygettime();
  for(u=0;u<(loops/1000);u++) {
    t0 = t1;
    if (do_a_trace) {
      gettimeofday(0, (struct timezone*)1);
    }
    for (v=0;v<1000;v++)
      k+=1;
    t1 = mygettime();
    if ((t1-t0)>max_delay){
      delay++;
      if (do_a_trace) {
        gettimeofday(0,0);
        do_a_trace = 0;
        printf("Trace triggered with %f second delay.\n",t1-t0);
      }
    }
  }

I was trying to avoid the extra "rdtsc" overhead (plus the
floating point calculations) so - yes, I could have cases
where the time I measure is not caught by the kernel tracer.

[do some tests...]
Now I'm 5 for 5 with the revised code. Odd that all the numbers
are within about 2 or 3 usec (application measured / kernel measured).
If it was as bad as I was measuring it, I would have expected
one or two to be really off.

  --Mark


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 21:31 Mark_H_Johnson
  2004-12-10 21:54 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 21:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
>> [1] I still do not get traces where cpu_delay switches CPU's. I only
>> get trace output if it starts and ends on a single CPU. [...]
>
>lt001.18RT/lt.02 is such a trace. It starts on CPU#1:
>
> <unknown-3556  1...1    0µs : find_next_bit (user_trace_start)
>
>and ends on CPU#0:
>
> <unknown-3556  1...1  247µs : _raw_spin_lock_irqsave (user_trace_stop)
>
>the trace shows a typical migration of an RT task.

Hmm. Now I look at it more clearly like...
 # grep '^<unknown-3556' lt001.18RT/lt.02
I can see what you mean. Though this time it moved from one [22 usec]
to zero [189 usec] and back to one [238 usec] before it finally stopped
the trace at 248 usec. [the attempt to reschedule on CPU zero was
clearly ineffective]

This trace is also odd in that the duration of the trace in the header is
listed as 99 usec (and the application confirmed that) but the trace
ends at 248 usec.

Hmm. Now that I look at it, the duration in the header (99 usec) is the
duration of lt.01 (as reported by the script) but the total duration
of the trace (248 usec) is the duration from the script for lt.02.


--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 21:06 Mark_H_Johnson
  2004-12-10 21:40 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 21:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

Comparison of .32-18RT and .32-18PK results
RT has PREEMPT_RT,
PK has PREEMPT_DESKTOP and no threaded IRQ's.
2.4 has lowlat + preempt patches applied

      within 100 usec
       CPU loop (%)   Elapsed Time (sec)    2.4
Test   RT     PK        RT      PK   |   CPU  Elapsed
X     90.40 100.00&     73 *    64+  |  97.20   70
top   78.56 100.00&     39 *    31+  |  97.48   29
neto  92.82 100.00&    350 *   184+  |  96.23   36
neti  90.69 100.00&    350 *   170+  |  95.86   41
diskw 82.96  99.99     360 *    61+  |  77.64   29
diskc 91.41  99.34     350 *   310+  |  84.12   77
diskr 88.41  99.96     360 *   310+  |  90.66   86
total                 1881    1130   |         368
 [higher is better]  [lower is better]
* wide variation in audio duration
+ long stretch of audio duration "too fast"
& 100% to digits shown, had a FEW samples > 100 usec.

WOW! Look at the 100% values measured on -18PK. The performance
of 2.6 with PREEMPT_DESKTOP is far better than 2.4 preempt+lowlat
in every way except the non RT starvation problem. Something
fixed between -12 and -18 really made a big improvement.

It is still disturbing to see the worse results for -18RT
and I wish I knew what the cause was. Perhaps the traces I sent
earlier can provide a clue.

Other notes:

[1] -PK has many more latency traces > 250 usec [some MUCH longer]
than -RT. I ended up collecting more traces for -RT since I set
the limit to 100 usec, but only got about 8 > 250 usec traces
compared to 40 for -PK.

[2] Some of the traces in -PK are extremely long (up to 2 msec) and
I guess I'm glad I have an SMP system to run the RT application on
the other CPU. Here's an extract of one example:

reemption latency trace v1.1.4 on 2.6.10-rc2-mm3-V0.7.32-18PK
--------------------------------------------------------------------
 latency: 1698 us, #3137/3137, CPU#0 | (M:preempt VP:0, KP:1, SP:0 HP:0
#P:2)
    -----------------
    | task: kjournald-1203 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
<unknown-1203  0d..1    0µs < (54)
<unknown-1203  0...1    0µs : journal_remove_journal_head
(journal_commit_transaction)
<unknown-1203  0...1    1µs : __journal_remove_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1    1µs : __brelse (__journal_remove_journal_head)
<unknown-1203  0...1    1µs : journal_free_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1    1µs : kmem_cache_free (journal_free_journal_head)
<unknown-1203  0d..1    2µs : cache_flusharray (kmem_cache_free)
<unknown-1203  0d..1    2µs : _raw_spin_lock (cache_flusharray)
<unknown-1203  0d..2    2µs+: free_block (cache_flusharray)
<unknown-1203  0d..2    5µs : _raw_spin_unlock (cache_flusharray)
<unknown-1203  0d..1    5µs : memmove (cache_flusharray)
<unknown-1203  0d..1    5µs : memcpy (memmove)
<unknown-1203  0...1    6µs : inverted_lock (journal_commit_transaction)
<unknown-1203  0...1    6µs : __journal_unfile_buffer
(journal_commit_transaction)
<unknown-1203  0...1    7µs : journal_remove_journal_head
(journal_commit_transaction)
<unknown-1203  0...1    7µs : __journal_remove_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1    8µs : __brelse (__journal_remove_journal_head)
<unknown-1203  0...1    8µs : journal_free_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1    8µs : kmem_cache_free (journal_free_journal_head)
<unknown-1203  0...1    9µs : inverted_lock (journal_commit_transaction)
<unknown-1203  0...1    9µs : __journal_unfile_buffer
(journal_commit_transaction)
...
<unknown-1203  0...1  835µs : journal_remove_journal_head
(journal_commit_transaction)
<unknown-1203  0...1  835µs : __journal_remove_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1  836µs : __brelse (__journal_remove_journal_head)
<unknown-1203  0...1  836µs : journal_free_journal_head
(journal_remove_journal_head)
<unknown-1203  0...1  837µs : kmem_cache_free (journal_free_journal_head)
<unknown-1203  0d..1  837µs : cache_flusharray (kmem_cache_free)
<unknown-1203  0d..1  837µs : _raw_spin_lock (cache_flusharray)
<unknown-1203  0d..2  838µs : free_block (cache_flusharray)
<unknown-1203  0d.h2  839µs : do_nmi (__mcount)
<unknown-1203  0d.h2  840µs : do_nmi (<08048340>)
<unknown-1203  0d.h2  840µs+: do_nmi (<00200282>)
<unknown-1203  0d..2  843µs : _raw_spin_unlock (cache_flusharray)
<unknown-1203  0d..1  844µs : memmove (cache_flusharray)
<unknown-1203  0d..1  844µs : memcpy (memmove)
<unknown-1203  0...1  844µs : inverted_lock (journal_commit_transaction)
...
<unknown-1203  0d.h2 1686µs : __mod_timer (ide_execute_command)
<unknown-1203  0d.h2 1686µs : _raw_spin_lock_irqsave (__mod_timer)
<unknown-1203  0d.h3 1687µs : _raw_spin_lock (__mod_timer)
<unknown-1203  0d.h4 1687µs : internal_add_timer (__mod_timer)
<unknown-1203  0d.h4 1687µs : _raw_spin_unlock (__mod_timer)
<unknown-1203  0d.h3 1688µs : _raw_spin_unlock_irqrestore (__mod_timer)
<unknown-1203  0d.h2 1688µs : ide_outbsync (ide_execute_command)
<unknown-1203  0d.h2 1689µs : __const_udelay (ide_execute_command)
<unknown-1203  0d.h2 1690µs : __delay (ide_execute_command)
<unknown-1203  0d.h2 1690µs : delay_tsc (__delay)
<unknown-1203  0d.h2 1691µs : _raw_spin_unlock_irqrestore
(ide_dma_exec_cmd)
<unknown-1203  0..h1 1691µs : ide_dma_start (__ide_do_rw_disk)
<unknown-1203  0..h1 1692µs : ide_inb (ide_dma_start)
<unknown-1203  0..h1 1692µs : ide_outb (ide_dma_start)
<unknown-1203  0..h1 1693µs : _raw_spin_lock_irq (ide_do_request)
<unknown-1203  0..h1 1693µs : _raw_spin_lock_irqsave (ide_do_request)
<unknown-1203  0d.h2 1694µs : _raw_spin_unlock_irqrestore (ide_intr)
<unknown-1203  0d.h1 1695µs : _raw_spin_lock (__do_IRQ)
<unknown-1203  0d.h2 1695µs : note_interrupt (__do_IRQ)
<unknown-1203  0d.h2 1696µs : end_edge_ioapic_irq (__do_IRQ)
<unknown-1203  0d.h2 1696µs : _raw_spin_unlock (__do_IRQ)
<unknown-1203  0d.h1 1696µs : irq_exit (do_IRQ)

[3] The charts are very smooth on -PK, very similar to those on
2.4 other than the "too fast" symptom for the audio duration.
[the white line is shifted down when compared to the nominal
duration yellow line]

[4] -PK also had two of the non RT starvation periods. Each was
almost 3 minutes long (should be a 5 second sleep). -RT had
none.

Will send the traces separately.
  --Mark


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 20:03 Mark_H_Johnson
  0 siblings, 0 replies; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 20:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

This time, initial test results on -18PK (PREEMPT_DESKTOP, no IRQ
threading).

Again, results running cpu_delay and collecting user triggered traces.

[1] Similar symptoms where I do not get traces that span CPU's plus
the missing trace symptom.

# ./cpu_delay 0.000100
Delay limit set to 0.00010000 seconds
calibrating loop ....
time diff= 0.503365 or 397326229 loops/sec.
Trace activated with 0.000100 second delay.
Trace triggered with 0.000321 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000136 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000139 second delay. [00]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000130 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000240 second delay. [01]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000104 second delay. [02]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000130 second delay. [03]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000109 second delay. [04]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000109 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000155 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000100 second delay. [05]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000105 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000128 second delay. [06]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000123 second delay. [07]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000100 second delay. [08]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000113 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000136 second delay. [09]

# chrt -f 1 ./get_ltrace.sh 50
Current Maximum is 4965280, limit will be 50.
Resetting max latency from 4965280 to 50.
No new latency samples at Fri Dec 10 13:42:31 CST 2004.
No new latency samples at Fri Dec 10 13:42:41 CST 2004.
No new latency samples at Fri Dec 10 13:42:51 CST 2004.
No new latency samples at Fri Dec 10 13:43:01 CST 2004.
New trace 0 w/ 139 usec latency.
Resetting max latency from 139 to 50.
No new latency samples at Fri Dec 10 13:43:23 CST 2004.
New trace 1 w/ 241 usec latency.
Resetting max latency from 241 to 50.
No new latency samples at Fri Dec 10 13:43:44 CST 2004.
New trace 2 w/ 103 usec latency.
Resetting max latency from 103 to 50.
New trace 3 w/ 130 usec latency.
Resetting max latency from 130 to 50.
No new latency samples at Fri Dec 10 13:44:17 CST 2004.
No new latency samples at Fri Dec 10 13:44:27 CST 2004.
No new latency samples at Fri Dec 10 13:44:37 CST 2004.
New trace 4 w/ 109 usec latency.
Resetting max latency from 109 to 50.
No new latency samples at Fri Dec 10 13:44:58 CST 2004.
New trace 5 w/ 100 usec latency.
Resetting max latency from 100 to 50.
New trace 6 w/ 128 usec latency.
Resetting max latency from 128 to 50.
New trace 7 w/ 123 usec latency.
Resetting max latency from 123 to 50.
New trace 8 w/ 100 usec latency.
Resetting max latency from 100 to 50.
No new latency samples at Fri Dec 10 13:45:54 CST 2004.
New trace 9 w/ 135 usec latency.
Resetting max latency from 135 to 50.

I don't see a significant difference in failures to record when compared
with -18RT (7 vs. 8).

[2] I get no ksoftirqd activity (as expected).

[3] Long sequences of one CPU only was seen again. Though in looking
at these traces I find several incomplete ones like this:

preemption latency trace v1.1.4 on 2.6.10-rc2-mm3-V0.7.32-18PK
--------------------------------------------------------------------
 latency: 139 us, #55/55, CPU#1 | (M:preempt VP:0, KP:1, SP:0 HP:0 #P:2)
    -----------------
    | task: cpu_delay-3517 (uid:0 nice:0 policy:1 rt_prio:30)
    -----------------
 => started at: <00000000>
 => ended at:   <00000000>

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
<unknown-0     0d.h1    0µs : do_nmi (default_idle)
<unknown-0     0d.h1    0µs : do_nmi (_raw_spin_lock_irqsave)
<unknown-3517  1d.H2    0µs : do_nmi (default_idle)
<unknown-0     0d.h1    0µs : do_nmi (<00200246>)
<unknown-3517  1d.H2    0µs+: do_nmi ((514))
<unknown-0     0...1    3µs : default_idle (cpu_idle)
<unknown-3517  1d.H2    3µs : do_IRQ (c013a0cc 0 0)
<unknown-3517  1d.H2    3µs : _raw_spin_lock (__do_IRQ)
<unknown-3517  1d.H3    4µs : ack_edge_ioapic_irq (__do_IRQ)
<unknown-3517  1d.H3    4µs : redirect_hardirq (__do_IRQ)
<unknown-3517  1d.H3    5µs : _raw_spin_unlock (__do_IRQ)
<unknown-3517  1d.H2    5µs : handle_IRQ_event (__do_IRQ)
<unknown-3517  1d.H2    5µs : timer_interrupt (handle_IRQ_event)
<unknown-3517  1d.H2    6µs : _raw_spin_lock (timer_interrupt)
<unknown-3517  1d.H3    6µs : mark_offset_tsc (timer_interrupt)
<unknown-3517  1d.H3    7µs : _raw_spin_lock (mark_offset_tsc)
<unknown-3517  1d.H4    7µs+: _raw_spin_lock_irqsave (mark_offset_tsc)
<unknown-3517  1d.H5   13µs : _raw_spin_unlock_irqrestore (mark_offset_tsc)
<unknown-3517  1d.H4   14µs : _raw_spin_unlock (mark_offset_tsc)
<unknown-3517  1d.H3   14µs+: _raw_spin_lock_irqsave (timer_interrupt)
<unknown-3517  1d.H4   18µs : _raw_spin_unlock_irqrestore (timer_interrupt)
<unknown-3517  1d.H3   18µs : do_timer (timer_interrupt)
<unknown-3517  1d.H3   19µs : _raw_spin_unlock (timer_interrupt)
<unknown-3517  1d.H2   19µs : _raw_spin_lock (__do_IRQ)
<unknown-3517  1d.H3   20µs : note_interrupt (__do_IRQ)
<unknown-3517  1d.H3   20µs : end_edge_ioapic_irq (__do_IRQ)
<unknown-3517  1d.H3   21µs : _raw_spin_unlock (__do_IRQ)
<unknown-3517  1d.H2   21µs : irq_exit (do_IRQ)
<unknown-3517  1d.s2   21µs < (0)
<unknown-3517  1..s1   22µs : mod_timer (rt_check_expire)
<unknown-3517  1..s1   23µs : __mod_timer (rt_check_expire)
<unknown-3517  1..s1   23µs : _raw_spin_lock_irqsave (__mod_timer)
<unknown-3517  1d.s2   24µs : _raw_spin_lock (__mod_timer)
<unknown-3517  1d.s3   24µs : internal_add_timer (__mod_timer)
<unknown-3517  1d.s3   25µs : _raw_spin_unlock (__mod_timer)
<unknown-3517  1d.s2   25µs : _raw_spin_unlock_irqrestore (__mod_timer)
<unknown-3517  1..s1   26µs : cond_resched_all (run_timer_softirq)
<unknown-3517  1..s1   26µs : cond_resched_softirq (run_timer_softirq)
<unknown-3517  1..s1   26µs : _raw_spin_lock_irq (run_timer_softirq)
<unknown-3517  1..s1   27µs : _raw_spin_lock_irqsave (run_timer_softirq)
<unknown-3517  1d.s2   28µs : _raw_spin_unlock_irq (run_timer_softirq)
<unknown-3517  1..s1   29µs : __wake_up (run_timer_softirq)
<unknown-3517  1..s1   29µs : _raw_spin_lock_irqsave (__wake_up)
<unknown-3517  1d.s2   29µs : __wake_up_common (__wake_up)
<unknown-3517  1d.s2   30µs : _raw_spin_unlock_irqrestore
(run_timer_softirq)
<unknown-3517  1..s1   30µs : cond_resched_all (___do_softirq)
<unknown-3517  1..s1   30µs : cond_resched_softirq (___do_softirq)
<unknown-3517  1d...   31µs < (0)
<unknown-3517  1d...   32µs+< (0)
<unknown-3517  1....   35µs > sys_gettimeofday (00000000 00000000 0000007b)
<unknown-3517  1....   35µs : sys_gettimeofday (sysenter_past_esp)
<unknown-3517  1....   35µs : user_trace_stop (sys_gettimeofday)
<unknown-3517  1...1   36µs : user_trace_stop (sys_gettimeofday)
<unknown-3517  1...1   36µs : _raw_spin_lock_irqsave (user_trace_stop)
<unknown-3517  1d..2   37µs : _raw_spin_unlock_irqrestore (user_trace_stop)

I assume that since it does not start with user_trace_start (or something
like that) that the missing portion is the first portion.

[4] Not sure if I can make the same comment on -18PK as I did for -18RT
on changing CPU's [since I understand we do not have the opportunity to
do so unless the IRQ's are threaded].

[5] The trace output cosmetic problems are on PREEMPT_DESKTOP as well.

Traces to come shortly in a separate message.
  --Mark


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 21:58 Mark_H_Johnson
  2004-12-09 22:55 ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 21:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>on SMP, latencytest + all IRQ threads (and ksoftirqd) at prio 99 +
>PREEMPT_RT is not comparable to PREEMPT_DESKTOP (with no IRQ threading).
Of course they are comparable. You may not consider it a FAIR
comparison, but they are comparable. I maintain the comparison shows
the increased overhead of IRQ threading - you maintain it is an
"inverse scenario" which I don't buy.

>The -RT kernel will 'split' hardirq and softirq workloads and migrate
>them to different CPUs - giving them a higher total throughput. Also, on
>PREEMPT_DESKTOP the IRQs will most likely go to one CPU only, and most
>softirq processing will be concentrated on that CPU too. Furthermore,
>the -RT kernel will agressively distribute highprio RT tasks.
Now wait a minute, how does -RT do that IRQ splitting? From what I recall
(I don't have RT up right now) taskset indicated all the IRQ tasks were
wired to CPU 0 and the only opportunity for splitting was with
ksoftirqd/0 and /1. I confirmed that by looking at the "last CPU" in top.

When I tried to change the CPU affinity of those IRQ tasks (to use both
CPU's), I got error messages. One of the responses you made at that
time was...
>> If setting it to 3 is REALLY BAD, perhaps we should prevent it.
>
>it's just like setting ksoftirqd's affinity. I agree that it's nasty,
>but there's no easy way right now.
Has this behavior changed in the last three weeks?

For a CONFIG_DESKTOP data point, let's take a look at the latency traces
I just made from -12PK.
  CPU 0 - 10, 14, 16 - 38, 40 - 43,
  CPU 1 - 00 - 09, 11 - 13, 15, 39, 44
let's see - 45 total traces, 29 for CPU 0 and 16 for CPU 1. Not quite
evenly balanced but not all on one CPU either (the data IS bursty though).
The common_interrupt trace appears to show up only on CPU 0, but the
latency traces are definitely on both CPU's.

>latencytest under your priority setup measures an _inverse_ scenario. (a
>CPU hog executing at a lower priority than all IRQ traffic) I'd not be
>surprised at all if it had higher latencies under -RT than under
>PREEMPT_DESKTOP.
Why "higher latencies"? And do you mean
 - more short latencies (I'm counting a lot of just over 100 usec delays)
OR
 - longer overall latencies (which I am not expecting but seeing)
OR
something else?

Let's look at the possible scenarios:
[PK refers to "Preemptible Kernel - PREEMPT_DESKTOP" w/o IRQ threading]
[RT refers to PREEMPT_RT with the IRQ # and ksoftirqd/# threads at RT 99]
 [1] A single interrupt comes in and latencytest is NOT on the CPU
that services the interrupt. In the case of PK, latencytest is
unaffected. In the case of RT, latencytest is affected ONLY if the
IRQ # thread or ksoftidqd/# thread is on the CPU with latencytest.
In that case, latencytest is pushed to the other CPU. That switch
takes some TBD amount of time and is counted by latencytest only if
it exceeds 100 usec.
 [2] A single interrupt comes in and latencytest is on the CPU that
services the interrupt. In the case of PK, latencytest is preempted
for the duration of the interrupt and resumes. In the case of RT,
latencytest is rescheduled on the other CPU (or not) once we reach the
place where we are ready to thread the IRQ. I would think RT should do
better in this case but am not sure.
 [3] A series of interrupts comes in. In PK what I see is several
sequential delays up to 1/2 msec or so (and have traces that show that
behavior). In RT I would expect a shorter latency period (if both CPU's
are busy with IRQ's or not) than PK [though I don't have traces for
this since if I cross CPU's the trace doesn't get recorded].

I don't see how RT should have worse numbers in these scenarios
unless the overhead is more (or I'm counting more trivial latencies)
than in PK. I would expect to see in the RT case a shorter maximum
delay (which alas I do NOT see).

>It's not clear-cut which one 'wins' though: because
>even this inverse scenario will have benefits in the -RT case: due to
>SCHED_OTHER workloads not interfering with this lower-prio RT task as
>much. But i'd expect there to be a constant moving of the 'benchmark
>result' forward and backwards, even if -RT only improves things - this
>is the nature of such an inverse priority setup.

Not quite sure what you mean by this.

>so this setup generates two conflicting parameters which are inverse to
>each other, and the 'sum' of these two parameters ends up fluctuating
>wildly. Pretty much like the results you are getting. The two parameters
>are: latency of the prio 30 task, and latency of the highprio tasks. The
>better the -RT kernel gets, the better the prio 30 tasks's priorities
>get relative to SCHED_OTHER tasks - but the worse they also get, due to
>the better handling of higher-prio tasks. Where the sum ends, whether
>it's a "win" or a "loss" depends on the workload, how much highprio
>activity the lowprio threads generate, etc.
I don't see how this rationale is relevant - the amount of work for IRQ
activities that is generated by each workload should be similar. Its
one of the reasons I run the same tests over and over again.

If I create a 750 Mbyte file (one of the stress test cases), I should be
doing a series of disk writes and interrupts. Both RT and PK should do
about the same work to create that file. So the overhead on latencytest
should be about the same for both RT and PK. If the overhead is
not the same, something is wrong.

If I look at the max latency:
  RT 3.90
  PK 1.91  (both cases nominal is 1.16 msec)
>From the scenarios I described above, I don't see why this result should
have occurred. Certainly nothing that should cause a delay of over
two msec on a roughly one msec task.

If I look at the % within 100 usec measure:
  RT 87% within 100 usec, 97% within 200 usec (360 seconds elapsed)
  PK 67% within 100 usec, 96% within 200 usec (57 seconds elapsed)
[note 250,000 samples in 360 seconds is 694 samples per second]
>From a percentage point of view, this looks bad for PK but if I
factor in the elapsed time I get...
 - PK interrupted latencytest about 13000 times
 - RT interrupted latencytest about 32000 times
I am not sure how much of this is due to the workload (disk writes)
or due to the elapsed time aspects.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2005-01-30 23:59 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
2004-12-10 21:09 ` Ingo Molnar
2004-12-10 21:12 ` Ingo Molnar
2004-12-10 21:24 ` Ingo Molnar
2004-12-13  0:16 ` Fernando Lopez-Lezcano
2004-12-13  6:47   ` Ingo Molnar
2004-12-14  0:46     ` Fernando Lopez-Lezcano
2004-12-14  4:42       ` K.R. Foley
2004-12-14  8:47         ` Rui Nuno Capela
2004-12-14 11:35           ` Ingo Molnar
2004-12-27 14:35             ` Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15) Esben Nielsen
2004-12-27 15:27               ` Steven Rostedt
2004-12-27 16:23                 ` Esben Nielsen
2004-12-27 16:39                   ` Steven Rostedt
2004-12-27 21:06                     ` Bill Huey
2004-12-27 21:48                       ` Valdis.Kletnieks
2004-12-28 21:59                       ` Lee Revell
2005-01-04 15:25                       ` Andrew McGregor
2004-12-28 21:42                   ` Lee Revell
2005-01-28  7:38               ` Ingo Molnar
2005-01-28 11:56                 ` William Lee Irwin III
2005-01-28 15:28                   ` Ingo Molnar
2005-01-28 15:55                     ` William Lee Irwin III
2005-01-28 16:16                       ` Ingo Molnar
2005-01-28 19:18                 ` Trond Myklebust
2005-01-28 19:45                   ` Ingo Molnar
2005-01-28 23:25                     ` Bill Huey
2005-01-28 21:13                   ` Lee Revell
2005-01-30 22:03                 ` Esben Nielsen
2005-01-30 23:59                   ` Kyle Moffett
  -- strict thread matches above, loose matches on Subject: below --
2004-12-13 23:14 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
2004-12-13 20:02 Mark_H_Johnson
2004-12-13 22:39 ` Ingo Molnar
2004-12-13 17:05 Mark_H_Johnson
2004-12-13 22:33 ` Ingo Molnar
2004-12-10 22:06 Mark_H_Johnson
2004-12-10 22:24 ` Ingo Molnar
2004-12-10 21:58 Mark_H_Johnson
2004-12-10 21:54 Mark_H_Johnson
2004-12-10 22:14 ` Ingo Molnar
2004-12-10 21:31 Mark_H_Johnson
2004-12-10 21:54 ` Ingo Molnar
2004-12-10 21:06 Mark_H_Johnson
2004-12-10 21:40 ` Ingo Molnar
2004-12-10 20:03 Mark_H_Johnson
2004-12-09 21:58 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Mark_H_Johnson
2004-12-09 22:55 ` Ingo Molnar
2004-12-10 10:53   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Ingo Molnar
2004-12-10 14:59     ` Gene Heskett
2004-12-10 15:59       ` Gene Heskett
2004-12-10 19:09     ` Lee Revell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).